On Tue, May 25, 2010 at 10:47 AM, Per Jessen
I need to extract text from html for purposes of indexing - implementation language is C or C++. Sofar I've come across html2text which is written in C++ - it looks pretty good, but I will need to make some changes to make it fit my prposes. Does any other library come to mind for extracting text from html?
/Per Jessen, Zürich
One way I've seen to do text extraction in the windows world is to have a printer driver that extracts the text as comes in. (it is not a perfect process with pdfs sometimes because of the way pdfs print part of word, then a command, then more of the word, etc.) Are there linux equivalent text extracting printer drivers? Greg -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org