Patrick Shanahan wrote:
* Per Jessen
[05-25-10 10:49]: I need to extract text from html for purposes of indexing - implementation language is C or C++. Sofar I've come across html2text which is written in C++ - it looks pretty good, but I will need to make some changes to make it fit my prposes. Does any other library come to mind for extracting text from html?
w3m -dump <url>
lynx has a similar function
Yeah, something like that will be the last way out - I'd prefer not having to fork() for such a simple operation. (I'll be indexing millions of documents). Maybe it's worth checking out what those two utilities use for the extraction. /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org