Re: [opensuse-programming] extracting text from html

25 May 2010


      Patrick Shanahan wrote:
...
* Per Jessen  [05-25-10 10:49]:
...
I need to extract text from html for purposes of indexing -
implementation language is C or C++.  Sofar I've come across
html2text which is written in C++ - it looks pretty good, but I will
need to make
some changes to make it fit my prposes.  Does any other library come
to mind for extracting text from html?
w3m -dump <url>
lynx has a similar function
Yeah, something like that will be the last way out - I'd prefer not
having to fork() for such a simple operation.  (I'll be indexing
millions of documents).  Maybe it's worth checking out what those two
utilities use for the extraction.


/Per Jessen, Zürich

-- 
To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse-programming+help@opensuse.org