Re: [opensuse-programming] extracting text from html

25 May 2010

      Per Jessen wrote:
...
justin finnerty wrote:
...
...
...
I need to extract text from html for purposes of
indexing -
implementation language is C or C++
I would use a SAX parser that handles HTML (libxml2?).  Then all you
might need to do is handle the TEXT nodes.
Something like that was indeed my first thought, but I'm pretty
certain it would require the html to be well-formed, which is far from
guaranteed :-(
Hmm,  

http://www.xmlsoft.org/html/libxml-HTMLparser.html

might just be useful:  

"this module implements an HTML 4.0 non-verifying parser with API
compatible with the XML parser ones. It should be able to parse "real
world" HTML, even if severely broken from a specification point of
view."

/Per Jessen, Zürich

-- 
To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse-programming+help@opensuse.org