Per Jessen wrote:
justin finnerty wrote:
I need to extract text from html for purposes of indexing - implementation language is C or C++
I would use a SAX parser that handles HTML (libxml2?). Then all you might need to do is handle the TEXT nodes.
Something like that was indeed my first thought, but I'm pretty certain it would require the html to be well-formed, which is far from guaranteed :-(
Hmm, http://www.xmlsoft.org/html/libxml-HTMLparser.html might just be useful: "this module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view." /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org