Mailinglist Archive: opensuse-programming (16 mails)
| < Previous | Next > |
Re: [opensuse-programming] extracting text from html
- From: Per Jessen <per@xxxxxxxxxxxx>
- Date: Sat, 29 May 2010 10:39:52 +0200
- Message-id: <htqjso$96d$1@xxxxxxxxxxxxxxxx>
Per Jessen wrote:
A quick update for those who might be following this thread - I got this
to work fairly quickly with libxml. My documents are read from disk,
the html is parsed and I then traverse the tree looking for
text-elements. I collect the text and feed it to xapian, then I start
the next document. I have some work left on getting a status-listing
(documents processed, what was indexed etc) as well as getting errors
returned properly, but I think this is the right way.
/Per Jessen, Zürich
--
To unsubscribe, e-mail: opensuse-programming+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse-programming+help@xxxxxxxxxxxx
Per Jessen wrote:
Hmm,
http://www.xmlsoft.org/html/libxml-HTMLparser.html
might just be useful:
"this module implements an HTML 4.0 non-verifying parser with API
compatible with the XML parser ones. It should be able to parse "real
world" HTML, even if severely broken from a specification point of
view."
A quick update for those who might be following this thread - I got this
to work fairly quickly with libxml. My documents are read from disk,
the html is parsed and I then traverse the tree looking for
text-elements. I collect the text and feed it to xapian, then I start
the next document. I have some work left on getting a status-listing
(documents processed, what was indexed etc) as well as getting errors
returned properly, but I think this is the right way.
/Per Jessen, Zürich
--
To unsubscribe, e-mail: opensuse-programming+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse-programming+help@xxxxxxxxxxxx
| < Previous | Next > |