Mailinglist Archive: opensuse-programming (16 mails)
| < Previous | Next > |
Re: [opensuse-programming] extracting text from html
- From: Per Jessen <per@xxxxxxxxxxxx>
- Date: Tue, 25 May 2010 17:50:36 +0200
- Message-id: <htgrkc$5qk$1@xxxxxxxxxxxxxxxx>
justin finnerty wrote:
Something like that was indeed my first thought, but I'm pretty certain
it would require the html to be well-formed, which is far from
guaranteed :-(
I also took a quick look at beautifulsoup, but I still need a C or C++
interface. Essentially I'm looking for something that will provide a
function:
int html2text( char *in, char *out );
It might be feasible or even easier to simply write something to strip
out html tags, I'm still contemplating that.
/Per Jessen, Zürich
--
To unsubscribe, e-mail: opensuse-programming+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse-programming+help@xxxxxxxxxxxx
I need to extract text from html for purposes ofindexing -
implementation language is C or C++
I would use a SAX parser that handles HTML (libxml2?). Then all you
might need to do is handle the TEXT nodes.
Something like that was indeed my first thought, but I'm pretty certain
it would require the html to be well-formed, which is far from
guaranteed :-(
I also took a quick look at beautifulsoup, but I still need a C or C++
interface. Essentially I'm looking for something that will provide a
function:
int html2text( char *in, char *out );
It might be feasible or even easier to simply write something to strip
out html tags, I'm still contemplating that.
/Per Jessen, Zürich
--
To unsubscribe, e-mail: opensuse-programming+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse-programming+help@xxxxxxxxxxxx
| < Previous | Next > |