On Tue, May 25, 2010 at 11:09 AM, Per Jessen email@example.com wrote:
Patrick Shanahan wrote:
- Per Jessen firstname.lastname@example.org [05-25-10 10:49]:
I need to extract text from html for purposes of indexing - implementation language is C or C++. Sofar I've come across html2text which is written in C++ - it looks pretty good, but I will need to make some changes to make it fit my prposes. Does any other library come to mind for extracting text from html?
w3m -dump <url>
lynx has a similar function
Yeah, something like that will be the last way out - I'd prefer not having to fork() for such a simple operation. (I'll be indexing millions of documents). Maybe it's worth checking out what those two utilities use for the extraction.
/Per Jessen, Zürich
If you have millions of documents to index/search, you're getting into my world.
Indexing and searching is a core part of my day job and we sell that capability as a service. We use an extremely fast, but rather expensive engine that runs on Linux. But unless you have lots of money to spend ($100K+) and are willing to run RH, I won't describe it in detail but the speed is incredible. (~800GB/hr indexing speed has been measured at the vendor's lab. We have a smaller solution, but we see a couple hundred GB/hr routinely.) The hardest part is getting an i/o system fast enough to feed it. (ie. How do you load 800GB/hr on to a computer for indexing? 1 Gbit/lan is way too slow, same for usb / firewire / etc.. In the vendors lab I think they used a high-end NAS with 4x1Gbit ports bonded together to be the data source.)
(Use private email if you want to know more. No need to bore everyone here.)
Assuming that's way overkill, would you consider a low-cost commercial solution?
DTsearch is definitely a market leader and it runs at reasonable speeds. For linux they only offer a library/engine I believe, not a full gui but that may work for you. I have no idea what they charge for the library/engine.
We have DTsearch, but the Windows Desktop version, so I can't say much about the linux engine they sell.
Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer CNN/TruTV Aired Forensic Imaging Demo - http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retriev...
The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com