Mailinglist Archive: opensuse-programming (16 mails)

< Previous Next >
Re: [opensuse-programming] extracting text from html
  • From: Per Jessen <per@xxxxxxxxxxxx>
  • Date: Tue, 25 May 2010 18:02:54 +0200
  • Message-id: <htgsbe$5qk$2@xxxxxxxxxxxxxxxx>
Greg Freemyer wrote:

Yeah, something like that will be the last way out - I'd prefer not
having to fork() for such a simple operation.  (I'll be indexing
millions of documents).  Maybe it's worth checking out what those two
utilities use for the extraction.


/Per Jessen, Zürich

Per,

If you have millions of documents to index/search, you're getting into
my world.

Hi Greg - there's no real upper limit, but designwise I'm working on
50-60 million documents per index - I will have many indexes on
different collections of documents.

Indexing and searching is a core part of my day job and we sell that
capability as a service. We use an extremely fast, but rather
expensive engine that runs on Linux. But unless you have lots of
money to spend ($100K+) and are willing to run RH, I won't describe it
in detail but the speed is incredible. (~800GB/hr indexing speed has
been measured at the vendor's lab. We have a smaller solution, but we
see a couple hundred GB/hr routinely.)

I'm currently aiming at indexing up to 1million new documents per index
per day, so about 10/sec on average. Later on, we will also be
removing (de-indexing?) the same amount per index per day.

Assuming that's way overkill, would you consider a low-cost commercial
solution?
DTsearch is definitely a market leader and it runs at reasonable
speeds. For linux they only offer a library/engine I believe, not a
full gui but that may work for you. I have no idea what they charge
for the library/engine.

http://www.dtsearch.com/PLF_engine_2.html

For the time being we've pretty much decided to use xapian, but if that
turns out to have significant performance-issues, products such as
dtsearch could well come under consideration - thanks for the link.
I stumbled over xapian more or less by accident, but gmane is using it
for indexing mmillions of emails, which is exactly what I will be doing
too.


/Per Jessen, Zürich

--
To unsubscribe, e-mail: opensuse-programming+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse-programming+help@xxxxxxxxxxxx

< Previous Next >
List Navigation