Mailinglist Archive: opensuse-programming (16 mails)
| < Previous | Next > |
Re: [opensuse-programming] extracting text from html
- From: Per Jessen <per@xxxxxxxxxxxx>
- Date: Tue, 25 May 2010 18:02:54 +0200
- Message-id: <htgsbe$5qk$2@xxxxxxxxxxxxxxxx>
Greg Freemyer wrote:
Hi Greg - there's no real upper limit, but designwise I'm working on
50-60 million documents per index - I will have many indexes on
different collections of documents.
I'm currently aiming at indexing up to 1million new documents per index
per day, so about 10/sec on average. Later on, we will also be
removing (de-indexing?) the same amount per index per day.
For the time being we've pretty much decided to use xapian, but if that
turns out to have significant performance-issues, products such as
dtsearch could well come under consideration - thanks for the link.
I stumbled over xapian more or less by accident, but gmane is using it
for indexing mmillions of emails, which is exactly what I will be doing
too.
/Per Jessen, Zürich
--
To unsubscribe, e-mail: opensuse-programming+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse-programming+help@xxxxxxxxxxxx
Yeah, something like that will be the last way out - I'd prefer not
having to fork() for such a simple operation. (I'll be indexing
millions of documents). Maybe it's worth checking out what those two
utilities use for the extraction.
/Per Jessen, Zürich
Per,
If you have millions of documents to index/search, you're getting into
my world.
Hi Greg - there's no real upper limit, but designwise I'm working on
50-60 million documents per index - I will have many indexes on
different collections of documents.
Indexing and searching is a core part of my day job and we sell that
capability as a service. We use an extremely fast, but rather
expensive engine that runs on Linux. But unless you have lots of
money to spend ($100K+) and are willing to run RH, I won't describe it
in detail but the speed is incredible. (~800GB/hr indexing speed has
been measured at the vendor's lab. We have a smaller solution, but we
see a couple hundred GB/hr routinely.)
I'm currently aiming at indexing up to 1million new documents per index
per day, so about 10/sec on average. Later on, we will also be
removing (de-indexing?) the same amount per index per day.
Assuming that's way overkill, would you consider a low-cost commercial
solution?
DTsearch is definitely a market leader and it runs at reasonable
speeds. For linux they only offer a library/engine I believe, not a
full gui but that may work for you. I have no idea what they charge
for the library/engine.
http://www.dtsearch.com/PLF_engine_2.html
For the time being we've pretty much decided to use xapian, but if that
turns out to have significant performance-issues, products such as
dtsearch could well come under consideration - thanks for the link.
I stumbled over xapian more or less by accident, but gmane is using it
for indexing mmillions of emails, which is exactly what I will be doing
too.
/Per Jessen, Zürich
--
To unsubscribe, e-mail: opensuse-programming+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse-programming+help@xxxxxxxxxxxx
| < Previous | Next > |