Re: [opensuse-kde] Understanding what Baloo is doing

4 May 2014

      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2014-05-03 23:10, mararm wrote:
...
On Saturday, May 03, 2014 22:25:08 Carlos E. R. wrote:
...
...
thousands of PDFs and huge amount of weird (mostly text)
files, like DNA sequences, huge matrices from permutations (I'm
biologist) and so on...
Those would appear to be "text", but (my educated guess) they
are almost random data. If they are large, they will make any
content indexer to go berserk.
Why is indexing one large file more of a problem than indexing the
same amount of text in multiple small(er) files?
No, the problem in this case (IMHO) is not the size, but the lack of
patterns in those files, because they are like random data. When
indexing the contents of text files you want later to be able to
search for a particular word or sentence, and find it fast. So you
would (very rough approximation) create a list of sentences and their
locations.
If the data is so random, you would have to create location indexes
for million of separate words...

That's just a guess. I have never studied this type of thing, so I can
only make guesses. Just think of compressing a file of random data: it
can not be compressed. This must be similar.

If this data (of Vojt?ch) can be indexed, it will take a lot of CPU to
do so, or disk space, or both.

- -- 
Cheers / Saludos,

		Carlos E. R.
		(from 13.1 x86_64 "Bottle" at Telcontar)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNlnokACgkQtTMYHG2NR9X6rACfc4CXkp6IAY7+RxbHeUJs4Rdv
J5QAn0VMOD4Z8UkOQ4OK92o+aOj7/DMV
=JA4u
-----END PGP SIGNATURE-----
-- 
To unsubscribe, e-mail: opensuse-kde+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-kde+owner@opensuse.org