-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2014-05-03 23:10, mararm wrote:
On Saturday, May 03, 2014 22:25:08 Carlos E. R. wrote:
thousands of PDFs and huge amount of weird (mostly text) files, like DNA sequences, huge matrices from permutations (I'm biologist) and so on...
Those would appear to be "text", but (my educated guess) they are almost random data. If they are large, they will make any content indexer to go berserk.
Why is indexing one large file more of a problem than indexing the same amount of text in multiple small(er) files?
No, the problem in this case (IMHO) is not the size, but the lack of patterns in those files, because they are like random data. When indexing the contents of text files you want later to be able to search for a particular word or sentence, and find it fast. So you would (very rough approximation) create a list of sentences and their locations. If the data is so random, you would have to create location indexes for million of separate words... That's just a guess. I have never studied this type of thing, so I can only make guesses. Just think of compressing a file of random data: it can not be compressed. This must be similar. If this data (of Vojt?ch) can be indexed, it will take a lot of CPU to do so, or disk space, or both. - -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNlnokACgkQtTMYHG2NR9X6rACfc4CXkp6IAY7+RxbHeUJs4Rdv J5QAn0VMOD4Z8UkOQ4OK92o+aOj7/DMV =JA4u -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse-kde+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kde+owner@opensuse.org