On Tue, Sep 16, 2014 at 5:43 PM, Carlos E. R. <carlos.e.r@opensuse.org> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 2014-09-16 21:54, Claudio Freire wrote:
On Mon, Sep 15, 2014 at 8:43 PM, Carlos E. R. <> wrote:
On 2014-09-16 01:29, Claudio Freire wrote:
On Mon, Sep 15, 2014 at 8:25 PM, Carlos E. R.
So it is obvious! We simply run the cp to null thing ahead of the query on the script, and done.
Yes, it's a nice band-aid if the system has enough memory.
Not so much if it doesn't.
True.
It is a hack, or band-aid, as you say. The real problem is how the database engine is coded: it is made, apparently, to minimize ram, doing non-sequential and non-cached disk reads.
That's not the case. It does use cached reads, but it takes about a minute to cache the whole thing in random order, whereas it takes only a few seconds in sequential order.
Wrong.
Why do you say? The fact that the hack works proves it does use the kernel's buffer cache. In fact, it was one of the first things I checked with strace, whether it opened in direct mode or not. It does not.
With the proposed hack, It takes about 3 seconds to cache the whole thing, then another 3 to do the whole query - compared to 90 seconds before the hack.
It does not matter how the database is accessed, once it is loaded in RAM. Of course, caching it as it is randomly accessed is wrong, unless the database engine is permanently running, as mysql might do.
It doesn't have to keep running. As the success of the cp notes, it only needs to put all the data into the OS buffer cache, which happens with each pread. The only difference between read and pread, is that pread doesn't modify the file descriptor's pointer. Everything else the kernel does to cache reads applies, as demonstrated by the fact that the hack works.
Look:
Telcontar:~ # echo 3 > /proc/sys/vm/drop_caches Telcontar:~ # time cp /var/lib/rpm/Packages /dev/null
real 0m3.532s user 0m0.004s sys 0m0.245s Telcontar:~ # time rpm -qa | wc -l 6154
real 0m3.668s user 0m2.670s sys 0m0.206s Telcontar:~ # echo 3 > /proc/sys/vm/drop_caches Telcontar:~ # time rpm -qa | wc -l 6154
real 1m23.203s user 0m2.912s sys 0m1.692s Telcontar:~ #
What does it prove? The first run proves the reads are cached, otherwise the cp wouldn't help, it would hurt. On Tue, Sep 16, 2014 at 6:01 PM, Stefan Brüns <stefan.bruens@rwth-aachen.de> wrote:
On Tuesday 16 September 2014 16:54:13 Claudio Freire wrote:
That's not the case. It does use cached reads, but it takes about a minute to cache the whole thing in random order, whereas it takes only a few seconds in sequential order.
I'm having a hard time following rpmdb.c's code. I see it uses plain db3 cursors, which should be sequentially scanning the file instead of hopping all over the place. If it's truly the case, then it's db3 the one that needs fixing. If it's rpmdb.c creating other cursors in parallel and seeking other parts of the packages database, which I can't rule out because I couldn't fully figure out the code yet, but seems unlikely, it's rpmdb.c the one in need of fixing.
I think a simple test case should clear this. I'll try to make one with python (it has a nice and neat interface to db3 that's easier to use than C for this).
The Packages db is in DB_HASH format - this has several implications:
1) A linear scan of the database is a random access pattern of the backing store, i.e. the disk. 2) bdb *does* a mmap of database files, but not for DB_HASH databases.
Um... are you sure about that? I thought the only difference between HASH and BTREE was that the iterating order of cursors was random (by key) in HASH, but it doesn't mean it will be random I/O. Do you have a pointer to documentation? I can't seem to find any relevant details on the access methods on the documentation I find by googling. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org