On Tuesday 16 September 2014 23:01:40 Stefan Brüns wrote:
On Tuesday 16 September 2014 16:54:13 Claudio Freire wrote:
That's not the case. It does use cached reads, but it takes about a minute to cache the whole thing in random order, whereas it takes only a few seconds in sequential order.
I'm having a hard time following rpmdb.c's code. I see it uses plain db3 cursors, which should be sequentially scanning the file instead of hopping all over the place. If it's truly the case, then it's db3 the one that needs fixing. If it's rpmdb.c creating other cursors in parallel and seeking other parts of the packages database, which I can't rule out because I couldn't fully figure out the code yet, but seems unlikely, it's rpmdb.c the one in need of fixing.
I think a simple test case should clear this. I'll try to make one with python (it has a nice and neat interface to db3 that's easier to use than C for this).
The Packages db is in DB_HASH format - this has several implications:
1) A linear scan of the database is a random access pattern of the backing store, i.e. the disk. 2) bdb *does* a mmap of database files, but not for DB_HASH databases.
Regards,
Stefan
According to /usr/lib/rpm/rpmdb_stat -d /var/lib/rpm/Packages the hash has 19 buckets, and the database contains 3042 keys. Thus every bucket has 150 elements average. I doubt this hash has better access times than a BTree ... The disk access pattern is horrible, see the attached graphic ... Regards, Stefan -- Stefan Brüns / Bergstraße 21 / 52062 Aachen phone: +49 241 53809034 mobile: +49 151 50412019