Author: dmacvicar Date: Wed Jun 6 13:59:08 2007 New Revision: 5688 URL: http://svn.opensuse.org/viewcvs/zypp?rev=5688&view=rev Log: moved to wiki Modified: trunk/libzypp/zypp2/cache/schema/DESIGN-SCHEMA.txt Modified: trunk/libzypp/zypp2/cache/schema/DESIGN-SCHEMA.txt URL: http://svn.opensuse.org/viewcvs/zypp/trunk/libzypp/zypp2/cache/schema/DESIGN-SCHEMA.txt?rev=5688&r1=5687&r2=5688&view=diff ============================================================================== --- trunk/libzypp/zypp2/cache/schema/DESIGN-SCHEMA.txt (original) +++ trunk/libzypp/zypp2/cache/schema/DESIGN-SCHEMA.txt Wed Jun 6 13:59:08 2007 @@ -1,482 +1,3 @@ -Cache Design Issues -=================== -dmacvicar@suse.de -Solver Requisites -================= - -The queries for the solver include: -(Kindly provided by schubi) - -1. get all installed items -2. get all uninstalled items -3. get all installed/uninstalled items of a specific kind -4. just find installed item with same kind/name as item - Helper::findInstalledByNameAndKind (const ResPool & pool, const string & - name, const Resolvable::Kind & kind) -5. just find uninstalled item with same kind/name as item - Helper::findUninstalledByNameAndKind (const ResPool & pool, const string - & name, const Resolvable::Kind & kind) -6. just find best (according to edition) uninstalled item with same - kind/name as item - // *DOES* check edition - PoolItem_Ref Helper::findUpdateItem (const ResPool & pool, PoolItem_Ref - item) -7. check if the given item is the best one of the pool - // Takes arch, edition in order to check it - bool Helper::isBestUninstalledItem (const ResPool & pool, PoolItem_Ref item) -8. get all requires,obsoletes,provides,recommends,... of an item -9. find all INSTALLED items which provides foo -10. find all UNINSTALLED items which provides foo -11. find all items which provides foo -12. find all locked items -13. find items which have a freshens entry with a given name or - corresponds with one of a given name list -14. find items which have a supplements entry with a given name or - corresponds with one of a given name list -15. find items which have a conflicts entry with a given name or - corresponds with one of a given name list -16. find items which have a recommend entry with a given name or - corresponds with one of a given - name list -17. find all INSTALLED items which requires foo -18. find all UNINSTALLED items which requires foo - -Most of those can be represented by single table queries which don't -depend on the schema at all except: - -* 6. requires edition comparision -* 9. 10. 11. 17. 18. Style queries are analyzed in our example for - different schemas -* 13. 14. 15. 16. Same as above - -Other system schema comparision -=============================== - -xpkg - http://www.xpkg.org --------------------------- - -Uses a simple dependency model (name,major,minor). -Database schema only has a packages table. - -YUM - http://linux.duke.edu/projects/yum ----------------------------------------- - -/var/cache/yum/stable/ -total 348M -0 cachecookie -16M filelists.xml.gz -72M filelists.xml.gz.sqlite -48 headers -45M other.xml.gz -172M other.xml.gz.sqlite -48 packages -6.8M primary.xml.gz -37M primary.xml.gz.sqlite -951 repomd.xml - -The schema is basically a 1:1 copy of the XML schema with -different databases per xml file. They also keep the raw cache. - -The names are not normalized and it is easy to see it looking -at the file sizes. - -Summaries and descriptions are together with resolvable data. - -The time to create factory cache from scratch via NFS is: - -real 2m20.947s -user 1m47.155s -sys 0m9.697s - -Once is created, looking for information about a file: - -real 0m12.470s -user 0m7.028s -sys 0m0.464s - -Solver queries with yum: - -Find package that provides foo: - - sqlite> select pk.name,pk.version,pk.release,pk.epoch from packages pk,provides pr where pk.pkgKey=pr.pkgKey and pr.name="kdelibs3"; - kdelibs3|3.5.6|4|0 - kdelibs3|3.5.6|4|0 - - real 0m0.008s - user 0m0.008s - sys 0m0.000s - - sqlite> explain query plan select pk.name,pk.version,pk.release,pk.epoch from packages pk,provides pr where pk.pkgKey=pr.pkgKey and pr.name="kdelibs3"; - 0|1|TABLE provides AS pr WITH INDEX providesname - 1|0|TABLE packages AS pk USING PRIMARY KEY - sqlite> - -Smart ------ - -Smart uses its own cache format. Creating it for stable-x86 -results in a 6.4M cache file. Creating it takes: - - real 2m18.128s - user 0m31.826s - sys 0m2.740s - -And later, searching for a package: - - real 0m3.198s - user 0m2.628s - sys 0m0.444s - -ZYPP Proposed schema Tests --------------------------- - -The current schema available in zypp-trunk. - -Find package that provides foo: - - select r.name,r.version,r.release,r.epoch from resolvables r,dependencies d,versioned_dependencies vd,names n where d.resolvable_id=r.id and d.dependency_type=1 and vd.name_id=n.id and n.name="kdelibs3" and vd.dependency_id=d.id; - kdelibs3|3.5.6|4| - kdelibs3|3.5.6|4| - - real 0m0.695s - user 0m0.636s - sys 0m0.036s - -Slow, this query is resolved as: - - 0|2|TABLE versioned_dependencies AS vd - 1|1|TABLE dependencies AS d USING PRIMARY KEY - 2|0|TABLE resolvables AS r USING PRIMARY KEY - 3|3|TABLE names AS n USING PRIMARY KEY - -Trying again with a new index: - - CREATE INDEX dependency_type ON dependencies (dependency_type); - 12M - - real 0m0.485s - user 0m0.456s - sys 0m0.024s - -Creating an index for the versioned_dependencies: - - CREATE INDEX versioned_dependencies_dependency ON versioned_dependencies (dependency_id); - - 0|3|TABLE names AS n WITH INDEX sqlite_autoindex_names_1 - 1|1|TABLE dependencies AS d WITH INDEX dependency_type - 2|0|TABLE resolvables AS r USING PRIMARY KEY - 3|2|TABLE versioned_dependencies AS vd WITH INDEX versioned_dependencies_dependency - - real 0m0.579s - user 0m0.552s - sys 0m0.016s - -Still slow. Last try: - -create index vd_name_id on versioned_dependencies(name_id); - - real 0m0.007s - user 0m0.004s - sys 0m0.004s - -This means the bottleneck was in the normalized names table. -Now the query is done in this way: - - 0|3|TABLE names AS n WITH INDEX sqlite_autoindex_names_1 - 1|2|TABLE versioned_dependencies AS vd WITH INDEX vd_name_id - 2|1|TABLE dependencies AS d USING PRIMARY KEY - 3|0|TABLE resolvables AS r USING PRIMARY KEY - -Adding all those indexes keep the database (without file lists and -descriptions) in a 15 M size. Some extra indexes can be dropped -if the query plan shows they are not used. - -With 40.000 packages. A database without normalized names uses 50mb, while a normalized one uses -around 43 mb. - -The query in the normalized one takes 2.092s. Adding the names index bumps the size to -50mb and gets the query in 0.010s. - -The query in the non-normalized one takes 0.443s. Adding the names index bumps the size to -76 mb and gets the query in 0.010s. - - -stable-x86 distribution facts [^onerepo] -======================================== - -* resolvable names: 8.204 -* unique resolvable names: 8.200 -* dependency names: 119.425 -* unique dependency names: 30.062 - - [^onerepo]: Note that those numbers are only 1 repository. - -If we take the proportions as a fact for every repository, that means 14.5 -dependencies per package in average. - -If you take in count that adding a repository which is not factory has not much -chance of adding much new names the unique 30k unique dependency names will not -grow too much, where the number of dependencies still grows up. - -Glob queries (ie: zypper) - - sqlite> explain query plan select * from dependencies where name like "kde%"; - 0|0|TABLE dependencies - -As we see, this query does not use any index, because, ok, there is none. - - sqlite> CREATE INDEX dependency_name ON dependencies (name); - sqlite> explain query plan select * from dependencies where name like "kde%"; - 0|0|TABLE dependencies - -This makes sense, as from [1], "The GLOB and LIKE operators are expensive in -SQLite because they can't make use of an index. One reason is that these are -implemented by user functions which can be overridden". Index can be bypassed, -if they slowdown the query: - -Example: "WHERE x=5" can be changed to "WHERE x+0=5" or "WHERE +x=5" to bypass the index. -( [1] 5.5.3 ) - -In contrast, this query does use an index: - - sqlite> explain query plan select * from resolvables r,dependencies d where d.name="kdelibs"; - 0|0|TABLE resolvables AS r - 1|1|TABLE dependencies AS d WITH INDEX dependency_name - -Database Size -============= - -A zmd style schema made with repotools, of stable-x86 takes 5.2 M -It is unrealistic as it does not contains descriptions but useful for comparision. - - CREATE INDEX dependency_name ON dependencies (name); - -Ths increases the size of the database to 8.4M!!. - -We normalize the database, taking the names to a new table. - - insert into dependencies (id, name_id, deptype_id, resolvable_id, refers_id, relation_id, epoch, version, release, pre ) select db.id, n.id, db.deptype_id, db.resolvable_id, db.refers_id, db.relation_id, db.epoch, db.version, db.release, db.pre from dependencies_backup db, names n where n.name=db.name; - -The new result is a 30kb difference, after vacuum. (????????) - - 5256192 stable.db - 5283840 stable.old.db - - sqlite> select sum(length(name)) from names; - 563224 - sqlite> select sum(length(name)) from dependencies; - 1922000 - -We should have saved at least 1.922.000 - 563.224 ????? - -However the index size for the names column is much smaller. -The database without normalized names jumps from 5.1M to 6.0M when -creating a name index, instead of the 8.4M mentioned above. - -Growing Test -============ - -zypp1.db uses name normalization for dependencies. zypp2.db does not. - -8.000 packages -9.9M zypp1.db -9.8M zypp2.db - -16.000 packages -18M zypp1.db -20M zypp2.db - -24.000 packages -26M zypp1.db -30M zypp2.db - -32.000 packages -34M zypp1.db -40M zypp2.db - -40.000 packages -43M zypp1.db -50M zypp2.db - -If x = thousand of packages, and y = database size. A normalized -schema has a slomp of 1.1 and a unnormalized schema a slomp of 1.25 - -File list data -============== - -For factory. Complete file list is - -2.840.491 files (unique names: 890.373) -119.070 directories - -We ommited using a schema with one table and one column. - -Test 1, Normalized Directories - -|-----------------| |----------------------------------| -| dirs | | files | -|-----------------| |----------------------------------| -| id | name | | dir_id | package_id | name | -|-----------------| |----------------------------------| - -For yum xml format, this is 192M in filelists.xml which can be -compressed to 17M filelists.xml.gz - -Using a simple schema normalizing the directories: - -186M filelist.db - -Obtaining the file list of a package: - -select d.name,f.name from files f,dirs d where f.dir_id=d.id and f.package_id=7557 -takes 0.135s and was not able to improve it more using indexes. This assumes you have the -key of the package. - -Reverse lookup, find the id of a package that owns file and dir. - -'select f.package_id from files f,dirs d where f.dir_id=d.id and f.name="libzypp.so" and d.name="/usr/lib";' -takes 1.582s - -Adding a index on file name, grows the database to 259M, but resolves the query in 0.007s. - -Test 1, Normalized Directories and file names - -|-----------------| |-----------------| -| dir_names | | file_names | -|-----------------| |-----------------| -| id | name | | id | name | -|-----------------| |-----------------| - -|----------------------------------| -| files | -|----------------------------------| -| dir_id | file_id | package_id | -|----------------------------------| - -This approach generates much slower insertions, which can be reduced with some -application level cache, for example for the directories, which probably get -inserted lot of files for the same directory. You can avoid N selects, keeping -the dir_id and name in memory, where N is the average number of files per directory. - -This generates a 193M filelist2.db. This includes an index on filenames, otherwise -every insertion requires a select on a non indexed text column. So this size is -comparable to the 259M non-normalized table. - -I could not get the reverse lookup query under a second. - -Application level cache for normalized insertions -================================================= - -|-----------------| |-----------------| -| table1 | | table2 | -|-----------------| |-----------------| -| id | name | | id | name | -|-----------------| |-----------------| - -|----------------------------------| -| table1_table1 | -|----------------------------------| -| t1_id | t2_id | other | -|----------------------------------| - -If you are inserting a list of: - -table1 table2 other ------------------------ -dog lalal 423432 -dog lelele 423432 -cat foo 423423 -cat bleh 423423 -cat ruby 423423 -duck nah 432432 - -If you insert in table1_table1 you will need to -check if an id already exists for table1 and table2, -using text comparision, which is slow. - -As we insert more than 1 table2 object for the same -table1 object, the id of table1 can be cached, and -the followings won't need another select statement. - -Representing in memory data -=========================== - -There are entities that are in memory, like Architectures and we dont -want to maintain them in the database store but we would like to be able -to relate database information with memory information. - -Virtual tables allow you to do this, so you can create a architectures table -that is not stored on disk that is only a set of callbacks to answer queries -but you can still use it in SQL. - -See [3] for more information about virtual tables. - -C style API vs C++ Style API -============================ - -1000 insertions and iterating using a select - - (sqlite3x insert) 2:05 (u 0.07 s 1.09 c 0.00) - (sqlite3x select) 0 (u 0.01 s 0.00 c 0.00) - - (sqlite-c insert) 2:05 (u 0.09 s 0.89 c 0.00) - (sqlite-c select) 0 (u 0.00 s 0.00 c 0.00) - -Oh sqlite is really slow to insert!!! no, I had autocommit enabled so there -was one transaction per insert, if we do all inserts as a transaction - - (sqlite3x insert) 0 (u 0.01 s 0.00 c 0.00) - (sqlite3x select) 0 (u 0.00 s 0.00 c 0.00) - - (sqlite-c insert) 1 (u 0.01 s 0.00 c 0.00) - (sqlite-c select) 0 (u 0.00 s 0.00 c 0.00) - -Ok, that is fast, lets do 10.000 inserts in a row. - - (sqlite3x insert) 0 (u 0.20 s 0.00 c 0.00) - (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00) - - (sqlite-c insert) 0 (u 0.20 s 0.02 c 0.00) - (sqlite-c select) 0 (u 0.03 s 0.00 c 0.00) - -repeat same test with flushed caches: - - (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00) - (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00) - - (sqlite-c insert) 1 (u 0.18 s 0.00 c 0.00) - (sqlite-c select) 0 (u 0.04 s 0.00 c 0.00) - -Conclusion, the c++ layer does not add a significant performance penalty but -does reduce the code by 30%-40% as errors are better handled using exceptions -instead of checking error values. same with transactions using block scopes. - -Some Conclusions -================ - -- Normalization lead to slower queries (due to JOINs) -- Normalized tables are faster than non-normalized tables - using an index. -- When both use an index, they reach the 0.01s lower limit -- The index size is much smaller in normalized tables. - -Recommendations -=============== - -[1] Dependencies names normalization can save space. Especially with - indexes. -[2] Dependency type normalization is cheap and can provide a flexible - capability representation. -[3] separating dependency types in different tables can also save - a JOIN, but if combined with [2], leads to (dep types)X(capability impls) - tables. - - -References - - [1]: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html - [2]: http://developer.mozilla.org/en/docs/Storage:Performance - [3]: http://www.sqlite.org/cvstrac/wiki?p=VirtualTables - [4]: http://katastrophos.net/andre/blog/2007/01/04/sqlite-performance-tuning-and-... - \ No newline at end of file +See: +http://en.opensuse.org/Libzypp/Refactoring/CacheSchema \ No newline at end of file -- To unsubscribe, e-mail: zypp-commit+unsubscribe@opensuse.org For additional commands, e-mail: zypp-commit+help@opensuse.org