-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andreas Hanke wrote:
Keith Goggin schrieb:
I've noticed that <filelists.xml.gz> is refreshed by Yast YOU if it has changed on the server. It takes me about 4 minutes to download from <mirror.pacific.net.au> and is currently 2.9MB in size.
It's really insane. This XML stuff has made SUSE distros basically unusable without a broadband connection. 3MB before the distro is even released - crazy!
Let's have a look at some numbers ;) On my repository for 10.1: repo type | size (bytes) | MB | files - ----------+--------------+-------+--------------------------- yast2 | 1831446 | 1.74 | packages, packages.en rpm-md | 1454322 | 1.37 | primary.xml.gz, filelists.xml.gz yast2 repositories make up a somewhat larger download (21%), actually. When you look at Factory: repo type | size (bytes) | MB | files - ----------+--------------+-------+--------------------------- yast2 | 17876184 | 17.04 | packages, packages.en rpm-md | 20263138 | 19.32 | primary.xml.gz, filelists.xml.gz Here rpm-md is a little larger (12%).
I had a look at this compressed text file and observed that it contains a very large proportion of redundant data. /a/b/c/d/e/file1 /a/b/c/d/e/file2 etc
It's not redundant data. Those are file paths. Part of it are redundant, sure (e.g. /usr/share, /usr/lib, ...). As in yast2 repository files (packages, packages.en) -- the yast2 repository files have a lot more redundant data as they're not even compressed.
If this file could be compressed further by omitting the redundant data the compressed version would download much faster and may improve the end users experience of updating their system.
That's exactly what gzip is already doing on those rpm-md XML files.
Yes, there are much more efficient ways to store these data than repomd does. But hey, XML is so much cooler.
XML has its advantages. It's very easy to parse (but not necessarily easy to parse with good performance, SAX/DOM/StAX/...). Using DOM for such files uses insane amounts of memory. Parsing yast2 metadata isn't that complex but you definitely have to write your own error-prone parser instead of just using StAX. It took quite a few attempts and fixes to write the yast2 parser in smart properly, as the various "tags" are not documented (and suddenly new tags appeared, etc...).
Maybe one of those talented persons who designed repomd should have a look at /var/lib/locatedb. It's able to store equivalent information about 2 Linuxes, 1 Windows system and a lot of user files in just 2.5MB. Thanks to not using XML.
One might argue that xml.gz _is_ a binary format ;) gzip should already optimize away the XML tags and attributes. filelists.xml.gz contains all files that are in Factory in 19.32 MB and files of RPMs for 1023 different projects in 1.37 MB in my repository. Note that using bzip2 would be more effective (with filelists.xml.gz from Factory): Compression | File size | diff | ratio - --------------------+-----------+------+------- uncompressed | 155920703 | +92% | 0.00 original (gzip) | 12908161 | 0 | 12.07 gzip -9 | 12871908 | -1% | 12.11 gzip -9 --rsyncable | 13158574 | +2% | 11.84 bzip2 | 10848980 | -16% | 14.37 lrzip -M | 4879263 | -63% | 31.95 (diff is compressed compared to original .gz) (ratio is compressed compared to uncompressed) lrzip performs best (to say the least) but takes a long time, especially compared to gzip, and a lot of RAM (-M uses all available RAM, performed on a 2GB box), even on my dual-core AMD64. The ratio is pretty impressive though (see http://ck.kolivas.org/apps/lrzip/README). bzip2 is also significantly slower than gzip and uses somewhat more CPU but still gives a 16% gain. Now with the yast2 "packages" file: Compression | File size | diff | ratio - --------------------+-----------+------+------- original (no compr.)| 15547367 | 0 | 0.00 gzip -9 | 2340627 | -85% | 6.64 bzip2 -9 | 1797400 | -89% | 8.64 lrzip -M | 1393749 | -92% | 11.15 Note: you cannot directly compare the yast2 and rpm-md files (except for overall size) as they contain somewhat different subsets of data of a repository. Obviously there's still a lot to gain from compressing yast2 repository files -- even with a quick gzip -9 the file is 85% smaller. Who files the bug for requesting that feature ? ;) But to summarize, the compression already optimizes away the redundant data, especially the XML tags and attributes. It comes pretty much at no cost as of the download size. Of course, when it is parsed, it first has to be uncompressed (or streamed). The parsing performance highly depends on the method (SAX, DOM, StAX, ...). An interesting approach is to store that parsed data into a relational database, which is what ZMD is doing (sqlite DB). A database can use indexes to speed up queries a lot, as compared to having to load the whole data into memory and processing the query yourself. But not don't ask me why ZMD is a lot slower than zypp or smart... maybe the cost of inserting all that data into the sqlite database ? Or the synchronization with zypp/yast ? cheers - -- -o) Pascal Bleser http://linux01.gwdg.de/~pbleser/ /\\ <pascal.bleser@skynet.be> <guru@unixtech.be> _\_v The more things change, the more they stay insane. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) iD8DBQFFb92/r3NMWliFcXcRAlAIAJ9Zd5EVozn2gLWZbcxC+yRflvEnXQCfSxbn fS1AZmOlw0VeDTUfSzD1FGI= =LNhD -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org