Re: [opensuse-factory] File list for Yast YOU

1 Dec 2006

      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andreas Hanke wrote:
...
Keith Goggin schrieb:
...
I've noticed that <filelists.xml.gz> is refreshed by Yast YOU if it has 
changed on the server. It takes me about 4 minutes to download from 
<mirror.pacific.net.au> and is currently 2.9MB in size.
It's really insane. This XML stuff has made SUSE distros basically
unusable without a broadband connection. 3MB before the distro is even
released - crazy!
Let's have a look at some numbers ;)

On my repository for 10.1:
repo type | size (bytes) | MB    | files
- ----------+--------------+-------+---------------------------
 yast2    |      1831446 |  1.74 | packages, packages.en
 rpm-md   |      1454322 |  1.37 | primary.xml.gz, filelists.xml.gz

yast2 repositories make up a somewhat larger download (21%), actually.

When you look at Factory:
repo type | size (bytes) | MB    | files
- ----------+--------------+-------+---------------------------
 yast2    |     17876184 | 17.04 | packages, packages.en
 rpm-md   |     20263138 | 19.32 | primary.xml.gz, filelists.xml.gz

Here rpm-md is a little larger (12%).
...
...
I had a look at this compressed text file and observed that it contains a very 
large proportion of redundant data.
  /a/b/c/d/e/file1
  /a/b/c/d/e/file2
  etc
It's not redundant data. Those are file paths.
Part of it are redundant, sure (e.g. /usr/share, /usr/lib, ...).
As in yast2 repository files (packages, packages.en) -- the yast2
repository files have a lot more redundant data as they're not even
compressed.
...
...
If this file could be compressed further by omitting the redundant data the 
compressed version would download much faster and may improve the end users 
experience of updating their system.
That's exactly what gzip is already doing on those rpm-md XML files.
...
Yes, there are much more efficient ways to store these data than repomd
does. But hey, XML is so much cooler.
XML has its advantages.
It's very easy to parse (but not necessarily easy to parse with good
performance, SAX/DOM/StAX/...). Using DOM for such files uses insane
amounts of memory.
Parsing yast2 metadata isn't that complex but you definitely have to
write your own error-prone parser instead of just using StAX.
It took quite a few attempts and fixes to write the yast2 parser in
smart properly, as the various "tags" are not documented (and suddenly
new tags appeared, etc...).
...
Maybe one of those talented persons who designed repomd should have a
look at /var/lib/locatedb. It's able to store equivalent information
about 2 Linuxes, 1 Windows system and a lot of user files in just 2.5MB.
Thanks to not using XML.
One might argue that xml.gz _is_ a binary format ;)
gzip should already optimize away the XML tags and attributes.

filelists.xml.gz contains all files that are in Factory in 19.32 MB and
files of RPMs for 1023 different projects in 1.37 MB in my repository.

Note that using bzip2 would be more effective (with filelists.xml.gz
from Factory):
Compression         | File size | diff | ratio
- --------------------+-----------+------+-------
uncompressed        | 155920703 | +92% |  0.00
original (gzip)     |  12908161 |   0  | 12.07
gzip -9             |  12871908 |  -1% | 12.11
gzip -9 --rsyncable |  13158574 |  +2% | 11.84
bzip2               |  10848980 | -16% | 14.37
lrzip -M            |   4879263 | -63% | 31.95

(diff is compressed compared to original .gz)
(ratio is compressed compared to uncompressed)

lrzip performs best (to say the least) but takes a long time, especially
compared to gzip, and a lot of RAM (-M uses all available RAM, performed
on a 2GB box), even on my dual-core AMD64. The ratio is pretty
impressive though (see http://ck.kolivas.org/apps/lrzip/README).
bzip2 is also significantly slower than gzip and uses somewhat more CPU
but still gives a 16% gain.

Now with the yast2 "packages" file:
Compression         | File size | diff | ratio
- --------------------+-----------+------+-------
original (no compr.)|  15547367 |    0 |  0.00
gzip -9             |   2340627 | -85% |  6.64
bzip2 -9            |   1797400 | -89% |  8.64
lrzip -M            |   1393749 | -92% | 11.15

Note: you cannot directly compare the yast2 and rpm-md files (except for
overall size) as they contain somewhat different subsets of data of a
repository.

Obviously there's still a lot to gain from compressing yast2 repository
files -- even with a quick gzip -9 the file is 85% smaller.
Who files the bug for requesting that feature ? ;)

But to summarize, the compression already optimizes away the redundant
data, especially the XML tags and attributes. It comes pretty much at no
cost as of the download size.
Of course, when it is parsed, it first has to be uncompressed (or
streamed). The parsing performance highly depends on the method (SAX,
DOM, StAX, ...).

An interesting approach is to store that parsed data into a relational
database, which is what ZMD is doing (sqlite DB). A database can use
indexes to speed up queries a lot, as compared to having to load the
whole data into memory and processing the query yourself.
But not don't ask me why ZMD is a lot slower than zypp or smart... maybe
the cost of inserting all that data into the sqlite database ?
Or the synchronization with zypp/yast ?

cheers
- --
  -o) Pascal Bleser     http://linux01.gwdg.de/~pbleser/
  /\\ <pascal.bleser@skynet.be>       <guru@unixtech.be>
 _\_v The more things change, the more they stay insane.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFFb92/r3NMWliFcXcRAlAIAJ9Zd5EVozn2gLWZbcxC+yRflvEnXQCfSxbn
fS1AZmOlw0VeDTUfSzD1FGI=
=LNhD
-----END PGP SIGNATURE-----
---------------------------------------------------------------------
To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse-factory+help@opensuse.org