[opensuse-factory] A look on RPM compression in openSUSE
Compression performance plots are often done with something like the Silesia corpus. Linux distributions have rather different proportions of file types, I think. They have a lot of machine code, and even more data files, and probably not so much text and images. Since our data set is also 2.8 orders of magnitude bigger, rerunning a compression shootout will give more detail. So I did just that. http://paste.opensuse.org/15790105 http://inai.de/files/openSUSE-compression.ods (My measurements included just *.x86_64.rpm + *.noarch.rpm.) The takeaway from that is: * xz outperforms zstd in the regions that xz caters to. But overall, xz forms the far end of the "law of diminishing returns". * Moving openSUSE from xz-5 to xz-2 saves 50% of time for an investment of just 3.2 GB of space. Or, moving to zstd-7, saving 85% for ~6.1 GB. Other observations: * There are steps in compressor behavior, and that penalizes a lot of levels, leaving only a few sensible ones: zstd-2-3-7-12, xz-1-2-3-4-5-9 (disregarding memory use, which is another factor). * Some of our packages are too fat. kicad-packages takes longer to compress than the entire remaining distro at zstd-19 with 16x-parallelism. In other words, a sufficiently parallelized system may have to wait just for that one to complete. (* "Trend lines" in LibreOffice Calc are quite useless sometimes, as it does not appear to calculate a constant offset portion for exp/pow fittings.) -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Mon, Oct 8, 2018 at 8:52 AM Jan Engelhardt <jengelh@inai.de> wrote:
Compression performance plots are often done with something like the Silesia corpus. Linux distributions have rather different proportions of file types, I think. They have a lot of machine code, and even more data files, and probably not so much text and images. Since our data set is also 2.8 orders of magnitude bigger, rerunning a compression shootout will give more detail. So I did just that.
http://paste.opensuse.org/15790105 http://inai.de/files/openSUSE-compression.ods (My measurements included just *.x86_64.rpm + *.noarch.rpm.)
The takeaway from that is:
* xz outperforms zstd in the regions that xz caters to. But overall, xz forms the far end of the "law of diminishing returns".
* Moving openSUSE from xz-5 to xz-2 saves 50% of time for an investment of just 3.2 GB of space. Or, moving to zstd-7, saving 85% for ~6.1 GB.
This jives with my own analysis in Fedora. For what it's worth, we use xz-2 for Fedora packages for this reason. At this point, I don't think zstd compression for RPM payloads makes sense yet, though it's something I'm constantly looking at. -- 真実はいつも一つ!/ Always, there's only one truth! -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On 08.10.18 14:54, Neal Gompa wrote:
On Mon, Oct 8, 2018 at 8:52 AM Jan Engelhardt <jengelh@inai.de> wrote:
Compression performance plots are often done with something like the Silesia corpus. Linux distributions have rather different proportions of file types, I think. They have a lot of machine code, and even more data files, and probably not so much text and images. Since our data set is also 2.8 orders of magnitude bigger, rerunning a compression shootout will give more detail. So I did just that.
http://paste.opensuse.org/15790105 http://inai.de/files/openSUSE-compression.ods (My measurements included just *.x86_64.rpm + *.noarch.rpm.)
The takeaway from that is:
* xz outperforms zstd in the regions that xz caters to. But overall, xz forms the far end of the "law of diminishing returns".
* Moving openSUSE from xz-5 to xz-2 saves 50% of time for an investment of just 3.2 GB of space. Or, moving to zstd-7, saving 85% for ~6.1 GB.
This jives with my own analysis in Fedora. For what it's worth, we use xz-2 for Fedora packages for this reason.
At this point, I don't think zstd compression for RPM payloads makes sense yet, though it's something I'm constantly looking at.
This is just looking at compression - what about decompression? Any wins expected there? Greetings, Stephan -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Mon, 8 Oct 2018 14:51:51 +0200 (CEST), Jan Engelhardt <jengelh@inai.de> wrote:
Compression performance plots are often done with something like the Silesia corpus. Linux distributions have rather different proportions of file types, I think. They have a lot of machine code, and even more data files, and probably not so much text and images. Since our data set is also 2.8 orders of magnitude bigger, rerunning a compression shootout will give more detail. So I did just that.
http://paste.opensuse.org/15790105 http://inai.de/files/openSUSE-compression.ods (My measurements included just *.x86_64.rpm + *.noarch.rpm.)
The takeaway from that is:
* xz outperforms zstd in the regions that xz caters to. But overall, xz forms the far end of the "law of diminishing returns".
* Moving openSUSE from xz-5 to xz-2 saves 50% of time for an investment of just 3.2 GB of space. Or, moving to zstd-7, saving 85% for ~6.1 GB.
Other observations:
* There are steps in compressor behavior, and that penalizes a lot of levels, leaving only a few sensible ones: zstd-2-3-7-12, xz-1-2-3-4-5-9 (disregarding memory use, which is another factor).
* Some of our packages are too fat. kicad-packages takes longer to compress than the entire remaining distro at zstd-19 with 16x-parallelism. In other words, a sufficiently parallelized system may have to wait just for that one to complete.
(* "Trend lines" in LibreOffice Calc are quite useless sometimes, as it does not appear to calculate a constant offset portion for exp/pow fittings.)
I value this kind of work! Here's a list of my findings in compression performance on database dumps. The first set is just looking at the size, the second takes time compared to gain into account. First line of each set is to compare without compression. Overall, we've chose to use «lbzip2 -9» as best option. The tests were executed on an old 8 CPU openSUSE 13.2 Linux 3.16.7 HP Z440 Xeon(R) CPU E5-1620 v3 @ 3.50GHz/1256(8) x86_64 15972 Mb Sorted by size Command Time Size rel_sz compr effccy Filename ------------ -------- ---------- ------ ----- ------ ----------- # pg_dumpall 00:01:27 5635943963 100.0% 0.0% bu.psql # compress 00:01:21 2672452529 47.4% 52.6% 27.31 bu.psql.Z # lz4 -9 00:03:03 1199599818 21.3% 78.7% 27.09 bu.psql.lz4 # lzop -9 00:16:02 1126038770 20.0% 80.0% 5.32 bu.psql.lzo # lha c -o7 00:08:42 842297791 14.9% 85.1% 11.09 bu.lzh # zip -9 00:07:32 835282929 14.8% 85.2% 12.84 bu.zip # gzip -9 00:07:26 835282675 14.8% 85.2% 13.01 bu.psql.gz # pigz -9 00:01:19 833434110 14.8% 85.2% 73.53 bu.psql.gz # pbzip2 -9 00:01:52 766662771 13.6% 86.4% 53.32 bu.psql.bz2 # bzip2 -9 00:08:17 766019009 13.6% 86.4% 12.02 bu.psql.bz2 # lbzip2 -9 00:00:56 765752138 13.6% 86.4% 106.67 bu.psql.bz2 # rar a -m5 00:04:05 732839985 13.0% 87.0% 24.71 bu.rar # zstd -19 00:32:24 639443277 11.3% 88.7% 3.23 bu.psql.zst # lrzip -U 00:10:40 597932768 10.6% 89.4% 9.99 bu.psql.lrz # 7z a -r 00:22:04 588093913 10.4% 89.6% 4.85 bu.7z # plzip -9 00:17:32 496047492 8.8% 91.2% 6.32 bu.psql.lz # lzip -9 01:23:33 476972612 8.5% 91.5% 1.34 bu.psql.lz # clzip -9 01:22:44 476972612 8.5% 91.5% 1.35 bu.psql.lz # xz -9 00:58:40 450632908 8.0% 92.0% 1.92 bu.psql.xz Sorted by efficiency Command Time Size rel_sz compr effccy Filename ------------ -------- ---------- ------ ----- ------ ----------- # pg_dumpall 00:01:27 5635943963 100.0% 0.0% bu.psql # lbzip2 -9 00:00:56 765752138 13.6% 86.4% 106.67 bu.psql.bz2 # pigz -9 00:01:19 833434110 14.8% 85.2% 73.53 bu.psql.gz # pbzip2 -9 00:01:52 766662771 13.6% 86.4% 53.32 bu.psql.bz2 # compress 00:01:21 2672452529 47.4% 52.6% 27.31 bu.psql.Z # lz4 -9 00:03:03 1199599818 21.3% 78.7% 27.09 bu.psql.lz4 # rar a -m5 00:04:05 732839985 13.0% 87.0% 24.71 bu.rar # gzip -9 00:07:26 835282675 14.8% 85.2% 13.01 bu.psql.gz # zip -9 00:07:32 835282929 14.8% 85.2% 12.84 bu.zip # bzip2 -9 00:08:17 766019009 13.6% 86.4% 12.02 bu.psql.bz2 # lha c -o7 00:08:42 842297791 14.9% 85.1% 11.09 bu.lzh # lrzip -U 00:10:40 597932768 10.6% 89.4% 9.99 bu.psql.lrz # plzip -9 00:17:32 496047492 8.8% 91.2% 6.32 bu.psql.lz # lzop -9 00:16:02 1126038770 20.0% 80.0% 5.32 bu.psql.lzo # 7z a -r 00:22:04 588093913 10.4% 89.6% 4.85 bu.7z # zstd -19 00:32:24 639443277 11.3% 88.7% 3.23 bu.psql.zst # xz -9 00:58:40 450632908 8.0% 92.0% 1.92 bu.psql.xz # clzip -9 01:22:44 476972612 8.5% 91.5% 1.35 bu.psql.lz # lzip -9 01:23:33 476972612 8.5% 91.5% 1.34 bu.psql.lz -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.29 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
* Some of our packages are too fat. kicad-packages takes longer to compress than the entire remaining distro at zstd-19 with 16x-parallelism.
This data will therefore dominate the stats you created. Other packages may behave differently. If the goal is to minimise network traffic, the sizes should be weighted by popularity (number of downloads). One could also question whether picking a single compression method for all packages is the way forward. A simple suggestion would be to select the compression method for each package independently. Even better will be to allow an arbitrary number of subsection in a package that can each pick its own decompressor (more subsection than decompressors can make sense as one may want to use different compression settings). Each subsection can contain multiple files, scripts and/or metadata. For an example of a file format that can mix different compression methods see PDF.
In other words, a sufficiently parallelized system may have to wait just for that one to complete.
Simple countermeasures are to * split large packages into smaller ones (current kicad-packages would then become a meta-package/pattern that pulls in the other packages) * order compression tasks so that largest packages come first * not parallelise too much: use stats from previous run to estimate optimal number of workers -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On 08/10/2018 16.10, Joachim Wagner wrote:
* Some of our packages are too fat. kicad-packages takes longer to compress than the entire remaining distro at zstd-19 with 16x-parallelism.
This data will therefore dominate the stats you created. Other packages may behave differently.
If the goal is to minimise network traffic, the sizes should be weighted by popularity (number of downloads).
One could also question whether picking a single compression method for all packages is the way forward. A simple suggestion would be to select the compression method for each package independently. Even better will be to allow an arbitrary number of subsection in a package that can each pick its own decompressor (more subsection than decompressors can make sense as one may want to use different compression settings). Each subsection can contain multiple files, scripts and/or metadata. For an example of a file format that can mix different compression methods see PDF.
Yes, this is a possibility, but the correct procedure would be the compressing program choosing the correct algorithm automatically (with configured criteria like speed/size), and not us in advance. Does that type of software exist? -- Cheers / Saludos, Carlos E. R. (from 42.3 x86_64 "Malachite" at Telcontar)
On 08/10/2018 14:51, Jan Engelhardt wrote:
* Some of our packages are too fat. kicad-packages takes longer to compress than the entire remaining distro at zstd-19 with 16x-parallelism. In other words, a sufficiently parallelized system may have to wait just for that one to complete.
Interesting, apart from the main kicad package which is a binary/python mix the other packages are all noarch text of which kicad-packages3D is enormous in size. My assumption from this would be the need for a different compressor for noarch packages. Regards Dave P -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Tuesday 2018-10-09 12:52, Dave Plater wrote:
On 08/10/2018 14:51, Jan Engelhardt wrote:
* Some of our packages are too fat. kicad-packages takes longer to compress than the entire remaining distro at zstd-19 with 16x-parallelism. In other words, a sufficiently parallelized system may have to wait just for that one to complete.
Interesting, apart from the main kicad package which is a binary/python mix the other packages are all noarch text of which kicad-packages3D is enormous in size. My assumption from this would be the need for a different compressor for noarch packages.
noarch does not mean anything with regard to compression, just that it does not (better: is believed to not have) any arch-specific content. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Dienstag, 9. Oktober 2018 13:52:26 CEST Jan Engelhardt wrote:
On Tuesday 2018-10-09 12:52, Dave Plater wrote:
On 08/10/2018 14:51, Jan Engelhardt wrote:
* Some of our packages are too fat. kicad-packages takes longer to compress
than the entire remaining distro at zstd-19 with 16x-parallelism. In other words, a sufficiently parallelized system may have to wait just for that one to complete.
Interesting, apart from the main kicad package which is a binary/python mix the other packages are all noarch text of which kicad-packages3D is enormous in size. My assumption from this would be the need for a different compressor for noarch packages.
noarch does not mean anything with regard to compression, just that it does not (better: is believed to not have) any arch-specific content.
In my test, xz -6 outperform zstd -19 both compression and speed wise. But as compression is all about statistics, I have found a neat cheap trick to improve the compression efficiency for kicad-packages3D considerably: The packages (3d models) come in pairs of 2 files, one VRML and one STEP (ISO-10301 part 214) file. Both are ASCII formats, but with different keywords, syntax and the like. Just reordering the cpio contents to have all VRML files first and STEP files after reduces the files size from 291 MByte to 277 MByte (xz -6). Compression time is 40 minutes in both cases (i7-4500U, single thread). For comparison, zstd -4 needs 40 minutes too, but ends up with ~650 MByte. I think this may work for a number of other cases: - doc packages, mixing HTML and PNG, JPEG, ... - python packages, mixing sources and .pyc/.pyo I think this could even be incorporated in RPM itself, feed it the file list, order by filename suffix. Kind regards, Stefan -- Stefan Brüns / Bergstraße 21 / 52062 Aachen home: +49 241 53809034 mobile: +49 151 50412019
On 11/10/2018 01.20, Stefan Brüns wrote:
I think this could even be incorporated in RPM itself, feed it the file list, order by filename suffix.
Nice trick. Reminds me of https://7-zip.org/faq.html
Why 7z archives created by new version of 7-Zip can be larger than archives created by old version of 7-Zip?
Old version of 7-Zip (before version 15.06) used file sorting "by type" ("by extension").
I also did that once in a university study, where re-ordering the contents of a binary struct to have values of the same kind next to each other, gave large gains in compression ratio. So instead of a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 it became a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
One thing comes up my mind from past incidents while suse upgrades. Dont forget about to-be-upgraded systems when changing essential stuff such as compression compatibility and so on. I think we used to have an opensuse version some years ago, that had a show stopper that caused incompatibility in the rpm packages while zypper dup-ing or something similar, which made it necessary to use an updated rpm and or some related libs first before being able to perform a zypper dup. Thanks. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Saturday 2018-10-13 19:38, cagsm wrote:
One thing comes up my mind from past incidents while suse upgrades. Dont forget about to-be-upgraded systems when changing essential stuff such as compression compatibility an opensuse version some years ago, that had a show stopper that caused incompatibility in the rpm packages while zypper dup-ing or something similar, which made it necessary to use an updated rpm and or some related libs first before being able to perform a zypper dup.
Well, once you start thinking, you will recognize that this situation - requiring an updated rpm - is inevitable: A sufficiently old rpm program just is not going to be able to read a sufficiently new rpm package. That is to say, even if rpm has had feature X for time Y, some dude will try to upgrade from a version older than Y. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On 13/10/2018 19.38, cagsm wrote:
One thing comes up my mind from past incidents while suse upgrades. Dont forget about to-be-upgraded systems when changing essential stuff such as compression compatibility and so on. I think we used to have an opensuse version some years ago, that had a show stopper that caused incompatibility in the rpm packages while zypper dup-ing or something similar, which made it necessary to use an updated rpm and or some related libs first before being able to perform a zypper dup.
Yes, I remember that one. The issue would be new version not being able to cope with old format. Otherwise, it is normal, new feature is not supported by old version. -- Cheers / Saludos, Carlos E. R. (from 42.3 x86_64 "Malachite" at Telcontar)
participants (10)
-
Bernhard M. Wiedemann
-
cagsm
-
Carlos E. R.
-
Dave Plater
-
H.Merijn Brand
-
Jan Engelhardt
-
Joachim Wagner
-
Neal Gompa
-
Stefan Brüns
-
Stephan Kulow