Re: xz security alert and CVE-2024-3094

31 Mar 2024

      * On 3/31/24 19:03, yassadmi via openSUSE Factory wrote:
...
The youtuber Brodie Robertson made this statement : 
"the malicious code end up on the distros because they build from the release tarballs instead of the git repo"
I guess they are reasons to do so, but i thought hes point had to be share here.
[TLDR: using only release tarballs as a malware distribution vector was
laziness; release tarballs are a useful, widely used practical means of source
code distribution; not using them won't help against adversaries hiding malware
in a more refined way directly in repositories]

I'm not a famous YouTuber, but I'd still like to give some insights with my
release manager hat on.

It's true that in this instance, the code activating the malicious code parts
has only been distributed in release tarballs. Other methods of fetching the
source code ship the malicious code itself, but since it's just part of test
data, it's never actually embedded into the binaries and pretty much discarded
after a successful build.

I've now read numerous times that the practice of using release tarballs is to
blame for all the mess that was caused, or, rather, that not using release
tarballs would always lead to less risk, but that's simply not true.

First of all, let me explain why release tarballs actually are a good thing and
by no means evil in themselves.

Historically, before (distributed) version control systems existed, packaging
was the only meaningful way of distributing source code. This packaging need not
be tarballs, even plain files on a diskette can be regarded as a package.
However, tarballing (or cpioing or whatever nom-du-jour archiver is being used)
the source makes it easy to compress the content - and compression is very
effective for plain text data, which is primarily what source code is.

Even with distributed version control systems now being the norm, release
tarballs are still a very *practical* means of distributing source code. For
small repositories and fast network connections, just downloading the whole
repository is not a practical issue, but for huge repositories (think of
Firefox, Chromium, wine, Qt) it very much still is, since the repository size is
magnitudes of orders higher than tarballs for a specific release version. As a
user, I'd always pick a release tarball of Chromium instead of having to
download tens if not hundreds of gigabytes of repository data.

Now, most VCS support a feature typically called "shallow clone", in which only
data relevant to a specific revision can be requested and downloaded. Since data
is typically compressed before being transmitted, one could argue that fetching
data for a tag using this method is practically equivalent to fetching release
tarballs. However, this fails to take the distributor's side into consideration.
Release tarballs are generated once and never touched again, but transferred
many, many, many times after their creation. A shallow clone creates a
non-negligible system load on the server side for every such request.
Technically, it would be possible to implement some form of caching to
circumvent this computational effort, but so far, no VCS has done so. Also,
ironically, the whole idea sounds a lot like... creating release tarballs which
are then distributed...

Then, there's the crucial question of how to verify the integrity of an archive
- whether release tarballs or source code checkouts. There are two integrity
issues - the first one is whether the data itself is not damaged but what the
creator intended to distribute and the other one is whether the data contained
corresponds to the data has been made publicly available in the repository. The
first issue is easy to check for using a combination of checksums and digital
signatures, the latter one is much more complicated, though technically also
realizable. However, nobody practically does that, because in virtually all
cases, it's a waste of time (and, also, will often fail! More on that soon) and
system resources, either on the developer's or distributor's machine(s).

In the case of xz, the build system modifications have been sneaked into the
release tarball by an adversary, but that was merely done for practical reason.
Obfuscating this code (and splitting it up/sneaking parts in via harmless
commits as "garbage changes" or the like) and committing it to the source code
repository would have been possible, but would also have required a lot more
sophistication than dropping a single file into the release tarballs.
Unfortunately, the dark side™ will also learn from this incident and probably be
smarter about this next time to avoid detection (and to make the malware more
readily available through any means of fetching the source code).

There is another practical reason why release tarballs are used so widely - some
(if not most) build systems are configured in a high(ish)-level fashion by
developers, but need another step of parsing/generation to actually get
executable scripts and other data for builds to be doable. This data changes
very often (if not inevitably always) during each generator run and is typically
not committed to the source code repository, because it's just unnecessary
noise. As a courtesy to users, maintainers run this intermediary step on their
machines while generating a release tarball, so that users don't have to install
a compatible build system version (which can be an actual problem if the user's
system is either much newer or much older than what the developers were using
during the time the release was done) and bootstrap the build system. Xz is
using a build system needing this intermediary step - GNU autotools. It also
offers CMake as an alternative, which also requires an intermediary step, but
this can't be reasonably cached, since it's an amalgamation of autootool's
auto(re)conf and ./configure run.

Crucially, this means that release tarballs will practically ALWAYS differ from
what a source code repository contains, at least for software relying on GNU
autotools. Common files won't differ, but there will be new, autogenerated
files. That new content is typically very verbose, big and nobody actually
checks it manually.

While one could advocate not use GNU autotools for reasons like these, in the
end, users and packagers will have to do with what the developers provide in
their project, short of reimplementing the whole build system for any project
that is not using their "favorite" build system (and then, that would only shift
one possible failure point to end users/packagers).

Then, there's the general issue of developers and maintainers not signing
commits and - worst of all - release tags. While the revision CAN be some form
of checksum for the eventual data, there is no way to make sure that a specific
revision was actually meant to be released other than using a digital signature.
Until digital signatures are mandatory for release tags, fetching source code
from repositories just shifts the failure/trust point to the party hosting the
source code.

Lastly, release tarballs are the only practical means of distributing source
code to satisfy strong copyleft license terms. I want to point out that it's by
no means the *only* way to comply with the terms, but it's a very convenient and
practical one. AFAIK, no free software distribution uses other ways. The
upstream release tarballs might be getting repacked from time to time, but
generally, no distribution "just" provides source code repositories for users to
fetch source code from, also because that's not easily archivable, doesn't scale
well and because it's just too easy to miss creating a tag or importing a
specific source version to the distribution-specific repository and
inadvertently breach the GPL. Note, that SRPMs essentially are likewise just
glorified/tuned release tarballs.

I hope I was able to shed a different light on what has been demonized
needlessly and often the past few days. Also, if you have read to this point and
wasn't able to obtain new information: congratulations, reading the TLDR on top
would have been good enough.

Mihai

Re: xz security alert and CVE-2024-3094

Mihai Moldovan