* On 3/31/24 19:03, yassadmi via openSUSE Factory wrote:
The youtuber Brodie Robertson made this statement : "the malicious code end up on the distros because they build from the release tarballs instead of the git repo"
I guess they are reasons to do so, but i thought hes point had to be share here.
[TLDR: using only release tarballs as a malware distribution vector was laziness; release tarballs are a useful, widely used practical means of source code distribution; not using them won't help against adversaries hiding malware in a more refined way directly in repositories] I'm not a famous YouTuber, but I'd still like to give some insights with my release manager hat on. It's true that in this instance, the code activating the malicious code parts has only been distributed in release tarballs. Other methods of fetching the source code ship the malicious code itself, but since it's just part of test data, it's never actually embedded into the binaries and pretty much discarded after a successful build. I've now read numerous times that the practice of using release tarballs is to blame for all the mess that was caused, or, rather, that not using release tarballs would always lead to less risk, but that's simply not true. First of all, let me explain why release tarballs actually are a good thing and by no means evil in themselves. Historically, before (distributed) version control systems existed, packaging was the only meaningful way of distributing source code. This packaging need not be tarballs, even plain files on a diskette can be regarded as a package. However, tarballing (or cpioing or whatever nom-du-jour archiver is being used) the source makes it easy to compress the content - and compression is very effective for plain text data, which is primarily what source code is. Even with distributed version control systems now being the norm, release tarballs are still a very *practical* means of distributing source code. For small repositories and fast network connections, just downloading the whole repository is not a practical issue, but for huge repositories (think of Firefox, Chromium, wine, Qt) it very much still is, since the repository size is magnitudes of orders higher than tarballs for a specific release version. As a user, I'd always pick a release tarball of Chromium instead of having to download tens if not hundreds of gigabytes of repository data. Now, most VCS support a feature typically called "shallow clone", in which only data relevant to a specific revision can be requested and downloaded. Since data is typically compressed before being transmitted, one could argue that fetching data for a tag using this method is practically equivalent to fetching release tarballs. However, this fails to take the distributor's side into consideration. Release tarballs are generated once and never touched again, but transferred many, many, many times after their creation. A shallow clone creates a non-negligible system load on the server side for every such request. Technically, it would be possible to implement some form of caching to circumvent this computational effort, but so far, no VCS has done so. Also, ironically, the whole idea sounds a lot like... creating release tarballs which are then distributed... Then, there's the crucial question of how to verify the integrity of an archive - whether release tarballs or source code checkouts. There are two integrity issues - the first one is whether the data itself is not damaged but what the creator intended to distribute and the other one is whether the data contained corresponds to the data has been made publicly available in the repository. The first issue is easy to check for using a combination of checksums and digital signatures, the latter one is much more complicated, though technically also realizable. However, nobody practically does that, because in virtually all cases, it's a waste of time (and, also, will often fail! More on that soon) and system resources, either on the developer's or distributor's machine(s). In the case of xz, the build system modifications have been sneaked into the release tarball by an adversary, but that was merely done for practical reason. Obfuscating this code (and splitting it up/sneaking parts in via harmless commits as "garbage changes" or the like) and committing it to the source code repository would have been possible, but would also have required a lot more sophistication than dropping a single file into the release tarballs. Unfortunately, the dark side™ will also learn from this incident and probably be smarter about this next time to avoid detection (and to make the malware more readily available through any means of fetching the source code). There is another practical reason why release tarballs are used so widely - some (if not most) build systems are configured in a high(ish)-level fashion by developers, but need another step of parsing/generation to actually get executable scripts and other data for builds to be doable. This data changes very often (if not inevitably always) during each generator run and is typically not committed to the source code repository, because it's just unnecessary noise. As a courtesy to users, maintainers run this intermediary step on their machines while generating a release tarball, so that users don't have to install a compatible build system version (which can be an actual problem if the user's system is either much newer or much older than what the developers were using during the time the release was done) and bootstrap the build system. Xz is using a build system needing this intermediary step - GNU autotools. It also offers CMake as an alternative, which also requires an intermediary step, but this can't be reasonably cached, since it's an amalgamation of autootool's auto(re)conf and ./configure run. Crucially, this means that release tarballs will practically ALWAYS differ from what a source code repository contains, at least for software relying on GNU autotools. Common files won't differ, but there will be new, autogenerated files. That new content is typically very verbose, big and nobody actually checks it manually. While one could advocate not use GNU autotools for reasons like these, in the end, users and packagers will have to do with what the developers provide in their project, short of reimplementing the whole build system for any project that is not using their "favorite" build system (and then, that would only shift one possible failure point to end users/packagers). Then, there's the general issue of developers and maintainers not signing commits and - worst of all - release tags. While the revision CAN be some form of checksum for the eventual data, there is no way to make sure that a specific revision was actually meant to be released other than using a digital signature. Until digital signatures are mandatory for release tags, fetching source code from repositories just shifts the failure/trust point to the party hosting the source code. Lastly, release tarballs are the only practical means of distributing source code to satisfy strong copyleft license terms. I want to point out that it's by no means the *only* way to comply with the terms, but it's a very convenient and practical one. AFAIK, no free software distribution uses other ways. The upstream release tarballs might be getting repacked from time to time, but generally, no distribution "just" provides source code repositories for users to fetch source code from, also because that's not easily archivable, doesn't scale well and because it's just too easy to miss creating a tag or importing a specific source version to the distribution-specific repository and inadvertently breach the GPL. Note, that SRPMs essentially are likewise just glorified/tuned release tarballs. I hope I was able to shed a different light on what has been demonized needlessly and often the past few days. Also, if you have read to this point and wasn't able to obtain new information: congratulations, reading the TLDR on top would have been good enough. Mihai