I have a story to tell that is relevant to the xz-utils thing that just happened. I'll probably write this up properly later, but I'm in pre-vacation mode so it may take a while . We have a problem with the way we develop and then distribute FOSS software, and both stories show that. A while ago I looked at the testcases of a widely used library implementing a widely used data format. There was one file that was... strange. 🧵
That file was named similar to the other testcases, but it was not used in any test. And if you fed that file into anything using that library, it would either crash or cause enormous CPU spikes. And most interestingly: This file was nowhere to be found in the project's git repository. It was *only* in the tarball.
I contacted the responsible project, but I never got an answer and never really got to the bottom of this. But here's what I think happened: This was a proof of concept file for a yet unfixed and undisclosed vulnerability. It appears the developer already had a testcase for that bug in his local copy of the source tree. And then created the tarball from that source tree. And by doing that leaked a PoC for a zeroday. FWIW, it was "only" a DoS bug. But still.
I wanted to disclose this eventually, but then a new version of that library came out and fixed the bug. And plenty of others, and well, people crash parsers for data formats from hell all the time. And I had some concerns that it would sound like I wanted to ridicule the dev, which wasn't my intention at all. But I already thought there's a deeper story here than someone accidentally leaking a PoC for an unfixed vuln. Why can this even happen?
Pretty much everyone develops code using Git these days, or some other SCM (some don't, there's this mail server, but I disgress). But people distribute code in tarballs. How does a Git repo become a tarball? The answer may disturb you. It's basically "every dev has some process, maybe some script, maybe some commands they remember". Nothing is reproducible, nothing is verifiable.
This creates a situation where even when the "many eyes" principle works, i.e. people are actually looking at the code, and at code changes and commits, you still have a path to a compromised package. Because noone checks how this git repo turns into a tarball. Because noone can, as nothing is standardized or reproducible. I can tell noone does for one of the most important libraries to parse one of the most important data formats, because of the story I just told you.
There were some substantial efforts to create "reproducible builds" in some areas. This is closely related, but not exactly the same thing. Even if we have reproducible builds, we don't have "reproducible source distribution". We should have that. Git already has some cryptographic integrity, and as much as it has some flaws (sha1...), it's a lot better than nothing at all. But we don't connect any of that to the actual source tarballs.
I think the same issue is true for most package managers out there. I don't think there's any mechanism that ties e.g. what's on pypi to what is in any git repo. (Anyone knows if any package manager does that?)
Anyway, what we should have is that every release of a software is tied to a git commit hash. And there should be a verifiable, automated process that checks it. It's more complicated than it sounds, as particularly in "C land" we have autotools, and what's in the source tarball is not just a snapshot of what's in the source repo, but contains all kinds of generated stuff. EIther those need to be reproducible, or we need to just stop doing that. It's solvable, but there are some obstacles. /fin
@hanno How could autotools theoretically solve this, when the whole point (for whatever reason- I don't understand the rationale) is that e.g. configure files and friends don't exist until releases are made?
@cr1901 I mean if the process is reproducible, it can be checked. But then you need some machine readable documentation of that process. And why they even do it this way: I think the rationale was that you can run configure scripts without having autotools installed.

@hanno Right, that rationale actually make sense from the Unix-centric lens of "why should we bother making a config language when your OS provides one* "sufficient" for the task"?

* Except when it doesn't provide one, like anything not-Unix :P.

I've seen configure scripts from Softlanding Linux System. ./configure scripts weren't horrible in 1992. But feels like they got too unwieldy too quick :(.

@cr1901 @hanno with autotools, configure files are required to generate makefiles, which means that they all must exist just to do a build. their existence has no particular correlation to a release, even if configure.ac is used to define a release version.
@cr1901 @hanno autoconf needs to die
@lambdafu @cr1901 and C with it, but well... legacy code is a reality.

@hanno That reminds me of how NixOS generates flakes.

I'm not a huge fan of Nix but that notion does seem to fix this issue at hand as far as I understand it.

@publicvoit @hanno yes, Nix/nixpkgs achieves this to a certain extend. Any package in nixpkgs can be tied to its source, be it a provided tarball (hash will be checked) or a reproducible build from source.
E.g. in the case of xz, the tarball from GitHub was being fetched (https://github.com/NixOS/nixpkgs/blob/nixos-unstable/pkgs/tools/compression/xz/default.nix)
nixpkgs/pkgs/tools/compression/xz/default.nix at nixos-unstable · NixOS/nixpkgs

Nix Packages collection & NixOS. Contribute to NixOS/nixpkgs development by creating an account on GitHub.

GitHub

@basbebe @hanno Bastian, do you know what it means that the Github repo was removed by Github (until Nix project finds an alternative/solution)?

Does that mean that all updates/installs are breaking at the moment?

@publicvoit @hanno currently the package (and any older, non-customized versions) will be pulled from cache (cache.nixos.org), so that shouldn’t be an immediate problem.

There is a mirror by the original maintainer that could be used in the future:

- https://github.com/NixOS/nixpkgs/pull/300028
- https://discourse.nixos.org/t/cve-2024-3094-malicious-code-in-xz-5-6-0-and-5-6-1-tarballs/42405/18

But yes, this seems to be a currently unsolved issue

Revert "xz: 5.4.6 -> 5.6.1" by mweinelt · Pull Request #300028 · NixOS/nixpkgs

Description of changes The upstream tarball has been tampered with and includes a backdoor for which we cannot completely rule out, whether we are affected. https://www.openwall.com/lists/oss-secur...

GitHub
Sorry for advertising #guix here ;-) Software Heritage archives all the sources used by guix. And this is done for exactly the case that the original source disappears.
Thus users of guix will not face such an issue - as long as Software Heritage is alive.
@publicvoit @basbebe @hanno
No, nix does not solve the issue, neither does guix. Packagers can always decide to use the *dist* tarball as source - instead of some git checkout.
@basbebe @publicvoit @hanno
@hanno the whole story also reminds me of the Webmin backdoor we investigated together in 2019.
Same deal: it was only in the release tarball.
We learned nothing from it.

@hanno there is https://slsa.dev/ to solve mostly exactly this, and we've integrated that in PrivateBin: https://github.com/PrivateBin/PrivateBin/blob/master/doc/Release.md

This only works, because the build process is super simple, "git archive" command essentially. And this is, what you also get with GitHub's source links, which is great as it can only remove files, not add ones and you have a valid source tarball, also referenced by a third-party (like SLSA aims at). #supplyChainSecurity /cc @elrido

Supply-chain Levels for Software Artifacts

SLSA is a security framework. It is a check-list of standards and controls to prevent tampering, improve integrity, and secure packages and infrastructure in your projects, businesses or enterprises. It’s how you get from safe enough to being as resilient as possible, at any link in the chain.

SLSA
rugk (@[email protected])

Note on all the #xz drama, there are some technical solutions for such #supplychainattack that can make such an attack way harder, at least to hide the code in tarballs etc. https://slsa.dev/ e.g. is a solution. Combined with reproducible builds, it ensures that a software artifact is built exactly from the source given in a source repository, with the possibility to prove that and no way for any maintainer to tamper with (in the highest level). #slsa #infosec #security #linux #backdoor

chaos.social
@hanno Apart from doing a "git checkout tag" and having a signed manifest of every single controlled file (woe betide you if you have submodules), the other conundrum is that reproducible compressed containers are also needed. That's not easy to do outside tarballs, so Windows folks are usually excluded out of the box. (And even with tarballs, the incantations are really arcane.)
@hanno Meson does their deployments with a git checkout + archive, but of course does not take care of the signing step.
@hanno I think you are right for many software stacks like C autotools, but if you check more modern languages for example Go tools it’s not as bad. These are much better in shape than the mentioned ones.
Also all the sbom tools should help to make it better.

@hanno when it comes to autotools, the correct answer is always "stop doing that".

a build system that regularly fails at both backwards & forward compatibility? every developer must have specific version(s) of the build system installed (effectively system-wide) on their machines (and sometimes, different versions for different projects).

It's not super-popular, but one of waf's main selling points is that it is a part of your codebase, managed with the same tools as the source itself.

Behdad Esfahbod (@behdadesfahbod) on X

@drjtwit If a bad actor gains write access to a repo and changes the release artifacts (eg. tarballs), there's no way to know... I want a page showing who and when and what was uploaded for a release.

X (formerly Twitter)
@hanno This isn't airtight by any means, but one way to begin addressing this concern is to automate release processes to run on public CI. I didn't do this for a long time because I didn't see how to automate the automatable parts without losing human oversight of the parts that needed it — eventually decided I needed to write my own tool to make that possible. Once I figured out the formalism that I wanted (https://pkgw.github.io/cranko/book/latest/jit-versioning/), I've never looked back.
Just-in-Time Versioning - The Cranko Manual

A manual for the Cranko release automation tool.

@hanno the new JSR by deno seems to have an approach: https://deno.com/blog/jsr_open_beta#publishing-from-github
Introducing JSR - the JavaScript Registry

The JavaScript Registry (JSR) is a TypeScript-first, ESM-only module registry designed for the entire JavaScript ecosystem. Use JSR modules from Deno and npm-based projects. JSR is free and open source. Available today in public beta.

Deno Blog

@hanno I believe Go does that.

You publish your code as a repo. But there is a specified, reproducible way to build a zip-file from a repo. And even if you download the module via git clone, the Go tool stores it on disk in the canonical form and builds from that. But most devs use a global caching proxy, which is paired with a certificate-transparency like tamper-evident ledger to verify the download.

IMO the Go module ecosystem does all of this really well.

@hanno One caveat is, that it's also possible *not* to build from a repo, but instead distribute your module directly as a (well-formed) zip file, on your own web server.

I don't know anyone who does that, but in that case, you don't get a clear audit-trail, of course.

@Merovius @hanno see e.g. https://go.dev/blog/module-mirror-launch

The Go compiler/build system is supposed to act like a function -- same inputs, yield same outputs, always. https://go.dev/blog/rebuild

Go build/install also does not run arbitrary other programs (especially if you avoid cgo, that is, calling out to foreign code).

It's not bulletproof, and someone can write obfuscated Go, same as any other language, but (in my opinion) it's not nearly as easy a target as C and its infrastructure.

Module Mirror and Checksum Database Launched - The Go Programming Language

The Go module mirror and checksum database provide faster, verified downloads of your Go dependencies.

@hanno I think Golang does, well if you don't use a proxy, as then you're getting whatever the proxy has seen first, not *necessarily* what is in the repo today (because someone force pushed). Not sure what's better.
https://go.dev/ref/mod#module-proxy
https://go.dev/ref/mod#checksum-database
Go Modules Reference - The Go Programming Language

@hanno Rust's Cargo does. It automatically adds the current git commit id to the ".cargo_vcs_info.json" file which it puts within the package, and the Cargo.toml file also within the package is supposed to have a pointer to the repository (it is shown in the package's page at the crates.io site).
@cesarb does the cargo upstream service check that the commit in the repo matches the file's content?
@hanno @cesarb no, but someone wrote a tool that could be used to do the comparison (but does not yet) https://rust-digger.code-maven.com/vcs/no-repo
Crates in Has no repository

@hanno I don't know rpm or pacman but debian/ubuntu tie the package to a git tag which is a pointer to a specific commit. But git tags are shoddy design and are not immutable. Delete tag alter code recreate tag with same name. You'd have to grab tagged source build then compare hash to know source did not match. 100 to 1 that almost never happens. But massively more secure than windows apple android package handling so not true concern. But is weakness in git design.
@smxi I am pretty sure debian does not pull git sources for most of their packages.
@hanno I'm not sure where you got that idea since that's exactly what they do. I talk to #unit193 the #debian #ubuntu #inxi packager all the time and that's exactly what he does. I can double check with him if you want. Arch pacman packagers certainly pull from gitt then build. AUR is just direct live build scripts pilling from git. Rpm I don't follow but assume that's what they do. Unit193 has tracker script to alert on new tagged releases. I'll ask what he does now.
@smxi @hanno there is no hard and fast rule for Debian packages and where the orig.tar.gz tarball come from. Ideally these match what was released by upstream, but sometimes the Debian maintainer even repacks the tarball so it doesn't even match what upstream released. And the upstream tarball is rarely a git tag - more often than not it is the output of running 'make distcheck' against the git repo and like Hanno says this may then include other bits.

@alexmurray @hanno while fuzzy on details I only started tagging because packagers said distro required it. Been too long but I want to say fedora asked. Maybe arch but don't remember. The point being reproducible builds based on a specific git commit.

Waiting to hear from packagers, now I'm curious. Lintian by FAR strictest packaging guidelines so doubtful any other pm will be more strict

I do #tinycore package for #inxi and they don't use that method at all.

@alexmurray @hanno yes, thanks. Unit193 explained it same. That surprises me but on reflection these rules precede things like git by many years so git tag/asset/archive form just one possible tarball source. Which makes sense. I suspect slackware is similar being from same original era. I guess it varies. Makes sense. I wonder how other core package pools do it.
@smxi @hanno Technically they may pull from git directly, yes. The archive still needs a tarball uploaded - which many generate from some git - so somehow you both are right.
@Ganneff @hanno I wonder how Arch main repo packages do it? My guess is mostly from git direct but never looked into it. It's easy to forget how old #debian and #slackware are.
@hanno @smxi It depends on the package and maintainer but many do, especially when upstream uses git. Nearly all my packages (with one exception) track the upstream git history directly and validate that the contents of the upstream tarball match. Some packages I actually have to generate tarballs because upstream ones aren't available. The git-buildpackage tool does the tarball-source validation automatically, which is nice.
@hanno github and pypi have integration that does that
@hanno it shouldn't be a big deal to write a nix derivation that, instead of a build script, has a tarball packager. That should solve the problem, without even requiring any commitment to the rest of the nix ecosystem.
@hanno npm supports "provenance" through sigstore and i think docker does too. basically you have to build and publish your code using a github/gitlab action, which has the ability to cryptographically sign the output. then you get a "verified" badge on the npm page that links to the build logs. so one could view the exact source code and build scripts that went into the package. i don't think it has to be 100% reproducible, but in most cases it is. and so long as you trust github it doesn't matter anyway. it adds a degree of centralization so that's a downside. i think the upside is worth it though.

here's an example of a package with a verified badge: https://www.npmjs.com/package/husky
husky

Modern native Git hooks. Latest version: 9.0.11, last published: 2 months ago. Start using husky in your project by running `npm i husky`. There are 3088 other projects in the npm registry using husky.

npm