I found this Veratasium documentary on the xz Jia Tan backdoor adventure quite good and surprisingly detailed:

https://www.youtube.com/watch?v=aoag03mSuXQ

The Internet Was Weeks Away From Disaster and No One Knew

YouTube

@bagder I'm confused to as why binary blobs are allowed to be stored in public source code repositories anyways.

I mean, I understand if you want to include assets for a game, but wouldn't it then be safer to store them in readable format before compression? As a simplified example, png's could be stored as xpm in source and then converted into the better format using provided tools, also in the repo.

Tldr being: If blobs are to be used in tests, write a tool that generates the blob for them.

@thanius @bagder it is weird but it was done to test compression /decompression with known blobs (I think) and since it was always like this, no one thought twice. I'd guess this kind of thing is being much more heavily scrutinized now.

@thanius convenience? lack of time? didn't think of the security implications?

Keeping everything readable all over takes effort. In the curl project the xz event kicked off a journey making sure we have less opaque data everywhere in git. It is work that is still ongoing!

@bagder Yeah, I understand it takes time to backtrack through an entire project or projects to make everything transparent for reviewers.

But after this debacle I hope that more developers look into dogfooding their binary storage in projects. I too am responsible for storing blobs, albeit in private repos, but I've since tried to implement build-time asset transformation instead even though it may bulk up the repos.

@thanius @bagder just guessing but it might be to hit certain corner cases. For example you might want a file with a certain type of noise to test that your changes to the algorithm didn't cause it to spit out a packed file that's significantly bigger than the incompressible source. Or something that was known to cause crashes or lossy behaviour in previous versions to prevent regressions

@duckz The whole point of unit tests is that they are reproducable. They're tailored for specific scenarios, and should thus be recreateable imho.

If you know how to reproduce a certain scenario, where the application expects a blob for the mockup, then build a tool that creates the blob before testing.

Prepare -> mock -> test

@thanius I get that and you can probably do what you say to generate certain sequences including noise even though it might be non-trivial, but if there is a certain binary sequence that was once a chunk of a file that someone attached to a bug report, you can't reasonably generate that. It doesn't have to be megs either, so why look for a function that generates 10 specific bytes when you can just commit those 10 bytes?
@duckz Because it wouldn't be source code. :)