Don't leave me unsupervised near osm data and computers with rust and git... My last week literally was spent with implementing the idea from https://blog.andygol.co.ua/en/2023/05/07/osm-2-0-api-using-git/

Soooo I got a script (works somewhat only) that converts #osm changesets to git commits... Uhm. It's still quite buggy and not usable (needs splitting to be actually clonable in the end) but it works

https://github.com/MTRNord/osm-git/

Readme will follow when this gets usable.

OSM 2.0 API using git

Personal blog of Andriy Holovin. A little about everything.

Andrii Holovin – Blog

It currently just writes each object as yaml files to the root folder. The plan is to chunk them across the lat/lon similar to how tiles are done. This would allow easier cloning of data.

The legacy object version is what the object version is, the filename the object id but the file version is a version number of the used format. That allows for future changes.

It also loads the whole changeset snapshot into ram for this so you need ~32+GB of free RAM for this to run.
Turns out you may be able to use less currently. Your milage may vary
Also the script requires you to convert the changeset snapshots to zstd compression since bzip2 is good for storage but slow as hell. Zstd still needs about 5-10m to be parsed fully. (as in file decompressing and parsing the xml. The nice thing is it only keeps in memory the currently needed parts downside is you parse the changeset file for every data file. Since ram is expensive but time is less expensive.)
To put time into perspective: this will take weeks to go through all data even if I were to load the changesets snapshot once into ram. So if it takes 2 days longer I don't really care. I am mostly bound to io either way with this. And based on a flame graph I am currently already basically at the edge of speed for this. Likifes by needing to download and parse the gigabytes of data

Well. Using #gitoxide instead of #git (libgit2 specifically) made a MAJOR performance speed up. It's still taking a lot of time, but it takes a lot less time. It's visibly faster this way.

Currently, running yet another flamegraph to make sure I am not spending too much time on something else.

And tomorrow I will figure out resuming things properly.

And this is going to be an issue... (it is not even the full first file of the osm dataset)
yes that is a full on 200GB git repo