Mastodawn

MTRNord (they/them)May 24, 2023

Don't leave me unsupervised near osm data and computers with rust and git... My last week literally was spent with implementing the idea from https://blog.andygol.co.ua/en/2023/05/07/osm-2-0-api-using-git/

Soooo I got a script (works somewhat only) that converts #osm changesets to git commits... Uhm. It's still quite buggy and not usable (needs splitting to be actually clonable in the end) but it works

https://github.com/MTRNord/osm-git/

Readme will follow when this gets usable.

OSM 2.0 API using git

Personal blog of Andriy Holovin. A little about everything.

Andrii Holovin – Blog

Show thread

MTRNord (they/them)May 24, 2023

It currently just writes each object as yaml files to the root folder. The plan is to chunk them across the lat/lon similar to how tiles are done. This would allow easier cloning of data.

The legacy object version is what the object version is, the filename the object id but the file version is a version number of the used format. That allows for future changes.

Show thread

MTRNord (they/them)May 24, 2023

It also loads the whole changeset snapshot into ram for this so you need ~32+GB of free RAM for this to run.

Show thread

MTRNord (they/them)May 25, 2023

Turns out you may be able to use less currently. Your milage may vary

Show thread

MTRNord (they/them)May 25, 2023

Also the script requires you to convert the changeset snapshots to zstd compression since bzip2 is good for storage but slow as hell. Zstd still needs about 5-10m to be parsed fully. (as in file decompressing and parsing the xml. The nice thing is it only keeps in memory the currently needed parts downside is you parse the changeset file for every data file. Since ram is expensive but time is less expensive.)

Show thread

MTRNord (they/them)May 25, 2023

To put time into perspective: this will take weeks to go through all data even if I were to load the changesets snapshot once into ram. So if it takes 2 days longer I don't really care. I am mostly bound to io either way with this. And based on a flame graph I am currently already basically at the edge of speed for this. Likifes by needing to download and parse the gigabytes of data

Show thread

MTRNord (they/them)

Well. Using #gitoxide instead of #git (libgit2 specifically) made a MAJOR performance speed up. It's still taking a lot of time, but it takes a lot less time. It's visibly faster this way.

Currently, running yet another flamegraph to make sure I am not spending too much time on something else.

And tomorrow I will figure out resuming things properly.

Show thread

MTRNord (they/them)May 25, 2023

And this is going to be an issue... (it is not even the full first file of the osm dataset)

Show thread

MTRNord (they/them)May 25, 2023

yes that is a full on 200GB git repo