Mastodawn

Andrew Nesbitt

Instead of using git as a database, what if you used database as a git?

https://nesbitt.io/2026/02/26/git-in-postgres.html

Git in Postgres

Instead of using git as a database, what if you used a database as a git?

Andrew Nesbitt

Show thread

Dirkjan Ochtman Feb 26

@andrewnez when aiming for operational simplicity, feels like SQLite would be a better fit than Postgres, especially when all the interactions come in through the network anyway. (See also: Fossil?)

Show thread

Andrew Nesbitt Feb 26

@djc forgejo uses postgresql, that was the main driver

Show thread

Matteꙮ Italia Feb 26

@djc @andrewnez (Fossil was indeed my first thought, although I think the schema there is a bit higher level, not just a storage backend)

Show thread

equinox Feb 26

@andrewnez does forgejo really fork out to run the git binary? Why would it not just use libgit 😨

Show thread

Andrew Nesbitt Feb 26

@equinox https://github.com/go-gitea/gitea/issues/5142 i think it's to avoid CGo

CGo · Issue #5142 · go-gitea/gitea

Does this project require CGo? Does it use libgit2 under the covers? Or just integrate with an existing standalone git service that is already running on host.

GitHub

Show thread

equinox Feb 26

@andrewnez I guess it's an unverified assumption but it feels like that's wasting a whole bunch of CPU cycles for no reason 😕

Show thread

Sidju Feb 26

@andrewnez
This reminds me of https://m.youtube.com/watch?v=wN6IwNriwHc
Although this (unlike that video) has practical use

Database as Filesystem

YouTube

Show thread

ティージェーグレェ Feb 26

I knew Homebrew was bad, but thanks for explaining some other reasons I hadn't paid attention to such as:

"homebrew-core has one Ruby file per package formula, and every brew update used to clone or fetch the whole repository until it got large enough that GitHub explicitly asked them to stop."

I hadn't realized it was that bad! I'll add that to the list of reasons to continue avoiding Homebrew, as if the spyware by default and the founder turning into a cryptocoin grifter weren't bad enough already.

"Git packfiles use delta compression, storing only the diff when a 10MB file changes by one line, while the objects table stores each version in full. A file modified 100 times takes about 1GB in Postgres versus maybe 50MB in a packfile."

20X overhead, seems, kind of horrifying to me?

No, thank you.

"storing three full uncompressed copies of every repository across data centres because redundancy and operational simplicity beat storage efficiency even at hundreds of exabytes."

I can't even begin to wrap my head around "beat storage efficiency" at "hundreds of exabytes" but then, I have been witness to corporate largess on scales that defy rational explanation. Some companies can afford to light stacks of money on fire apparently, but I don't think taking inspiration from the Heath Ledger's portrayal of The Joker in 2008's The Dark Knight should be a guiding light for any sane sorts.

IMHO, SQL is an anti-pattern that Steve Jobs wasn't smart enough to avoid when he bundled Sybase with NeXT and subsequently we've been suffering from that oversight ever since. SQL should have died, or at least stayed with, IBM. There are so many better database paradigms in existence which are not SQL.

Anyway, I dislike Git and I dislike SQL and you have somehow managed to create what I guess to me, is like the opposite of the Reese's peanut butter cup commercials of the 20th century?

But y'know, you're probably not entirely off the mark? Fossil-scm is a DVCS (with limited Git interoperability) and issue tracking system and wiki and such, which is presumably by virtue of being developed by the author of SQLite, also wrapped around SQLite.

The concluding sentence: "there’s no filesystem of bare repos to manage alongside the database." hearkens back to some old Slashdot "Rob Pike Responds" Q&A about databases and filesystems: https://interviews.slashdot.org/story/04/10/18/1153211/rob-pike-responds but seems to ignore the reality: databases exist on filesystems, always have, and presumably, always will. So, figuring out how to not overly abstract that and get down to brass tax is vital.

I suppose, since elsewhere you write about S3, maybe you're too lost in the clouds and too far removed from bare metal and hardware implementations? That's, not a good thing. Pretty much, the opposite of good.

Rob Pike Responds - Slashdot

He starts by clearing up my error in saying he was a Unix co-creator in the original Call For Questions. From there he goes on to answer your questions both completely and lucidly. A refreshing change from the politicians and executives we've talked to so much recently, no doubt about it....

Show thread

luna the doggie

Feb 26

@andrewnez actually I would argue storage efficiency is important for making it accessible to self-host.

If you don't have a bunch of disposable income to put towards a homelab, storage really isn't very cheap. Especially now that prices of everything have doubled or tripled due to AI datacenter demand.

Renting a VPS with more than 20-50GB of storage gets expensive very fast. Too expensive for many people.

Using an old PC for hosting can sometimes be an option but that will have reliability issues, and you probably won't have more than like 1TB of space available then anyway. Also dependent on your ISP giving you public IPs.

Show thread

World Basement Classic Feb 26

@lunareclipse @andrewnez I'd be interested to know if anybody's explored how effective it is to run Postgres on a deduplicating filesystem like ZFS or brtfs.

Hopefully, it might be that this is a non-issue in practice without having to build deduplication or delta compression into Postgres directly.

Show thread

walnut 🌱Feb 27

@diffrentcolours @lunareclipse @andrewnez
I don't know much about this, but it seems like the tradeoffs/gains would need to be measured on a case by case basis still

despairlabs.com/blog/posts/2024-10-27-openzfs-dedup-is-good-dont-use-it/

OpenZFS deduplication is good now and you shouldn't use it

OpenZFS 2.3.0 will be released any day now, and it includes the new “Fast Dedup” feature. My team at Klara spent many months in 2023 and 2024 working on it, and we reckon it’s pretty good, a huge step up from the old dedup as well as being a solid base for further improvements. I’ve been watching various forums and mailing lists since it was announced, and the thing I kept seeing was people saying something like “it has the same problems as the old dedup; needs too much memory, nukes your performance”. While that was true (ish), and is now significantly less true, the real problem is that this just repeating the same old non-information that they probably heard from someone else repeating it.

despair labs

Show thread

Latte macchiato

Feb 27

@diffrentcolours @lunareclipse @andrewnez
Deduplication is almost never worth it, especially not for Postgres. The Postgres page headers, tuple headers, and MVCC versioning make blocks look different even though the logical data is the same, resulting in no actual dedup happening with "naive" dedup methods like ZFS.

ZFS compression, on the other hand, works very well with Postgres. It's basically free on modern CPUs, reduces I/O, and can easily yield massive compression. The VM my instance is running on is compressed by 1.8x using zstd. That'd be the first knob to turn that pays off big.

see attached, my compression ratios for my hypervisor storage pool. Most postgres data compresses so well that you don't even really need dedup.

Show thread

Latte macchiato

Feb 27

@diffrentcolours @lunareclipse @andrewnez
from how I understand it, with gitgres specifically, every object in the objects table is content-addressed by its SHA-1 hash.
So, by definition, no two rows have the same content, ever. A file that changes by one byte produces a different blob with a different hash. ZFS/Btrfs block-level dedup would find no duplicates.

Show thread

Ben Zanin Feb 26

@andrewnez @stsp wasn't this done a few years back? I think maybe by AWS? They also leaned on libgit2 to interpose a SQL storage layer into git operations. They ran across the same storage size trade-offs of lacking delta-pack optimisations, but if I recall correctly their big lesson learned was that standard CLI git operations slowed down by a few thousand times because of an unexpectedly high number of round trips to the database.

Is this ringing any bells? I'll see if I can find it.

Show thread

Ben Zanin Feb 26

@andrewnez @stsp ah, here we go:

https://web.archive.org/web/20160304015546/http://blog.deveo.com/your-git-repository-in-a-database-pluggable-backends-in-libgit2/

And

https://github.com/libgit2/libgit2-backends/pull/4#issuecomment-36115322

2013 and 2014 respectively. (And upon re-reading it I see that libgit2 has several SQL pluggable back-ends, so now I'm sure you've already seen this. Sorry for pointing out the obvious.)

Your Git Repository in a Database: Pluggable Backends in libgit2

Libgit2 provides the ability to store data in .git directory to relational or a NoSQL database, or an in-memory data structure.

The Deveo Blog

Show thread

Stefan Sperling Feb 26

@gnomon @andrewnez We have the same latency issue in game of trees where objects and pack files are always parsed in a sub-process (via fork+exec and passing data back to the main process via a file descriptor or a small buffer copied across Unix pipes). To obtain reasonable performance we try to perform object graph traversal operations in batches inside the pack file reader if possible. Still slower than Git but the good news is that Git is so incredibly fast that programs running 10 or even 100 times slower can still be perfectly usable.

Show thread

Jan Katins Feb 26

@andrewnez internally we use PG to version data. We have commits, branches, diffs, merges, cherry picks,... Internally, data is kept in tables with branch_id, commit_id_from, commit_id_to, data_pk, with some range based exclision constraint. Branching is expensive, as it needs to copy the whole data set (~150 tables, diffing has some hints to make it not read all tables but only the touched ones.

Show thread

Romain Maneschi 🇫🇷Feb 26

@andrewnez Funny, I have a forge that does the exact opposite. Rather than putting everything in a database, I thought, "What if we had a forge without a database?" Everything is stored in Git (config, users, permissions, tickets, etc.). Of course, I don't use git-cli, but rather a library that allows me to mount the repository in memory. Implementing a storer for PG should be doable with it https://github.com/go-git/go-git in go like forgejo. To see what my forge looks like https://gitroot.dev

GitHub - go-git/go-git: A highly extensible Git implementation in pure Go.

A highly extensible Git implementation in pure Go. - go-git/go-git

GitHub