I'd like to store one billion variable-length binary objects & get them by SHA256(obj) key. The median object size is 2 kilobyte. Low read/write volumes.

What I've tried so far: NFS with the hash's first few octets as nested directory names, it works but it is a bit slow and I also tried ZeroFS (also too slow).

Under considerations: DuckDB, RocksDB, BerkeleyDB, SQLite3, lmdb, something bespoke

Recommendations? Things/papers I should be reading?

@job hashed / sharded sqlite is very accessible and should do the job. Huhu. Lmdb would be faster but you need to write some tooling then.
@job define slow
@ydroneaud had trouble consistently reading faster than 2500 obj/sec on NFS; on ZeroFS it goes slower and I seem to be maxing out the one CPU core it uses for some task.

@job

I suspect that a lot of that was file-system overhead of things like finding the file and checking permissions. A single-file solution will remove that overhead doing all that once when opened and then only using fseek(3)s to jump around within that file (see also: why games like Doom used .wad files, creating effectively a big read-only k/v store)

So of your original list, bdb & lmdb both fill about the same space and should be pretty interchangeable. I'd start there since they have a bit less overhead than sqlite, where sqlite (would be my fallback if bdb/lmdb don't meet your needs) offers more flexibility but involves things marshaling in and out of textual data, and query-parsing that could introduce performance issues.

@job also, I don't know if licensing considerations matter.

bdb: APGL
lmdb: OpenLDAP public license (permissive)
sqlite: ostensibly public domain

@job okay, apparently I can't shut up or formulate all my thoughts in a single reply 🙃

I'm unfamiliar with RocksDB but it sounds like a reasonable option too (local, performant, comfortably licensed)

Unless you see the need to access the data from multiple machines (like an actual MySQL/PostgreSQL type install), I'd bias away from DuckDB for this use case.

@job At my previous job we used lmdb to store DNS records with a btree on the reversed domain name.
So, example.com became moc.elpmaxe (as domain names are written the wrong way), and we had three tables: one for the SOAs, one for DNSSEC and one for the rest of the rest of the zone.
It looks like kind of the same load as you, and it was blazing fast.
I wasn’t the author of this (I mostly don’t code) but I can get you in touch with my previous teammates if you want to exchange directly with them.
@alarig @job y'know, you could have just stored the names in forward order, and set the MDB_REVERSEKEY option...
crates.io: Rust Package Registry

@job This kinda sounds like an application for S3 storage? IDK if any self-hostable solution would provide better performance than what you've already tried though. Also adds an S3 API somewhere in between, but then that's common enough to usually have good language support.
@galaxis tiny objects like these are terrible for s3 style storage. S3 works great if your objects are around 10MB, kinda okayish at 1MB each, and terrible at 2KB
GitHub - wilsonzlin/blobd: Blob storage designed for huge amounts of random reads and small objects with constant latency

Blob storage designed for huge amounts of random reads and small objects with constant latency - wilsonzlin/blobd

GitHub
@job may sound strange, but maybe git?
@kinnison I don’t think git has the ability to scale to these numbers, probably because git offers a lot of features and functionality (overhead) that I don’t need for this task
@job @kinnison yes, git isn't quite going to work, the packfile format is close, but a KV store probably scales better. Potentially if you wanted to design something yourself some ideas from how packfile indexes are built could be interesting (and other tools building on the git format, e.g. http://github.com/bup/bup).

@job This is a prime example for the use of a key-value store. As you only have a few TB of storage need, put the data on a NVMe locally.

As for which KV store, any b-tree based should be fine unless you need the best performance possible. Then it comes to trying with real data and access patterns. Hint: pre-sort the data by key before loading it into the KV store for better performance.

@attilakinali @job if he only needs point lookups and not ranges or ordered lookups, an extensible hash will probably be fastest. This would be one of the rare times where I wouldn't recommend LMDB first.
@hyc @job The problem with hash based storage systems is, that you need to ensure you have a sparse table or you get too many collisions. With billions of records, the size of the table becomes large enough to be an issue. Meanwhile, we have plenty of methods on how to make b-trees as fast if not faster than hash tables, when the data doesn't fit into RAM anymore. (see "Modern B-Tree Techniques" by Graefe for an in-depth explanation)
@attilakinali @job yeah... thanks, I understand the ins and outs of B+trees, better than current textbooks...
@hyc @job Oh... sorry. I did not want to imply anything of that kind!

@attilakinali @job anyway, it may well be that B+trees are still the superior solution, as you suggest. That was the conclusion we reached 20-some years ago when testing BerkeleyDB's hashes vs its B+tree implementation for larger-than-RAM DBs in OpenLDAP. But I didn't want to assume that our experience would generalize to everyone's.

A TRIE might be best, with keys in one file whose values are offsets to blobs in a separate append-only file.

@attilakinali @job when you're just accumulating records that are never deleted or modified, you can simplify quite a lot relative to a regular k/v store.
@attilakinali @job the other relevant factor here wrt LMDB is he mentions wanting to use the DB across NFS. LMDB's built in lock management won't work for multiple client hosts accessing the same DB, he'd have to manage locking himself. And NFS' interaction with pagecache is inconsistent at best, mmap across NFS isn't known for great performance or coherence.
@hyc @job Yes, that's one thing I wouldn't do. NFS works great for file storage, not so much for data storage. If anything, I'd pair a locally running storage engine with a small RPC shim if remote access is required. Not only would that be faster, it would also eliminate quite a few error modes that storing a DB on NFS comes with. But that depends on OP's needs, environment, and use case.

@hyc
BTW: Do you have a recommended reading list for people wanting to know more than what textbooks cover on b-trees, LSM, etc. Especially on the real world aspects?

@job

Howard Chu - LMDB [The Databaseology Lectures - CMU Fall 2015]

YouTube
@hyc
Cool, I'll watch it this evening. Thanks a lot!
@job
@job What do you mean by "nested"? Just one level should always suffice & perform a lot better.
@dalias /mnt/nfs/XX/YY/ZZ/XXYYZZRESTOFHASH where XX, YY, and ZZ are the first few bytes of the hash
@job Yeah just do XXX/REST
@dalias why?
@job Only 2 directory lookups instead of 4+.
@dalias @job This is going to hurt on directory lookups (directory implementations are optimized for ca. 10³ entries). On the other hand, NFS client path traversal is lockstep so you're going to get N×RTT where N is the number of directories in the oath. (Plus another RTT for the actual read RPC.) Large files with an index that can be held open across multiple requests will perform much better. In theory a custom NFS client that didn't go through namei() would avoid the pipeline stalls.
@dalias @job I advise my users to store this kind of data in a ZIP file; they can hold it open during training and random seeks in an open file are much less costly than opening random tiny files to read a single block. There's plenty of memory to hold the whole central ZIP directory. The users ignore my advice.
@wollman @job NFS round trips are going to dominate access time. Underlying directory lookups should not be slow until directory sizes get astronomical.
@job @dalias don't repeat the full hash in filename, eg. /mnt/nfs/XXXX/RESTOFHASH , it will ease the lookup in the last directory. (also note I've used only one directory depth: 64k entries in the root directory should be handled nicely, and for 10^9 files, each subdirectory will contain ~15k entries)
@job If only ther was a key-value based storage system, like Berkeley DB and derived systems.
@czauner I mention Berkeley DB in my post, yes?

@job
Yes, you did. Sorry. But one question: why sha256? You are aiming for speed, and not cryptographic security, if I understand you correctly.

Why not go the computationaly cheaper route of sha1 or even md5? Or CRC64 (you need a collision handler then, for sure)
That also leads to shorter filenames.

@czauner everything in this particular ecosystem is addressed by SHA256() so I figured that the computational effort is offset by ease of constructing lookup keys

@job
Well, I just assume you will be running here into some latency-Problems; You will need to shave off any microsecond. NFS does not help either - even a 1ms RTT limits you rather brutally to 1k Ops/sec. You can go and try to parallelize the fuck out of your implementation.
If your NFS-Server has an ext - FS underlying, you need to enable dir_hashes there (ext - FS are painfully slow, when there are a ton of files in given directory). Or you go the XFS-Route (but I'm no expert in Linux-Filesystems, that is).

Or, From am ZFS-Perspective (which is also available for Linux):
tune the hell out of the medatada-cache, including a decent L2-ARC on an NVME. You might also consider playing with the recordsize - but given that you are primarily into read-performance, that might not be an issue; Ah yes, and generally turn atime off.

@job Do you need this to be distributed?

What kind of directory structure does the remote local filesystem have? Linear list or btree? What is your latency goal?

Xfs and modern ext4 offer trees, don't know what you have.

With a tree you shouldn't have to split into subdirectories, so fewer NFS roundtrips if remote is a requirement.

This is ok only 2 TB of data, you should be able to keep a btree in memory and only do the final disk read on media, and even this could be a name read.

@job simply a large directory with files inside with name==SHA256 ? XFS performs pretty well under these constraints, and while the file metadata certainly is annoying overhead, it sounds like a robust, easy to use and relatively fast option. (data structure used in that case: cleverly organized B+-trees, with storage-respecting scheduling of modifications)

If it's not an option, any key-value store would do, as mentioned by the others. I'm not sure why you mention NFS – sounds like you want…

@job remote access? Multiple separate remote concurrent read, single write access with low coherency guarantees? Or do you need things to be fully transactional, i.e., no reader could ever see a half-written object? Do you need guarantees that the moment a write finishes its write, everyone else sees the same data, or might a bit of a propagation delay be OK?
And while I'm incessantly asking: what *are* acceptable read latencies for you?
Do you need high availability/how regularly take backup?
@funkylab multiple remote readers, no transactions, everything addressed by content hash, should propagate in seconds. Ideally able to read at least 10K objects / second.

@job yeah postgres table with a blob column; haven't done that, but you should be able to let the db calculate the hash on insertion instead of having to supply it; column type would be sonething like

hashcolumn TEXT GENERATED ALWAYS AS (encode( sha256(youblobcolumn),'hex'))

Or similar. Considering one billion entries, you might want add an index using actually that column.

@job I'm a bit confused. From what I know, NFS is a way to mount directory over network, and actual storage is handled by underlying FS on remote machine

But many of other things you mentioned are in-process local databases...

Do you need to read/write those blobs over a network?

@mo yeah, sorry for the awkward phrasing! The goal is to allow multiple servers to read the data.
@job postgres? :D
@mo might be a good one yes
@job https://github.com/seaweedfs/seaweedfs is where I'd start. That or clickhouse.
GitHub - seaweedfs/seaweedfs: SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, xDC replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding. Enterprise version is at seaweedfs.com.

SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, xDC replica...

GitHub
@job Any specific reason for not trying Postgres? Whenever people seem to do something special someone suggest Postgres, optionally with a plugin. Quick search suggests https://stackoverflow.com/a/62926616
Hash a column in postgres using sha-256

I have a postrges database with a table contains key codes that I have generated using a python function. I would like to be able to hash this column such that each time a key code is added to it, ...

Stack Overflow
@job For your NFS method, shallow and wide directory structure may yield better performance than narrow and deep. NFS server filesystems tend to have difficulty caching the metadata if you have fewer top-level directories and structure that is multiple subdirectories deep.
@it where are the cut offs / cliffs you reckon? I current use 4096 * 4096 * 4096
@job I don’t remember the exact advice received from an engineer at a major NAS storage vendor, but If I recall correctly, I worked on a project where we more than tripled performance by changing from three levels deep to two levels deep. So, in your case, maybe try 1.67772e7 x 4096?
I’ll try to find my notes and reply later with more details.
@job @it In ZFS the cutoff is 2,048 entries per directory — more than that requires a "full fat" ZAP and any updates will have write amplification due to indirect blocks.
@wollman Wow! I had no idea ZFS had such a low limit on directory entries. Thanks! So, @job it seems that my suggestion is not valid or helpful for your needs. :(
@it @job It's not a limit, it's just the point at which directories are big enough to require a less efficient representation.