Mastodawn

I'd like to store one billion variable-length binary objects & get them by SHA256(obj) key. The median object size is 2 kilobyte. Low read/write volumes.

What I've tried so far: NFS with the hash's first few octets as nested directory names, it works but it is a bit slow and I also tried ZeroFS (also too slow).

Under considerations: DuckDB, RocksDB, BerkeleyDB, SQLite3, lmdb, something bespoke

Recommendations? Things/papers I should be reading?

Show thread

Attila Kinali Dec 6

@job This is a prime example for the use of a key-value store. As you only have a few TB of storage need, put the data on a NVMe locally.

As for which KV store, any b-tree based should be fine unless you need the best performance possible. Then it comes to trying with real data and access patterns. Hint: pre-sort the data by key before loading it into the KV store for better performance.

Show thread

Howard Chu @ Symas Dec 8

@attilakinali @job if he only needs point lookups and not ranges or ordered lookups, an extensible hash will probably be fastest. This would be one of the rare times where I wouldn't recommend LMDB first.

Show thread

Attila Kinali Dec 8

@hyc @job The problem with hash based storage systems is, that you need to ensure you have a sparse table or you get too many collisions. With billions of records, the size of the table becomes large enough to be an issue. Meanwhile, we have plenty of methods on how to make b-trees as fast if not faster than hash tables, when the data doesn't fit into RAM anymore. (see "Modern B-Tree Techniques" by Graefe for an in-depth explanation)

Show thread

Howard Chu @ Symas Dec 8

@attilakinali @job yeah... thanks, I understand the ins and outs of B+trees, better than current textbooks...

Show thread

Attila Kinali Dec 8

@hyc @job Oh... sorry. I did not want to imply anything of that kind!

Show thread

Howard Chu @ Symas Dec 8

@attilakinali @job anyway, it may well be that B+trees are still the superior solution, as you suggest. That was the conclusion we reached 20-some years ago when testing BerkeleyDB's hashes vs its B+tree implementation for larger-than-RAM DBs in OpenLDAP. But I didn't want to assume that our experience would generalize to everyone's.

A TRIE might be best, with keys in one file whose values are offsets to blobs in a separate append-only file.

Show thread

Howard Chu @ Symas Dec 8

@attilakinali @job when you're just accumulating records that are never deleted or modified, you can simplify quite a lot relative to a regular k/v store.

Show thread

Howard Chu @ Symas Dec 8

@attilakinali @job the other relevant factor here wrt LMDB is he mentions wanting to use the DB across NFS. LMDB's built in lock management won't work for multiple client hosts accessing the same DB, he'd have to manage locking himself. And NFS' interaction with pagecache is inconsistent at best, mmap across NFS isn't known for great performance or coherence.

Show thread

Attila Kinali Dec 9

@hyc @job Yes, that's one thing I wouldn't do. NFS works great for file storage, not so much for data storage. If anything, I'd pair a locally running storage engine with a small RPC shim if remote access is required. Not only would that be faster, it would also eliminate quite a few error modes that storing a DB on NFS comes with. But that depends on OP's needs, environment, and use case.

Show thread

Attila Kinali Dec 9

@hyc
BTW: Do you have a recommended reading list for people wanting to know more than what textbooks cover on b-trees, LSM, etc. Especially on the real world aspects?

@job

Show thread

Howard Chu @ Symas

@attilakinali @job this covers some of the gory details https://www.youtube.com/watch?v=tEa5sAh-kVk

Howard Chu - LMDB [The Databaseology Lectures - CMU Fall 2015]

YouTube

Show thread

Attila Kinali Dec 9

@hyc
Cool, I'll watch it this evening. Thanks a lot!
@job