Got unreasonably excited about this new, incredibly straightforward count-distinct algorithm. The CVM algorithm is a direct replacement for HyperLogLog, it nerd-sniped Donald Knuth for weeks, *and* it can easily be taught in an entry-level CS course.

h/t @munin
https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

Computer Scientists Invent an Efficient New Way to Count | Quanta Magazine

By making use of randomness, a team has created a simple algorithm for estimating large numbers of distinct objects in a stream of data.

Quanta Magazine

It is a VERY simple algorithm. You can code it up in an hour or two, including tests.

Knuth's paper is a bit more dense, but still shows how utterly fascinated he was by it

https://www-cs-faculty.stanford.edu/~knuth/papers/cvm-note.pdf

Note: "count-distinct" means count the distinct number of elements with a stream of data, with less memory available than it would take to maintain an exact count. So it's more of an estimate, and CVM produces an estimate with really good bounds
@rain oh that looks super interesting! the tradeoffs with hll are gonna need a better brain day to understand.
@rain @munin @nwf Okay, yeah, that is one of those "so simple and so brilliant" algorithms.
@rain
it sounds like a variation of reservoir sampling... weird that it's discovered only now?
@rain @munin but it can't possibly be nearly as efficient as linear counting or the LogLog variants, given that it needs to look up each new value in the reservoir, instead of just computing a hash and blindly setting bits (LC) or counting leading zeroes in the hash (LL), right?

@tobinbaker @rain

hash computation is much more expensive than value comparison, even across many values. Given that the table of values can be usefully constrained to a small number, the maximum time for a lookup can be likewise constrained to within a predictable, tunable value such that the matching can take place with an amount of time comparable to or less than the time required to compute a hash.

@tobinbaker @munin Knuth's note has some discussion about implementation -- he suggests a treap (random binary search tree). This means that you don't need to construct hashes, just compare elements with each other (in Rust terms, you only need Ord, not Hash).
@rain @munin well, even neglecting value comparisons, that's potentially a lot of pointer-chasing. I think this is all quite context-dependent (e.g. hashing throughput, cache miss rate for the treap), but I would be surprised if a CVM impl achieved comparable perf to LC or LL. Might try implementing both myself and comparing in a favorable scenario for CVM (cheap comparisons, small reservoir size).
@tobinbaker @munin Well I'd imagine that a fast impl would use trees of contiguous arrays, something closer to a b-tree. But worth profiling and such
@rain @munin what does X \ { a_i } mean? What is \ ?
@felipe @munin hmm, where is that from?
@rain @munin It's from the paper linked in the article, the second line in the loop* here:
@felipe @munin Ah thanks. \ is notation for set difference -- so in this case it just means "remove a_i from X"
Counting Distinct Items with the CVM Algorithm

This notebook implements the an approximate algorithm for counting the number of distinct items in a stream of data. For large streams of data which may not even fit in memory, this is a classic computer science problem. The algorithm implemented here comes from the paper Distinct Elements in Streams: An Algorithm for the (Text) Book by Chakraborty, Vinodchandran, and Meel, all of which is nicely summarized in Quanta Magazine's article, Computer Scientists Invent an Efficient New Way to Count. It's so simpl

Observable

@rreusser @felipe @rain @observablehq

lol, amazing, it forms a normal curve around the 'right' value, just like you'd intuitively expect

@munin @felipe @rain @observablehq Yeah. I observe that for a 50% chance of getting within 50% of the right answer, it's actually quite a bit more accurate than advertised. I suppose it's just a lower bound that probably isn't so meaningful for such small inputs, but 🤷
@rreusser @munin @felipe @observablehq Knuth's note mentions that he was only able to prove bounds for a much coarser function

@rain @munin thank you. I love approximate data structures and that is much simpler than the version of this that I had seen before.

(Sadly, I think I have only had a chance to use approximate data structures once in “real life”: approximate counting up to N using log2(log2(N)) bits.)

@rain @munin I got excited also, included it in my talk last year about counting things. My “ELI5” explanation of it at 21:51 https://m.youtube.com/watch?v=EpNooE5sEes
Monitorama PDX 2024 - Use counters to count things

YouTube

@rain @munin I missed it when it came out, super interesting!

In Big Data settings, HLL is attractive because it can be parallelized, and it's useful to persist the probabilistic data structure ("sketch") to delay computing the distinct count.

This algorithm is so simple that I wonder if I have the chops to derive the merge procedure of separate buffers, either in the original CVM or in Knuth's version...

@rain @munin I thought this was reservoir sampling? Huh, should figure out what distinguishes them.
@rain @munin Damn, that's beautiful!
@rain @munin This is wonderful! Thank you for bringing it to my attention.

@rain this is pinging the 'zipf zipf zipf' bit of my brain heavily.

It's really cool!