Mastodawn

Got unreasonably excited about this new, incredibly straightforward count-distinct algorithm. The CVM algorithm is a direct replacement for HyperLogLog, it nerd-sniped Donald Knuth for weeks, *and* it can easily be taught in an entry-level CS course.

h/t @munin
https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

Computer Scientists Invent an Efficient New Way to Count | Quanta Magazine

By making use of randomness, a team has created a simple algorithm for estimating large numbers of distinct objects in a stream of data.

Quanta Magazine

rain 🌦️May 20, 2024

It is a VERY simple algorithm. You can code it up in an hour or two, including tests.

Knuth's paper is a bit more dense, but still shows how utterly fascinated he was by it

https://www-cs-faculty.stanford.edu/~knuth/papers/cvm-note.pdf

rain 🌦️May 20, 2024

Note: "count-distinct" means count the distinct number of elements with a stream of data, with less memory available than it would take to maintain an exact count. So it's more of an estimate, and CVM produces an estimate with really good bounds

Rho May 20, 2024

@rain oh that looks super interesting! the tradeoffs with hll are gonna need a better brain day to understand.

Mark T. Tomczak May 20, 2024

@rain @munin @nwf Okay, yeah, that is one of those "so simple and so brilliant" algorithms.

Phairupegiont May 20, 2024

@rain
it sounds like a variation of reservoir sampling... weird that it's discovered only now?

Tobin Baker May 21, 2024

@rain @munin but it can't possibly be nearly as efficient as linear counting or the LogLog variants, given that it needs to look up each new value in the reservoir, instead of just computing a hash and blindly setting bits (LC) or counting leading zeroes in the hash (LL), right?

Fi 🏳️‍⚧️May 21, 2024

@tobinbaker @rain

hash computation is much more expensive than value comparison, even across many values. Given that the table of values can be usefully constrained to a small number, the maximum time for a lookup can be likewise constrained to within a predictable, tunable value such that the matching can take place with an amount of time comparable to or less than the time required to compute a hash.

rain 🌦️May 21, 2024

@tobinbaker @munin Knuth's note has some discussion about implementation -- he suggests a treap (random binary search tree). This means that you don't need to construct hashes, just compare elements with each other (in Rust terms, you only need Ord, not Hash).

Tobin Baker May 21, 2024

@rain @munin well, even neglecting value comparisons, that's potentially a lot of pointer-chasing. I think this is all quite context-dependent (e.g. hashing throughput, cache miss rate for the treap), but I would be surprised if a CVM impl achieved comparable perf to LC or LL. Might try implementing both myself and comparing in a favorable scenario for CVM (cheap comparisons, small reservoir size).

rain 🌦️May 22, 2024

@tobinbaker @munin Well I'd imagine that a fast impl would use trees of contiguous arrays, something closer to a b-tree. But worth profiling and such

rain 🌦️May 22, 2024

@tobinbaker @munin Ah there's a paper about B-treaps: https://www.cs.cmu.edu/~./dgolovin/papers/btreap.pdf

skategoat 🐐 🇵🇸May 21, 2024

@rain @munin what does X \ { a_i } mean? What is \ ?

rain 🌦️May 21, 2024

@felipe @munin hmm, where is that from?

skategoat 🐐 🇵🇸May 21, 2024

@rain @munin It's from the paper linked in the article, the second line in the loop* here:

rain 🌦️May 21, 2024

@felipe @munin Ah thanks. \ is notation for set difference -- so in this case it just means "remove a_i from X"

Ricky Reusser May 21, 2024

@felipe @rain @munin It felt impossible not to just give this a quick try, so I implemented it in an @observablehq notebook.

https://observablehq.com/@rreusser/counting-distinct-items-with-the-cvm-algorithm?collection=@rreusser/writeups

Counting Distinct Items with the CVM Algorithm

This notebook implements the an approximate algorithm for counting the number of distinct items in a stream of data. For large streams of data which may not even fit in memory, this is a classic computer science problem. The algorithm implemented here comes from the paper Distinct Elements in Streams: An Algorithm for the (Text) Book by Chakraborty, Vinodchandran, and Meel, all of which is nicely summarized in Quanta Magazine's article, Computer Scientists Invent an Efficient New Way to Count. It's so simpl

Observable

Fi 🏳️‍⚧️May 21, 2024

@rreusser @felipe @rain @observablehq

lol, amazing, it forms a normal curve around the 'right' value, just like you'd intuitively expect

Ricky Reusser May 21, 2024

@munin @felipe @rain @observablehq Yeah. I observe that for a 50% chance of getting within 50% of the right answer, it's actually quite a bit more accurate than advertised. I suppose it's just a lower bound that probably isn't so meaningful for such small inputs, but 🤷

rain 🌦️May 21, 2024

@rreusser @munin @felipe @observablehq Knuth's note mentions that he was only able to prove bounds for a much coarser function

Alastair Reid May 21, 2024

@rain @munin thank you. I love approximate data structures and that is much simpler than the version of this that I had seen before.

(Sadly, I think I have only had a chance to use approximate data structures once in “real life”: approximate counting up to N using log2(log2(N)) bits.)

mirabilos Jun 14, 2024

@rain @munin oh.

OH.

OH!

Fred Moyer Oct 16

@rain @munin I got excited also, included it in my talk last year about counting things. My “ELI5” explanation of it at 21:51 https://m.youtube.com/watch?v=EpNooE5sEes

Monitorama PDX 2024 - Use counters to count things

YouTube

@rain @munin I missed it when it came out, super interesting!

In Big Data settings, HLL is attractive because it can be parallelized, and it's useful to persist the probabilistic data structure ("sketch") to delay computing the distinct count.

This algorithm is so simple that I wonder if I have the chops to derive the merge procedure of separate buffers, either in the original CVM or in Knuth's version...

Sparrows Oct 16

@rain @munin I thought this was reservoir sampling? Huh, should figure out what distinguishes them.

Piers Cawley Oct 16

@rain @munin Damn, that's beautiful!

Morgan Pyne Oct 16

@rain @munin This is wonderful! Thank you for bringing it to my attention.

Twenty Twenty Sixhorns Oct 16

@rain this is pinging the 'zipf zipf zipf' bit of my brain heavily.

It's really cool!