Got unreasonably excited about this new, incredibly straightforward count-distinct algorithm. The CVM algorithm is a direct replacement for HyperLogLog, it nerd-sniped Donald Knuth for weeks, *and* it can easily be taught in an entry-level CS course.

h/t @munin
https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

Computer Scientists Invent an Efficient New Way to Count | Quanta Magazine

By making use of randomness, a team has created a simple algorithm for estimating large numbers of distinct objects in a stream of data.

Quanta Magazine
@rain @munin but it can't possibly be nearly as efficient as linear counting or the LogLog variants, given that it needs to look up each new value in the reservoir, instead of just computing a hash and blindly setting bits (LC) or counting leading zeroes in the hash (LL), right?

@tobinbaker @rain

hash computation is much more expensive than value comparison, even across many values. Given that the table of values can be usefully constrained to a small number, the maximum time for a lookup can be likewise constrained to within a predictable, tunable value such that the matching can take place with an amount of time comparable to or less than the time required to compute a hash.

@tobinbaker @munin Knuth's note has some discussion about implementation -- he suggests a treap (random binary search tree). This means that you don't need to construct hashes, just compare elements with each other (in Rust terms, you only need Ord, not Hash).
@rain @munin well, even neglecting value comparisons, that's potentially a lot of pointer-chasing. I think this is all quite context-dependent (e.g. hashing throughput, cache miss rate for the treap), but I would be surprised if a CVM impl achieved comparable perf to LC or LL. Might try implementing both myself and comparing in a favorable scenario for CVM (cheap comparisons, small reservoir size).
@tobinbaker @munin Well I'd imagine that a fast impl would use trees of contiguous arrays, something closer to a b-tree. But worth profiling and such