Got unreasonably excited about this new, incredibly straightforward count-distinct algorithm. The CVM algorithm is a direct replacement for HyperLogLog, it nerd-sniped Donald Knuth for weeks, *and* it can easily be taught in an entry-level CS course.

h/t @munin
https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

Computer Scientists Invent an Efficient New Way to Count | Quanta Magazine

By making use of randomness, a team has created a simple algorithm for estimating large numbers of distinct objects in a stream of data.

Quanta Magazine
@rain @munin what does X \ { a_i } mean? What is \ ?
@felipe @munin hmm, where is that from?
@rain @munin It's from the paper linked in the article, the second line in the loop* here:
Counting Distinct Items with the CVM Algorithm

This notebook implements the an approximate algorithm for counting the number of distinct items in a stream of data. For large streams of data which may not even fit in memory, this is a classic computer science problem. The algorithm implemented here comes from the paper Distinct Elements in Streams: An Algorithm for the (Text) Book by Chakraborty, Vinodchandran, and Meel, all of which is nicely summarized in Quanta Magazine's article, Computer Scientists Invent an Efficient New Way to Count. It's so simpl

Observable

@rreusser @felipe @rain @observablehq

lol, amazing, it forms a normal curve around the 'right' value, just like you'd intuitively expect

@munin @felipe @rain @observablehq Yeah. I observe that for a 50% chance of getting within 50% of the right answer, it's actually quite a bit more accurate than advertised. I suppose it's just a lower bound that probably isn't so meaningful for such small inputs, but 🤷
@rreusser @munin @felipe @observablehq Knuth's note mentions that he was only able to prove bounds for a much coarser function