Mastodawn

rain 🌦️May 20, 2024

Got unreasonably excited about this new, incredibly straightforward count-distinct algorithm. The CVM algorithm is a direct replacement for HyperLogLog, it nerd-sniped Donald Knuth for weeks, *and* it can easily be taught in an entry-level CS course.

h/t @munin
https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

Computer Scientists Invent an Efficient New Way to Count | Quanta Magazine

By making use of randomness, a team has created a simple algorithm for estimating large numbers of distinct objects in a stream of data.

Quanta Magazine

Show thread

skategoat 🐐 🇵🇸May 21, 2024

@rain @munin what does X \ { a_i } mean? What is \ ?

Show thread

rain 🌦️May 21, 2024

@felipe @munin hmm, where is that from?

Show thread

skategoat 🐐 🇵🇸May 21, 2024

@rain @munin It's from the paper linked in the article, the second line in the loop* here:

Show thread

Ricky Reusser May 21, 2024

@felipe @rain @munin It felt impossible not to just give this a quick try, so I implemented it in an @observablehq notebook.

https://observablehq.com/@rreusser/counting-distinct-items-with-the-cvm-algorithm?collection=@rreusser/writeups

Counting Distinct Items with the CVM Algorithm

This notebook implements the an approximate algorithm for counting the number of distinct items in a stream of data. For large streams of data which may not even fit in memory, this is a classic computer science problem. The algorithm implemented here comes from the paper Distinct Elements in Streams: An Algorithm for the (Text) Book by Chakraborty, Vinodchandran, and Meel, all of which is nicely summarized in Quanta Magazine's article, Computer Scientists Invent an Efficient New Way to Count. It's so simpl

Observable

Show thread

Fi 🏳️‍⚧️May 21, 2024

@rreusser @felipe @rain @observablehq

lol, amazing, it forms a normal curve around the 'right' value, just like you'd intuitively expect

Show thread

Ricky Reusser

@munin @felipe @rain @observablehq Yeah. I observe that for a 50% chance of getting within 50% of the right answer, it's actually quite a bit more accurate than advertised. I suppose it's just a lower bound that probably isn't so meaningful for such small inputs, but 🤷

Show thread

rain 🌦️May 21, 2024

@rreusser @munin @felipe @observablehq Knuth's note mentions that he was only able to prove bounds for a much coarser function