Mastodawn

I implemented a two-pass algorithm for duplicate detection for duplik (see https://codeberg.org/hgrsd/duplik/pulls/2)

The idea being that on pass one, we don't actually read files or hash them, but only record their sizes. Then, on a second pass, we find files that share the same file size (i.e., potential duplicates), and hash only them.

This led to a >100x speedup on nested directories. E.g., looking through my home directory with a depth of 5 took 1.3seconds, versus 135 seconds on the previous implementation.

As a bonus, I shot myself with the footgun of calling .write() on a Writer instead of .writeAll() without realising that this doesn't guarantee all bytes are in fact written. You live and you learn.

#opensource #zig #programming #code

two-pass duplicate detection

This PR changes the duplicate detection algorithm to be based on two passes. Pass one `stats` all files that are found in the directory tree, up until max depth, to get their file sizes. Pass two looks only at file sizes that have >1 files associated with them, as only files that share the...

Codeberg.org