Optimisation time for https://codeberg.org/hgrsd/duplik
This was fun, with lots of learning. I'm using #Zig 's new-ish Io interface with `std.Io.Queue` acting as a channel to achieve concurrent file reads & hashing, with a single reader taking those hashed files and grouping them by hash. This fans out to a number of worker tasks equivalent to the number of cores available, unless overridden with the --workers flag.
Additionally, in the file size based prefilter (files with different sizes must be unique, so no need to hash them), I now use `Dir.statFile` instead of `file.stat`. This means only a single syscall instead of having to open the file handle, stat the file, then close the handle again.
I'm seeing 5-6x speedups for nontrivial workloads which is very nice. I can now detect duplicates within all of my repos to a depth of 5 within 1.2 seconds.
Here's the commit: https://codeberg.org/hgrsd/duplik/commit/8d8cf995eec8cbebfedfd6c87ab2bd57b7582ae7. If you feel like critiquing it, please go ahead; I'm working to learn Zig so feedback is most welcome :)







