Today's bug is a `duperemove` infinite looping bug: https://github.com/markfasheh/duperemove/pull/376
There `duperemove` was not able to dedupe against NoCOW file:
$ dd if=/dev/urandom bs=8M count=1 > a
$ touch b
$ chattr +C b
$ cat a >> b
$ ./duperemove -d -q --batchsize=0 --dedupe-options=partial,same a b
<hangup>
I noticed it about a month ago but got to debug it only today. It's a 0.15 regression. The fix is trivial once bisected.
Bin dann doch ein wenig neugierig, ob der seit 29.11. laufende duperemove-Prozess noch irgendwann enden oder in die Erbmasse mit eingehen wird.
Today's `duperemove` bug is a https://github.com/markfasheh/duperemove/issues/332.
There `duperemove` crashes when the file being deduped gets truncated down to zero.
And the bug is already fixed!
`dupermove-0.14` is a lot faster than `duperemove-0.13`!
Unfortunately it crashes sometimes on my input data. It takes about 10 minutes to observe the crash.
I wrote a trivial fuzzer to generate funny filesystem states for `duperemove`. Guess how long it takes to crash `duperemove `with it.
Spoiler: https://trofi.github.io/posts/305-fuzzing-duperemove.html
Today's `duperemove` bug is a https://github.com/markfasheh/duperemove/pull/324.
There quite aggressive `--dedupe-options=partial` option used less optimized `sqlite` query to fetch unique file extents. That caused the whole database scan when data was queries for each individual file.
The fix switched `JOIN` query for nested `SELECT` query to convert from full scan to an index lookup.
The idea of the change is to substitute linear scan of extents table for lookup in it in block dedupe phase. Here are the query explanations by sqlite: $ sqlite3 /tmp/foo.db sqlite> .eqp on Before...
Today's `duperemove` bug is a minor accounting bug: https://github.com/markfasheh/duperemove/pull/323
$ ls -lh /nix/var/nix/db/db.sqlite
1.4G /nix/var/nix/db/db.sqlite
Before the change:
$ ./show-shared-extents /nix/var/nix/db/db.sqlite
/nix/var/nix/db/db.sqlite: 27065321263104 shared bytes
After the change:
$ ./show-shared-extents /nix/var/nix/db/db.sqlite
/nix/var/nix/db/db.sqlite: 1169276928 shared bytes
The size reduction is not as impressive as initially reported :)
Before the change filerec_count_shared() incorrectly accounted for extent end compared to file end: $ ls -lh /nix/var/nix/db/db.sqlite -rw-r--r-- 1 root root 1.4G Nov 9 22:21 /nix/var/nix/db/db.sq...
Today's bug is a `duperemove` quadratic slowdown: https://github.com/markfasheh/duperemove/pull/322
There `duperemove` was struggling to dedupe small files inlined into metadata entries. It kept trying to dedupe all of them as a single set (even if files' contents did not match).
This fix is a one-liner: just don't track non-dedupable files.
Without the fix dedupe run never finished on my system. I always had to run it on a subset to get any progress. Now the whole run takes 20 minutes.
Before the change small files that consist of a single FIEMAP_EXTENT_DATA_INLINE extent type all were hashed and stored as files of identical checksum. Deduplication phase attempted to deduplicates...
It feels like `duperemove` could have worked a lot faster than it does today.
What would it take to get a 2x speedup on small files? A one-liner: https://github.com/markfasheh/duperemove/pull/318
There are still a ton of low hanging improvements hiding there.
The setup: create 100K files 1024 bytes each. This is 100MB input: echo "Creating directory structure, will take a minute" mkdir dd for d in `seq 1 100`; do mkdir dd/$d for f in `seq 1 1000...