@dbat the same issue exists in the research data management world with #DataLad / #gitAnnex. One thing that I am doing for our storage servers is regularly run #duperemove on it. It requires filesystem support (xfs/btrfs), but deduplicates on an extent basis, so below the file level. If the difference between two versions only affects a small part of a file it should be able to help. I wonder if it could be run as a post-commit hook, or something like that.
Tout va bien : je fais juste tourner #duperemove.

Today's bug is a `duperemove` infinite looping bug: https://github.com/markfasheh/duperemove/pull/376

There `duperemove` was not able to dedupe against NoCOW file:

$ dd if=/dev/urandom bs=8M count=1 > a
$ touch b
$ chattr +C b
$ cat a >> b
$ ./duperemove -d -q --batchsize=0 --dedupe-options=partial,same a b
<hangup>

I noticed it about a month ago but got to debug it only today. It's a 0.15 regression. The fix is trivial once bisected.

#duperemove #bug

dedupe.c: fix infinite looping on NoCOW files by trofi · Pull Request #376 · markfasheh/duperemove

Without the change dedupe hangs on NoCOW files as: $ dd if=/dev/urandom bs=8M count=1 > a $ touch b $ chattr +C b $ cat a >> b $ lsattr a b ---------------------- a ---------------C------...

GitHub

Bin dann doch ein wenig neugierig, ob der seit 29.11. laufende duperemove-Prozess noch irgendwann enden oder in die Erbmasse mit eingehen wird.

#duperemove

Today's `duperemove` bug is a https://github.com/markfasheh/duperemove/issues/332.

There `duperemove` crashes when the file being deduped gets truncated down to zero.

And the bug is already fixed!

#duperemove #bug

`duperemove-0.14` `SIGSEGV`s in `fiemap_scan_extent()` · Issue #332 · markfasheh/duperemove

Initially noticed as a duperemove-0.14 crash on a real data set. Reproducing it on a real data takes a while, but using the following duperemove-fuzz.bash usually takes under 20 seconds: #!/usr/bin...

GitHub

`dupermove-0.14` is a lot faster than `duperemove-0.13`!

Unfortunately it crashes sometimes on my input data. It takes about 10 minutes to observe the crash.

I wrote a trivial fuzzer to generate funny filesystem states for `duperemove`. Guess how long it takes to crash `duperemove `with it.

Spoiler: https://trofi.github.io/posts/305-fuzzing-duperemove.html

#duperemove #bug

fuzzing duperemove

Today's `duperemove` bug is a https://github.com/markfasheh/duperemove/pull/324.

There quite aggressive `--dedupe-options=partial` option used less optimized `sqlite` query to fetch unique file extents. That caused the whole database scan when data was queries for each individual file.

The fix switched `JOIN` query for nested `SELECT` query to convert from full scan to an index lookup.

#duperemove #bug

`--dedupe-options=partial`: avoid quadratic slowdown on extent count by trofi · Pull Request #324 · markfasheh/duperemove

The idea of the change is to substitute linear scan of extents table for lookup in it in block dedupe phase. Here are the query explanations by sqlite: $ sqlite3 /tmp/foo.db sqlite> .eqp on Before...

GitHub

Today's `duperemove` bug is a minor accounting bug: https://github.com/markfasheh/duperemove/pull/323

$ ls -lh /nix/var/nix/db/db.sqlite
1.4G /nix/var/nix/db/db.sqlite

Before the change:

$ ./show-shared-extents /nix/var/nix/db/db.sqlite
/nix/var/nix/db/db.sqlite: 27065321263104 shared bytes

After the change:

$ ./show-shared-extents /nix/var/nix/db/db.sqlite
/nix/var/nix/db/db.sqlite: 1169276928 shared bytes

The size reduction is not as impressive as initially reported :)

#duperemove #bug

filerec_count_shared(): fix sharing accounting by trofi · Pull Request #323 · markfasheh/duperemove

Before the change filerec_count_shared() incorrectly accounted for extent end compared to file end: $ ls -lh /nix/var/nix/db/db.sqlite -rw-r--r-- 1 root root 1.4G Nov 9 22:21 /nix/var/nix/db/db.sq...

GitHub

Today's bug is a `duperemove` quadratic slowdown: https://github.com/markfasheh/duperemove/pull/322

There `duperemove` was struggling to dedupe small files inlined into metadata entries. It kept trying to dedupe all of them as a single set (even if files' contents did not match).

This fix is a one-liner: just don't track non-dedupable files.

Without the fix dedupe run never finished on my system. I always had to run it on a subset to get any progress. Now the whole run takes 20 minutes.

#duperemove #bug

file_scan.c: avoid work when processing inline-only small files by trofi · Pull Request #322 · markfasheh/duperemove

Before the change small files that consist of a single FIEMAP_EXTENT_DATA_INLINE extent type all were hashed and stored as files of identical checksum. Deduplication phase attempted to deduplicates...

GitHub

It feels like `duperemove` could have worked a lot faster than it does today.

What would it take to get a 2x speedup on small files? A one-liner: https://github.com/markfasheh/duperemove/pull/318

There are still a ton of low hanging improvements hiding there.

#duperemove #bug

file_scan.c: don't use calloc() in csum_whole_file() by trofi · Pull Request #318 · markfasheh/duperemove

The setup: create 100K files 1024 bytes each. This is 100MB input: echo "Creating directory structure, will take a minute" mkdir dd for d in `seq 1 100`; do mkdir dd/$d for f in `seq 1 1000...

GitHub