Mastodawn

@youam read about that briefly, sounds very nice, the daemon mode sounded very interesting to me to kind of get an inband deduplication like experience / compromise. The reason why I didn't try it yet though was because I couldn't find it in the Debian Sid package repository. #duperemove on the other hand was readily available on Debian Sid.
I still need to figure out how/when to best run duperemove though.

Show thread

T_X 1d ago

Also, the #duperemove hash file, an sqlite 3 database it seems, takes up about 10 GiB for this #btrfs partition for me. So that needs to be added to the reported 127G disk usage / 256GiB disk, I guess.

Show thread

Matthias Riße Mar 7

@dbat the same issue exists in the research data management world with #DataLad / #gitAnnex. One thing that I am doing for our storage servers is regularly run #duperemove on it. It requires filesystem support (xfs/btrfs), but deduplicates on an extent basis, so below the file level. If the difference between two versions only affects a small part of a file it should be able to help. I wonder if it could be run as a post-commit hook, or something like that.

Parleur Sep 29, 2025

Tout va bien : je fais juste tourner #duperemove.

Sergei Trofimovich Mar 31, 2025

Today's bug is a `duperemove` infinite looping bug: https://github.com/markfasheh/duperemove/pull/376

There `duperemove` was not able to dedupe against NoCOW file:

$ dd if=/dev/urandom bs=8M count=1 > a
$ touch b
$ chattr +C b
$ cat a >> b
$ ./duperemove -d -q --batchsize=0 --dedupe-options=partial,same a b
<hangup>

I noticed it about a month ago but got to debug it only today. It's a 0.15 regression. The fix is trivial once bisected.

#duperemove #bug

dedupe.c: fix infinite looping on NoCOW files by trofi · Pull Request #376 · markfasheh/duperemove

Without the change dedupe hangs on NoCOW files as: $ dd if=/dev/urandom bs=8M count=1 > a $ touch b $ chattr +C b $ cat a >> b $ lsattr a b ---------------------- a ---------------C------...

GitHub

Andreas Kilgus Dec 2, 2024

Bin dann doch ein wenig neugierig, ob der seit 29.11. laufende duperemove-Prozess noch irgendwann enden oder in die Erbmasse mit eingehen wird.

#duperemove

Sergei Trofimovich Nov 24, 2023

Today's `duperemove` bug is a https://github.com/markfasheh/duperemove/issues/332.

There `duperemove` crashes when the file being deduped gets truncated down to zero.

And the bug is already fixed!

#duperemove #bug

`duperemove-0.14` `SIGSEGV`s in `fiemap_scan_extent()` · Issue #332 · markfasheh/duperemove

Initially noticed as a duperemove-0.14 crash on a real data set. Reproducing it on a real data takes a while, but using the following duperemove-fuzz.bash usually takes under 20 seconds: #!/usr/bin...

GitHub

Sergei Trofimovich Nov 21, 2023

`dupermove-0.14` is a lot faster than `duperemove-0.13`!

Unfortunately it crashes sometimes on my input data. It takes about 10 minutes to observe the crash.

I wrote a trivial fuzzer to generate funny filesystem states for `duperemove`. Guess how long it takes to crash `duperemove `with it.

Spoiler: https://trofi.github.io/posts/305-fuzzing-duperemove.html

#duperemove #bug

fuzzing duperemove

Sergei Trofimovich Nov 10, 2023

Today's `duperemove` bug is a https://github.com/markfasheh/duperemove/pull/324.

There quite aggressive `--dedupe-options=partial` option used less optimized `sqlite` query to fetch unique file extents. That caused the whole database scan when data was queries for each individual file.

The fix switched `JOIN` query for nested `SELECT` query to convert from full scan to an index lookup.

#duperemove #bug

`--dedupe-options=partial`: avoid quadratic slowdown on extent count by trofi · Pull Request #324 · markfasheh/duperemove

The idea of the change is to substitute linear scan of extents table for lookup in it in block dedupe phase. Here are the query explanations by sqlite: $ sqlite3 /tmp/foo.db sqlite> .eqp on Before...

GitHub

Sergei Trofimovich Nov 9, 2023

Today's `duperemove` bug is a minor accounting bug: https://github.com/markfasheh/duperemove/pull/323

$ ls -lh /nix/var/nix/db/db.sqlite
1.4G /nix/var/nix/db/db.sqlite

Before the change:

$ ./show-shared-extents /nix/var/nix/db/db.sqlite
/nix/var/nix/db/db.sqlite: 27065321263104 shared bytes

After the change:

$ ./show-shared-extents /nix/var/nix/db/db.sqlite
/nix/var/nix/db/db.sqlite: 1169276928 shared bytes

The size reduction is not as impressive as initially reported :)

#duperemove #bug

filerec_count_shared(): fix sharing accounting by trofi · Pull Request #323 · markfasheh/duperemove

Before the change filerec_count_shared() incorrectly accounted for extent end compared to file end: $ ls -lh /nix/var/nix/db/db.sqlite -rw-r--r-- 1 root root 1.4G Nov 9 22:21 /nix/var/nix/db/db.sq...

GitHub