@taylor

Yeah! Of course, this is still a block-sorting compression algorithm*, so you wont get much advantages over zstd or xz when dealing with datasets with more inherent entropy like binary files or whatnot, but it does miracles for text.

* Of course I know what that means. Tell you what, you tell me what you think it means, and I'll tell you if you're right. 🤣

Here's an example with non-text data, where you see that #bzip3 isn't as strong:

Pictures$ for x in cat "gzip -9" "bzip2 -9" "bzip3" "zstd --ultra -22" "xz -9e"; do $x < Hobbes.jpg |wc -c |tr "\n" "\t"; echo "$x"; done |sort -rn 3445659 cat 3444164 xz -9e 3441839 zstd --ultra -22 3439158 gzip -9 3384450 bzip2 -9 3274433 bzip3

WAIT.
WHAT.

Let's try something else...

Videos$ f="Federated Timeline.webm"; for x in cat "gzip -9" "bzip2 -9" "bzip3" "zstd --ultra -22" "xz -9e"; do $x < "$f" |wc -c |tr "\n" "\t"; echo "$x"; done |sort -rn 1231940 bzip2 -9 1231269 bzip3 1227060 xz -9e 1226931 cat 1226421 zstd --ultra -22 1226241 gzip -9

WHAT?!? THE WORLD IS BROKEN!!!

TrYiNg AgAiNnNn...

Documents$ f="Thinkpad x200 hardware maintenance manual.pdf"; for x in cat "gzip -9" "bzip2 -9" "bzip3" "zstd --ultra -22" "xz -9e"; do $x < "$f" |wc -c |tr "\n" "\t"; echo "$x"; done |sort -rn 8942833 cat 8657277 bzip2 -9 8617801 gzip -9 8592319 bzip3 8568484 xz -9e 8535244 zstd --ultra -22

Ok, that makes sense. That's what I was expecting.

YOU SAW NOTHING ELSE. DON'T ASK ME ANY MORE QUESTIONS. 🤣

P.S., here's another interesting one:

138240138 cat (large BMP file) 3768642 gzip -9 3143455 PNG format 1987020 zstd --ultra -22 1592854 bzip2 -9 1512291 bzip3 1501540 xz -9e

For text with a lot of repetition, #bzip3 still blows my mind. 😆

rld@Intrepid:Documents$ for x in cat "gzip -9" "zstd --ultra -22" "xz -9e" "bzip2 -9" bzip3; do $x < weatherlog-2024.txt |wc -c |tr "\n" "\t"; echo "$x"; done 1735300 cat 80423 gzip -9 63275 zstd --ultra -22 53516 xz -9e 52374 bzip2 -9 40645 bzip3 rld@Intrepid:Documents$ echo 1735300/40645 |bc -l 42.69405830975519744125

#Lossless #Compression #LosslessCompression

P.S. times:

real 1.49 zstd --ultra -22 real 0.94 xz -9e real 0.23 bzip2 -9 real 0.07 gzip -9 real 0.06 bzip3 real 0.00 cat

DANG. 😂

R.L. Dane - Unix Data Compression Shootout

So I used #bzip3 to compress #shamela library(the big 12GB ISO for Windows) and now it's only 1.4GB.

Edit: Correction. The FS ran out of space and left that file. bzip3 makes the file bigger not smaller.

So @rl_dane introduced #bzip3 to me to use instead of #bzip2. Let's turn some bz2 files into bz3 to see the difference.

First example: 90k opus files

hey snips wake word dataset. It has ~90k opus files and a tar file of 3.1GB. bzip2 produces the same 3.1GB which is as expected. bzip3 created 3.0GB but used tons of computation power. Not worth the 100MB

Second example: Windows 7 virtual box VM image

Windows7.vdi it's Windows 7 VM image for the "special" days. I think I have to get rid of it. But while it is still there, let's see how each will perform. It is 16GB uncompressed. bzip2 -9 is 7.0GB. bzip3 is 6.3GB but at the expense of like 3x CPU time. Deleting all of them anyway. Down with Windows.

Third example: Pure XML text file

Pure XML file. It's Persian and English characters. Uncompressed is 1.7GB. bzip2 -9 is 276M while bzip3 is 260MB

Final example: Creating a simple bomb

So I did this:

dd if=/dev/zero of=./justzero bs=2G count=6

So now I have a 16GB with only zero bytes. bzip2 -9 is 672KB. bzip3 is 46KB.

Conclusion

Thank you @rl_dane

Real nice thing!

#compression #gzip #zip #filecompression #textcompression #datacompression #linux #unix #tech

#bzip3, y'all!

17,396,992 Apr 15 18:52 powertrack-Excelsior-2024.txt 564,163 Apr 15 18:52 powertrack-Excelsior-2024.txt.bz3 ~ $ tail powertrack.txt 2025-04-15 18:44 Battery 0: Not charging, 81%; 0.00488959 W; uptime: 4:48 2025-04-15 18:45 Battery 0: Not charging, 81%; 0.00488959 W; uptime: 4:49 2025-04-15 18:46 Battery 0: Not charging, 81%; 0.00488959 W; uptime: 4:50 2025-04-15 18:47 Battery 0: Not charging, 81%; 0.00488959 W; uptime: 4:51 2025-04-15 18:48 Battery 0: Not charging, 81%; 0.00488959 W; uptime: 4:52 2025-04-15 18:49 Battery 0: Not charging, 81%; 0.00488959 W; uptime: 4:53 2025-04-15 18:50 Battery 0: Not charging, 81%; 0.00488959 W; uptime: 4:54 2025-04-15 18:51 Battery 0: Not charging, 81%; 0.00488959 W; uptime: 4:55 2025-04-15 18:52 Battery 0: Not charging, 81%; 0.00488959 W; uptime: 4:56 2025-04-15 18:53 Battery 0: Not charging, 81%; 0.00488959 W; uptime: 4:57

#bzip3 continues to amaze me:

-rw-r--r-- 1 ~~~ ~~~ 100M Apr 1 2025 outbox.json $ simplify $(bzip3 < outbox.json |wc -c) 4.57 MiB $ simplify $(xz -9e < outbox.json |wc -c) 4.69 MiB $ simplify $(zstd --ultra -22 < outbox.json |wc -c) 5.06 MiB

(I didn't time it, but it was much faster than the other two)

Also, just in case anyone's curious, simplify (poor name pick, but I couldn't think of anything better) is just a bash function for converting byte counts to an SI unit:

function simplify { #Reduces a big bytes count down to megabytes or whatnot local steps num [ $1 ] || ( warn "simplify() called without parameters\n (requires a number of bytes with no unit name)"; return 1 ) steps=0 num=$1 while [[ $(echo "$num > 1024" |bc) == 1 ]] #bc has to be used because num is a float do let steps++ num=$(echo "$num/1024" |bc -l) done #Cut off after two decimal place: num=$(echo "$num" |sed 's/\(\.[0-9][0-9]\)[0-9]*$/\1/') printf "$num " case $steps in 0) echo b;; 1) echo KiB;; 2) echo MiB;; 3) echo GiB;; 4) echo TiB;; 5) echo PiB;; 6) echo EiB;; 7) echo ZiB;; 8) echo YiB;; *) echo "1024 ^ $steps bytes";; esac }
Tein vähän #pakkaus-kokeiluja törmättyäni taannoin uuteen pakkaimeen, #Bzip3:een. Ainakin minun tiedostojeni #varmuuskopiointi-pakkaamisessa se hävisi pakkausteholtaan selvästi #XZ:lle, jota olen varmuuskopiointiin käyttänyt, ja pakkausnopeudeltaan selvästi #ZStd:lle, johon siirtymistä olin aprikoinut. #atkjuttuja

BZip3

在 Hacker News 上看到 BZip3 的連結:「Bzip3: A spiritual successor to BZip2 (github.com/kspalaiologos)」。

雖然名字看起來與 bzip2 有關,但看起來是不同的人弄出來的東西,不過有些經典的演算法有留下來用,像是 Burrows-Wheeler transform。

另外值得一提的是,bzip2 是 1996 年出的 (不過 1.0 大約是 2000 年時出的),BZip3 的第一個 release 在 2022 年,這段時間也累積了不少有趣的演算法可以用。

無損壓縮中如果期望有比較的壓縮率,目前比較常用的應該是 LZMA 類的演算法 (差不多是 2001 年出現的),用的工具通常會是 X

https://blog.gslin.org/archives/2025/02/02/12240/bzip3/

#Computer #Murmuring #Software #bzip2 #bzip3 #compression #lzma #ratio #xz

BZip3

Gea-Suan Lin's BLOG

I love playing around with #compression

In this case, it's all text-based data in csv and xml formats.

Size:

32,696,320 202411.tar 4,384,020 202411.tar.bz2 4,015,912 202411.tar.zst 3,878,583 202411.tar.bz3 3,730,416 202411.tar.xz

zstd was invoked using zstd --ultra -22
xz was invoked using xz -9e
bzip2 was invoked using bzip2 -9
bzip3 has no compression level options

Speed:

zstd 54.31user 0.25system 0:54.60elapsed 99%CPU xz 53.80user 0.06system 0:53.93elapsed 99%CPU bzip2 5.33user 0.01system 0:05.35elapsed 99%CPU bzip3 3.98user 0.02system 0:04.01elapsed 99%CPU

Maximum memory usage (RSS):

zstd 706,312 xz 300,480 bzip3 75,996 bzip2 7,680

*RSS sampled up to ten times per second during execution of the commands in question

#bzip3 is freaking amazing, yo.

#DataCompression #bzip #bz3 #zstd #zst #zstandard #xz #lzma
#CouldaBeenABlost ;)