For fun I implemented lbzip2's lbzcat (parallel bzip2 decompression) clone in rust, using the bzip2 crate. I found a lot of different interesting things. 1/N

#lbzip2 #bzip2 #RustLang

GitHub - anisse/lbzip2-rs: PoC lbzip2 parallel decompression in Rust

PoC lbzip2 parallel decompression in Rust. Contribute to anisse/lbzip2-rs development by creating an account on GitHub.

GitHub

First, let's look at the baseline. The @trifectatech foundation recently claimed that bzip2-rs had faster decompression than libbzip2's original C implementation. Is this true? Well, it depends... 2/N

#bzip2 #RustLang

bzip2 crate switches from C to 100% rust - Trifecta Tech Foundation

I'll do benchmarks using my own bzcat and comparing that with bzip2's original bzcat. All running on an M1 laptop running Linux. The M1 CPU has both performance and efficiency cores. We'll run the benchmark on both, using hyperfine to do the comparisons. 3/N

#bzip2 #AsahiLinux #hyperfine

So, let's start with the most obvious, the performance cores; they make the most sense for a CPU-intensive task; on my systems those are CPU 4-7. We can see here that the promise of the Trifecta Foundation holds, as the rust implementation is about 4% faster. 4/N

#bzip2 #RustLang #AsahiLinux

What about the efficiency cores? (0-3 on my system); well, everything breaks down on those CPUs. The C implementation is now faster by a small margin! 5/N

Edit: it's probably not significant enough, see answers.

#bzip2

But why? Let's see what perf stat has to say: the Rust version has less instructions, but with much less IPC (Instruction-per-clock); the Rust version also has less branches and misses in general. On the efficiency cores, we see that worse IPC and branch prediction of the Rust version give the advantage to the C version. 6/N

#bzip2 #RustLang #perf

That gives us our baseline: bzip2 (in C) vs bzip2 (in Rust). But is it a fair enough comparison? I mentioned initially that I was implementing an lbzip2 "clone" (mostly a PoC for the decompression part). lbzip2 is an other program (a C binary, without a library), that can compress and decompress bzip2 files in parallel. Surely it should be slower than bzip2 since it has the parallel management overhead? 7/N

#lbzip2 #bzip2 #RustLang #perf

Let's have a look! Indeed, bzip2 is faster than lbzip2. Between 18% and 95% faster! 8/N

#lbzip2 #bzip2

While running this, I discovered something. lbzip2's detection of the cores it can run on is... lacking: it uses the number of globally online cpu cores instead the syscall sched_getaffinity for the current process, so let's manually limit the number of processing threads: lbzip2 is now always faster, between 4% and 8%. 9/N

#lbzip2 #bzip2

But wait, there is more. lbzip2 also does compression, and does so in a way that optimizes decompression. If we use a file compressed by lbzip2, it can be even faster, even on a single thread: up to 125% faster 10/N

#lbzip2 #bzip2

What about my implementation of lbzcat? It was designed to work with files generated by lbzip2: it does not work on some files compressed by bzip2, and silently produces incorrect output (!). So we'll limit benchmarking to files produced by lbzip2. 11/N

#lbzip2 #bzip2 #RustLang

Overall, my Rust implementation (using the bzip2-rs crate) is (much) slower than lbzip2, and faster than bzip2. For some reasons, it also sees huge performance boost on performance cores, most likely due to better IPC and branch prediction. 12/N

#lbzip2 #bzip2

We've been running benchmarks on single CPU cores since the start. What if we unleash the parallel mode? Here are the results: lbzip2 is still much faster on the 8 cores; my implementation holds up fine, but is only 80% faster than bzip2, while running on 8 cores. On bigger files though, it starts to pay off, with up to 6.3x faster, while lbzip2 can go to 7.7x. 13/N

#lbzip2 #bzip2

That's it for the benchmarking! You can find my implementation at http://github.com/anisse/lbzip2-rs/ ; it's very much PoC-quality code, so use at our own risks! I chose to manually spawn threads instead of using rayon or an async runtime; there are other things I'm not proud of, like busy-waiting instead of condvar for example. 14/N

#lbzip2 #bzip2 #RustLang #async #rayon

GitHub - anisse/lbzip2-rs: PoC lbzip2 parallel decompression in Rust

PoC lbzip2 parallel decompression in Rust. Contribute to anisse/lbzip2-rs development by creating an account on GitHub.

GitHub

lbzip2 internally implements a full task-scheduling runtime, and splits tasks at a much smaller increments; it supports bit-aligned blocks (that are standard in bzip2 format), while my Rust implementation purposefully doesn't: I wanted to rely on the bzip2 crate that only supports byte-aligned buffers, and keep code simple (which I failed IMHO). FIN 15/15

#lbzip2 #bzip2

You'll find this benchmarking adventure in its own blog post "Performance lessons of implementing lbzcat in Rust" https://anisse.astier.eu/lbzip2-rs.html

#RustLang #lbzip2 #bzip2 #benchmarking #performance

Performance lessons of implementing lbzcat in Rust - Linux Engineer's random thoughts

This was originally published as a thread on Mastodon. For fun I implemented lbzip2's lbzcat (parallel bzip2 decompression) clone in rust, using the bzip2 crate. Baseline: bzip2 vs bzip2-rs First, let's look at the baseline. The Trifecta Tech Foundation recently claimed that bzip2-rs had faster decompression than libbzip2's original C …

@Aissen Hi, libbz2-rs-sys developer here. Thanks for trying out our implementation.

I wonder how statistically significant these results are. For both of your measurements, the margin of error is larger than the effect.

This is in part why we use https://github.com/andrewrk/poop for our benchmarks: it automatically performs a significance check, not just for runtime but also for instructions and cycles.

GitHub - andrewrk/poop: Performance Optimizer Observation Platform

Performance Optimizer Observation Platform. Contribute to andrewrk/poop development by creating an account on GitHub.

GitHub
@folkertdev Excellent question. This is precisely why I put the full command and hyperfine output. I don't think the results - as shown - are significant. I also ran them many times, including with higher minimum number of runs (not shown). I'll easily concede that the difference is negligible though.