I wonder how much perf would be lost on #pigz if there was an optional switch to bitwise append compressed blocks together rather than using an empty stored block to byte align before appending. I bet there's some really fast block bit shifting memcpy code out there.