i looked at the rpgp codebase and it is very good code. incredibly small and the directory structure alone is remarkably effective
#[derive(Debug, Snafu)]
#[snafu(display("needed {}, remaining {}", needed, remaining))]
pub struct RemainingError {
pub needed: usize,
pub remaining: usize,
backtrace: Option<Backtrace>,
}

wrote a good amount of code like this for the zstd impl that stopped when i realized zstd was weird and bad

i'm gonna try to make the proof of concept for length extension now. i'm very confident i'm right
i have to generate their fucked up little format
you literally just generate blocks that have zero length
it's the most unnecessary thing https://amass.energy/rustdoc/yzx-unstable/yzx_core/frame/data/ ziv-lempel's bullshit already lets you do run-length shit. why are there TWO ways to encode a run-length block of the same byte
yzx_core::frame::data - Rust

Encoding of a Zstandard frame.

the problem with e.g. writing code that would analyze .tar.zsts in the wild is that the zstd format is brain destroying and offensively poorly documented
yann collet owes me 1 million dollars for being incredibly insecure and not writing up his "standard"
SequenceBehavior in yzx_core::frame::data::block - Rust

API documentation for the Rust `SequenceBehavior` enum in crate `yzx_core`.

there's never any actual offsets, sizes, block counts, id sets. the zip format has all of that shit. and google hates it

the idea is so basic it's literally:

  • split a .tar in two like a magician
  • read in an arbitrary byte stream from the user, encoded as 0-length RLE blocks
  • sandwich output
and there are two distinct ways to do this. have i mentioned i really despise this format
jarek duda's tANS is cool as shit though. yann collet is such a loser for repeatedly obfuscating it. it's one of those cool mathematical results that is not a magic bullet but solves a specific problem and i like that duda is interested in expanding the application to other areas of information theory. the blake3 paper also mentions compression and i think these both should absolutely be more closely allied
compression unfortunately still sucks bc everyone acts like they absolutely cannot wait or expend CPU time on something that will be decompressed 100x or 1000x more than it was compressed
god i really hate this format. i'm gonna do it the slightly less annoying way first. i really hope the annoying way (not creating a new frame but embedding it in the previous frame, again as 0-length RLE blocks) does not require a whole encoder. i'm gonna add a dependency for that shit

this is my enum for their block type header

#[inline(always)]
fn block_type(&self) -> BlockType {
match (self.0 & 0b110) >> 1 {
0 => BlockType::RawBlock,
1 => BlockType::RLEBlock,
2 => BlockType::CompressedBlock,
3 => BlockType::Reserved,
_ => unreachable!("block type is limited to two bits"),
}
}

two whole bits:

  • "raw": on its face: sure! not bad! if it doesn't compress well, don't compress it! we will discuss the flaw there in a moment
  • "RLE": the idea here is that you would have a massive run of a single byte. and like. ok. but that is also very much what the compression is supposed to handle for you already

the problem with prefix codes as i understand it now is that they require at least one bit to signal an output symbol (which is typically a byte--i also think this is incredibly important to parameterize). however prefix codes seem to be ideal in every other way (collet mentions performance)

and for extremely biased distributions i.e. ones that would benefit from less than one bit for some highly frequent output symbols, we have (actually a whole field here, but i like) duda's tANS. it's cute. it can be sized to a precise memory region. you follow the path of the blocks wherever they lead

i could cite yann collet on this here but i won't cause he doesn't. and i cited everyone here https://codeberg.org/cosmicexplorer/corporeal/src/branch/main/literature/README.md

corporeal/literature/README.md at main

corporeal - String library that uses corpus dictionaries to produce a more efficient encoding than UTF-8.

Codeberg.org

the compressed block is where it becomes clear zstd is Just Fucking Deflate Again: https://en.wikipedia.org/wiki/Deflate

BTYPE (next two bits): Block type

  • 00: No compression (sometimes called stored). Any bits up to the next byte boundary are ignored. The rest of the block consists of 16-bit LEN, 16-bit NLEN (one's complement of LEN), and LEN bytes of uncompressed data, i.e. up to 65,535 (216 − 1) bytes. Useful for incompressible data (e.g. high-entropy, random, or already compressed), adding minimal overhead (i.e. ~5 bytes per block).
    • 01: A static Huffman compressed block, using a pre-agreed Huffman tree defined in the RFC.
    • 10: A dynamic Huffman compressed block, complete with the Huffman table supplied.
    • 11: Reserved (error).

shameless shit. they even kept the reserved block

Deflate - Wikipedia

i think i actually buy the way zstd references prefix trees (note: everyone in the literature calls them "huffman" trees and i will not be doing that)

more from the DEFLATE page

Most compressible data will end up being encoded using method 10, the dynamic Huffman encoding, which produces an optimized Huffman tree customized for each block of data individually. Instructions to generate the necessary Huffman tree immediately follow the block header. The static Huffman option is used for short messages, where the fixed saving gained by omitting the tree outweighs the percentage compression loss due to using a non-optimal (thus, not technically Huffman) code.

one thing i haven't seen ANYONE fucking talk about is generating compression blocks that understand the boundaries of file contents

Tarlz Manual

Tarlz Manual

@clayote before i read a single word, if it references ziv and lempel i think it's boring
@hipsterelectron It probably does in the actual compression part, but I've linked to the archiver part
@clayote oh thank you this rocks sorry