Mastodawn

i looked at the rpgp codebase and it is very good code. incredibly small and the directory structure alone is remarkably effective

Show thread

d@nny disc@ mc² 1d ago

#[derive(Debug, Snafu)]
#[snafu(display("needed {}, remaining {}", needed, remaining))]
pub struct RemainingError {
    pub needed: usize,
    pub remaining: usize,
    backtrace: Option<Backtrace>,
}

wrote a good amount of code like this for the zstd impl that stopped when i realized zstd was weird and bad

Show thread

d@nny disc@ mc² 1d ago

i'm gonna try to make the proof of concept for length extension now. i'm very confident i'm right

Show thread

d@nny disc@ mc² 1d ago

i have to generate their fucked up little format

Show thread

d@nny disc@ mc²

you literally just generate blocks that have zero length

Show thread

d@nny disc@ mc² 1d ago

it's the most unnecessary thing https://amass.energy/rustdoc/yzx-unstable/yzx_core/frame/data/ ziv-lempel's bullshit already lets you do run-length shit. why are there TWO ways to encode a run-length block of the same byte

yzx_core::frame::data - Rust

Encoding of a Zstandard frame.

Show thread

d@nny disc@ mc² 1d ago

the problem with e.g. writing code that would analyze .tar.zsts in the wild is that the zstd format is brain destroying and offensively poorly documented

Show thread

d@nny disc@ mc² 1d ago

yann collet owes me 1 million dollars for being incredibly insecure and not writing up his "standard"

Show thread

d@nny disc@ mc² 1d ago

oh yeah then there's this shit https://amass.energy/rustdoc/yzx-unstable/yzx_core/frame/data/block/enum.SequenceBehavior.html

SequenceBehavior in yzx_core::frame::data::block - Rust

API documentation for the Rust `SequenceBehavior` enum in crate `yzx_core`.

Show thread

d@nny disc@ mc² 1d ago

there's never any actual offsets, sizes, block counts, id sets. the zip format has all of that shit. and google hates it

Show thread

d@nny disc@ mc² 1d ago

the idea is so basic it's literally:

split a .tar in two like a magician
read in an arbitrary byte stream from the user, encoded as 0-length RLE blocks
sandwich output

Show thread

d@nny disc@ mc² 1d ago

and there are two distinct ways to do this. have i mentioned i really despise this format

Show thread

d@nny disc@ mc² 1d ago

jarek duda's tANS is cool as shit though. yann collet is such a loser for repeatedly obfuscating it. it's one of those cool mathematical results that is not a magic bullet but solves a specific problem and i like that duda is interested in expanding the application to other areas of information theory. the blake3 paper also mentions compression and i think these both should absolutely be more closely allied

Show thread

d@nny disc@ mc² 1d ago

compression unfortunately still sucks bc everyone acts like they absolutely cannot wait or expend CPU time on something that will be decompressed 100x or 1000x more than it was compressed

Show thread

d@nny disc@ mc² 1d ago

god i really hate this format. i'm gonna do it the slightly less annoying way first. i really hope the annoying way (not creating a new frame but embedding it in the previous frame, again as 0-length RLE blocks) does not require a whole encoder. i'm gonna add a dependency for that shit

Show thread

d@nny disc@ mc² 1d ago

this is my enum for their block type header

      #[inline(always)]
      fn block_type(&self) -> BlockType {
        match (self.0 & 0b110) >> 1 {
          0 => BlockType::RawBlock,
          1 => BlockType::RLEBlock,
          2 => BlockType::CompressedBlock,
          3 => BlockType::Reserved,
          _ => unreachable!("block type is limited to two bits"),
        }
      }

two whole bits:

"raw": on its face: sure! not bad! if it doesn't compress well, don't compress it! we will discuss the flaw there in a moment
"RLE": the idea here is that you would have a massive run of a single byte. and like. ok. but that is also very much what the compression is supposed to handle for you already

Show thread

d@nny disc@ mc² 1d ago

the problem with prefix codes as i understand it now is that they require at least one bit to signal an output symbol (which is typically a byte--i also think this is incredibly important to parameterize). however prefix codes seem to be ideal in every other way (collet mentions performance)

and for extremely biased distributions i.e. ones that would benefit from less than one bit for some highly frequent output symbols, we have (actually a whole field here, but i like) duda's tANS. it's cute. it can be sized to a precise memory region. you follow the path of the blocks wherever they lead

i could cite yann collet on this here but i won't cause he doesn't. and i cited everyone here https://codeberg.org/cosmicexplorer/corporeal/src/branch/main/literature/README.md

corporeal/literature/README.md at main

corporeal - String library that uses corpus dictionaries to produce a more efficient encoding than UTF-8.

Codeberg.org

Show thread

d@nny disc@ mc² 1d ago

the compressed block is where it becomes clear zstd is Just Fucking Deflate Again: https://en.wikipedia.org/wiki/Deflate

BTYPE (next two bits): Block type

00: No compression (sometimes called stored). Any bits up to the next byte boundary are ignored. The rest of the block consists of 16-bit LEN, 16-bit NLEN (one's complement of LEN), and LEN bytes of uncompressed data, i.e. up to 65,535 (216 − 1) bytes. Useful for incompressible data (e.g. high-entropy, random, or already compressed), adding minimal overhead (i.e. ~5 bytes per block).
- 01: A static Huffman compressed block, using a pre-agreed Huffman tree defined in the RFC.
- 10: A dynamic Huffman compressed block, complete with the Huffman table supplied.
- 11: Reserved (error).

shameless shit. they even kept the reserved block

Deflate - Wikipedia

Show thread

d@nny disc@ mc² 1d ago

i think i actually buy the way zstd references prefix trees (note: everyone in the literature calls them "huffman" trees and i will not be doing that)

Show thread

d@nny disc@ mc² 1d ago

Frequently Asked Questions (FAQ)

Show thread

d@nny disc@ mc² 1d ago

providing tarballs at your own site is cool but why you would also provide them from github and sourceforge? well, github has the lock-in effect.

very very funny repo though https://github.com/ip7z/7zip

GitHub - ip7z/7zip: 7-Zip

7-Zip. Contribute to ip7z/7zip development by creating an account on GitHub.

GitHub

Show thread

d@nny disc@ mc² 1d ago

13 commits. i guess you can work like that. it's kind of the sqlite problem though

  MS DOCs:
    The range lock sector covers file offsets 0x7FFFFF00-0x7FFFFFFF.
    These offsets are reserved for byte-range locking to support
    concurrency, transactions, and other compound file features.
    The range lock sector MUST be allocated in the FAT and marked with
    ENDOFCHAIN (0xFFFFFFFE), when the compound file grows beyond 2 GB.
    If the compound file is greater than 2 GB and then shrinks to below 2 GB,
    the range lock sector SHOULD be marked as FREESECT (0xFFFFFFFF) in the FAT.

did you know microsoft really likes to obscure huge commits

this is just a fun fact though

Show thread

d@nny disc@ mc² 1d ago

oh HELL fucking yes someone figured it out https://en.wikipedia.org/wiki/7z#Pre-processing_filters

The LZMA SDK comes with the BCJ and BCJ2 preprocessors included, so that later stages are able to achieve greater compression: For x86, ARM, PowerPC (PPC), IA-64 Itanium, and ARM Thumb processors, jump targets are "normalized"[4] before compression by changing relative position into absolute values. For x86, this means that near jumps, calls and conditional jumps (but not short jumps and conditional jumps) are converted from the machine language "jump 1655 bytes backwards" style notation to normalized "jump to address 5554" style notation; all jumps to 5554, perhaps a common subroutine, are thus encoded identically, making them more compressible.

7z - Wikipedia

Show thread

some kind of orange shape 1d ago

@hipsterelectron Perhaps you'd be interested in what the tarlz archiver does https://www.nongnu.org/lzip/manual/tarlz_manual.html#Amendments-to-pax-format

Tarlz Manual

Show thread

d@nny disc@ mc² 1d ago

@clayote before i read a single word, if it references ziv and lempel i think it's boring

Show thread

some kind of orange shape 1d ago

@hipsterelectron It probably does in the actual compression part, but I've linked to the archiver part

Show thread

d@nny disc@ mc² 1d ago

@clayote oh thank you this rocks sorry

Show thread

the vessel of morganna 1d ago

@hipsterelectron incredible

Show thread

d@nny disc@ mc² 1d ago

@astraleureka i forgot to mention the punchline which is that the one type of block that's not in DEFLATE (RLE block) is very specifically the one you can misuse

Show thread

d@nny disc@ mc² 1d ago

@astraleureka and the use case doesn't really make sense since RLE should either be a component of the compression scheme like ziv lempel or you should do compression across the whole file contents which is correct

Show thread

d@nny disc@ mc² 1d ago

@astraleureka billy mays here BUT WAIT, THERE'S MORE!

Show thread

d@nny disc@ mc² 1d ago

@astraleureka https://datatracker.ietf.org/doc/html/rfc9659

so in addition to saying "oh yeah skippable frames are for watermarking you can just remove them"

there's also these fucking zero-length fields everywhere

and this is a format that works extremely hard to save every goddamn bit in its fucked up headers

they tell you several times BE CAREFUL!!! SOMEONE MIGHT TELL YOU TO USE A BIG WINDOW! which is an indicator that the window size hint is either monopolistic orrrrrrrrrrrrrrr

so then the new update 9659 its ENTIRE fucking deal is saying BE CAREFUL!!!!! OF BIG WINDOWS!!!!!!

RFC 9659: Window Sizing for Zstandard Content Encoding

Deployments of Zstandard, or "zstd", can use different window sizes to limit memory usage during compression and decompression. Some browsers and user agents limit window sizes to mitigate memory usage concerns, thereby causing interoperability issues. This document updates the window size limit in RFC 8878 from a recommendation to a requirement in HTTP contexts.

IETF Datatracker

Show thread

d@nny disc@ mc² 1d ago

@astraleureka 0-size windows are the meta bro trust me

Show thread

d@nny disc@ mc² 1d ago

@astraleureka i was upset personally because i could not rule out the zero and i was gonna use a nightly feature to rule out the zero

but it was not to be.

Show thread

the vessel of morganna 1d ago

@hipsterelectron why the heck does this need to be a whole ass RFC

Show thread

Timo the timo 1d ago

@hipsterelectron pay your respects