#[derive(Debug, Snafu)]
#[snafu(display("needed {}, remaining {}", needed, remaining))]
pub struct RemainingError {
pub needed: usize,
pub remaining: usize,
backtrace: Option<Backtrace>,
}
wrote a good amount of code like this for the zstd impl that stopped when i realized zstd was weird and bad
the idea is so basic it's literally:
.tar in two like a magicianthis is my enum for their block type header
#[inline(always)]
fn block_type(&self) -> BlockType {
match (self.0 & 0b110) >> 1 {
0 => BlockType::RawBlock,
1 => BlockType::RLEBlock,
2 => BlockType::CompressedBlock,
3 => BlockType::Reserved,
_ => unreachable!("block type is limited to two bits"),
}
}
two whole bits:
the problem with prefix codes as i understand it now is that they require at least one bit to signal an output symbol (which is typically a byte--i also think this is incredibly important to parameterize). however prefix codes seem to be ideal in every other way (collet mentions performance)
and for extremely biased distributions i.e. ones that would benefit from less than one bit for some highly frequent output symbols, we have (actually a whole field here, but i like) duda's tANS. it's cute. it can be sized to a precise memory region. you follow the path of the blocks wherever they lead
i could cite yann collet on this here but i won't cause he doesn't. and i cited everyone here https://codeberg.org/cosmicexplorer/corporeal/src/branch/main/literature/README.md
the compressed block is where it becomes clear zstd is Just Fucking Deflate Again: https://en.wikipedia.org/wiki/Deflate
BTYPE (next two bits): Block type
00: No compression (sometimes called stored). Any bits up to the next byte boundary are ignored. The rest of the block consists of 16-bit LEN, 16-bit NLEN (one's complement of LEN), and LEN bytes of uncompressed data, i.e. up to 65,535 (216 − 1) bytes. Useful for incompressible data (e.g. high-entropy, random, or already compressed), adding minimal overhead (i.e. ~5 bytes per block).01: A static Huffman compressed block, using a pre-agreed Huffman tree defined in the RFC.10: A dynamic Huffman compressed block, complete with the Huffman table supplied.11: Reserved (error).shameless shit. they even kept the reserved block
more from the DEFLATE page
Most compressible data will end up being encoded using method 10, the dynamic Huffman encoding, which produces an optimized Huffman tree customized for each block of data individually. Instructions to generate the necessary Huffman tree immediately follow the block header. The static Huffman option is used for short messages, where the fixed saving gained by omitting the tree outweighs the percentage compression loss due to using a non-optimal (thus, not technically Huffman) code.
one thing i haven't seen ANYONE fucking talk about is generating compression blocks that understand the boundaries of file contents
https://www.7-zip.org/faq.html hmmmmm
Can I use the source code of 7-Zip in a commercial application?
he then provides maybe the most concise description of the LGPL requirements i've ever seen. i am team 7zip for life as of this moment
providing tarballs at your own site is cool but why you would also provide them from github and sourceforge? well, github has the lock-in effect.
very very funny repo though https://github.com/ip7z/7zip
13 commits. i guess you can work like that. it's kind of the sqlite problem though
MS DOCs:
The range lock sector covers file offsets 0x7FFFFF00-0x7FFFFFFF.
These offsets are reserved for byte-range locking to support
concurrency, transactions, and other compound file features.
The range lock sector MUST be allocated in the FAT and marked with
ENDOFCHAIN (0xFFFFFFFE), when the compound file grows beyond 2 GB.
If the compound file is greater than 2 GB and then shrinks to below 2 GB,
the range lock sector SHOULD be marked as FREESECT (0xFFFFFFFF) in the FAT.
did you know microsoft really likes to obscure huge commits
this is just a fun fact though
oh HELL fucking yes someone figured it out https://en.wikipedia.org/wiki/7z#Pre-processing_filters
The LZMA SDK comes with the BCJ and BCJ2 preprocessors included, so that later stages are able to achieve greater compression: For x86, ARM, PowerPC (PPC), IA-64 Itanium, and ARM Thumb processors, jump targets are "normalized"[4] before compression by changing relative position into absolute values. For x86, this means that near jumps, calls and conditional jumps (but not short jumps and conditional jumps) are converted from the machine language "jump 1655 bytes backwards" style notation to normalized "jump to address 5554" style notation; all jumps to 5554, perhaps a common subroutine, are thus encoded identically, making them more compressible.
omg 7z uses lzfse! cc @steve https://github.com/ip7z/7zip/blob/839151eaaad24771892afaae6bac690e31e58384/DOC/License.txt#L49-L53
i also specifically reference it in my literature review as good c code. super glad it's out there
no_std MUST!!!!! be declared in the most inconvenient way possible, completely invisible to the human eye or cargo. i will never again subject myself to this. no one should have to live like thisbeing around when rust was still a queer revolution that broke google twice is crazy cause i get search results with supreme SEO linking to people who generated streams of falsehoods with steve klabnik on the podcast circuit
Binary packages must not expose their library functionality within the same package.
literally coincidentally today like 3 hours ago i decided to enable this for spack external packages because we don't make decisions for people we describe the world around us
The library package must be separated out, with an appropriate name linking the two.
binary and library in the same package? obfuscation. xz all over again
Some examples of linked names:
my-lib for the library, and my-lib-cli for the binary, if most people are going to use the library.my-app-core for the library, and my-app for the binary, if most people are going to use the binary.my-utility for the library, and cargo-my-utility for the binary, if your program is a Cargo plugin.this is a really misleading portrayal of the "-core" convention which is actually a good and useful pattern. "my-lib-cli" is.......i mean that's what we'd want for the zip crate. (& i made up a really cute name for the smaller version.....zip-clite). but a cli is not an afterthought lmao
if most people are going to use the library.
what does this mean?
no_std crate) from an external API. this is what i did with my parser compiler, and with the grouplink signal fork. sometimes you end up with a copy of the same API. that's not actually cruft. that's breathing room. it's slack. that's specifically what you learn when you write twitter scale services (it's like google scale except we actually solved user problems)@astraleureka my notes on this are currently:
It is deeply confusing to me why this would have been applied to machine code jumps for the
purpose of compression, when UTF-8 IS RIGHT THERE????
But the 7zip maintainer is Russian, I think? So he would be familiar with the scourge of
UTF-8. Maybe he does solve text in exactly the right way and just doesn't consider it
a separate optimization?
idk if i really buy that. i can't possibly imagine he hasn't thought of decoding the utf-8 varint encoding though and i think it would make sense to not consider that a "big idea" cause it's obvious. but i would talk about that shit all the time. this entire repo https://codeberg.org/cosmicexplorer/corporeal is me talking about that shit before i did it
tarlz archiver does https://www.nongnu.org/lzip/manual/tarlz_manual.html#Amendments-to-pax-format@astraleureka https://datatracker.ietf.org/doc/html/rfc9659
so in addition to saying "oh yeah skippable frames are for watermarking you can just remove them"
there's also these fucking zero-length fields everywhere
and this is a format that works extremely hard to save every goddamn bit in its fucked up headers
they tell you several times BE CAREFUL!!! SOMEONE MIGHT TELL YOU TO USE A BIG WINDOW! which is an indicator that the window size hint is either monopolistic orrrrrrrrrrrrrrr
so then the new update 9659 its ENTIRE fucking deal is saying BE CAREFUL!!!!! OF BIG WINDOWS!!!!!!

Deployments of Zstandard, or "zstd", can use different window sizes to limit memory usage during compression and decompression. Some browsers and user agents limit window sizes to mitigate memory usage concerns, thereby causing interoperability issues. This document updates the window size limit in RFC 8878 from a recommendation to a requirement in HTTP contexts.
@astraleureka i was upset personally because i could not rule out the zero and i was gonna use a nightly feature to rule out the zero
but it was not to be.