Mastodawn

i looked at the rpgp codebase and it is very good code. incredibly small and the directory structure alone is remarkably effective

Show thread

d@nny disc@ mc² 23h ago

#[derive(Debug, Snafu)]
#[snafu(display("needed {}, remaining {}", needed, remaining))]
pub struct RemainingError {
    pub needed: usize,
    pub remaining: usize,
    backtrace: Option<Backtrace>,
}

wrote a good amount of code like this for the zstd impl that stopped when i realized zstd was weird and bad

Show thread

d@nny disc@ mc² 23h ago

i'm gonna try to make the proof of concept for length extension now. i'm very confident i'm right

Show thread

d@nny disc@ mc² 23h ago

i have to generate their fucked up little format

Show thread

d@nny disc@ mc² 23h ago

you literally just generate blocks that have zero length

Show thread

d@nny disc@ mc² 23h ago

it's the most unnecessary thing https://amass.energy/rustdoc/yzx-unstable/yzx_core/frame/data/ ziv-lempel's bullshit already lets you do run-length shit. why are there TWO ways to encode a run-length block of the same byte

yzx_core::frame::data - Rust

Encoding of a Zstandard frame.

Show thread

d@nny disc@ mc² 23h ago

the problem with e.g. writing code that would analyze .tar.zsts in the wild is that the zstd format is brain destroying and offensively poorly documented

Show thread

d@nny disc@ mc² 23h ago

yann collet owes me 1 million dollars for being incredibly insecure and not writing up his "standard"

Show thread

d@nny disc@ mc² 23h ago

oh yeah then there's this shit https://amass.energy/rustdoc/yzx-unstable/yzx_core/frame/data/block/enum.SequenceBehavior.html

SequenceBehavior in yzx_core::frame::data::block - Rust

API documentation for the Rust `SequenceBehavior` enum in crate `yzx_core`.

Show thread

d@nny disc@ mc² 23h ago

there's never any actual offsets, sizes, block counts, id sets. the zip format has all of that shit. and google hates it

Show thread

d@nny disc@ mc² 23h ago

the idea is so basic it's literally:

split a .tar in two like a magician
read in an arbitrary byte stream from the user, encoded as 0-length RLE blocks
sandwich output

Show thread

d@nny disc@ mc² 23h ago

and there are two distinct ways to do this. have i mentioned i really despise this format

Show thread

d@nny disc@ mc² 23h ago

jarek duda's tANS is cool as shit though. yann collet is such a loser for repeatedly obfuscating it. it's one of those cool mathematical results that is not a magic bullet but solves a specific problem and i like that duda is interested in expanding the application to other areas of information theory. the blake3 paper also mentions compression and i think these both should absolutely be more closely allied

Show thread

d@nny disc@ mc² 23h ago

compression unfortunately still sucks bc everyone acts like they absolutely cannot wait or expend CPU time on something that will be decompressed 100x or 1000x more than it was compressed

Show thread

d@nny disc@ mc² 23h ago

god i really hate this format. i'm gonna do it the slightly less annoying way first. i really hope the annoying way (not creating a new frame but embedding it in the previous frame, again as 0-length RLE blocks) does not require a whole encoder. i'm gonna add a dependency for that shit

Show thread

d@nny disc@ mc² 23h ago

this is my enum for their block type header

      #[inline(always)]
      fn block_type(&self) -> BlockType {
        match (self.0 & 0b110) >> 1 {
          0 => BlockType::RawBlock,
          1 => BlockType::RLEBlock,
          2 => BlockType::CompressedBlock,
          3 => BlockType::Reserved,
          _ => unreachable!("block type is limited to two bits"),
        }
      }

two whole bits:

"raw": on its face: sure! not bad! if it doesn't compress well, don't compress it! we will discuss the flaw there in a moment
"RLE": the idea here is that you would have a massive run of a single byte. and like. ok. but that is also very much what the compression is supposed to handle for you already

Show thread

d@nny disc@ mc² 23h ago

the problem with prefix codes as i understand it now is that they require at least one bit to signal an output symbol (which is typically a byte--i also think this is incredibly important to parameterize). however prefix codes seem to be ideal in every other way (collet mentions performance)

and for extremely biased distributions i.e. ones that would benefit from less than one bit for some highly frequent output symbols, we have (actually a whole field here, but i like) duda's tANS. it's cute. it can be sized to a precise memory region. you follow the path of the blocks wherever they lead

i could cite yann collet on this here but i won't cause he doesn't. and i cited everyone here https://codeberg.org/cosmicexplorer/corporeal/src/branch/main/literature/README.md

corporeal/literature/README.md at main

corporeal - String library that uses corpus dictionaries to produce a more efficient encoding than UTF-8.

Codeberg.org

Show thread

d@nny disc@ mc² 23h ago

the compressed block is where it becomes clear zstd is Just Fucking Deflate Again: https://en.wikipedia.org/wiki/Deflate

BTYPE (next two bits): Block type

00: No compression (sometimes called stored). Any bits up to the next byte boundary are ignored. The rest of the block consists of 16-bit LEN, 16-bit NLEN (one's complement of LEN), and LEN bytes of uncompressed data, i.e. up to 65,535 (216 − 1) bytes. Useful for incompressible data (e.g. high-entropy, random, or already compressed), adding minimal overhead (i.e. ~5 bytes per block).
- 01: A static Huffman compressed block, using a pre-agreed Huffman tree defined in the RFC.
- 10: A dynamic Huffman compressed block, complete with the Huffman table supplied.
- 11: Reserved (error).

shameless shit. they even kept the reserved block

Deflate - Wikipedia

Show thread

d@nny disc@ mc² 23h ago

i think i actually buy the way zstd references prefix trees (note: everyone in the literature calls them "huffman" trees and i will not be doing that)

Show thread

d@nny disc@ mc² 23h ago

Frequently Asked Questions (FAQ)

Show thread

d@nny disc@ mc² 23h ago

providing tarballs at your own site is cool but why you would also provide them from github and sourceforge? well, github has the lock-in effect.

very very funny repo though https://github.com/ip7z/7zip

GitHub - ip7z/7zip: 7-Zip

7-Zip. Contribute to ip7z/7zip development by creating an account on GitHub.

GitHub

Show thread

d@nny disc@ mc² 23h ago

13 commits. i guess you can work like that. it's kind of the sqlite problem though

  MS DOCs:
    The range lock sector covers file offsets 0x7FFFFF00-0x7FFFFFFF.
    These offsets are reserved for byte-range locking to support
    concurrency, transactions, and other compound file features.
    The range lock sector MUST be allocated in the FAT and marked with
    ENDOFCHAIN (0xFFFFFFFE), when the compound file grows beyond 2 GB.
    If the compound file is greater than 2 GB and then shrinks to below 2 GB,
    the range lock sector SHOULD be marked as FREESECT (0xFFFFFFFF) in the FAT.

did you know microsoft really likes to obscure huge commits

this is just a fun fact though

Show thread

d@nny disc@ mc²

oh HELL fucking yes someone figured it out https://en.wikipedia.org/wiki/7z#Pre-processing_filters

The LZMA SDK comes with the BCJ and BCJ2 preprocessors included, so that later stages are able to achieve greater compression: For x86, ARM, PowerPC (PPC), IA-64 Itanium, and ARM Thumb processors, jump targets are "normalized"[4] before compression by changing relative position into absolute values. For x86, this means that near jumps, calls and conditional jumps (but not short jumps and conditional jumps) are converted from the machine language "jump 1655 bytes backwards" style notation to normalized "jump to address 5554" style notation; all jumps to 5554, perhaps a common subroutine, are thus encoded identically, making them more compressible.

7z - Wikipedia

Show thread

d@nny disc@ mc² 22h ago

now, one might also say it's weird to fuck with jumps in machine language for compression reasons. i wouldn't know. but fucking utf-8 man. utf-8 is the mind killer. utf-8 is the little-death that brings total destruction

Show thread

d@nny disc@ mc² 22h ago

7z does have a surprising number of useful comments though (LINUX COULD NEVER)

Show thread

d@nny disc@ mc² 22h ago

omg he added a note when the signedness of a variable changed. i haven't seen an evil person go this far yet. i think he'd have to be really good

Show thread

d@nny disc@ mc² 22h ago

also his directory structure is so good!!!! and he publicly said "yeah i'm not gonna use a new unproven cryptographic scheme yet" and that's hero behavior

Show thread

d@nny disc@ mc² 22h ago

omg 7z uses lzfse! cc @steve https://github.com/ip7z/7zip/blob/839151eaaad24771892afaae6bac690e31e58384/DOC/License.txt#L49-L53

i also specifically reference it in my literature review as good c code. super glad it's out there

7zip/DOC/License.txt at 839151eaaad24771892afaae6bac690e31e58384 · ip7z/7zip

7-Zip. Contribute to ip7z/7zip development by creating an account on GitHub.

GitHub

Show thread

d@nny disc@ mc² 22h ago

particularly because yann collet is unable to write readable code and yet he keeps doing it

Show thread

d@nny disc@ mc² 22h ago

hey remember when github decided semantically meaningful URLs were cool and made graphical emacs in coffeescript and empowered people to use the web as an editing tool that was awesome

Show thread

d@nny disc@ mc² 22h ago

i am using rust and cargo for this stupid shit i hate because i implemented the rest of it in rust and immediately i confront the way no_std MUST!!!!! be declared in the most inconvenient way possible, completely invisible to the human eye or cargo. i will never again subject myself to this. no one should have to live like this

Show thread

d@nny disc@ mc² 22h ago

being around when rust was still a queer revolution that broke google twice is crazy cause i get search results with supreme SEO linking to people who generated streams of falsehoods with steve klabnik on the podcast circuit

Binary packages must not expose their library functionality within the same package.

literally coincidentally today like 3 hours ago i decided to enable this for spack external packages because we don't make decisions for people we describe the world around us

Show thread

d@nny disc@ mc² 22h ago

The library package must be separated out, with an appropriate name linking the two.

binary and library in the same package? obfuscation. xz all over again

Show thread

d@nny disc@ mc² 22h ago

Some examples of linked names:

my-lib for the library, and my-lib-cli for the binary, if most people are going to use the library.
my-app-core for the library, and my-app for the binary, if most people are going to use the binary.
- my-utility for the library, and cargo-my-utility for the binary, if your program is a Cargo plugin.

this is a really misleading portrayal of the "-core" convention which is actually a good and useful pattern. "my-lib-cli" is.......i mean that's what we'd want for the zip crate. (& i made up a really cute name for the smaller version.....zip-clite). but a cli is not an afterthought lmao

if most people are going to use the library.

what does this mean?

Show thread

d@nny disc@ mc² 21h ago

the "-core" convention distinguishes an internal API (which may be and often is a no_std crate) from an external API. this is what i did with my parser compiler, and with the grouplink signal fork. sometimes you end up with a copy of the same API. that's not actually cruft. that's breathing room. it's slack. that's specifically what you learn when you write twitter scale services (it's like google scale except we actually solved user problems)

Show thread

d@nny disc@ mc² 21h ago

writing a cargo plugin means giving people a reason to use cargo but not giving cargo a reason to expose a build script API. ed page explicitly shuts this down when proposed

Show thread

d@nny disc@ mc² 21h ago

There's an intermediate solution possible here, which is to have a single crate that enables being built as a binary with --features=bin. However, you must not do this for code uploaded to a registry, because you lose out on the benefits of having separate versioning. You may use this pattern for code internal to a workspace.

see the thing is that cargo hates you in particular and does not support distinct dependency sets the way you absolutely require when making a binary vs library. it's not an intrinsic problem it is directly a cargo problem

Show thread

d@nny disc@ mc² 21h ago

Case study: The presence of the libgit2 and JGit libraries for Git has made it significantly harder to improve Git's data structures.

literal propaganda lmao

Show thread

d@nny disc@ mc² 21h ago

Maintaining a library in addition to a binary is hard work. It involves documentation and versioning.

this is a good point because nobody in the entire rust ecosystem has ever released version 1.0

Show thread

d@nny disc@ mc² 21h ago

Cargo and rustc are not designed to be invoked as libraries. They force loose coupling.

i remembered what i was searching about which was "is there a cfg flag for binary shit"

Show thread

d@nny disc@ mc² 21h ago

omg cfg is by far the worst shit. absolute excrement

Show thread

d@nny disc@ mc² 21h ago

https://github.com/rust-lang/rust/issues/32838#issuecomment-3705992254 i

// This is legal:
struct A<#[cfg(feature = "nightly")] B>(#[cfg(feature = "nightly")] B);

// impl<#[cfg(feature = "nightly")] B> is allowed,
// but A<#[cfg(feature = "nightly")] B> is a parse error:
impl<#[cfg(feature = "nightly")] B> A<#[cfg(feature = "nightly")] B> {
  ...
}

so i made The Worst Boolean SAT Heuristic In the History of Humankind

It makes use of a very very complex proc macro that works around this issue by identifying usages of generic params, then generating 2n instances of the item (e.g. an impl) for n distinct #[cfg(...)]-constrained generic params. What this means is that the smallvec with parameterized allocator can do this:

#[conditional_impl_type_bounds]
unsafe impl<T: Send, #[cfg(feature = "allocator-api")] A: Allocator + Send, const N: usize> Send
    for SmallVec<T, N, A>
{
}

which gets converted to this:

#[cfg(feature = "allocator-api")]
unsafe impl<T: Send, A: Allocator + Send, const N: usize> Send
    for SmallVec<T, N, A>
{
}
#[cfg(not(feature = "allocator-api"))]
unsafe impl<T: Send, const N: usize> Send
    for SmallVec<T, N>
{
}

Show thread

d@nny disc@ mc² 21h ago

my expert analysis upon why #[cfg(...)] is not allowed upon the impl target is that it is the kind of inconvenience that serves a purpose, like most parsing bullshit

Show thread

d@nny disc@ mc² 17h ago

@julia the thing is that unfortunately i think adding annotations to c/++ is a safer bet because i have turned evil now but (in both cases) an actual SMT/ASP solver would be appropriate. it is strongly analogous to what spack does and like spack it works off of human annotations

Show thread

d@nny disc@ mc² 17h ago

@julia you know what i would do. you know what would be soooooo easy to do. i won't do it because unfortunately i will never contribute to cargo. but it would be soooooo easy to translate features and other configurable settings into a cacheable format like autoconf produces.

Show thread

zrb 21h ago

@hipsterelectron it's true, I knew a colleague once who tried to release 1.0, and then ten thousand crabs crawled out of the vents and picked up his entire cubicle with him in it and carried him away

Show thread

Willow at large 20h ago

@hipsterelectron you don't like UTF-8?

Show thread

the vessel of morganna 22h ago

@hipsterelectron RAR also does a lot of executable-aware shenanigans but I am unsure how much of a savings it actually offers in exchange for the ridiculously dense attack surface that RAR decoders seem to have historically

Show thread

d@nny disc@ mc² 22h ago

@astraleureka my notes on this are currently:

It is deeply confusing to me why this would have been applied to machine code jumps for the
purpose of compression, when UTF-8 IS RIGHT THERE????

But the 7zip maintainer is Russian, I think? So he would be familiar with the scourge of
UTF-8. Maybe he does solve text in exactly the right way and just doesn't consider it
a separate optimization?

idk if i really buy that. i can't possibly imagine he hasn't thought of decoding the utf-8 varint encoding though and i think it would make sense to not consider that a "big idea" cause it's obvious. but i would talk about that shit all the time. this entire repo https://codeberg.org/cosmicexplorer/corporeal is me talking about that shit before i did it

corporeal

String library that uses corpus dictionaries to produce a more efficient encoding than UTF-8.

Codeberg.org

Show thread

d@nny disc@ mc² 22h ago

@astraleureka i specifically include russian text in my examples of utf-8 as bloatware

Show thread

d@nny disc@ mc² 22h ago

@astraleureka i have a medal from a russian essay contest in high school where i wrote about obama. i was being genuine

Show thread

d@nny disc@ mc² 22h ago

@astraleureka my teacher was easily the most openly autistic person i've ever seen as an adult. huge inspo