https://mattmahoney.net/dc/dce.html #datacompression #freeinformation #kindleformats #techhumor #sharingknowledge #HackerNews #ngated
Data Compression Explained
https://mattmahoney.net/dc/dce.html
#HackerNews #DataCompression #UnderstandingDataTech #CompressionTech #DataScience
Draft program of IFIP SEC '26 is there: https://ifipsec.org/program.html
We will present our work on AMPhitryon (a covert channel amplification (and general data compression!) technique, cf. https://github.com/cdpxe/AMPhitryon ).

IFIP SEC conferences are the flagship events of the International Federation for Information Processing (IFIP) Technical Committee 11 (TC11) on Information Security and Privacy Protection in Information Processing Systems. The IFIP SEC conferences aim to bring together primarily researchers, but also practitioners from academia, industry and governmental institutions to elaborate and discuss IT Security and Privacy Challenges that we are facing today and will be facing into the future. Join us for our next event.
All of human cooking compressed into 2 megabytes

We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline. A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.
7-Zip 26.01 - Linux huge pages provide a solid 2.5–4.5% compression speedup on modern and cache-limited CPUs by reducing TLB overhead, but offer zero benefit for decompression or ancient hardware. #memorymanagement #x86 #hugepages #largepages #7zip #linux #compression #datacompression #benchmark #performanceKV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
https://arxiv.org/abs/2604.15356
#HackerNews #KVCache #Compression #TurboQuant #ShannonLimit #DataCompression

Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that actually matters: compressing the KV cache as a sequence. The tokens stored in a KV cache are not arbitrary floating-point data -- they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language. We introduce sequential KV compression, a two-layer architecture that exploits this structure. The first layer, probabilistic prefix deduplication, identifies semantically equivalent shared prefixes across sessions using the trie metric d_T(s, s') = -log_2 P_M(s ^ s') from Probabilistic Language Tries (PLTs). The second layer, predictive delta coding, stores only the residual of each new KV vector from the model's own prediction of it, achieving a per-token entropy bound of H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}). We prove that at typical language model perplexity -- approximately 10-20 for fluent English text -- this bound is 3.3-4.3 bits on average per token position, compared to TurboQuant's 3 bits per vector component (with typical attention heads having 64-128 components). The theoretical compression ratio over TurboQuant is approximately 914,000x at the Shannon limit. Even at 1000x above the entropy floor -- a deliberately pessimistic worst-case overhead, two orders of magnitude above the 2-5x typical of practical source coders -- the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows. The two layers are orthogonal and compose with existing per-vector quantization methods including TurboQuant.