Holy _shit_ this paper, and the insight behind it.

You know how every receiver is also a transmitter, _well_: every text predictor is also text compressor, and vice-versa.

You can outperform massive neural networks running millions of parameters, with a few lines of python and a novel application of _gzip_.

https://aclanthology.org/2023.findings-acl.426/

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

Zhiying Jiang, Matthew Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, Jimmy Lin. Findings of the Association for Computational Linguistics: ACL 2023. 2023.

ACL Anthology
If this is reliable, this is "take something that needed a datacenter last year and do it on a phone this year" material.
@mhoye every once in a while we find another one of these: https://en.wikipedia.org/wiki/Noether%27s_theorem
Noether's theorem - Wikipedia

@gvwilson @mhoye How is Noether's theorem relevant to this? This is so far off the usual applications of the theorem that I don't see the link. :⁠-⁠)
@RoarStovner @gvwilson @mhoye I think the link is the 1:1 mapping of a generator in a symmetry to a conservation law. Here someone seems to have found (or more: proven, once again), that there's a 1:1 mapping between compression and text generation.
@ljrk @gvwilson @mhoye Thanks! I still don't see that symmetries and conservation laws are related to compression text generation. That there is a bijection, I understand.

@RoarStovner @gvwilson @mhoye IIRC Noether showed that there will always be such a bijection given certain environments. So this result could be considered to be – albeit of course a bit of a stretch – an example of such a bijection foretold by Noether.

But it's more an analogy than anything else, or I'm missing something as well :)

@ljrk @RoarStovner My apologies for writing such a brief toot - all I meant was that sometimes someone finds something that is beautiful because it unexpectedly simplifies our understanding of the world @mhoye
@mhoye I don’t know, seems like you would need the storage space to store all the references you’re comparing to, even if the computation is easy. Sounds like a time/space trade off.
@XaiaX @mhoye but that's almost already the case, existing models are pretty large. I'd worry more about the computational complexity / representational examples for classification against (O(|to_classify| * |reference_dataset| * maxlen(to_classify {minkowski_plus} reference_dataset))
@fogti @XaiaX So, I don't actually think that acres of storage space is all that meaningful if you just want to focus good-enough utility. We all kind of know you're not getting more than varying heats of spicy autocomplete out of these tools, and the core insight of statistics as a field is that you can make a very accurate approximation of the state of a large dataset out of a small subset of that data. So, maybe a wikipedia dump and a gutenberg mirror is plenty?

@mhoye @XaiaX > spicy autocomplete

huh, nah the topic here is text classification, which is similar to text prediction (and there is probably a way to use Huffman tables to produce suggestions, etc.), but not the same.

> So, maybe a wikipedia dump and a gutenberg mirror is plenty?

probably yes. (imo coolest would be producing a tool that could both classify and predict using the same infrastructure (light preprocessed large text dumps (<10GiB), massive improvement)

@fogti @XaiaX I'm a bit more interested in predictive text tools that give you stylistic nudges towards artists you admire, and finding a way to get artists paid for that. Smart autocomplete/smart fill tools that answer, "what might Degas have done right here?"
@mhoye @XaiaX interesting thing is that it is probably *much* easier to simultaneously get the information for text completion *and* also what authors were involved in that match (instead of a large pool of authors just the relevant subset). [in the case of these compression-decompression + reference dataset models]

@fogti i have no idea how big a just plain classifier model would be relative to the generalized LLMs that are being so huge.

I just skimmed the paper, though, it’s been a few years since I did serious NLP stuff and that was all search/rules based shit anyway.

Meet “ZipPy”, a fast AI LLM text detector

Thinkst Thoughts

@mhoye I will have to look at this.

I always felt I was just a smidge not smart enough to really dive in and get machine learning, in a "let me implement this myself" way.

But I get nearest neighbors and I should be able to understand zipping.

@mhoye Wait, what? So their claim is that anything that is decent at compression, should also be decent at prediction? So therefore we've missed the boat because all of the work on compression the past few decades means we have really good predictors?

Definitely going to be reading this.

@rmflight @mhoye symbol prediction is part of many of the best lossless compression algorithms, isn't it?
@gabrielesvelto @rmflight It is, and if this paper's results hold up it we're talking about how large scale deep-learning networks are fundamentally a technical dead end, and something that takes a datacenter to do with a DNN can be done better with a clever application of gzip on a phone.

@gabrielesvelto @rmflight

Did I say phone?

I meant GameBoy.

@mhoye @gabrielesvelto @rmflight On first look I think what this paper suggests is 1) for some classification tasks there's nicely simple approach that works well and 2) this is a promising path towards better feature engineering for language models that will in turn result in better accuracy vs cost.
@mhoye @gabrielesvelto @rmflight If this works out well we'll see better + smaller models for all tasks (not just classification) that outperform both current DNNs and the NCD technique they use at moderate cost. There's precedent of this being a successful approach for example using frequency domain data for audio models instead of raw PCM. There's also precedent for finding ways DNNs waste a lot of capacity on effectively routing data around and restructuring to fix (ResNets for example).
@mhoye @gabrielesvelto @rmflight Overall though in recent history data-based approaches have tended to win so I would expect the useful bits to get incorporated into DNNs rather than DNNs being obsoleted in almost any context. My favorite essay on that topic by Rich Sutton: http://incompleteideas.net/IncIdeas/BitterLesson.html
The Bitter Lesson

@mhoye @gabrielesvelto @rmflight "Our method is a simple, lightweight, and universal alternative to DNNs. It’s simple because it doesn’t require any preprocessing or training. It’s lightweight in that it classifies without the need for parameters or GPU resources. It’s universal as compressors are data-type agnostic, and non-parametric methods do not bring underlying assumptions."

The result is in principle an alternative to what we call models in LLM-speak, @mhoye?

@judell @gabrielesvelto @rmflight I think more importantly it trivializes the creation of - and provides obvious guarantees of reproducibility and training material provenance - those models. It not only democratizes access to the tech but provides avenues for contributors to get paid for the value they bring to the resultant models.

@mhoye @gabrielesvelto @rmflight Almost like chugging hashes for merkle trees on ASICs and GPUs for something that can better be done with an Excel spreadsheet...

It's exactly the same people who wanted it done with ASICs and GPUs doing the DNN stuff, for exactly the same reasons, that have nothing to with with any utility of the process and everything to do with running scams.

@rmflight @mhoye Sounds about right. My old coworker Tom always joked that NLP was just a compression algorithm on the input text.
@paulmheider @rmflight And Ted Chiang has referred to ML-generated text as "a jpeg of a language", yeah. But to see that come together in fifteen lines of python that out-do these massive, crazy expensive DNN models is bonkers, jaws on the floor material.
@mhoye @paulmheider @rmflight the basic idea is so simple, why has nobody done this before? Apparently ideas also need to be tried out 😄
@mhoye @rmflight The converse is certainly true: if something is decent at prediction, it will be decent at compression. If you *almost* predict what the following data is going to be, you can store the difference between the prediction and the real data, which will take less space.
@mhoye what is it for? for detecting topic of the text without needing to costly train neural network?
@mhoye holy fuck this is potentially brilliant, and I am so excited to dive into reading this after work
@sysop "Code is available at $URL" right there in the abstract! Holy shit!
@mhoye this is text classification not text prediction

@mhoye wait it works on a dataset in Pinyin?! Damn

And it's 14 lines of python

And it's not even looking at the contents of the compression, it's just using the byte lengths? Holy fucking shit that's smart as hell.

😍

@mhoye compression methods have been used in text classification for authorship analysis for quite a while.

@tschfflr @mhoye yes, I remember discussion of this at least a decade ago and probably longer.

Still interesting.

@tschfflr @mhoye after looking through it it seems like they acknowledge all that as the basis for this work.
@mhoye @jrconlin the abstract reads like a self aware joke. Please tell me it's a self-aware joke

@mhoye Compression is ... very interesting. Recalling here Schmidhuber's take on compression and its predictability being at the root of learning, beauty, novelty, interestingness, boredom ... https://arxiv.org/pdf/0812.4360

Previously: https://mathstodon.xyz/@albertcardona/110686536069075845

@mhoye reading this paper and cackling the whole time, it's just so clever
@mhoye (this is not particularly novel, I remember reading a similar paper about using gzip to de-anonymize written works back around when I was learning about singular value decomposition to support vector space document searching from Maciej Ceglowski, I think back in the early 2000s. But it definitely is novel that this old technique is now outperforming current tech darlings.)

@mhoye that prediction ≈ compression is also the basis of the Hutter Prize

https://en.m.wikipedia.org/wiki/Hutter_Prize

Hutter Prize - Wikipedia

@mhoye IIRC, a few years ago someone found a way to categorize music files (MIDI?) by testing which ones compress well together.
@mhoye This is awesome, but I'm surprised it wasn't better known. I have vague memories of going to a talk by a researcher in Oxford about 25 years ago about using gzip compression for text analysis. His presentation explained about entropy and how compression is prediction, then looked at categorising text by gzipping it. Can't remember the name; some guy doing inference stuff in the psychology department. This is going to bug me now.
@simoncozens From what I can tell the fact of it wasn't a big secret, but the idea that with apparently negligible effort you can outperform tools that are insanely expensive and wildly more complicated is the interesting part.
@mhoye The right algorithm in the right place beats an inscrutable pile of ReLUs every damned time.
@simoncozens I read this in the G-Man's voice from Half Life.
@simoncozens @mhoye it might take a pile of inscrutable ReLUs to tell what’s the right algorithm and/or the right place though 💡
Google Scholar

@simoncozens @mhoye I suppose this is related to an early corpus linguistics rule of thumb: if you have a sentence in English, and several sentences in French, one of which is the translation of the English one. The correct translation is most likely to be the one closest to 111% the length of the English one.

possible ref is from Gale & Church's work, c.1994

@simoncozens @mhoye thank you! This reference reconnected me to http://ccc.inaoep.mx/~villasen/bib/joula-Authorship%20Attribution.pdf which got me to https://arxiv.org/abs/cond-mat/0108530 , which I believe is the early-2000s paper I was thinking of earlier!

Edit: plus the latent semantic indexing article I read around the same time by Clara Yu, John Cuadrado, Maciej Ceglowski (yes, idlewords), and J. Scott Payne, https://web.archive.org/web/20050507172205/http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm

@mhoye ...That's mildly terrifying if it works out, actually. A major jump in AI capabilities. Eep. :/
@Angle I don't think it's a jump in capabilities, just a massive decrease in cost and complexity, which is different (and democratizing!)
@mhoye Eh, a decrease in cost and complexity is an increase in capability / (cost & complexity). Assuming it pans out. Hopefully it'll be democratizing? We'll see. XD