Mastodawn

Jon Jones Jul 14, 2023

Holy _shit_ this paper, and the insight behind it.

You know how every receiver is also a transmitter, _well_: every text predictor is also text compressor, and vice-versa.

You can outperform massive neural networks running millions of parameters, with a few lines of python and a novel application of _gzip_.

https://aclanthology.org/2023.findings-acl.426/

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

Zhiying Jiang, Matthew Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, Jimmy Lin. Findings of the Association for Computational Linguistics: ACL 2023. 2023.

ACL Anthology

Show thread

mhoye Jul 13, 2023

If this is reliable, this is "take something that needed a datacenter last year and do it on a phone this year" material.

Show thread

Greg Wilson Jul 13, 2023

@mhoye every once in a while we find another one of these: https://en.wikipedia.org/wiki/Noether%27s_theorem

Noether's theorem - Wikipedia

Show thread

Roar Stovner Jul 14, 2023

@gvwilson @mhoye How is Noether's theorem relevant to this? This is so far off the usual applications of the theorem that I don't see the link. :⁠-⁠)

Show thread

lj·rk Jul 15, 2023

@RoarStovner @gvwilson @mhoye I think the link is the 1:1 mapping of a generator in a symmetry to a conservation law. Here someone seems to have found (or more: proven, once again), that there's a 1:1 mapping between compression and text generation.

Show thread

Roar Stovner Jul 15, 2023

@ljrk @gvwilson @mhoye Thanks! I still don't see that symmetries and conservation laws are related to compression text generation. That there is a bijection, I understand.

Show thread

lj·rk Jul 15, 2023

@RoarStovner @gvwilson @mhoye IIRC Noether showed that there will always be such a bijection given certain environments. So this result could be considered to be – albeit of course a bit of a stretch – an example of such a bijection foretold by Noether.

But it's more an analogy than anything else, or I'm missing something as well :)

Show thread

Greg Wilson Jul 15, 2023

@ljrk @RoarStovner My apologies for writing such a brief toot - all I meant was that sometimes someone finds something that is beautiful because it unexpectedly simplifies our understanding of the world @mhoye

Show thread

XaiaX ̆̈‏Jul 13, 2023

@mhoye I don’t know, seems like you would need the storage space to store all the references you’re comparing to, even if the computation is easy. Sounds like a time/space trade off.

Show thread

Έλλεν Εμίλια Ά.ζ.Jul 13, 2023

@XaiaX @mhoye but that's almost already the case, existing models are pretty large. I'd worry more about the computational complexity / representational examples for classification against (O(|to_classify| * |reference_dataset| * maxlen(to_classify {minkowski_plus} reference_dataset))

Show thread

mhoye Jul 13, 2023

@fogti @XaiaX So, I don't actually think that acres of storage space is all that meaningful if you just want to focus good-enough utility. We all kind of know you're not getting more than varying heats of spicy autocomplete out of these tools, and the core insight of statistics as a field is that you can make a very accurate approximation of the state of a large dataset out of a small subset of that data. So, maybe a wikipedia dump and a gutenberg mirror is plenty?

Show thread

Έλλεν Εμίλια Ά.ζ.Jul 13, 2023

@mhoye @XaiaX > spicy autocomplete

huh, nah the topic here is text classification, which is similar to text prediction (and there is probably a way to use Huffman tables to produce suggestions, etc.), but not the same.

> So, maybe a wikipedia dump and a gutenberg mirror is plenty?

probably yes. (imo coolest would be producing a tool that could both classify and predict using the same infrastructure (light preprocessed large text dumps (<10GiB), massive improvement)

Show thread

mhoye Jul 13, 2023

@fogti @XaiaX I'm a bit more interested in predictive text tools that give you stylistic nudges towards artists you admire, and finding a way to get artists paid for that. Smart autocomplete/smart fill tools that answer, "what might Degas have done right here?"

Show thread

Έλλεν Εμίλια Ά.ζ.Jul 13, 2023

@mhoye @XaiaX interesting thing is that it is probably *much* easier to simultaneously get the information for text completion *and* also what authors were involved in that match (instead of a large pool of authors just the relevant subset). [in the case of these compression-decompression + reference dataset models]

Show thread

XaiaX ̆̈‏Jul 14, 2023

@fogti i have no idea how big a just plain classifier model would be relative to the generalized LLMs that are being so huge.

I just skimmed the paper, though, it’s been a few years since I did serious NLP stuff and that was all search/rules based shit anyway.

Show thread

Jacob Torrey Jul 14, 2023

I found similar with LLM text detection: https://blog.thinkst.com/2023/06/meet-zippy-a-fast-ai-llm-text-detector.html

Meet “ZipPy”, a fast AI LLM text detector

Thinkst Thoughts

Show thread

Darius Kazemi Jul 13, 2023

@mhoye !!!!!

@darius Right!?!?

@mhoye I will have to look at this.

I always felt I was just a smidge not smart enough to really dive in and get machine learning, in a "let me implement this myself" way.

But I get nearest neighbors and I should be able to understand zipping.

Show thread

Darius Kazemi Jul 13, 2023

@mhoye cc @aparrish

Show thread

Dr. Robert M Flight Jul 13, 2023

@mhoye Wait, what? So their claim is that anything that is decent at compression, should also be decent at prediction? So therefore we've missed the boat because all of the work on compression the past few decades means we have really good predictors?

Definitely going to be reading this.

Show thread

Gabriele Svelto [moved]Jul 13, 2023

@rmflight @mhoye symbol prediction is part of many of the best lossless compression algorithms, isn't it?

Show thread

mhoye Jul 13, 2023

@gabrielesvelto @rmflight It is, and if this paper's results hold up it we're talking about how large scale deep-learning networks are fundamentally a technical dead end, and something that takes a datacenter to do with a DNN can be done better with a clever application of gzip on a phone.

Show thread

mhoye Jul 13, 2023

@gabrielesvelto @rmflight

Did I say phone?

I meant GameBoy.

Show thread

Gabriele Svelto [moved]Jul 13, 2023

@mhoye @rmflight "artificial intelligence" LMAO

Show thread

Choong Ng Jul 13, 2023

@mhoye @gabrielesvelto @rmflight On first look I think what this paper suggests is 1) for some classification tasks there's nicely simple approach that works well and 2) this is a promising path towards better feature engineering for language models that will in turn result in better accuracy vs cost.

Show thread

Choong Ng Jul 13, 2023

@mhoye @gabrielesvelto @rmflight If this works out well we'll see better + smaller models for all tasks (not just classification) that outperform both current DNNs and the NCD technique they use at moderate cost. There's precedent of this being a successful approach for example using frequency domain data for audio models instead of raw PCM. There's also precedent for finding ways DNNs waste a lot of capacity on effectively routing data around and restructuring to fix (ResNets for example).

Show thread

Choong Ng Jul 13, 2023

@mhoye @gabrielesvelto @rmflight Overall though in recent history data-based approaches have tended to win so I would expect the useful bits to get incorporated into DNNs rather than DNNs being obsoleted in almost any context. My favorite essay on that topic by Rich Sutton: http://incompleteideas.net/IncIdeas/BitterLesson.html

The Bitter Lesson

Show thread

Jon Udell Jul 13, 2023

@mhoye @gabrielesvelto @rmflight "Our method is a simple, lightweight, and universal alternative to DNNs. It’s simple because it doesn’t require any preprocessing or training. It’s lightweight in that it classifies without the need for parameters or GPU resources. It’s universal as compressors are data-type agnostic, and non-parametric methods do not bring underlying assumptions."

The result is in principle an alternative to what we call models in LLM-speak, @mhoye?

Show thread

mhoye Jul 13, 2023

@judell @gabrielesvelto @rmflight I think more importantly it trivializes the creation of - and provides obvious guarantees of reproducibility and training material provenance - those models. It not only democratizes access to the tech but provides avenues for contributors to get paid for the value they bring to the resultant models.

Show thread

Jon Udell Jul 13, 2023

@mhoye @gabrielesvelto @rmflight /me watching and munching 🍿

Show thread

Cassandrich Jul 14, 2023

@mhoye @gabrielesvelto @rmflight Almost like chugging hashes for merkle trees on ASICs and GPUs for something that can better be done with an Excel spreadsheet...

It's exactly the same people who wanted it done with ASICs and GPUs doing the DNN stuff, for exactly the same reasons, that have nothing to with with any utility of the process and everything to do with running scams.

Show thread

Paul M. Heider Jul 13, 2023

@rmflight @mhoye Sounds about right. My old coworker Tom always joked that NLP was just a compression algorithm on the input text.

Show thread

mhoye Jul 13, 2023

@paulmheider @rmflight And Ted Chiang has referred to ML-generated text as "a jpeg of a language", yeah. But to see that come together in fifteen lines of python that out-do these massive, crazy expensive DNN models is bonkers, jaws on the floor material.

Show thread

Jakob Voß Jul 14, 2023

@mhoye @paulmheider @rmflight the basic idea is so simple, why has nobody done this before? Apparently ideas also need to be tried out 😄

Show thread

Nicolás Alvarez Jul 14, 2023

@mhoye @rmflight The converse is certainly true: if something is decent at prediction, it will be decent at compression. If you *almost* predict what the following data is going to be, you can store the difference between the prediction and the real data, which will take less space.

Show thread

Severák Jul 13, 2023

@mhoye what is it for? for detecting topic of the text without needing to costly train neural network?

Show thread

Daneel Jul 13, 2023

@mhoye holy fuck this is potentially brilliant, and I am so excited to dive into reading this after work

Show thread

mhoye Jul 13, 2023

@sysop "Code is available at $URL" right there in the abstract! Holy shit!

Show thread

bob Jul 13, 2023

@mhoye this is text classification not text prediction

Show thread

Emily S Jul 13, 2023

@mhoye wait it works on a dataset in Pinyin?! Damn

And it's 14 lines of python

And it's not even looking at the contents of the compression, it's just using the byte lengths? Holy fucking shit that's smart as hell.

😍

Show thread

Tatjana Scheffler Jul 13, 2023

@mhoye compression methods have been used in text classification for authorship analysis for quite a while.

Show thread

XaiaX ̆̈‏Jul 13, 2023

@tschfflr @mhoye yes, I remember discussion of this at least a decade ago and probably longer.

Still interesting.

Show thread

XaiaX ̆̈‏Jul 13, 2023

@tschfflr @mhoye after looking through it it seems like they acknowledge all that as the basis for this work.

Show thread

matzipan Jul 13, 2023

@mhoye @jrconlin the abstract reads like a self aware joke. Please tell me it's a self-aware joke

Show thread

Albert Cardona Jul 13, 2023

@mhoye Compression is ... very interesting. Recalling here Schmidhuber's take on compression and its predictability being at the root of learning, beauty, novelty, interestingness, boredom ... https://arxiv.org/pdf/0812.4360

Previously: https://mathstodon.xyz/@albertcardona/110686536069075845

Show thread

allison Jul 13, 2023

@mhoye reading this paper and cackling the whole time, it's just so clever

Show thread

Ben Zanin Jul 13, 2023

@mhoye (this is not particularly novel, I remember reading a similar paper about using gzip to de-anonymize written works back around when I was learning about singular value decomposition to support vector space document searching from Maciej Ceglowski, I think back in the early 2000s. But it definitely is novel that this old technique is now outperforming current tech darlings.)

Show thread

Ben Zanin Jul 13, 2023

@mhoye that prediction ≈ compression is also the basis of the Hutter Prize

https://en.m.wikipedia.org/wiki/Hutter_Prize

Hutter Prize - Wikipedia

Show thread

0x10f/kalle/qalle Jul 13, 2023

@mhoye IIRC, a few years ago someone found a way to categorize music files (MIDI?) by testing which ones compress well together.

Show thread

Simon Cozens Jul 13, 2023

@mhoye This is awesome, but I'm surprised it wasn't better known. I have vague memories of going to a talk by a researcher in Oxford about 25 years ago about using gzip compression for text analysis. His presentation explained about entropy and how compression is prediction, then looked at categorising text by gzipping it. Can't remember the name; some guy doing inference stuff in the psychology department. This is going to bug me now.

Show thread

mhoye Jul 13, 2023

@simoncozens From what I can tell the fact of it wasn't a big secret, but the idea that with apparently negligible effort you can outperform tools that are insanely expensive and wildly more complicated is the interesting part.

Show thread

Simon Cozens Jul 13, 2023

@mhoye The right algorithm in the right place beats an inscrutable pile of ReLUs every damned time.

Show thread

mhoye Jul 13, 2023

@simoncozens I read this in the G-Man's voice from Half Life.

Show thread

David Sep 19, 2023

@simoncozens @mhoye it might take a pile of inscrutable ReLUs to tell what’s the right algorithm and/or the right place though 💡

Show thread

Simon Cozens Jul 13, 2023

@mhoye Found it. Patrick Juola was the guy. 1998, categorizing the complexity of languages by gzipping them. https://scholar.google.co.uk/scholar?q=patrick+juola+complexity+compression&hl=en&as_sdt=0&as_vis=1&oi=scholart#d=gs_qabs&t=1689286759338&u=%23p%3DbcHXg7kuXAIJ

Google Scholar

Show thread

Stewart Russell Jul 13, 2023

@simoncozens @mhoye I suppose this is related to an early corpus linguistics rule of thumb: if you have a sentence in English, and several sentences in French, one of which is the translation of the English one. The correct translation is most likely to be the one closest to 111% the length of the English one.

possible ref is from Gale & Church's work, c.1994

Show thread

Ben Zanin Jul 14, 2023

@simoncozens @mhoye thank you! This reference reconnected me to http://ccc.inaoep.mx/~villasen/bib/joula-Authorship%20Attribution.pdf which got me to https://arxiv.org/abs/cond-mat/0108530 , which I believe is the early-2000s paper I was thinking of earlier!

Edit: plus the latent semantic indexing article I read around the same time by Clara Yu, John Cuadrado, Maciej Ceglowski (yes, idlewords), and J. Scott Payne, https://web.archive.org/web/20050507172205/http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm

Show thread

Angle🖇Jul 13, 2023

@mhoye ...That's mildly terrifying if it works out, actually. A major jump in AI capabilities. Eep. :/

Show thread

mhoye Jul 13, 2023

@Angle I don't think it's a jump in capabilities, just a massive decrease in cost and complexity, which is different (and democratizing!)

Show thread

Angle🖇Jul 14, 2023

@mhoye Eh, a decrease in cost and complexity is an increase in capability / (cost & complexity). Assuming it pans out. Hopefully it'll be democratizing? We'll see. XD