Oh, the irony! πŸ€– An article that promises to demystify #Transformers with N-grams but really just masquerades as a job listing for a #DevOps Engineer at #arXiv. πŸ“œ Because nothing says "deep understanding" like *skipping to the main content* for career opportunities. 🌟
https://arxiv.org/abs/2407.12034 #Ngrams #irony #careeropportunities #HackerNews #ngated
Understanding Transformers via N-gram Statistics

Transformer based large-language models (LLMs) display extreme proficiency with language yet a precise understanding of how they work remains elusive. One way of demystifying transformer predictions would be to describe how they depend on their context in terms of simple template functions. This paper takes a first step in this direction by considering families of functions (i.e. rules) formed out of simple N-gram based statistics of the training data. By studying how well these rulesets approximate transformer predictions, we obtain a variety of novel discoveries: a simple method to detect overfitting during training without using a holdout set, a quantitative measure of how transformers progress from learning simple to more complex statistical rules over the course of training, a model-variance criterion governing when transformer predictions tend to be described by N-gram rules, and insights into how well transformers can be approximated by N-gram rulesets in the limit where these rulesets become increasingly complex. In this latter direction, we find that for 79% and 68% of LLM next-token distributions on TinyStories and Wikipedia, respectively, their top-1 predictions agree with those provided by our N-gram rulesets.

arXiv.org

Ever wish you could #crawl a #website to generate a #targeted #wordlist, create #ngrams, and sort everything by frequencyβ€”all with a single, easy-to-use tool that doesn’t rely on Ruby or Python?
Now you can, thanks to #Spider.
https://forum.hashpwn.net/post/52

#hashpwn #cyclone #hashcat #hashcracking #cewl

I was even thinking to use #ngrams data from https://marcoxbresciani.codeberg.page/keyboards/ergodash/ergodash.html#org2ca8e47 but even if I have those numbers, I have no idea on how to use them to create a better #Italian-based #ColemakDH layout.

Also, is it worth it?

Hints? Ideas? Help!

@mechanicalkeyboards

ErgoDash

@joeyh I've read elsewhere that Google's Ngrams (and book scanning) is heavily skewed toward academic publishing from roughly 1920 -- 1990 or so. It's a combination of books falling into the copyright hole, emphasis on academic corpora (e.g., University of Michigan) as major contributors to the scanning project, and the emergence of digital book formats in the very late 20th century.

Douglas Harper of The Online Etymological Dictionary addresses this in a blog post:

https://www.etymonline.com/columns/post/who-lusts-for-certainty-lusts-for-lies

#Etymology #ngrams #GoogleNgramViewer

Who Lusts for Certainty Lusts for Lies

We need to talk about the Google Ngram Viewer n-grams. They are wrong. [D.R.H.]

@johnwehrle I'm defending the notion of effective and fact-based criticism here, not longtermism ...

... but note that the term "existential risk" LONG predates the emergence of "longtermism", and through 2000 is also far more prevalent. See screenshot, and note that "longtermism" is multiplied 3x to scale equivalently to "existential risk".

I've strong concerns with any argument which leans heavily on such readily-refuted claims. The viewpoint may well be justified, but a bit less hyperventilating hyperbole and poor scholarship would greatly help the case.

The notion of "existential risk" was originally applied in a religious context (by Paul Tillich) and to nuclear weapons.

See:

#longtermism #ExistentialRisk #GoogleNgramViewer #Ngrams #WeakArguments #EmilePTorres

Bulletin of the Atomic Scientists

Google Books
Google Books Ngram Viewer

Google Books Ngram Viewer

Pondering the Big Questions:

When did "meet-cute" become A Thing?

Ngram Viewer says ... mostly post-2010:

https://books.google.com/ngrams/graph?content=meet+cute%2Cmeet-cute&year_start=1919&year_end=2019&corpus=26&smoothing=3

It seems recent to me.

(Both "meet cute" and "meet-cute" plotted. I suspect the unhyphenated version will have numerous false positives as in "meet cute (girl(s)|guy(s))".)

#ngrams #NgramViewer

Google Books Ngram Viewer

Google Books Ngram Viewer

On the changing of language usage patterns over time, homelessness is an interesting case.

I'd discovered some time back, that term broke into usage suddenly in 1980. It wasn't entirely unknown before, but the concept often appeared as a compound verb, "made homeless", rather than as a noun, "homeless (man|woman|person)", and almost always as an immediate consequence of some disaster, such as a structural fire, hurricane, flood, or earthquake. Earlier terms that had been used to describe long-term lack of reliable housing include vagrant, itinerant, and the like (I'd need to look these up again).

Part of this seems to be due to changes in how housing was approached in the US, and especially the elimination of alternatives to single-family dwellings (e.g., rooming houses, residence hotels) in many areas. But some also seems to be a linguistic, social, and political change in usage.

Ngram: "homelessness": https://books.google.com/ngrams/graph?content=homelessness&year_start=1800&year_end=2000&corpus=26&smoothing=3

Ngram: "homeless, vagrant, itinerant": https://books.google.com/ngrams/graph?content=homeless%2C+itinerant%2C+unhoused%2C+vagrant&year_start=1800&year_end=2000&corpus=26&smoothing=3

The message is that ngrams and the Google corpus are useful but also require interpretation.

#ngrams #NgramViewer #homelessness

Google Books Ngram Viewer

Google Books Ngram Viewer

Google Ngrams: "white nationalist"

Apropos some recent discussions, I've been looking into a number of aspects of this term and aspects related to it.

Google Ngram Viewer is a powerful, if occasionally problematic, tool for exploring language and terms used within it.

An ngram of the headline phrase of this toot ... shows an immense rise in prevalence of the term through 2019 (the most recent data in the corpus), roughly 10 times the 2010 level.

What's driving that isn't necessarily clear --- language and usage reflects both the reflected real-world phenomena described by terms, and preferences for certain terms over others.

But it's attention-grabbing all the same. And a bit sobering.

https://books.google.com/ngrams/graph?content=white+nationalist&year_start=1919&year_end=2019&corpus=26&smoothing=1

#ngrams #NgramViewer #racism

Google Books Ngram Viewer

Google Books Ngram Viewer

Google Books Ngram Viewer

Google Books Ngram Viewer