"The wordfreq data is a snapshot of language that could be found in various online sources up through 2021. There are several reasons why it will not be updated anymore.

Generative AI has polluted the data

I don't think anyone has reliable information about post-2021 language usage by humans."
#wordfreq #AI #LLM

https://github.com/rspeer/wordfreq/blob/master/SUNSET.md

wordfreq/SUNSET.md at master · rspeer/wordfreq

Access a database of word frequencies, in various natural languages. - rspeer/wordfreq

GitHub

“The field I know as ‘natural language processing’ is hard to find these days. It's all being devoured by generative AI."

The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says they are sunsetting the project because generative AI spam has poisoned the internet

#WordFreq #NLP #language #ArtificialIntelligence #AI #GenAI #LLM #technology #tech

https://www.404media.co/project-analyzing-human-language-usage-shuts-down-because-generative-ai-has-polluted-the-data/

Project Analyzing Human Language Usage Shuts Down Because ‘Generative AI Has Polluted the Data’

Wordfreq shuts down because "I don’t think anyone has reliable information about post-2021 language usage by humans.”

404 Media
Slashdot

“Why #wordfreq will not be updated: I don't think anyone has reliable information about post-2021 language usage by humans. The open Web was one of wordfreq's data sources. Now the Web at large is full of slop generated by #LLM, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.”
https://github.com/rspeer/wordfreq/blob/master/SUNSET.md
“I don't want to work on anything that could be confused with #GenAI, or that could benefit generative AI.” — Robyn Speer
wordfreq/SUNSET.md at master · rspeer/wordfreq

Access a database of word frequencies, in various natural languages. - rspeer/wordfreq

GitHub

@jgordon

Definitely good use cases for LLM.

Poisoning the well is not one of them.

Which is why wordfreq gave up.

#Wordfreq #AI #LLM

There was a time when Natural Language Processing was a thing, before #GenAI became the buzzword, drowning all other activities. #WordFreq was a project keeping track of how often humans in different languages used certain words.

However, it's maintaineris throwing in the towel. Partly, because many platforms make such use impossible. But much more, because even *when* you get the data, it is no longer a representation of human language use.
1/2
#NLP #AI
https://github.com/rspeer/wordfreq/blob/master/SUNSET.md

wordfreq/SUNSET.md at master · rspeer/wordfreq

Access a database of word frequencies, in various natural languages. - rspeer/wordfreq

GitHub

Project Analyzing Human #Language Usage Shuts Down Because ‘Generative #AI Has Polluted the Data’

#Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across #Wikipedia , movie and TV #subtitles , news articles, books, websites, #Twitter , and #Reddit.
#generativeai

https://www.404media.co/project-analyzing-human-language-usage-shuts-down-because-generative-ai-has-polluted-the-data/

Project Analyzing Human Language Usage Shuts Down Because ‘Generative AI Has Polluted the Data’

Wordfreq shuts down because "I don’t think anyone has reliable information about post-2021 language usage by humans.”

404 Media
This is what I was worried about with #GenAI: the web is becoming so polluted with #AI slop that it is becoming unusable for other purposes. Today’s example: the #wordfreq project is shutting down. https://github.com/rspeer/wordfreq/blob/master/SUNSET.md #LLM
wordfreq/SUNSET.md at master · rspeer/wordfreq

Access a database of word frequencies, in various natural languages. - rspeer/wordfreq

GitHub

Очень интересное объяснение, почему прекращает деятельность проект #wordfreq (они собирали частоты слов в разных языках). В двух словах - потому что половина Интернета теперь состоит из текстов, сгенерированных #LLM, так что никто уже не знает, какая статистика у текстов, созданных людьми.

https://github.com/rspeer/wordfreq/blob/master/SUNSET.md

The wordfreq data is a snapshot of language that could be found in various online sources up through 2021. There are several reasons why it will not be updated anymore.

Generative #AI has polluted the data

I don't think anyone has reliable information about post-2021 language usage by humans.

The open Web (via #OSCAR) was one of #wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.

#NLProc #WebAsCorpus #GenAI

wordfreq/SUNSET.md at master · rspeer/wordfreq

Access a database of word frequencies, in various natural languages. - rspeer/wordfreq

GitHub

From the post I just retooted:

"I don't think anyone has reliable information about post-2021 language usage by humans.

The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies."

#genai #habsburgai #wordfreq

Source: "Why wordfreq will not be updated" - https://github.com/rspeer/wordfreq/blob/master/SUNSET.md

wordfreq/SUNSET.md at master · rspeer/wordfreq

Access a database of word frequencies, in various natural languages. - rspeer/wordfreq

GitHub