Очень интересное объяснение, почему прекращает деятельность проект #wordfreq (они собирали частоты слов в разных языках). В двух словах - потому что половина Интернета теперь состоит из текстов, сгенерированных #LLM, так что никто уже не знает, какая статистика у текстов, созданных людьми.

https://github.com/rspeer/wordfreq/blob/master/SUNSET.md

The wordfreq data is a snapshot of language that could be found in various online sources up through 2021. There are several reasons why it will not be updated anymore.

Generative #AI has polluted the data

I don't think anyone has reliable information about post-2021 language usage by humans.

The open Web (via #OSCAR) was one of #wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.

#NLProc #WebAsCorpus #GenAI

wordfreq/SUNSET.md at master · rspeer/wordfreq

Access a database of word frequencies, in various natural languages. - rspeer/wordfreq

GitHub