@janfrode

I wouldn't trust an LLM not to be generating based upon other already-published unencoded stuff.

A less expensive, and far more trustworthy, way to decode it is to just pipe the encoded body through gbase64 -d and then iconv -f CP1252 .

#PeterMandelson #UKPolitics #EpsteinFiles #TextProcessing #AIs #LLMs

Speech and Language Processing (3rd ed. draft) - by Dan Jurafsky and James H. Martin (Stanford):

https://web.stanford.edu/~jurafsky/slp3/

#NLP #TextProcessing #AI #Algorithms

Speech and Language Processing

Speech and Language Processing

sentencex - by Wikimedia:

https://github.com/wikimedia/sentencex

A sentence segmentation library with wide language support optimized for speed and utility.

Written in #Rust.

Bindings are available for #Python, #NodeJS and #WASM

Might be useful for my #SpeechToText system! 👀

#NLP #TextProcessing #Segmentation #RustLang

GitHub - wikimedia/sentencex: A sentence segmentation library with wide language support optimized for speed and utility.

A sentence segmentation library with wide language support optimized for speed and utility. - wikimedia/sentencex

GitHub
#APLQuest 2013-03: Write a function that returns the number of words in the given character scalar or vector (see https://apl.quest/2013/3/ to test your solution and view ours). #APL #WordCount #TextProcessing
APL Quest 2013-3: What Is In a Word

Write a function which returns the number of words in the given character scalar or vector.

LLMs are getting better at character-level text manipulation

Recently, I have been testing how well the newest generations of large language models (such as GPT-5 or Claude 4.5) handle natural language, specifically counting characters, manipulating characters in a sentences, or solving encoding and ciphers. Surprisingly, the newest models were able to solve these kinds of tasks, unlike previous generations of LLMs. Character manipulation LLMs handle individual characters poorly. This is due to all text being encoded as tokens via the LLM tokenizer and its vocabulary. Individual tokens typically represent clusters of characters, sometimes even full words (especially in English and other common languages in the training dataset). This makes any considerations on a more granular level than tokens fairly difficult, although LLMs have been capable of certain simple tasks (such as spelling out individual characters in a word) for a while.

Tom Burkert

Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.

F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA

#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique

#TIL you can pass variables to an #awk script with the option -v. This is useful, for example, when you want to include the file name in the output:

```
find . -type f -iname '*.csv' -exec awk -F, -v filename={} '{print filename, $2}' {} \;
```

Even though seemingly awkward at first glance, #awk is definitely one of the most versatile and useful tools on #linux.

#bash #commandline #shell #programming #textprocessing

🚀 Behold the epic tale of Janet's #PEG #module, where the author heroically excludes regular expressions like they're yesterday's news. 💥 Marvel at the labyrinth of #parsing magic that claims to be more readable, but only if you have a PhD in arcane text processing. 📜✨
https://bakpakin.com/writing/how-janets-peg-works.html #Janet #readability #textprocessing #regex #HackerNews #ngated
How Janet's PEG module works

An in depth explanation of pegs and how they work.

Photo of Enola Gay aircraft among 26,000 images flagged for removal in Pentagon’s DEI purge

In some cases, photos seemed to be flagged for removal simply because their file included the word ”gay,” including service members with that last name and an image of the B-29 aircraft Enola Gay, which dropped the first atomic bomb on Hiroshima, Japan, during World War II.

oregonlive