Sort lines in one text file by order of matching part in another file #commandline #textprocessing
Sort lines in one text file by order of matching part in another file #commandline #textprocessing
@rl_dane If you’re interested in working with Bible texts, you might want to look at https://platform.youversion.com/ – it provides free access via APIs and SDKs, so you don’t need to scrape or re‑parse the text yourself. The fast‑track licensing respects copyright and direct access to the source text helps you avoid introducing issues around textual integrity.
I wouldn't trust an LLM not to be generating based upon other already-published unencoded stuff.
A less expensive, and far more trustworthy, way to decode it is to just pipe the encoded body through gbase64 -d and then iconv -f CP1252 .
#PeterMandelson #UKPolitics #EpsteinFiles #TextProcessing #AIs #LLMs
Speech and Language Processing (3rd ed. draft) - by Dan Jurafsky and James H. Martin (Stanford):
sentencex - by Wikimedia:
https://github.com/wikimedia/sentencex
A sentence segmentation library with wide language support optimized for speed and utility.
Written in #Rust.
Bindings are available for #Python, #NodeJS and #WASM
Might be useful for my #SpeechToText system! 👀
LLMs are getting better at character-level text manipulation
https://blog.burkert.me/posts/llm_evolution_character_manipulation/
#HackerNews #LLMs #CharacterManipulation #TextProcessing #AIInnovation #MachineLearning
Recently, I have been testing how well the newest generations of large language models (such as GPT-5 or Claude 4.5) handle natural language, specifically counting characters, manipulating characters in a sentences, or solving encoding and ciphers. Surprisingly, the newest models were able to solve these kinds of tasks, unlike previous generations of LLMs. Character manipulation LLMs handle individual characters poorly. This is due to all text being encoded as tokens via the LLM tokenizer and its vocabulary. Individual tokens typically represent clusters of characters, sometimes even full words (especially in English and other common languages in the training dataset). This makes any considerations on a more granular level than tokens fairly difficult, although LLMs have been capable of certain simple tasks (such as spelling out individual characters in a word) for a while.
The palindrome problem – Unicode edition
https://wiesmann.codiferes.net/wordpress/archives/41500
#C++ #CodePoints #GraphemeClusters #java #Javascript #ProgrammingLanguage #Python #Swift #TextProcessing #Unicode
Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.
F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA
#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique
#TIL you can pass variables to an #awk script with the option -v. This is useful, for example, when you want to include the file name in the output:
```
find . -type f -iname '*.csv' -exec awk -F, -v filename={} '{print filename, $2}' {} \;
```
Even though seemingly awkward at first glance, #awk is definitely one of the most versatile and useful tools on #linux.