🤔 Ah, yet another academic masterpiece on the magical powers of 'grep'—because who knew that sifting through text could be so agentically transformative? 🚀 Apparently, we need an army of agent harnesses to do what Ctrl+F has been mastering since the dawn of time. 😜 Thanks for the 🧠-bending #insights, arXiv!
https://arxiv.org/abs/2605.15184 #grep #textprocessing #arXiv #automation #academichumor #HackerNews #ngated
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.

arXiv.org

@rl_dane If you’re interested in working with Bible texts, you might want to look at https://platform.youversion.com/ – it provides free access via APIs and SDKs, so you don’t need to scrape or re‑parse the text yourself. The fast‑track licensing respects copyright and direct access to the source text helps you avoid introducing issues around textual integrity.

#BibleTech #FaithTech #APIs #TextProcessing

YouVersion Platform

YouVersion Platform - Developer Hub

@janfrode

I wouldn't trust an LLM not to be generating based upon other already-published unencoded stuff.

A less expensive, and far more trustworthy, way to decode it is to just pipe the encoded body through gbase64 -d and then iconv -f CP1252 .

#PeterMandelson #UKPolitics #EpsteinFiles #TextProcessing #AIs #LLMs

Speech and Language Processing (3rd ed. draft) - by Dan Jurafsky and James H. Martin (Stanford):

https://web.stanford.edu/~jurafsky/slp3/

#NLP #TextProcessing #AI #Algorithms

Speech and Language Processing

Speech and Language Processing

sentencex - by Wikimedia:

https://github.com/wikimedia/sentencex

A sentence segmentation library with wide language support optimized for speed and utility.

Written in #Rust.

Bindings are available for #Python, #NodeJS and #WASM

Might be useful for my #SpeechToText system! 👀

#NLP #TextProcessing #Segmentation #RustLang

GitHub - wikimedia/sentencex: A sentence segmentation library with wide language support optimized for speed and utility.

A sentence segmentation library with wide language support optimized for speed and utility. - wikimedia/sentencex

GitHub
#APLQuest 2013-03: Write a function that returns the number of words in the given character scalar or vector (see https://apl.quest/2013/3/ to test your solution and view ours). #APL #WordCount #TextProcessing
APL Quest 2013-3: What Is In a Word

Write a function which returns the number of words in the given character scalar or vector.

LLMs are getting better at character-level text manipulation

Recently, I have been testing how well the newest generations of large language models (such as GPT-5 or Claude 4.5) handle natural language, specifically counting characters, manipulating characters in a sentences, or solving encoding and ciphers. Surprisingly, the newest models were able to solve these kinds of tasks, unlike previous generations of LLMs. Character manipulation LLMs handle individual characters poorly. This is due to all text being encoded as tokens via the LLM tokenizer and its vocabulary. Individual tokens typically represent clusters of characters, sometimes even full words (especially in English and other common languages in the training dataset). This makes any considerations on a more granular level than tokens fairly difficult, although LLMs have been capable of certain simple tasks (such as spelling out individual characters in a word) for a while.

Tom Burkert

Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.

F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA

#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique

#TIL you can pass variables to an #awk script with the option -v. This is useful, for example, when you want to include the file name in the output:

```
find . -type f -iname '*.csv' -exec awk -F, -v filename={} '{print filename, $2}' {} \;
```

Even though seemingly awkward at first glance, #awk is definitely one of the most versatile and useful tools on #linux.

#bash #commandline #shell #programming #textprocessing