Mastodawn

sick of dealing with unwanted line breaks when you copy text from claude code?

termCopy는 macOS에서 터미널 앱(예: iTerm2, Alacritty 등)에서 복사한 텍스트의 원치 않는 줄바꿈을 자동으로 제거해주는 백그라운드 데몬입니다. 터미널 너비에 맞춰 강제로 삽입된 줄바꿈을 해제하면서도 목록, 코드 블록, 문단 구분 등 의도된 구조는 유지합니다. 설치는 Homebrew를 통해 간단히 가능하며, 클립보드를 실시간 감시해 불필요한 줄바꿈 문제를 해결해 개발자의 복사-붙여넣기 작업 효율을 높입니다.

https://github.com/aaronw122/termcopy/

#macos #terminal #clipboard #textprocessing #developertools

GitHub - aaronw122/termcopy

Contribute to aaronw122/termcopy development by creating an account on GitHub.

GitHub

sayzard 4d ago

Fix mojibake in Unicode text, after the fact

ftfy는 Python 패키지로, UTF-8 인코딩 오류(모지베이크)를 감지하고 복구하는 데 특화되어 있습니다. 여러 겹의 인코딩 오류나 HTML 엔티티 문제도 해결할 수 있으며, 잘못 디코딩된 텍스트를 원래 의도된 문자열로 복원합니다. AI 데이터 전처리나 NLP 연구에서 텍스트 정제에 유용하며, Apache 라이선스 하에 배포되어 사용 시 저작자 표기를 요구합니다.

https://github.com/rspeer/python-ftfy

#python #textprocessing #unicode #mojibake #nlp

GitHub - rspeer/python-ftfy: Fixes mojibake and other glitches in Unicode text, after the fact.

Fixes mojibake and other glitches in Unicode text, after the fact. - rspeer/python-ftfy

GitHub

Show thread

SJ McQuay Apr 14

@rl_dane If you’re interested in working with Bible texts, you might want to look at https://platform.youversion.com/ – it provides free access via APIs and SDKs, so you don’t need to scrape or re‑parse the text yourself. The fast‑track licensing respects copyright and direct access to the source text helps you avoid introducing issues around textual integrity.

#BibleTech #FaithTech #APIs #TextProcessing

YouVersion Platform

YouVersion Platform - Developer Hub

JdeBP Feb 5

@janfrode

I wouldn't trust an LLM not to be generating based upon other already-published unencoded stuff.

A less expensive, and far more trustworthy, way to decode it is to just pipe the encoded body through gbase64 -d and then iconv -f CP1252 .

#PeterMandelson #UKPolitics #EpsteinFiles #TextProcessing #AIs #LLMs

Jan

Jan 11

Speech and Language Processing (3rd ed. draft) - by Dan Jurafsky and James H. Martin (Stanford):

https://web.stanford.edu/~jurafsky/slp3/

#NLP #TextProcessing #AI #Algorithms

Speech and Language Processing

Jan

Nov 1, 2025

sentencex - by Wikimedia:

https://github.com/wikimedia/sentencex

A sentence segmentation library with wide language support optimized for speed and utility.

Written in #Rust.

Bindings are available for #Python, #NodeJS and #WASM

Might be useful for my #SpeechToText system! 👀

#NLP #TextProcessing #Segmentation #RustLang

GitHub - wikimedia/sentencex: A sentence segmentation library with wide language support optimized for speed and utility.

A sentence segmentation library with wide language support optimized for speed and utility. - wikimedia/sentencex

GitHub

Dyalog Oct 20, 2025

#APLQuest 2013-03: Write a function that returns the number of words in the given character scalar or vector (see https://apl.quest/2013/3/ to test your solution and view ours). #APL #WordCount #TextProcessing

APL Quest 2013-3: What Is In a Word

Write a function which returns the number of words in the given character scalar or vector.

Hacker News Oct 14, 2025

LLMs are getting better at character-level text manipulation

https://blog.burkert.me/posts/llm_evolution_character_manipulation/

#HackerNews #LLMs #CharacterManipulation #TextProcessing #AIInnovation #MachineLearning

LLMs are getting better at character-level text manipulation

Recently, I have been testing how well the newest generations of large language models (such as GPT-5 or Claude 4.5) handle natural language, specifically counting characters, manipulating characters in a sentences, or solving encoding and ciphers. Surprisingly, the newest models were able to solve these kinds of tasks, unlike previous generations of LLMs. Character manipulation LLMs handle individual characters poorly. This is due to all text being encoded as tokens via the LLM tokenizer and its vocabulary. Individual tokens typically represent clusters of characters, sometimes even full words (especially in English and other common languages in the training dataset). This makes any considerations on a more granular level than tokens fairly difficult, although LLMs have been capable of certain simple tasks (such as spelling out individual characters in a word) for a while.

Tom Burkert

🔏 Matthias Wiesmann Oct 3, 2025

The palindrome problem – Unicode edition

https://wiesmann.codiferes.net/wordpress/archives/41500

#C++ #CodePoints #GraphemeClusters #java #Javascript #ProgrammingLanguage #Python #Swift #TextProcessing #Unicode

Harald Sack May 9, 2025

Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.

F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA

#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique