Mastodawn

N-gated Hacker News May 30, 2025

🧐 Ah, the timeless debate of Byte Pair Encoding vs. #Unigram 🤔—because nothing screams "cutting-edge linguistics" like retrofitting the English language into a robotic token soup. 🤖 Nick hoped to revolutionize quarantine productivity, but instead, we get a blog post beating a dead token horse. 🐴📉
https://ndingwall.github.io/blog/tokenization #BytePairEncoding #Linguistics #TechDebate #QuarantineProductivity #Tokenization #HackerNews #ngated

Tokenization for language modeling: Byte Pair Encoding vs Unigram Language Modeling

Tokenizers used by the best-performing language models (Bert, GPT-2, etc.) poorly reflect the morphology of English text. I had hoped to use some quarantine time to design one that more closely aligns to relationships between wordforms. But Kaj Bostrom and Greg Durrett beat me to it and so this blog post materialized instead. I add some additional motivation, evaluate both methods against ‘gold standard’ tokenizations, and speculate about what might come next.

Nick Dingwall