Mastodawn

What does it take to get Large Language Models trained in low-resource languages? I analyse this problem from first principles, using my mother tongue, Malayalam, as a case study.

I trained a tokenizer and evaluated its performance against other tokenizers. I analysed the challenges that need to be solved to get a functional LLM. In other words, this is also a story of why LLMs work in languages like English.

The Broken Token: Tokenization for Malayalam Language Models
https://thottingal.in/blog/2026/02/27/malayalam-tokenizer-llm/

Show thread

Santhosh Thottingal Mar 2

Using the tokenizer I introduced yesterday, I built a small language model that can generate text at 1000 tokens/second. It's a Markov chain model with trigram context (context of 3 nearby words). The output text is nonsense enough 😄

Try it here: https://malgen.thottingal.in/
Article: https://thottingal.in/blog/2026/02/28/malayalam-markov-chain/