Jindřich Libovický

42 Followers
87 Following
46 Posts
🇨🇿 🇪🇺 Researcher at Charles University, Prague. Working on multilingual NLP and neural machine translation. Views my own. He/him.

RE: https://sigmoid.social/@jlibovicky/116198578428150787

None of this would work without my TAs: Dušan Variš, Tomáš Musil, Jan Bronec,
Gianluca Vico, Adnan Al Ali, Kristýna Onderková, Milan Straka taking care of ReCodEx: recodex.mff.cuni.cz. Thank you 🙏

3rd run teaching ML to 250+ bachelor students (with great materials originaly by Milan Straka). Core philosophy: explain the math, implement algorithms from scratch, Kaggle-style competitions, all auto-graded.
https://ufal.mff.cuni.cz/courses/npfl129/2526-winter

Some students find the assignments too time-consuming. Fair. But here's what the data shows over 3 years:
📉 Forum questions dropped ~4×
📈 Full bonus points: 20% → 27% → 52%
📉 Avg. test attempts: 2.7 → 2.4 → 1.9

Asking less, achieving more, iterating less. 🤔

Spent time making AI-generated images of Bayes' Rule, Laplace Smoothing, Markov Chains & Shannon Entropy for class today 🎨🤖 Even though the images are objectively hilarious, none of the 50 students in the room laughed. Or even smiled. 💀

Two years ago, I reviewed papers on LLM value orientations—and didn't trust the results. So I ran my own World Values Survey experiments. Skepticism justified: prompting style and error metric choice yield fundamentally different conclusions about LLM "alignment."
More interestingly, LLMs overgeneralize second-order patterns—more stereotypically consistent than real humans.

Paper out after several rejections! Preprint: https://www.arxiv.org/abs/2602.04033
Presenting at MME @ EACL 2026

We have updated the pre-print on CUS-QA, benchmark for regional knowledge about Czechia 🇨🇿 , Slovakia 🇸🇰 , and Ukraine 🇺🇦 https://arxiv.org/abs/2507.22752
Now, there are results from retrieval-augmented generation and more detailed analyses of model performance depending on the question's topic or visual context.
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

We introduce CUS-QA, a benchmark for evaluation of open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. We evaluate state-of-the-art LLMs through prompting and add human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results show that even the best open-weight LLMs achieve only over 40% accuracy on textual questions and below 30% on visual questions. LLM-based evaluation metrics show strong correlation with human judgment, while traditional string-overlap metrics perform surprisingly well due to the prevalence of named entities in answers.

arXiv.org

We (Abishiek Stephen and me) developed a way to evaluate how morphological a #tokenization is w/o gold segmentation labels. https://arxiv.org/abs/2601.18536 The key: align subword tokens with morphological features from UniMorph using IBM Model 1. To appear in EACL 2026 Findings.

👉 Why it matters?
For many languages good segmentation data is missing. Morphological features are more widely available.

Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features

We present a novel metric for the evaluation of the morphological plausibility of subword segmentation. Unlike the typically used morpheme boundary or retrieval F-score, which requires gold segmentation data that is either unavailable or of inconsistent quality across many languages, our approach utilizes morpho-syntactic features. These are available in resources such as Universal Dependencies or UniMorph for a much wider range of languages. The metric works by probabilistically aligning subwords with morphological features through an IBM Model 1. Our experiments show that the metric correlates well with traditional morpheme boundary recall while being more broadly applicable across languages with different morphological systems.

arXiv.org
Highlights from Machine Translation and Multilinguality in April 2024

Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation

Jindřich’s blog
Highlights from Machine Translation and Multilinguality in March 2024

Did Translation Models Get More Robust Without Anyone Even Noticing?

Jindřich’s blog
Things I need to tell LLMs 🤖 to produce text I could at least post-edit in my writing 🤓🤷‍♂️: Make it sound very technical, not impassionate. Do not exaggerate, do not use too fancy words, and do not use rich vocabulary.
Highlights from Machine Translation and Multilinguality in February 2024

With a new month, here are a few papers that I noticed on arXiv in February.

Jindřich’s blog