Mastodawn

Picture: Wikimedia Commons

My current research project relies on successful lemmatisation to yield any sensible results. Finding the right tools can be tricky, especially for a smaller language. Here are some notes on what to pay attention to when choosing the right tool for your project.

Text analysis is hard and finding the right tools for it is often a significant part of the work. In my current research project I’m working with transcriptions of speeches so at least I’m spared from turning spoken word into text. Getting anything meaningful out of strings is still a lot of work.

Lemmatisation is the process of identifying the word’s dictionary form, the lemma. When we work with text with computers, in practice lemmatisation means identifying words in character strings and returning their lemmas – transforming more or less coherent sentences into toddler speak. Transforming words into their lemmas means we can compare them but the trade-off is often losing context and complexity. Context, therefore, must be held in the back of the mind or the actual analysis makes no sense.

I work with already transcribed data (thank god) and so the actual lemmatisation is a straightforward process. The tricky thing is that the language I’m working with a small one: Finnish. There several lemmatisation packages for Python nowadays but not all of them can be applied on Finnish. But the two that I ended up comparing were both surprisingly good!

This is a brief comparison between two Python packages: simplemma and PyVoikko. I ended up using PyVoikko but the differences in the lemmatisation accuracy were not very different. A major difference between the packages is that simplemma handles several languages whereas PyVoikko is tailored especially for Finnish.

This comparison visualises the performance in speed where simplemma excels. What that means in practice is that simplemma processes strings faster: in my tests with a dataset of 22739 unique strings of average length of 124 words and 1055 characters, applying the lemmatisation functions to each unique string and storing the outputs the results after timing are:
-> simplemma: 52.62 seconds
-> PyVoikko: 636.72 seconds.

No contest here. I used PyVoikko’s Rust api for performance. I’m not going to try again without it. Note that I used pandas dataframes and their .apply() method for the tests. If speed is what you need, you have your answer.

A major difference between the packages is that PyVoikko recognised Finnish first (given) names and some last names from the data, but simplemma rarely managed this. This resulted in PyVoikko eagerly returning the names it recognised with capital letters but transformed into the dictionary form.

Words were lost from original sentences using both tools. Lemmatising with PyVoikko reduced the amount of words in lemmatised text relative to the original in 85% of the cases versus 75% with simplemma. The maximum number of words in a sentence in the original content was 2781, with PyVoikko 2741 and with simplemma 2767; minimums were 1 (original, PyVoikko and simplemma each), and the mean 124.1 (original), 120.8 (PyVoikko) and 122.1 (simplemma). Lemmatisation with PyVoikko lost more words from the original content. The amount of words lost is not big but when working with the Finnish language it is worth checking this out. Finnish has no articles (“a”, “the”, “ett”, “le”) and definite and indefinite, plurals, future/imperfect forms etc. are formed with suffixes and conjugation (I’m not an expert and not going to try to explain this further; check the Wikipedia for a quick dive if you’re interested) – stems are therefore tricky to find but quite important for a relatively dense language. In 32 cases (out of 22739, 0.14%) simplemma ended up with more words than the original text, but these turned out to be due to typos and special characters in the text and simplemma did not generate extra words by itself.

Next, I wanted to see how greedy the packages were in their operations: how many unique words they would be able to extract from the original material. I then compared this against the official list of Finnish words (in their dictionary forms) by Kotus (Institutet för de inhemska språken, Institute for the Languages of Finland) which at the time of writing contains 104743 unique words. Simplemma extracted 243901 and PyVoikko extracted 194114 unique words (lemmas) from the original dataset. Out of these, simplemma’s difference ratio was: 82.2 % (200408) of 243901 words not found in Kotus’ word list, whereas PyVoikko’s difference ratio was 80.4 % (156184) of 194114 words not on Kotus’ list. Spoken language of course differs a lot from “official” and Finnish loves compound words which can not be returned to any “official” lemma – if the Kotus’ list contained all of these it would close to infinite in size. Remarkably similar results from the both tools in my opinion!

PyVoikko and simplemma extracted a total of 109449 identical words from the dataset that were not on Kotus’ list. Some of these were due to typos in the dataset and of course could not be handled by the tools as no spell correcting was applied to the original data before the analysis. I ordered these PyVoikko’s and simplemma’s sets of identical words by the frequency of each word, and 76 out of the 100 most frequent word on both lists were the same word. The tools extracted the same lemma in notably many cases despite the word not being on the “official” Finnish word list. When I widened the scope to check the 1000 most frequent word on both lists, in 835 cases the word was the same. This means that the tools quite systematically arrived in the same conclusion about the words’ lemmas.

PyVoikko extracted 46735 unique words that simplemma did not, and simplemma 90959 extracted unique words that PyVoikko did not. PyVoikko returned names and some typos which it could not lemmatise, and turned some words erroneously into names. Simplemma appeared in some cases to return verbs in their noun forms (verb equivalent nouns). In some cases PyVoikko recognised a compound word and its parts, but also struggled with verb equivalency (“velanottoa” -> “velanottaa”, should have been “velanotto” (velka->velan+ottaa) [to take on (a) debt]). Instead of falling for this trap as often, simplemma tended to either skip processing compound words it did not recognise or possibly sliced them. Problems of a small language, I guess, although I would love to hear from German speakers if this is a common problem for them too. PyVoikko appeared to also slice words but, based on ranking the frequencies of non-matched and un-shared words, to a lesser extent than simplemma. Simplemma also appeared to struggle with finding the lemma for plural partitives. In my use case, PyVoikko’s feature to recognise local names became handy, although as mentioned, PyVoikko is too eager to trigger it. I am still amazed and positively surprised about how strongly both tools performed and how similar results they yielded.

These are mostly notes for myself to check later on how I approached tool selection. I’m glad if these comparisons and tests help you to find a lemmatisation tool to suit your needs in your own project. No AI was used in writing this post. The tests were not vibe coded.

https://oarajala.vivaldi.net/2026/01/29/comparing-python-lemmatisation-tools/

#lemmatisation #Python #PyVoikko #simplemma