Frederic Blum

39 Followers
54 Following
11 Posts
PhD researcher at Max Planck Institute for Evolutionary Anthropology and the University of Passau. Studying the history of South American languages and linguistic typology.
GitHubhttps://github.com/FredericBlum
CodeBerghttps://codeberg.org/FredericBlum
Homepagehttps://www.eva.mpg.de/de/linguistic-and-cultural-evolution/staff/frederic-blum/

My first solo-authored publication just appeared in *Linguistic Typology*: "The over-representation of phonological features in basic vocabulary doesn’t replicate when controlling for spatial and phylogenetic effects"

Running a #Bayesian model with #Lexibank data, I show that most previously observed effects that have been claimed to be sound symbolism do **not** replicate. A handful of effects emerges as highly stable though, mostly related to body-parts and the pronominal system.

#linguistics #replication #typology #science #statistics

> https://doi.org/10.1515/lingty-2025-0050

The over-representation of phonological features in basic vocabulary doesn’t replicate when controlling for spatial and phylogenetic effects

The statistical over-representation of certain phonological features in the basic vocabulary of languages is often interpreted as reflecting potentially universal sound symbolic patterns. However, most of these cases have not been tested explicitly for reproducibility and might be prone to biases in the study samples or models. Many studies on the topic do not adequately control for genealogical and areal dependencies between sampled languages, casting doubts on the robustness of the results. In this study, I test the robustness of a recent study on sound symbolism in basic vocabulary concepts which analyzed 245 languages. This paper adds a new sample of 2,864 languages from Lexibank. I modify the original model by adding statistical controls for spatial and phylogenetic dependencies between languages. The new results show that most of the previously observed patterns are not robust, and in fact many patterns disappear completely when adding the genealogical and areal controls. A small number of patterns, however, emerges as highly stable even with the new sample. Through the new analysis, it is possible to assess the distribution of sound symbolism on a larger scale than previously. The study further highlights the need for testing all universal claims on language for robustness on various levels.

De Gruyter Brill

New preprint by @fblum (major idea and implementation) and me (the one who criticized and commented), introducing a new approach on regularity assessment.

"Using correspondence patterns to identify irregular words in cognate sets through leave-one-out-validation"

https://arxiv.org/abs/2602.02221

Using Correspondence Patterns to Identify Irregular Words in Cognate sets Through Leave-One-Out Validation

Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85\% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.

arXiv.org

Now published, our study presenting Lexibank 2.

Blum et al. @fblum (2025) in Open Research Europe.

https://doi.org/10.12688/openreseurope.20216.1

‪Just learned that our study introducing Lexibank 2 (Blum et al. @fblum ) has passed peer review with Open Research Europe. We will revise with reviewers' comments, but the study is accepted, Lexibank 2 is now official.

Lexibank 2: pre-computed features for large-scale lexical data

https://doi.org/10.12688/openreseurope.20216.1

Our study introducing Lexibank², the second installation of the Lexibank repository, just appeared online with Open Research Europe (with @fblum as our first author, who led this project bravely.

https://doi.org/10.12688/openreseurope.20216.1

@stefanmuelller Ich nutze eine Kombi aus PrivacyBadger, duckduckgo privacy essentials, und uBlock. Besonders den PrivacyBadger finde ich nett, da er auch die Liste ausspuckt was genau er blockt. Ganz schön viel nämlich.

New preprint presenting a method that can be seen as a new way to complement phylogenetic approaches in comparative linguistics, together with @fblum and Steffen Herbold:

From Isolates to Families: Using Neural Networks for Automated Language Affiliation

https://doi.org/10.48550/arXiv.2502.11688

From Isolates to Families: Using Neural Networks for Automated Language Affiliation

In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.

arXiv.org

Friends, for something to be open source, we need to see

1. The data it was trained and evaluated on

2. The code

3. The model architecture

4. The model weights.

DeepSeek only gives 3, 4. And I'll see the day that anyone gives us #1 without being forced to do so, because all of them are stealing data.

New blog post in Computer-Asssisted Language Comparison in Practice by @fblum just appeared, illustrating how the EDICTOR tool for computer-assisted language comparison can be used locally.

How to Run EDICTOR 3 Locally

https://calc.hypotheses.org/8143

https://doi.org/10.15475/calcip.2025.1.1

How to Run EDICTOR 3 Locally

EDICTOR3 offers many ways of comparing language data with computer-assisted methods. This study offers a short overview of how to run EDICTOR3 locally, without the need for uploading the data to a server or being connected to the internet, while maintaining all the functionalities. In a first step, we will show how one can download […]

Computer-Assisted Language Comparison in Practice
Why I have resigned from the Royal Society

The Royal Society is a venerable institution founded in 1660, whose original members included such eminent men as Christopher Wren, Robert H...