Mastodawn

#HisTag25 #MLLM #LLM #MultilingualDH

Andreas Wagner

At the @bifold.berlin conference "AI-based methods in the humanities", I have just attended a great talk by Seid Muhie Yimam of Hamburg University who confirmed my impression that there is a kind of momentum in this area at the moment. He mentioned many datasets, publications and shared tasks on African Languages. I will list them (bit by bit) in this thread.

2/x

Starting from Joshi et al. (2020): The State and Fate of Linguistic Diversity and Inclusion in the NLP World.
https://doi.org/10.18653/v1%2F2020.acl-main.560

Then, Adelani et al. (2021): MasakhaNER: Named Entity Recognition for African Languages. https://doi.org/10.1162/tacl_a_00416
- covering 10 African Languages

3/x

Muhammad et al. (2025): AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages. https://doi.org/10.18653/v1/2025.naacl-long.92 Huggingface: https://huggingface.co/datasets/afrihate/afrihate
- Labeled dataset for 18 African Languages
- Fine-tuning PLMs (AfroXLMR, AfriBERTa, AfriTeVa; best-performing: AfroXLMR-76L)

4/x

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Saminu Mohammad Aliyu, Paul Röttger, Abigail Oppong, Andiswa Bukula, Chiamaka Ijeoma Chukwuneke, Ebrahim Chekol Jibril, Elyas Abdi Ismail, Esubalew Alemneh, Hagos Tesfahun Gebremichael, Lukman Jibril Aliyu, Meriem Beloucif, Oumaima Hourrane, Rooweither Mabuya, Salomey Osei, Samuel Rutunda, Tadesse Destaw Belay, Tadesse Kebede Guge, Tesfa Tegegne Asfaw, Lilian Diana Awuor Wanzare, Nelson Odhiambo Onyango, Seid Muhie Yimam, Nedjma Ousidhoum. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025.

ACL Anthology

Nigatu et al. (2025): A Case Against Implicit Standards: Homophone Normalization in Machine Translation for Languages that use the Ge'ez Script. https://doi.org/10.48550/arXiv.2507.15142

5/x

A Case Against Implicit Standards: Homophone Normalization in Machine Translation for Languages that use the Ge'ez Script

Homophone normalization, where characters that have the same sound in a writing script are mapped to one character, is a pre-processing step applied in Amharic Natural Language Processing (NLP) literature. While this may improve performance reported by automatic metrics, it also results in models that are not able to understand different forms of writing in a single language. Further, there might be impacts in transfer learning, where models trained on normalized data do not generalize well to other languages. In this paper, we experiment with monolingual training and cross-lingual transfer to understand the impacts of normalization on languages that use the Ge'ez script. We then propose a post-inference intervention in which normalization is applied to model predictions instead of training data. With our simple scheme of post-inference normalization, we show that we can achieve an increase in BLEU score of up to 1.03 while preserving language features in training. Our work contributes to the broader discussion on technology-facilitated language change and calls for more language-aware interventions.

arXiv.org

Yimam et al. (2021): Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets. https://doi.org/10.3390/fi13110275

6/x

Belay et al. (2025): Afro-XLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text. https://doi.org/10.48550/arXiv.2503.18247

- Model: https://huggingface.co/Tadesse/AfroXLMR-Social
- Dataset: https://huggingface.co/datasets/Tadesse/AfriSocial

7/x

AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text

Language models built from various sources are the foundation of today's NLP progress. However, for many low-resource languages, the diversity of domains is often limited, more biased to a religious domain, which impacts their performance when evaluated on distant and rapidly evolving domains such as social media. Domain adaptive pre-training (DAPT) and task-adaptive pre-training (TAPT) are popular techniques to reduce this bias through continual pre-training for BERT-based models, but they have not been explored for African multilingual encoders. In this paper, we explore DAPT and TAPT continual pre-training approaches for African languages social media domain. We introduce AfriSocial, a large-scale social media and news domain corpus for continual pre-training on several African languages. Leveraging AfriSocial, we show that DAPT consistently improves performance (from 1% to 30% F1 score) on three subjective tasks: sentiment analysis, multi-label emotion, and hate speech classification, covering 19 languages. Similarly, leveraging TAPT on the data from one task enhances performance on other related tasks. For example, training with unlabeled sentiment data (source) for a fine-grained emotion classification task (target) improves the baseline results by an F1 score ranging from 0.55% to 15.11%. Combining these two methods (i.e. DAPT + TAPT) further improves the overall performance. The data and model resources are available at HuggingFace.

arXiv.org

Hussen et al. (2025): The State of Large Language Models for African Languages: Progress and Challenges. https://doi.org/10.48550/arXiv.2506.02280

8/x

The State of Large Language Models for African Languages: Progress and Challenges

Large Language Models (LLMs) are transforming Natural Language Processing (NLP), but their benefits are largely absent for Africa's 2,000 low-resource languages. This paper comparatively analyzes African language coverage across six LLMs, eight Small Language Models (SLMs), and six Specialized SLMs (SSLMs). The evaluation covers language coverage, training sets, technical limitations, script problems, and language modelling roadmaps. The work identifies 42 supported African languages and 23 available public data sets, and it shows a big gap where four languages (Amharic, Swahili, Afrikaans, and Malagasy) are always treated while there is over 98\% of unsupported African languages. Moreover, the review shows that just Latin, Arabic, and Ge'ez scripts are identified while 20 active scripts are neglected. Some of the primary challenges are lack of data, tokenization biases, computational costs being very high, and evaluation issues. These issues demand language standardization, corpus development by the community, and effective adaptation methods for African languages.

arXiv.org

Shared Tasks at https://semeval.github.io/

- SemEval 2023 Task 12: AfriSenti https://afrisenti-semeval.github.io/
- SemEval 2024 Task 1: SemRel https://semantic-textual-relatedness.github.io/
- SemEval 2025 Task 11: Bridging the Gap https://github.com/emotion-analysis-project/SemEval2025-task11
- SemEval 2026: Task 4: Narrive Story Similarity http://narrative-similarity-task.github.io/ / Task 9: Detecting multilingual online polarization https://polar-semeval.github.io/

9/x

SemEval

International Workshop on Semantic Evaluation

SemEval

and there are certainly more, just search on huggingface...

You can find models like https://huggingface.co/Davlan/afro-xlmr-large-114L or even Apertus that boasts about "1811 natively supported languages" https://huggingface.co/swiss-ai/Apertus-70B-2509 ...

Some remarks and outlook:

- only 41 (2%) of African Languages substantially covered
- Only Latin, Arabic & Ge'ez scripts covered
- <10 languages are frequently supported
- 18GB of data in 23 datasets
- focus on classification
- focus on specialized small language models
- In Africa, research is often community-driven: participatory research, not driven by universities but communities like Masakhane
- it remains challenging to even find speakers/annotators
- It is necessary to invest in scalable infrastructure, ethical frameworks, and context-sensitive evaluation

10/10 (fin)

Davlan/afro-xlmr-large-114L · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.