The next talk is a bit different--Cedri Lothritz will talk to us about "Exploring Data Augmentation and Transfer Learning Techniques to Create Language Models for Luxembourgish" at #CLIDA2024

Motivation is to include insights into how lower resourced languages can benefit from higher resourced related languages.

Lothritz: Language standardized in written form in 1984. Syntax and vocab very similar to German. About 600k Luxembourgish speakers vs. ~96 million German speakers. Until 3 years ago, no LMs or other NLP tools available for Luxembourgish (except incidentally through, e.g., inclusion in mBERT).
Lothritz: LMs require large amounts of data to train. Many low-resource languages belogn to the same language family as more wide-spread languages (comparing Luxembourgish : German to Romansh : Romance lg.s and Scottish Gaelic : Celtic languages). Degree of mutual intelligibility varies.
Lothritz: used translation from German for data augmentation to train the first language model for Luxembourgish: LuxemBERT. This approach inspired by other work on Afrikaans and Dutch. Data augmentation resulted in what I would call "silver-standard" data -- partial translations which are not perfect but are good enough for the model to learn from.
Lothritz: comparing the resulting model to mBERT on 5 NLP tasks, they found that it generally performed slightly better.
Lothritz: LMs for related languages can have large overlap in their vocabularies. In this second study, we looked at finetuning mBERT and GottBERT on Luxembourgish data to compare to LuxemBERT.