Mastodawn

The next talk is a bit different--Cedri Lothritz will talk to us about "Exploring Data Augmentation and Transfer Learning Techniques to Create Language Models for Luxembourgish" at #CLIDA2024

Motivation is to include insights into how lower resourced languages can benefit from higher resourced related languages.

Show thread

Dave in 🏴󠁧󠁢󠁳󠁣󠁴󠁿Apr 9, 2024

Lothritz: Language standardized in written form in 1984. Syntax and vocab very similar to German. About 600k Luxembourgish speakers vs. ~96 million German speakers. Until 3 years ago, no LMs or other NLP tools available for Luxembourgish (except incidentally through, e.g., inclusion in mBERT).

Show thread

Dave in 🏴󠁧󠁢󠁳󠁣󠁴󠁿Apr 9, 2024

Lothritz: LMs require large amounts of data to train. Many low-resource languages belogn to the same language family as more wide-spread languages (comparing Luxembourgish : German to Romansh : Romance lg.s and Scottish Gaelic : Celtic languages). Degree of mutual intelligibility varies.

Show thread

Dave in 🏴󠁧󠁢󠁳󠁣󠁴󠁿Apr 9, 2024

Lothritz: used translation from German for data augmentation to train the first language model for Luxembourgish: LuxemBERT. This approach inspired by other work on Afrikaans and Dutch. Data augmentation resulted in what I would call "silver-standard" data -- partial translations which are not perfect but are good enough for the model to learn from.

Show thread

Dave in 🏴󠁧󠁢󠁳󠁣󠁴󠁿Apr 9, 2024

Lothritz: comparing the resulting model to mBERT on 5 NLP tasks, they found that it generally performed slightly better.

Show thread

Dave in 🏴󠁧󠁢󠁳󠁣󠁴󠁿Apr 9, 2024

Lothritz: LMs for related languages can have large overlap in their vocabularies. In this second study, we looked at finetuning mBERT and GottBERT on Luxembourgish data to compare to LuxemBERT.