🎓 Open PhD Position: Speech Tech for Small Languages

Working on Frisian-Dutch bilingual speech + AI at Fryske Akademy/Campus Fryslân. Fully funded, 4 years, starts Sept 2025.

More info ⬇️
https://voicetechnology.substack.com/publish/posts/detail/157245937/share-center

#SpeechTech #PhD #LowResourceLanguages #AcademicJobs

Sign in to CFSN Detailed Analysis

The other DICE contribution at #COLING2025 comes from Nikit, who presented "LOLA - An Open-Source Massively Multilingual Large Language Mode" by Nikit Srivastava, Denis Kuchelev, Tatiana Moteu Ngoli, Kshitij Shetty, Michael Röder, @hamadazahera, Diego Moussallem & Axel Ngonga.🤩 👏

👉 Want to find out more? Find the paper here: https://arxiv.org/abs/2409.11272

#DICEontour #LowResourceLanguages

LOLA -- An Open-Source Massively Multilingual Large Language Model

This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model's strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.

arXiv.org

#Coda published last year an article by Avi Ackermann covering the advantages and disadvantages of training language models for low-resource languages. From the use of these models to censor and oppress minorities to the preservation of nearly extinct languages, Avi Ackermann raises important issues for all those interested in #multilingualDH and #LowResourceLanguages:

https://www.codastory.com/authoritarian-tech/artificial-intelligence-minority-language-censorship/

When AI doesn’t speak your language

Better tech could do a lot of good for minority language speakers — but it could also make them easier to surveil

Coda Story
You might have heard me claim that most #NLG is #LowResource (not just #NaturalLanguageGeneration for #LowResourceLanguages). If you want to hear me explain a bit more, my talk from last year's #GEM workshop at #EMNLP2022 is now up online: https://underline.io/lecture/66771-most-nlg-is-low-resource-here-s-what-we-can-do-about-it
Most NLG is Low-Resource: here's what we can do about it

On-demand video platform giving you access to lectures from conferences worldwide.

Underline.io

Very biased, but also very excited about Khyathi Chandu's presentation of our new proposed shared task at #INLG2023: "LowReCorp: The Low-Resource NLG Corpus Building Challenge"

Join the #SharedTask during the coming year if you want to use our UI or task design to collect #NLG data for #LowResourceLanguages!

#DialogueSummarization #QuestionAnswering #ResponseGeneration

Now Liam Cripwell presents an overview of the #WebNLG2023 challenge. The challenge began in 2017 with RDF->English generation, in 2020 added Russian and a #SemanticParsing track, and this year focused on #LowResourceLanguages looking at #Breton, #Welsh, #Irish, #Maltese, and #Russian.

#MMNLG #SIGDIALxINLG2023

We can't wait to welcome you back for our first #ai4lam community call after summer, taking place on Tuesday 19 September at 15:00 UTC (8AM California | 11AM Washington DC | 16:00 UK | 17:00 Oslo & Paris).

We will be joined by speakers Chahan Vidal-Gorene and Konstantinos Chatzitheodorou, discussing how they use #AI & #ML in different ways to tackle the issues in working with #lowresourcelanguages.

Full details and joining link for the call can be found in the agenda:
https://docs.google.com/document/d/1FP17IFCQstczi_zaIK2vpccvE7ETcoXvOO2VkKVaYwQ/edit?usp=sharing

2023-09-19 ai4lam Community Call

ai4lam Community Call September 19, 2023 8 AM California | 11 AM Washington DC | 16:00 UK | 17:00 Oslo & Paris | 01:00 (+1) Sydney Connection Information: https://stanford.zoom.us/j/95460091625?pwd=UU1mbSs1RmVkaW1mSmpBYUNhS2hHUT09&from=addon Telephone: +1 650 724 9799 (US, Canada, Caribbean To...

Google Docs

The seventh conference on Machine Translation, #WMT22,

has a "Shared Task": Large-Scale Machine Translation Evaluation for [24] African Languages https://statmt.org/wmt22/large-scale-multilingual-translation-task.html

"We do so by introducing a high quality benchmark, paired with a fair and rigorous evaluation procedure."

#MachineTranslation #AfricanLanguages #LowResourceLanguages

Large-Scale Machine Translation Evaluation for African Languages