🌍 #MOSEL: Multilingual Open-Source European Languages Dataset

β€’ πŸ“Š 950,000 hours of #speech data covering 24 official EU languages
β€’ πŸŽ™οΈ Includes up to 441K hours of unlabeled speech from #VoxPopuli and #LibriLight
β€’ πŸ€– Transcribed using #Whisper large v3 #ASR model
β€’ 🏷️ Covers both labeled and unlabeled #speechcorpora
β€’ πŸ“œ Released under #CCBY40 license for #opensource use
β€’ 🧠 Designed for training #AI #speechrecognition models

Key features:
β€’ Diverse language coverage
β€’ Large-scale dataset
β€’ Open-source compliant
β€’ Includes pseudo-labeled data
β€’ Supports #NLP and #machinelearning research

Learn more: https://huggingface.co/datasets/FBK-MT/mosel?s=09

FBK-MT/mosel Β· Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.