π #MOSEL: Multilingual Open-Source European Languages Dataset
β’ π 950,000 hours of #speech data covering 24 official EU languages
β’ ποΈ Includes up to 441K hours of unlabeled speech from #VoxPopuli and #LibriLight
β’ π€ Transcribed using #Whisper large v3 #ASR model
β’ π·οΈ Covers both labeled and unlabeled #speechcorpora
β’ π Released under #CCBY40 license for #opensource use
β’ π§ Designed for training #AI #speechrecognition models
Key features:
β’ Diverse language coverage
β’ Large-scale dataset
β’ Open-source compliant
β’ Includes pseudo-labeled data
β’ Supports #NLP and #machinelearning research
Learn more: https://huggingface.co/datasets/FBK-MT/mosel?s=09