Elevate Your Voice Recognition with Global Speech Data! 🎙️🌐
Is your AI struggling to understand diverse accents or complex background noise? High-quality Speech Data Collection is the foundation of any successful ASR or NLP model.

🔗 Learn more at: https://gts.ai/services/speech-data-collection/

#GTS #GloboseTechnology #SpeechData #AIData #MachineLearning #ASR #NLP #VoiceRecognition #DataCollection #Innovation #AITrainingData

For the past couple of years, as each new @mozilla #CommonVoice dataset of #voice #data is released, I've been using @observablehq to visualise the #metadata coverage across the 100+ languages in the dataset.

Version 17 was released yesterday (big ups to the team - EM Lewis-Jong, @jessie, Gina Moape, Dmitrij Feller) and there's some super interesting insights from the visualisation:

➡ Catalan (ca) now has more data in Common Voice than English (en) (!)

➡ The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (is). Perhaps Icelandic words are longer? I suspect so!

➡ Spanish (es), Bangla (Bengali) (bn), Mandarin Chinese (zh-CN) and Japanese (ja) all have a lot of recorded utterances that have not yet been validated. Albanian (sq) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv).

➡ Votic (vot) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid) given the geopolitical instability in Russia currently.

See the visualisation here and let me know your thoughts below!

https://observablehq.com/@kathyreid/mozilla-common-voice-v17-dataset-metadata-coverage

#linguistics #languages #data #VoiceAI #VoiceData #SpeechAI #SpeechData #DataViz

Mozilla Common Voice v17 dataset metadata coverage

This visualisation uses "@d3/stacked-horizontal-bar-chart" to visualise the Common Voice metadata coverage. The original data is taken from the Common Voice `cv-dataset` repository - direct link Table of contents Splits by age range - shows how many clips have been provided by speakers of different age ranges for each locale (language) Splits by age range scaled to 100% - as above, but scaled to 100% so that the metadata coverage of low resource languages is more visible Splits by gender - shows how many cl

Observable