Each quarter, when the new @mozilla #CommonVoice #dataset is released, I do a #dataviz using @observablehq of its #metadata coverage, across all 100+ languages, based on the JSON summary that is part of the release.

Some of my observations from the v18 release are:

💡 #Catalan (ca) now has a larger dataset than English, based on the number of audio recordings (including validated and yet-to-be-validated recordings). It’s also an interesting dataset because the number of recordings per unique contributor is relatively low (around 80). This means it’s likely to have a high diversity of speakers in the dataset, which is useful for building #ASR models that generalise well to many speakers.

Catalan also appears to have the highest percentage of audio recordings by older speakers - e.g. speakers in their forties, fifties and older. Again, this highlights the diversity of speakers in the Catalan dataset.

💡 Although it’s very early to see any trends from the decision by Common Voice to expand the range of options for gender identity, we are starting to see some data being tagged with the new options that are available. For example, in #Uyghur (ug), we now have data tagged as “do not wish to say”. I don’t want to draw connections between the geopolitical situation in that area and the desire of data contributors not to provide demographic data which may in some way identify them without more evidence, but I think it’s telling that the first use of these expanded metadata categories appears in a language that is spoken in a contested geography.

💡Similarly, it’s very early to identify trends in sentence domain classification - as most of the sentences that do have a domain tag are labelled “general”, although “health_care” sentences are occurring frequently in languages such as #Albanian (sq).

💡#Bangla (Bengali) (bn) continues to have a very large number of yet-to-be-validated audio recordings. Due to this, the train split for Bangla is quite small.

💡#Dholuo (luo), a language spoken in Kenya and Tanzania, is an outlier in terms of the number of distinct data contributors to the dataset - this language has a very high average number of contributions for per contributor. This is often seen in languages that are new to Common Voice, before they have been able to recruit more contributors. Dholuo has nearly 5 million speakers.

💡 The language with the highest average utterance duration is by far #Icelandic (is) at over 7 seconds. This may be because Icelandic has many words with several syllables, which take longer to pronounce. Consider "the cat sat on the mat" in English, cf "kötturinn sat á mottunni" in Icelandic.

Big thanks to all data contributors in this release for your donated utterances, and to Dmitrij Feller, @jessie, Gina Moape, EM Lewis-Jong and the team for all your efforts.

What are your thoughts? What conclusions do you draw?

https://observablehq.com/@kathyreid/mozilla-common-voice-v18-dataset-metadata-coverage

Mozilla Common Voice v18 dataset metadata coverage

This visualisation uses "@d3/stacked-horizontal-bar-chart" to visualise the Common Voice metadata coverage. The original data is taken from the Common Voice `cv-dataset` repository - direct link Table of contents Splits by age range - shows how many clips have been provided by speakers of different age ranges for each locale (language) Splits by age range scaled to 100% - as above, but scaled to 100% so that the metadata coverage of low resource languages is more visible Splits by gender - shows how many cl

Observable

https://archive.org/details/nyangi-gi-otis

Nyangi gi Otis by Asenath Bole Odaga

Topics
#Dholuo, #kitabu, #sigendini, #sigendiniLuo

Kisumu : Lake Publishers & Enterprises Ltd.

Nyangi gi Otis : Asenath Bole Odaga : Free Download, Borrow, and Streaming : Internet Archive

Kisumu : Lake Publishers & Enterprises Ltd.

Internet Archive

https://archive.org/details/kisera

Kisera by Asenath Bole Odaga

Topics
#Dholuo, #kitabu, #kitepe, #buk, #buge

"Kisera en kitabu ma wuoyo kuom nyako midendoni Limbe gi ngimane chakre ka ne en nyathi koda kuom ji mamoko mathoth. Omiyo kitabuni chalo piny, opong' gi ji. Oting'o Achienge gi Selina min koda Kala wuon mare gi nyithindgi te. Kendo oting'o joma osomo man gi barupe mabeyo, to onge tich. Kanyo bende ema iyude Achwaka gi wuode Olalna e lum, jamoko tho gotieno. Yawa, kuom adiera, kitabuni mit kendo kichako some to ok idwar kete piny nyaka itieke. Ondikre achana ndi."

Kisera : Asenath Bole Odaga : Free Download, Borrow, and Streaming : Internet Archive

Kisera en kitabu ma wuoyo kuom nyako midendoni Limbe gi ngimane chakre ka ne en nyathi koda kuom ji mamoko mathoth. Omiyo kitabuni chalo piny, opong' gi ji....

Internet Archive

https://archive.org/details/luo-sayings

Luo Proverbs and Sayings by Asenath Bole Odaga

Topics
#Dholuo, #Ngeche, #NgecheLuo, #sayings, #proverbs, #tonguetwisters, #riddles, #wechemawachotek

"The Luo use Ngeche and other sayings to demonstrate their knowledge and skill in expressing themselves in their language. Ngeche and sayings are adaptable and have many functions. For instance, they may be used to chide someone, to answer a question, to illustrate, clarify or to drive home a point made on an issue. They are particularly valuable in the education of the youth by adults on the use of language, that is how to apply them during a conversation. Ngeche in sayings, proverbs, tongue twisters and even some narratives.

Ngeche Luo has been authored by Asenath Bole Odaga. Bole who writes in her mother tongue-Luo, as well as in English, has written over fifty books for adults, children, and general readership. Her latest publications are Dholuo-English Dictionary and Nyangi gi Otis."

Luo Proverbs and Sayings : Asenath Bole Odaga : Free Download, Borrow, and Streaming : Internet Archive

The Luo use Ngeche and other sayings to demonstrate their knowledge and skill in expressing themselves in their language. Ngeche and sayings are adaptable...

Internet Archive

https://archive.org/details/lebd2

Luo-English Biological Dictionary, Second Edition by John O. Kokwaro; Timothy Johns

Topics
#Dholuo, #biologicaldictionary, #biologicaldictionaries, #biology, #ecology, #zoology, #botany, #lakevictoria, #NamLolwe, #ethnobotany, #EastAfrica, #AfricanGreatLakes, #ecology, #Luoland, #Kavirondo, #Joluo, #Uganda, #Kenya, #Tanzania, #piny, #ngima, #ngeyo

"This Second Edition of the Luo-English Biological Dictionary contains an extensive coverage of the flora and fauna of the Lake Victoria region of East Africa. The region is mainly occupied by the Luo community. It comprises the Luo ethnosystematics and ethnobiological account including indigenous foods, traditional medicines, ritual and other cultural uses of plants. The dictionary is a result of over 20 years of research carried out by the authors.

Luo-English Biological Dictionary, Second Edition : John O. Kokwaro : Free Download, Borrow, and Streaming : Internet Archive

This Second Edition of the Luo-English Biological Dictionary contains an extensive coverage of the flora and fauna of the Lake Victoria region of East Africa....

Internet Archive