📣 Accepted to #AIES2025: What do the audio datasets powering generative audio models actually contain? (led by Willie Agnew)

Answer: Lots of old audio content that is mostly English, often biased, and of dubious copyright / permissioning status.

Paper: https://www.sauvik.me/papers/65/serve

Large audio models power a broad suite of new applications: they can continue unfinished audio, clone voices, provide an expressive range of text-to-speech voices, and can even create entire songs from simple text-based prompts. But what are they trained on?

ML models are only as good as the data they are trained on, and there is understandably a lot of concern around how the data that powers these models are sourced.

Through a broad review of recent gen audio papers, we identified the most commonly used datasets and audited them.

Our audit was broad: we included sound, voice, and music. We explored content, audio quality, language representation, toxicity, bias, and licensing adherence. Lots to unpack but three key findings:
1) While there is a lot of data that may be copyrighted, to circumvent copyright issues some datasets just comprise a lot of "old" audio data, e.g., sentences read from old newspapers and books that are now in the public domain.
2) Most datasets pay little attention to representation — with the exception being Mozilla Common Voice. So, unsurprisingly, most audio data is in English and there is little attempt to ensure vocal representation from a broad set of individuals.
3) Finally, there is...very very little documentation associated with these datasets which made this audit much harder than it needed to be. To help improve documentation practices, we extended datasheets for datasets w/ audio-specific questions