We at BR Data investigated the largest freely available training dataset for generative image models from #laion
The story with the relevant aspects privacy, copyright, consent: https://interaktiv.br.de/ki-trainingsdaten/en/index.html

A more technical thread 🧵 on how to tackle those huge quantities of data:

#ai #stablediffusion #trainingdata

@BR24

We Are All Raw Material for AI

Training data for artificial intelligence include enormous amounts of images and text gathered from millions of websites. An analysis performed of LAION datasets (Stable Diffusion) by public broadcaster BR shows that it frequently contains sensitive and private data – usually without the knowledge of those concerned.

BR
Even if the data does not contain the images themselves, but only links, captions and up to more than a dozen other metadata columns: that's also a lot stuff: All compressed parquet files of Laion5B stored @huggingface need considerably more than 1TB of available disk space.

We were interested in the images with German language captions, so Laion2B-multi-md5 was our starting point: https://huggingface.co/datasets/laion/laion2B-multi-md5

Resulting in hundreds of GB of parquet files, we decided to use #duckdb to query and filter the data on the fly. Very convenient!

laion/laion2B-multi-md5 · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Especially for extracting the exif metadata #duckdb's json_extract_string() function was very handy:
At some point, though, you have to look at the data set, look around, and get an idea of what you're dealing with. For this, we followed @waxy's and @simon's approach, which gave us a look at another subset of LAION5B a few months ago:
https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/
Exploring the training data behind Stable Diffusion

Two weeks ago, the Stable Diffusion image generation model was released to the public. I wrote about this last week, in Stable Diffusion is a really big deal—a post which …

In our @datasette we had multiple tables: References to images with geodata, those with (email) addresses, etc. Or those with faces based on other models (https://huggingface.co/datasets/FacePerceiver/laion-face), which did not work convincingly: Either the model is crap or #laion's sample_id...
FacePerceiver/laion-face · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

More about the project:
🇺🇸/🇬🇧 English text: https://interaktiv.br.de/ki-trainingsdaten/en/index.html
🇩🇪 German text: https://interaktiv.br.de/ki-trainingsdaten/
🇩🇪🎧 German podcast: @harlan_elisa
talks about her face being part of the dataset: https://ardaudiothek.de/episode/11km-der-tagesschau-podcast/ploetzlich-im-datensatz-wenn-die-ki-mit-dir-trainiert/tagesschau/94587872/

🇩🇪 @tagesschau https://www.tagesschau.de/wissen/technologie/ki-trainingsdaten-privat-datenschutz-100.html

We Are All Raw Material for AI

Training data for artificial intelligence include enormous amounts of images and text gathered from millions of websites. An analysis performed of LAION datasets (Stable Diffusion) by public broadcaster BR shows that it frequently contains sensitive and private data – usually without the knowledge of those concerned.

BR