Fun, interactive visualization of the data in Google's C4 (Colossal Clean Crawled Corpus) that's heavily used in LLM training.
Broken down by category and the top sites in each category.
https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/


