This visual deep dive into one of the largest AI language datasets is nonstop fascinating, jaw-dropping, and troubling, and anyone who is remotely interested in how LLMs really work, their biases, or intellectual property should read it. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post
"Content without consent" is a concern that I could see catching on as more people gradually realize the content they've published and posted over the years is being secretly used to train for-profit AI models. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post
If your AI chatbot is spouting some disturbing views, it could be because the websites that contributed the most language tokens to its training dataset include the likes of RT, Breitbart and VDare.
If you're asking an AI chatbot questions about religion, you probably shouldn't expect the perspectives of non-Christian faiths to be well-represented, based on this analysis of what sites make up Google's massive C4 dataset. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post
Is your website / your favorite website / your least favorite website being scraped to train tech giants' AI models? You might be surprised. This story has a handy search tool you can use to see if a given domain is included in one of the largest datasets. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post
Why aren't the big social networks up in arms about rival tech giants scraping their content to train AI models? Maybe because they don't allow it--and they may be keeping it partly to train their own models. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post
@willoremus Let's be clear, though, scraping social network or training AIs on it is not scraping/using "their" content, it's using *our* content.

@ricci @willoremus Not according to their EULAs.

The biggest shame about this stuff is that people are freaking out about corporate access to their content when the technilogy in question could actually be useful, when this sort of scrubbing and automatic appropriation has been in place since social media became the norm on the Internet.

Our current IP regulation is broken, but not because of this.