Mastodawn

Will Oremus Apr 19, 2023

This visual deep dive into one of the largest AI language datasets is nonstop fascinating, jaw-dropping, and troubling, and anyone who is remotely interested in how LLMs really work, their biases, or intellectual property should read it. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post

Will Oremus Apr 19, 2023

"Content without consent" is a concern that I could see catching on as more people gradually realize the content they've published and posted over the years is being secretly used to train for-profit AI models. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post

If your AI chatbot is spouting some disturbing views, it could be because the websites that contributed the most language tokens to its training dataset include the likes of RT, Breitbart and VDare.

Will Oremus Apr 19, 2023

If you're asking an AI chatbot questions about religion, you probably shouldn't expect the perspectives of non-Christian faiths to be well-represented, based on this analysis of what sites make up Google's massive C4 dataset. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post

Will Oremus Apr 19, 2023

Is your website / your favorite website / your least favorite website being scraped to train tech giants' AI models? You might be surprised. This story has a handy search tool you can use to see if a given domain is included in one of the largest datasets. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post

Will Oremus Apr 19, 2023

Why aren't the big social networks up in arms about rival tech giants scraping their content to train AI models? Maybe because they don't allow it--and they may be keeping it partly to train their own models. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post

Will Oremus Apr 19, 2023

If you're concerned by what's in Google's colossal C4 dataset, keep in mind it's only a small fraction of the training data for today's AI chatbots--and OpenAI won't even tell us what it's using for ChatGPT and GPT-4.

Will Oremus Apr 19, 2023

So why aren't the big AI companies more transparent about what's in the data that they use to train their models?

One reason, experts say, is because they're afraid they'd get in trouble if people found out. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post

Will Oremus Apr 19, 2023

Sorry for barraging your feed with this... tootstorm? I don't do it often. But I found this story by @kevinschaul, @nitashatiku, and Szu Yu Chen really eye-opening and valuable and realized it might be a large ask for people to read the whole thing, so I wanted to highlight some of what I found the most interesting takeaways. Thanks for your patience.

Kevin Leecaster Apr 19, 2023

@willoremus @kevinschaul @nitashatiku
They sold it as a boon to humanity while training it off of the content on #KiwiFarms. That kind of undercuts their arguments of good will, #JustSayin

Ms. Penny Oaken Apr 19, 2023

@GreenFire @willoremus @kevinschaul @nitashatiku content from pre-2020 Reddit, which is possibly worse

Kerstin Apr 19, 2023

@willoremus thank you for sharing!

Luis Villa Apr 20, 2023

@willoremus @kevinschaul @nitashatiku It’s really very good; subject of repeated discussion in the ML track at the open source lawyer conference I’m at today.

The challenge I keep coming back to: this media scrutiny is critical to a functioning societal oversight of this new tech, but such scrutiny incentivizes other cos to stop disclosing their data sets. That’s a bad spiral to be in.

James Gleick Apr 19, 2023

@willoremus Remember when Google scanned and digitized All the World’s Books (approx.) without asking authors for permission or offering any compensation? Many people thought that was just fine. They said it would only give authors more “visibility.” They said it was “fair use.”

I’m still bitter about it. This was one of Google’s principal motivations, and few understood that at the time.

Kevin Leecaster Apr 19, 2023

@JamesGleick @willoremus
Because of google's role in blocking #ClimateAction through promotion of climate science denial, I place it up there as our civilizations most evil corporations ever.

thatdosbox Apr 19, 2023

@willoremus note that various mastodon servers are included too

Dr. Damien P. Williams, Magus Apr 19, 2023

@willoremus What a terrible reason.

Jeff Moe Apr 19, 2023

Copyright violations are *criminal*. They should go to jail.

Rob Ricci Apr 19, 2023

@willoremus Let's be clear, though, scraping social network or training AIs on it is not scraping/using "their" content, it's using *our* content.

MudMan Apr 21, 2023

@ricci @willoremus Not according to their EULAs.

The biggest shame about this stuff is that people are freaking out about corporate access to their content when the technilogy in question could actually be useful, when this sort of scrubbing and automatic appropriation has been in place since social media became the norm on the Internet.

Our current IP regulation is broken, but not because of this.

Tim Allison Apr 19, 2023

@willoremus I ❤️ that Google uses #CommonCrawl and thereby the fruits of #ApacheTika and #ApacheNutch.

Phil Landmeier Apr 21, 2023

@willoremus Yep. Mine all are.

Hegelian Princess Apr 19, 2023

@willoremus and kiwifarms lmao