Mastodawn

scorinaldi Jul 15, 2023

People continue to be upset about the idea that public data on the internet is crawled to train AIs.

This time people are upset with the Brave browser for crawling data from sites like Wikipedia then charging for API access to that data to companies that then use it to train AI.

This is how the web has worked since the dawn of the internet. What’s new is OpenAI and ChatGPT called attention to it.

https://stackdiary.com/brave-selling-copyrighted-data-for-ai-training/

The shady world of Brave selling copyrighted data for AI training

I'm fairly certain that I was not the only person in the world who thought to himself, "Did they just yoink the entire Internet and bundle it together into a

Stack Diary

Show thread

Joyce Park Jul 15, 2023

@carnage4life If Googlebot crawled you, you at least knew you'd be getting something -- exposure -- in exchange for being in their corpus. What do you get get from LLMs?

Show thread

Josh Collinsworth Jul 15, 2023

@carnage4life This...is a bad take, IMO. True, people have long provided content for web platforms, but they've always done it in exchange for something, like SEO benefits, or growing their following on that platform.

"This is how the Internet works" would be a very dismissive explanation (and a very convenient one for digital colonists) even if it were true. It's not, though. Nobody ever worked land for free so a corporation could build a fence around it and charge admission.

Show thread

Stephen Jul 15, 2023

@collinsworth @carnage4life I have to agree with Josh - there was an exchange of value. Training for AI is one way.

Show thread

Sandip Bhattacharya ☮️Jul 15, 2023

@carnage4life
To be fair, the web had worked this way where legitimate web properties would reuse others content within the boundaries of copyright law - i have been a big fan of creative common licenses for making this so accessible.

What is difference this time is every AI service provider taking advantage of the current data opaqueness in the tech, to conveniently skirt this common decency. That hurts more than making money off it.

Show thread

Jonathan Jul 16, 2023

@carnage4life the big difference here is that the crawlers ended up as referers back to the traffic source. Google for example would crawl The New York Times and then when you searched, they’d surface The New York Times. It was arguably a symbiotic relationship. OpenAI crawls The New York Times and ChatGPT responds with data it learned there but doesn’t refer the user back to The New York Times. It’s now a one sided relationship.

Show thread

Jose Raya Jul 16, 2023

@carnage4life the web is about linking to other documents. What these LLM models do is actually the opposite: They provide no links to the sources so there is no web anymore.