Mastodawn

If it were up to me, Apple would create the world’s first “Do Not Train” registry. Enter your domain or web page, get a TXT record or meta tag to prove ownership, prevent Apple from using your content to train any LLM.

Only a small percentage of sites would do it, I think, so the training impact would be low, but it’d be extremely meaningful for those people. And it’d be a valuable PR tool, brand-booster, and competitor-shamer. (Bonus: make it open and encourage competitors to follow it.)

Eric Eggert Jun 20, 2024

@cabel Even better: distinguish between transformative learning (for transcriptions and grammar and such) and generative learning (bullshit machines). I am ok-ish for my content to be used for the former but vehemently opposed to the latter.

Jason Petersen (he)Jun 20, 2024

@cabel (a) why would this be different than then saying they’d honor a robots.txt?

(b) this sounds like a do not track thing, but without even the attempts at legislation around it (which themselves failed, and using the header at all is pointless)

Cabel Sasser Jun 20, 2024

@jason Somewhat cynically, with my Apple hat on: much stronger PR. “Apple follows robots.txt” is a weak headline and robots.txt are notoriously loosely and optionally-followed so trust is very low. Apple is all about building trust but that can’t be built on a shaky platform. New initiatives get attention.

Henry Wilkinson Jun 20, 2024

@cabel @jason I like the sentiment, but functionally it seems no different? The largest “AI” companies already have ways of blocking their crawlers via robots.txt and as you say this would also be “opt in” and unenforceable. If any company were to do this, I would critique it as an incredibly empty gesture.

Cabel Sasser Jun 20, 2024

@hank @jason I think “robots.txt” has a very bad reputation for being voluntary and often outright ignored. It’s almost not even worth doing one because there’s zero accountability. It’s time to start over. The idea here is to increase trust and accountability. But it would also be quintessentially Apple to reinvent something that has long existed but making it a little bit better and nicer. Look i’m just wearing my Apple pretend executive hat here! 😛

Henry Wilkinson Jun 20, 2024

@cabel As somebody who runs a service that does at lot of user-driven web crawling for archival purposes, I'm familiar with exactly how loose it all is...

I just don't think an opt-in list is the way to create this accountability. I would prefer it be settled with case-law that says "if you train on copyright protected work you'll get mega-sued" but large companies are trying to get ahead of that by making cash deals with anyone who can afford lawyers right now. :(

Henry Wilkinson Jun 20, 2024

@cabel For the same reason, I'm mad that the EU didn't legislate "do not track" requests (a similarly honour-based system) into law, instead we have a million consent popups to individually agree to or a button that says "do whatever"

Jason Petersen (he)Jun 20, 2024

@cabel @hank right, this would entirely benefit Apple’s brand and require extra work for one specific vendor.

Frankly if they did this and opened it up to third party AI providers, I would be extremely tempted to sign up just to wantonly violate it and drag down the reputation it might carry.

Yes, I’m that petty, but maybe the only computer company I’ve ever used (save one gaming PC) in nearly four decades should stop acting in ways that have me looking for the exits.

Jason Petersen (he)Jun 20, 2024

@cabel @hank and yes I think archive.org should disregard robots whenever they want no I will not be taking questions 😬

Henry Wilkinson Jun 20, 2024

@jason We also make web archiving software and have a similar stance 🙃

💀 𝓕airchild 💀Jun 21, 2024

@cabel @jason I hate to be a jerk here, but it sounds like your primary concern is Apple's PR look, and that respecting people's rights is a side effect.

Cabel Sasser Jun 21, 2024

@tankgrrl @jason why not both dot jpeg

Daniel Jalkut Jun 21, 2024

@cabel @tankgrrl @jason FWIW I took it as “Apple should do this for good reasons and in case anybody needs convincing: good PR.”

💀 𝓕airchild 💀Jun 21, 2024

@danielpunkass @cabel @jason I'd still prefer a PR look like "AI company refuses to steal peoples' content without their permission" over "we won't steal your stuff if you tell us not to and sign up for our registry"

Cabel Sasser Jun 21, 2024

@tankgrrl @danielpunkass @jason me too, 100%.

💀 𝓕airchild 💀Jun 21, 2024

@danielpunkass @cabel @jason On a related note but not part of this question, none of what these companies are doing qualifies as fair use, yet no one seems to want to argue for the content creators' rights. Everyone just throws up their hands like "Well, what are you going to do, You shouldn't have put it on the internet."

Dominic Hopton Jun 20, 2024

@cabel surely every single enterprise and corporation would immediately register in it given the current temperature? They would, collectively, remove most of the internet from the training model. Every lawyer & biz dev would be making it happen overnight.

This feels like something that has to happen *after* (a) contracts for licensing ‘big’ content have been signed (b) business has collectively accepted that actually training LLMs on their dross isn’t actually impacting their business.

@cabel This project https://darkvisitors.com/ is part of the way there, as far as generating a robots.txt to try and disallow certain actors from training on your site. But something much better like that would be nicec

Track, control, and optimize your website for AI agents and bots

Use Dark Visitors to turn the rising wave of AI agents, LLM assistants, and other bots crawling your website into a new growth channel for your business

Dark Visitors

Blake Garner Jun 20, 2024

@cabel They published the datasets used “Our pre-training dataset contains RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totaling approximately 1.8 trillion tokens." here https://huggingface.co/apple/OpenELM It would need to be more of a supplier contract requirement.
Supposidly you can search the datasets as well https://huggingface.co/datasets/tiiuae/falcon-refinedweb

But yes I like the registry idea as well. With all the licensing going on, sites could provide details on licensing as well.

apple/OpenELM · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Jari Komppa 🇫🇮Jun 20, 2024

@cabel
even better: they would scrape your data and use it in adversarial training, to make sure the AI never spews out your data, even if it was found copied elsewhere.

phi1997 Jun 20, 2024

@cabel
That still assumes your information and creations being used for training data should be opt-out, when it needs to be opt-in.

Cabel Sasser Jun 20, 2024

@phi1997 I agree with this 100% philosophically, while also understanding that an entire class of interesting tools would simply vaporize if this was all opt-in — none of it would exist, 0.0001% would do it. On one hand, maybe that says something — none of it should probably exist if that’s the case. On the other hand, some (some! Not all) of it is interesting and will potentially really help people and the genie is out. I struggle with this back-and-forth basically every day

phi1997 Jun 20, 2024

@cabel
Yeah, they shouldn't exist, and considering how expensive these tools are to keep running due to how much energy they use, I think it's only a matter of time before this genie is rebottled.

Cabel Sasser Jun 20, 2024

@phi1997 I think there’s too much money — and dare I say it, “shareholder value” barf — at stake to go back now, and capitalism will ensure it’ll absolutely never roll back — it’s all here to stay. So the realistic question becomes: how to do the best we can with where find ourselves now. I don’t like it any more than you do but here we are. 😌

phi1997 Jun 20, 2024

@cabel
There was a ton of money at stake with the blockchain, and it has almost entirely been abandoned. The sunk cost fallacy will sink some companies, while others will ditch it to cut costs. Don't assume it is inevitable or you'll feed into the narrative pushed by the people marketing LLMs.

Cabel Sasser Jun 20, 2024

@phi1997 Nah, blockchain was always fringe — weirdos and rogues. Apple never bet on blockchain, nor did Microsoft. Put it this way: both Blockchain and LLM’s rely on good GPUs, but which one of them just propelled Nvidia to be the largest company in the world? It’s over 😌

phi1997 Jun 20, 2024

@cabel
That's revisionist history. Blockchain used to be everywhere. Nobody wants to admit they were building around it. Also, Microsoft did bet on blockchain, they just aimed at businesses more than consumers, offering support for it in Azure. I recall that Apple talked about it too.

Right now, we are somewhere around the peak of the AI bubble. LLM innovation has slowed and venture capital money won't last forever. I expect enshittification to arrive soon for AI and for the bubble to pop.

Mayo Jun 20, 2024

@phi1997 @cabel IBM bet on blockchain and Amazon went in as well (AMB, which according to Amazon is used by number of large corps, and 25% of worldwidecethereum loads). Apple filed patents around blockchain, so they definitely thought about it.

Drew Costen Jun 20, 2024

@cabel I have to admit, I don’t understand this whole thing. As much as I dislike the current AI craze, if it’s going to exist, I want it to learn the facts I have on my website so they can be presented to people who use it, and I don’t get why anyone wouldn’t want this from their website too.

@cabel Great idea! They already have all the DNS record workflows for iCloud BYOD.

jramskov Jun 20, 2024

@cabel Should perhaps be a law instead?

IAG Jun 20, 2024

@cabel you just reinvented robots.txt and made it centralized

Oliver Hunt Jun 20, 2024

@cabel I saw an article the other day where someone found one of the "AI" companies ignored robots.txt, faked a standard browser UA (e.g. no "bot" indicator), and used IP masking so they couldn't be filtered. Given the entire industry is built on blatant IP theft I have no expectation any of them would follow it without legislative penalties (and even then if caught I'm sure they'd have a blog saying "whoops we had a minor bug, we've fixed it now")

Luis Villa Jun 20, 2024

@cabel at least second; ServiceNow and HuggingFace have been doing it for at least a year: https://github.com/bigcode-project/opt-out-v2

https://spawning.io have also been advocating for an EU-based centralized opt-out registry (since EU law requires opt-outs starting in less than a year, but doesn't require a central registry) but not sure how far along their proof-of-concept is.

GitHub - bigcode-project/opt-out-v2: Repository for opt-out requests.

Repository for opt-out requests. Contribute to bigcode-project/opt-out-v2 development by creating an account on GitHub.

GitHub

Daniel M Karlsson Jun 20, 2024

@cabel Sure. I like that. Too bad about robots.txt not cutting it tho. I would prefer opting out in that file to be enough.

Brett Walker Jun 20, 2024

@cabel I’ve been thinking about this recently, except from the perspective of an individual content license (i.e. Creative Commons) that prohibits uses for AI model training.