In big news overnight, #Anthropic have made a major change to their user data retention and training policy - giving customers until September 28th to opt out, or have their chats, code sessions and other artefacts used for training for up to five years.

This is a major departure from their previous privacy-first stance.

But what's really behind this change? As Connie Loizos points out in this @Techcrunch article, it's all about the #data.

As I've spoken about recently, we've passed #PeakToken - the point in history where we have the maximum amount of authentic, human-generated data available. Now, the internet is polluted with synthetically-generated #AIslop. If you're an #AI company scraping the web for new data to train on, that's bad news, because you also scoop up the AI slop. If models are trained on AI slop, they're likely to encounter #ModelCollapse - like a bad photocopy.

Anthropic's play here is all about the #TokenCrisis - the voracious appetite for new, authentic, human-generated data to train on - part of a broader phenomenon I've termed the #TokenWars.

As new data becomes scarcer and more valuable, it will be more sought after and contested. We're still in the early days of the #TokenWars, and we should expect to see more moves like this to secure more data for AI training.

https://techcrunch.com/2025/08/28/anthropic-users-face-a-new-choice-opt-out-or-share-your-data-for-ai-training/

Anthropic users face a new choice – opt out or share your chats for AI training | TechCrunch

Anthropic is making some major changes to how it handles user data. Users have until September 28 to take action.

TechCrunch

I'll be chatting with the good folks at #3RRR #ByteIntoIt tomorrow evening 7.30pm-ish, all about the #TokenWars and the race for data to train #AI models - with Vanessa Toholka and @floreani and crew.

Looking forward to it!

https://www.rrr.org.au/explore/programs/byte-into-it/

Programs: Byte Into IT — Triple R 102.7FM, Melbourne Independent Radio

Local insights on tech news, with feature interviews and regular guests.

Here's a lovely piece by my #ANU #Cybernetics colleague, @theEllamo which talks about my #TokenWars talk, and how it's related to concepts like #PeakToken and the value of human-generated #data as the internet becomes polluted by #AI-generated slop.

There's a video link here to the #TokenWars talk, if you haven't seen it already.

Thanks, Ella!

https://cybernetics.anu.edu.au/news/2025/06/19/token-wars/

Token Wars

PhD Researcher Kathy Reid (she/her) is an AI Voice researcher investigating speech technologies with a focus on the data that goes into these models. Kathy asks critical questions about these technologies, the people, and the voices they serve. Kathy’s motivation both personally and professionally is the value that knowledge is power and knowledge shared is empowerment. These underlying understandings are core to Kathy’s PhD work and also her keynote talk Token Wars. With these values and a highly open-source background it may come as a surprise to you that Kathy questions if everything should be open not just to anyone but to anything. The continuous scraping and pollution of the internet by AI companies looking to train their latest models is deeply challenging to the ‘everything open’ approach. Kathy and her research ask a lot of great questions about technology and power - who benefits from technologies and what are the costs? These cybernetics questions about unintended consequences underpin Kathy’s Token Wars, a talk that dives into the current technical, legal, and political, resource conflict surrounding AI training ‘tokens’ or data. Kathy’s talk, as many great talks do, comes in three parts: Part 1: Kathy gives us an accessible overview of tokens and transformers, the technologies that together build large language models like ChatGPT and Claude Part 2: Kathy unpacks the value of tokens, why they mean so much to AI companies, and what it means for these tokens to become a scarce resource. Here, Kathy also dives into the actions and intentions of the key actors in these token wars, as well as the damage they are causing. Part 3: Kathy considers tokens and data as a form of treasure or capital – and asks how we might protect and safeguard this treasure. Kathy also speculates on the future of tokens and future protection strategies. The Token Wars: why not all our content should be open Token Wars was first delivered by Kathy Reid at Everything Open and the Melbourne Machine Learning and AI meetup. This version of Token Wars was delivered on Ngunnawal and Ngambri Country here at the Australian National University’s School of Cybernetics. Current LLMs are trained on nearly all the publicly available data in the world – and globally we’re running out of new human-generated ‘tokens’ to train newer and better models on. Kathy holds that we’ve passed a point in history she terms “Peak Token” - where we have the highest availability of human-generated tokens. As LLMs and synthetics data proliferate, the open web is becoming increasingly filled with low-quality “AI slop”, ushering in the “slopocene” - where rich, diverse, human-generated data is rarer and more valuable. LLM Models: Number of training tokens and parameter size by date. 2025-04-13 https://github.com/KathyReid/token-wars-dataviz. Visit Kathy’s blog for her recent thoughts on the speculative OpenAI hardware device that may come about as a result of these Token Wars and find out more about Kathy’s research on our PhD Spotlight from earlier this year.

ANU School of Cybernetics

Great post from Will Sinatra on the assemblage of tools he is using to block bots and scrapers

https://lambdacreate.com/posts/68

#TokenWars

(lambda (x) (create x))

I recently had the opportunity to present at the Melbourne #ML and #AI Meetup on the topic of the #TokenWars - the resource conflict over data being harvested to train AI models like #LLMs - and the alateral damage this conflict is causing to the open web.

With a huge thanks to Jaime Blackwell you can now see the video here:

https://www.youtube.com/watch?v=C86Y3mXnsNI

Huge thanks to Lizzie Silver for all her behind the scenes work and to @jonoxer for making the connections.

Check out the Meetup at:

https://www.meetup.com/machine-learning-ai-meetup/

The Token Wars: Why Not All Our Content Should Be Open - Kathy Reid

YouTube

Opinion of the day:

The reason OpenAI wants a browser, or a social network, IMHO, is so they can have more training data - more tokens - for their models.

We have reached a point where we are in the Token Crisis - LLMs have been trained on all the publicly available data in the world, and it's costing OpenAI millions to licence more data.

It's cheaper to have that data, those tokens, produced for free by people who interact on social media or who use a browser. Data is driving these decisions.

#TokenWars

ICYMI: I'll be talking at the Melbourne #ML and #AI Meetup in a couple weeks' time about the #TokenWars - the conflict for data to train LLMs and the fight by IP rights holders to protect their data from scrapers.

Come learn about how #LLMs are trained on huge volumes of tokens with transformers, why those tokens are becoming more economically valuable, and what you can do to protect your token treasure.

You'll never look at ChatGPT or data the same way again.

Huge thanks to @jonoxer for the recommend, and to Lizzie Silver for the behind the scenes wrangling.

https://www.meetup.com/machine-learning-ai-meetup/events/306548300

The Token Wars, Tue, Apr 15, 2025, 6:00 PM | Meetup

The MLAI Meetup is a community for AI researchers and professionals which hosts monthly talks on exciting research. Our format is: * 6:00 - 6:20: Socializing * 6:20 - 6:40

Meetup

If you weren't able to make @everythingopen in Adelaide in January but were still keen to catch my talk on the #TokenWars in #ML - the hunt for real, human data amidst a sea of AI-generated slop - then don't despair!

I'm delighted to be giving this talk again at the Melbourne ML and AI meetup in mid-April - with thanks to Lizzie Silver for the behind the scenes organisation and to Jonathan Oxer for making the connection.

Seats are strictly limited - so sign up as soon as you can!

📅 Tuesday 15th April, 6pm to 8pm AEST
📍 Docklands Hub, next to Library at the Dock, 912 Collins Street, Melbourne

Talk Title: The Token Wars: why not all our content should be open

Abstract: In recent years, there has been an explosion in generative AI. Most of us are now familiar with tools like ChatGPT, Midjourney, Sora, and others. At the heart of generative AI is a machine learning architecture called the "transformer", which is fed by huge datasets - text, images and videos. Those datasets are "tokenised" - cut up into chunks which the transformer can ingest. Those actors who can obtain the most tokens can generally train the best models (for various values of "best").

We are now witnessing a battle between the creators of generative AI models - who seek to obtain as much data as possible for tokenisation - while their targets try to stop them. The social ramifications of this resource conflict are widespread, resulting in "alateral damage" - a term I am coining to point to the unforeseen, unintended, distal consequences of a seemingly innocuous technology.

These are the Token Wars.

And they're the reason not all our content should be openly available.

In this three-part talk, I first provide a technical grounding on transformers, tokens and how they're used to build text-based generative AI. In the second part, I draw on economics to ask, "why are tokens so valuable?", showing that as the internet becomes filled with AI slop, human-created data is becoming more scarce - and so more expensive. In the third part I explore how you might approach guarding your token treasure, from data poisoning to alternative licensing models and data sovereignty.

You'll leave this talk never looking at data or ChatGPT the same way again.

https://www.meetup.com/machine-learning-ai-meetup/events/306548300/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

The Token Wars, Tue, Apr 15, 2025, 6:00 PM | Meetup

The MLAI Meetup is a community for AI researchers and professionals which hosts monthly talks on exciting research. Our format is: * 6:00 - 6:20: Socializing * 6:20 - 6:40

Meetup

In just a few days I will travel to Tarntanya/Adelaide - fulfilling a desire to take the Overland across north-Western Victoria and into South Australia - to present my talk on the #TokenWars at @everythingopen #EO2025 #EverythingOpen.

The future of many open source, volunteer-run conferences is precarious.

Rising costs of hosting, dwindling sponsorship, and reluctance to fund employees to attend, as well as the increasing burn-out of the dedicated folks who pitch in thousands of hours a year to make them happen - on top of the erosion caused by the pandemic - means that this may be the last year in many I get to catch up with the community I've come to call "my people" over the last 15 years.

So, let's make it a blast.

Three stellar #keynotes lead the proceedings - maker, technologist and Skill Seeker, @sjpiper145, critical technologist and FOI expert, @daedalus, alongside passionate advocate for the power of libraries, @Trishh.

On top of that, I'm also anticipating great talks from Andy Gelme, @Unixbigot, @saera, @nnye, @dtbell91, @emmadavidson, @kattekrab @caitelatte@cloudisland.nz Aleisha Amohia and Sara King, just to name a few - people I have admired and respected for a long time.

See you there, perhaps for the last time in a long while?

You might be familiar with what I'm terming the "Token Wars" - in which #LLM and #GenAI companies seek to ingest text, image, audio and video content to create their #ML models. Tokens are the basic unit of data input into these models - meaning that #scraping of web content is widespread.

In retaliation, many sites - such as Reddit, Inc. and Stack Overflow - are entering into content sharing deals with companies like OpenAI, or making their sites subscription only.

Another solution that has emerged recently is content blocking based on user agent. In web programming, the client requesting a web page identifies themself - usually as a browser or a bot.

User agents can be blocked by a website's robots.txt file - but only if the user agent respects the robots.txt protocol. Many web scrapers do not. Taking this a step further, network providers like Cloudflare are now offering solutions which block known token scraper bots at a a network level.

I've been playing with one of these solutions called #DarkVisitors for a couple weeks after learning it about it on The Sizzle and was **amazed** at how much traffic to my websites were bots, crawlers and content scrapers.

https://darkvisitors.com

(No backhanders here, it's just a very insightful tool)

#TokenWars #tokenization #scraping #bots #scrapy #WebScraping

Track, control, and optimize your website for AI agents and bots

Use Dark Visitors to turn the rising wave of AI agents, LLM assistants, and other bots crawling your website into a new growth channel for your business

Dark Visitors