"California legislators have begun debating a bill (A.B. 412) that would require AI developers to track and disclose every registered copyrighted work used in AI training. At first glance, this might sound like a reasonable step toward transparency. But it’s an impossible standard that could crush small AI startups and developers while giving big tech firms even more power."

Trash take.
https://www.eff.org/deeplinks/2025/03/californias-ab-412-bill-could-crush-startups-and-cement-big-tech-ai-monopoly

California’s A.B. 412: A Bill That Could Crush Startups and Cement A Big Tech AI Monopoly

California legislators have begun debating a bill (A.B. 412) that would require AI developers to track and disclose every registered copyrighted work used in AI training. At first glance, this might sound like a reasonable step toward transparency. But it’s an impossible standard that could crush...

Electronic Frontier Foundation

Whether you're a small restaurant or not you have to ensure that you're not stealing your ingredients. So why is this any different?

What about the 1-2 person creative startups? Who is protecting their works in a society that devalues artists so much that "starving artist" is an expectation?

The idea that you shouldn't be expected to know what data you're using to train your systems and that doing so is "an impossible task" is so normalized that its hard to know that this was not always the case even in the field of AI.

Data theft and scraping became completely normalized, along with the exploitation of crowdworkers, with the advent of photo sharing and other platforms and others like amazon mechanical turk.

So now NOT exploiting people and stealing data is the anomaly.

"A.I. Training Is Like Reading And It’s Very Likely Fair Use"

These people are unbelievable.

@timnitGebru

This consistently boils down to if we have to pay for it we can't make a profit.

Which means:

"YOUR BUSINESS MODEL ISN'T PROFITABLE. YOU ARE ON THE HOOK TO YOUR INVESTORS. GOOD LUCK."

@timnitGebru why does the EFF have such trash takes sometimes
@chirpbirb @timnitGebru I dumped the EFF when they became "cryptocurrency" advocates.
@lgw4 @timnitGebru oh, gross. when did that happen? 🤢

@chirpbirb @timnitGebru

Because they're probably under intense, predatory pressure to not do their jobs.

What kind of blackmail or death threats or worse wouldn't ecocidal war profiteers use to protect their wealth and power?

@chirpbirb @timnitGebru

I believe fair use shouldn't apply to AI.

The only LEGITIMATE reason I can even MAYBE think about here is that monopoly/cartel power is bad. I think about what happened in the wake of Upton Sinclair's The Jungle was that the big meat-packers (who could afford to go along with the Jungle-inspired new laws or afford to avoid them) just got bigger.

Without the monopoly/cartel argument, I see nothing good at all here.

@chirpbirb @timnitGebru

BTW hyper-focussing on one thing like this is how one misses the forest for the trees.

/me glances askew at Matthew Stoller, e.g.

@timnitGebru oh, glad to know i have the right to read absolutely any book i want without paying.
@timnitGebru yeah wtf. But EFF has that “information wants to be free” streak that sometimes leads them to say nonsense like this. They do some great work as well but when people ask me what digital rights nonprofit to donate to I always point them to groups like Media Justice instead.

@timnitGebru You're echoing an argument that was once leveled against search engines: the right to access information is presumed as long as it isn't hidden behind a paywall or excluded from indexing.

Consider the requirements for object recognition, where YOLO models typically need at least 1,000 training images of the target object, alongside images that do not depict it. Citing relevant examples in the output becomes impractical. While image LLMs may operate on fewer inputs, they still offer a rich diversity. In contrast, a text-based LLMs might align more closely with your expectations; for instance, Perplexity.ai generates responses and links to sources it deems relevant.

However, it’s crucial that training data undergoes meticulous curation to eliminate noise, or we risk developing flawed AI systems—something all too common in many chatbots today. This challenge will persist until AI systems evolve to effectively self-filter out the noise, making them truly intelligent.

@deabigt @timnitGebru

My question is simple. How many books you can read per hour and how much can you remember after one week.

It's probably ~100 pages, and you'll have a distilled summary plus some gaps after a week.

How many books an LLM ingest per hour, probably 100+ complete books.

How much they can remember?
Ask the correct questions and you can rebuild enough the book from it, to allow a copyright violation case to move forward.

Search engines crawl and find connections. That's all.

@bayindirh @timnitGebru
Search engines faced criticism for providing summaries of articles, which made people less inclined to visit the original sites to see if their questions were answered.

LLMs generate summaries from inputs. Similar to Cliff Notes or an author incorporating ideas from various other books, especially when considering things like fan fiction.

@deabigt @timnitGebru
Criticism related to news is justified. Google tried using AMP as a captive device, and at the end they faced the backlash, and it's almost dead now.

LLMs do not generate summaries from inputs. They guess the next token, and as we have seen over and over and over, asking the right questions can generate the training input verbatim. Regardless of prose/source code.

For some reading material, see: https://notes.bayindirh.io/notes/Lists/Discussions+about+Artificial+Intelligence

Home - bayindirh's Notes

Welcome to my notes, an open notebook of what I know, what I'm working on, and what I'm planning to ponder on for the near and far future. This place will be constantly under construction, and will m…

bayindirh's Notes

@deabigt @timnitGebru

Moreover: "author incorporating ideas from various other books, especially when considering things like fan fiction."

This is an extremely charitable take of a thing which ingests whole libraries to train a chain of probabilities, and gives out relatively or very similar (depending on the query) chain of words back.

IOW, A Stochastic, Self Unaware Parrot, on LSD.

@timnitGebru Wait so reading without buying…or watching a movie without paying is now fair use?

Oh my, how tables have turned

@almad @timnitGebru
You only need a stack of expensive lawyers.
@timnitGebru my jaw dropped at this heading. It is truly unbelievable.
@timnitGebru meanwhile, in this thread, people arguing that this is fine, actually.  makes me wonder if some folks have ever created anything they really valued, if they have ever valued anything created by someone else. If so, imagine THAT was stolen by big tech and this creator you love suddenly had to compete with said big tech, which could now reproduce similar work for anyone on earth just by using keywords.
@timnitGebru or how about not having future art, books, music, anything creative, to love, that isn't AI slop, because creative people are (understandably) very hesitant to share, or even create, their best work anymore. And those that are trying are finding it harder and harder to find work. If you, personally, don't care about any single creative thing, I think most folks do have at least one medium that they value. Have some respect for that, at least, and quit defending AI theft.
@timnitGebru That sounds terrible. I expected better from EFF.
@timnitGebru they even called it mechanical Turk ffs, the expectation of exploiting minority groups has been culturally grandfathered in since the beginning
@wouldinotcallmyselfahumanbeing @timnitGebru the underlying point (exploitation of cheap labor) is correct, but the name is based on the chess "automaton" of the same name: https://en.m.wikipedia.org/wiki/Mechanical_Turk
Mechanical Turk - Wikipedia

@dondelelcaro yep, that worked by exploiting a person with dwarfism stuffed into a box, a methodology which has been the unspoken template ever since. @timnitGebru
@wouldinotcallmyselfahumanbeing @timnitGebru I don't believe any of the chess masters in the turk had achondroplasia; they were likely fairly flexible, though.
@dondelelcaro point taken, but i'm fairly comforable with my assumption that the tech-frat-bro-industrial complex never followed any further with the metaphor than "it's a scam based on hidden labour" @timnitGebru
@timnitGebru it is _really_ disappointing that the EFF are arguing that "attributing your training data sources" is an impossible standard to achieve, ffs

@timnitGebru and then we are supposed to trust them that when they claim some amazing result in answering a prompt or taking a “standarized test” that the answers or the test and answer key weren’t part of their training set.

(The one they claim not to know what is in it)

@timnitGebru Zuckerberg scraping Harvard database for images and info on fellow students comes to mind

@timnitGebru “if we steal from literally everyone, who can sue us” –Sam Altman, probably

(The answer is everyone, Sam)

@timnitGebru
The big problem is if the AI system doesn't flag what data its ingested is fact and what's fiction, it's just an expensive GIGO machine.

Or put it another way, if it doesn't know if it's answering questions with facts or fiction, all the results must be classed as fiction.

@timnitGebru
It's highly doubtable that the scraping companies have oversight on the data and quality of data they ingest.
How would they be able to tell what is bad data? Probably they don't and end up with biased 'majority vote results' intermixed with crap.
Plus especially the good data is not and cannot be for free! This is data is essential for them to get some model working close to common sense. So they try to put 'guardrails' and hide the fact that their models spit out original content.
@timnitGebru The only generative AI tool I know of that didn't steal others' labor is one that a photographer wrote and trained *on his own photographs*: https://bhphotopodcast.libsyn.com/ai-powered-wedding-photography-workflows-with-sam-hurd-justin-benson
B&H Photography Podcast: AI-Powered Wedding Photography Workflows, with Sam Hurd & Justin Benson

Photographers often react instinctively against artificial intelligence, typically focusing on controversial generative AI. Meanwhile, a different branch of AI technology—machine learning systems—has been making remarkable progress helping photographers manage overwhelming image workflows under tight deadlines.   In today's show, we explore this trending topic in a discussion with Justin Benson and Sam Hurd, both accomplished wedding photographers and tech entrepreneurs.   Justin begins by clarifying the distinction between machine learning and generative AI before taking us behind the scenes of Aftershoot, the workflow tool he co-founded. He highlights the key advantages of culling and editing images locally rather than using cloud-based alternatives.   We also explore the question of trusting automated workflows and discuss how aesthetics factor in, particularly since machine learning systems adapt to a photographer's culling and editing preferences over time.   Our conversation naturally addresses the ethical concerns surrounding AI. Sam's posing tool, Insight, uses generative AI to offer photographers fresh creative suggestions, but importantly, it trains exclusively on images from his own extensive collection. You'll hear about his journey developing this tool initially for personal use before scaling it as a service for others.   By the end, you'll have gained fresh perspective on how these AI-powered tools can enhance your current workflow, along with hearing some thought-provoking predictions about what the future may hold. Guests: Sam Hurd & Justin Benson Episode Timeline: 3:34: Sam’s background in the creative arts complimented his love of technology in developing his photography career.  7:26: Justin’s busy wedding workload and how his photography knowledge became an asset in developing an AI-powered culling & editing software. 9:35: Differentiating between machine learning and Generative AI, plus ethical concerns related to companies using cloud-based systems. 12:49: The influence of AI-related processes to a photographer’s creative output, plus the significance of the human element in generating new creative content.   18:43: Factoring for aesthetics when it comes to AI-powered culling and editing software, plus how much to trust the process. 30:34: Implementing an AI-powered software program in a photographer’s overall workflow process, and reallocating time to details that really matter. 42:00: Episode Break 42:50: The backstory to Sam Hurd’s AI-powered posing tool Insight, plus why posing people is so hard. 54:59: Staying ahead of the curve with various AI-powered workflow tools, plus essential differences to Insight as a text-based service instead of an app. 1:02:57: How the increasing use of generative AI is causing a shift in assumptions about posting portrait pictures to the Internet. 1:07:30: Debating future advancements to AI technology, image generation at the pixel level, and comparing this to the shift from analog to digital. 1:12:23: The importance of the user community in making further advances to AI-powered software tools. 1:17:34: What’s next with AI, a real-world comparison to the original Blade Runner movie, plus the dynamic of scary vs cool.   Guest Bios: After starting as a political news and celebrity portraitist in Washington, DC., Sam Hurd was quickly drawn to wedding photography as a way to explore more inventive ideas. He focuses on deceptively simple photographic techniques with the potential to transform difficult or uninspiring environments into unique creative visions.  A hallmark of his work are several in-camera techniques he developed such as Prisming, Lens Chimping, Color Spreading, and his Ring of Fire effect for in-camera flare. In addition to his long-held passion for photography and creative reinvention, Sam is equally skilled in information technology and computer science since majoring in this field during college. That background, combined with a belief that modern AI tools can offer new avenues for creative exploration when thoughtfully leveraged, has led Sam to build several services to assist and inspire photographers. These include Insight, which analyses images in real time, returning visual pose ideas based on the actual environment and subjects in front of the lens - an entirely new process for breaking through creative blocks while working.  Justin Benson started out photographing sets and location scouting for film and TV. But in 2011 his life changed after he answered a last-minute call from a family friend in need of a wedding photographer. Justin quickly discovered wedding photography was his true calling and he’s been at it ever since, with his wife joining him as a second shooter in 2016. Based in Connecticut, Justin works in a modern traditional style, combining a hint of candid moments with posed scenes.  In 2019, just before COVID started shutting things down, Justin learned about an enterprising developer who was seeking photographer input to incorporate in an AI-powered image culling app he was cooking up. After much discussion and advice about how to improve this product, Justin signed on and became a co-founder of the AI-based software company Aftershoot.  Stay Connected: Sam Hurd Website: https://samhurdphotography.com/ Sam Hurd Instagram: https://www.instagram.com/iamthesam/ Sam Hurd Facebook: https://www.facebook.com/samhurdphotography Sam Hurd Youtube: https://www.youtube.com/@iamsamhurdphotography Sam Hurd Linktr.ee: https://linktr.ee/iamthesam Sam Hurd Patreon: https://www.patreon.com/samhurd Insight Photo Website: https://www.insight.photo/  Sam Hurd on the B&H Photography Podcast: https://www.bhphotovideo.com/explora/podcasts/photography/photographic-craft-connections-sam-hurd-dixie-dixon Blade Runner movie clip: https://youtu.be/IbzlX43ykxQ?si=g9NG8TPkRskGsvW6’ Justin Benson Website: https://jbensonphotography.com/ Justin Benson Instagram: https://www.instagram.com/Jbensonphotography/ Justin Benson Facebook: https://www.facebook.com/jbensonphotography Aftershoot Website: https://aftershoot.com/ Justin Benson & Aftershoot on the B&H Explora blog: