Mastodawn

Timnit Gebru (she/her).Mar 18, 2025

"California legislators have begun debating a bill (A.B. 412) that would require AI developers to track and disclose every registered copyrighted work used in AI training. At first glance, this might sound like a reasonable step toward transparency. But it’s an impossible standard that could crush small AI startups and developers while giving big tech firms even more power."

Trash take.
https://www.eff.org/deeplinks/2025/03/californias-ab-412-bill-could-crush-startups-and-cement-big-tech-ai-monopoly

California’s A.B. 412: A Bill That Could Crush Startups and Cement A Big Tech AI Monopoly

California legislators have begun debating a bill (A.B. 412) that would require AI developers to track and disclose every registered copyrighted work used in AI training. At first glance, this might sound like a reasonable step toward transparency. But it’s an impossible standard that could crush...

Electronic Frontier Foundation

Show thread

Timnit Gebru (she/her).Mar 18, 2025

Whether you're a small restaurant or not you have to ensure that you're not stealing your ingredients. So why is this any different?

What about the 1-2 person creative startups? Who is protecting their works in a society that devalues artists so much that "starving artist" is an expectation?

Show thread

Timnit Gebru (she/her).Mar 18, 2025

The idea that you shouldn't be expected to know what data you're using to train your systems and that doing so is "an impossible task" is so normalized that its hard to know that this was not always the case even in the field of AI.

Data theft and scraping became completely normalized, along with the exploitation of crowdworkers, with the advent of photo sharing and other platforms and others like amazon mechanical turk.

So now NOT exploiting people and stealing data is the anomaly.

Show thread

Timnit Gebru (she/her).Mar 18, 2025

"A.I. Training Is Like Reading And It’s Very Likely Fair Use"

These people are unbelievable.

Show thread

David Abigt

🌎 🎄Mar 18, 2025

@timnitGebru You're echoing an argument that was once leveled against search engines: the right to access information is presumed as long as it isn't hidden behind a paywall or excluded from indexing.

Consider the requirements for object recognition, where YOLO models typically need at least 1,000 training images of the target object, alongside images that do not depict it. Citing relevant examples in the output becomes impractical. While image LLMs may operate on fewer inputs, they still offer a rich diversity. In contrast, a text-based LLMs might align more closely with your expectations; for instance, Perplexity.ai generates responses and links to sources it deems relevant.

However, it’s crucial that training data undergoes meticulous curation to eliminate noise, or we risk developing flawed AI systems—something all too common in many chatbots today. This challenge will persist until AI systems evolve to effectively self-filter out the noise, making them truly intelligent.

Show thread

Hakan Bayındır

@deabigt @timnitGebru

My question is simple. How many books you can read per hour and how much can you remember after one week.

It's probably ~100 pages, and you'll have a distilled summary plus some gaps after a week.

How many books an LLM ingest per hour, probably 100+ complete books.

How much they can remember?
Ask the correct questions and you can rebuild enough the book from it, to allow a copyright violation case to move forward.

Search engines crawl and find connections. That's all.

Show thread

David Abigt

🌎 🎄Mar 18, 2025

@bayindirh @timnitGebru
Search engines faced criticism for providing summaries of articles, which made people less inclined to visit the original sites to see if their questions were answered.

LLMs generate summaries from inputs. Similar to Cliff Notes or an author incorporating ideas from various other books, especially when considering things like fan fiction.

Show thread

Hakan Bayındır Mar 19, 2025

@deabigt @timnitGebru
Criticism related to news is justified. Google tried using AMP as a captive device, and at the end they faced the backlash, and it's almost dead now.

LLMs do not generate summaries from inputs. They guess the next token, and as we have seen over and over and over, asking the right questions can generate the training input verbatim. Regardless of prose/source code.

For some reading material, see: https://notes.bayindirh.io/notes/Lists/Discussions+about+Artificial+Intelligence

Home - bayindirh's Notes

Welcome to my notes, an open notebook of what I know, what I'm working on, and what I'm planning to ponder on for the near and far future. This place will be constantly under construction, and will m…

bayindirh's Notes

Show thread

Hakan Bayındır Mar 19, 2025

@deabigt @timnitGebru

Moreover: "author incorporating ideas from various other books, especially when considering things like fan fiction."

This is an extremely charitable take of a thing which ingests whole libraries to train a chain of probabilities, and gives out relatively or very similar (depending on the query) chain of words back.

IOW, A Stochastic, Self Unaware Parrot, on LSD.