This whole "this is how humans learn so whats the difference" thing while stealing so much data to make billions for a few dudes is so insidious.
@timnitGebru it's true, i personally am backed by fifty thousand GPUs in a datacentre in outback australia
@timnitGebru Yeah, I distinctly remember having to pay a whole lot for my education...
@BlueAppaloosa @timnitGebru mine was quite inexpensive but it instilled in me a strong sense of indicating what ideas I can reasonably claim were my own and where I have to cite my sources. These language models are more like comedians shamelessly stealing material. There can be no (mechanical) shame or guilt because it’s not part of the problem definition.
@timnitGebru somehow that's not copyright infringement, but sampling even one damn note and using it in a DJ set is.
@rysiek @timnitGebru Steal from one person, that's copyright infringement. Steal from everybody, 🤷
@timnitGebru I still don’t get how that’s supposed to cover obvious infringement. “We trained our writer’s room on this collection of Marvel comics, so now they can write MCU films for anyone who wants them!”

@chris_radcliff you can do that: you can train your writers on Marvell fiction and then tell them to write fanfiction. It won’t be copyright infringement, but it pretty sure would be infringing on trademarks, and it would be plagiarism.

@timnitGebru

@chris_radcliff The deeper problem is that copyright law is a swamp.

I once asked a webcomic author whether he would allow me to use a single strip under free licenses for a free RPG. He asked “will I still retain the rights to the characters?”.

Half a year of part-time searching later I had to tell him: “I don’t know and no one can tell me”.

So since his livelihood depended on others not being able to just run with his cast of characters, we decided to scrap the idea.
@timnitGebru

@chris_radcliff that swamp of copyright is where now generative art is added that doesn’t really copy any specific work and does not need to actually recreate a character as it is but combines stuff from a huge number of undisclosable sources where each usage might or might not be fair use but the company behind it has funds to simply outspend any author who sues.

Now copyright is still a swamp, but big money is with those who consume the works, not with those who control access.
@timnitGebru

@ArneBab @chris_radcliff @timnitGebru

> Following the release of the 1978 The Rutles album, ATV Music, the then-owner of the publishing rights to the Beatles catalog, sued Innes for copyright infringement. Though Innes hired a musicologist to defend the originality of his songs, he settled with ATV out of court for 50% of the royalties and shared songwriting credit on the 14 songs included on the album. As of early 2006, these six songs from the first Rutles CD
...
https://www.liquisearch.com/the_rutles/lawsuits

The Rutles - Lawsuits

@indieterminacy music copyright ≠ writing copyright ≠ drawing copyright ≠ dance moves copyright (as epic games painfully found out).

Each of those has special rules what’s infringement.

In music even having an intro sound similar can cause a lawsuit. Even if you never knew the other song.

I said for good reason that it’s a swamp.

@chris_radcliff @timnitGebru

@ArneBab @chris_radcliff @timnitGebru I once worked at an independent music label association, I appreciate you emphasizing distinctions.

> “It’s brutal,” he said, his smile fading for the first time. “I couldn’t afford to get a lawyer that would go up against these big corporations. They’re like the banks – they’re too big to fail.”

> Innes maintained that he didn’t analyse The Beatles’ music before writing The Rutles’ songs, but instead wrote everything from memory
https://www.loudersound.com/features/the-rutles-neil-innes-interview

The Rutles: the strange and surreal story of the original Spinal Tap

Formed from the ashes of the Bonzo Dog Doo Dah Band, The Rutles were a razorsharp pastiche of The Beatles with links to Monty Python and the Fab Four themselves

louder

@indieterminacy I’ve been contributing to foster free culture (libre licenses) for more than two decades now and if the goal is to find a corpus of works that really are free to use and reuse as long as you reciprocate you hit upon a lot of nasty corners of copyright, because you can’t just go with the default "I don’t make money and if they are angry, I’ll just take it down or let them make a profit from it".

Because I make promises to others that they can use it.
@chris_radcliff @timnitGebru

@timnitGebru I'd be a lot more comfortable with generative-ML if it could explain its influences and sources like a human would, and not just confabulate an answer after the fact. (sometimes "I don't know, it just felt right" is an ok answer, but it shouldn't be *all* the answers.)
@gray17 @timnitGebru see, the fashionable claim is that a human explaining their influences is actually just retroactively rationalizing an unconscious process.

@FeralRobots @timnitGebru right, and that line of argument leads to "assembling the words of an explanation is an unconscious process", consciousness does not exist.

it's pointless to argue against that position. consciousness is something that does exist, even if we don't know how to explain it, and the ML models of this era do not have consciousness as we understand it.

sidestepping that is probably better. ML models easily do things humans cannot, and vice versa. they're not very similar

@gray17 @timnitGebru
It's not just pointless to argue against that position, it's impossible - which is why that position exists.

So while I don't disagree that we can't debunk it, we still have to deal with the fact that it's a REALLY PERSUASIVE position for a lot of people, for a number of reasons. It's real, it's out there, it's dangerous, & we can't fight it with facts.

I'm just ranting, this isn't about anything you're saying. Frustrated, I suppose.

@timnitGebru no human could consume and retain the entire public interwebs plus a lot of even more questionably-obtained data 🤯
@timnitGebru Pushed by the same crew that's declared AGI to be so intellectually powerful as to constitute an existential threat to Homo Sapiens. The only thing that can connect that with an indifference, without contradiction, to wholesale copying of creative works is ... yeah, that there is loads of money to be made from it.
@fgbjr One argument also backs up the other. If large language models are pseudo-intelligence on the verge of AGI then it follows that they're capable of reinterpreting the works they are trained on, rather than regurgitating them.
@projectgus If I'd have had wheels I'd have been a bus.

@projectgus
"If large language models are pseudo-intelligence on the verge of AGI [...]"

They are far from being AGI at the moment. They are simply very specialized (and advanced) models of language. Not "general" at all.
@fgbjr

@denki @fgbjr

I fully agree! Sorry if that wasn't clear in how I phrased my original reply.

What I meant was: The people who run and fund AI companies want us to believe LLMs are "creating" new things rather than merely "spicy autocomplete".

They also claim that we're already on the verge of "potentially dangerous AGI". This hypes AI directly, but it also backs up their first assertion - because if they can make people believe that today's LLMs are "almost AGI" then it's easier to argue that their output is creative and not derivative.

(To be totally clear: None of this is what I believe, I'm not convinced past "spicy autocomplete". But there is an internally consistent, self-interested, package of ideas out there that the AI boosters want us to believe.)

@projectgus @denki Yes. Not to disagree, but to tack on: if an employee were given an assignment and told, "copy from things in this library to do the job," the employer would surely be on the string for any copyright violations. Godlike powers of command and control would need to be ascribed to a system to reach a "the devil made me do it" defense.
@timnitGebru are you sure you did not mean "egregious" instead of "insidious"?
@Timnit Gebru (she/her) So what other bots besides Common Crawl are there to be blocked?
Oh: ChatGPT-user.
Timnit Gebru (she/her). (@[email protected])

4.86K Posts, 483 Following, 34.8K Followers · Personal Account. Fired from Google for raising issues of discrimination in the workplace and writing about the dangers of large language models: https://www.wired.com/story/google-timnit-gebru-ai-what-really-happened/. Founded The Distributed AI Research Institute (https://www.dair-institute.org/) to work on community-rooted AI research. Author: The View from Somewhere, a memoir & manifesto arguing for a technological future that serves our communities (to be published by One Signal / Atria

Distributed AI Research Community
@timnitGebru /narrator voice/ this was, in fact, not how humans learn. One of the great mysteries of human cognition lies in the befuddling observation just how quickly humans learn language during a critical age period, and how exactly they lose that special power. There even was a great scientific debate around *not* needing to consume terabytes of input to learn language, called "poverty of stimulus". And yet, the promise of return on investment droned out such debates.

whoever looks for adjacent info in pithy comments, see this thread where a little detail on the poverty-of-stimulus argument is shared and some musings of how that could inform how we grok what LLMs can and cannot do and how metaphors frame how we think about machines.

https://pxi.social/@jakob/110283974473306733

jakob.pxi (@[email protected])

#LLM are brute-forcing their way through absurd amounts of data to generate an autocomplete output for any given input that approximates outputs a human might give instead. They lack a few distinct properties of human cognition, including language, that more brute force alone cannot compensate for. Because they can only ever internalize and compute *intra*textual context. Incidentally, humans need much less input(!) to learn language. Probably because they can contextualize across domains. 🧵

pxi.mastodon

@timnitGebru

I am human and i do not read Terabytes of data. I make a selection what I read.

@timnitGebru it isn't how humans learn though: if you see a deer once, you know what it is the next time you see it. You don't have to see it on a thousand different backgrounds from a thousand different angles to know what a "deer" is.
@ryangermann @timnitGebru but that's also because humans generally know how to reason about 3D-space and can predict what something looks like from another angle if we've seen other things like it.

@crenfrow true, ...but humans can see a 2D PICTURE of something and recognize it in 3D without having to look at hundreds of pictures. The suggestion that what-we-call AI is actual "intelligence" is oversimplified to the point of being inappropriate, but it's what people have latched onto.

The fact that a great dane can recognize that a chihuahua is something whose ass it wants to sniff is remarkable. When an AI knows what asses it wants to sniff, THEN we've achieved something.

@ryangermann @timnitGebru well, to be fair... Babies take a while to figure out visual data enough to resolve objects. Lots and lots of image input from all sorts of different angles to learn to process visual data. Then after lots of experience you only need to see a dear once.
That's not to say that neural networks learn the same way as humans. Humans process a hell of a lot of data before they start speaking.
@timnitGebru I would say this is not at all how humans learn. I don't recall learning by reading the entire internet texts and still images.
@timnitGebru I also remember learning by just going through life.

@timnitGebru

Anyone else remember?

You wouldn't steal a car

You wouldn't steal a handbag

You wouldn't steal a TV

You wouldn't steal a movie

Downloading pirated films is stealing

Stealing is against the law

https://youtu.be/HmZm8vNHBSU

NB: they didn't acquire the rights to that song.

Copyright is for the little people not corporations.

Piracy it's a crime

YouTube

@timnitGebru This is what computers do - provide effect of scale for things that can be automated.

Nobody was enraged when all the clerks lost their jobs to Excel.

@timnitGebru @DataDrivenMD

It’s not stealing and your saying it is is the real lack of ethics. Greed over intellectual property has already wrecked huge amounts our culture as money grubbers try to monetize the joy contributed by the public.

@tqwhite @timnitGebru @DataDrivenMD

Who are the money grubbers?

artists who can't afford rent?

can you clarify?

@CrowquillGal @timnitGebru @DataDrivenMD

Yes I can.

Artists who can’t pay rent are not going to lose anything. They already are working two jobs to subsidize their passion.

Unless, of course, they are a Louisiana blues man who had to sell his intellectual property because he could not find a job. In that case he also has nothing to lose because he doesn’t have anything anymore.

@tqwhite @timnitGebru @DataDrivenMD

Should artists who cannot pay rent have their work used to enrich someone else and receive no profit?

@CrowquillGal @timnitGebru @DataDrivenMD

Should I get sued for playing the song my wife and I played when we fell in love because I did not pay NAASCAP (or whatever it is)?

There are losers either way. Intellectual@property is an oxymoron.

@tqwhite @timnitGebru @DataDrivenMD

no one sues you for playing music.

People can sue you for profiting from their music, without their consent.

Who are the money grubbers?

Who is allowed to profit from creative work?

Techbros who scrape the internet for images they don't bother to license from?

or the people educating themselves, purchasing the materials and tools required, and putting the time in to create something?

@CrowquillGal @timnitGebru @DataDrivenMD

that would be incorrect. Obviously it’s rare that people get caught or that anyone exerts the effort but it is 100% actionable if you have your kids band play commercial music at your wedding without a license.

@tqwhite @timnitGebru @DataDrivenMD

Licenses are the means by which Artists are able to write, practice, perform, record, and promote the art your kid is performing.

None of that is free.

@CrowquillGal @timnitGebru @DataDrivenMD

We all scrape the internet all the time using sophisticated programs that retain substantial amount of publicly available information.

What is illegal is copying and distributing stuff. Scaling and making it available in a different form is perfectly legal. What do you think google is?

Calling them “Bros” isn’t an argument.

@CrowquillGal @timnitGebru @DataDrivenMD

Throughout history, everyone that doesn’t do an exact copy is who gets to profit. Thousands of playwrights have copied shakespeares style and concepts. A movie critic profits off of a movie by describing it.

Experiencing stuff and regurgitating it with your own api is the essential function of human culture.

@tqwhite @timnitGebru @DataDrivenMD

AI does Not experience stuff.

Humans experience things visually, aurally, tactically.

We think about meaning, we imagine 'What if?" and ascribe new meanings to the updated version in our minds. We then explore those meanings, discover connections that make those meanings personal, then we develop it.

We don't do it by downloading one million images without permission and mimic a style.

@CrowquillGal @timnitGebru @DataDrivenMD

Who are your alleged tech bros going to buy a license from? My blog? Yours? There are seven billion people. Does everyone who has ever said anything have the right to require a license? It’s not only impractical, it would be immoral to try to require it.

@tqwhite @timnitGebru @DataDrivenMD

the people who want to use an artists images to train an AI should contact the artists they want to train their system on.

Why is that hard for you to imagine?

@CrowquillGal @timnitGebru @DataDrivenMD

Because these things train on billions of things. They don’t make conscious decisions. They roam the internet and look around.

If you don’t want your stuff to be seen, don’t put it online. Require a membership.

@tqwhite @timnitGebru @DataDrivenMD

Some AI art program developers Specifically Contracted artists, and licensed agreed-upon pieces for a training set. That's an ethical process I would participate in.

AI Developers don't have to be unethical. Some are intentionally choosing to. It's part of the decision making. If they couldn't make a product ethically, they probably needed more investors. If they cared about art *at all* they'd support the artists they Need to train their product.

@CrowquillGal @timnitGebru @DataDrivenMD

Copyright is ruining our culture by exchanging artistic expression for commercial design. Further, it horribly impairs the ability of the future to benefit from new things.

I am delighted to pay artists. I pay for my streaming. Of course, most artists are screwed there, too, by copyright. But I have no interest in supporting a system that has Sir Paul cashing checks based on a stoned afternoon with John Lennon fifty years ago.

@CrowquillGal @timnitGebru @DataDrivenMD

You can easily see the corruption by the insane length of copyright protection. I’d be less strongly opposed to it if the duration was three or four years but life of the artist plus ninety means that our culture cannot freely use the things THAT ARE MADE RELEVANT AND VALUABLE PURELY BY THE ACTION OF PUBLIC INTEREST ever.

@tqwhite @timnitGebru @DataDrivenMD

if artists can't commercially sell their work, they can't afford to make meaningful art.

if you want meaningful artistic expression, working 3 day jobs doesn't produce it.

being able to get paid for your art some software developer wants to use is a way to promote artistic expression.

AI scraping steals that vector.

@tqwhite @timnitGebru @DataDrivenMD

seriously - who are the money grubbers?

@CrowquillGal @timnitGebru @DataDrivenMD

The people who claim to do art but actually have nothing to say except “I want money”.