2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow.
2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow.
use it to decide who’s resume is going to be viewed and/or who will be hired
Luckily that’s far removed from ChatGPT and entirely indepentent from the question whether copyrighted works may be used to train conversational AI Models or not.
That really will be the question at hand. Is the ai producing work that could be considered transformative, educational, or parody? The answer is of course yes, it is capable of doing all three of those things, but it's also capable of being coaxed into reproducing things exactly.
I don't know if current copyright laws are capable of dealing with the ai Renaissance.
Yeah it is. The only protection in copyright is called derivative works, and an AI is not a derivative of a book, No more than your brain is after you've read one.
The only exception would be if you manage to overtrain and encode the contents of the book inside of the model file. That's not what happened here because I'll chat GPT output was a summary.
The only valid claim here is the fact that the books were not supposed to be on the public internet and it's likely that the way open AI the books in the first place was through some piracy website through scraping the web.
At that point you just have to hold them liable for that act of piracy, not the fact that the model release was an act of copyright violation.
Well, while I do agree that it sucks that some jobs may get replaced history has shown that it always leads to creating more jobs in place. The weavers lost their jobs when the loom came about, but far more jobs were created because of it, same with the printing press and every other advancement, the nature of advancing technology is to replace the old with the new.
Ugh, the robot phone calls are going to get a hundred times worse, that one is true, I’m not sure if it’ll make the standard corporate phone maze better or worse, maybe better because at least you can screw with the robot while you wait instead of having the same 30 seconds of highly compressed garbage elevator music blasted into your ear on repeat.
Or is it that the OpenAI computer isn’t going a good enough job of following the “three references” rule to avoid plagiarism?
This is exactly the problem, months ago I read that AI could have free access to all public source codes on GitHub without respecting their licenses.
So many developers have decided to abandon GitHub for other alternatives not realizing that in the end AI training can safely access their public repos on other platforms as well.
What should be done is to regulate this training, which however is not convenient for companies because the more data the AI ingests, the more its knowledge expands and “helps” the people who ask for information.
It's incredibly convenient for companies.
Big companies like open AI can easily afford to download big data sets from companies like Reddit and deviantArt who already have the permission to freely use whatever work you upload to their website.
Individual creators do not have that ability and the act of doing this regulation will only force AI into the domain of these big companies even more than it already is.
Regulation would be a hideously bad idea that would lock these powerful tools behind the shitty web APIs that nobody has control over but the company in question.
Imagine the world is the future, magical new age technology, and Facebook owns all of it.
Do not allow that to happen.
Is it practically feasible to regulate the training? Is it even necessary? Perhaps it would be better to regulate the output instead.
It will be hard to know that any particular GET request is ultimately used to train an AI or to train a human. It’s currently easy to see if a particular output is plagiarized. plagiarismdetector.net It’s also much easier to enforce. We don’t need to care if or how any particular model plagiarized work. We can just check if plagiarized work was produced.
That could be implemented directly in the software, so it didn’t even output plagiarized material. The legal framework around it is also clear and fairly established. Instead of creating regulations around training we can use the existing regulations around the human who tries to disseminate copyrighted work.
That’s also consistent with how we enforce copyright in humans. There’s no law against looking at other people’s work and memorizing entire sections. It’s also generally legal to reproduce other people’s work (eg for backups). It only potentially becomes illegal if someone distributes it and it’s only plagiarism if they claim it as their own.
This makes perfect sense. Why aren’t they going about it this way then?
My best guess is that maybe they just see openAI being very successful and wanting a piece of that pie? Cause if someone produces something via chatGPT (let’s say for a book) and uses it, what are they chances they made any significant amount of money that you can sue for?
It’s hard to guess what the internal motivation is for these particular people.
Right now it’s hard to know who is disseminating AI-generated material. Some people are explicit when they post it but others aren’t. The AI companies are easily identified and there’s at least the perception that regulating them can solve the problem, of copyright infringement at the source. I doubt that’s true. More and more actors are able to train AI models and some of them aren’t even under US jurisdiction.
I predict that we’ll eventually have people vying to get their work used as training data. Think about what that means. If you write something and an AI is trained on it, the AI considers it “true”. Going forward when people send prompts to that model it will return a response based on what it considers “true”. Clever people can and will use that to influence public opinion. Consider how effective it’s been to manipulate public thought with existing information technologies. Now imagine large segments of the population relying on AIs as trusted advisors for their daily lives and how effective it would be to influence the training of those AIs.
Yeah. There are valid copyright claims because there are times that chat GPT will reproduce stuff like code line for line over 10 20 or 30 lines which is really obviously a violation of copyright.
However, just pulling in a story from context and then summarizing it? That's not a copyright violation that's a book report.
Say I see a book that sells well. It's in a language I don't understand, but I use a thesaurus to replace lots of words with synonyms. I switch some sentences around, and maybe even mix pages from similar books into it. I then go and sell this book (still not k ow what the book actually says).
I would call that copyright infringement. The original book didn't inspire me, it didn't teach me anything, and I didn't add any of my own knowledge into it. It didn't produce any original work, I simply mixed a bunch of things I don't understand.
That's what these language models do.
The fear is that the books are in one way or another encoded into the machine learning model, and that the model can somehow retrieve excerpts of these books.
Part of the training process of the model is to learn how to plagiarize the text word for word. The training input is basically “guess the next word of this excerpt”. This is quite different compared to how humans do research.
To what extent the books are encoded in the model is difficult to know. OpenAI isn’t exactly open about their models. Can you make ChatGPT print out entire excerpts of a book?
It’s quite a legal gray zone. I think it’s good that this is tried in court, but I’m afraid the court might have too little technical competence to make a ruling.
Can’t reply directly to @[email protected] because of that “language” bug, but:
The problem is that they then sell the notes in that database for giant piles of cash. Props to you if you’re profiting off your research the way OpenAI can profit off its model.
But yes, the lack of meat is an issue. If I read that article right, it’s not the one being contested here though. (IANAL and this is the only article I’ve read on this particular suit, so I may be wrong).
Was also going to reply to them!
“Well if you do that you source and reference. AIs do not do that, by design can’t.
So it’s more like you summarized a bunch of books. Pass it of as your own research. Then publish and sell that.
I’m pretty sure the authors of the books you used would be pissed.”
The problem is that they then sell the notes in that database for giant piles of cash.
On top of that, they have no way of generating any notes without your input.
I believe the way these models work is fundamentally plagiaristic. It's an "average of its inputs" situation, not a "greater than the sum of its parts" one.
GitHub Copilot doesn't know how to code, it knows how to copy-and-paste from people who do. It's useless without a million devs to crib off.
I think it's a perfectly reasonable reaction to be rather upset when some Silicon Valley chuckleheads help themselves to your lfe's work in order to build a bot to replace you.
@[email protected] can’t reply directly to you either, same language bug between lemmy and kbin.
That’s a great way to put it.
Frankly idc if it’s “technically legal,” it’s fucking slimy and desperately short-term. The aforementioned chuckleheads will doom our collective creativity for their own immediate gain if they’re not stopped.
Yeah, they want the right only to protect who copies their work and distributes it to other people, but who's able to actually read their work.
It's asinine and we should be rolling back copy right, not making it more strict. This 70 year plus the life of the author thing is bullshit.
Researchers pay for publication, and then the publisher doesn't pay for peer review, then charges the reader to read research that they basically just slapped on a website.
It's the publisher middlemen that need to be ousted from academia, the researchers don't get a dime.
Remember, Creative Commons licenses often require attribution if you use the work in a derivative product, and often require ShareAlike. Without these things, there would be basically no protection from a large firm copying a work and calling it their own.
Rolling pack copyright protection in these areas will enable large companies with traditional copyright systems to wholesale take over open source projects, to the detriment of everyone. Closed source software isn't going to be available to AI scrapers, so this only really affects open source projects and open data, exactly the sort of people who should have more protection.
Closed source software isn't going to be available to AI scrapers, so this only really affects open source projects and open data, exactly the sort of people who should have more protection.
The point of open source is contributing to the crater all of humanity. If open source contributes to an AI which can program, and that programming AI leads to increased productivity and ability in the general economy then open source has served its purpose, and people will likely continue to contribute to it.
Creative of Commons applies to when you redistribute code. (In the ideal case) AI does not redistribute code, it learns from it.
And the increased ability to program by the average person will allow programmers to be more productive and as a result allow more things to be open source and more things to be programmed in general. We will all benefit, and that is what open source is for.
There’s also GPL, which states that derivations of GPL code can only be used in GPL software. GPL also states that GPL software must also be open source.
ChatGPT is likely trained on GPL code. Does that mean all code ChatGPT generates is GPL?
I wouldn’t be surprised if there would be an update to GPL that makes it clear that any machine learning model trained on GPL code must also be GPL.
I think this is exposing a fundamental conceptual flaw in LLMs as they’re designed today. They can’t seem to simultaneously respect intellectual property / licensing and be useful.
Their current best use case - that is to say, a use case where copyright isn’t an issue - is dedicated instances trained on internal organization data. For example, Copilot Enterprise, which can be configured to use only the enterprise’s data, without any public inputs. If you’re only using your own data to train it, then copyright doesn’t come into play.
That’s been implemented where I work, and the best thing about it is that you get suggestions already tailored to your company’s coding style. And its suggestions improve the more you use it.
But AI for public consumption? Nope. Too problematic. In fact, public AI has been explicitly banned in our environment.
There’s an additional question: who holds the copyright on the output of an algorithm? I don’t think that is copyrightable at all. The bot doesn’t really add anything to the output, it’s just a fancy search engine. In the US, in particular, the agency in charge of Copyrights has been quite insistent that a copyright can only be given to the output if a human.
So when an AI incorporates parts of copyrighted works into its output, how can that not be infringement?
How can you write a blog post reviewing a book you read without copyright infringement? How can you post a plot summary to Wikipedia without copyright infringement?
I think these blanket conclusions about AI consuming content being automatically infringing are wrong. What is important is whether or not the output is infringing.
You can write that blog post because you are a human, and your summary qualifies for copyright protection, because it is the unique output of a human based on reading the copywrited material.
But the US authorities are quite clear that a work that is purely AI generated can never qualify for copyright protection. Yet since it is based on the synthesis of works under copyright, it can’t really be considered public domain either. Otherwise you could ask the AI “Write me a summary of this book that has exactly the same number of words”, and likely get a direct copy of the book which is clear of copyright.
I think that these AI companies are going to face a reckoning, when it is ruled that they misappropriated all this content that they didn’t explicitly license for use.
I’m expecting a much messier “resolution” that’ll look a lot like YouTube’s copyright situation - their product can be used for copyright infringement, and they’ll be required by law to try and take appropriate measures to prevent it, but will otherwise not be held liable as long as they can claim such measures are being taken.
Having an AI recite a long text to bypass copyright seems equivalent in my mind to uploading a full movie to youtube. In both cases, some amount of moderation (itself increasingly algorithmic) is required to not only be applied, but actively developed and advanced to flout efforts to bypass it. For instance, youtube pirates will upload things with some superficial changes like a filter applied or showing the movie on a weird angle or mirrored to bypass copyright bots, which means the bots need to be more strict and better trained, or else youtube once again becomes liable for knowing about these pirates and not stopping them.
The end result, just like with youtube, will probably be that AI models have to have big, clunky algorithms applied against their outputs to recalculate or otherwise make copyright-safe anything that might remotely be an infringement. It’ll suck for normal users, pirates will still dig for ways to bypass it, and everyone will be unhappy. If youtube is any indicator, this situation can somehow remain stable for over a decade - long enough for AI devs to release a new-generation bot to restart the whole issue.
Yaaaaaaaaay
But the US authorities are quite clear that a work that is purely AI generated can never qualify for copyright protection.
Which law says this? The government is certainly discussing the problem, but I wasn’t aware of any legislation.
If there is such a law, it seems to overlook an important point: an algorithm - an AI - is itself an expression of human intelligence. Having a computer carry out an algorithm for summarizing content can be indistinguishable from a person having a pattern they follow for writing summaries.
If you’re doing research, there are actually some limits on the use of the source material and you’re supposed to be citing said sources.
But yeah, there’s plenty of stuff where there needs to be a firm line between what a random human can do versus an automated intelligent system with potential unlimited memory/storage and processing power. A human can see where I am in public. An automated system can record it for permanent record. An integrated AI can tell you detailed information about my daily activities including inferences which - even if legal - is a pretty slippery slope.
a firm line between what a random human can do versus an automated intelligent system with potential unlimited memory/storage and processing power.
I think we need a better definition here. Is the issue really the processing power? Do we let humans get a pass because our memories are fuzzy? From your example you’re assuming massive details are maintained in the AI situation which is typically not the case. To make the data useful it’s consumed and turned into something useful for the system.
This is why I’m worried about legislation and legal precedent. Most people think these AI systems read a book and store the verbatim text off somewhere to reference when that isn’t really the case. There may be fragments all over, and it may be able to reconstitute the text, but we don’t seem to have the same issue with data being synthesized in a similar way with a human brain.
Is that scary because it’s a machine? Someone could tail you and follow you around and manually write it all down in a notebook.
Yes the ease of data collection is an issue and I’m very much for better privacy rights for us all. But from the issue you’ve stated I’d be more afraid of what the 70 year old politicians who don’t understand any of this would write up in a bill.
Someone could tail you and follow you around and manually write it all down in a notebook.
They could, and then they could also be charged with stalking.
It’s not just ease of collection. It’s how the data is being retained, secured, and shared among a great many other things. Laws just haven’t kept up with technology, partly because yeah 70yo politicians that don’t even understand email but also because the corporations behind the technology lie and bribe to keep it that way, and face little consequences when they do so improperly or mishandle it. E.G.
cbc.ca/…/cadillac-fairview-5-million-images-1.578…
When the government does it, we seem to have even less recourse.
The real estate company behind some of Canada's most popular shopping centres embedded cameras inside its digital information kiosks at 12 shopping malls in major Canadian cities to collect millions of images — and used facial recognition technology without customers' knowledge or consent — according to a new investigation by the federal, Alberta and B.C. privacy commissioners.