"AI companies claim their tools couldn't exist without training on copyrighted material. It turns out, they could — it's just really hard. To prove it, AI researchers trained a new model that's less powerful but much more ethical. That's because the LLM's dataset uses only public domain and openly licensed material."

tl;dr: If you use public domain data (i.e. you don't steal from authors and creators) you can train a LLM just as good as what was cutting edge a couple of years ago. What makes it difficult is curating the data, but once the data has been curated once, in principle everyone can use it without having to go through the painful part.
So the whole "we have to violate copyright and steal intellectual property" is (as everybody already knew) total BS.

https://www.engadget.com/ai/it-turns-out-you-can-train-ai-models-without-copyrighted-material-174016619.html?src=rss

It turns out you can train AI models without copyrighted material

It's just a pain in the ass.

Engadget

Look mom, I got a semi-viral post on Mastodon!

I don't have a soundcloud (or whatever people who regularly go viral tend to advertise), so just be nice to each other!

@j_bertolotti

Fortunately virality in Mastodon is only by contact of a "Susceptible" masto-user with your "Infectious" post–and Recovery is quite common.

No Mass Inoculations thru algorithms here.

@j_bertolotti Crikey, it's like the glorious days of John Mastodon.
@j_bertolotti It'd be interesting if AI companies lobbied to increase the works in the public domain by decreasing copyright duration. That's something I'd actually support. Copyright is too long. And it would then be a legal, more ethical, industry instead of pack of VC-funded thieves. Strange bedfellows.
@shanecelis @j_bertolotti
I suspect this idea will get increasing traction, as current copyright principles experience increasing strain.
Curious - what's an acceptably shorter period of time? Does it vary with type of work?
@leafless @shanecelis @j_bertolotti I would say, especially software copyrights should be much shorter.

@leafless @shanecelis @j_bertolotti That's a hella interesting question, and I don't think I have an answer I'm comfortable with.

My original thought was maybe 20, 25 years , much like a patent, and make it renewable so if it's still profitable the creator can keep their share of the action, but that puts things from 2000 into the public domain and I'm not sure if that feels right.

@Brett @shanecelis @j_bertolotti
Yeah! It's hard to draw clear lines. I also wonder about different types of software. It feels like architectural components for eg. should be treated differently to a light-weight ERP saas solution that'll move on in a few years...
Currently thinking whether there's a plausible sense i. which duration of protection could be tied to effort expended...
@shanecelis @j_bertolotti IMO, the right approach here is to essentially copy what radio does. Give AI companies a blanket right to train on anything as long as they pay royalties to some central organization, which would then distribute them to content creators.

@miki @shanecelis @j_bertolotti

Not sure if it is that good. Current central organizations existing for, say, music or films, are assholes, and I don't see why this one would be any different.
Moreover, I wonder how does one pay royalties to social network users (whose posts are being crawled as well). That's millions of people

@stargazer @shanecelis @j_bertolotti It is the worst idea, except for all the others.

Good AI is more beneficial for humanity than bad AI, and giving authors *some* money is better than giving them no money, as we do now.

@miki
@shanecelis @j_bertolotti

Just to iterate on your idea (I still believe we can do better), we can force social networks to act as intermediaries between end users and AI bros. If the AI bros love subscription capitalism so much, let them subscribe to, say, Meta to scrape Facebook, with Meta being obligated to distribute the fee obtained between users on the platform.

...with an option to opt-out from scraping.
The world where I can sell my kidney but can't sell my personal data is kinda weird.

@shanecelis @j_bertolotti Like it's so weird that pirates and AI company is are now not in the same boat but we agree on some things and neither of us like it like the pirates hate the fact that we have 2 for our rights like say you know it they have some points and they I company is hate that. They have to be seen as pirates because we have some points.
@shanecelis @j_bertolotti Thankfully they never would, imagine Microsoft waiving away parts of Windows code.
And you can entirely have licences that allow to expire into a more permissive license or a public-domain equivalent, no need to screw around with everyone's copyrights.
@j_bertolotti
I, for one, would love to use an LLM that wrote like a tuberculosis-riddled Georgian or Victorian gent. Train it up on all the old penny dreadfuls and let it start answering Gen Z questions.
@xinit @j_bertolotti We can actually get a Victorian child
@Mik3y
All the urchin charm, none of the pox.
@j_bertolotti
@j_bertolotti And once you're curating, you can also curate for truthfulness. No more glue on pizzas.

@martinvermeer
That one didn't happen because the training material contained lots of "glue on pizzas". It happened because the training material contained lots of "use glue to make things stick".

Glue on pizza is just a consequence of how LLMs work, use another tool if you wish to avoid that.
@j_bertolotti

@notsoloud @j_bertolotti I seem to remember it was because there was one satirical article that actually recommended putting glue on pizza, and the LLM just swallowed it whole.
Now even with 100% true input material it will still happily generate fantasy output, but at least these basic embarrassments may be avoided...
Peter Yang (@petergyang) on X

Google AI overview suggests adding glue to get cheese to stick to pizza, and it turns out the source is an 11 year old Reddit comment from user F*cksmith 😂

X (formerly Twitter)
@j_bertolotti my innovative ponzi scheme only works when trained on other people wealth. So this is the precedent my lawyer has been praying for.

@j_bertolotti

Oh yeah if I went around and had to break every car window down the street to feel good as an excuse, people would still rejected because you know destroying somebody else’s property is wrong. It’s illegal.

No, the same Ben are trying to destroy all of us. We should all be outraged because they’re trying to destroy us with AI that they want to like subjugate in bull over us. Why are we letting him get away with it?

@j_bertolotti Now all we need to do is make it less of a resource hungry monster and stop it from making up pure fictions and we'll be set.
@j_bertolotti
Intellectual property doesn't exist
@j_bertolotti it was just easier for them not to care, because billions allowed them to do so. It was never about efficiency or ethics.
@j_bertolotti
Having, and publishing, a syllabus for AI education has noticeable merit.
@j_bertolotti
"This work was supported by funding from the Mozilla Foundation and Sutter Hill Ventures"
Interesting positive for Mozilla.

@j_bertolotti "So, we managed to build something noticeably worse (lagging a couple generations) with considerably more effort" - which only confirms the initial point.

There already were models trained on free materials. Of course it also meant their scope was somewhat limited to these materials. They have their uses but general purpose assistants are expected to have certain level of popular culture awareness which can hardly be deduced from the Library of Congress.

@shuro @j_bertolotti It doesn't confirm the initial point. The initial point was it is impossible to create LLMs without violating copyright. This experiment disproves that.

Claiming otherwise is moving the goal posts to say it has to be possible and also super easy and cheap and give the same results. But that was not at all the original claim being investigated.

@distrowatch The point that it is impossible to make any LLM without copyright material makes no sense. Do you have an exact quote?

Obviously you can make a language model on any dataset and it was done long before "AI" was a thing.

The question is can you make it comparable and marketable - just like you can make knives without metal and plastic knives are a thing as well as ceramic ones or just whatever people manage to sharpen as their pet project but they are hardly comparable with usual kitchen knives.
@j_bertolotti

@shuro @j_bertolotti Again, you're moving the goal posts. No one said it's impossible to create _any_ LLM without copyrighted materials. Just that it's impossible to create current models on LLM. You're making up strawmen arguments to confuse the issue.

As for quotes, the OP you replied to has one. Here's another from OpenAI to the UK court system: "It would be impossible to train today’s leading AI models without using copyrighted materials." The above study just disproved their statement.

@distrowatch I am sorry if you have the impression of me moving the goal posts as it is not my intention. Maybe I am not expressing my point clear enough.

So we established that some LLMs can be created (and it happened even before 90's) on any kind of dataset. So AI vendors claim that it is impossible to create not just any LLM but a product with certain characteristics. Here are two quotes from OpenAI I found (the original article mentions another one linking to this official document):

"We believe that AI tools are at their best when they incorporate and represent the full diversity and breadth of human intelligence and experience. In order to do this, today’s AI technologies require a large amount of training data and computation, as models review, analyze, and learn patterns and concepts that emerge from trillions of words and images. OpenAI’s large language models, including the models that power ChatGPT, are developed using three primary sources of training data: (1) information that is publicly available on the internet, (2) information that we license from third parties, and (3) information that our users or our human trainers provide. Because copyright today covers virtually every sort of human expression–including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens."

So, obviously, key parts of context are "copyright today covers virtually every sort of human expression" and "meet the needs of today’s citizens". These are the initial goal posts.

So this boils down to what I said above: we speak about creating not any LLM or not just some LLM which is big enough or complicated enough or produces just as much greenhouse gases and waste water - but LLM which has comparable cultural awareness. So people can use it to find something or to make a snippet of Python code or to restyle Trump and Musk posts as if they were Game of Thrones characters, etc.

Now it would be interesting to test it in that regard and let actual today's citizens judge how it compares - again because they are not interested in synthetic tests but with a tool to help them solve their practical problems.
@j_bertolotti

@shuro
I understand your point, but I feel you are asking a slightly wrong question. The point is not whether these researchers can create a LLM as powerful as the very latest version of chatgpt (they can't, if nothing else because they have only a microscopic fraction of the money OpenAI has pushed into it). The point is that they can make something that is not that far behind (just a couple of years ago what the have would have been cutting edge) without having to violate everyone's intellectual property.
This is relevant because it makes OpenAI argument that they "have to" ignore any form of intellectual property protection or they wouldn't be able to have a product, a lot less tenable.
@distrowatch

@j_bertolotti - my point was a bit different though, it wasn't about just power. I am pretty sure it is possible to source enough open data to build corpus allowing fluid speech generation and certain "reasoning" on par with current models. As you point out it is more about resources.

However my point is something else: that large enough corpus and efficient training is not enough, you also need to make your model aware of knowledge areas your users expect it to "know about" - and this is the real problem. Take Windows or other Microsoft tech - very popular and very proprietary. Take Apple. Take car brands. Take just about any modern literature. Even general popular culture - memes, movie reviews, etc. are all concentrated on various services that usually don't grant free license for the hosted content even when it is user-generated.

And end users don't want just a well-spoken chatbot or a chatbot which can solve simple logic problems. They want to ask it why their Windows gets BSOD when they insert a WiFi dongle or how to write birthday wishes for their nephew with Harry Potter references or recommend a shark movie to watch.

I would be very surprised to see if this libre LLM could perform similarly in these scenarios - and there are countless other areas which are mostly covered by copyright or other licensing restrictions. This is what OpenAI meant in their statement.

@distrowatch

@j_bertolotti That article does a good job of muddying the water to benefit big AI companies.

By equating "without training on copyrighted material" and "openly licensed material" it pretends that copyright is the issue, not the "without permission" part.

Capitalist BS. Cuz what those techno idiots are actually saying is "we can't get obscenely rich with IA if we can't steal from everyone".

@j_bertolotti

@j_bertolotti Doth it talk all funny, or over-use hyphens, or any-thing like that, or is it, in fact, fairly modern, except, possibly, for a slight over-use of commas (although, in all fairness, betwixt underuse of commas, and their overuse, overuse looks significantly more dignified)?
@j_bertolotti But the ones trained on public domain data can't infringe on copyrights quite so easily - just ask them to make a Calvin and Hobbes comic, or a magician in the style of JK Rowling, and you'll see they can't.

@j_bertolotti I'm surprised anyone even had to actually do a study to prove this. It's only logical.

Here's the thing: they're running into serious problems getting their current models to work better because it's basically impossible to take random data taken from all over the Internet from sources that aren't even all appropriate (and almost none of which have given permission) and then somehow filter that down into quality. They're trying to artificially filter it and trying to write really complicated post-filtering methods too which is getting more and more complicated and more expensive.

However, if they just simply did it right in the first place -- paying people to write qualitative stuff -- they wouldn't have to go through all that and wouldn't be violating lots of laws.

@j_bertolotti Honestly... I think that the model "more data = better models" is dumb AF. Is basically brute force applied on LLM training
@j_bertolotti
Who would've thought it... 🤔😬
@j_bertolotti i wonder how a LLM would sound if you can only use 100 year old texts due to public domain.
If we stick to public domain: it'll be a bit dated, and will require curation to prevent old common misconceptions from leaking into the corpus. If we also add copyleft data, the result fares better in terms of knowledge breadth. There is a project attempting to cure precisely that kind of corpus: huggingface.co/datasets/PleIAs…
PleIAs/common_corpus · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

@taleteller
@j_bertolotti
There are works made in the modern era that are released in the public domain (either through CC0 or a general public domain waiver). A good example is the works of Nicky Case. There may need to be a *bit* of search, but they exist. Just gotta know where to look.
@j_bertolotti But that points out the *OTHER* more serious problem. The theft was always a side issue. Ethical, legal AIs still destroy industries and remove the ability for people to make a living. I don't feel any less homeless if my job is taken by a "legal" AI.
@j_bertolotti The ”license to steal” is not important for the tech. However, it is very important to the power the tech broligarchy. To be able to break laws is what they actually want.
@j_bertolotti I kind of hope the open source community can assemble a canon of training data such that we can standardize on it and then directly compare various model performances without totally opaque training variances playing too big a role.

@j_bertolotti

The top use of this "trained on public data" AI will be to snitch on all the other AIs.

If your AI has different info than Publicly Trained AI then it is stolen goods and can't be sold or rented.

Dun dumm.
Law and Order
AI

@j_bertolotti

If your business model relies on theft, you don't have a business, you have a criminal organisation.

@j_bertolotti Um, open-licensed doesn’t mean public domain! Very large parts of their corpus for example are 100% copyrighted. All of Hansard and Wikipedia is copyrighted and made available under an open licence, with conditions: for example, any generated text that’s obviously derived from them would be legally required to credit the source. Is their LLM proposing to do that?
@Adzebill @j_bertolotti I'm still waiting for all ai code to be GPL licensed

@j_bertolotti

How did the companies that produce encyclopaedias in the past get their content? They hired experts, scholars, and professional writers to collect and produce knowledge.

How does DeepL train its translation model? The latest models use proprietary data and hire translators to tutor the model.

There are ways to do it, but they cost money and take longer than theft.

@j_bertolotti “openly licenced material”? I sure hope they honour these licences!

cf. https://mbsd.evolvis.org/MirOS-Licence.htm#guidelines

The MirOS Licence (MirBSD)

@mirabilos @j_bertolotti
https://github.com/r-three/common-pile/tree/main/sources/ubuntu
> The content of all Ubuntu channels, whether official logs or otherwise, are considered to be in the public domain.

What… no it's not?
common-pile/sources/ubuntu at main · r-three/common-pile

Code for collecting, processing, and preparing datasets for the Common Pile - r-three/common-pile

GitHub