0 Followers
0 Following
1 Posts
If you give me several paragraphs instead of a single sentence, do you still think it’s impossible to tell?

It’s not it’s biological origins that make it hard to understand the brain, but the complexity. For example, we understand how the heart works pretty well.

While LLMs are nowhere near as complex as a brain, they’re complex enough to make it extremely difficult to understand.

But then there comes the question: if they’re so difficult to understand, how did people make them in the first place? The way they did it actually bears some similarities to evolution.

They created an “empty” model - a large network that wasn’t doing anything useful or meaningful. But it depended on billions of parameters, and if you tweak a parameter, its behavior changes slightly.

Then they expended enormous amount of computing power tweaking parameters, each tweak slightly improving its ability to model language. While doing this, they didn’t know what each number meant. They didn’t know how or why each tweak was improving the model. Just that each tweak was making an improvement.

Unlike evolution, each tweak isn’t random. There’s an algorithm called back-propagation that can tell you how to tweak the neural network to make it predict some known data slightly better. But it doesn’t tell you anything about the “why” this tweak is good.

I don’t see how that affects my point.

Today’s AI detector can’t tell apart the output of today’s LLM. Future AI detector WILL be able to tell apart the output of today’s LLM. Of course, future AI detector won’t be able to tell apart the output of future LLM.

So the claim that “all text after 2023 is forever contaminated” just isn’t true. Researchers may simply have to be a bit more careful including recent text in the training data.

Not really. If it’s truly impossible to tell the text apart, than it doesn’t really pose a problem for training AI. Otherwise, next-gen AI will be able to tell apart text generated by current gen AI, and it will get filtered out. So only the most recent data will have unfiltered shitty AI-generated stuff, but they don’t train AI on super-recent text anyway.

They don’t redistribute. They learn information about the material they’ve been trained on - not there natural itself*, and can use it to generate material they’ve never seen.

  • Bigger models seem to memorize some of the material and can infringe, but that’s not really the goal.
Language models actually do learn things in the sense that: the information encoded in the training model isn’t (usually) taken directly from the training data; instead, it’s information that describes the training data. That’s why it can generate text that’s never appeared in the data.

It’s specifically distribution of the work or derivatives that copyright prevents.

So you could make an argument that an LLM that’s memorized the book and can reproduce (parts of) it upon request is infringing. But it shouldn’t be infringement just because the LLM was trained on the book - it needs to actually reproduce it.

Why should such a think be assumed???
It’s actually a real problem on reddit where people spin up fake users to manipulate votes. Reddit hasn’t published how they detect that exactly, but one way to do that is to look for bad voting patters, like if one account systematically upvotes/downvotes another. But you pretty much can’t if the votes are secret.
True - but it’ll be much easier to detect.