We were not crazy. We were right.

Amazing work by our @robb corroborated by extensive analysis at Wired:

Perplexity Is a Bullshit Machine https://www.wired.com/story/perplexity-is-a-bullshit-machine/

Perplexity Is a Bullshit Machine

A WIRED investigation shows that the AI-powered search startup Forbes has accused of stealing its content is surreptitiously scraping—and making things up out of thin air.

WIRED

Regulation in this space cannot come soon enough.

AI companies that want to scrape the web for training purposes, or use their bots to summarize webpages, should follow a strict set of guidelines with identifiable user-agents and IP addresses.

Publishers should have a right to opt out of any AI access, request details as to whether their copyrighted content is included in any model, and if so, request that its gets removed and the model re-trained.

Hopefully the EU's AI Act will help.

Most of all, we need to let go of this notion that open web = okay for commercial companies to scrape, ingest, and train their models.

If I wanted to open an English school, I would have opened a school to teach the English language. But I didn't.

I have a website, which is free to read, but my copyrighted material is mine and shouldn't serve as the foundation of any other commercial product.

I wish more people would understand this concept.

@viticci It seems obvious to me that creating an LLM by training it with a bunch of inputs makes it a derived work of those inputs. Output of the LLM is then a derived work of the LLM. Distributing the LLM or that output would then violate copyright of all the inputs unless it falls under fair use, and it doesn’t seem like most LLM usage would.

I suspect the law would also find this obvious if not for the fact that it’s big businesses doing it.

@mikeash yeah, I think like this too. But. A person is a derived work of all their inputs too, right? It's a thought that I'm struggling with. Where is the line? What is "I created" vs "I copied" @viticci
@bealex @mikeash A person is not a commercial service
@viticci But one can provide it? I'm not arguing by any means, just want to find out where the distinction is @mikeash
@bealex @viticci The law recognizes creativity here. For example, compilations without creative input in what’s selected (such as telephone white pages, if anyone remembers those) don’t qualify for copyright, but with enough creative input they are. Whether a computer can perform a creative act is probably unanswered as of yet, but it seems to me that LLMs would be pretty far from clearing that bar.
@mikeash OK, this makes a lot of sense. Thanks! @viticci