We were not crazy. We were right.
Amazing work by our @robb corroborated by extensive analysis at Wired:
Perplexity Is a Bullshit Machine https://www.wired.com/story/perplexity-is-a-bullshit-machine/
We were not crazy. We were right.
Amazing work by our @robb corroborated by extensive analysis at Wired:
Perplexity Is a Bullshit Machine https://www.wired.com/story/perplexity-is-a-bullshit-machine/
Regulation in this space cannot come soon enough.
AI companies that want to scrape the web for training purposes, or use their bots to summarize webpages, should follow a strict set of guidelines with identifiable user-agents and IP addresses.
Publishers should have a right to opt out of any AI access, request details as to whether their copyrighted content is included in any model, and if so, request that its gets removed and the model re-trained.
Hopefully the EU's AI Act will help.
Most of all, we need to let go of this notion that open web = okay for commercial companies to scrape, ingest, and train their models.
If I wanted to open an English school, I would have opened a school to teach the English language. But I didn't.
I have a website, which is free to read, but my copyrighted material is mine and shouldn't serve as the foundation of any other commercial product.
I wish more people would understand this concept.
@viticci It seems obvious to me that creating an LLM by training it with a bunch of inputs makes it a derived work of those inputs. Output of the LLM is then a derived work of the LLM. Distributing the LLM or that output would then violate copyright of all the inputs unless it falls under fair use, and it doesn’t seem like most LLM usage would.
I suspect the law would also find this obvious if not for the fact that it’s big businesses doing it.