We were not crazy. We were right.
Amazing work by our @robb corroborated by extensive analysis at Wired:
Perplexity Is a Bullshit Machine https://www.wired.com/story/perplexity-is-a-bullshit-machine/
We were not crazy. We were right.
Amazing work by our @robb corroborated by extensive analysis at Wired:
Perplexity Is a Bullshit Machine https://www.wired.com/story/perplexity-is-a-bullshit-machine/
Regulation in this space cannot come soon enough.
AI companies that want to scrape the web for training purposes, or use their bots to summarize webpages, should follow a strict set of guidelines with identifiable user-agents and IP addresses.
Publishers should have a right to opt out of any AI access, request details as to whether their copyrighted content is included in any model, and if so, request that its gets removed and the model re-trained.
Hopefully the EU's AI Act will help.
Most of all, we need to let go of this notion that open web = okay for commercial companies to scrape, ingest, and train their models.
If I wanted to open an English school, I would have opened a school to teach the English language. But I didn't.
I have a website, which is free to read, but my copyrighted material is mine and shouldn't serve as the foundation of any other commercial product.
I wish more people would understand this concept.
@cabel @viticci I agree with you on explicit permissions, but we need to make a distinction between "permission to use the content to train an LLM model" and "permission to simply access the content (without training)".
Tools like Perplexity may be denied the permission to train the LLM on the content of your page, but how do you prevent them from reading + summarising the page?
If you don't want the content to be summarised it's fine, but there should be a different permission to ask.