We were not crazy. We were right.

Amazing work by our @robb corroborated by extensive analysis at Wired:

Perplexity Is a Bullshit Machine https://www.wired.com/story/perplexity-is-a-bullshit-machine/

Perplexity Is a Bullshit Machine

A WIRED investigation shows that the AI-powered search startup Forbes has accused of stealing its content is surreptitiously scraping—and making things up out of thin air.

WIRED

Regulation in this space cannot come soon enough.

AI companies that want to scrape the web for training purposes, or use their bots to summarize webpages, should follow a strict set of guidelines with identifiable user-agents and IP addresses.

Publishers should have a right to opt out of any AI access, request details as to whether their copyrighted content is included in any model, and if so, request that its gets removed and the model re-trained.

Hopefully the EU's AI Act will help.

Most of all, we need to let go of this notion that open web = okay for commercial companies to scrape, ingest, and train their models.

If I wanted to open an English school, I would have opened a school to teach the English language. But I didn't.

I have a website, which is free to read, but my copyrighted material is mine and shouldn't serve as the foundation of any other commercial product.

I wish more people would understand this concept.

@viticci I simply can’t get over the fact that our favorite “privacy is a basic human right, advertisers should ask your permission before tracking you, you should be in control of your data, etc.” company doesn’t seem to understand this either. 😌

@cabel @viticci I agree with you on explicit permissions, but we need to make a distinction between "permission to use the content to train an LLM model" and "permission to simply access the content (without training)".

Tools like Perplexity may be denied the permission to train the LLM on the content of your page, but how do you prevent them from reading + summarising the page?

If you don't want the content to be summarised it's fine, but there should be a different permission to ask.

@cabel @viticci there should be like these three level of permissions: 1) indexing 2) reading + summarising 3) training LLM.

How do we specify this with a simple robots.txt ? I think we can't, at the moment.

A new specification / standard is probably needed (then we need laws to enforce this)