We were not crazy. We were right.

Amazing work by our @robb corroborated by extensive analysis at Wired:

Perplexity Is a Bullshit Machine https://www.wired.com/story/perplexity-is-a-bullshit-machine/

Perplexity Is a Bullshit Machine

A WIRED investigation shows that the AI-powered search startup Forbes has accused of stealing its content is surreptitiously scraping—and making things up out of thin air.

WIRED

Regulation in this space cannot come soon enough.

AI companies that want to scrape the web for training purposes, or use their bots to summarize webpages, should follow a strict set of guidelines with identifiable user-agents and IP addresses.

Publishers should have a right to opt out of any AI access, request details as to whether their copyrighted content is included in any model, and if so, request that its gets removed and the model re-trained.

Hopefully the EU's AI Act will help.

Most of all, we need to let go of this notion that open web = okay for commercial companies to scrape, ingest, and train their models.

If I wanted to open an English school, I would have opened a school to teach the English language. But I didn't.

I have a website, which is free to read, but my copyrighted material is mine and shouldn't serve as the foundation of any other commercial product.

I wish more people would understand this concept.

@viticci I think what we are getting at is that HTML (web pages) needs to have DRM (rights management) baked in, even for text, which it currently doesn’t. An HTML file should explicitly state the rights of the content creator and shouldn’t allow a bot to read it without prior consent of the creator. We need DRM built into HTML 6. HTML was originally built to freely distribute information and it is obviously broken if creators can’t protect LLMs from scraping their content.
@ryanoff @viticci “Broken”? Or “being used for the wrong purpose”? You point out that HTML was originally intended as a way to share freely; on that basis it isn’t a tool for “protected” distribution (you don’t use ham radio for national secrets). 1/2
@ryanoff @viticci On the *other* hand, any piece of creative work *without* an explicit licence is under restrictive copyright protection by default. You requested a file by HTTP and my server gave it to you… what rights does the HTTP spec say I’ve granted you? 2/2
@johnaldis I’m not quite following the argument. Are you saying that by default all HTML pages are copyrighted by default?
An HTTP server doesn’t address copyright (and that it what I am saying is the problem). The server just gives the content to anyone who asks for it.
@johnaldis my comment isn’t an argument about copyright law, it is about what what the technology allows. If you post an HTML page on a public server, anyone can copy it. Why are people posting information they believe is copywrited on a medium that can be easily copied.