We were not crazy. We were right.

Amazing work by our @robb corroborated by extensive analysis at Wired:

Perplexity Is a Bullshit Machine https://www.wired.com/story/perplexity-is-a-bullshit-machine/

Perplexity Is a Bullshit Machine

A WIRED investigation shows that the AI-powered search startup Forbes has accused of stealing its content is surreptitiously scraping—and making things up out of thin air.

WIRED

Regulation in this space cannot come soon enough.

AI companies that want to scrape the web for training purposes, or use their bots to summarize webpages, should follow a strict set of guidelines with identifiable user-agents and IP addresses.

Publishers should have a right to opt out of any AI access, request details as to whether their copyrighted content is included in any model, and if so, request that its gets removed and the model re-trained.

Hopefully the EU's AI Act will help.

Most of all, we need to let go of this notion that open web = okay for commercial companies to scrape, ingest, and train their models.

If I wanted to open an English school, I would have opened a school to teach the English language. But I didn't.

I have a website, which is free to read, but my copyrighted material is mine and shouldn't serve as the foundation of any other commercial product.

I wish more people would understand this concept.

@viticci People (well, corporations) see the word "open" and think that means "open to do whatever I want with" which is definitely untrue!
@viticci I simply can’t get over the fact that our favorite “privacy is a basic human right, advertisers should ask your permission before tracking you, you should be in control of your data, etc.” company doesn’t seem to understand this either. 😌
@cabel @viticci I think what they understand is that the potential financial incentive for undermining those principles is now much greater than the potential financial incentive for upholding them during the zero-interest-rate era.

@cabel @viticci I’m not a content creator, but I really do appreciate the thoughts expressed by @ismh86 and @jsnell. It’s a complicated situation, there’s no easy solution, and it’s okay to have complicated, opposing opinions.

I also respect that others, such as yourself, Federico, can feel differently. You’re a creator so you have a completely different perspective than I do.

@cabel @viticci I agree with you on explicit permissions, but we need to make a distinction between "permission to use the content to train an LLM model" and "permission to simply access the content (without training)".

Tools like Perplexity may be denied the permission to train the LLM on the content of your page, but how do you prevent them from reading + summarising the page?

If you don't want the content to be summarised it's fine, but there should be a different permission to ask.

@cabel @viticci there should be like these three level of permissions: 1) indexing 2) reading + summarising 3) training LLM.

How do we specify this with a simple robots.txt ? I think we can't, at the moment.

A new specification / standard is probably needed (then we need laws to enforce this)

@cabel @viticci
If you try to refer to Rotten Fruits here, you misunderstand Apple's position on privacy.

The like privacy for PR purposes. And being better than Android a tiny little bit. But the Apple apps are as privacy invasive (tne tiny bit more privacy protection does not apply to your relation with your primary cult, Apple, you wouldn't want to keep your sins from your digital pastor?) as Google.

@viticci oh but they DO understand, they are just selective when it comes to enforcement.

Try and use their music, video, software, or whatever without their permission and *then* it’s a crime.

@viticci imo this is in the same vein as using an ad blocker, and I know you’ve said you use content blockers in the past.
@viticci

They do understand, they just don't care.
@viticci maybe you need to put an explicit licence in your site. Like how Amazon licence Kindle books to you to read and then use DRM to stop you lending it, printing it, copying segments from it, etc. Software often comes with open source licences that restrict what types of usage you are allowed to perform. I feel that legal recourse will become so much simpler if people are explicit about how their copyrighted content may be used. Creative Commons have licenses ready to use.
@viticci legitimate question: how is scraping for indexing in a web search engine that powers their business model with ads different from what you are pointing out? Is it that you get something in return from the search engine? Is that the main differentiator?

@viticci

Actually, that depends upon your jurisdiction and what your copyright law says. Although AI training seems to be a new kind of usage, it's probably not really (scrape, process the data, have some output that is statistically depends on the scraped data has been done for years, if not decades now)

The EU copyright actually has an exemption for copying stuff for educational purposes.

That's why you nowadays usually get everything you need as a student via Moodle.

@viticci Back then in the days of my first studies (1990s) my parents literally spent tons on textbooks for me. (Especially Medicine was painful, inflation corrected, €2000-3000 per semester for books was quite realistic. Free university != free books)

Basically that's also why most courses on uni moodles are behind a registration wall → the copyright exemption is only for students you teach.

@viticci they will understand, as soon as you take something they make that’s available for free and turn it into a different product … 😒

The hypocrisy! 😤

@viticci @Gargron
I want to throw a book at them and ask them if they own the book.

• They own that mass.
• They can read that mass.
• They can copy that mass for personal use.
• They cannot copy that mass for selling.
• They own a thing. They also own a copy of an idea of a thing. They don’t own the idea of that thing.

• You may visit my website.
• You may read my website.
• As it is public you may even scape my website.
• You can also build an AI off of my website. You really can… but for personal use.

BUT

• Your AI contains part of my website so if you want to sell it you’ve got to ask me and all the owners first.
• WTF is the argument that that isn’t practical. No it isn’t. It just means what you did was stupid.

AI in Silicon valley has gotten where it is on rich white male privilege expressed in the legal framework.

Now, if you find the A in #AI offensive and you declare it Electrical Intelligence (#EI) as life with human rights and learning, then we can have an ethics conversation.

But as long as you say you own it, it isn’t learning, it’s processing and you can fuck off.

@viticci Someone please invent a web tool that can exclude scrapers.

@viticci It seems obvious to me that creating an LLM by training it with a bunch of inputs makes it a derived work of those inputs. Output of the LLM is then a derived work of the LLM. Distributing the LLM or that output would then violate copyright of all the inputs unless it falls under fair use, and it doesn’t seem like most LLM usage would.

I suspect the law would also find this obvious if not for the fact that it’s big businesses doing it.

@mikeash yeah, I think like this too. But. A person is a derived work of all their inputs too, right? It's a thought that I'm struggling with. Where is the line? What is "I created" vs "I copied" @viticci
@bealex @mikeash A person is not a commercial service
@viticci But one can provide it? I'm not arguing by any means, just want to find out where the distinction is @mikeash
@bealex @viticci The law recognizes creativity here. For example, compilations without creative input in what’s selected (such as telephone white pages, if anyone remembers those) don’t qualify for copyright, but with enough creative input they are. Whether a computer can perform a creative act is probably unanswered as of yet, but it seems to me that LLMs would be pretty far from clearing that bar.
@mikeash OK, this makes a lot of sense. Thanks! @viticci
@viticci unfortunately there's not a lot of actually rich people with an ax to grind to sue them for copyright infringment.
@viticci Oh I'm sure they have plenty of lawyers that understand these concepts exactly. But unfortunately they also understand that they can steal your copywriten work and because of "copywrite laundering" you can't do anything about it.
@viticci Where’s an overzealous federal prosecutor like Carmen Ortiz when you need one
@viticci I think what we are getting at is that HTML (web pages) needs to have DRM (rights management) baked in, even for text, which it currently doesn’t. An HTML file should explicitly state the rights of the content creator and shouldn’t allow a bot to read it without prior consent of the creator. We need DRM built into HTML 6. HTML was originally built to freely distribute information and it is obviously broken if creators can’t protect LLMs from scraping their content.
@ryanoff @viticci “Broken”? Or “being used for the wrong purpose”? You point out that HTML was originally intended as a way to share freely; on that basis it isn’t a tool for “protected” distribution (you don’t use ham radio for national secrets). 1/2
@ryanoff @viticci On the *other* hand, any piece of creative work *without* an explicit licence is under restrictive copyright protection by default. You requested a file by HTTP and my server gave it to you… what rights does the HTTP spec say I’ve granted you? 2/2
@johnaldis I’m not quite following the argument. Are you saying that by default all HTML pages are copyrighted by default?
An HTTP server doesn’t address copyright (and that it what I am saying is the problem). The server just gives the content to anyone who asks for it.
@johnaldis my comment isn’t an argument about copyright law, it is about what what the technology allows. If you post an HTML page on a public server, anyone can copy it. Why are people posting information they believe is copywrited on a medium that can be easily copied.
@viticci imo they shouldn't be allowed to scrape anything without permission, opt-in not opt-out
@viticci there's already regulation. The recent copyright directive does address the issue of text and data mining.
@viticci I wish that people who ask for more regulation, actually knew the law, actually get involved on the democratic process to draft and approve law, instead of paying no attention to those who ask for people to participate.
@DiogoConstantino we are literally getting in touch with the EU as well as my Italian representatives about the AI Act 😉

@viticci AI act is already done... Approved and published (https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021AE2482)

So is the Copyright Directive:

Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A32019L0790

EUR-Lex - 52021AE2482 - EN - EUR-Lex

@viticci on this matter it's important for people to read the TITLE II of the Directive 2019/790, which does create an exception of copyright for text and data mining (article 3), and creates a limitation to that exception (article 4).
@viticci I'm a bit annoyed because I spent many hours, during many years trying to get people involved with this, and people didn't cared, now that's too late, it's when people care.
@viticci In my opinion this sort of thing should be opt in at both the platform and user level.
@viticci

Companies selling AI enabled solutions hate when you ask, "if I buy your service, how do I keep
my data from becoming ingested into Plagiarism as a Service offerings".
@viticci I completely agree with all but the last bit. I get quite frustrated hearing that content is being slurped by companies to generate a profit and share none of that with the contributors that made it possible. However, forcing a retraining is going to amplify the carbon release of AI yet further. I want to see a solution that doesn’t, quite literally, cost the earth.
@viticci you need to read up on @pluralistic Cory Doctorow's comments on this. Copyright won't save us because of the way it works. https://pluralistic.net/2023/02/09/ai-monkeys-paw/
Pluralistic: Copyright won’t solve creators’ Generative AI problem (09 Feb 2023) – Pluralistic: Daily links from Cory Doctorow

@viticci i basically agree with this. But when we talk about the environmental impact of training MLs, i really don't want them to retrain every 14 days (i realize they probably already are training new stuff all the damned time)
@viticci @robb oooh Wired is in super spicy mode 🌶️🌶️🌶️
“calling [Perplexity] an “AI startup” is somewhat misleading; it would perhaps be more accurately described as a sort of remora attached to existing AI systems”
@viticci @robb ok, say no more. I stopped reading in the middle of the article, opened a new tab and deleted my Bullshixity account.