After spending the last few years working really hard to beef up longer format histories of abandoned locations on my website I am seriously considering pulling all but, say, three paragraphs of each and putting the rest on Patreon for subs because of how disgusted I am that the work I've done will be used in Google's AI training and search results without my consent. It is a very frustrating situation to be placed in.

To be clear, I would really rather not have to do the extra work and paywalling content is something that runs against what I've wanted to do with my work from the beginning, which is make info accessible to whoever wants to view it.

But I also really abhor the idea of my writing being absorbed and plagiarized by search summaries that encourage bypassing my site entirely and I don't know of another way to prevent it.

I would bet a not insignificant amount of money that a good number of other small websites that are looking at the same predicament are mulling what their options are to prevent data scraping of decades of their work and deciding whether paywalling is the only way to prevent it. This will absolutely devastate the internet as we know it. I'm well aware of the social media debacles here but the issue stretches far beyond corporate behemoths trying to monetize APIs
Anyway, if you're looking for a time sink there are tons of photos and site histories on the Abandoned America website that you can check out right now, get 'em while they're hot
https://www.abandonedamerica.us/
Abandoned America

Matthew Christopher's Abandoned America: a hauntingly beautiful urban exploration chronicle of the abandoned buildings in our midst and their fascinating histories.

@AbandonedAmerica depending on your setup and skill, it might be possible to implement a simple free user system, so the content is still available but sits behind a login. That will stop automated scrapers. A small hassle for users who have to register, but keeps in the spirit of what you want. A wordpress site could do this easily. (There's also no problem if you decide you DO want to charge people!)
@philbetts @AbandonedAmerica It might be even simpler than they... A CAPTCHA in front of the full content plus terms of use that say the content can't be used in AI models is probably sufficient. No need for individual logins I don't think.
@jik @AbandonedAmerica CAPTCHA is owned by Google, so I don't imagine it interferes with their indexing. Terms of Use won't make a difference - almost all big AI models are trained on copyrighted material anyway - there are a few cases before the courts, but lots of the damage is already done.
@philbetts @AbandonedAmerica 1) Google does not index pages blocked by CAPTCHAs.
2) CAPTCHA is not "owned" by Google. It's a generic industry term, and there are CAPTCHA implementations from many sources other than Google. 1/2
@philbetts @AbandonedAmerica
3) Google and the other LLM vendors are basically saying, "We scrape everything that lets us." CAPTCHA plus ToS is clear, explicit indication that you don't let them. If sites do that and they ignore it they'll have a big honking class action lawsuit or GDPR enforcement in their hands. Not to mention potentially federal criminal charges for unauthorized access. They don't want that. 2/2
@jik @AbandonedAmerica yeah fair, I was thinking reCAPTCHA. ToS is absolutely meaningless though. Maybe robots.txt would work for Google, but the class actions are already happening. Was listening to a podcast about the GitHub Copilot suit today. https://www.gizmodo.com.au/2023/07/a-new-class-action-lawsuit-adds-to-openais-growing-legal-troubles/
A New Class Action Lawsuit Adds to OpenAI's Growing Legal Troubles

A new class action lawsuit accuses ChatGPT creator OpenAI of criminally scraping data from all over the internet, then using...

Gizmodo Australia
@philbetts ToS is absolutely not meaningless. Every class action lawyer and privacy regulator in the country would salivate over a large pool of websites with anti-AI ToS plus CAPTCHA or even just robots.txt that were scraped for training despite them. It makes the case overwhelmingly stronger. And since Congress is mostly broken, lawsuits are probably the only thing that is going to do any good about this in the US.

@doot

Are abandoned buildings as cute as clams? :-?

@AbandonedAmerica
Very cool website. I'm having fun digging into it. Reminds me of how when John Hillcoat was in pre-production for The Road, they didn't build sets because there are a lot of abandoned places in America to mine.
@AbandonedAmerica this may sound naive, genteel, and old fashioned of me, but I take it they're not respecting robots.txt?
@AbandonedAmerica Countdown till Patreon licences your content to Google for AI training, without telling you, because someone decided it would make a line go up…
@metaning yeah, that would not surprise me at all. Everyone's a scammer these days
@AbandonedAmerica It’s a deeply depressing situation that feels like it happened so suddenly! Will Google also be scraping from Google Drive and Google office suite?
@fiz oh gosh I bet they will or already have
@AbandonedAmerica I'm not trying to tell you how to feel, but I'm curious if there's a way to view this from another perspective. If I had no control of the situation but still wanted to share my content with the world, maybe I could take comfort in knowing that my hard work is now reaching an even wider audience as an unknown contributor to our collective knowledge. What are your thoughts on that?
@livingcoder yeah, why attribute anything to anyone? Let's take all the artists and creators that are already clinging to the edge of a precipice by the meager thread that the ownership of their work affords them and just Sparta kick them into complete destitution and oblivion. That a corporate behemoth has stripped the last benefit I get from my work for its own gain to incorporate it into the "collective knowledge base" that they can then monetize and control is a huge reward for that price.
@AbandonedAmerica
I know!
It really sucks that Google, Apple, etc. can provide your content as "search results" so the searcher never visits your website.
@CheapPontoon yep. It will wreck so many sites that are struggling to stay alive and only doing so through traffic and (unfortunately) ad revenue
@AbandonedAmerica how is what they're doing not a violation of your copyright? Are we just waiting for the court system to catch up on big data scraping for AI data sets?
@xyniden it definitely is, we are, and it probably won't 😕
@AbandonedAmerica I didn't follow all of the details, but I did see a while back that Getty Images is suing Stable Diffusion over their use of copyrighted images in their data sets, so if their case sets a lasting precedent we may be in luck
@xyniden we'll see. It may just get them to back off Getty and steal elsewhere from smaller sources less likely to be able to kick them for it

@AbandonedAmerica @xyniden

Copyrighted is a biatch because unless it's filed with uscpo you can only sue for physical damages (loss of revenue). So you're gonna spend $10k to recover $1k.

@AbandonedAmerica It's extra sucky to think about it alongside things like the lawsuit against the Internet Archive where something useful to the world is pitted against copyright claims. But all of these "AI" cons will get away with mass content scraping with zero legal resistance because they pretend robots wrote it. The real source will be lost, and the accuracy of the presentation will be impossible to determine. The worst of both ends.
@NIH_LLAMAS yeah, that definitely is an extra turn of the dagger there
@AbandonedAmerica FWIW I do think google, at least, respects limitations set by robots.txt
@AbandonedAmerica
Why does this feel so much like "The Tragedy of the Commons" in process online?..
They'll steal everything we make, then rewrite the history to say they 'had' to take it from us because we couldn't be trusted to be good stewards.