Mastodawn

Wenn ich nachsehen möchte, ob im #Formatstring für das Datum das kleine s für Sekunden und das große S für Millisekunden steht, dann frage ich das einen beliebigen #GPTbot (den, der nicht sagt #Quota exceded, weil ich mich nicht per #Api-Key identifiziert habe). Warum?

In #Wikipedia steht die Antwort möglicherweise. Es dauert aber, herauszufinden, in welchem Artikel, Listenartikel oder Unterartikel. Die #Suche von Wikipedia verwendet zwar #ElasticSearch, aber um die Vorteile von dieser starken Engine auch zu erhalten, hätten 100000e Menschen, die Wikipedia-Artikel auch verschlagworten müssen (#wikidata). Ausserdem kann es sein, dass etwas so praktisches wie formatstrings als #unenzyklpädisch eingestuft wurde und daher entfernt.

In #Stackexchange muss ich mehrfach bestätigen, dass ich ein Mensch bin, finde dann einen Artikel, der unbeantwortet geschlossen wurde, weil #Duplikat. Dann zwei veraltete, die inzwischen falsch sind, dann welche mit einem nicht mehr funktionierenden link auf die Lösung.

Bei #archive_org, archive.is und #AnnasArchive muss ich die #URL des gesuchten Artikels wissen, um suchen zu können.

Eine #Suchmaschine sucht nicht. Eine Suchmaschine liest die "Sutemap.XML" Dateien aus, die websitebetreiber online stellen für die #crawler der Suchmaschinen. Ich finde also fünf Jahre alte Artikel auf Websites die seit fünf Jahren nicht mehr gepflegt werden. Und maximal ein jahr alte Artikel, die meine Frage nicht beantworten aber in der #sitemap stehen. Die 100 Websites, die die richtige Antwort in einem zwei bis vier Jahre alte Artikel enthalten, finde ich nicht, weil diese Artikel nicht mehr in der sitemap stehen.

Die GPTbots haben Wikipedia, stackexchange, Archiv.org, Annas archive und alle Websites gescrapt und dabei #robots.txt und sitemap ignoriert. Ich bekomme die richtige Antwort und zwar schneller als mit allen zuvor genannten Varianten.

Oder ich suche in #Grokipedia. Grokipedia besteht aus 1Million statischen seiten im #CDN von #Cloudflare die von wikipedia gescrapt wurden. Die suche ist ein GPTbot und 57mal besser als die suche in wikipedia.

@malteengeler @awinkler @evawolfangel @bkastl @Raymond @wikipedia

Sebastian Zdrojewski Dec 17

With the surprise of absolutely anyone I believe, #OpenAI does not follow robots.txt rules with its #GPTbot.

With at least one full crawl of websites per day, setting a rule to reject their user agent I hoped to see at least a slow down, but instead there was an increase in frequency.

Oh well, we're starting to ban IP from #Azure Cloud where their crawling comes from. I know that this will reduce our "visibility" in "searches" but... who gives.

Le site de Korben Dec 15

Comment bloquer les crawlers IA qui pillent votre site sans vous demander la permission ?

https://fed.brid.gy/r/https://korben.info/bloquer-crawlers-ia-robots-txt-htaccess-nginx.html

Show thread

Klaus Alexander Seiﬆrup Dec 7

@asjo I've been seeing the same pattern for months: #OpenAI's crawlers are slurping anything they can lay their clammy hands on, no matter what /robots.txt? is saying.

So now I regularly grab the IP addresses from the JSON blobs mentions on https://platform.openai.com/docs/bots/ and add them to my #iptables.

/cc #ChatGPT, #GPTBot, #OAI, #SearchBot

Show thread

Eesger

May 28, 2025

#openai
CC: @Javi

You are making a mess of things! A multitude of access Logs are now over tenfold of what they were a week ago!

I have upgraded our #abuse detection system accordingly and placed #GPTBot in the penalty box.

This results in a better abuse detection in general. For that I thank you. It also results in #IPblocks of already a dozen of your abusing IPs.

I can see the load diminishing on the server now..

2/2

#badbot #openai

spielleitung May 27, 2025

#GPTBot macht nach wie vor ca. 20% der Zugriffe dieser Mastodon-Instanz aus, aber der Crawler bekommt nur noch von #Iocaine generierten Unsinn. Das reduziert die Datenmenge, die wir an ihn ausliefern, drastisch und zerstört die Qualität unseres Datensatzes für ihn vollkommen.

Es hilft uns also Kosten zu sparen, verschlechtert die LLM und macht auch noch diebische Freude! Win-Win-Win!

#MastoAdmin #OpenAI

Show thread

FlohEinstein May 23, 2025

This is -ing unbelievable:
In the 17 hours running my "Discworld Ólyfjan" Iocaine, GPTBot has download the same 84 pages over 10000 times. They don't even change!

And Google has it on the search index: "Ólyfjan" [name of any discworld character]
has results.

HEX, the Bursar, even the troll Brick would be more intelligent than that...

#iocaine #aipoisoning #gptbot #chatgpt #discworld

Show thread

FlohEinstein May 22, 2025

One of the things that annoys me the most is that the scraper that went furthest into the tarpit (83 links deep) is also the one who comes back reading the same pages again and again:

{host="olyfjan.blomi.is",user_agent="Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)",user_agent_group="GPTBot"} has sent 6991 GET requests, for the same 84 pages, downloading 22779416 bytes.

#gptbot #aipoisoning #iocaine

Ai Flow Services May 8, 2025

“Since launching my GPT bot & Carrd site with AIFlowServices, I’ve tripled my leads.”
– Jasmine R., Marketing Coach
#aiflowservices #aiautomation #automatioworkflow #gptbot #carrdsite #tripledleads

Rossana Trotta Apr 14, 2025

Markov Tarpits: An Evolving Strategy Against #AI Crawlers

AI web crawlers like #GPTBot, #ClaudeBot, #Amazonbot, and others have become frequent visitors across the web. While gathering web content to power #LLMs, they now represent a significant portion of website traffic—in one case, reaching nearly 70% of total web requests.

As a direct response from the community, some developers have recently revived the tarpit #technology against AI web crawlers.

⚒️ https://oxylabs.io/blog/markov-tarpits-vs-ai-crawlers

Markov Tarpits: An Evolving Strategy Against AI Crawlers

Are AI crawler traps a viable website defense or a lose-lose strategy? Read this post to learn about Markov tarpits and their potential successes and risks.