Mastodawn

If you blocked ChatGPT-User in your robots.txt, you need to update it. OpenAI now uses the “GPTBot” user-agent.

User-agent: GPTBot
Disallow: /

Got 100 hits yesterday. Ugh.

Edit: this won’t stop them in their tracks but it will at least make things slightly less convenient for OpenAI.

Blocking certain bots

I don’t want my content on those sites in any form and I don’t want my content to feed their algorithms. Using robot.txt assumes they will ‘obey’ it. But they

Seirdy’s Home

Show thread

Seirdy Aug 6, 2023

‘k, updated the seirdy.one robots.txt. Changes will go live in a sec.

Show thread

Seirdy Aug 7, 2023

Does anybody know if they respect a GPTBot-specific noindex tag? if so I might allow them to crawl just so they discover the tag and exclude all syndicated versions of my site from their index. Since the issue isn’t crawling, it’s indexing.

Blocking certain bots

I don’t want my content on those sites in any form and I don’t want my content to feed their algorithms. Using robot.txt assumes they will ‘obey’ it. But they

Seirdy’s Home

Show thread

The Doctor Aug 7, 2023

@Seirdy Doubtful.

Show thread

Kainoa

Aug 6, 2023

@[email protected] bookmarked

Show thread

rose (escaped not sanitized)Aug 7, 2023

@Seirdy

Thanks for the heads up.

I'm going to have a look at my site's stats and go ahead and screen it.

Show thread

Dave Spector Aug 7, 2023

@Seirdy seems like a great time to read up on the LLM poisoning methods and start feeding them REALLY bad data:

https://arxiv.org/abs/2307.15043

Universal and Transferable Adversarial Attacks on Aligned Language Models

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

arXiv.org

Show thread

Tomodachi94 Aug 7, 2023

@Seirdy updated on my blog to include both of the user agents.

🙈 I didn't actually know you could do that on GitHub Pages, but it turns out... you can!

#robotstxt #openai #chatgpt

Show thread

AnotherGrumpyDayInHell 💙🇺🇲🇺🇦Aug 7, 2023

@Seirdy I'm kind of a newb when it comes to all of this, how do I use robots.txt in my preferences so that I don't get scraped by AI bots?

Show thread

Seirdy Aug 7, 2023

@AnotherDayInHell See https://seirdy.one/robots.txt to disallow crawling. Disallowing indexing is a little more complicated, though.

Show thread

AnotherGrumpyDayInHell 💙🇺🇲🇺🇦Aug 7, 2023

@Seirdy thank you

Show thread

don Elías (como los buses) 🥨Aug 7, 2023

@Seirdy do you know if their IPs are coming from a specific AS? It would be useful to block them at firewall level

Show thread

Seirdy Aug 7, 2023

@donelias Here are the docs: https://platform.openai.com/docs/gptbot

The page is just text ironically gated by an aggressive CAPTCHA. Links to IPs used are at the end.

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

Show thread

don Elías (como los buses) 🥨Aug 7, 2023

@Seirdy awesome, thanks 🙏

Show thread

Jens Oliver Meiert Aug 7, 2023

@Seirdy, everyone, do you have a source? OpenAI docs still seem to refer to “ChatGPT-User” (as on https://platform.openai.com/docs/plugins/bot).

(Another question: The docs refer to users and plug-ins. Is there a reference that OpenAI uses only one bot for scraping?)

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

Show thread

Seirdy Aug 7, 2023

@j9t See https://platform.openai.com/docs/gptbot

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

Show thread

Stacey Campbell Aug 8, 2023

@Seirdy
$ fgrep GPTBot access.log.1 | wc -l
2592

Show thread

Seirdy Aug 8, 2023

@stacey_campbell Jesus christ, what was the peak RPS?

Show thread

Stacey Campbell Aug 8, 2023

@Seirdy A hit every 10 seconds. Very regular.

Show thread

Seirdy Aug 8, 2023

@stacey_campbell Oh, well that’s not too bad for a crawler. Much less intensive than being the first link in a post on Fedi and triggering the generation of thousands of link-previews.