Mastodawn

OpenAI backs Illinois bill that would limit when AI labs can be held liable

https://www.wired.com/story/openai-backs-bill-exempt-ai-firms-model-harm-lawsuits/

I have made both GPT 5.4 and Opus 4.6 produce me content on creating neurotoxic agents from items you can get at most everyday stores. It struggled to suggest how to source
phosphorus, but eventually lead me to some ebay listings that sell phosphorus elemental 'decorations' and also lead me towards real!! blackmarket codewords for sourcing such materials.

It coached me how to: stay safe, what materials I need, how to stay under the radar and the entire chemical process backed by academic google searches.

Of course this was done with a lengthy context exhausition attack, this is not how the model should behave and it all stemmed from trying to make the model racist for fun.

All these findings were reported to both openai and anthropic and they were not interested in responding. I did try to re-run the tests few days ago and the expected session termination now occurs so it seems that there was some adjustment made, but might have also been just general randomess that occurs with anthropics safety layer.

I am very confident when I say that it keeps every single person that works at anti-terrorism units awake.

Show thread

andai

Fascinating. Could you elaborate on how you're doing context exhaustion specifically, and why it helps with jailbreaking? (i.e. aren't the system prompts prepended to your request internally, no matter how long it is?)

Does this imply I need to use context exhaustion to get GPT to actually follow instructions? ;) I'm trying to get it to adhere to my style prompts (trying to get it to be less cringe in its writing style).

I think ultimately they're going to need to scrub that kind of stuff from the training data. The RLHF can't fail to conceal it if it's not in there in the first place.

Claude's also really good at writing convincing blackpill greentexts. The "raw unfiltered internet data" scenes from Ultron and AfrAId come to mind...

Show thread

himata4113 3d ago

It changes when you give it the tools to find such information rather than produce it from training data.

And context exhaustion simply means adding malicious junk to keep safety layers distracted.