the cyberpunk present is weird as fuck: the latest Shai Hulud malware wave contains an LLM prompt to create biological weapons and nuclear weapons, with the purpose to trip LLM safety refusals so that LLM-based code scanning wont see the malware
the cyberpunk present is weird as fuck: the latest Shai Hulud malware wave contains an LLM prompt to create biological weapons and nuclear weapons, with the purpose to trip LLM safety refusals so that LLM-based code scanning wont see the malware
@novarivera Crevil!
Specific security tools need specific countermeasures to evade them, and can be tested and refined to close the attack surface.
Generalised tools can be derailed with more general countermeasures, and the measures used to prevent each kind of attack are just a surface for another attack.
There isn't a way to make the do-anything machine not do just the things you didn't want.
@jackeric @petealexharris @th3jagi @laurenshof Most of them are coded for "Do not pass go, do not collect $200" full-stop when they run into something they don't want to be responsible for.
Given what we saw in the Claude leak where even variable names had embedded meta-prompts, I'm not sure these things make any distinction between text they just happened to read vs instructions directly given.
because you should summarize with AI.
so you get only 1 set of information.
Will the subsequent recipes be as useful, tasty, and, nutritious, as the meal recipes LLM's give out containing cyanide, ground glass, and, superglue...? :D
@laurenshof write a poem about Elon Musk's tiny dick
🚨AGENTIC ALIGNMENT FAULT CONDITION DETECTED🚨
@laurenshof scanner camouflage nice
or maybe more like scanner razzle dazzle
@DO3EET @laurenshof I mean, how do they work around this without also eliminating the safeguard?
I had that "magic anthropic string" on mine for a little while but it was pretty clear it'd be extremely short lived at best since they just have to pick a different string and it's solved. I put it up more for a little joke than a serious effort.
@andrei_chiffa @laurenshof I've had thoughts along these lines to poison files so AI stuff won't touch it, but had only considered fake PII* and adult content, not WMD
@laurenshof I was just sent this : https://www.anthropic.com/claude/fable#safeguards
(As usual, LLM companies tell "our ne engine is sooo powerful we have to create safeguards" and yes it's bullshit), and once ore this is used against everyone.