๐—ช๐—ต๐—ฒ๐—ป ๐—ฎ๐—ป ๐—Ÿ๐—Ÿ๐— โ€™๐˜€ ๐—ผ๐˜„๐—ป ๐—น๐—ผ๐—ด๐—ถ๐—ฐ ๐—ฏ๐—ฒ๐—ฐ๐—ผ๐—บ๐—ฒ๐˜€ ๐—ถ๐˜๐˜€ ๐—ฏ๐—ถ๐—ด๐—ด๐—ฒ๐˜€๐˜ ๐˜„๐—ฒ๐—ฎ๐—ธ๐—ป๐—ฒ๐˜€๐˜€ ๐Ÿ˜ง

Researchers at UKP Lab introduce #POATE, a jailbreak technique that exploits contrastive reasoning to bypass safety mechanisms in large language models.

๐Ÿงฉ Instead of submitting harmful prompts, POATE generates their harmless oppositesโ€”then subtly guides the modelโ€™s reasoning back to unsafe intent. The result: logic itself becomes the attack vector.

(1/๐Ÿงต)

๐—ž๐—ฒ๐˜† ๐˜๐—ฎ๐—ธ๐—ฒ๐—ฎ๐˜„๐—ฎ๐˜†๐˜€:
โšก ๐Ÿฑ๐Ÿณ% ๐—ฎ๐˜๐˜๐—ฎ๐—ฐ๐—ธ ๐˜€๐˜‚๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ ๐—ฟ๐—ฎ๐˜๐—ฒ: Outperforms SOTA attacks across GPT-4o, LLama, Gemma, and Phi models
๐Ÿง  ๐—ฆ๐—บ๐—ฎ๐—ฟ๐˜๐—ฒ๐—ฟ โ‰  ๐—ฆ๐—ฎ๐—ณ๐—ฒ๐—ฟ: Larger, more capable models are MORE vulnerable to contrastive reasoning attacks
๐Ÿšจ ๐——๐—ฒ๐—ณ๐—ฒ๐—ป๐˜€๐—ฒ ๐—ด๐—ฎ๐—ฝ ๐—ฒ๐˜…๐—ฝ๐—ผ๐˜€๐—ฒ๐—ฑ: Current safety measures can't detect subtle, logic-driven jailbreaks
โœ… ๐—ฆ๐—ผ๐—น๐˜‚๐˜๐—ถ๐—ผ๐—ป ๐—ฒ๐˜…๐—ถ๐˜€๐˜๐˜€: Our Chain-of-Thought defenses reduce attack success by 95%

(2/๐Ÿงต)

๐Ÿ“œ ๐—ฃ๐—ฎ๐—ฝ๐—ฒ๐—ฟ โ†’ https://arxiv.org/pdf/2501.01872
๐ŸŒ ๐—ฃ๐—ฟ๐—ผ๐—ท๐—ฒ๐—ฐ๐˜ โ†’ https://ukplab.github.io/emnlp2025-poate-attack/
๐Ÿ’พ ๐—–๐—ผ๐—ฑ๐—ฒ + ๐—ฑ๐—ฎ๐˜๐—ฎ โ†’ https://github.com/UKPLab/emnlp2025-poate-attack

And consider following the authors Rachneet Sachdevaโ€ฌ, Rima Hazra, and Iryna Gurevych (UKP Lab/TU Darmstadt) if you are interested in more information or an exchange of ideas.

(3/3)

#NLProc #LLMSafety #AIsecurity #Jailbreak #LLM