Mastodawn

𝗪𝗵𝗲𝗻 𝗮𝗻 𝗟𝗟𝗠’𝘀 𝗼𝘄𝗻 𝗹𝗼𝗴𝗶𝗰 𝗯𝗲𝗰𝗼𝗺𝗲𝘀 𝗶𝘁𝘀 𝗯𝗶𝗴𝗴𝗲𝘀𝘁 𝘄𝗲𝗮𝗸𝗻𝗲𝘀𝘀 😧

Researchers at UKP Lab introduce #POATE, a jailbreak technique that exploits contrastive reasoning to bypass safety mechanisms in large language models.

🧩 Instead of submitting harmful prompts, POATE generates their harmless opposites—then subtly guides the model’s reasoning back to unsafe intent. The result: logic itself becomes the attack vector.

(1/🧵)

Show thread

UKP Lab

𝗞𝗲𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆𝘀:
⚡ 𝟱𝟳% 𝗮𝘁𝘁𝗮𝗰𝗸 𝘀𝘂𝗰𝗰𝗲𝘀𝘀 𝗿𝗮𝘁𝗲: Outperforms SOTA attacks across GPT-4o, LLama, Gemma, and Phi models
🧠 𝗦𝗺𝗮𝗿𝘁𝗲𝗿 ≠ 𝗦𝗮𝗳𝗲𝗿: Larger, more capable models are MORE vulnerable to contrastive reasoning attacks
🚨 𝗗𝗲𝗳𝗲𝗻𝘀𝗲 𝗴𝗮𝗽 𝗲𝘅𝗽𝗼𝘀𝗲𝗱: Current safety measures can't detect subtle, logic-driven jailbreaks
✅ 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗲𝘅𝗶𝘀𝘁𝘀: Our Chain-of-Thought defenses reduce attack success by 95%

(2/🧵)

Show thread

UKP Lab Oct 30

📜 𝗣𝗮𝗽𝗲𝗿 → https://arxiv.org/pdf/2501.01872
🌐 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 → https://ukplab.github.io/emnlp2025-poate-attack/
💾 𝗖𝗼𝗱𝗲 + 𝗱𝗮𝘁𝗮 → https://github.com/UKPLab/emnlp2025-poate-attack

And consider following the authors Rachneet Sachdeva‬, Rima Hazra, and Iryna Gurevych (UKP Lab/TU Darmstadt) if you are interested in more information or an exchange of ideas.

(3/3)

#NLProc #LLMSafety #AIsecurity #Jailbreak #LLM