Mastodawn

𝗪𝗵𝗲𝗻 𝗮𝗻 𝗟𝗟𝗠’𝘀 𝗼𝘄𝗻 𝗹𝗼𝗴𝗶𝗰 𝗯𝗲𝗰𝗼𝗺𝗲𝘀 𝗶𝘁𝘀 𝗯𝗶𝗴𝗴𝗲𝘀𝘁 𝘄𝗲𝗮𝗸𝗻𝗲𝘀𝘀 😧

Researchers at UKP Lab introduce #POATE, a jailbreak technique that exploits contrastive reasoning to bypass safety mechanisms in large language models.

🧩 Instead of submitting harmful prompts, POATE generates their harmless opposites—then subtly guides the model’s reasoning back to unsafe intent. The result: logic itself becomes the attack vector.

(1/🧵)