Mastodawn

Chloé Messdaghi Jun 25, 2025

This paper introduces a model-agnostic threat evaluation using N-gram language models to measure jailbreak likelihood, finding discrete optimization attacks more effective than LLM-based ones and that jailbreaks often exploit rare bigrams.

Read more: https://arxiv.org/abs/2410.16222

#AIResearch #JailbreakDetection