Mastodawn

Not only is the pen mightier than the sword, but poets now trounce LLM guardrails better than hackers.

A new paper, Adversarial Poetry as a Universal Single‑Turn Jailbreak Mechanism in Large Language Models (savor that title), shows that malicious prompts in verse gave attackers a 60 %+ success rate across state-of-the-art models.

Looks like we’ll be adding the lute and quill to the red‑team toolkit.

https://arxiv.org/abs/2511.15304v2
#AI Threads

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

arXiv.org

Pavel Soukenik Feb 15, 2024

New article: "Intelligence in AI: Seeing Past the Symbols"

To what degree do LLMs understand language? This question is a window into the capabilities and implications on the strengths, weaknesses, and future of machine learning.

Join me for a survey of a range of views—from the echoes of Alan Turing's 'paper machine' to the latest insights of Yann LeCun—that shed light on this intriguing topic.

https://pavelsoukenik.com/intelligence-in-ai

#machinelearning #AI #LLM

Intelligence in AI: Seeing Past the Symbols

Explore how the debate on AI’s capacity for understanding sheds light on the strengths, limitations, and future of machine intelligence.

Pavel Soukenik

Pavel Soukenik Jan 26, 2024

Do we need more ██████ in the use of generative AI?

https://authograph.com/tag

Authograph Tag

Authograph Tag is a quick and transparent way to indicate human and AI authorship in social media posts and short content.

Pavel Soukenik Jan 20, 2024

The rapidly increasing use of generative AI made me realize the importance of having a clear indication of how the content we are consuming was created.

This prompted the development of Authograph -- a labeling and certification service to promote transparency and trust in content creation.

I discuss this in more detail in this article: https://authograph.com/transparency-in-authorship

I would love to connect and hear from people their thoughts on this and on promoting transparency and trust in authorship in general.

Transparency in Authorship

Discover Authograph, an authorship label and certification service promoting transparency in content creation in the AI era. Learn how it lets you build trust and credibility with your audiences.

Pavel Soukenik Mar 23, 2023

Just received this. "Sometimes I might say something weird" is a wording that I have many thoughts about. Also, we have been able to do better than needing "newtopic" (sic) for at least about half a century now.

Pavel Soukenik Mar 23, 2023

If you're in #trustandsafety, check out this event in #Seattle: https://www.eventbrite.com/e/seats-establishing-a-trust-safety-professionals-community-tickets-588606928167

It's non-commercial, so please help spread the word.

SEATS: Establishing a Trust & Safety Professionals Community

A networking event for Trust & Safety professionals with an unconference. Meet new people & exchange ideas about your work or research.

Eventbrite

Pavel Soukenik Nov 25, 2022

Derek Powazek 🐐Nov 24, 2022

Federation does not fix moderation problems. Only moderation fixes moderation problems.

Pavel Soukenik Nov 22, 2022

Marc Rogers 🥜 👋🏼 ⚠️Nov 22, 2022

Why shouldn’t you just delete your #twitter ? Abandoned social media accounts represent the same risk as abandoned domain names. Name #squatting works because reputation and influence gets attached over an account’s life -Those followers you spent time building up don’t just dissipate when you go. For a threat actor this is a huge opportunity. Some accounts carry more #influence than presidents. So if you stop using an account purge it, leave a last message, then securely lock it. (Please share)

Pavel Soukenik Nov 20, 2022

The nice thing about 20 years in #localization is that it embedded in my world view that the vast majority of users on every big platform are not Americans and do not speak English.

For #Twitter, ~80% of users are outside of United States, making the scenario of policy and moderation decisions being made by one guy (as opposed to what he promised) even more problematic. #contentmoderation

Pavel Soukenik Nov 18, 2022

Job openings at #EU for legal officers, data scientists, technology specialists, economists and policy officers in relation to #DSA. #TrustAndSafety

https://digital-strategy.ec.europa.eu/en/news/job-opportunity-european-commission-hiring-experts-enforce-digital-services-act

Job opportunity: European Commission is hiring experts to enforce the Digital Services Act

The European Commission is strengthening its team to implement the Digital Services Act and create a safer and more transparent online space.

Shaping Europe’s digital future

Website	https://pavelsoukenik.com/about
Threads	https://www.threads.com/@pavel_soukenik
Location	Langley, WA, United States
Pronouns	he/him