Pavel Soukenik

@soukenik
25 Followers
49 Following
26 Posts
Talks about languages and technology and transparency and ethics in AI
Websitehttps://pavelsoukenik.com/about
Threadshttps://www.threads.com/@pavel_soukenik
LocationLangley, WA, United States
Pronounshe/him
Not only is the pen mightier than the sword, but poets now trounce LLM guardrails better than hackers.

A new paper, Adversarial Poetry as a Universal Single‑Turn Jailbreak Mechanism in Large Language Models (savor that title), shows that malicious prompts in verse gave attackers a 60 %+ success rate across state-of-the-art models.

Looks like we’ll be adding the lute and quill to the red‑team toolkit.

https://arxiv.org/abs/2511.15304v2
#AI Threads
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

arXiv.org

New article: "Intelligence in AI: Seeing Past the Symbols"

To what degree do LLMs understand language? This question is a window into the capabilities and implications on the strengths, weaknesses, and future of machine learning.

Join me for a survey of a range of views—from the echoes of Alan Turing's 'paper machine' to the latest insights of Yann LeCun—that shed light on this intriguing topic.

https://pavelsoukenik.com/intelligence-in-ai

#machinelearning #AI #LLM

Intelligence in AI: Seeing Past the Symbols

Explore how the debate on AI’s capacity for understanding sheds light on the strengths, limitations, and future of machine intelligence.

Pavel Soukenik

Do we need more ██████ in the use of generative AI?

https://authograph.com/tag

Authograph Tag

Authograph Tag is a quick and transparent way to indicate human and AI authorship in social media posts and short content.

The rapidly increasing use of generative AI made me realize the importance of having a clear indication of how the content we are consuming was created.

This prompted the development of Authograph -- a labeling and certification service to promote transparency and trust in content creation.

I discuss this in more detail in this article: https://authograph.com/transparency-in-authorship

I would love to connect and hear from people their thoughts on this and on promoting transparency and trust in authorship in general.

Transparency in Authorship

Discover Authograph, an authorship label and certification service promoting transparency in content creation in the AI era. Learn how it lets you build trust and credibility with your audiences.

Just received this. "Sometimes I might say something weird" is a wording that I have many thoughts about. Also, we have been able to do better than needing "newtopic" (sic) for at least about half a century now.

If you're in #trustandsafety, check out this event in #Seattle: https://www.eventbrite.com/e/seats-establishing-a-trust-safety-professionals-community-tickets-588606928167

It's non-commercial, so please help spread the word.

SEATS: Establishing a Trust & Safety Professionals Community

A networking event for Trust & Safety professionals with an unconference. Meet new people & exchange ideas about your work or research.

Eventbrite
Federation does not fix moderation problems. Only moderation fixes moderation problems.
Why shouldn’t you just delete your #twitter ? Abandoned social media accounts represent the same risk as abandoned domain names. Name #squatting works because reputation and influence gets attached over an account’s life -Those followers you spent time building up don’t just dissipate when you go. For a threat actor this is a huge opportunity. Some accounts carry more #influence than presidents. So if you stop using an account purge it, leave a last message, then securely lock it. (Please share)

The nice thing about 20 years in #localization is that it embedded in my world view that the vast majority of users on every big platform are not Americans and do not speak English.

For #Twitter, ~80% of users are outside of United States, making the scenario of policy and moderation decisions being made by one guy (as opposed to what he promised) even more problematic. #contentmoderation

Job openings at #EU for legal officers, data scientists, technology specialists, economists and policy officers in relation to #DSA. #TrustAndSafety

https://digital-strategy.ec.europa.eu/en/news/job-opportunity-european-commission-hiring-experts-enforce-digital-services-act

Job opportunity: European Commission is hiring experts to enforce the Digital Services Act

The European Commission is strengthening its team to implement the Digital Services Act and create a safer and more transparent online space.

Shaping Europe’s digital future