Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable โ€” tools like Heretic make this visible.

A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

#AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

----------------

๐Ÿ”’ AI Pentesting Roadmap โ€” LLM Security and Offensive Testing
===================

Overview

This roadmap provides a structured learning path for practitioners aiming to assess and attack AI/ML systems, with a focus on LLMs and related pipelines. It organizes topics into progressive phases: foundations in ML and APIs, core AI security concepts, prompt injection and LLM-specific attacks, hands-on labs, advanced exploitation techniques, and real-world research/bug bounty work.

Phased Structure

Phase 1 (Foundations) covers machine learning fundamentals and LLM internals, including model architectures and tokenization concepts. Phase 2 (AI/ML Security Concepts) anchors the curriculum on standards and frameworks such as OWASP LLM Top 10, MITRE ATLAS, and NIST AI risk guidance. Phase 3 focuses on prompt injection and LLM adversarial vectors, describing attack surfaces like context manipulation, instruction-following bypasses, and RAG pipeline poisoning. Phase 4 emphasizes hands-on practice through CTFs, sandboxed labs, and safe testing methodologies. Phase 5 explores advanced exploitation: model poisoning, data poisoning, backdoor techniques, and chaining vulnerabilities across API/authentication layers. Phase 6 targets real-world research, disclosure workflows, and bug bounty engagement.

Technical Coverage

The roadmap lists practical tooling and repositories for experiment design and testing concepts without prescribing deployment steps. It calls out necessary foundationsโ€”Python programming, HTTP/API mechanics, and web security basics (XSS, SSRF, SQLi) to support end-to-end attack scenarios against AI systems. Notable conceptual risks include RAG poisoning, adversarial ML perturbations, prompt injection, and leakage through augmented memory or external tool integrations.

Limitations & Considerations

The guide is educational and emphasizes conceptual descriptions of capabilities and use cases rather than operational recipes. It highlights standards and references rather than prescriptive mitigations. Practical exploration should respect ethical boundaries and responsible disclosure norms.

๐Ÿ”น OWASP #MITRE_ATLAS #RAG #prompt_injection #adversarialML

๐Ÿ”— Source: https://github.com/anmolksachan/AI-ML-Free-Resources-for-Security-and-Prompt-Injection

GitHub - anmolksachan/AI-ML-Free-Resources-for-Security-and-Prompt-Injection: AI/ML Pentesting Roadmap for Beginners

AI/ML Pentesting Roadmap for Beginners. Contribute to anmolksachan/AI-ML-Free-Resources-for-Security-and-Prompt-Injection development by creating an account on GitHub.

GitHub

merve (@mervenoyann)

ํŠธ์œ—์€ 'distillation-attack-as-a-service'๋ฅผ ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ ์ฆ๋ฅ˜ ๊ธฐ๋ฒ•์„ ์•…์šฉํ•ด ๋ชจ๋ธ ์ง€์‹์ด๋‚˜ ๋™์ž‘์„ ๋ชจ๋ฐฉยท์ถ”์ถœํ•˜๋Š” ๊ณต๊ฒฉ์„ ์„œ๋น„์Šค ํ˜•ํƒœ๋กœ ์ œ๊ณตํ•œ๋‹ค๋Š” ์˜๋ฏธ๋กœ ํ•ด์„๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ชจ๋ธ ์ €์ž‘๊ถŒ ์นจํ•ดยท๋ฐ์ดํ„ฐ ์œ ์ถœยท๋ณด์•ˆ ์ทจ์•ฝ์„ฑ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. AI ๊ฐœ๋ฐœ์ž์™€ ๋ณด์•ˆํŒ€์€ API ์‚ฌ์šฉยท๋ชจ๋ธ ๊ณต๊ฐœ ์ •์ฑ… ๋ฐ ์ถ”๋ก  ๋ณด์•ˆ ๋ฐฉ์–ด๋ฅผ ์ ๊ฒ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

https://x.com/mervenoyann/status/2030017555830546762

#modelextraction #distillationattack #aisecurity #adversarialml

merve (@mervenoyann) on X

distillation-attack-as-a-service ๐Ÿซก

X (formerly Twitter)

European researchers report that poetic prompts can bypass safety guardrails in multiple LLMs, exposing gaps in classifier-based moderation.

A good reminder that safety systems must evolve alongside generative models - especially as adversarial creativity becomes easier to automate.

What direction should improvements take?

Source: https://www.wired.com/story/poems-can-trick-ai-into-helping-you-make-a-nuclear-weapon/

Follow us for more neutral and security-focused AI updates.

#AISafety #LLMSecurity #AdversarialML #CyberSecurity #MLResearch #TechNadu

Adversarial Policies Beat Superhuman Go AIs
https://arxiv.org/abs/2211.00241
https://goattack.far.ai/
https://news.ycombinator.com/item?id=42494127

* SoTA Go-playing AI system KataGo
* trained adversarial policies against it
* >97% win rate against KataGo running at superhuman settings
* core vulnerability persists even in KataGo agents adversarially trained to defend against attack

KataGo: https://en.wikipedia.org/wiki/KataGo
* open-source, superhuman level Go program

#KataGo #Go_game #ML #AI #MachineLearning #AIsafety #AdversarialML

Adversarial Policies Beat Superhuman Go AIs

We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human experts can implement it without algorithmic assistance to consistently beat superhuman AIs. The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available https://goattack.far.ai/.

arXiv.org

Adversarial machine learning
https://en.wikipedia.org/wiki/Adversarial_machine_learning

* study of attacks on machine learning algorithms, & defenses against such attacks

#AI #ML #MachineLearning #ComputerSecurity #AIsafety #AdversarialAttacks #AdversarialML

Adversarial machine learning - Wikipedia

D-ReLU: A breakthrough in robust AI, designed to defend against adversarial attacks while maintaining efficiency and scalability. This research, led by Korn Sooksatra (now at Meta), has implications for high-stakes AI applications. Blog: https://buff.ly/4fC9GeP Full paper: https://buff.ly/3UXzNVi #ResponsibleAI #SafeAI #AdversarialML
Resilient AI: Advancing Robustness Against Adversarial Threats with D-ReLU

This article explores D-ReLU, an advanced modification of the ReLU activation function, designed to improve the robustness of AI models against adversarial attacks. By incorporating adaptive, learnโ€ฆ

BAYLOR AI

This was another great ep following the one with Simon Willison about finding the boundaries of LLMs https://oxide-and-friends.transistor.fm/episodes/adversarial-machine-learning

#podcast #machineLearning #LLM #adversarialML

Oxide and Friends | Adversarial Machine Learning

Nicholas Carlini joined Bryan, Adam, and the Oxide Friends to talk about his work with adversarial machine learning. He's found sequences of--seemingly random--tokens that cause LLMs to ignore thei...

Oxide and Friends

If you build and maintain a database of "fingerprints" of adversarial attacks, you can estimate which kind is being used against your model in real time. This tells you both about the technical sophistication of your adversary, and the strength of possible adversarial defenses.

Learn more at https://adversarial-designs.shop/blogs/blog/know-thy-enemy-classifying-attackers-with-adversarial-fingerprinting

#ThreatIntelligence #AdversarialML

Know thy enemy : classifying attackers with adversarial fingerprinting

In threat intelligence, you want to know the characteristics of possible adversaries. In the world of machine learning, this could mean keeping a database of "fingerprints" of known attacks, and using these to inform real time defense strategies if your inference system comes under attack. Would you like to know more?

adversarial designs

In related news, I need to recruit 1-2 new PhD students starting next fall!!

Likely research topics: Adversarial and explainable ML for large models of text and code.

(And maybe probabilistic and relational models if another project gets funded.)

If you want to email me about this, please include โ€œcapybaraโ€ in the subject line so I know itโ€™s a specific response and not a blanket query.

#recruiting #adversarialML