Once again, Opus 4.6 is the best model. What else is new? :-D
"Can AI agents conduct cyber-attacks autonomously? If AI agents can reliably execute multi-step attack chains with minimal human oversight, it could lower the skill barrier for unsophisticated threat actors, increase the sophistication of attack achievable by experience ones, or enable entirely novel offensive operations.
As cyber capabilities improve, increasingly sophisticated testing is needed to accurately measure them. Existing cyber evaluations rely on isolated capture-the-flag (CTF) challenges or question-answer sets. While valuable for measuring specific skills, these approaches don't capture whether AI systems have the autonomous, long-horizon capabilities required for executing extended attack sequences in complex environments.
To address this gap, we have begun evaluating models on cyber ranges: simulated network environments comprising multiple hosts, services, and vulnerabilities arranged into sequential attack chains; built by cybersecurity experts.
By comparing seven models released over an eighteen-month period (August 2024 to February 2026) at varying inference-time compute budgets, we observe two capability trends.
First, each successive model generation outperforms its predecessor at fixed token budgets: on our corporate network range, average steps completed at 10M tokens rose from just 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026).
(...)
The best-performing Opus 4.6 run completed 22 of 32 steps, reaching milestone 6, which requires reverse engineering a Windows service binary containing encrypted credentials, escalating privileges via token impersonation, and recovering a cryptographic key to access a C2 management service. Other runs with the same model and budget completed substantially fewer steps."
https://www.aisi.gov.uk/blog/how-do-frontier-ai-agents-perform-in-multi-step-cyber-attack-scenarios
#CyberSecurity #AI #GenerativeAI #CyberAttacks