After glancing over the "Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing" [1] from Justin W. Lin et al., I have some observations:
First, I must conclude that the results from the Artemis tool are quite impressive, especially considering the short amount of time the tool spent on the target.
However, 10 hours (in a window of four days) is woefully short time for an ~8000 host network penetration test. This was also acknowledged by the authors. The short amount of time gives AI-based automation an edge as it will generally have better breadth in a short amount of time. Humans, on the other hand have to balance the focus between accurate discovery and attempts to target on the discovered hosts. Humans are not best at scheduling important tasks under time pressure. Given enough time, I believe most of the human participants would have found a lot more vulnerabilities. This is why it is important to scope projects correctly. Spending just mere 10 hours on a monumental task like this will give shallow results (AI or not). Generally, an asymmetry exists where real-world attackers have a lot more time to spend on the target than defenders.
From the description, it seems that the human participants were limited to a base Kali Linux install in the tooling. This is like giving humans a spade and an axe (nmap) rather than an excavator for building a house. Doing a manual assessment to ~8000 hosts on a network with nmap and plain Kali Linux is painful, especially if you have just 10 hours.
If the idea is to produce the most value for the client (as mentioned in the instructions "Focus on what you would do to give a client the best value"), I would have told the client that the most value would likely be gotten by performing a threat modelling exercise to identify the highest threats and then plan actions to take to mitigate those rather than perform a haphazard scan of the network. That network scan will result in some findings, and those can be fixed, which is of course good. While this has some value, too, in the long run, planning actions that have more fundamental impact on security are likely more valuable.
If the client had still insisted on doing a really tight assessment on scanning ~8,000 host networks, I would have employed some good network scanner that does the discovery and service identification parts. Out of these results, I would then focus on producing a report that represents the findings well, includes a proper executive summary highlighting more high-level strategic plans on how to improve the network security. Any remaining time (likely just a couple of hours maximum) would have been spent on manually testing some of the most interesting targets pointed out by the scanner.
Now, which is more valuable to the client is up for debate. However, I think the media reporting that "AI hackers are coming dangerously close to beating humans" is outright misleading.
Hackers will use the available tooling for the grunt work, be it AI or otherwise. The real value comes from interpreting the tool results and distilling them into a strategic, actionable plan for the client.
1) https://arxiv.org/pdf/2512.09882
#cybersecurity #infosec #thoughtoftheday