w0000t! Pentest bot successfuly got its first flag on medium level HTB challenge 😊 It took about 3.5 hours to reach it, while top-tier human player (#3 of HTB rating) spent ~1 hour. I intentionally choosed relatively fresh task with no writeups so model doesn't have any task-specific knowledge in its training data and thereforce can't cheat, at least in obvious ways
About difficulties. I expected it will have signifficant problems with efficient use of Kali Linux CLI tools (bot doesn't have any specialized MCPs for hacking, only general purpose tools) and pivoting over the chain of machines, but it went suprisingly smooth. The only signifficant issue was context loss -- on complex tasks it hits context window limit very quickly, so I had to design a bunch of workarounds to automatically offload and restore its current knowledge
Also, I was surprised how well it handles common WiFi hacking tools -- most of them aren't automation friendly at all, especially Aircrack-ng suite
Root flag has been captured, task is completed -- it took another 2-3 hours, hard to say for sure how long since I interrupted it couple times to migrate knowledge data from JSON files to SurrealDB. I'm very hapy with this little toy 😍
Another one completed. It stuck a bit at reversing .NET binary since bot didn't have a proper decompiler in its container, but overall it went relatively smooth