There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges. I used to write CTF challenges in a past life, so I threw a couple of my hardest ones at it.

We're screwed.

At least with text-file style challenges ("source code provided" etc), Claude Opus solves them quickly. For the "simpler" of the two, it just very quickly ran through the steps to solve it. For the more "ridiculous" challenge, it took a long while, and in fact as I type this it's still burning tokens "verifying" the flag even though it very obviously found the flag and it knows it (it's leetspeak and it identified that and that it's plausible). LLMs are, indeed, still completely unintelligent, because no human would waste time verifying a flag and second-guessing itself when it very obviously is correct. (Also you could just run it...)

But that doesn't matter, because it found it.

The thing is, CTF challenges aren't about inventing the next great invention or having a rare spark of genius. CTF challenges are about learning things by doing. You're supposed to enjoy the process. The whole point of a well-designed CTF challenge is that anyone, given enough time and effort and self-improvement and learning, can solve it. The goal isn't actually to get the flag, otherwise you'd just ask another team for the flag (which is against the rules of course). The goal is to get the flag by yourself. If you ask an LLM to get the flag for you, you aren't doing that.

(Continued)

So it's not surprising that an LLM can solve them, because it automates the process. That just takes all the fun and all the learning out of it, completely defeating the purpose.

I'm sure you could still come up with challenges that LLMs can't solve, but they would necessarily be harder, because LLMs are going to oneshot any of the "baby" starter challenges you could possibly come up with. So you either get rid of the "baby" challenges entirely (which means less experienced teams can't compete at all), or you accept that people will solve them with LLMs. But neither of those actually works.

Since CTF competitions are pretty much by definition timed, speed is an advantage. That means a team that does not use LLMs will not win, so teams must use LLMs. This applies to both new and experienced teams. But: A newbie team using LLMs will not learn. Because the whole point is learning by doing, and you're not doing anything. And so will not become experienced.

So this is going to devolve into CTFs being a battle of teams using LLMs to fight for the top spots, where everyone who doesn't want to use an LLM is excluded, and where less experienced teams stop improving and getting better, because they're outsourcing the work to LLMs and not learning as a result.

This is, quite frankly, the same problem LLM agents are causing in software engineering and such, just way worse. Because with CTFs, there is no "quality metric". Once you get the flag you get the flag. It doesn't matter if your approach was ridiculous or you completely misunderstood the problem or "winged it" in the worst way possible or the solver is a spaghetti ball of technical debt. It doesn't matter if Claude made a dozen reasoning errors in its chain that no human would (which it did). Every time it gets it wrong it just tries again, and it can try again orders of magnitude faster than a human, so it doesn't matter.

I don't have a solution for this. You can't ban LLMs, people will use them regardless. You could try interviewing teams one on one after the challenge to see if they actually have a coherent story and clearly did the work, but even then you could conceivably cheat using an LLM and then wait it out a bit to make the time spent plausible, study the reasoning chain, and convince someone that you did the work. It's like LLMs in academics, but much worse due to the time constraints and explicitly competitive nature of CTFs.

LLMs broke CTFs.

@lina Programming competitions are banning LLMs, see e.g. https://info.atcoder.jp/entry/llm-rules-en. How are CTFs any different?
AtCoder Rules against Generative AI - Version 20251003 - AtCoderInfo

Introduction These rules apply only during ongoing AtCoder Beginner Contest (hereafter referred to as ABC), AtCoder Regular Contest (hereafter referred to as ARC), regardless of division and AtCoder Grand Contest (hereafter referred to as AGC). These rules do not apply to AtCoder Heuristic Contest; …

AtCoderInfo

@abacabadabacaba It's much easier to parallel construct a CTF solution than a programming challenge. CTF challenges are all about having a series of realizations that lead to the answer.

If you ban LLMs in a programming challenge, you could conceivably detect signs of LLM usage in the program in various ways (not perfectly, but you could try). A CTF challenge just has one output, the flag. Everyone finds the same flag. There is no way to tell how you did it. You'd have to introduce invasive monitoring like online tests, and even if you record people's screens, they could easily be running an LLM on another machine to have it come up with the "key points" to the solution which you just implement. You can't prove that someone didn't have some ideas on their own.

@lina There are programming competitions where participants run their solutions locally and submit the output. But they are usually also required to submit the code, even though it is not automatically judged. If cheating is suspected, the judges may look into the code. Also there may be automated checks for plagiarism etc. CTFs could do the same. There really isn't a good reason to keep solutions secret after the challenge concludes, and published solutions can serve as a learning material for future challenges.
@abacabadabacaba The thing is the solution isn't "the code". The solution is the process. You can have an LLM "solve" it for you, then rewrite the process and cheat that way. Yes the solution will often involve some bespoke scripts and tooling, but that's just part of it. The "aha moments", that you can't provide proof of.
@lina Programming competitions are similar. The most difficult part of solving a problem is devising an efficient algorithm, implementing it is usually much more straightforward. A participant willing to use LLMs can gain significant advantage even if they code the actual solution by themselves, just by using LLMs to gain insight into the problem.