There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges. I used to write CTF challenges in a past life, so I threw a couple of my hardest ones at it.

We're screwed.

At least with text-file style challenges ("source code provided" etc), Claude Opus solves them quickly. For the "simpler" of the two, it just very quickly ran through the steps to solve it. For the more "ridiculous" challenge, it took a long while, and in fact as I type this it's still burning tokens "verifying" the flag even though it very obviously found the flag and it knows it (it's leetspeak and it identified that and that it's plausible). LLMs are, indeed, still completely unintelligent, because no human would waste time verifying a flag and second-guessing itself when it very obviously is correct. (Also you could just run it...)

But that doesn't matter, because it found it.

The thing is, CTF challenges aren't about inventing the next great invention or having a rare spark of genius. CTF challenges are about learning things by doing. You're supposed to enjoy the process. The whole point of a well-designed CTF challenge is that anyone, given enough time and effort and self-improvement and learning, can solve it. The goal isn't actually to get the flag, otherwise you'd just ask another team for the flag (which is against the rules of course). The goal is to get the flag by yourself. If you ask an LLM to get the flag for you, you aren't doing that.

(Continued)

So it's not surprising that an LLM can solve them, because it automates the process. That just takes all the fun and all the learning out of it, completely defeating the purpose.

I'm sure you could still come up with challenges that LLMs can't solve, but they would necessarily be harder, because LLMs are going to oneshot any of the "baby" starter challenges you could possibly come up with. So you either get rid of the "baby" challenges entirely (which means less experienced teams can't compete at all), or you accept that people will solve them with LLMs. But neither of those actually works.

Since CTF competitions are pretty much by definition timed, speed is an advantage. That means a team that does not use LLMs will not win, so teams must use LLMs. This applies to both new and experienced teams. But: A newbie team using LLMs will not learn. Because the whole point is learning by doing, and you're not doing anything. And so will not become experienced.

So this is going to devolve into CTFs being a battle of teams using LLMs to fight for the top spots, where everyone who doesn't want to use an LLM is excluded, and where less experienced teams stop improving and getting better, because they're outsourcing the work to LLMs and not learning as a result.

This is, quite frankly, the same problem LLM agents are causing in software engineering and such, just way worse. Because with CTFs, there is no "quality metric". Once you get the flag you get the flag. It doesn't matter if your approach was ridiculous or you completely misunderstood the problem or "winged it" in the worst way possible or the solver is a spaghetti ball of technical debt. It doesn't matter if Claude made a dozen reasoning errors in its chain that no human would (which it did). Every time it gets it wrong it just tries again, and it can try again orders of magnitude faster than a human, so it doesn't matter.

I don't have a solution for this. You can't ban LLMs, people will use them regardless. You could try interviewing teams one on one after the challenge to see if they actually have a coherent story and clearly did the work, but even then you could conceivably cheat using an LLM and then wait it out a bit to make the time spent plausible, study the reasoning chain, and convince someone that you did the work. It's like LLMs in academics, but much worse due to the time constraints and explicitly competitive nature of CTFs.

LLMs broke CTFs.

And honestly, reading the Claude output, it's just ridiculous. It clearly has no idea what it's doing and it's just pattern-matching. Once it found the flag it spent 7 pages of reasoning and four more scripts trying to verify it, and failed to actually find what went wrong. It just concluded after all that time wasted that sometimes it gets the right answer and sometimes the wrong answer and so probably the flag that looks like a flag is the flag. It can't debug its own code to find out what actually went wrong, it just decided to brute force try again a different way.

It's just a pattern-matching machine. But it turns out if you brute force pattern-match enough times in enough steps inside a reasoning loop, you eventually stumble upon the answer, even if you have no idea how.

Humans can "wing it" and pattern-match too, but it's a gamble. If you pattern-match wrong and go down the wrong path, you just wasted a bunch of time and someone else wins. Competitive CTFs are all about walking the line between going as fast as possible and being very careful so you don't have to revisit, debug, and redo a bunch of your work. LLMs completely screw that up by brute forcing the process faster than humans.

This sucks.

I might still do a monthly challenge or something in the future so people who want to have fun and learn can have fun and learn. That's still okay.

But CTFs as discrete competitions with winners are dead.

A CTF competition is basically gameified homework.

LLMs broke the game. Now all that's left is self study.

@lina For in-person CTF competitions, would it be possible to do like programming competitions (specifically thinking of ACM ICPC) and disallow Internet access entirely? That would at least limit GenAI use to local models, which I suspect will remain uncompetitive at this sort of task for a very long time (due to the nearly inherent context size limitations).
@MrDOS Maybe, but in-person CTFs are themselves biased towards more privileged people and more advanced teams. We need online CTFs for the pipeline to work...
@lina Yeah, of course – and in reverse, I'm sure online competitive programming competitions are staring down the barrel of the same problem!
@lina thank you for this excellent thread
@lina I wonder if you can still design a challenge to be "LLM unfriendly" by changing the wording, just like those papers showing how an LLM aces problems like "river crossing", but if you change wording a bit, they just fail in weird and spectacular ways.
@doragasu Possibly? I might try removing all "hints" from one and trying again and seeing if it's any different. But that also affects human solvers... the hints are there to point you towards a website that explains the fundamentals of what's going on. The LLM didn't even read that, it just guessed from a filename and a comment and hulk smashed its way to guessing the general concept right with multiple attempts...
@lina In those papers trying to confuse LLMs, what was very effective IIRC, was adding data you don't need to use to the statement. The LLM tried to use all data you gave it to solve the problem and fail. Just like when a child is solving maths problems from a text book, all problems look similar so the child internalizes that you have to add two numbers and divide by the third one. Then you change the problem and the child fails because applies the same "formula".
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. This level of transparency into LLMs' predictions would yield significant safety benefits. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs--e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)"--which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations rationalizing those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods.

arXiv.org

@doragasu I'm not sure that's going to work. The thing is the LLM can just try again or you can ask it to try again. In the reasoning chain it made tons of mistakes, including at one point concluding the flag is XXXXXXXXX, but then realized that's unlikely to be the real answer and tried again. It's not like a problem with a numerical solution where you only get one chance to solve it.

(Also those are pretty old models, I doubt such shenanigans still work well)

@lina Yeah, there's also the two faced problem that the trick must not be very obvious for humans, that could "clean" the problem and feed the LLM the cleaned version.

@doragasu @lina Probably. LLMs are hilariously bad at dealing with linguistic ambiguities like puns.

One of my favorite ambiguities I’ve seen was saying some people “lie about the family tree”. Are they being deceptive on the topic of relations, or are they reclining around a plant tended by multiple generations?

@lina Might it be possible to include a few decoy flags that can’t be found by completing the challenge as intended, but that an LLM would notice and pattern-match on? And then you can disqualify any team submitting a decoy flag?
@dwineman This is too dangerous. Human solver teams also use things like grep scripts to find flags.

@lina

AI is fast eradicating any learning activity.
In my current job, learning anything new is actively discouraged.

As was said to us "they only care about numbers on a dashboard".

I got to the position I am in, at the level at I am in, by being curious and very interested, in taking things apart, and figuring out how they work.

A LLM, which, in the eyes of a CEO means he can get rid of people like me, is the end of the road, we are all doomed.

@Sonic2k @lina your looking at it the wrong way. Yes it’s killing one type of learning. But it’s teaching you how to CTF using AI, what are it strengths and weaknesses, what prompts are effective? What sub problems should the AI tackle, what should the human focus on. It’s no different than a carpenter switching from a hand plane to a powered belt sander. The skill set changes, the results are more or less the same. Someone that only learns to belt sand isn’t less of a carpenter. It gatekeeping to think otherwise. Yes the “elitist artists” will argue otherwise, but the difference is moot for the vast bulk of us working stiffs.
@Jmj @Sonic2k @lina classic ai apologist "expertise is unnecessary" fallacy. The results are perhaps similar on the surface "was the task completed" level but if person does it and learns the details an LLM can brute force past, that person can then recognize the issues showcased without going out of their way to look for them, which is a incredibly important part for security work. Because the real world is far messier and less clear than a CTF, and part of dealing with that is the kind of intuition and almost subconscious understanding which is impossible to achieve by using an LLM. And CTFs used to be decent at finding and rewarding those who are good at that.

@laund @Sonic2k @lina I never said "expertise is unnecessary".

Expertise is always necessary. All that changes is what types of expertise.

I was an expert 6502/Z80 assembly language programmer. Now that expertise is mostly useless. And actually, harmful for writing Rust code. The mental models I developed for CPU behavior is completely wrong for understanding Rust code on ARM/x86 multi-core processors.
Because I learn assembly language level stuff, it does not make me a better or worse programmer compared to someone that only learned high level languages. Yes, we will perform differently in narrow cases (say compiler bugs, vs multi-core perf optimization) by for most code in most projects our expertise will be indistinguishable.

I know when AI coding I spend more time reading code, analyzing test cases and spec writing. And a lot less time banging out lines of code and reading library/tool docs to yak shave them into working. I need to know different things, not better things or worse things, just different.

Personally, I'm most interested in what the 12-year-olds are learning to do with these AI tools. In exactly the same way I learned what computers can do, with my BBC micro in the 80s

The job is EXACTLY the same, press buttons until the pixels blink the way you want them.

@Jmj @Sonic2k @lina This feels strongly like you have no idea how people who aren't already pretty knowledgeable use LLMs

@laund @Sonic2k @lina

I honestly don't care about them. It doesn't matter in exactly same sense that all those folks building terrible GeoCities websites don't matter.

What matters is those folks that learned HTML and design sense on GeoCities. They became experts. They went on to build what we call the internet now. Fantastic websites that we ALL use. They developed the design guidelines and aesthetics that we now love.

Anyone can use any tool to make crap.

What matters is what experts can make that wasn't possible (or too expensive) before.
And I am strongly against gatekeeping on how someone learns to become an expert.

@Jmj @Sonic2k @lina There is a stark difference between deciding to learn a certain way and making a test of skill completely irrelevant (the latter being the topic of this post)
@laund @Sonic2k @lina I don’t like absolutes.
It’s no more or less than how you should feel about dictionaries, spell checkers, calculators, or Mathematica in relation to spelling bees and math olympiads. In some cases we don’t use or allow the aids and in other cases they are. And those competitions have similar relationships to the real world as CTF competitions do to security work.
@Jmj @laund @Sonic2k @lina
but the calculator i bought once and had the same powers as any professor;
same with the dictionary;
with llms, the current owners want you to spend more and more money;
only if we can democratize llms, we can be free again to develop further
@titusDeGroan @laund @Sonic2k @lina totally agree on that score. There are many concerns around the implementation. Rent seeking, power costs, IP theft etc. These are all bad(tm), however the capabilities of these new tools are real, pretending they don’t work isn’t going to fix anything.
We are going to have to figure out how the world works with them in the mix.

@Jmj @laund @lina

My parents didn’t have bosses telling them that calculators, computers, dictionaries were going to make them redundant and continue to rub that in their faces did they?

I find myself in a position where I am soon going to be replaced by an LLM because I cost too much on the balance sheet.

That’s why I am planning to quit before May. Before they can make me redundant and fuck me

@Sonic2k @laund @lina
They 100% did...
Calculators replaced armies of bookkeepers.
Desktop computers replaced huge typing pools.

I never said the transition wouldn't be very painful to real people. History is full of people completely screwed over by change. But the change is not going away.

On a personal note, don't quit. Make them fire you. Why would you give up a paycheck unless you don't need it. And are you really sure you cannot work with LLMs? Competent people are always valuable.

@Jmj @Sonic2k @lina

This assumes that 'AI' works within a comparable problem space as any human being, and it doesn't, it doesn't brute force anything it just matches patterns.

So 'AI' can't really teach anything to any other entity.

@Jmj Are you the guy I saw driving a forklift into the gym the other day?
@wesdym depends on your goals, both things can be true.
If the goal is to look good in a muscle shift then no. But if I’m being paid to move goods in a Costco I’m absolutely using a forklift. And while working out in the gym will marginally improve my performance at Costco in situations where the forklift doesn’t help, there is no amount for working out in the gym that will move a pallet of stuff down from the top shelf.
@Jmj There's no possible way that you're so stupid that you couldn't get my point. You're just being difficult, apparently just for the sake of it. Bye.l
@lina that's the worst part IMO. We get Claude through work and, all environmental and ethical issues aside, I just hate using it. Curating mounds of garbage output from the Screw It Up Faster Machine sucks. But it looks *great* in artificial evaluations with a concrete, machine-verifiable goal. And too many managers don't understand that real world programming isn't just passing a succession of concrete, machine-verifiable goals.
@lina LLMs can't reason
@dngrs @lina We don't doubt that, but here's used with a different meaning, there's no word for this process that doesn't also have a definition of an uniquely human ability. And, for example, saying that a machine "thinks" is nothing new, I was saying that 20+ years ago whenever a computer was stuck doing something which would finish eventually. Particularly if it was a virtual game opponent (which we also called AI because the term has always been that broad).
@lina LLM ➡️ PSM (Persistent Squirrel Model)
@lina It’s not just the fact that it’s automated and faster, but also that it is parallelizable. You can just run several agents at the same time, trying different approaches. If you are willing to spend the tokens or have a free source of them, from an employer, for example. Then it kind of becomes an economic competition.

@lina LLMs broke the whole damn industry. We have a new "full stack" guy on our team who's there because his our boss' son and he was put there as a junior full stack to "learn".

He's using Claude or copilot or whatever the fuck, I don't differentiate them by interface, but some LLM, constantly, to solve ridiculously easy tickets that I used to welcome as a learning opportunity when I was his expertise. He will never learn.

@lina Yet it's me who will be let go eventually, because I can't "solve" tickets as fast as a guy with no experience, all of the ambition and fuck ton of Claude tokens does.
@lina This saddens me to hear. We host an industrial control system themed CTF and LLMs haven't quite gotten to the point of being useful to solve challenges yet. But I can totally see them catching up.
@SnoozyRests is it hosted publicly?
@twomikecharlie Unfortunately not, since we use real devices we have major scalability issues. Has to be hosted in person too. Currently subject to expressed interest and invitation.

@lina

How does this statement differ from "DeepBlue broke chess"? Cheat engines are similarly impossible to deterministically detect in online competition, yet the game is more popular than ever.

The competition format will have to adapt, which sucks, but if the majority of participants can agree that LLMs are cheats, then the community should be able to adapt & self-police like any other game community where cheats are easily accessible. Unless I'm missing something special about CTFs?

@nathan It's worse because it's not a linear game like chess. You aren't competing move-wise, you are going down your own path where there is no interaction between teams. There's no way to detect that in online competition, even heuristically. There's no realtime monitoring. There isn't any condensed format that describes "what you did". At most you could stream yourself to some kind of video escrow system, but then who is going to watch those? And if you make them public after the competition, you are giving away your tools to everyone. And you could still have an LLM on the side on another machine and parallel construct the whole thing plausibly.

Sure you could do in-person only, but that would only work for the top tiers and who is going to want to learn and grow online when a huge number of people are going to be cheating online?

It's the same with any kind of game. Sure cheating is barely a concern in-person, but people hate cheaters online, and companies still try hard to detect cheaters. And detecting cheaters for a CTF is nigh impossible.

@lina Ah I didn't consider that there would be a culture of hiding tools/methods. Yeah that's definitely incompatible with a post-LLM world.

This is a general trend with GenAI: the only way to earn legitimacy is either in person, or by publicizing the creative process. For a while already visual/music artists have had to either rely on their existing credibility, or share their creative process to establish their art's legitimacy. New anonymous art has sadly been made nearly worthless.

@nathan I don't think there's necessarily a culture of hiding methods outright (though some of the more competitive teams might), but more like people build their own personal stash of scripts and things to build off of, and don't necessarily just outright post it on GitHub or whatever.

So like, "fucky stuff with QR codes" having showed up in CTF challenges more than once, I have a personal "do low level analysis and extended recovery of damaged QR codes" script.

@lina I mean, yes, but I don't know if complete pessimism is warranted. It's definitely broken a lot of public CTFs but I think society will find a way, and maybe it's not even the worst thing.

Forever ago, when I was in uni, a few colleagues and I would do this thing every semester where we'd do one of the nominally individual projects together, ahead of time. Technically, it was cheating, but we did it specifically so we could go "off the rails" and try things that were not in the guide that our TAs handed out to us.

For instance, the "rite of passage" for every 3rd year student in my generation was a transformer design. It was something you'd work on, on and off, for the whole semester (we're talking big three-phase transformers for power distribution here so there was definitely *a lot* to work on).

They'd give us a sort of step-by-step guide to walk us through the whole process (start here, compute this quantity, check it against this standard table etc.) and you'd consult with the TAs along the semester. It was definitely interesting, if tedious at times, but tediousness was the lesser problem.

The bigger deal was that these guides weren't updated very often -- because the associated industrial standards don't get updated that often.

So what we did was that those of us who actually wanted to be there in the first place got together and we tried to experiment with various things not in the guide. Different isolation materials that we'd just read about, different cooling methods and so on. Not that we could show those to the TAs (can't blame them but most of them weren't very interested), and we didn't always have a lot of time or access to all the data we needed (we were students and had student budgets to contend with -- we couldn't buy standards, for example, and this was before libgen).

The cool thing about it was that it removed any kind of metrics pressure from this process. We weren't going to be ranked by anyone, there were no arsehole TAs to cater towards and no obtuse professors whose personal preferences in the formatting of our reports who had to be placated.

We also didn't have to show our results to anyone who wasn't primarily interested in mentoring us. We worked *really* quickly because we had graded assignments to finish first and clung to whatever had remained of our social lives by the third year of an engineering degree, so "deadlines" were super tight.

That quickly removed any incentive to cheat. When there was no way around it (tl;dr outdated guides sometimes didn't work in the context we used them, I have some fun stories about that) we totally cheated on the "real" assignments -- but never on these ones. This was technically cheating, too -- in the process of working out these differences we'd obviously discuss how we'd gone through the "real" assignments, share results and so on -- but since we all had different design targets (tl;dr same transformer designs but with different target parameters, so you couldn't just copy your colleague's work) it wasn't really a big deal.

With no incentive to cheat and nothing to get ahead of other than the limits of our own knowledge and engineering abilities, we often found ourselves doing things we normally wouldn't do for our regular assignments. We couldn't try things out in a lab, so if we doubted our analytical results for some particular configuration, we'd compare it against general EM field numerical simulations. If we didn't have a good simulation package for what we were after, we'd try to work out different analytical solutions for related quantities and see if we got similar results were similar.

We ended up learning a lot more than we did from the "real" assignments, mostly because our priorities were different. With real assignment, your main objective was inevitably to get a high grade, and keeping the TAs and the prof happy were as critical as tracking the decimal point.

Whereas with our "social" assignments, our main objectives were 1. to learn new things and 2. to get something that looked like a workable design that was an improvement over the "real" one in some aspect of our choice (better efficiency, reduced size, less coolant, whatever). If you "cheated" your way through it, #1 was obviously not happening and you were never really sure of #2, so no one was motivated to do it.

I think this is what we're eventually going to converge towards in other spaces, too: CTFs organised in smaller circles, with fewer external metrics and motivators, and an emphasis on cooperation, shifting the "competition" towards external factors than competition among teams/team members.

When CTF scores matter because they could potentially get you ahead in the race for an intership, every twenty year-old will eventually give in to cheating -- if only because it's the only way to stay in the race with people who do it because it's the only way they *can* do it. But if you take out the cheese, it's not much of a rat race anymore.

I'm old enough to have seen this happen to hackathons to some degree. At first, after hackathons had grown into their "competitive" form from their "let's hack shit together" roots, everyone was super enthusiastic and people every age jumped in. After a while, when prep became intensive enough that the only way to a prize was to implement 90% of what you meant to do beforehand (e.g. in a library) and then show up on the day of the hackathon and just piece the frontend together, everyone who was in it primarily for the thrill of focused building noped out.

Did that stop hackathons? Not at all, it just "split" things into:

- Corporate-funded hackathons which almost no one attends after they finish school -- where people rarely produce anything of value, and it's fine, because everyone understands that's not what they're there for. The "cheese" wasn't explicitly removed here, it's just at some point almost everyone recognised it's unattainable and the amount of hoops you have to jump through in order to attain it just isn't worth it when you're programming professionally
- "Real" hackathons, where people get together to work on a real project together, and the only competition is maybe the how-much-wasabi-you-can-eat-without-crying competition when everyone goes out for sushi the next day.

@lina Programming competitions are banning LLMs, see e.g. https://info.atcoder.jp/entry/llm-rules-en. How are CTFs any different?
AtCoder Rules against Generative AI - Version 20251003 - AtCoderInfo

Introduction These rules apply only during ongoing AtCoder Beginner Contest (hereafter referred to as ABC), AtCoder Regular Contest (hereafter referred to as ARC), regardless of division and AtCoder Grand Contest (hereafter referred to as AGC). These rules do not apply to AtCoder Heuristic Contest; …

AtCoderInfo

@abacabadabacaba It's much easier to parallel construct a CTF solution than a programming challenge. CTF challenges are all about having a series of realizations that lead to the answer.

If you ban LLMs in a programming challenge, you could conceivably detect signs of LLM usage in the program in various ways (not perfectly, but you could try). A CTF challenge just has one output, the flag. Everyone finds the same flag. There is no way to tell how you did it. You'd have to introduce invasive monitoring like online tests, and even if you record people's screens, they could easily be running an LLM on another machine to have it come up with the "key points" to the solution which you just implement. You can't prove that someone didn't have some ideas on their own.

@lina There are programming competitions where participants run their solutions locally and submit the output. But they are usually also required to submit the code, even though it is not automatically judged. If cheating is suspected, the judges may look into the code. Also there may be automated checks for plagiarism etc. CTFs could do the same. There really isn't a good reason to keep solutions secret after the challenge concludes, and published solutions can serve as a learning material for future challenges.
@abacabadabacaba The thing is the solution isn't "the code". The solution is the process. You can have an LLM "solve" it for you, then rewrite the process and cheat that way. Yes the solution will often involve some bespoke scripts and tooling, but that's just part of it. The "aha moments", that you can't provide proof of.
@lina Programming competitions are similar. The most difficult part of solving a problem is devising an efficient algorithm, implementing it is usually much more straightforward. A participant willing to use LLMs can gain significant advantage even if they code the actual solution by themselves, just by using LLMs to gain insight into the problem.

@abacabadabacaba @lina mostly because the incentive to cheat for time is so high and it places an ever increasing burden on the organizers to develop LLM detection methods that are prohibitively cumbersome.

Rules without the ability to enforce them effectively are just guideposts for bad actors

@lina perhaps having separate categories for LLMs allowed vs. banned would help with 90% of this problem? So ppl who want to use LLM can do so at their pleasure, and only ppl who actively want to cheat (hopefully very few) will try to use LLM in the banned category.
@YaLTeR I promise lots of people would cheat. These are competitions with rewards (bragging rights at minimum, but often cash prizes, swag, invitations to events, etc.)
@lina right, rewards make this difficult
Asahi Linya (朝日りにゃ〜), I really hope that LLMs are a temporary phenomenon. Sure the local ones will remain even after the bubble finally bursts, but they're ridiculously bad, you do need millions of dollars worth of GPUs to get to that "it's still bad but it looks plausible" level of output quality.