this spring I've been teaching undergrads to use LLM agents. my rationale for doing this was that it would give me a chance to covertly teach lots of real software engineering, which is what I've done.

meanwhile, I've been watching the students closely to try to figure out whether a coding agent is a leveling factor (reducing differences in effectiveness between different students) or an anti-leveling factor (amplifying differences). at this point I'm 99% sure it's the second thing.

I’m not sure how to feel about this but I suspect it’s good news for CS types, it means all the managers and other random vibe coders aren’t really competition
My own experience is that it’s not at all easy to get reasonable code out of even the latest LLMs. Makefiles and that sort of thing— easy peasy. Medium or larger code you’d actually maintain? Very tricky

@regehr

What kind of tasks are you trying to test this out? I have not had the time but I keep wondering how do you find a task not well represented in the training set. So that it is clear what you are actually testing.

Maybe your experience says this is not really an issue and any medium or large code test will show these deficiencies.

I would have to think if you could come up with task not well represented that they would do a lot worse.

@shafik given how good the current models are at generalization, I'm not sure how to approach "not well represented in the training set".

but, what I'm doing is writing totally new LLVM passes, fixing previously unknown bugs in open source software, implementing dataflow transfer functions that are better than anything out there, etc. the current models are absolutely competent at these tasks.

@shafik so anyway, things are weird right now. but one thing that's completely clear to me is that the models are not simply reproducing things from their training set. that's just not how to think about it anymore. I mean, they might do that sometimes-- but it's not the interesting part.

@regehr @shafik Folks working in this space have also indicated that most of the major models have pretty strict recitation checking to specifically _prevent_ emitting something that is directly in their training set...

That said, I can't find good references to this. I'm worried the terminology I'm using is wrong.

And I don't expect these to be perfect. My understanding is that the goal is to have significantly less recitation than the rate at which humans tend to copy/paste from StackOverflow, which seems like a plausible bar to me.

@chandlerc @shafik yeah, but. this sort of thing: https://arxiv.org/abs/2505.12546
Extracting memorized pieces of (copyrighted) books from open-weight language models

Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression in their training data. Drawing on both machine learning and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we extend a recent probabilistic extraction technique to measure memorization of 50 books in 17 open-weight LLMs. Through thousands of experiments, we show that the extent of memorization varies both by model and by book. With respect to our specific extraction methodology, we find that most LLMs do not memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B entirely memorizes some books, like the first Harry Potter book and 1984. In fact, the first Harry Potter is so memorized that, using a seed prompt consisting of just the first few tokens of the first chapter, we can deterministically generate the entire book near-verbatim. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.

arXiv.org
@regehr @shafik being *able* to extract verbatim training material is IMO very different from it happening incidentally... There are real problems there, but they seem somewhat orthogonal to the problems of incidental / unwitting including things verbatim in otherwise "normal" coding synthesis.

@chandlerc

Strongly depends on the uniqueness of the problem space. Pick something esoteric -- for which there are only a couple of public examples -- and you'll often get nigh-verbatim output. This is true for both coding and non-coding use cases.

@regehr @shafik

@shafik @regehr Every time I try to use Codex it always uses non-extant API calls or has subtle errors in the boundary conditions. The only thing it ever does okay at are simple web-based interfaces over existing data.

@regehr There's a good C programmer I know from /r/C_programming and from his great blog who recently wrote how he completely changed towards LLM/AI coding in the past year. He always shared great insights, ideas and projects in the C sub and now the (expensive) LLM seems to work really well for him: https://nullprogram.com/blog/2026/03/29/

(Apparently not great yet in C, so it's all C++ now.)

Maybe his best-known project: https://github.com/skeeto/w64devkit

My post: https://redd.it/1s7mbqw

2026 has been the most pivotal year in my career… and it's only March

@regehr I’ve had a similar experience. LLMs have been helpful with reviewing and doing simple work for a semi-sophisticated rewriting system I’m building, but when experimented with letting them work on the core algorithms they tend to produce code I wouldn’t have that isn’t as clean or robust as I’d like. I’ve found them useful to help sketch out a method, but I usually do the actual coding work and let it find my typos and subtle bugs when I’m done. Took a weekend to write a confluence checker that would have taken me maybe 3x that before. I wrote the code, Opus reviewed it, and I let it write some mundane stuff I didn’t feel like doing like pretty printers to help me debug. I often wonder what the code looks like for people who primarily vibe code and just shovel tests into the model.
@regehr I also noticed they tend to suggest overly complex methods sometimes. Had a DAG reachability problem that had to be computed and kept up to date as the DAG was being modified, and both Opus and GPT suggested cool sounding methods (with paper citations that were real), both of which were pretty complex. I was suspicious and one evening after 45 minutes and a beer I came up with a much cleaner method that was just as efficient but a ton simpler. Had I taken their advice I would have gone down a deep unnecessary hole.
@regehr How are you finding the good students are making use of them?
@penguin42 I don’t have good thoughts about that yet
@regehr do you have an example for why that is? I’ve been wondering that myself and leaning towards (2) as well, but it was just a gut feeling with no evidence.

@ryan so you know how one student will have a bug, form a wrong hypothesis about it, get on the wrong track, and take a very very long time to track down the bug, whereas another student will just sort of home in on the issue right away?

I feel like it's just more of the same. it's a difference in how effective people's mental models and hypotheses are, and the available tools simply amplify whatever effects are already there.

but I have no hard evidence for anything

Giving University Exams in the Age of Chatbots

Giving University Exams in the Age of Chatbots par Ploum - Lionel Dricot.

@adx I had not! thank you
@regehr It seems like you found the same thing he did.

@adx @regehr The kids. They are alright. (With this set of rules on chatbot use anyway.)

This is probably the most hopeful i've ever felt about life in a post-LLM society.

@regehr Eons ago, in my 4th year operating systems course, the assignment was to build a linking loader for a simplified version of OS/360 object files. I wrote a test case, which ended up being used by almost everyone. And, for the hell of it, I implemented the not-so-simple cases (e.g. 3-byte unaligned address). AFAIK, everyone who passed my easy test case also passed the hard test case (~5 out of the class of ~40). We marked each other's code ... I was given the code of someone who had a good reputation and I can only describe their code as "not even wrong".

I don't think I'm a "10x" programmer but there are definitely "÷10" programmers.

@regehr My experience is: if you have a task to do something that you don’t know how to do, an LLM is a great way to complete the task without learning how to do it.

So in the short term they feel like equalizers — we all finished the assignment! — but in the…not even the long term, the immediate medium term, they amplify the gulf between skilled and unskilled. Those with a desire and ability to learn more continue to learn more; those who are struggling for any reason don’t.

@regehr it's more of a replacement for a keyboard and text editor than a replacement for programming
@regehr This very much aligns with my university experience as well. My running theory is that utilizing LLMs effectively (a) is just a very different skill from plain coding, somewhat closer to product management / requirements engineering (b) requires additional level of self discipline, because clicking "accept" on everything may get you a passing grade due to how courses are setup today, but definitely won't get you an actually good result.