this spring I've been teaching undergrads to use LLM agents. my rationale for doing this was that it would give me a chance to covertly teach lots of real software engineering, which is what I've done.

meanwhile, I've been watching the students closely to try to figure out whether a coding agent is a leveling factor (reducing differences in effectiveness between different students) or an anti-leveling factor (amplifying differences). at this point I'm 99% sure it's the second thing.

I’m not sure how to feel about this but I suspect it’s good news for CS types, it means all the managers and other random vibe coders aren’t really competition
My own experience is that it’s not at all easy to get reasonable code out of even the latest LLMs. Makefiles and that sort of thing— easy peasy. Medium or larger code you’d actually maintain? Very tricky

@regehr

What kind of tasks are you trying to test this out? I have not had the time but I keep wondering how do you find a task not well represented in the training set. So that it is clear what you are actually testing.

Maybe your experience says this is not really an issue and any medium or large code test will show these deficiencies.

I would have to think if you could come up with task not well represented that they would do a lot worse.

@shafik given how good the current models are at generalization, I'm not sure how to approach "not well represented in the training set".

but, what I'm doing is writing totally new LLVM passes, fixing previously unknown bugs in open source software, implementing dataflow transfer functions that are better than anything out there, etc. the current models are absolutely competent at these tasks.

@shafik so anyway, things are weird right now. but one thing that's completely clear to me is that the models are not simply reproducing things from their training set. that's just not how to think about it anymore. I mean, they might do that sometimes-- but it's not the interesting part.

@regehr @shafik Folks working in this space have also indicated that most of the major models have pretty strict recitation checking to specifically _prevent_ emitting something that is directly in their training set...

That said, I can't find good references to this. I'm worried the terminology I'm using is wrong.

And I don't expect these to be perfect. My understanding is that the goal is to have significantly less recitation than the rate at which humans tend to copy/paste from StackOverflow, which seems like a plausible bar to me.

@chandlerc @shafik yeah, but. this sort of thing: https://arxiv.org/abs/2505.12546
Extracting memorized pieces of (copyrighted) books from open-weight language models

Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression in their training data. Drawing on both machine learning and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we extend a recent probabilistic extraction technique to measure memorization of 50 books in 17 open-weight LLMs. Through thousands of experiments, we show that the extent of memorization varies both by model and by book. With respect to our specific extraction methodology, we find that most LLMs do not memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B entirely memorizes some books, like the first Harry Potter book and 1984. In fact, the first Harry Potter is so memorized that, using a seed prompt consisting of just the first few tokens of the first chapter, we can deterministically generate the entire book near-verbatim. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.

arXiv.org
@regehr @shafik being *able* to extract verbatim training material is IMO very different from it happening incidentally... There are real problems there, but they seem somewhat orthogonal to the problems of incidental / unwitting including things verbatim in otherwise "normal" coding synthesis.

@chandlerc

Strongly depends on the uniqueness of the problem space. Pick something esoteric -- for which there are only a couple of public examples -- and you'll often get nigh-verbatim output. This is true for both coding and non-coding use cases.

@regehr @shafik