maybe i'm just not good enough of a programmer to use coding agents, i guess? i definitely don't trust my ability to know whether or not some code will do what i want it to do just by looking at it
@aparrish i think you're too good for it. and i dont think its pride or hubris on my part. the bar is just that low
@aparrish I don't even look at the code the agents write, or at least not much. It works better for things that you can build good test suites for or where you care more about the output of the program than the way the program works. See also @simon's book on agentic programming.
Agentic Engineering Patterns - Simon Willison's Weblog

Simon Willison’s Weblog
@nelson i don't trust my tests to be correct either, only that they reflect my best understanding. and i'm not sure what it could mean to care more about the output of a program than how the program works...? isn't the output of a program *determined by* how the program works? i feel like whenever i've believed there was a difference between those two things, i ended up being wrong (sometimes subtly, sometimes not)

@aparrish @nelson I don't think it's enough to accept code is just a black box ratcheted by tests.

If you look at the state of Claude code... It's really bad. Like worst case devolve to bogo sort bad... like store your credentials in plain text files because it can't guarantee it won't lose your credentials mid process bad.

edit: Ratcheting by tests doesn't tell you about non-deterministic total failure in rare circumstances and it doesn't tell you about security.

@aparrish @nelson I would also say, ratcheting by tests is basically reinforcement learning, and you are effectively doing extra steps to add the possibility of a hill climbing solution with over fitting.
@theeclecticdyslexic @nelson yeah, every instinct i have from 20+ years in software dev says "if the output looks right, and the code passes the tests, but you don't actually understand it, and you push to prod/incorporate it into your workflow anyway, you are bound to spend 10x the time fixing it than you would have spent understanding it in the first place" but maybe others don't have that instinct?
@aparrish @nelson well, if the LLM knows what the tests are, and you don't read the code it writes... You simply can't know it didn't write dedicated code paths for your tests.

@aparrish I use Claude Code for a lot of one-offs and non-critical projects. Ie, my little thread unroller for travel postcards. The standard of quality here is

  • a few tests to make sure data is being included
  • look at the HTML output. "looks good to me!"
  • This is not a high stakes or subtle program I'm working on! For something more complex like a Fediverse server, there's way more hidden and subtle than I'd trust to an agent. People are doing that kind of work with AI too but I don't.

    Italy, France, and Spain 2026

    @aparrish @nelson Yeah, "correctness" is something we have to approach from multiple angles.

    Sometimes we look at program outputs and say, "yes, that output is right for that input".

    Sometimes we read the code and say, "yes, this code is correct by construction" (e.g. we can see that control flow *cannot* pass into a sensitive region without a certain check happening).

    Sometimes we can use proofs, or fuzzing, or other tools.

    It feels like vibe coders are focusing on only that first type.

    @aparrish @nelson A lot of programmers don't seem to understand that security is the *absence* of a feature.

    Sure, features can sometimes be verified by looking at a program's behavior. But you can't use that to show that a feature is missing. The should-be-missing feature might be something like "Eve can read Alice's messages to Bob".

    If vibe coders are only checking for the presence of features, then can never detect the "presence" of security.

    @aparrish @nelson It turns out that if you want to establish that a piece of software is secure, you're going to have to understand it.

    That's why we're seeing such basic vulnerabilities in slopware.

    @aparrish (obligatory disclaimer: I’m flailing in the dark with this stuff just like everybody else) I find myself spending much more of my dev time finding alternative ways to get that understanding without reading (all of) the code. LLM agents are good for that too, writing smaller scripts that pull out and visualise (deterministically!) important information. And also refactoring (more thoroughly than I would have time for without agents) to emphasise patterns that are easy to scan-and-agree.
    @aparrish for most uses I think of it like cardboard: flimsy, yes. But great for prototyping as it’s sturdy enough to live with for weeks or longer. Once you prove what you’re trying to test with it, then you can remake it w/ better material (and/or hire a better programmer to help)