Mastodawn

My first rule for reviewing vibecoded PRs: if I can type "find bugs in this PR" into Claude and it finds bugs, then you should have done that, not me.

Show thread

James Scholes Mar 5

@nolan I don't think an assumption that someone didn't do that always holds up against the non-deterministic nature of LLMs. They will often find or invent fault with the very thing they claimed was flawless 30 seconds ago in a separate session.

Show thread

Nolan Lawson Mar 5

@jscholes You're right, I'm being a bit glib. Actually I've found Claude alone is not a great PR reviewer – you have to chain two or three of them together and have them vote. There's an interesting article that suggests Claude plus another model is the best bang for the buck: https://milvus.io/blog/ai-code-review-gets-better-when-models-debate-claude-vs-gemini-vs-codex-vs-qwen-vs-minimax.md

Claude vs Gemini vs Codex vs Qwen vs MiniMax Code Review - Milvus Blog

We tested Claude, Gemini, Codex, Qwen, and MiniMax on real bug detection. The best model hit 53%. After adversarial debate, detection jumped to 80%.

Show thread

ash furrow (still spooky)Mar 5

@nolan I think a coder using an LLM to create a PR needs to be the first reviewer of that PR. If it’s not worth spending my own time to make sure it works, why would I burden my team? My workflow now is to open draft PRs, review, and mark as ready.

Show thread

Nolan Lawson Mar 5

@ashfurrow My workflow is also to open a draft PR of pure slop. But then I have 3 different agents (Codex, Claude, Bugbot) review it and vote before I waste my human eyeballs on it. Otherwise you spend too much time finding obvious mistakes, DRYness issues, etc.

Show thread

ash furrow (still spooky)Mar 6

@nolan interesting! workflows are changing so quickly, it’s fascinating to see it all. I’ve had good results using Conductor for this kind of adversarial AI review.