Mastodawn

My first rule for reviewing vibecoded PRs: if I can type "find bugs in this PR" into Claude and it finds bugs, then you should have done that, not me.

Show thread

James Scholes

@nolan I don't think an assumption that someone didn't do that always holds up against the non-deterministic nature of LLMs. They will often find or invent fault with the very thing they claimed was flawless 30 seconds ago in a separate session.

Show thread

Nolan Lawson Mar 5

@jscholes You're right, I'm being a bit glib. Actually I've found Claude alone is not a great PR reviewer – you have to chain two or three of them together and have them vote. There's an interesting article that suggests Claude plus another model is the best bang for the buck: https://milvus.io/blog/ai-code-review-gets-better-when-models-debate-claude-vs-gemini-vs-codex-vs-qwen-vs-minimax.md

Claude vs Gemini vs Codex vs Qwen vs MiniMax Code Review - Milvus Blog

We tested Claude, Gemini, Codex, Qwen, and MiniMax on real bug detection. The best model hit 53%. After adversarial debate, detection jumped to 80%.