Hot take: don’t chase bigger models — build a 100–300 example, high‑quality eval set and gate releases on it. If an architecture or prompt tweak doesn’t beat that set reliably, it’s roll‑forward noise, not progress.