Even bigger and more expensive models aren't necessarily better for your use case. And how do you rigorously test prompt changes?
With evals! I've added evals to my open source project, The Archive. I wrote about doing so here:
https://www.abramjackson.com/artificial-intelligence/the-archive-pt-3-dont-hack-away-on-vibes-alone/