@mattiem @dimitribouniol what I’ve done that seems to work pretty well is come up with a moderately difficult example, give the agent a prompt. And then revert the changes and start over again. Given the same repo state, model etc, the results tend to be pretty consistent. But I don’t know how you would wrap that into an automated test.