... I find it strange that there isn't a suite of tests used to help evaluate the effectiveness of an agent skill?
@mattiem What’s the point if passing said tests is entirely non-deterministic and seed dependent?
@dimitribouniol I don’t know! How does one evaluate if a skill “works” or not?
@mattiem @dimitribouniol what I’ve done that seems to work pretty well is come up with a moderately difficult example, give the agent a prompt. And then revert the changes and start over again. Given the same repo state, model etc, the results tend to be pretty consistent. But I don’t know how you would wrap that into an automated test.