Mastodawn

Matt Massicotte Mar 24

... I find it strange that there isn't a suite of tests used to help evaluate the effectiveness of an agent skill?

Show thread

Dimitri Bouniol Mar 24

@mattiem What’s the point if passing said tests is entirely non-deterministic and seed dependent?

Show thread

Matt Massicotte Mar 24

@dimitribouniol I don’t know! How does one evaluate if a skill “works” or not?

Show thread

David Beck

@mattiem @dimitribouniol what I’ve done that seems to work pretty well is come up with a moderately difficult example, give the agent a prompt. And then revert the changes and start over again. Given the same repo state, model etc, the results tend to be pretty consistent. But I don’t know how you would wrap that into an automated test.