Mastodawn

Matt Massicotte Mar 24

... I find it strange that there isn't a suite of tests used to help evaluate the effectiveness of an agent skill?

Show thread

Dimitri Bouniol

@mattiem What’s the point if passing said tests is entirely non-deterministic and seed dependent?

Show thread

Matt Massicotte Mar 24

@dimitribouniol I don’t know! How does one evaluate if a skill “works” or not?

Show thread

Max Desiatov 🇺🇦Mar 24

@mattiem @dimitribouniol benchmarks for agent models exist despite non-determinism. I guess for skills the biggest problem is the lack of a general test harness, especially as environments in which agents operate using those skills are so diverse. A test harness would have to simulate such environment.

Show thread

Dimitri Bouniol Mar 24

@maxd @mattiem I imagine such a test harness is mostly just mocking the usual set of MCP services, but honestly feels pretty expensive to have a test suite that runs every time you want to improve the set of skills, especially if you want statistical confidence that an improvement was really made…

Show thread

Max Desiatov 🇺🇦Mar 24

@dimitribouniol @mattiem skills != MCP, they're just Markdown files with human language in them

Show thread

Dimitri Bouniol Mar 25

@maxd @mattiem I mostly meant MCP in terms of what I assume the agents use to actually access files and run commands on disk. You know, to actually benchmark the skills against _something_

Show thread

Dimitri Bouniol Mar 24

@mattiem Vibes 🤪

Show thread

David Beck

Mar 24

@mattiem @dimitribouniol what I’ve done that seems to work pretty well is come up with a moderately difficult example, give the agent a prompt. And then revert the changes and start over again. Given the same repo state, model etc, the results tend to be pretty consistent. But I don’t know how you would wrap that into an automated test.