... I find it strange that there isn't a suite of tests used to help evaluate the effectiveness of an agent skill?
@mattiem What’s the point if passing said tests is entirely non-deterministic and seed dependent?
@dimitribouniol I don’t know! How does one evaluate if a skill “works” or not?
@mattiem @dimitribouniol benchmarks for agent models exist despite non-determinism. I guess for skills the biggest problem is the lack of a general test harness, especially as environments in which agents operate using those skills are so diverse. A test harness would have to simulate such environment.
@maxd @mattiem I imagine such a test harness is mostly just mocking the usual set of MCP services, but honestly feels pretty expensive to have a test suite that runs every time you want to improve the set of skills, especially if you want statistical confidence that an improvement was really made…
@dimitribouniol @mattiem skills != MCP, they're just Markdown files with human language in them
@maxd @mattiem I mostly meant MCP in terms of what I assume the agents use to actually access files and run commands on disk. You know, to actually benchmark the skills against _something_