This is super cool. I feel like maybe it's a little limited, in that you want to be able to test reasoning, not only answers. I'm not sure how to build a task description that requires the LLM to have given the right answer for the right reason or reasons. But it's a step in the right direction, for sure.