Mastodawn

alpaylan 4d ago

Hypothesis, Antithesis, synthesis

https://antithesis.com/blog/2026/hegel/

Hypothesis, Antithesis, synthesis

Introducing Hegel, our new family of property-based testing libraries.

Show thread

pron

> property-based testing is going to be a huge part of how we make AI-agent-based software development not go terribly.

There's no doubt, I think, testing will remain important and possibly become more important with more AI use, and so better testing is helpful, PBT included. But the problem remains verifying that the tests actually test what they're supposed to. Mutation tests can allow agents to get good coverage with little human intervention, and PBT can make tests better and more readable. But still, people have to read them and understand them, and I suspect that many people who claim to generate thousands of LOC per day don't.

And even if the tests were great and people carefully reviewed them, that's not enough to make sure things don't go terribly wrong. Anthropic's C compiler experiment didn't fail because of bad testing. Not only were the tests good, it took humans years to write the tests by hand, and the agents still failed to converge.

I think good tests are a necessary condition for AI not generating terrible software, but we're clearly not yet at a point where they're a sufficient one. So "a huge part" - possibly, but there are other huge parts still missing.

Show thread

tybug 4d ago

I actually think there's another angle here where PBT helps, which wasn't explored in the blog post.

That angle is legibility. How do you know your AI-written slop software is doing the right thing? One would normally read all the code. Bad news: that's not much less labor intensive as not using AI at all.

But, if one has comprehensive property-based tests, they can instead read only the property-based tests to convince themselves the software is doing the right thing.

By analogy: one doesn't need to see the machine-checked proof to know the claim is correct. One only needs to check the theorem statement is saying the right thing.

Show thread

DRMacIver 4d ago

> But the problem remains verifying that the tests actually test what they're supposed to.

Definitely. It's a lot harder to fake this with PBT than with example-based testing, but you can still write bad property-based tests and agents are pretty good at doing so.

I have generally found that agents with property-based tests are much better at not lying to themselves about it than agents with just example-based testing, but I still spend a lot of time yelling at Claude.

> So "a huge part" - possibly, but there are other huge parts still missing.

No argument here. We're not claiming to solve agentic coding. We're just testing people doing testing things, and we think that good testing tools are extra important in an agentic world.

Show thread

pron 4d ago

> We're not claiming to solve agentic coding. We're just testing people doing testing things, and we think that good testing tools are extra important in an agentic world.

Yeah, I know. Just an opportunity to talk about some of the delusions we're hearing from the "CEO class". Keep up the good work!

Show thread

ngruhn 4d ago

> I have generally found that agents with property-based tests are much better at not lying to themselves

I also observed the cheating to increase. I recently tried to do a specific optimization on a big complex function. Wrote a PBT that checks that the original function returns the same values as the optimized function on all inputs. I also tracked the runtime to confirm that performance improved. Then I let Claude loose. The PBT was great at spotting edge cases but eventually Claude always started cheating: it modified the test, it modified the original function, it implemented other (easier) optimizations, ...

Show thread

DRMacIver 4d ago

Ouch. Classic Claude. It does tend to cheat when it gets stuck, and I've had some success with stricter harnesses, reflection prompts and getting it to redo work when it notices it's cheated, but it's definitely not a solved problem.

My guess is that you wouldn't have had a better time without PBT here and it would still have either cheated or claimed victory incorrectly, but definitely agreed that PBT can't fully fix the problem, especially if it's PBT that the agent is allowed to modify. I've still anecdotally found that the results are better than without it because even if agents will often cheat when problems are pointed out, they'll definitely cheat if problems aren't pointed out.

Show thread

js8 4d ago

> There's no doubt, I think, testing will remain important and possibly become more important with more AI use, and so better testing is helpful, PBT included.

Given Curry-Howard isomorphism, couldn't we ask AI to directly prove the property of the binary executable under the assumption of the HW model, instead of running PBTs?

By no means I want to dismiss PBTs - but it seems that this could be both faster and more reliable.

Show thread

Groxx 4d ago

And how do you know if it has proven the property you want, instead of something that's just complicated looking but evaluates to true?

Show thread

skybrian 4d ago

Proofs are a form of static analysis. Static analysis can find interesting bugs, but how a system behaves isn't purely a property of source code. It won't tell you whether the code will run acceptably in a given environment.

For example, if memory use isn't modelled, it won't tell you how big the input can be before the system runs out of memory. Similarly, if your database isn't modelled then you need to test with a real database. Web apps need to test with a real web browser sometimes, rather than a simplified model of one. Databases and web browsers are too complicated to build a full-fidelity mathematical model for.

When testing with real systems there's often the issue that the user's system is different from the one you use to test. You can test with recent versions of Chrome and Firefox, etc, which helps a lot, but what about extensions?

Nothing covers everything, but property tests and fuzzers actually run the code in some test environment. That's going to find different issues than proofs will.

Show thread

zoogeny 4d ago

> t took humans years to write the tests by hand, and the agents still failed to converge.

I think there is some hazard in assuming that what agents fail at today they will continue to fail on in the future.

What I mean is, if we take the optimistic view of agents continuing to improve on the trajectory they have started at for one or two years, then it is worth while considering what tools and infrastructure we will need for them. Companies that start to build that now for the future they assume is coming are going to be better positioned than people who wake up to a new reality in two years.