Leanstral: Open-source agent for trustworthy coding and formal proof engineering
Lean 4 paper (2021): https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37
Leanstral: Open-source agent for trustworthy coding and formal proof engineering
Lean 4 paper (2021): https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37
It’s great to see this pattern of people realising that agents can specify the desired behavior then write code to conform to the specs.
TDD, verification, whatever your tool; verification suites of all sorts accrue over time into a very detailed repository of documentation of how things are supposed to work that, being executable, puts zero tokens in the context when the code is correct.
It’s more powerful than reams upon reams of markdown specs. That’s because it encodes details, not intent. Your intent is helpful at the leading edge of the process, but the codified result needs shoring up to prevent regression. That’s the area software engineering has always ignored because we have gotten by on letting teams hold context in their heads and docs.
As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.
> That’s because it encodes details, not intent.
Be careful here - make sure you encode the right details. I've seen many cases where the tests are encoding the details of how it was implemented and not what it is intended to do. This means that you can't refactor anything because your tests are enforcing a design. (refactor is changing code without deleting tests, the trick is how can you make design changes without deleting tests - which means you have to test as much as possible at a point where changing that part of the design isn't possible anyway)
While you are right that you need to be encoding the right details, I disagree on the tests enforcing a design point.
As part of the proper testing strategy, you will have tests that cover individual behavior of a small block/function (real "unit" tests), tests that cover integration points only up to the integration itself, and a small number of end-to-end or multi-component integration tests.
Only the last category should stay mostly idempotent under refactoring, depending on the type of refactor you are doing.
Integration tests will obviously be affected when you are refactoring the interfaces between components, and unit tests will be affected when you are refactoring the components themselves. Yes, you should apply the strategy that keeps it under incremental reverse TDD approach (do the refactor and keep the old interface, potentially by calling into new API from the old; then in second step replace use of old API as well, including in tests).
Tests generally define behavior and implementation in a TDD approach: it'd be weird if they do not need changing at all when you are changing the implementation.
Taken to its logical conclusion, what you are saying is do not write (or commit? but in practice, why write them if not to run in CI) any tests except for end-to-end tests covering actual use cases. In theory, even make them generic enough so they are not affected by the implementation. Perhaps even employ LLMs there ("check that a customer can provide their address for their order by using a headless browser").
It is a strong disagree from me: end-to-end tests have always been fragile and slow, and feedback loop time is the boundary at which any coder (agentic or human) needs to operate on. If your agents need to wait 2h to see if their every change is valid, you'll be beat by humans doing properly structured "just enough" testing.
That isn't the logical conclusion though. I specificly said find places where change would be too complex to attempt anyway which breaks your conclusion. This lets you find plenty of places to jump in and write a test. (You will still be wrong but less often and generally you know it will be a hard change before you start making it)
though I find in practice end to end tests are not that fragile. It took us a decade of effort to find and mitigate all the little issues that make them fragile though so perhaps you don't want to go that far. I wish I could make the end to end tests faster though, but fragile they are not.
I am a bit perplexed at your claim that end-to-end tests are not fragile in general (my claim) with a counter how you spent a decade on not making them fragile in one particular case?
I am not disagreeing most projects evolve test suites which have duplicated, useless tests as a majority. But it can be done better.