Leanstral: Open-source agent for trustworthy coding and formal proof engineering

Lean 4 paper (2021): https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

https://mistral.ai/news/leanstral

It’s great to see this pattern of people realising that agents can specify the desired behavior then write code to conform to the specs.

TDD, verification, whatever your tool; verification suites of all sorts accrue over time into a very detailed repository of documentation of how things are supposed to work that, being executable, puts zero tokens in the context when the code is correct.

It’s more powerful than reams upon reams of markdown specs. That’s because it encodes details, not intent. Your intent is helpful at the leading edge of the process, but the codified result needs shoring up to prevent regression. That’s the area software engineering has always ignored because we have gotten by on letting teams hold context in their heads and docs.

As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.

I feel like the difference is minimal, if not entirely dismissable. Code in this sense is just a representation of the same information as someone would write in an .md file. The resolution changes, and that's where both detail and context are lost.

I'm not against TDD or verification-first development, but I don't think writing that as code is the end-goal. I'll concede that there's millions of lines of tests that already exist, so we should be using those as a foundation while everything else catches up.

Tests (and type-checkers, linters, formal specs, etc.) ground the model in reality: they show it that it got something wrong (without needing a human in the loop). It's empiricism, "nullius in verba"; the scientific approach, which lead to remarkable advances in a few hundred years; that over a thousand years of ungrounded philosophy couldn't achieve.

The scientific approach is not only or primarily empiricism. We didn't test our way to understanding. The scientific approach starts with a theory that does it's best to explain some phenomenon. Then the theory is criticized by experts. Finally, if it seems to be a promising theory tests are constructed. The tests can help verify the theory but it is the theory that provides the explanation which is the important part. Once we have explanation then we have understanding which allows us to play around with the model to come up with new things, diagnose problems etc.

The scientific approach is theory driven, not test driven. Understanding (and the power that gives us) is the goal.

> The scientific approach starts with a theory that does it's best to explain some phenomenon

At the risk of stretching the analogy, the LLM's internal representation is that theory: gradient-descent has tried to "explain" its input corpus (+ RL fine-tuning), which will likely contain relevant source code, documentation, papers, etc. to our problem.

I'd also say that a piece of software is a theory too (quite literally, if we follow Curry-Howard). A piece of software generated by an LLM is a more-specific, more-explicit subset of its internal NN model.

Tests, and other real CLI interactions, allow the model to find out that it's wrong (~empiricism); compared to going round and round in chain-of-thought (~philosophy).

Of course, test failures don't tell us how to make it actually pass; the same way that unexpected experimental/observational results don't tell us what an appropriate explanation/theory should be (see: Dark matter, dark energy, etc.!)

The ai is just pattern matching. Vibing is not understanding, whether done by humans or machines. Vibe programmers (of which there are many) make a mess of the codebase piling on patch after patch. But they get the tests to pass!

Vibing gives you something like the geocentric model of the solar system. It kind of works but but it's much more complicated and hard to work with.

Nice analogy *

I guess the current wave is going to give us Sofware Development Epicycles (SDEC?)

* All analogies are "wrong", some analogies are useful