Mastodawn

@tastapod have you tried to get LLM coding tools to use BDD techniques yet? It seems like it would be helpful, if we can get the agent swarm setup with the right contexts. I’m going to give it a try but I don’t really know what I’m doing, so it would be helpful to have some critique from BDD experts as Claude and I figure it out.

Show thread

Adrian Cockcroft Aug 29

@tastapod I asked Claude for advice and got this… https://claude.ai/share/417a5f87-e217-400b-8bff-34f4e9d80e4d

Show thread

jo.vaneyck Aug 29

@adrianco @tastapod I took a look at tdd-guard recently and it's really useful if "strongly suggesting tdd process through prompting" isn't cutting it. Strongly recommend but non-trivial setup/token spend!

https://youtu.be/IVdYaVKuekk

Can AI coding agents do Test-Driven Development (TDD)?

YouTube

Show thread

Adrian Cockcroft Aug 29

@jovaneyck @tastapod That was really useful, thanks! I will try tdd-guard out.

Show thread

Karl Matthias 🇺🇦Aug 29

@adrianco @tastapod while not BDD, my main recent unlock was to insist that the written plan I review include Mermaid sequence diagrams. This has really helped make sure the intended implementation is sensible.

Show thread

Adrian Cockcroft Aug 29

@relistan @tastapod I had to ask Claude what a mermaid diagram is… looks cool, I will try that suggestion.

Show thread

Karl Matthias 🇺🇦Aug 29

@adrianco @tastapod the Mermaid syntax supports lots of types but I find the sequence diagrams the best for this.

Show thread

oisin Aug 29

@relistan @adrianco @tastapod Mermaid is the only diagrammatic description approach that yer "AI" tools can consume or produce.

Show thread

Warren Moore Aug 30

@adrianco @relistan @tastapod Claude works well with plantuml too. I’ve used it for both class and sequence diagrams, inputs and outputs.

Show thread

derHinek Aug 30

@adrianco @tastapod Sorry, not an answer to the question, just as info/warning:
Anthropic announced a major policy shift that will fundamentally change how the AI company handles user data from its Claude chatbot. Starting immediately, the company will train its AI models on chat transcripts and coding sessions from consumer accounts unless users actively opt out by September 28th

Show thread

Adrian Cockcroft Aug 30

@derHinek @tastapod Thanks for the note! I’m coding Apache licensed open source for my own amusement so will share the chats. I use Claude for coding and ChatGPT for other more personal queries.

Show thread

Daniel Terhorst-North Sep 2

@adrianco I haven't been doing much with LLM tools tbh. Lots of lurking in chat groups where others are doing things though.

I haven't seen tdd-guard but it seems like it might help. In general I disagree with Claude's 'Phase 2'. I almost never use Gherkin feature files, I have a longstanding promise to myself to write up my reasons, but this reply thread might be that write-up! 1/n

Show thread

Daniel Terhorst-North Sep 2

@adrianco Almost all the BDD I've ever done uses JUnit or PyTest. I always structure my code examples (tests) as:

# Given

# When

# Then

It isn't dogma so much as how I think about design. I often start in the Then or When section writing a 'model client'. This tends to surface domain terms which then find their way into the production code.

This tells me what I need to set up in the Given section. This is especially fun in Java or Kotlin because the IDE fills in a lot of the blanks. 2/n

Show thread

Daniel Terhorst-North Sep 2

@adrianco I just keep hitting Cmd-Enter on the red squiggles until there aren't any left, and assign some dummy values. I'm sure an LLM could automate this but I have never felt it slowing me down. I tend to appreciate the thinking time.

So Gherkin, then.

About 100 years ago, I wrote a scenario runner in Ruby using RSpec, called RBehave.[1] It had a neat internal DSL inspired by RSpec but for G/W/T and steps. It quickly found its way into the RSpec core.

3/n

[1]: https://dannorth.net/blog/introducing-rbehave/

Introducing RBehave

RBehave is a framework for defining and executing application requirements. Using the vocabulary of behaviour-driven development, you define a feature in terms of a Story with Scenarios that describe how the feature behaves. Using a minimum of syntax (a few “quotes” mostly), this becomes an executable and self-describing requirements document. BDD has been around in the Ruby world for a while now, in the form of the excellent RSpec framework, which describes the behaviour of objects at the code level. The RSpec team has focused on creating a simple, elegant syntax and playing nicely with other frameworks, in particular Rails and the Mocha mocking library.

Dan North & Associates Limited

Show thread

Daniel Terhorst-North Sep 2

@adrianco It had a setting that would render the scenario title and steps as plain text as they ran, which was pretty neat (this was pre-Markdown).

Some of the RSpec folks wondered whether you could round-trip this, and have plain text as an input. So in the spirit of 'we thought it would be easy' and several regexps later, plain text scenario runner was born. Then Aslak Hellesøy rewrote this to use a proper grammar parser, and Cucumber was born.

4/n

Show thread

Daniel Terhorst-North Sep 2

@adrianco Writing and running plain text scenarios was a hit, and Cucumber became the most downloaded automation tool in the world... for testers!

Testers would write these scenario files in plain English (or Norwegian or whatever) and they, or usually developers, would write the plumbing to automate them. Once there was a critical mass of steps, the testers could compose new plain text scenarios with little or no developer involvement.

Which sounds great, right?

5/n

Show thread

Daniel Terhorst-North Sep 2

@adrianco From a raw engineering perspective, you just introduced several layers of indirection, each using different technologies, to do something that you could literally do in a single line of Java or Python.

Each Scenario is made up of Steps (Givens, Whens and Thens). These are mapped to methods or functions in some target language using annotations containing regular expressions. So you have 3 or 4 different languages before you even start.[1]

6/n

[1]: https://cucumber.io/docs/guides/10-minute-tutorial

10-minute tutorial | Cucumber

Get started in 10 minutes

Show thread

Daniel Terhorst-North Sep 2

@adrianco Remember, all of this could be a single line of Java in a // Given section!

The pitch is that steps are reusable. Yay! Except that you can't refactor plain text, to rename an evolving domain concept across tens or hundreds of scenarios, say, and anyway, testers don't think in terms of refactoring.

(Note, none of this has anything to do with BDD as such, which is about how the team communicates and collaborates to get work done.)

7/n

Show thread

Daniel Terhorst-North Sep 2

@adrianco So there are a number of pathologies that play out over time, which are my issue with using Cucumber / Gherkin for anything other than a very specific circumstance.

1. Proliferation of near-identical copy-pasta scenarios.

I have worked with clients who had literally thousands of 'BDDs' (yes, BDD as a plural noun meaning feature files), in Cucumber, SpecFlow (now ReqnRoll), and variants. These would take many hours to run so were either ignored or resented.

8/n

Show thread

Daniel Terhorst-North Sep 2

@adrianco With one client, we grouped and categorized several thousand BDDs and rewrote their core behaviour into PyTest tests using Requests (lovely HTTP library), and some SQL and AMQP libraries we found.

The testers learned enough Python to be dangerous, including things like helper functions and basic refactoring.

We ended up with a few hundred scenarios in well-structured PyTest tests which would run in a few minutes and provided way more confidence than 'the BDDs' ever did.

9/n

Show thread

Daniel Terhorst-North Sep 2

@adrianco

2. Shocking performance

Cucumber and friends run orders of magnitude slower than Just Writing Tests. You end up with scaffolding in the plumbing and plumbing in the scaffolding, with a twisty maze of steps all alike. And boy those stack traces! Heaven help you tracking things down when something fails, especially intermittently.

10/n

Show thread

Daniel Terhorst-North Sep 2

@adrianco

3. Scenarios and steps serve different audiences.

The business stakeholders generally care about the scenario title, 'The one where...'. They want to know that this case is one we have considered, and that we will Do The Right Thing when that scenario occurs. They glaze over as soon as you start describing the steps, especially the amount of setup in the Givens or the degrees of checking results in the Thens.

Testers often care about the steps, but not the implementation.

11/n

Show thread

Daniel Terhorst-North Sep 2

@adrianco And the developers would rather be anywhere else than deep in the weeds of BDD step definitions or their dependencies, or trying to figure out why the parameters aren't mapping correctly (it's always a typo, but no linting of course, because English).

4. It is easier to read in code

No, honestly. Even (especially?) for non-developers. I have had this conversation so many times with testers. 'We can read your code because it is full of domain terms doing sensible things!'

12/n

Show thread

Daniel Terhorst-North Sep 2

@adrianco It turns out that using DDD with domain-based, intention-revealing names and a consistent and well-curated domain model are far more versatile than 'You can write your steps in English!'

---

I'll pause here. I have a lot of time for the Cucumber folks. They are smart and invested and they care about (and get!) BDD. Sadly, 99% of the time Cucumber is just a tool for test automation. The scenarios are not a joint collaboration (see 'Different audiences'), just a different syntax.

/fin

Show thread

Adrian Cockcroft Sep 2

@tastapod Thanks! Super helpful.

Show thread

Daniel Terhorst-North Sep 2

@adrianco You're welcome. Surprisingly cathartic too! I should tidy this up into a long-form post.

Show thread

Daniel Terhorst-North Sep 2

@adrianco (or ask Claude to do it for me!)

Show thread

Chris Pitts Sep 3

@tastapod @adrianco Yes please.

You seem to have the same view as I do, which is why trying to use LLMs to write BDD acceptance tests always makes me facepalm. It misses the entire point of *talking to the customer*. It’s just (yet) another layer of abstraction away from that. The Drucker quote is spot on IMO.

Don’t get me wrong, exploring the idea is fine, but to me it just feels wrong and, to a degree, pointless (and that’s without considering the ethical aspects of using AI)

Show thread

Adrian Cockcroft Sep 3

@thirstybear @tastapod Talking to the customer is what the LLM is doing, and I want to see if that conversation can be structured as behaviors that lead to a result that works and does what the customer wanted.

Show thread

Chris Pitts Sep 3

@adrianco @tastapod A valid experiment. But I immediately wonder what is being lost in the interaction

LLMs do not have understanding. They do not have a mental model. They cannot pick up inconsistencies. They do not have instinct. They cannot pick up on subtle signals that something is not quite right, or there's more. They won't query, or question, or delve deeper. They are simply stochastic parrots. Text extrusion machines. I think we need more

I await the conclusions with interest

Show thread

Stuart Marks Sep 4

@tastapod @adrianco Good read. I can’t believe this was started 100 years ago already. It doesn’t seem like a day older than 75. 🤪

Show thread

Daniel Terhorst-North Sep 4

@stuartmarks @adrianco good catch, I was exaggerating for effect. It was in fact 75 years ago.

Show thread

Adrian Cockcroft Sep 2

@tastapod Thanks for all this perspective. I will try to digest it and see if I can persuade the agent swarm to work along these lines.

Show thread

Daniel Terhorst-North Sep 2

@adrianco ooh, I never mentioned my 'very specific circumstance'. That's for tomorrow, then.

Yes, I suspect you could persuade Claude to write sensible, intention-revealing scenario tests with a G/W/T structure. That might be cool.

Show thread

Daniel Terhorst-North Sep 2

@adrianco The real message with this ramble, though, is that while LLMs could make a pretty decent fist of all that plumbing/scaffolding, you end up in the Peter Drucker situation of doing with great efficiency that which should never have been done at all.

Just write the PyTest! (Also, I <3 PyTest's fixtures. Such a lovely framework affordance.)

Show thread

Adrian Cockcroft Sep 2

@tastapod That makes sense. The behavior based spec is the key thing, then use that to generate tests directly.

Show thread

Jonathan Hartley Sep 3

@adrianco @tastapod I feel foolish suggesting things to you guys, but for other drive-bys like me: Copy the human readable spec as comments into a new empty test file, complete with "Given", "When" and "Then", then implement each comment.

Show thread

Daniel Terhorst-North Sep 3

@tartley @adrianco nothing foolish in that! I often start a code example / test by writing the whole thing in comments, then convert the comments into code one at a time, usually working backwards. I call this 'comment-driven testing' as a flavour of TDD/BDD.

Show thread

Adrian Cockcroft Sep 3

@tastapod Someone is working on making agents for use in a swarm that are modeling the best practices and thoughts of leading developers. Like a Kent Beck agent that tells you to tidy first, or a Dan North agent that tells you to do BDD better. The context encapsulation of a an agent swarm lets you do that much better than with a single LLM that you try to tell to think about everything at once.

Show thread

Daniel Terhorst-North Sep 3

@adrianco I still can't help thinking that we are investing insane amounts of time, effort and money into compensating for the intrinsic limitations of using completely the wrong tool in the first place.

Show thread

Daniel Terhorst-North Sep 3

@adrianco I am convinced that the next significant shift in software development involves ML. I am equally convinced that LLMs are not the way and are just a massive shill, sucking all the oxygen (and investment dollars) out of the room.

Why not _start_ with the premise that we want to encode rules and heuristics and build a ML solution from that, rather than trying to persuade a forgetful, stochastic token prediction engine to do the job, in this case by throwing lots of them at it?