@tastapod have you tried to get LLM coding tools to use BDD techniques yet? It seems like it would be helpful, if we can get the agent swarm setup with the right contexts. I’m going to give it a try but I don’t really know what I’m doing, so it would be helpful to have some critique from BDD experts as Claude and I figure it out.

@adrianco @tastapod I took a look at tdd-guard recently and it's really useful if "strongly suggesting tdd process through prompting" isn't cutting it. Strongly recommend but non-trivial setup/token spend!

https://youtu.be/IVdYaVKuekk

Can AI coding agents do Test-Driven Development (TDD)?

YouTube
@jovaneyck @tastapod That was really useful, thanks! I will try tdd-guard out.
@adrianco @tastapod while not BDD, my main recent unlock was to insist that the written plan I review include Mermaid sequence diagrams. This has really helped make sure the intended implementation is sensible.
@relistan @tastapod I had to ask Claude what a mermaid diagram is… looks cool, I will try that suggestion.
@adrianco @tastapod the Mermaid syntax supports lots of types but I find the sequence diagrams the best for this.
@relistan @adrianco @tastapod Mermaid is the only diagrammatic description approach that yer "AI" tools can consume or produce.
@adrianco @relistan @tastapod Claude works well with plantuml too. I’ve used it for both class and sequence diagrams, inputs and outputs.
@adrianco @tastapod Sorry, not an answer to the question, just as info/warning:
Anthropic announced a major policy shift that will fundamentally change how the AI company handles user data from its Claude chatbot. Starting immediately, the company will train its AI models on chat transcripts and coding sessions from consumer accounts unless users actively opt out by September 28th
@derHinek @tastapod Thanks for the note! I’m coding Apache licensed open source for my own amusement so will share the chats. I use Claude for coding and ChatGPT for other more personal queries.

@adrianco I haven't been doing much with LLM tools tbh. Lots of lurking in chat groups where others are doing things though.

I haven't seen tdd-guard but it seems like it might help. In general I disagree with Claude's 'Phase 2'. I almost never use Gherkin feature files, I have a longstanding promise to myself to write up my reasons, but this reply thread might be that write-up! 1/n

@adrianco Almost all the BDD I've ever done uses JUnit or PyTest. I always structure my code examples (tests) as:

# Given

# When

# Then

It isn't dogma so much as how I think about design. I often start in the Then or When section writing a 'model client'. This tends to surface domain terms which then find their way into the production code.

This tells me what I need to set up in the Given section. This is especially fun in Java or Kotlin because the IDE fills in a lot of the blanks. 2/n

@adrianco I just keep hitting Cmd-Enter on the red squiggles until there aren't any left, and assign some dummy values. I'm sure an LLM could automate this but I have never felt it slowing me down. I tend to appreciate the thinking time.

So Gherkin, then.

About 100 years ago, I wrote a scenario runner in Ruby using RSpec, called RBehave.[1] It had a neat internal DSL inspired by RSpec but for G/W/T and steps. It quickly found its way into the RSpec core.

3/n

[1]: https://dannorth.net/blog/introducing-rbehave/

Introducing RBehave

RBehave is a framework for defining and executing application requirements. Using the vocabulary of behaviour-driven development, you define a feature in terms of a Story with Scenarios that describe how the feature behaves. Using a minimum of syntax (a few “quotes” mostly), this becomes an executable and self-describing requirements document. BDD has been around in the Ruby world for a while now, in the form of the excellent RSpec framework, which describes the behaviour of objects at the code level. The RSpec team has focused on creating a simple, elegant syntax and playing nicely with other frameworks, in particular Rails and the Mocha mocking library.

Dan North & Associates Limited

@adrianco It had a setting that would render the scenario title and steps as plain text as they ran, which was pretty neat (this was pre-Markdown).

Some of the RSpec folks wondered whether you could round-trip this, and have plain text as an input. So in the spirit of 'we thought it would be easy' and several regexps later, plain text scenario runner was born. Then Aslak Hellesøy rewrote this to use a proper grammar parser, and Cucumber was born.

4/n

@adrianco Writing and running plain text scenarios was a hit, and Cucumber became the most downloaded automation tool in the world... for testers!

Testers would write these scenario files in plain English (or Norwegian or whatever) and they, or usually developers, would write the plumbing to automate them. Once there was a critical mass of steps, the testers could compose new plain text scenarios with little or no developer involvement.

Which sounds great, right?

5/n

@adrianco From a raw engineering perspective, you just introduced several layers of indirection, each using different technologies, to do something that you could literally do in a single line of Java or Python.

Each Scenario is made up of Steps (Givens, Whens and Thens). These are mapped to methods or functions in some target language using annotations containing regular expressions. So you have 3 or 4 different languages before you even start.[1]

6/n

[1]: https://cucumber.io/docs/guides/10-minute-tutorial

10-minute tutorial | Cucumber

Get started in 10 minutes

@adrianco Remember, all of this could be a single line of Java in a // Given section!

The pitch is that steps are reusable. Yay! Except that you can't refactor plain text, to rename an evolving domain concept across tens or hundreds of scenarios, say, and anyway, testers don't think in terms of refactoring.

(Note, none of this has anything to do with BDD as such, which is about how the team communicates and collaborates to get work done.)

7/n

@adrianco So there are a number of pathologies that play out over time, which are my issue with using Cucumber / Gherkin for anything other than a very specific circumstance.

1. Proliferation of near-identical copy-pasta scenarios.

I have worked with clients who had literally thousands of 'BDDs' (yes, BDD as a plural noun meaning feature files), in Cucumber, SpecFlow (now ReqnRoll), and variants. These would take many hours to run so were either ignored or resented.

8/n

@adrianco With one client, we grouped and categorized several thousand BDDs and rewrote their core behaviour into PyTest tests using Requests (lovely HTTP library), and some SQL and AMQP libraries we found.

The testers learned enough Python to be dangerous, including things like helper functions and basic refactoring.

We ended up with a few hundred scenarios in well-structured PyTest tests which would run in a few minutes and provided way more confidence than 'the BDDs' ever did.

9/n

@adrianco

2. Shocking performance

Cucumber and friends run orders of magnitude slower than Just Writing Tests. You end up with scaffolding in the plumbing and plumbing in the scaffolding, with a twisty maze of steps all alike. And boy those stack traces! Heaven help you tracking things down when something fails, especially intermittently.

10/n

@adrianco

3. Scenarios and steps serve different audiences.

The business stakeholders generally care about the scenario title, 'The one where...'. They want to know that this case is one we have considered, and that we will Do The Right Thing when that scenario occurs. They glaze over as soon as you start describing the steps, especially the amount of setup in the Givens or the degrees of checking results in the Thens.

Testers often care about the steps, but not the implementation.

11/n

@adrianco And the developers would rather be anywhere else than deep in the weeds of BDD step definitions or their dependencies, or trying to figure out why the parameters aren't mapping correctly (it's always a typo, but no linting of course, because English).

4. It is easier to read in code

No, honestly. Even (especially?) for non-developers. I have had this conversation so many times with testers. 'We can read your code because it is full of domain terms doing sensible things!'

12/n

@adrianco It turns out that using DDD with domain-based, intention-revealing names and a consistent and well-curated domain model are far more versatile than 'You can write your steps in English!'

---

I'll pause here. I have a lot of time for the Cucumber folks. They are smart and invested and they care about (and get!) BDD. Sadly, 99% of the time Cucumber is just a tool for test automation. The scenarios are not a joint collaboration (see 'Different audiences'), just a different syntax.

/fin

@tastapod Thanks! Super helpful.
@adrianco You're welcome. Surprisingly cathartic too! I should tidy this up into a long-form post.
@adrianco (or ask Claude to do it for me!)

@tastapod @adrianco Yes please.

You seem to have the same view as I do, which is why trying to use LLMs to write BDD acceptance tests always makes me facepalm. It misses the entire point of *talking to the customer*. It’s just (yet) another layer of abstraction away from that. The Drucker quote is spot on IMO.

Don’t get me wrong, exploring the idea is fine, but to me it just feels wrong and, to a degree, pointless (and that’s without considering the ethical aspects of using AI)

@thirstybear @tastapod Talking to the customer is what the LLM is doing, and I want to see if that conversation can be structured as behaviors that lead to a result that works and does what the customer wanted.

@adrianco @tastapod A valid experiment. But I immediately wonder what is being lost in the interaction

LLMs do not have understanding. They do not have a mental model. They cannot pick up inconsistencies. They do not have instinct. They cannot pick up on subtle signals that something is not quite right, or there's more. They won't query, or question, or delve deeper. They are simply stochastic parrots. Text extrusion machines. I think we need more

I await the conclusions with interest

@tastapod @adrianco Good read. I can’t believe this was started 100 years ago already. It doesn’t seem like a day older than 75. 🤪
@stuartmarks @adrianco good catch, I was exaggerating for effect. It was in fact 75 years ago.
@tastapod Thanks for all this perspective. I will try to digest it and see if I can persuade the agent swarm to work along these lines.

@adrianco ooh, I never mentioned my 'very specific circumstance'. That's for tomorrow, then.

Yes, I suspect you could persuade Claude to write sensible, intention-revealing scenario tests with a G/W/T structure. That might be cool.

@adrianco The real message with this ramble, though, is that while LLMs could make a pretty decent fist of all that plumbing/scaffolding, you end up in the Peter Drucker situation of doing with great efficiency that which should never have been done at all.

Just write the PyTest! (Also, I <3 PyTest's fixtures. Such a lovely framework affordance.)

@tastapod That makes sense. The behavior based spec is the key thing, then use that to generate tests directly.
@adrianco @tastapod I feel foolish suggesting things to you guys, but for other drive-bys like me: Copy the human readable spec as comments into a new empty test file, complete with "Given", "When" and "Then", then implement each comment.
@tartley @adrianco nothing foolish in that! I often start a code example / test by writing the whole thing in comments, then convert the comments into code one at a time, usually working backwards. I call this 'comment-driven testing' as a flavour of TDD/BDD.
@tastapod Someone is working on making agents for use in a swarm that are modeling the best practices and thoughts of leading developers. Like a Kent Beck agent that tells you to tidy first, or a Dan North agent that tells you to do BDD better. The context encapsulation of a an agent swarm lets you do that much better than with a single LLM that you try to tell to think about everything at once.
@adrianco I still can't help thinking that we are investing insane amounts of time, effort and money into compensating for the intrinsic limitations of using completely the wrong tool in the first place.

@adrianco I am convinced that the next significant shift in software development involves ML. I am equally convinced that LLMs are not the way and are just a massive shill, sucking all the oxygen (and investment dollars) out of the room.

Why not _start_ with the premise that we want to encode rules and heuristics and build a ML solution from that, rather than trying to persuade a forgetful, stochastic token prediction engine to do the job, in this case by throwing lots of them at it?