I just dreamed up an amusing way for things to fail using LLMs:
1. Deliverable is a test suite testing some new feature. Validate by looking at the logs to see if things passed.
2. LLM makes generates tests and generates passing logs.
3. Everybody signs off on it. Job well done.
4. Nothing works, of course, because the logs were generated by the LLM instead of by the test suite.

It would be hard to make this mistake, but it's amusing to think about...

Maybe this isn't so far fetched if you use an LLM to summarize log outputs, another one to look at the summary to see if things passed/failed, you point at the wrong artifacts...

@AcausalRobotGod

If the test suite itself were _also_ generated by an LLM ("eh, I just used it to help with the boilerplate bc Java sux", said the dev)

then it's very likely it would produce code that does a lot of work but doesn't actually test anything in a meaningful way

LLM code is … deceptively okay-looking for boring & predictable tasks that might be easily found in many repos

But this test suite is the _actual deliverable_ and really needs to be a negative-space functional spec!