The reason I get so annoyed about people pitching LLMs as a way to 'democratise programming' or as end-user programming tools is that they solve the wrong problem.

The hard part of programming is not writing code. It's unambiguously expressing your problem and desired solution. Imagine if LLMs were perfect programmers. All you have to do is write a requirements document and they turn it into a working program. Amazing, right? Well, not if you've ever seen what most people write in a requirements document or seen the output when a team of good programmers works from a requirements document.

The most popular end-user programming language in the world (and, by extension, the most popular programming language), with over a billion users, is the Calc language that is embedded in Excel. It is not popular because it's a good language. Calc is a terrible programming language by pretty much any metric. It's popular because Excel (which is also a terrible spreadsheet, but that's a different rant) is basically a visual debugger and a reactive programming environment. Every temporary value in an Excel program is inspectable and it's trivial to write additional debug expressions that are automatically updated when the values that they're observing change.

Much as I detest it as a spreadsheet, Excel is probably the best debugger that I have ever used, including Lisp and Smalltalk.

The thing that makes end-user programming easy in Excel is not that it's easy to write code, it's that it's easy to see what the code is doing and understand why it's doing the wrong thing. If you replace this with an LLM that generates Python, and the Python program is wrong, how does a normal non-Python-programming human debug it? They try asking the LLM, but it doesn't actually understand the Python so it will often send them down odd rabbit holes. In contrast, every intermediate step in an Excel / Calc program is visible. Every single intermediate value is introspectable. Adding extra sanity checks (such as 'does money leaving the account equal the money paid to suppliers?') is trivial.

If you want to democratise programming, build better debuggers, don't build tools that rapidly generate code that's hard to debug.

@david_chisnall this isn't even a new phenomenon. Before there was vibe coding, there were "NoCode solutions". The problem is always the same: either the result is janky, limited, and/or not up to spec, or the person creating it has inadvertently become a programmer, with all the complexity that entails.

In the case of NoCode, this was mostly a way to underpay programmers, by not calling them that. I expect similar in the case of LLMs.

@sophieschmieg I think a lot of the low-code or no-code things were better because they didn't claim to be general solutions. They were good at creating business apps that worked like ten thousand other in-house business apps but with the small tweak specific to your requirements. Effectively, they factored out the common code and made it easy to create something that was the common bit plus a tiny extra bit.

In contrast, LLMs let you automatically generate a private local copy of all of that shared code, which is a maintenance nightmare and is completely unapproachable to non-programmer users because they can't tell the difference between the thousand lines of code that are copied and pasted and the dozen that are actually related to their requirements.

@david_chisnall @sophieschmieg If it only were so.

In my experience LLM produce often enough faulty enough code, that even in "dynamically typed languages" like Python manages to throw SyntaxError upon module import.

As I joked, the LLM have probably managed in the past weeks to produce more SyntaxError errors than me myself in the past decade or longer.

@david_chisnall
My personal worry is always, what happens all the cool uses of LLM that don't have output that need to follow a formal grammar, which can be used to validate the LLM work output.

Then I watch our corporate retreat where managers dreamy tell how quickly an AI can create lengthy position papers on topics they don't fully understand.

Surely after the paper is written by the LLM they'll go on to read the 1000s of pages of source @sophieschmieg

material to verify the position paper content.

So the moment the LLM starts producing huge amounts of work output, the question arises how are you supposed to make sure that the output is correct?

And that the LLM produce plenty of bullshit is indicated by the fact that current state of the art LLM optimized for coffee generation or not, manage to produce regularly syntax errors. That does not even include the code that is invalid at runtime.
@sophieschmieg @david_chisnall

@david_chisnall
So whenever somebody mentions AI/LLM, the first question is what is your result verification strategy. Because you need one for almost all use cases

@sophieschmieg

@yacc143 @david_chisnall @sophieschmieg That's the problem. Luckily (?) with software it gets tested at runtime. Unluckily, when it's a bridge it'll get tested when a load gets applied and it falls down.

Never forget, either: it's a plausibility engine. If you can verify the result, it'll lie to you, and when you catch it, it'll apologise and lie to you again until you're happy (the result is plausible) or you give up checking.

@Dss
Ah, but that's a slightly faulty reasoning.

No not all faults get tested just by usage.
There is a whole New world of bugs and failure modes when you put the "system" under stress. Be it your friendly port knocking software that spews some random bile at it, some nice people probing if you handle all parameters correctly, or simply race conditions that only show up under extreme load. Or the clusterf&%k decides not to scale as @david_chisnall @sophieschmieg

predicted and insured of a result in 24h it's still running after 3 days. Should have included more logging to show progress. Sigh.

So yes, there's are no cool videos of a bridge catching a harmonic frequency, that's right, but trust me, disasters software is quite capable to produce. (And the test coverage myth, it's tested by usage is painful. Trust me, I cursed rather loudly last week on that topic in my office.)
@david_chisnall @sophieschmieg @Dss

@yacc143 @sophieschmieg @Dss In particular, most security vulnerabilities appear in software that works for the common case. If it doesn't work at all, people don't use it and there's no impact from a bug. If it works well when no one is actively attacking it, it gets deployed in a lot of places. Software vulnerabilities are usually bugs that are triggered by some uncommon input (or, in some cases, common input but uncommon observations).

Testing can catch a lot of these if the tests are written by someone who understands the common failure modes of the kinds of software that's under test. Understanding is key here, and that's something LLMs lack.

@david_chisnall @yacc143 @sophieschmieg I agree that anything customer facing, or external, needs another level of care and testing. But no-one's* dumb enough to show an internal spreadsheet externally, surely?

*DOGE excluded, obvs!

@Dss @david_chisnall @sophieschmieg I've working in ETL and especially automated data acquisition (one could say web scraping, but the discipline is sadly a little bit wider) for the past decade or so.

Trust me, after 2-3 years you stop assuming anything, and the "why would any thinking being code that like this???" moves into the background.

Basically that "now I've seen everything" assumption has been kicked literally into the garbage bin too often.

Another thing that I've learned is that (we are in the academic adjacent space, so to say), Excel sheets are fucking bad as "databases" or "data exchange formats". Our CTO (when we were still a startup) literally designed a template sheet that interested partners could fill with their data if they wanted to give us their data (and didn't have it somehow published for us to scrape).

I've literally not seen in the dozens Excels that came back one f%c&ing sheet that didn't need custom code to be parsed.

What do people think an ISO 3166-1 alpha-2 country code like US/UK/FR/DE is? Deutschland is obviously the expected value for the cells, right? Or is it +49?

That was one fascinating experiment in "data entry", Excel sheets totally failing. You either need to code up a VBA app that requires a specific MS Office setup at the user or you get random data.

@yacc143 haha, amazing.

We had a VBA module called something like "Cleanuptext" that was used to parse every single field of input, because some of the engineers were very... Hard to help... But also because Access would do some crazy things where it broke itself on certain inputs! So in the end, *everything* got turned into A-Za-z0-9, plus a bit of punctuation.