I finally turned off GitHub Copilot yesterday. I’ve been using it for about a year on the ‘free for open-source maintainers’ tier. I was skeptical but didn’t want to dismiss it without a fair trial.

It has cost me more time than it has saved. It lets me type faster, which has been useful when writing tests where I’m testing a variety of permutations of an API to check error handling for all of the conditions.

I can recall three places where it has introduced bugs that took me more time to to debug than the total time saving:

The first was something that initially impressed me. I pasted the prose description of how to communicate with an Ethernet MAC into a comment and then wrote some method prototypes. It autocompleted the bodies. All very plausible looking. Only it managed to flip a bit in the MDIO read and write register commands. MDIO is basically a multiplexing system. You have two device registers exposed, one sets the command (read or write a specific internal register) and the other is the value. It got the read and write the wrong way around, so when I thought I was writing a value, I was actually reading. When I thought I was reading, I was actually seeing the value in the last register I thought I had written. It took two of us over a day to debug this. The fix was simple, but the bug was in the middle of correct-looking code. If I’d manually transcribed the command from the data sheet, I would not have got this wrong because I’d have triple checked it.

Another case it had inverted the condition in an if statement inside an error-handling path. The error handling was a rare case and was asymmetric. Hitting the if case when you wanted the else case was okay but the converse was not. Lots of debugging. I learned from this to read the generated code more carefully, but that increased cognitive load and eliminated most of the benefit. Typing code is not the bottleneck and if I have to think about what I want and then read carefully to check it really is what I want, I am slower.

Most recently, I was writing a simple binary search and insertion-deletion operations for a sorted array. I assumed that this was something that had hundreds of examples in the training data and so would be fine. It had all sorts of corner-case bugs. I eventually gave up fixing them and rewrote the code from scratch.

Last week I did some work on a remote machine where I hadn’t set up Copilot and I felt much more productive. Autocomplete was either correct or not present, so I was spending more time thinking about what to write. I don’t entirely trust this kind of subjective judgement, but it was a data point. Around the same time I wrote some code without clangd set up and that really hurt. It turns out I really rely on AST-aware completion to explore APIs. I had to look up more things in the documentation. Copilot was never good for this because it would just bullshit APIs, so something showing up in autocomplete didn’t mean it was real. This would be improved by using a feedback system to require autocomplete outputs to type check, but then they would take much longer to create (probably at least a 10x increase in LLM compute time) and wouldn’t complete fragments, so I don’t see a good path to being able to do this without tight coupling to the LSP server and possibly not even then.

Yesterday I was writing bits of the CHERIoT Programmers’ Guide and it kept autocompleting text in a different writing style, some of which was obviously plagiarised (when I’m describing precisely how to implement a specific, and not very common, lock type with a futex and the autocomplete is a paragraph of text with a lot of detail, I’m confident you don’t have more than one or two examples of that in the training set). It was distracting and annoying. I wrote much faster after turning it off.

So, after giving it a fair try, I have concluded that it is both a net decrease in productivity and probably an increase in legal liability.

Discussions I am not interested in having:

  • You are holding it wrong. Using Copilot with this magic config setting / prompt tweak makes it better. At its absolute best, it was a small productivity increase, if it needs more effort to use, that will be offset.
  • This other LLM is much better. I don’t care. The costs of the bullshitting far outweighed the benefits when it worked, to be better it would have to not bullshit, and that’s not something LLMs can do.
  • It’s great for boilerplate! No. APIs that require every user to write the same code are broken. Fix them, don’t fill the world with more code using them that will need fixing when the APIs change.
  • Don’t use LLMs for autocomplete, use them for dialogues about the code. Tried that. It’s worse than a rubber duck, which at least knows to stay silent when it doesn’t know what it’s talking about.

The one place Copilot was vaguely useful was hinting at missing abstractions (if it can autocomplete big chunks then my APIs required too much boilerplate and needed better abstractions). The place I thought it might be useful was spotting inconsistent API names and parameter orders but it was actually very bad at this (presumably because of the way it tokenises identifiers?). With a load of examples with consistent names, it would suggest things that didn't match the convention. After using three APIs that all passed the same parameters in the same order, it would suggest flipping the order for the fourth.

#GitHubCopilot #CHERIoT

@david_chisnall your experience is how I expected mine to be if I had actually given the technology a chance.

Machine learning has been useful for decades, mostly quietly. The red flag against LLMs for me (aside from the authorship laundering and mass poaching of content) was there scramble by all companies to shoehorn it into their products, like a solution looking for a problem; I’ve yet to see it actually solve.

@carbontwelve I used machine learning in my PhD. The use case there was data prefetching. This was an ideal task for ML, because the benefits of a correct answer were high and the cost of an incorrect answer were low. In the worst case, your prefetching evicts something from cache that you need later, but a 60% accuracy in predictions is a big overall improvement.

Programming is the opposite. The benefits of being able to generate correct code faster 80% of the time are small but the costs of generating incorrect code even 1% of the time are high. The entire shift-left movement is about finding and preventing bugs earlier.

@david_chisnall that’s a nicely eloquent way to put both into perspective.
@david_chisnall @carbontwelve this is what has been gnawing at the back of my brain. The purveyors of LLM's have been talking up the latest improvements in reasoning. A calculator that isn't 100% accurate at returning correct answers to inputs is 100% useless. We're being asked to conflate the utility of LLM's with the same kind of utility as a calculator. Would we choose to drive over a bridge designed using AI? How will we know?

@zebratale @carbontwelve Calculators do make mistakes. Most pocket calculators do arithmetic in binary and so propagate errors converting decimal to binary floating point, for example not being able to represent 0.1 accurately. They use floating point to approximate rationals, so collect rounding errors for things like 1/3.

The difference is that you can create a mental model of how they fail and make sure that the inaccuracies are acceptable within your problem domain. You cannot do this with LLMs. They will fail in exciting and surprising ways. And those failure modes will change significantly across minor revisions.

@david_chisnall

@zebratale @carbontwelve

I do find myself building up intuitions for what an LLM does. It's far less reliable than a calculator but humans can build intuitions for other unreliable things that can fail excitingly.

@david_chisnall @carbontwelve Well one thing where LLMs can make sense is spam filtering (sadly also for generating it, as we probably all know by now…).

Like rspamd tried GPT-3.5 Turbo and GPT-4o models against Bayes and got pretty interesting results: https://rspamd.net/misc/2024/07/03/gpt.html

Although as conclusion puts, one should use local LLMs for data privacy reasons and likely performance reasons (elapsed time for GPT being ~300s vs. 12s and 30s for bayes), which would also likely change results.

@lanodan @carbontwelve Spam filtering has been a good application for machine learning for ages. I think the first Bayesian spam filters were added around the end of the last century. It has several properties that make it a good fit for ML:

  • The cost of letting spam through is low, the value in filtering most of it correctly is high.
  • There isn’t a rule-based approach that works well. You can’t write a list of properties that make something spam. You can write a list of properties that indicate something has a higher chance of being spam.
  • The problem changes rapidly. Spammers change their tactics depending on what gets through filters and so a system that adapts on the defence works well. You have a lot of data of ham vs spam to do the adaptation.

Note that this is not the same for intrusion detection and a lot of ML-based approaches for intrusion detection have failed. It is bad if you miss a compromise and you don’t have enough examples of malicious and non-malicious data for your categoriser to adapt rapidly.

The last point is part of why it worked well in my use case and was great for Project Silica when I was at MS. They were burning voxels into glass with lasers and then recovering the data. With a small calibration step (burn a load of known-value voxels into a corner of the glass) they could build an ML classifier that worked on any set of laser parameters. It might not have worked quite as well as a well-tuned rule-based system, but they could do experiments as fast as the laser could fire with the ML approach, whereas a rule-based system needed someone to classify the voxel shapes and redo the implementation, which took at least a week. That was a huge benefit. Their data included error-correction codes, so as long as their model was mostly right, ECC would fix the rest.

@david_chisnall @carbontwelve part of the AI hype brainrot seems to be a complete disregard to the possibility of failure. or if there's a consideration of which, then the next bigger, hungrier model will fix it...
@nebulos @david_chisnall @carbontwelve That's capitalism 101: Create a problem that didn't exist before; conjure a solution you can build a business model around; kneecap whatever regulatory mechanisms that would eliminate the problem for good; leech off society for life!

@david_chisnall @carbontwelve

I just want magic ETL. That's all I want, not just for Christmas.

@david_chisnall @carbontwelve The argument that I would make is that, to the extent LLMs are "useful" (or perceived as useful) for programming, it's an indictment of our programming environments. If we're writing that many statistically likely tokens in our programs, programming is reduced to pasting large volumes of boilerplate, and the correct solution is *not* "get a machine to do that", it's "get rid of the boilerplate".
@wollman @carbontwelve Alan Blackwell recently wrote an entire book around that thesis, which you might find interesting.

@david_chisnall @carbontwelve This sort of cost benefit analysis is crucial and missing in so much of the discussion around generative AI!

On the other hand, without intending to diminish it at all, it sort of characterizes all programing as equally serious or something, which obviously isn't true.

In plenty of programming "the costs of generating incorrect code even 1% of the time are" pretty inconsequential, actually.

@aeischeid @david_chisnall @carbontwelve

> In plenty of programming "the costs of generating incorrect code even 1% of the time are" pretty inconsequential, actually.

I guess it depends on the size of your codebase and how easy it is to spot the bug introduced by the LLM. In a lot of cases debugging can take up much more time than writing new code.

@carbontwelve @david_chisnall this also matches my expectations, and I've seen people mention studies in teams showing no productivity gain, too.

So I'm intrigued by the few people who DO report that LLMs help them code, though (eg @simon ). Is there something different about how their brains work so LLMs help? Or (cynically) are they jumping on the bandwagon and trying hard to show the world they've cracked how to use them well, to sell themselves as consultants or something?

@kitten_tech @carbontwelve @david_chisnall @simon Something I've found LLMs useful for, and I've seen Simon say something similar, is writing code in a language or situation that I _kinda_ know. I might not have bothered writing it if I had to climb boilerplate mountain first, but the LLM serves as guardrails and crack-filler. And since I _kinda_ know the thing, I don't fall for appalling hallucinations, but I get a chance to learn more about the thing fairly painlessly.
@kitten_tech @carbontwelve @david_chisnall @simon This m matches my experience, and I gave up after trying on four small nontrivial projects. The people I personally know using it successfully only use it in limited cases as a kind of hint system to help remind of things or point out new ways of doing things.

@kitten_tech @carbontwelve @david_chisnall I'm actually getting more coding work done directly in the Claude and ChatGPT web interfaces and apps vs using Copliot in my editor

The real magic for me at the moment is Claude Artifacts and ChatGPT Code Interpreter - I wrote a bunch about Artifacts here: https://simonwillison.net/tags/claude-artifacts/

Here are all of my general notes on AI-assisted programming: https://simonwillison.net/tags/ai-assisted-programming/

Simon Willison on claude-artifacts

@simon @kitten_tech @carbontwelve @david_chisnall How would you avoid or deal with the issues that David encountered? Specifically, subtle bugs that the process of debugging make the whole process less efficient than writing it yourself. Is there one of your notes that deals with that already?

@utterfiction @carbontwelve @david_chisnall

He has a few examples where he felt something in the output didn't look right, or ran it and found bugs, and had the LLM try again.

Most of his examples are relatively simple things of the form "I didn't want to spend time reading API docs for this quick task", though. I don't find that sort of thing a bottleneck in what I do - and I quite enjoy reading docs, and building a mental model of a tool I can then use to know what its...

@utterfiction @carbontwelve @david_chisnall ... limitations and capabilities are.

The bits of programming that eat my time, which I'd love a tool to help with, are usually understanding a bug in an undocumented and under commented ball of hundreds of kloc of code, too big for an LLM's context window, and where going and quizzing the people who wrote bits of it is essential to success.

The bits Simon gets LLMs to do look like the tasks I do to cheer myself up after that :-)

@kitten_tech @carbontwelve @david_chisnall Yeah. A lot of my professional time is spent extending logic, adding new features that follow an existing pattern, refactoring when re-usable abstractions are discovered… so far, they’re just not very good at that. And I don’t think pure LLMs ever will be - limited token windows and no genuine symbolic representation of knowledge.

@kitten_tech

@utterfiction @carbontwelve @david_chisnall

If you can suddenly create small throwaway applications far more quickly than before, applications that might be too boring or bothersome to create otherwise, that might allow new ways of working altogether.

@utterfiction @kitten_tech @carbontwelve @david_chisnall you have to assume that the LLM will make weird mistakes all the time, so your job is all about code review and meticulous testing

I still find that a whole lot faster then writing all the code myself

Here's just one of many examples where I missed something important: https://simonwillison.net/2023/Apr/12/code-interpreter/#something-i-missed

Running Python micro-benchmarks using the ChatGPT Code Interpreter alpha

Today I wanted to understand the performance difference between two Python implementations of a mechanism to detect changes to a SQLite database schema. I rendered the difference between the two …

@utterfiction @kitten_tech @carbontwelve @david_chisnall but honestly, the disappointing answer is that most of this comes down to practice and building intuition for tasks the models are likely to do well vs mess up

Manipulating some elements in the HTML DOM with JavaScript? They'll nail that every time

Implementing something involving MDIO registers? My guess is there are FAR less examples relating to that in the (undocumented, unlicensed) training data so much more likely to make mistakes

@kitten_tech @carbontwelve @david_chisnall Wouldn't touch it with a bargepole myself, but I think a third possibility is that at least some people reporting that haven't had it write a sufficiently hilarious bug _yet_. After all, the OP hit one every four months - one could easily get lucky if that's a typical frequency.

@denisbloodnok

@kitten_tech @carbontwelve @david_chisnall

I am a couple of years in with copilot. No such bugs yet. Context: Rust. I write lots of tests along with my code (as OP appears to do too), and currently can rely on a massive external test suite.

The one ridiculously hard to debug bug I got is when I had to debug a codebase I ported from Java to Rust and I transliterated bits wrong as the human, and had no incremental tests built up along with the code. No LLM to blame.

@kitten_tech @carbontwelve @david_chisnall @simon so, I’ve found a 10-20% productivity boost with Copilot, mostly when dealing with boilerplate and small stuff. I mostly code python and AL (an ERP-specific language, which doesn’t get much if any boost).

What does work: ending statements when you start them, it infers enough for me to want to let it finish the line, maybe the next two-three lines. Sometimes it gets the logic completely wrong, but then you don’t accept the suggestion. Sometimes it comes up with edge cases I hadn’t considered.

What is more dodgy: explaining what you want and getting it to write the code. That can get quite dodgy, and I rarely accept those suggestions, unless it’s boilerplate.
1/3

That being said, I have some experience working with code submitted by less skilled programmers who blindly copy and paste stack exchange for a living, from before the prevalence of LLMs, and I am somewhat used to reviewing code of that standard. I find the longer LLM-built code is similar to review as that style of code, and in some cases is approaching that level of code quality.

I am tempted to try one of these “code your own mobile app” demo things, as it’s a platform I’m unfamiliar with, and I have some itches to scratch.

I believe both my coding style and my speed have been affected by using Copilot, both with modest boosts to productivity.

Could I work without Copilot? Absolutely! Would I want to? I think I’d miss the speed boost in a long python project

2/3

One place where my colleagues (and to a lesser extent, myself) have found LLMs to be useful is in multilingual situations.

When English is not your first language, but your coding standard requires things to be programmed in English, sometimes you can struggle to use the correct name for a variable, especially when those words are false friends, or are concepts that aren’t one word in English. The LLM can make sensible suggestions for variable and function names and the like. I’ve had to do fewer refactorings of colleague’s work due to inappropriate name use since they started with Copilot.

Similarly, pasting a description in Spanish into one of these things and asking for an outline onto which to hang your code on in English has helped, with proper code review.

This stuff is not panacea, but it can help when applied with a healthy dose of scepticism. @kitten_tech’s OP conclusion is valid, as the benefits are still marginal.

3/3

@moof

I am tempted to try one of these “code your own mobile app” demo things, as it’s a platform I’m unfamiliar with, and I have some itches to scratch.

I wrote my first Android app a couple of months ago. I did it in Android Studio, which didn’t have Copilot set up. It took half a day (having it touched Java for 6-8 years, and then mostly only to write test cases when hacking on the internals of a JVM). I went from nothing to a working app in under a day.

The things that took time were:

  • Google’s CADT problem meant that a lot of things in the build system had changed from the time tutorials were written and figuring out the differences always annoying.
  • The MQTT library I was using needed some extra things for compatibility with older SDKs and they were enabled by default, the instructions for turning them off were documented but figuring out that this was the problem took time.
  • I spent ages debugging a connection problem that I assumed was a permissions issue. It turned out that the MQTT server was down (but its status page was not).

I don’t think an LLM would have helped with any of these problems.

Android development is so much worse then OpenStep development in 1992 (iOS is a cleaned up version of OpenStep tuned for touchscreens and systems with more than 8 MiB of RAM, so I presume it’s better). Adding LLMs won’t fix that, thinking about APIs before you ship a thing that you likely have to support for a decade or so would. In spite of it being a truly terrible platform for developers, it was pretty easy to build something that worked.

Twenty years ago, we were building minimal-code platforms where you could build CRUD web and desktop apps with a few dozen lines of code for your business logic. A lot of frameworks seem to have massively regressed since then. If anything, relying on LLMs to fill in the code that shouldn’t be necessary in the first place will make this worse.

@david_chisnall I do miss the era when you could just code up an app with minimal thinking about the common cases that were covered by frameworks. The idea that everyone needs to have their own interface developed is something that the new web era has foisted on us, and is definitely a step back. Electron and company has just made it worse. And don’t get me started on my thoughts on WASM.

I agree that LLMs will not help there. Or if they do, it shouldn’t be like that.

Either way, I feel that the best way to get a feel for a tool is to use it. You have done so, and come to valid conclusions, and I thank you for sharing.

I expect my next job to be the sort where I will have to battle pressure both from above and below for use of LLMs as a way to accelerate or replace developers. I need to have arguments that sound authoritative in order to battle the massive propaganda^Wmarketing effort being made to sell this as the best thing since Jesus fed the 5k with sliced bread

@moof

I need to have arguments that sound authoritative

If you're looking for plausible and authoritative-sounding pronouncements, you've come to the right place!

@moof What kind of boilerplate are you having to write so often that any decent snippet engine couldn't handle perfectly well without the litany of issues of an LLM? We *already have* tools for boilerplate, I don't understand why people are so entranced by LLM's ability to deal with it in an absurdly, grossly inefficient way.
@jincyquones for me, mostly it’s things like schema objects, as I do a lot of integration between disparate systems with subtly different schemas on each end. The LLM does the repetitive work out of typing this stuff out. I consider this boilerplate as opposed to functional code.
@moof Can you be more specific? That still sounds to me like something that'd be handled better with purpose-built tooling. What specifically is the LLM doing that one couldn't develop a tool to do precisely? How does it actually leverage the advantages of ML? LLMs are great at natural language, but code/schemas are designed to be structured & machine readable and "typing [repetitive] stuff out" is a trivial task to automate. Why do you need a sledgehammer to hang pictures?

@jincyquones I don’t need a sledgehammer to hang pictures, I could sit and write a script.

The point is, the tool does a load of things that different scripts could do, but, crucially, I now no longer need to source, write, or maintain.

@moof Okay, but are the tasks you're using it for so complex or novel that the cost of sourcing/maintaining an equivalent toolset is significantly greater than the cost of dealing with an LLM's mistakes? Do you do the tasks so infrequently that said cost isn't justified, in which case, are they still frequent enough that you wouldn't just do them by hand, use copy/paste, or bang out a quick script you don't need to maintain? Do you think Copilot would be able to write those scripts for you?

@jincyquones
Unlike the OP, I have found the cost of dealing with LLM mistakes minimal compared to the benefits, possibly due to the fact that I have dealt with bad code from outside sources already, and have been aggressive about reviewing it.

Whilst in hindsight, developing tooling would help with these tasks in an algorithmic manner, at the time I didn’t realise the need until the LLM Just Did It, and it was Good Enough that I didn’t even bother to consider coding that tooling. I never evaluated whether a side project to create said tooling would be short enough to be worthwhile. As such, it was already a performance boost.

Algorithmic tooling could definitely have done the same job. But whilst it was happening frequently enough that the LLM made a difference, it was not frequently enough that I might have actually gotten round to scratching that itch had I not had the LLM helping in the first place, as I am wary of too much yak shaving.

@moof I can get how it would be useful in that very specific sort of situation, but I question how often that happens. It depends on the dev, what they're working on, and priorities, I'm sure.

Every dev I've seen say good things about LLM code assistance, they say "boilerplate," but that's not remotely impressive or persuasive to me because boilerplate is easy & snippet engines these days are very good. If I were y'all, I'd find a better way to explain the benefits. Thanks for the convo!

@moof Sorry for the interrogation. I'm obviously skeptical, but I'm just trying to understand where people such as yourself are coming from so I can reevaluate my opinion. No matter how I look at it, the cost/benefit doesn't make sense to me.
GitHub - L3MON4D3/LuaSnip: Snippet Engine for Neovim written in Lua.

Snippet Engine for Neovim written in Lua. Contribute to L3MON4D3/LuaSnip development by creating an account on GitHub.

GitHub

@kitten_tech

@carbontwelve @david_chisnall .

Note that how @simon reports using this to generate little projects is an entirely different mode of working with them. I have used copilot for a few years now and like it myself, which is mostly context sensitive autocomplete.

A Q&A session to create code for a CLI tool or web app is a very different way of working I started exploring more recently. It's surprisingly capable for little projects and requires a different approach.

The Death of the Junior Developer

Steve Yegge's speculative take on the impact LLM-assisted coding could have on software careers. Steve works on Cody, an AI programming assistant, so he's hardly an unbiased source of information. …

@carbontwelve @david_chisnall i have been suspecting their use for a while yes. ive never used an LLM either, not because i didnt want to but because they all seem to be served from CAGEMAFIA.... if that isnt bad enough they block tor.

a paper came out this month (Dec 5), about how these LLM being marketed as "open" really are not, in a number of ways, very worthwhile read and i have started un-including people that are excessive users. even recently i just didn't bother catching up with someone wearing a smart wearable. they can stay in their bubble

@david_chisnall Thats interesting. One thing I think about a lot is the different value proposition for LLM's for someone who already knows how to code, vs someone who doesn't. And I really worry about the pipeline of junior dev —> senior dev in this new reality of coding tools.
@poswald @david_chisnall Me too. Not to sound like an old, grumpy pants, but there is real value from going through the whole process of designing and writing a unit. Valuable lessons learned along the way such as seeing a refactor point where code is duplicated, understanding inputs and outputs and test driving a solution. Real talk? I think it orders our minds and we get better and better at writing and reading code. Junior devs are missing that. I see it all the time.

@david_chisnall Thank you for the report.

I must admit that it fits my expectations. (That may be due to my bias, or it may be due to my prior experience or insight.)

@david_chisnall thank you for your insights! they are valuable.

@david_chisnall My experiences mirror yours. And are amplified by the number of posts I see on r/regex and r/sql and r/vim where people ask for help because ClaudeminiGrokPT told them to do XYZ but it doesn't work and they don't know to fix it.

Are they useful for some things? Maybe.

Are they useful for programming reliable code? You bet your sweet bippy they aren't.

@gumnos @david_chisnall "Bippy. Now that is a word I have not heard in a very, very long time..." #FickleFingerOfFate
@gumnos @david_chisnall people use llms to generate regexes? That seems like the worst possible usecase 🤣

@mark

alas, r/regex is replete with them ☹

@david_chisnall

@gumnos @david_chisnall I mean the best case scenario for a llm is where writing is costly and verification is cheap. Regexes are the opposite
@david_chisnall Of course it is a waste of energy (as compared to putting some extra brain energy into it) but I like this aspect:
"The one place Copilot was vaguely useful was hinting at missing abstractions (if it can autocomplete big chunks then my APIs required too much boilerplate and needed better abstractions)."
Because that is something (if you don't think about it before) that you normally only recognize later on.
@Torbencht @david_chisnall big fan of this bit. "copilot is useful only in that sometimes you know you've messed up because copilot starts being useful", lovely

@Torbencht @david_chisnall Samba generates a *lot* of C code using PIDL.

We rely on this to securely and reliably parse network data structures.

NDR is easy to parse successfuly but insecurely: many duplicate lengths.

We have had devastating security bugs in that parsing.

When we fixed them, we also had to fix the much smaller number of manual parsers we manually modified.

If an LLM generates your codebase, but badly, how can we be sure the human or LLM fixed all the spots? How much effort?