I finally turned off GitHub Copilot yesterday. I’ve been using it for about a year on the ‘free for open-source maintainers’ tier. I was skeptical but didn’t want to dismiss it without a fair trial.

It has cost me more time than it has saved. It lets me type faster, which has been useful when writing tests where I’m testing a variety of permutations of an API to check error handling for all of the conditions.

I can recall three places where it has introduced bugs that took me more time to to debug than the total time saving:

The first was something that initially impressed me. I pasted the prose description of how to communicate with an Ethernet MAC into a comment and then wrote some method prototypes. It autocompleted the bodies. All very plausible looking. Only it managed to flip a bit in the MDIO read and write register commands. MDIO is basically a multiplexing system. You have two device registers exposed, one sets the command (read or write a specific internal register) and the other is the value. It got the read and write the wrong way around, so when I thought I was writing a value, I was actually reading. When I thought I was reading, I was actually seeing the value in the last register I thought I had written. It took two of us over a day to debug this. The fix was simple, but the bug was in the middle of correct-looking code. If I’d manually transcribed the command from the data sheet, I would not have got this wrong because I’d have triple checked it.

Another case it had inverted the condition in an if statement inside an error-handling path. The error handling was a rare case and was asymmetric. Hitting the if case when you wanted the else case was okay but the converse was not. Lots of debugging. I learned from this to read the generated code more carefully, but that increased cognitive load and eliminated most of the benefit. Typing code is not the bottleneck and if I have to think about what I want and then read carefully to check it really is what I want, I am slower.

Most recently, I was writing a simple binary search and insertion-deletion operations for a sorted array. I assumed that this was something that had hundreds of examples in the training data and so would be fine. It had all sorts of corner-case bugs. I eventually gave up fixing them and rewrote the code from scratch.

Last week I did some work on a remote machine where I hadn’t set up Copilot and I felt much more productive. Autocomplete was either correct or not present, so I was spending more time thinking about what to write. I don’t entirely trust this kind of subjective judgement, but it was a data point. Around the same time I wrote some code without clangd set up and that really hurt. It turns out I really rely on AST-aware completion to explore APIs. I had to look up more things in the documentation. Copilot was never good for this because it would just bullshit APIs, so something showing up in autocomplete didn’t mean it was real. This would be improved by using a feedback system to require autocomplete outputs to type check, but then they would take much longer to create (probably at least a 10x increase in LLM compute time) and wouldn’t complete fragments, so I don’t see a good path to being able to do this without tight coupling to the LSP server and possibly not even then.

Yesterday I was writing bits of the CHERIoT Programmers’ Guide and it kept autocompleting text in a different writing style, some of which was obviously plagiarised (when I’m describing precisely how to implement a specific, and not very common, lock type with a futex and the autocomplete is a paragraph of text with a lot of detail, I’m confident you don’t have more than one or two examples of that in the training set). It was distracting and annoying. I wrote much faster after turning it off.

So, after giving it a fair try, I have concluded that it is both a net decrease in productivity and probably an increase in legal liability.

Discussions I am not interested in having:

  • You are holding it wrong. Using Copilot with this magic config setting / prompt tweak makes it better. At its absolute best, it was a small productivity increase, if it needs more effort to use, that will be offset.
  • This other LLM is much better. I don’t care. The costs of the bullshitting far outweighed the benefits when it worked, to be better it would have to not bullshit, and that’s not something LLMs can do.
  • It’s great for boilerplate! No. APIs that require every user to write the same code are broken. Fix them, don’t fill the world with more code using them that will need fixing when the APIs change.
  • Don’t use LLMs for autocomplete, use them for dialogues about the code. Tried that. It’s worse than a rubber duck, which at least knows to stay silent when it doesn’t know what it’s talking about.

The one place Copilot was vaguely useful was hinting at missing abstractions (if it can autocomplete big chunks then my APIs required too much boilerplate and needed better abstractions). The place I thought it might be useful was spotting inconsistent API names and parameter orders but it was actually very bad at this (presumably because of the way it tokenises identifiers?). With a load of examples with consistent names, it would suggest things that didn't match the convention. After using three APIs that all passed the same parameters in the same order, it would suggest flipping the order for the fourth.

#GitHubCopilot #CHERIoT

@david_chisnall your experience is how I expected mine to be if I had actually given the technology a chance.

Machine learning has been useful for decades, mostly quietly. The red flag against LLMs for me (aside from the authorship laundering and mass poaching of content) was there scramble by all companies to shoehorn it into their products, like a solution looking for a problem; I’ve yet to see it actually solve.

@carbontwelve I used machine learning in my PhD. The use case there was data prefetching. This was an ideal task for ML, because the benefits of a correct answer were high and the cost of an incorrect answer were low. In the worst case, your prefetching evicts something from cache that you need later, but a 60% accuracy in predictions is a big overall improvement.

Programming is the opposite. The benefits of being able to generate correct code faster 80% of the time are small but the costs of generating incorrect code even 1% of the time are high. The entire shift-left movement is about finding and preventing bugs earlier.

@david_chisnall that’s a nicely eloquent way to put both into perspective.
@david_chisnall @carbontwelve this is what has been gnawing at the back of my brain. The purveyors of LLM's have been talking up the latest improvements in reasoning. A calculator that isn't 100% accurate at returning correct answers to inputs is 100% useless. We're being asked to conflate the utility of LLM's with the same kind of utility as a calculator. Would we choose to drive over a bridge designed using AI? How will we know?

@zebratale @carbontwelve Calculators do make mistakes. Most pocket calculators do arithmetic in binary and so propagate errors converting decimal to binary floating point, for example not being able to represent 0.1 accurately. They use floating point to approximate rationals, so collect rounding errors for things like 1/3.

The difference is that you can create a mental model of how they fail and make sure that the inaccuracies are acceptable within your problem domain. You cannot do this with LLMs. They will fail in exciting and surprising ways. And those failure modes will change significantly across minor revisions.

@david_chisnall

@zebratale @carbontwelve

I do find myself building up intuitions for what an LLM does. It's far less reliable than a calculator but humans can build intuitions for other unreliable things that can fail excitingly.

@david_chisnall @carbontwelve Well one thing where LLMs can make sense is spam filtering (sadly also for generating it, as we probably all know by now…).

Like rspamd tried GPT-3.5 Turbo and GPT-4o models against Bayes and got pretty interesting results: https://rspamd.net/misc/2024/07/03/gpt.html

Although as conclusion puts, one should use local LLMs for data privacy reasons and likely performance reasons (elapsed time for GPT being ~300s vs. 12s and 30s for bayes), which would also likely change results.

@lanodan @carbontwelve Spam filtering has been a good application for machine learning for ages. I think the first Bayesian spam filters were added around the end of the last century. It has several properties that make it a good fit for ML:

  • The cost of letting spam through is low, the value in filtering most of it correctly is high.
  • There isn’t a rule-based approach that works well. You can’t write a list of properties that make something spam. You can write a list of properties that indicate something has a higher chance of being spam.
  • The problem changes rapidly. Spammers change their tactics depending on what gets through filters and so a system that adapts on the defence works well. You have a lot of data of ham vs spam to do the adaptation.

Note that this is not the same for intrusion detection and a lot of ML-based approaches for intrusion detection have failed. It is bad if you miss a compromise and you don’t have enough examples of malicious and non-malicious data for your categoriser to adapt rapidly.

The last point is part of why it worked well in my use case and was great for Project Silica when I was at MS. They were burning voxels into glass with lasers and then recovering the data. With a small calibration step (burn a load of known-value voxels into a corner of the glass) they could build an ML classifier that worked on any set of laser parameters. It might not have worked quite as well as a well-tuned rule-based system, but they could do experiments as fast as the laser could fire with the ML approach, whereas a rule-based system needed someone to classify the voxel shapes and redo the implementation, which took at least a week. That was a huge benefit. Their data included error-correction codes, so as long as their model was mostly right, ECC would fix the rest.

@david_chisnall @carbontwelve part of the AI hype brainrot seems to be a complete disregard to the possibility of failure. or if there's a consideration of which, then the next bigger, hungrier model will fix it...
@nebulos @david_chisnall @carbontwelve That's capitalism 101: Create a problem that didn't exist before; conjure a solution you can build a business model around; kneecap whatever regulatory mechanisms that would eliminate the problem for good; leech off society for life!

@david_chisnall @carbontwelve

I just want magic ETL. That's all I want, not just for Christmas.

@david_chisnall @carbontwelve The argument that I would make is that, to the extent LLMs are "useful" (or perceived as useful) for programming, it's an indictment of our programming environments. If we're writing that many statistically likely tokens in our programs, programming is reduced to pasting large volumes of boilerplate, and the correct solution is *not* "get a machine to do that", it's "get rid of the boilerplate".
@wollman @carbontwelve Alan Blackwell recently wrote an entire book around that thesis, which you might find interesting.

@david_chisnall @carbontwelve This sort of cost benefit analysis is crucial and missing in so much of the discussion around generative AI!

On the other hand, without intending to diminish it at all, it sort of characterizes all programing as equally serious or something, which obviously isn't true.

In plenty of programming "the costs of generating incorrect code even 1% of the time are" pretty inconsequential, actually.

@aeischeid @david_chisnall @carbontwelve

> In plenty of programming "the costs of generating incorrect code even 1% of the time are" pretty inconsequential, actually.

I guess it depends on the size of your codebase and how easy it is to spot the bug introduced by the LLM. In a lot of cases debugging can take up much more time than writing new code.