Mastodawn

My job as a senior developer with a team of juniors is to figure out what to write, sketch a PoC as guidance, and then delegate the actual implementation to them. I'm going to look at that, explain misunderstandings or poor style choices, and guide them into implementing something that meets our standards.

I don't think LLMs can do my job yet. But I think we're getting shockingly close to them being able to do the other part. And I'm worried how we're going to get more senior developers.

Show thread

Matthew Garrett Mar 7

I would not have said the same thing 6 months ago - the amount of progress here is significant. And I'm not denying that the technology has resulted in massive quantities of poor quality code produced by people who aren't in a position to review it, or that the externalities of all of this are large. But capitalism isn't going to give a shit, so we're getting all of this anyway whether we like it or not

Show thread

Glyph Mar 7

@mjg59 do you have some way of evaluating that progress in the last 6 months in some way that is not the subjective impression of improvement?

Show thread

Paul McMillan Mar 7

@glyph @mjg59 watching the benchmarks get saturated is interesting, but watching teammates build entire non-trivial projects entirely with the technology is a lot more convincing. There was a really palpable uptick in capability of the most powerful variants of this at the beginning of this year.

Show thread

Glyph Mar 7

@PaulM @mjg59 Someone I respect has said *some* version of this to me every month since ChatGPT first shipped though, and I am tired of retesting various models and having them all produce the same hot garbage for my problems, while wondering if they're slowly making me psychotic as a side-effect. I keep asking this question because if *hard* evidence shows up, the kind of ROI you see on a balance sheet, I don't want to miss it.

Show thread

Paul McMillan Mar 7

@glyph @mjg59
that's entirely fair, and they have been getting better, but what constitutes "worth using" is pretty individual. I'm curious if you have any examples of something you'd quantity that way.

Maybe some relatively complex feature or bugfix you already wrote that you'd like to use as a benchmark for capability? Alternatively, a couple of trivial features you'd like in a personal project but haven't gotten around to building?

At a more mundane level, I suspect they could reliably alleviate a significant amount of the drudgery associated with maintaining OSS - fixing tests when dependencies are updated, etc. Nothing you can't trivially do yourself, but also in my experience painful to try to get the ADHD brain to pay attention to.

Show thread

Glyph Mar 7

@PaulM @mjg59 At this point I am too nervous about the risks to actually touch one for anything non-trivial, and I think everyone should refrain from their use for ethical and safety reasons. One pretty robust argument in that discussion is "they're most likely actually an economic drain, even if they seem useful". But this is a tenuous argument that might become false at any moment, and if I'm not using them I won't know when that moment is.

Show thread

Paul McMillan Mar 7

@glyph @mjg59 but those things aside, I'd like to understand more about which risks you're most worried about, particularly for nontrivial work.

Show thread

Glyph Mar 7

@PaulM @mjg59

1. Why do some people develop AI psychosis and others don't? Or does everyone eventually succumb and we just haven't used it longitudinally enough? Hormesis or linear-no-threshold?

2. How can one maintain a balance of failed-vs-successful prompts, to avoid time-wasting? Intuitive evaluation will always favor the successes.

3. If the tech *does* actually work, doesn't give me psychosis, and works more often than not with enough of an edge, de-skilling seems like a big problem.

Show thread

Glyph Mar 7

@PaulM @mjg59 Related to 2, I am also concerned about addiction. ADHD is highly comorbid with problem gambling, and I don't want to be putting myself in a daily behavioral loop where I'm getting a little thrill from every minor success, even if I do, in some circumstances, have a demonstrable edge over the "house" which I guess in this case is pointless re-prompting with no progress

Show thread

Paul McMillan

@glyph @mjg59 I also worry about addiction. As Netflix learned, hours-go-up is likely a bad thing for your business to optimize in isolation, because its usually bad for your customers too. I know my team works hard to avoid that trap.

A lot of that concern is related to my earlier response to #1, but it can also be a great enabler of hyperfocus, which can be both very pleasurable and counterproductive.

All that said, you seem pretty convinced that using these things is like pulling the arm on a slot machine - sometimes you get a reward but a lot of the time you get garbage and have to try again a different way. They really truly are not like that these days in my experience, and have not been like that for a while. If you model them or their users that way in your reasoning, you will be making category errors.

As a user, (and maybe you'll say I have AI psychosis) the experience is more like working with a very fast, very precocious junior who has memorized half of wikipedia and is very quick with google, and who is getting better at writing code, but reasonably often needs detailed instruction or directional course correction. You don't cut their head off and ask the talent agency to send you a new one every time they give you an answer that doesn't quite match what you want, you clarify your request, or ask for a more achievable scope of work. Unlike searching google, your queries compound to vector the agent where you want it to go, conversationally, rather than standing alone individually.