Mastodawn

My job as a senior developer with a team of juniors is to figure out what to write, sketch a PoC as guidance, and then delegate the actual implementation to them. I'm going to look at that, explain misunderstandings or poor style choices, and guide them into implementing something that meets our standards.

I don't think LLMs can do my job yet. But I think we're getting shockingly close to them being able to do the other part. And I'm worried how we're going to get more senior developers.

Show thread

Matthew Garrett Mar 7

I would not have said the same thing 6 months ago - the amount of progress here is significant. And I'm not denying that the technology has resulted in massive quantities of poor quality code produced by people who aren't in a position to review it, or that the externalities of all of this are large. But capitalism isn't going to give a shit, so we're getting all of this anyway whether we like it or not

Show thread

Glyph Mar 7

@mjg59 do you have some way of evaluating that progress in the last 6 months in some way that is not the subjective impression of improvement?

Show thread

Paul McMillan Mar 7

@glyph @mjg59 watching the benchmarks get saturated is interesting, but watching teammates build entire non-trivial projects entirely with the technology is a lot more convincing. There was a really palpable uptick in capability of the most powerful variants of this at the beginning of this year.

Show thread

Glyph Mar 7

@PaulM @mjg59 Someone I respect has said *some* version of this to me every month since ChatGPT first shipped though, and I am tired of retesting various models and having them all produce the same hot garbage for my problems, while wondering if they're slowly making me psychotic as a side-effect. I keep asking this question because if *hard* evidence shows up, the kind of ROI you see on a balance sheet, I don't want to miss it.

Show thread

Paul McMillan Mar 7

@glyph @mjg59
that's entirely fair, and they have been getting better, but what constitutes "worth using" is pretty individual. I'm curious if you have any examples of something you'd quantity that way.

Maybe some relatively complex feature or bugfix you already wrote that you'd like to use as a benchmark for capability? Alternatively, a couple of trivial features you'd like in a personal project but haven't gotten around to building?

At a more mundane level, I suspect they could reliably alleviate a significant amount of the drudgery associated with maintaining OSS - fixing tests when dependencies are updated, etc. Nothing you can't trivially do yourself, but also in my experience painful to try to get the ADHD brain to pay attention to.

Show thread

Glyph Mar 7

@PaulM @mjg59 At this point I am too nervous about the risks to actually touch one for anything non-trivial, and I think everyone should refrain from their use for ethical and safety reasons. One pretty robust argument in that discussion is "they're most likely actually an economic drain, even if they seem useful". But this is a tenuous argument that might become false at any moment, and if I'm not using them I won't know when that moment is.

Show thread

Paul McMillan Mar 7

@glyph @mjg59 but those things aside, I'd like to understand more about which risks you're most worried about, particularly for nontrivial work.

Show thread

Glyph

@PaulM @mjg59

1. Why do some people develop AI psychosis and others don't? Or does everyone eventually succumb and we just haven't used it longitudinally enough? Hormesis or linear-no-threshold?

2. How can one maintain a balance of failed-vs-successful prompts, to avoid time-wasting? Intuitive evaluation will always favor the successes.

3. If the tech *does* actually work, doesn't give me psychosis, and works more often than not with enough of an edge, de-skilling seems like a big problem.

Show thread

Glyph Mar 7

@PaulM @mjg59 Related to 2, I am also concerned about addiction. ADHD is highly comorbid with problem gambling, and I don't want to be putting myself in a daily behavioral loop where I'm getting a little thrill from every minor success, even if I do, in some circumstances, have a demonstrable edge over the "house" which I guess in this case is pointless re-prompting with no progress

Show thread

Paul McMillan Mar 7

@glyph @mjg59 I also worry about addiction. As Netflix learned, hours-go-up is likely a bad thing for your business to optimize in isolation, because its usually bad for your customers too. I know my team works hard to avoid that trap.

A lot of that concern is related to my earlier response to #1, but it can also be a great enabler of hyperfocus, which can be both very pleasurable and counterproductive.

All that said, you seem pretty convinced that using these things is like pulling the arm on a slot machine - sometimes you get a reward but a lot of the time you get garbage and have to try again a different way. They really truly are not like that these days in my experience, and have not been like that for a while. If you model them or their users that way in your reasoning, you will be making category errors.

As a user, (and maybe you'll say I have AI psychosis) the experience is more like working with a very fast, very precocious junior who has memorized half of wikipedia and is very quick with google, and who is getting better at writing code, but reasonably often needs detailed instruction or directional course correction. You don't cut their head off and ask the talent agency to send you a new one every time they give you an answer that doesn't quite match what you want, you clarify your request, or ask for a more achievable scope of work. Unlike searching google, your queries compound to vector the agent where you want it to go, conversationally, rather than standing alone individually.

Show thread

Paul McMillan Mar 7

@glyph @mjg59 for ai psychosis, my sense is that the observed outcomes are some combination of "already psychotic/already narcissist", people who are unusually susceptible to the same validation/reinforcement traps used in social media who discover the feedback loop can be instantaneous and permanently tilted in their favor, and an unfortunate subset of people who are prone to believe everything they read.

Which models they interact with, and how those are configured, makes a big difference. Some models are brokenly sycophantic, and that encourages this. Some models gladly engage in the kind of secret world government mind control "I discovered secrets the FBI needs to know about" kind of roleplay that draws susceptible people in. Training the model to refuse to go down these rabbit holes and keeping discussions factual is a hard problem, but one that modern models are much better at.

These dangers are one of the reasons that readily accessible open source model weights with near frontier capabilities worry me. I recognize that sounds hypocritical given my employer, but these systems are easier to misuse, and their snapshot-in-time nature can't benefit from ongoing safety work.

My belief is that occurrence is the product of underlying susceptibility, multiplied by unsafe model behavior. If those don't combine to meet a threshold level, people stay grounded in the real world. I don't see longitudinal use as an additional risk, although it obviously exacerbates symptoms for people who are above that threshold.

With modern models deployed with safety measures from the major providers, I think the risk is relatively low for most users.

Show thread

Paul McMillan Mar 7

@glyph @mjg59 re:2, I'm less sure I understand this. You're worried that the LLM will be so attractive you lose the ability to tell whether or not it's a useful resource? I think you'll use the same mechanisms you use to decide if searching google or asking questions on stackoverflow is useful, or searching reddit, or asking on IRC, or convincing other people to maintain open source projects so you can reliably depend on them. I don't think this tool is especially worse in this regard, and as with any other, you get better at using it over time. Unlike many tools, it also gets better over time, and lately has done so quite rapidly. Plenty of people seem to have quite successfully decided for themselves that it's a waste of time though!

Show thread

Paul McMillan Mar 7

@glyph @mjg59 re:3 - eh, maybe. I think I'm not convinced either way on that one yet.

On the one hand, these days I typically don't bother looking in detail at large chunks of the output where I can measure the results, and have used it for long enough to feel comfortable with the levels of reliability I can expect from it. I'm typically very comfortable asking it to read logs, write debug harnesses, and generally check its own work under my supervision. I've also used it enough that I have a pretty good ability to smell when it's going off the rails, or is likely to do so.

That, sense I think, is something that takes longer for someone who isn't already senior to develop, but at the same time, I've had some fascinating conversations with people who started from almost no ability to code and are now running (and debugging, and fixing) pretty significantly complex projects. It's neat to see the lowered barrier to entry allowing people to scratch their own itches, but I am absolutely certain that "using an LLM" doesn't preclude skills development or the ability to fix complex problems. You're just starting at a different abstraction layer, and now have a handy built-in tutor and contextual reference guideif you want to move up or down the stack to learn more.

The problem of "who will pay for the juniors for long enough that they become senior enough to be worth employing" is probably going to be a real problem. I'm not sure there is a good answer to that one aside from acknowledging that there will be disruption, and that our social contracts in America particularly are utterly broken.