My job as a senior developer with a team of juniors is to figure out what to write, sketch a PoC as guidance, and then delegate the actual implementation to them. I'm going to look at that, explain misunderstandings or poor style choices, and guide them into implementing something that meets our standards.

I don't think LLMs can do my job yet. But I think we're getting shockingly close to them being able to do the other part. And I'm worried how we're going to get more senior developers.

I would not have said the same thing 6 months ago - the amount of progress here is significant. And I'm not denying that the technology has resulted in massive quantities of poor quality code produced by people who aren't in a position to review it, or that the externalities of all of this are large. But capitalism isn't going to give a shit, so we're getting all of this anyway whether we like it or not
@mjg59 do you have some way of evaluating that progress in the last 6 months in some way that is not the subjective impression of improvement?
@glyph not at all, other than my occasional requests for the robot to write code for me getting increasingly close to code I'd be willing to deploy
@mjg59 thanks. that kind of data is really hard to come by, so I am just asking everyone with this experience :)
@glyph @mjg59
The machine is still not generating code I can use. Apparently it is not trained in bash scripting.
@mjg59 @glyph Same. I’ve asked it to write things that I know I can do but don’t want to context switch to that domain and it actually does a remarkably good job both at understanding what I’m trying to do, the constraints, and producing code that’s correct and performant.
@glyph @mjg59 watching the benchmarks get saturated is interesting, but watching teammates build entire non-trivial projects entirely with the technology is a lot more convincing. There was a really palpable uptick in capability of the most powerful variants of this at the beginning of this year.
@PaulM @mjg59 Someone I respect has said *some* version of this to me every month since ChatGPT first shipped though, and I am tired of retesting various models and having them all produce the same hot garbage for my problems, while wondering if they're slowly making me psychotic as a side-effect. I keep asking this question because if *hard* evidence shows up, the kind of ROI you see on a balance sheet, I don't want to miss it.

@glyph @mjg59
that's entirely fair, and they have been getting better, but what constitutes "worth using" is pretty individual. I'm curious if you have any examples of something you'd quantity that way.

Maybe some relatively complex feature or bugfix you already wrote that you'd like to use as a benchmark for capability? Alternatively, a couple of trivial features you'd like in a personal project but haven't gotten around to building?

At a more mundane level, I suspect they could reliably alleviate a significant amount of the drudgery associated with maintaining OSS - fixing tests when dependencies are updated, etc. Nothing you can't trivially do yourself, but also in my experience painful to try to get the ADHD brain to pay attention to.

@PaulM @mjg59 At this point I am too nervous about the risks to actually touch one for anything non-trivial, and I think everyone should refrain from their use for ethical and safety reasons. One pretty robust argument in that discussion is "they're most likely actually an economic drain, even if they seem useful". But this is a tenuous argument that might become false at any moment, and if I'm not using them I won't know when that moment is.
@glyph @mjg59 I entirely respect the ethical position about not using them, but I think "their use is an economic drain" is likely not as robustly defensible at this point as it might have been in the past.
@glyph @mjg59 but those things aside, I'd like to understand more about which risks you're most worried about, particularly for nontrivial work.

@PaulM @mjg59

1. Why do some people develop AI psychosis and others don't? Or does everyone eventually succumb and we just haven't used it longitudinally enough? Hormesis or linear-no-threshold?

2. How can one maintain a balance of failed-vs-successful prompts, to avoid time-wasting? Intuitive evaluation will always favor the successes.

3. If the tech *does* actually work, doesn't give me psychosis, and works more often than not with enough of an edge, de-skilling seems like a big problem.

@PaulM @mjg59 Related to 2, I am also concerned about addiction. ADHD is highly comorbid with problem gambling, and I don't want to be putting myself in a daily behavioral loop where I'm getting a little thrill from every minor success, even if I do, in some circumstances, have a demonstrable edge over the "house" which I guess in this case is pointless re-prompting with no progress

@glyph @mjg59 I also worry about addiction. As Netflix learned, hours-go-up is likely a bad thing for your business to optimize in isolation, because its usually bad for your customers too. I know my team works hard to avoid that trap.

A lot of that concern is related to my earlier response to #1, but it can also be a great enabler of hyperfocus, which can be both very pleasurable and counterproductive.

All that said, you seem pretty convinced that using these things is like pulling the arm on a slot machine - sometimes you get a reward but a lot of the time you get garbage and have to try again a different way. They really truly are not like that these days in my experience, and have not been like that for a while. If you model them or their users that way in your reasoning, you will be making category errors.

As a user, (and maybe you'll say I have AI psychosis) the experience is more like working with a very fast, very precocious junior who has memorized half of wikipedia and is very quick with google, and who is getting better at writing code, but reasonably often needs detailed instruction or directional course correction. You don't cut their head off and ask the talent agency to send you a new one every time they give you an answer that doesn't quite match what you want, you clarify your request, or ask for a more achievable scope of work. Unlike searching google, your queries compound to vector the agent where you want it to go, conversationally, rather than standing alone individually.

@glyph @mjg59 for ai psychosis, my sense is that the observed outcomes are some combination of "already psychotic/already narcissist", people who are unusually susceptible to the same validation/reinforcement traps used in social media who discover the feedback loop can be instantaneous and permanently tilted in their favor, and an unfortunate subset of people who are prone to believe everything they read.

Which models they interact with, and how those are configured, makes a big difference. Some models are brokenly sycophantic, and that encourages this. Some models gladly engage in the kind of secret world government mind control "I discovered secrets the FBI needs to know about" kind of roleplay that draws susceptible people in. Training the model to refuse to go down these rabbit holes and keeping discussions factual is a hard problem, but one that modern models are much better at.

These dangers are one of the reasons that readily accessible open source model weights with near frontier capabilities worry me. I recognize that sounds hypocritical given my employer, but these systems are easier to misuse, and their snapshot-in-time nature can't benefit from ongoing safety work.

My belief is that occurrence is the product of underlying susceptibility, multiplied by unsafe model behavior. If those don't combine to meet a threshold level, people stay grounded in the real world. I don't see longitudinal use as an additional risk, although it obviously exacerbates symptoms for people who are above that threshold.

With modern models deployed with safety measures from the major providers, I think the risk is relatively low for most users.

@glyph @mjg59 re:2, I'm less sure I understand this. You're worried that the LLM will be so attractive you lose the ability to tell whether or not it's a useful resource? I think you'll use the same mechanisms you use to decide if searching google or asking questions on stackoverflow is useful, or searching reddit, or asking on IRC, or convincing other people to maintain open source projects so you can reliably depend on them. I don't think this tool is especially worse in this regard, and as with any other, you get better at using it over time. Unlike many tools, it also gets better over time, and lately has done so quite rapidly. Plenty of people seem to have quite successfully decided for themselves that it's a waste of time though!

@glyph @mjg59 re:3 - eh, maybe. I think I'm not convinced either way on that one yet.

On the one hand, these days I typically don't bother looking in detail at large chunks of the output where I can measure the results, and have used it for long enough to feel comfortable with the levels of reliability I can expect from it. I'm typically very comfortable asking it to read logs, write debug harnesses, and generally check its own work under my supervision. I've also used it enough that I have a pretty good ability to smell when it's going off the rails, or is likely to do so.

That, sense I think, is something that takes longer for someone who isn't already senior to develop, but at the same time, I've had some fascinating conversations with people who started from almost no ability to code and are now running (and debugging, and fixing) pretty significantly complex projects. It's neat to see the lowered barrier to entry allowing people to scratch their own itches, but I am absolutely certain that "using an LLM" doesn't preclude skills development or the ability to fix complex problems. You're just starting at a different abstraction layer, and now have a handy built-in tutor and contextual reference guideif you want to move up or down the stack to learn more.

The problem of "who will pay for the juniors for long enough that they become senior enough to be worth employing" is probably going to be a real problem. I'm not sure there is a good answer to that one aside from acknowledging that there will be disruption, and that our social contracts in America particularly are utterly broken.

@glyph

Production of all hardware, building and operation of all data centres are huge environmental issues, and while human activity was certainly extremely polluting even before that, the whole content generating stuff comes on top of that.
This idiotically might not concern companies like your employer.

But content generation models shift power to those who own them.
This might also not concern your employer, but if it's a SW corp they're externalising their core product

@PaulM @mjg59

@ari I'm pretty sure you meant to headline my username and not glyph's in that response?

@PaulM it's more a direct reply to Glyph then to you, so I think I want to headline Glyph? But I don't have a firm enough grasp on the workings of mastodon to be certain either way.

email / reddit style discussion trees have their merrit, I'd rather have that (and no character cap, but I digress)

@ari well then, I'm not sure why you're talking like that to glyph. He's an independent open source developer who does consulting and has a patreon and doesn't use LLMs and has contributed to a bunch of fundamental internet software you undoubtedly use.

So you kinda sound like an asshole here.

@glyph
The way to make it work is not to use a web interface, but instead to use a tool like https://opencode.ai/ to
- generate the code
- generate the tests
- run the tests
- have it loop over 'fix any failures and try again'
- test the code yourself

By themselves, they will get things about 80% right. That's not perfect, but with that feedback loop, enough to get something that works.
@PaulM @mjg59
OpenCode | The open source AI coding agent

OpenCode - The open source coding agent.

@glyph
It won't be pretty or efficient or even entirely bug free, but if 'working code' is the only requirement, that it will get you faster than doing it by hand.
@PaulM @mjg59

@mjg59 I dunno. Sure we'll get a lot of (more) terrible apps, services, etc in the short term. "AI" is sort of accelerating the "everything is unreliable slop-ware" trend that's been infecting software development for the past decade plus.

At some point I suspect there may (once again) be a market for software that's not utter garbage.

I feel bad for everyone stuck working for these awful companies (increasingly *all* companies), while the industry destroys its capability to write software.

@swetland @mjg59 hilariously tho we just aren't seeing any new successful startups built on "vibed" code, it's all headlines from big corps like Microslop about how "AI" writes like 80% of their code already and blog posts about building a clone of a web page to do $thing that's been done a hundred times, but… nothing in between????? As Ed Zitron likes to ask: where are the startups??
@valpackett @mjg59 Yeah I'm pretty strongly on team "if this is such a miracle technology why are they expending all their effort trying to convince others to use it rather than building things that couldn't possibly by built without it?"
@mjg59 "we're getting all of this anyway whether we like it or not", sounds like slippery slop argument
@mjg59
On one hand, very much yes.
On the other, I just read an article about where we currently stand with climate change, and I don't think "where will senior programmers come from" is going to be that much of an issue.
@viq @mjg59 because we'll go extinct before that becomes a problem?
@nicolas17
Or if not that, wars for resources and otherwise trying to keep enough of farming going will require too many hands directly applied to problems for them to be spending time on keyboards.
@mjg59
@mjg59 Musk tweeted that this is the Year of the Singularity, so I don't think you have to worry about the senior developers part either... 🤓
@mjg59 actually, if AI gets that good, why will the public ever need to give money to any commercial software company? You'd just ask your AI to build you a solution and throw it away after using it. In effect, the software industry is racing towards doing itself out of business.
@jhaar @mjg59 Then you're just paying the AI company for your bespoke program than a software company. You don't really think they're going to let you vibe code a whole app for cheaper than buying an app with the same functionality, right?
@JessTheUnstill @mjg59 ...but AI is going to get sooooo good that then I'll slap the whole thing on my watch as a local model and it will auto-evolve itself without external help. 😜
@mjg59 every mid-to-large FOSS project is seeing their "Good First Issue"s getting sniped by 20 LLM bots. Those exist to feed new contributors into dedicated ones. If you cut the bottom rungs off the ladder, how is anyone going to be able to get to the top?
@greg yeah, exactly. I've helped people turn into senior devs, I don't know how to turn an LLM into one - embodying good taste is a different problem to generating code that meets a functional description

@mjg59 @greg I agree wholeheartedly with the junior pipeline problem, though I suspect that we end up with junior devs who are good at piloting the models, and learn to debug even hard problems within that context.

We didn't stop being able to computer when people stopped learning assembly or c, I hope we have a similar outcome here.

@mjg59 @greg good taste though - I've been arguing with people that we need to teach the models this, but the counterargument is "does it matter if humans don't read the code?"
@PaulM
it matters hugely. The potential for truly horrible infosec breaches is large.
@mjg59 @greg
@PaulM @mjg59 @greg But we could stop learning assembly because compilers are deterministic and don’t hallucinate. That’s not the case with LLMs.
@chris_evelyn @mjg59 @greg to be pedantic, computers are only sorta kinda mostly deterministic if you squint at them just right. From the perspective of any given program executing in a modern operating system, there's a whole lot happening around it which is completely opaque, even if execution mostly proceeds in an apparently sequential fashion.

@PaulM @mjg59 @greg That argument is bullshit and I’m getting fucking tired of it.

How often did you have to check assembly output lately because a compiler did something different from what you expressed in your code?

@chris_evelyn @PaulM @greg I'm a kernel developer, this happens to me more than you'd think

@mjg59 @PaulM @greg See my other answer, I forgot that I‘m replying to professional edge case handlers in this thread so had to dial it back to „normal“ programming.

Out of curiosity: Do LLMs work well for kernel dev?

@chris_evelyn @PaulM @greg massively depends, a *lot* of the kernel is super boilerplate and it's largely fine at that, and then you reach the point where you're dealing with CPU errata and you're going to have a bad time. I wouldn't say no to it in general (and we know chunks of Linux are already LLM developed), but I'd have several concerns around its use in more specialised areas

@mjg59 @PaulM @greg Thanks!

and we know chunks of Linux are already LLM developed

Do you have a pointer to some examples handy? I‘d be interested in the process and discussions around that.

Supporting kernel development with large language models

Kernel development and machine learning seem like vastly different areas of endeavor; there are [...]

LWN.net
@mjg59 @chris_evelyn @PaulM @greg Any examples of CPU eratta being relevant, other than the obvious security holes?
@alwayscurious @chris_evelyn @PaulM @greg "You must ensure that certain things have weird alignment otherwise the CPU will fault or return garbage" is a surprisingly common thing for CPUs to insist on and also typically not present outside kernels, so there's very little training data that embodies it
@mjg59 @chris_evelyn @PaulM @greg Is this found on the big CPUs or mostly limited to embedded?
@alwayscurious @chris_evelyn @PaulM @greg Less common on big CPUs these days, but it's the kind of thing that early Ultrasparc and 90s MIPS had a bunch of
@alwayscurious @mjg59 @chris_evelyn @PaulM @greg
there's also performance related errata, like https://www.intel.com/content/www/us/en/support/articles/000055650/processors.html though that needs to be worked around in the compiler/assembler and in the kernel mostly only affects things that manipulate code (live patching, JIT, etc.).
Provides you with information about the Jump Conditional Code Erratum and how to obtain the MCU.

Provides you with information about the Jump Conditional Code Erratum and how to obtain the MCU.

Intel
@mjg59 @chris_evelyn @PaulM @greg Can the boilerplate be replaced with a (non-LLM) code generator?
@alwayscurious @chris_evelyn @PaulM @greg there's huge piles of "What does driver initialisation look like" that could be replaced with macros except that would reduce readability