There are a lot of other concerns but I think since a lot of people on the fediverse are opposed to these tools, they might not be very familiar with where they're currently at ability-wise. @mttaggart provides a good description that they *are* capable of solving many problems you put in front of them... and that doesn't remove the other problems they generate or involved in their process.

The slop part isn't just the individual outputs, but the cumulation, and the effect on society itself.

Is that pushing the goalposts? It may be. I think "slop" used to be easier to dismiss when it came to code because it was obviously bad. Now when it's bad, it's non-obviously bad, which is part of its own problem. And cognitive debt, deskilling, and etc don't get factored into the quality of output aspect.

But unfortunately, the immediate reward aspects of these things are going to make it hard for society to recognize.

@cwebber

It kind of feels like its going to something big happening in the press to get people to stop.

I was thinking an AI caused Therac 25, but maybe a copilot worm that wipes all windows 11 computers might get some outlawing AI code legislation.

@alienghic @cwebber

The thing most likely to get people to stop is the end of the massive subsidies for its use that the VCs are currently pouring in.

Already firms are starting to panic a little about token use for things like Claude Code, and are putting limiters in their workers that really defeat the purpose of all of the "YOU MUST USE THIS OR BE FIRED" diktats. But operating indefinitely at those prices will bankrupt Anthropic soon.

So at some point the private equity love affair with everything AI will dry up (possibly because of a Iran war-induced financial crisis), and at that point it's going to be "my org can spend $50k annually on my personal Claude tokens to make me 20% more productive . . . or it could just hire a junior dev?"

There's a chance they manage to optimize this, or get it to work using a lighter weight model. But I think it's unlikely.

@MichaelTBacon @alienghic @cwebber I don't see a stopping in the near term. PE hasn't done a lot of real AI deals, I think that's blocked on a lack of proven playbooks. VCs are making bets but the actual end user value is pretty unclear. Having studied this area and its trajectory quite a bit over the last year I think the unit economics of API serving are already approximately sustainable, and the models and hardware designs continue to get cheaper for a given level of performance. ...
@MichaelTBacon @alienghic @cwebber ... Right now the big firms are loading up on cash and I think working hard to cut the cost of their subscription products (e.g. ChatGPT or Claude) to the point where they'll be able to run unsubsidized in the near future at something not far from the current output quality and pricing. Their priority appears to be to sell more seats at low cost (Claude Enterprise starts at $20 a seat) and hope that they can get entrenched before starting to ramp prices up.
@MichaelTBacon @alienghic @cwebber I haven't seen any hard data, but spending enough time in tech industry circles it seems to be working.

@mirth @alienghic @cwebber

So far, from what I've seen, any time one of the subscription AI places put up their prices to something resembling actual operating costs (nevermind paying back gigantic sunk capital costs), users have screamed and then bolted.

Honestly, doing the really heavy duty Claude Code stuff that's getting pushed now will easily run to $50k per developer at current costs. And no, I don't see that as something that enterprises will ultimately be willing to swallow. Nor do I see a path for them to get the GPU cycle burn down easily.

@MichaelTBacon @alienghic @cwebber That math sounds way off. Assuming a monthly usage of 5M tokens for day to day developer usage, at the current Claude API costs, and billing them all at the highest rate ($25 per M), that's $125 per month at current pricing. It's a long way from there to $50k, and surveying the trajectory over the last couple years as well as models from some of the Chinese labs it's pretty clear that model size necessary to do these tasks is trending down.
@MichaelTBacon @alienghic @cwebber The other thing happening is there are many efforts to build special-purpose chips for these workloads, and some will eventually pan out. Big neural nets on GPUs are extremely wasteful in energy terms, and even though many people seem to think that approach is horribly wrong it's become "too big to fail" in a way that will encourage investment into new chips until something sticks.
@MichaelTBacon @alienghic @cwebber Combine a downward trend in average model complexity (by usage) and downward trend in energy consumption (on new hardware) on top of a typical usage that currently costs perhaps $100 - $1000 at the high end... I can easily see a world of $500/month/seat subscriptions without any structural changes. I'm not saying it's good or that I like it, but based on the best information I can find I don't think the "price explosion" scenario is plausible.

@mirth @alienghic @cwebber

What downward trend in average model complexity? What downward trend in energy consumption? They're both going up! Nobody can get the cost of inference to go down outside of going with discount models like Deepseek which are okay for spouting text but you can't get anywhere near the code quality of something like Claude Code (and even with CC, as the OP link says, quality is still something that only works well in certain languages and in certain situations, with lots and lots of guard rails).

Ed Zitron isn't everyone's cup of tea, but he's been watching the finances of this for a while and there's absolutely no sign of the burn rate slowing down or the cost of inference dropping.

https://www.wheresyoured.at/the-subprime-ai-crisis-is-here/

The Subprime AI Crisis Is Here

Hi! If you like this piece and want to support my independent reporting and analysis, why not subscribe to my premium newsletter? It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5,000 to 18,000 words,

Ed Zitron's Where's Your Ed At

@mirth @alienghic @cwebber

Anthropic gets some credit for getting Claude Code to actual usability and decent code if you spend enough time scolding and cajoling the model and manually forcing it through various code quality assurances. But they're not doing it on cheap models, they're doing it on the biggest, most expensive models which require the biggest and most expensive GPUs. You can't get those results out of Deepseek or Ollama or any of the smaller, cheaper models. The code quality goes right back into the toilet, no mater what guard rails you put on it.

Given the horrific mess that is the Claude Code source code (see this megathread for a walk through the chaos fractal that is Claude Code https://neuromatch.social/@jonny/116324676116121930) it's possible that they could tighten the hell out of it and clean up some of the immense noise in it to get some efficiency. But then what does that say about Claude Code's code quality?

@mirth @alienghic @cwebber

As for the custom chips, I'm not sure how much more customized you can make a chip for ML models than what NVIDIA is cranking out, but at the very least here's what's going on with Microsoft's attempts to get Azure to work on smaller hardware. This is a really sobering read from a former MS system engineer.

Certainly, the capability of ARM chips to really change cloud computing if someone can get the ultra-efficient ones to scale up shouldn't be overlooked. And someone else who isn't Microsoft will probably figure it out (although AWS in particular is also staggering under its immense technical debt right now).

But there is just one titanic mess after another under the hoods of the major tech firms burning hundreds of billions of VC dollars right now.

https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion

How Microsoft Vaporized a Trillion Dollars

Inside the complacency and decisions that eroded trust in Azure—from a former Azure Core engineer.

Axel’s Substack
@MichaelTBacon @alienghic @cwebber I read the Ed Zitron piece with some interest but it's weakly sourced and light on actual analysis (though it has a lot of links). I think he is right that AI companies are spending eye watering amounts of cash but misunderstands why or what will likely happen when the game of musical chairs stops. Massive layoffs, other financial carnage, yes, but the insiders will still be rich and maintain control of the post-restructuring profits. Corrupt.
@MichaelTBacon @alienghic @cwebber Regarding models, size for a given output quality has been falling fast for the last couple years. Well documented. IMO a major threat to OpenAI et al is if on-device models pass the "good enough" line for casual users and destroy the unit economics of their subscription businesses. Apple and Google have privileged access to user data via their OSes, but not yet good enough models. They're motivated to try.
@MichaelTBacon @alienghic @cwebber The subscription products are moving to multi-model hybrids under the heading of "model routing" and "sub agents" and related schemes. I think primarily motivated by cost although I don't know if they admit it in public. This already exists in the current products in small ways but they'll likely push it a lot farther.
@MichaelTBacon @alienghic @cwebber Re: chips... That post about Azure is quite interesting but not related to AI accelerators. The core workload for these neural nets is almost totally unrelated to what a general purpose CPU does, and for large models the majority of energy consumption is DRAM and interconnect (e.g. not the arithmetic). So the answer to how much more customized they can get is "a lot" if you frame the problem as how to avoid interconnect and DRAM usage. Wafer scale etc.
@MichaelTBacon @alienghic @cwebber Between improved models, app designs, and hardware, it seems costs would come down. Do you have a source for operating costs being 10x greater than the current API pricing, or typical developer usage being 2B tokens a month? I have access to some developer usage data and I'd guess devs using every day average under 10M tokens/dev/month. I'm sure there are some wild outliers, but that is a typical problem many subscription businesses manage.

@mirth @MichaelTBacon @cwebber

I think some press person estimated the ai companies costs were 10x their revenue. I don't remember which one though.

Its plausible theres improvements in both inference and training though.

I don't know how much progress there is on model collapse though.

And this does nothing about the feeling that the reason companies push AI is to make workers fear for their jobs and block unionization efforts. Also I think the main goal of current AI alignment is to make the AIs obedient to billionaires so they can have obedient secret police for their future kingdoms.

@alienghic @MichaelTBacon @cwebber I think that last bit is probably roughly correct and the biggest risk. From what I can tell the infrastructure cost to deliver a usable chatbot app or similar at or better than today's best models will fall towards zero. What are the consequences of that going to be? Chaos, but what else?

@mirth @alienghic @cwebber

What trends make you think the costs will fall toward zero? Who right now is delivering high quality products with lower cost models?

@MichaelTBacon @mirth @cwebber

One my collaborators claims to be having good luck with an open weight coding model running on his local nvidia GPUs.

He liked devstral for coding, and heard that qwen3 is supposed to be good as well.

@MichaelTBacon @mirth @cwebber

I'm not even entirely sure that having an LLM code up a matplotlib plot is all that different from copying a plot out of the matplotlib gallery.

I don't the the grad students really understood either version. Either path is copy and paste followed by trial and error.

@alienghic @mirth @cwebber

Yeah, I think it was XKCD like 10 years ago that said we're going to change the name of software development to "searching stackoverflow."

Now what the LLMs are doing is ingesting stackoverflow then using a half a kilowatt=hour to give a slightly neater answer.

Except that it's also slowly killing stackoverflow's engagement, so soon the LLM's answers are going to start getting out of date . . .

@MichaelTBacon @alienghic @cwebber Re: Cost trend: None of the small models are close to good enough now but the consistent trend seems to be that strength at any given size is increasing steadily. If that continues we will see single PC size models with similar performance to today's flagship models eventually, and less cost to do most jobs. Accelerators also keep improving which generally means less energy and less operating cost.
@MichaelTBacon @alienghic @cwebber It's a bit like looking at the trajectory of broadband and batteries in 2005 and projecting forward that people would be streaming movies to their pockets for nearly free. The technology isn't there now but it seems like that's where it will go.

@mirth @alienghic @cwebber

That's the thing, though, I don't think even the smallest trend is there. It's not that "it's early yet, but it's pointing in the right direction." I haven't seen anything that suggests there's any downward trend at all, no matter how small.

To get anything like equivalent performance out of the discount models like Deepseek, you have to run multiple instances in parallel or run a bunch of agents along with it.

The big breakthroughs in capability in the last year or two have all been about ramping *up* the power usage, model size, and GPU capacity, either by using bigger token windows or adding secondary agents. That's kind of some cool engineering, but power consumption or operating costs are just going up.

The main advantage of the discount models is that they have a much lower cost of training, mostly because (we suspect) they were trained off of responses from the bigger models.

@MichaelTBacon @alienghic @cwebber That's not consistent with what I've seen. Using weight footprint as a rough guide, the capability at the 1B, 10B, 30B, etc sizes continue to increase, and current smaller models perform better than larger predecessors. True on the closed side as well: Today's Sonnet is stronger than last year's Sonnet _and_ Opus, today's GPT-5.4 mini is stronger than GPT-5 at release, etc. Current 30B tier open models are stronger than GPT-4 was.
@MichaelTBacon @alienghic @cwebber That's models. On the silicon side, ten years ago the bulk of inference ran on regular CPUs. Most of that moved to GPUs maybe five years ago, and now specialist chips are becoming more common. I know a little bit about chip design, and there's a lot of inefficiency that will get designed out of accelerators (reliance on DRAM is the big one, Cerebras and Taalas are the two notable current attempts).
@MichaelTBacon @alienghic @cwebber That's evidence of a downward trend in model size for any capability, and a downward trend in the cost to run a model of a given size. It seems likely that both continue for at least a few more years, and the consequence of that is Sonnet-level models available in pocket-size devices. How do we manage the fallout from that, and steer away from some of the worst downstream effects?

@mirth @alienghic @cwebber

I would be very interested in links showing what you say if you have them.

@MichaelTBacon @alienghic @cwebber This is a synthesis of what I've seen across years of doing compute related work plus reading the published benchmark data and some papers. In my opinion it is necessary to have a reasonable knowledge of at least what's been published before having specific opinions about what the technical trends are or are not. Here's a leaderboard for SWE Bench Verified, a reasonable gauge of one dimension of model strength:

https://llm-stats.com/benchmarks/swe-bench-verified

SWE-Bench Verified Leaderboard

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

LLM Stats
@MichaelTBacon @alienghic @cwebber If you want an experiential reference for compute efficiency you can try https://chatjimmy.ai. It's using an old model, but the speed can give you some idea of what's possible in the next generation of silicon.
chat jimmy

chat jimmy LLM web interface

@MichaelTBacon @alienghic @cwebber Taking a step back, my hope is that I can rouse some of the people who in my opinion have the right concerns, but don't feel urgency because they believe "AI doesn't really work" or "AI will go away when the investors stop subsidizing it". There's a mind boggling amount of vapor and hype, a lot of bad ideas, but also some of this stuff is going to stick around and we're going to have to deal with it.

@mirth @alienghic

I've gotten into a similar fight myself, basically with a crowd of people who were utterly convinced that Claude Code couldn't create anything that actually worked.

As the above example clearly details, it can make code that works (it made Claude Code, among other things, for better or for worse) but whether that's a benchmark of usefulness is another story.

I'm pushing back on your noting of trends, or at least pushing for clearer evidence that I can unpack myself, because I think it's a critical distinction. I'm still not seeing anything like the kinds of asymptotic curves that would be needed to make this stuff viable as a commercial success without the mega subsidies from VC/PE/PC sources. I *do* think it will be transformative, but not in any of the ways that are currently getting discussed.

But if there *is* some real trend towards real dramatic reductions in power/CPU usage per unit of useful output (whatever that is), it changes the story.

@MichaelTBacon @alienghic I'm not sure I understand, what about the current cost is unsustainable? If to use some round numbers $100k of server hardware that costs about $40k a year to operate can serve subscribers paying a total of $7k a month then that's about a break even investment (assuming a 10% discount rate). Of course the big AI companies are spending a lot of money, but if they fire all the people outside a core product operations, engineering, and some research, the math is different.

@mirth @alienghic

Can a $100k server cover $7k/month of subscribers though? I'm not at all sure of that, even for just servicing requests, particularly since you're going to have very spiky and uneven utilization and if those models are slow to respond, you're going to lose customers, fast.

Beyond that, a huge amount of the cost of the models is in the training. You can say, oh, sure, but once the training is done, they're good, but that means that your old model knows nothing about anything that's happened since it was trained.

I go back to Anthropic saying last August that its $200/month Max users were costing it $50k/month, each. 20-30% increases in performance are totally insufficient to scratch that problem. Some of it could be achieved by making Claude Code less of a shambolic train wreck of code, but that's not cheap either.

@MichaelTBacon @alienghic Do you have a source for that figure? I have access to some developer usage data for a customer and I don't see anything close to the token usage it would take to get there. Maybe $200 a month.

@mirth @alienghic It was in this part of the thread which got forked off:

"The Max tier tells a revealing story about the economics of flat-rate AI pricing. Internal data revealed that some $200/month Max users were costing Anthropic over $50,000 per month in compute. The tier was introduced specifically to manage this cost imbalance while retaining high-value users."

The point is that what's showing up as API charges or app use charges are nowhere near what it costs to run the model. I'm sure that $200/$50000 ratio is probably out at the extreme (which is why I rounded way down to just a 10x subsidy instead of a 250x subsidy) but the point is every "frontier" model seller is losing gigantic wads of cash on operating costs alone, and I don't see any corner getting turned towards bringing those costs under control.

https://social.coop/@MichaelTBacon/116357863323666495

@mirth @alienghic

Anthropic at the very least appears to have wrestled its training costs to be less than its annual revenue. So that's a start. Maybe they can get their operating costs down too. Their leadership seems to be the least full of lying assholes of most of the big AI chasers. So that's a positive for them.

The GlassWing or whatever announcement from today will turn heads, for sure. And it's a definitely not-terrible use for language models. But that's a very important but very niche application, not "code generation for everyone!"

https://nitter.net/ShanuMathew93/status/2041444857416126617#m

@MichaelTBacon @alienghic The quote about the $50k is conveying that Anthropic introduced a $200 plan to bring the revenue at that tier up to match the costs. The charts in that Twitter thread tell the same story, serving customer traffic is already roughly breakeven. Keep in mind that venture backed companies use balance sheet as a weapon, the moment the competitive pressure is off they'll cut everything that doesn't carry its own weight (e.g. most research).
@MichaelTBacon @alienghic Training is necessary for better models, as models pass "good enough" for most tasks they're only necessary due to competitive pressure, and when a funding drought comes the competitive pressure gets released. Imagine a game of musical chairs where the "sitting down" part is who can cut costs the fastest. The management know this, they spend prolifically to force out competition while protecting unit economics to be positioned to survive the downturn.
@MichaelTBacon @alienghic I've been trying to think of a happy ending to this story but I think it's going to be a tough ride.

@mirth @MichaelTBacon

That seems entirely plausible for these days.