Mastodawn

lm studio offers an Anthropic compatible local endpoint, so you can point Claude code at it and it'll use your local model for it's requests, however, I've had a lot of problems with LM Studio and Claude code losing it's place. It'll think for awhile, come up with a plan, start to do it and then just halt in the middle. I'll ask it to continue and it'll do a small change and get stuck again.

Using ollama's api doesn't have the same issue, so I've stuck to using ollama for local development work.

Show thread

keerthiko 3d ago

Claude Code is fairly notoriously token inefficient as far as coding agent/harnesses go (i come from aider pre-CC). It's only viable because the Max subscriptions give you approximately unlimited token budget, which resets in a few hours even if you hit the limit. But this also only works because cloud models have massive token windows (1M tokens on opus right now) which is a bit difficult to make happen locally with the VRAM needed.

And if you somehow managed to open up a big enough VRAM playground, the open weights models are not quite as good at wrangling such large context windows (even opus is hardly capable) without basically getting confused about what they were doing before they finish parsing it.

Show thread

storus 3d ago

Can't you use Claude caveman mode?

https://github.com/JuliusBrussee/caveman

GitHub - JuliusBrussee/caveman: 🪨 why use many token when few token do trick — Claude Code skill that cuts 75% of tokens by talking like caveman

🪨 why use many token when few token do trick — Claude Code skill that cuts 75% of tokens by talking like caveman - JuliusBrussee/caveman

GitHub

Show thread

unsnap_biceps 3d ago

I use CC at work, so I haven't explored other options. Is there a better one to use locally? I presumed they were all going to be pretty similar.

Show thread

jaggederest 2d ago

If you want to experiment with same-harness-different-models Opencode is classically the one to use. After their recent kerfluffle with Anthropic you'll have to use API pricing for opus/sonnet/haiku which makes it kind of a non-starter, but it lets you swap out any number of cloud or local models using e.g. ollama or z.ai or whatever backend provider you like.

I'd rate their coding agent harness as slightly to significantly less capable than claude code, but it also plays better with alternate models.

Show thread

mbesto 2d ago

I don't get why I would use Claude Code when OpenCode, Cursor, Zed, etc. all exist, are "free" and work with virtually any llm. Seems like a weird use case unless I'm missing something.

Show thread

Someone1234 3d ago

Using Claude Code seems like a popular frontend currently, I wonder how long until Anthropic releases an update to make it a little to a lot less turn-key? They've been very clear that they aren't exactly champions of this stuff being used outside of very specific ways.

Show thread

moomin 3d ago

Right now it suits them down to the ground. You pay for the product and you don’t cost their servers anything.

Show thread

phainopepla2 3d ago

You don't pay anything to use Claude Code as a front end to non-Anthropic models

Show thread

quinnjh 3d ago

so no subscription is needed?

Show thread

chvid 3d ago

Is it not about the same as using OpenCode?

And is running a local model with Claude Code actually usable for any practical work compared to the hosted Anthropic models?

Show thread

wyre 3d ago

I think CC is popular because they are catering to the common denominator programmer and are going to continue to do that, not because CC is particularly turn-key.

Show thread

nerdix 3d ago

I don't think there is any incentive to do so right now because the open models aren't as good. The vast majority of businesses are going to just pay the extra cost for access to a frontier model. The model is what gives them a competitive advantage, not the harness. The harness is a lot easier to replicate than Opus.

There are benefits too. Some developers might learn to use Claude Code outside of work with cheaper models and then advocate for using Claude Code at work (where their companies will just buy access from Anthropic, Bedrock, etc). Similar to how free ESXi licenses for personal use helped infrastructure folks gain skills with that product which created a healthy supply of labor and VMware evangelists that were eager to spread the gospel. Anthropic can't just give away access to Claude models because of cost so there is use in allowing alternative ways for developers to learn how to use Claude Code and develop a workflow with it.

Show thread

martinald 3d ago

Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.

Show thread

IceWreck 3d ago

It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM.
That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.

Show thread

Yukonv 2d ago

With that you are taking a significant performance penalty and become severely I/O bottlenecked. I've been able to stream Qwen3.5-397B-A17B from my M5 Max (12 GB/s SSD Read) using the Flash MoE technique at the brisk pace of 10 tokens per second. As tokens are generated different experts need to be consulted resulting in a lot of I/O churn. So while feasible it's only great for batch jobs not interactive usage.

Show thread

charcircuit 2d ago

You never need to have all weights in memory. You can swap them in from RAM, disk, the network, etc. MOE reduces the amount of data that will need to be swapped in for the next forward pass.