Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code

https://ai.georgeliu.com/p/running-google-gemma-4-locally-with

Running Google Gemma 4 Locally With LM Studio’s New Headless CLI & Claude Code

LM Studio 0.4.0 introduced llmster and the lms CLI. Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.

George Liu
So wait what is the interaction between Gemma and Claude?

lm studio offers an Anthropic compatible local endpoint, so you can point Claude code at it and it'll use your local model for it's requests, however, I've had a lot of problems with LM Studio and Claude code losing it's place. It'll think for awhile, come up with a plan, start to do it and then just halt in the middle. I'll ask it to continue and it'll do a small change and get stuck again.

Using ollama's api doesn't have the same issue, so I've stuck to using ollama for local development work.

Claude Code is fairly notoriously token inefficient as far as coding agent/harnesses go (i come from aider pre-CC). It's only viable because the Max subscriptions give you approximately unlimited token budget, which resets in a few hours even if you hit the limit. But this also only works because cloud models have massive token windows (1M tokens on opus right now) which is a bit difficult to make happen locally with the VRAM needed.

And if you somehow managed to open up a big enough VRAM playground, the open weights models are not quite as good at wrangling such large context windows (even opus is hardly capable) without basically getting confused about what they were doing before they finish parsing it.

Can't you use Claude caveman mode?

https://github.com/JuliusBrussee/caveman

GitHub - JuliusBrussee/caveman: 🪨 why use many token when few token do trick — Claude Code skill that cuts 75% of tokens by talking like caveman

🪨 why use many token when few token do trick — Claude Code skill that cuts 75% of tokens by talking like caveman - JuliusBrussee/caveman

GitHub