Mastodawn

Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code

https://ai.georgeliu.com/p/running-google-gemma-4-locally-with

Running Google Gemma 4 Locally With LM Studio’s New Headless CLI & Claude Code

LM Studio 0.4.0 introduced llmster and the lms CLI. Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.

George Liu

Show thread

martinald 3d ago

Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.

Show thread

IceWreck 3d ago

It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM.
That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.

Show thread

Yukonv

With that you are taking a significant performance penalty and become severely I/O bottlenecked. I've been able to stream Qwen3.5-397B-A17B from my M5 Max (12 GB/s SSD Read) using the Flash MoE technique at the brisk pace of 10 tokens per second. As tokens are generated different experts need to be consulted resulting in a lot of I/O churn. So while feasible it's only great for batch jobs not interactive usage.