My models don't have reasoning ability in llama-b9543 server but have in llama-cli
My models don't have reasoning ability in llama-b9543 server but have in llama-cli
My most recent llama cpp build is b9543 [https://github.com/ggml-org/llama.cpp/releases/tag/b9543] and today I notice that my local models don’t reason in the server web interface. Prior to that, I was using b8996 [https://github.com/ggml-org/llama.cpp/releases/tag/b8996] where they do reason. In the web interface, I see no reasoning being shown. However, models do reason in llama-cli. I tried with --reasoning on, --reasoning-budget -1, --chat-template-kwargs '{"enable_thinking":true'. I didn’t use these flags before as reasoning was working fine in b8996.
I Put a Datacenter GPU in My Gaming PC for £200
Google just released "QAT" versions of their Gemma 4 models. QAT stands for "yeah we know you people don't have enough VRAM so we trained the model knowing you'd quantize it down to 4 bits anyway" and apparently that makes a 4-bit QAT-model perform similar to an 8-bit quantized with previous methods.
This is a game-changer for running LLMs locally. As a first try I'm running unsloth's version of the 12b model released yesterday, and _without_ quantizing the KV-cache and with >128000 byte context it's not even filling up my 16GB VRAM. Prompt processing > 2000t/s and inference at >40 t/s.
Gemma4 12b released with "unified" approach to multi-modality

From the model card, sounds interesting: The “Unified” in Gemma 4 12B Unified refers to its encoder-free architecture. Other Gemma 4 models use dedicated encoders to process multimodal data before passing it to the LLM. Gemma 4 12B eliminates these encoders entirely, projecting raw image patches and audio waveforms directly into the LLM’s embedding space through lightweight linear layers. This unified approach means all modalities flow straight into a single decoder-only transformer, reducing multimodal latency and allowing the entire model to be fine-tuned in one pass. The benchmarks put it closer to the 26b MoE than to the E variants of the Gemma4 series, but mostly below Qwen3.5 9b. [https://lemmy.ml/pictrs/image/87ca2774-86eb-4160-b29f-dd74e9ce4810.png] Looking forward to giving it a shot.
I Tried This Open Source ChatGPT Alternative [Jan AI] on Linux, But Went Back to Ollama
Infinity-Parser2 - Multimodal Document Parser
Your best local LLM for low-VRAM (6GB)?
Hey guys, What’s currently the best LLM for low-VRAM machines with only 6 GB VRAM? I’ve got 32GB RAM as well. I’m experimenting a little with SillyTavern and I’m curious which model gets the most out of my setup. Should be multilingual and suitable for “casual chatting”. I know I will probably not get very far with this, but I’m still interested in how far we’ve already come. (Using KoboldCPP if that matters). ~sp3ctre
DystopiaBench - AI Ethics Stress Test