My models don't have reasoning ability in llama-b9543 server but have in llama-cli

https://thelemmy.club/post/50653847

My models don't have reasoning ability in llama-b9543 server but have in llama-cli - The Lemmy Club

My most recent llama cpp build is b9543 [https://github.com/ggml-org/llama.cpp/releases/tag/b9543] and today I notice that my local models don’t reason in the server web interface. Prior to that, I was using b8996 [https://github.com/ggml-org/llama.cpp/releases/tag/b8996] where they do reason. In the web interface, I see no reasoning being shown. However, models do reason in llama-cli. I tried with --reasoning on, --reasoning-budget -1, --chat-template-kwargs '{"enable_thinking":true'. I didn’t use these flags before as reasoning was working fine in b8996.

So with this and smart use of LLM hosting (vLLM / Ollama etc) I can run the likes of lucidRAG (which needs GPUs) for FREE (well energy prices)...
Local LLM FTW!
#localllama

I Put a Datacenter GPU in My Gaming PC for £200

https://lemy.lol/post/66741697

I Put a Datacenter GPU in My Gaming PC for £200 - lemy.lol

Lemmy

Google just released "QAT" versions of their Gemma 4 models. QAT stands for "yeah we know you people don't have enough VRAM so we trained the model knowing you'd quantize it down to 4 bits anyway" and apparently that makes a 4-bit QAT-model perform similar to an 8-bit quantized with previous methods.

This is a game-changer for running LLMs locally. As a first try I'm running unsloth's version of the 12b model released yesterday, and _without_ quantizing the KV-cache and with >128000 byte context it's not even filling up my 16GB VRAM. Prompt processing > 2000t/s and inference at >40 t/s.

https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF

#LLM #LocalLLaMa #Gemma4

unsloth/gemma-4-12B-it-qat-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Gemma4 12b released with "unified" approach to multi-modality

https://lemmy.ml/post/48296053

Gemma4 12b released with "unified" approach to multi-modality - Lemmy

From the model card, sounds interesting: The “Unified” in Gemma 4 12B Unified refers to its encoder-free architecture. Other Gemma 4 models use dedicated encoders to process multimodal data before passing it to the LLM. Gemma 4 12B eliminates these encoders entirely, projecting raw image patches and audio waveforms directly into the LLM’s embedding space through lightweight linear layers. This unified approach means all modalities flow straight into a single decoder-only transformer, reducing multimodal latency and allowing the entire model to be fine-tuned in one pass. The benchmarks put it closer to the 26b MoE than to the E variants of the Gemma4 series, but mostly below Qwen3.5 9b. [https://lemmy.ml/pictrs/image/87ca2774-86eb-4160-b29f-dd74e9ce4810.png] Looking forward to giving it a shot.

I Tried This Open Source ChatGPT Alternative [Jan AI] on Linux, But Went Back to Ollama

https://lemy.lol/post/66509824

I Tried This Open Source ChatGPT Alternative [Jan AI] on Linux, But Went Back to Ollama - lemy.lol

Lemmy

Infinity-Parser2 - Multimodal Document Parser

https://sh.itjust.works/post/60950791

Gmail can read your emails and attachments to train its AI, unless you opt out - sh.itjust.works

Lemmy

Your best local LLM for low-VRAM (6GB)?

https://feddit.org/post/30200503

Your best local LLM for low-VRAM (6GB)? - feddit.org

Hey guys, What’s currently the best LLM for low-VRAM machines with only 6 GB VRAM? I’ve got 32GB RAM as well. I’m experimenting a little with SillyTavern and I’m curious which model gets the most out of my setup. Should be multilingual and suitable for “casual chatting”. I know I will probably not get very far with this, but I’m still interested in how far we’ve already come. (Using KoboldCPP if that matters). ~sp3ctre

DystopiaBench - AI Ethics Stress Test

https://aussie.zone/post/32813538

DystopiaBench - AI Ethics Stress Test - Aussie Zone

Lemmy

Claude? No. Cucumbers? Yes!

https://aussie.zone/post/32805644

Claude? No. Cucumbers? Yes! - Aussie Zone

More often than not, AI and LLM gets conflated in the public consciousness…and then gets mixed with “Agentic”, “SaaS” and other well…slop. So, here is a farmer in Japan, using a raspberry pi, to sort cucumbers. https://www.newsweek.com/artificial-intelligence-cucumber-farm-raspberry-pi-495289 [https://www.newsweek.com/artificial-intelligence-cucumber-farm-raspberry-pi-495289] PS: 2016 article. I expect by now the tractor is self driving and named Betty. If you have any other “dude does cool shit with a box of scraps in a cave”, I’m all EARS.md [http://EARS.md]