GLM-5.2 vs Claude Opus | Tech Stackups
GLM-5.2 vs Claude Opus | Tech Stackups
LunarWing — self-hostable AI agent framework built in Rust, focused on privacy and real secret management
Qwen 3.6 27B running at 46 tok/s on an RX 9070 XT (llama.cpp + MTP Speculative Decoding is basically magic)
Just got my hands on a new AMD Radeon RX 9070 XT (16GB) and wanted to share some inference numbers. I’ve been messing around with llama.cpp via their official ROCm Docker image, testing out Qwen 3.6 27B (Omnimerge-v4) in IQ3_M. Honestly, the performance you can squeeze out of a 27B model on a 16GB consumer card right now is blowing my mind. Here’s the breakdown: The Setup - GPU: AMD Radeon RX 9070 XT (RDNA4 / gfx1201) - 16 GB VRAM - CPU: AMD Ryzen 9 9950X3D - OS/Backend: Linux via Docker using ghcr.io/ggml-org/llama.cpp:server-rocm [http://ghcr.io/ggml-org/llama.cpp:server-rocm]. (Props to the devs, it natively supports RDNA4 gfx1201 out of the box!) - Model: Qwen3.6-27B-Omnimerge-v4-IQ3_M.gguf (~13 GB) - Context: 16k Tweaks: - Set -np 1 since I’m just running it as a single-user chatbot in Open WebUI. - Slapped on 8-bit KV cache (–cache-type-k q8_0 --cache-type-v q8_0) to save about 50% VRAM. - Enabled MTP Speculative Decoding (–spec-type draft-mtp). The Numbers (512-token test) - Prompt Processing (TTFT): 549.27 tok/s (1220 tokens evaluated in ~2.2s). Latency: 1.82 ms/token. - Text Gen: 46.06 tok/s (512 tokens generated in ~11.1s). Latency: 21.71 ms/token. - MTP Stats: Draft acceptance rate was super high at 62.7% (333 drafts accepted / 531 generated). Aggregate speed (including prompt eval) hit roughly 48.97 tok/s. Memory Footprint - VRAM: Sitting at 14.46 GB out of 16 GB. This leaves about 1.5 GB of breathing room, which has been totally stable with zero OOM crashes so far. - System RAM: ~4.3 GB (mostly the host-side prompt cache helping speed up subsequent turns). RDNA4 is ready: The latest ROCm images and llama.cpp HIP libs support the 9070 XT natively. Didn’t even need to mess with HSA_OVERRIDE_GFX_VERSION. KV Cache quantization is required: Pushing the KV cache to q8_0 is the only reason a 16k context window fits on a 16GB card alongside a 27B model. If anyone with a 16GB card is looking for the sweet spot, this has to be one of the best price-to-performance setups available right now. Let me know if you want my docker-compose.yml or run scripts… For more info, here are my logs ::: spoiler Logs ROCm GPU and System Info: device_info: - ROCm0 : AMD Radeon RX 9070 XT (16304 MiB, 16200 MiB free) - ROCm1 : AMD Ryzen 9 9950X3D 16-Core Processor (15589 MiB, 25012 MiB free) - CPU : AMD Ryzen 9 9950X3D 16-Core Processor (31178 MiB, 31178 MiB free) I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 Context Initialization (16k single-slot & 8-bit KV Cache): I srv load_model: [spec] estimated memory usage of MTP context is 178.02 MiB W llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized I srv load_model: initializing slots, n_slots = 1 I slot load_model: id 0 | task -1 | new slot, n_ctx = 16384 I srv load_model: prompt cache is enabled, size limit: 8192 MiB I srv llama_server: model loaded I srv llama_server: server is listening on http://0.0.0.0:8080/ Benchmark Request Timing Statistics (Generating 512 tokens): I slot launch_slot_: id 0 | task 3454 | processing task, is_child = 0 I slot print_timing: id 0 | task 3454 | n_decoded = 103, tg = 37.93 t/s I slot print_timing: id 0 | task 3454 | n_decoded = 247, tg = 43.16 t/s I slot print_timing: id 0 | task 3454 | n_decoded = 392, tg = 44.87 t/s I slot print_timing: id 0 | task 3454 | prompt eval time = 195.88 ms / 42 tokens ( 4.66 ms per token, 214.41 tokens per second) I slot print_timing: id 0 | task 3454 | eval time = 11116.24 ms / 512 tokens ( 21.71 ms per token, 46.06 tokens per second) I slot print_timing: id 0 | task 3454 | total time = 11312.12 ms / 554 tokens I slot print_timing: id 0 | task 3454 | graphs reused = 3560 I slot print_timing: id 0 | task 3454 | draft acceptance = 0.62712 ( 333 accepted / 531 generated) I statistics draft-mtp: #calls(b,g,a) = 6 3607 3607, #gen drafts = 3607, #acc drafts = 2824, #gen tokens = 10820, #acc tokens = 6605, dur(b,g,a) = 0.008, 31925.089, 2.740 ms I slot release: id 0 | task 3454 | stop processing: n_tokens = 553, truncated = 0 ___ :::
My models don't have reasoning ability in llama-b9543 server but have in llama-cli
My most recent llama cpp build is b9543 [https://github.com/ggml-org/llama.cpp/releases/tag/b9543] and today I notice that my local models don’t reason in the server web interface. Prior to that, I was using b8996 [https://github.com/ggml-org/llama.cpp/releases/tag/b8996] where they do reason. In the web interface, I see no reasoning being shown. However, models do reason in llama-cli. I tried with --reasoning on, --reasoning-budget -1, --chat-template-kwargs '{"enable_thinking":true'. I didn’t use these flags before as reasoning was working fine in b8996.
I Put a Datacenter GPU in My Gaming PC for £200
Google just released "QAT" versions of their Gemma 4 models. QAT stands for "yeah we know you people don't have enough VRAM so we trained the model knowing you'd quantize it down to 4 bits anyway" and apparently that makes a 4-bit QAT-model perform similar to an 8-bit quantized with previous methods.
This is a game-changer for running LLMs locally. As a first try I'm running unsloth's version of the 12b model released yesterday, and _without_ quantizing the KV-cache and with >128000 byte context it's not even filling up my 16GB VRAM. Prompt processing > 2000t/s and inference at >40 t/s.
Gemma4 12b released with "unified" approach to multi-modality

From the model card, sounds interesting: The “Unified” in Gemma 4 12B Unified refers to its encoder-free architecture. Other Gemma 4 models use dedicated encoders to process multimodal data before passing it to the LLM. Gemma 4 12B eliminates these encoders entirely, projecting raw image patches and audio waveforms directly into the LLM’s embedding space through lightweight linear layers. This unified approach means all modalities flow straight into a single decoder-only transformer, reducing multimodal latency and allowing the entire model to be fine-tuned in one pass. The benchmarks put it closer to the 26b MoE than to the E variants of the Gemma4 series, but mostly below Qwen3.5 9b. [https://lemmy.ml/pictrs/image/87ca2774-86eb-4160-b29f-dd74e9ce4810.png] Looking forward to giving it a shot.
I Tried This Open Source ChatGPT Alternative [Jan AI] on Linux, But Went Back to Ollama