Qwen 3.6 27B running at 46 tok/s on an RX 9070 XT (llama.cpp + MTP Speculative Decoding is basically magic)

https://ani.social/post/32999311

Qwen 3.6 27B running at 46 tok/s on an RX 9070 XT (llama.cpp + MTP Speculative Decoding is basically magic) - ani.social

Just got my hands on a new AMD Radeon RX 9070 XT (16GB) and wanted to share some inference numbers. I’ve been messing around with llama.cpp via their official ROCm Docker image, testing out Qwen 3.6 27B (Omnimerge-v4) in IQ3_M. Honestly, the performance you can squeeze out of a 27B model on a 16GB consumer card right now is blowing my mind. Here’s the breakdown: The Setup - GPU: AMD Radeon RX 9070 XT (RDNA4 / gfx1201) - 16 GB VRAM - CPU: AMD Ryzen 9 9950X3D - OS/Backend: Linux via Docker using ghcr.io/ggml-org/llama.cpp:server-rocm [http://ghcr.io/ggml-org/llama.cpp:server-rocm]. (Props to the devs, it natively supports RDNA4 gfx1201 out of the box!) - Model: Qwen3.6-27B-Omnimerge-v4-IQ3_M.gguf (~13 GB) - Context: 16k Tweaks: - Set -np 1 since I’m just running it as a single-user chatbot in Open WebUI. - Slapped on 8-bit KV cache (–cache-type-k q8_0 --cache-type-v q8_0) to save about 50% VRAM. - Enabled MTP Speculative Decoding (–spec-type draft-mtp). The Numbers (512-token test) - Prompt Processing (TTFT): 549.27 tok/s (1220 tokens evaluated in ~2.2s). Latency: 1.82 ms/token. - Text Gen: 46.06 tok/s (512 tokens generated in ~11.1s). Latency: 21.71 ms/token. - MTP Stats: Draft acceptance rate was super high at 62.7% (333 drafts accepted / 531 generated). Aggregate speed (including prompt eval) hit roughly 48.97 tok/s. Memory Footprint - VRAM: Sitting at 14.46 GB out of 16 GB. This leaves about 1.5 GB of breathing room, which has been totally stable with zero OOM crashes so far. - System RAM: ~4.3 GB (mostly the host-side prompt cache helping speed up subsequent turns). RDNA4 is ready: The latest ROCm images and llama.cpp HIP libs support the 9070 XT natively. Didn’t even need to mess with HSA_OVERRIDE_GFX_VERSION. KV Cache quantization is required: Pushing the KV cache to q8_0 is the only reason a 16k context window fits on a 16GB card alongside a 27B model. If anyone with a 16GB card is looking for the sweet spot, this has to be one of the best price-to-performance setups available right now. Let me know if you want my docker-compose.yml or run scripts… For more info, here are my logs ::: spoiler Logs ROCm GPU and System Info: device_info: - ROCm0 : AMD Radeon RX 9070 XT (16304 MiB, 16200 MiB free) - ROCm1 : AMD Ryzen 9 9950X3D 16-Core Processor (15589 MiB, 25012 MiB free) - CPU : AMD Ryzen 9 9950X3D 16-Core Processor (31178 MiB, 31178 MiB free) I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 Context Initialization (16k single-slot & 8-bit KV Cache): I srv load_model: [spec] estimated memory usage of MTP context is 178.02 MiB W llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized I srv load_model: initializing slots, n_slots = 1 I slot load_model: id 0 | task -1 | new slot, n_ctx = 16384 I srv load_model: prompt cache is enabled, size limit: 8192 MiB I srv llama_server: model loaded I srv llama_server: server is listening on http://0.0.0.0:8080/ Benchmark Request Timing Statistics (Generating 512 tokens): I slot launch_slot_: id 0 | task 3454 | processing task, is_child = 0 I slot print_timing: id 0 | task 3454 | n_decoded = 103, tg = 37.93 t/s I slot print_timing: id 0 | task 3454 | n_decoded = 247, tg = 43.16 t/s I slot print_timing: id 0 | task 3454 | n_decoded = 392, tg = 44.87 t/s I slot print_timing: id 0 | task 3454 | prompt eval time = 195.88 ms / 42 tokens ( 4.66 ms per token, 214.41 tokens per second) I slot print_timing: id 0 | task 3454 | eval time = 11116.24 ms / 512 tokens ( 21.71 ms per token, 46.06 tokens per second) I slot print_timing: id 0 | task 3454 | total time = 11312.12 ms / 554 tokens I slot print_timing: id 0 | task 3454 | graphs reused = 3560 I slot print_timing: id 0 | task 3454 | draft acceptance = 0.62712 ( 333 accepted / 531 generated) I statistics draft-mtp: #calls(b,g,a) = 6 3607 3607, #gen drafts = 3607, #acc drafts = 2824, #gen tokens = 10820, #acc tokens = 6605, dur(b,g,a) = 0.008, 31925.089, 2.740 ms I slot release: id 0 | task 3454 | stop processing: n_tokens = 553, truncated = 0 ___ :::

P.S. 16k context window is not good for agentic coding. I am trying to increase it and see what I can do…

Try q8_0 + q5_1 cache. The V cache is much less sensitive to quantization.

Also, use that IGP on your 9950X3D! Plug your monitor into the motherboard, and free up vram on the 9070, so you can use every last megabyte.

q8_0 + q5_1

I just tried it but it is silently falling back to CPU for processing giving me 87 tok/s for prompt processing. Maybe q5_1 is not fully supported on AMD? I cannot find anything relevant :(

Plug your monitor into the motherboard

My mobo has just 1 hdmi and a usb 4 DP. I would need to buy a new cable… I would try it once I get one

I dunno where you got your llama.cpp binary from, but all the fa kernel “combinations” need to be compiled, and maybe q8_0/q5_1 isn’t compiled by default?

If you compile it yourself, there may be an “all_quants” flag or something similar you have to enable.

As an aside, be sure enable the avx512 flags as well. Ryzen 9000 benefits from them quite a bit.

You are an absolute legend! You were 100% right.

I built llama.cpp from source just like you suggested, using the -DGGML_CUDA_FA_ALL_QUANTS=ON and AVX-512 flags. The difference is insane! The silent CPU fallback is completely gone. My prompt processing jumped from a slow 87 tok/s to 938.68 tok/s, and my CPU is now at 0% during prefill.

P.S. I was using doccker image previously…

Thank you so much for the compile flag. You saved me a ton of time. Oh also, avx512 is “1” by default.

I checked, and the CMAKE flag you want to enable Q8/Q5 is:

github.com/ggml-org/llama.cpp/…/CMakeLists.txt#L2…

And all these AVX512 ones:

github.com/ggml-org/llama.cpp/…/CMakeLists.txt#L1…

llama.cpp/ggml/CMakeLists.txt at 1fd6dfe9f3d4b69cce101d832339fbda2d14b056 · ggml-org/llama.cpp

LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.

GitHub
This is what I was gonna mention. Even with a 24gb 4090 I have no context room. I run it 8-bit quant on a DGX Spark for that. It’s only 13 tps but 256k context.
I just commented about the Llama.cpp TurboQuant fork in this post, which would help. Also, the recipe for single card here github.com/noonghunna/club-3090 will get you 200k. If you have a 4090, there’s no reason you can’t do Qwen 3.6 27B with 200k at decent speeds.
GitHub - noonghunna/club-3090: Community recipes for serving LLMs on RTX 3090/4090/5090 CUDA gpus. Multi-engine (vLLM, llama.cpp, ik_llama) and model-agnostic. Currently shipping Qwen3.6-27B Qwen3.6 35B Gemma 4 26B Gemma 4 31B configs for 1× and 2× cards.

Community recipes for serving LLMs on RTX 3090/4090/5090 CUDA gpus. Multi-engine (vLLM, llama.cpp, ik_llama) and model-agnostic. Currently shipping Qwen3.6-27B Qwen3.6 35B Gemma 4 26B Gemma 4 31B c...

GitHub