Mastodawn

Did you follow a guide for setting up Speculative Decoding? I haven’t gotten it to work very well personally. Does the smaller model run on your CPU memory and the larger one fully on GPU?

No, I actually don’t run a separate draft model on the CPU…

Since I’m using Qwen 3.6 27B, I’m utilizing MTP (Multi-Token Prediction) speculative decoding. This is built directly into the main model itself, so there is no extra “small model” to load or offload to the CPU. Everything runs entirely on the GPU VRAM.

To get it working well in llama.cpp, you just need two things:

Make sure you are using a model variant specifically trained for it (look for files with -MTP- in the name on Hugging Face, like the Unsloth ones).
Add the flag --spec-type draft-mtp to your startup command for docker. But I do suggest compiling llama.cpp yourself now, for better kv caching.

That’s pretty much it! Because everything stays in VRAM and uses the main model’s native architecture, the draft acceptance rate is super high (around 64% for me) and it basically doubles the generation speed.

Show thread

cicadagen Jun 14

KV-cache for the MTP-model all the way down to q4_0.

From the other reply from u/[email protected], I did try q8_0/q5_1. It works pretty well.

I’ll try Gemma-4-12B QAT when I get time :)

Show thread

cicadagen Jun 14

You are an absolute legend! You were 100% right.

I built llama.cpp from source just like you suggested, using the -DGGML_CUDA_FA_ALL_QUANTS=ON and AVX-512 flags. The difference is insane! The silent CPU fallback is completely gone. My prompt processing jumped from a slow 87 tok/s to 938.68 tok/s, and my CPU is now at 0% during prefill.

P.S. I was using doccker image previously…

Thank you so much for the compile flag. You saved me a ton of time. Oh also, avx512 is “1” by default.

32G :(

9950X3D.

How much RAM you got with that?

…And have you ever considered running an MoE, with experts on the CPU?

They can be shockingly fast, as the attention and dense layers are all still processed on the GPU. You could also run a much less impactful quantization, especially for the dense layers (which are typically at Q6K or Q8_0 for MoEs, while the experts in CPU RAM can take heavier quantization).

I have 32G rn, planning to upgrade it to 64G soon once they get cheaper again. Meanwhile, I did some testing

Show thread

cicadagen Jun 14

q8_0 + q5_1

I just tried it but it is silently falling back to CPU for processing giving me 87 tok/s for prompt processing. Maybe q5_1 is not fully supported on AMD? I cannot find anything relevant :(

Plug your monitor into the motherboard

My mobo has just 1 hdmi and a usb 4 DP. I would need to buy a new cable… I would try it once I get one

Show thread

cicadagen Jun 14

P.S. 16k context window is not good for agentic coding. I am trying to increase it and see what I can do…

cicadagen Jun 14

Qwen 3.6 27B running at 46 tok/s on an RX 9070 XT (llama.cpp + MTP Speculative Decoding is basically magic)

https://ani.social/post/32999311

Qwen 3.6 27B running at 46 tok/s on an RX 9070 XT (llama.cpp + MTP Speculative Decoding is basically magic) - ani.social

Just got my hands on a new AMD Radeon RX 9070 XT (16GB) and wanted to share some inference numbers. I’ve been messing around with llama.cpp via their official ROCm Docker image, testing out Qwen 3.6 27B (Omnimerge-v4) in IQ3_M. Honestly, the performance you can squeeze out of a 27B model on a 16GB consumer card right now is blowing my mind. Here’s the breakdown: The Setup - GPU: AMD Radeon RX 9070 XT (RDNA4 / gfx1201) - 16 GB VRAM - CPU: AMD Ryzen 9 9950X3D - OS/Backend: Linux via Docker using ghcr.io/ggml-org/llama.cpp:server-rocm [http://ghcr.io/ggml-org/llama.cpp:server-rocm]. (Props to the devs, it natively supports RDNA4 gfx1201 out of the box!) - Model: Qwen3.6-27B-Omnimerge-v4-IQ3_M.gguf (~13 GB) - Context: 16k Tweaks: - Set -np 1 since I’m just running it as a single-user chatbot in Open WebUI. - Slapped on 8-bit KV cache (–cache-type-k q8_0 --cache-type-v q8_0) to save about 50% VRAM. - Enabled MTP Speculative Decoding (–spec-type draft-mtp). The Numbers (512-token test) - Prompt Processing (TTFT): 549.27 tok/s (1220 tokens evaluated in ~2.2s). Latency: 1.82 ms/token. - Text Gen: 46.06 tok/s (512 tokens generated in ~11.1s). Latency: 21.71 ms/token. - MTP Stats: Draft acceptance rate was super high at 62.7% (333 drafts accepted / 531 generated). Aggregate speed (including prompt eval) hit roughly 48.97 tok/s. Memory Footprint - VRAM: Sitting at 14.46 GB out of 16 GB. This leaves about 1.5 GB of breathing room, which has been totally stable with zero OOM crashes so far. - System RAM: ~4.3 GB (mostly the host-side prompt cache helping speed up subsequent turns). RDNA4 is ready: The latest ROCm images and llama.cpp HIP libs support the 9070 XT natively. Didn’t even need to mess with HSA_OVERRIDE_GFX_VERSION. KV Cache quantization is required: Pushing the KV cache to q8_0 is the only reason a 16k context window fits on a 16GB card alongside a 27B model. If anyone with a 16GB card is looking for the sweet spot, this has to be one of the best price-to-performance setups available right now. Let me know if you want my docker-compose.yml or run scripts… For more info, here are my logs ::: spoiler Logs ROCm GPU and System Info: device_info: - ROCm0 : AMD Radeon RX 9070 XT (16304 MiB, 16200 MiB free) - ROCm1 : AMD Ryzen 9 9950X3D 16-Core Processor (15589 MiB, 25012 MiB free) - CPU : AMD Ryzen 9 9950X3D 16-Core Processor (31178 MiB, 31178 MiB free) I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 Context Initialization (16k single-slot & 8-bit KV Cache): I srv load_model: [spec] estimated memory usage of MTP context is 178.02 MiB W llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized I srv load_model: initializing slots, n_slots = 1 I slot load_model: id 0 | task -1 | new slot, n_ctx = 16384 I srv load_model: prompt cache is enabled, size limit: 8192 MiB I srv llama_server: model loaded I srv llama_server: server is listening on http://0.0.0.0:8080/ Benchmark Request Timing Statistics (Generating 512 tokens): I slot launch_slot_: id 0 | task 3454 | processing task, is_child = 0 I slot print_timing: id 0 | task 3454 | n_decoded = 103, tg = 37.93 t/s I slot print_timing: id 0 | task 3454 | n_decoded = 247, tg = 43.16 t/s I slot print_timing: id 0 | task 3454 | n_decoded = 392, tg = 44.87 t/s I slot print_timing: id 0 | task 3454 | prompt eval time = 195.88 ms / 42 tokens ( 4.66 ms per token, 214.41 tokens per second) I slot print_timing: id 0 | task 3454 | eval time = 11116.24 ms / 512 tokens ( 21.71 ms per token, 46.06 tokens per second) I slot print_timing: id 0 | task 3454 | total time = 11312.12 ms / 554 tokens I slot print_timing: id 0 | task 3454 | graphs reused = 3560 I slot print_timing: id 0 | task 3454 | draft acceptance = 0.62712 ( 333 accepted / 531 generated) I statistics draft-mtp: #calls(b,g,a) = 6 3607 3607, #gen drafts = 3607, #acc drafts = 2824, #gen tokens = 10820, #acc tokens = 6605, dur(b,g,a) = 0.008, 31925.089, 2.740 ms I slot release: id 0 | task 3454 | stop processing: n_tokens = 553, truncated = 0 ___ :::

Show thread

cicadagen May 24

How do you assume that it was only 2?

Show thread

cicadagen Jan 16

I love how it just says no, prolly no life left XD