The Rise and Fall of the ‘Hipster Music’ Era, 2000-2014: VICE’s Definitive Timeline

Qwen 3.6 27B running at 46 tok/s on an RX 9070 XT (llama.cpp + MTP Speculative Decoding is basically magic) - sh.itjust.works
Just got my hands on a new AMD Radeon RX 9070 XT (16GB) and wanted to share some inference numbers. I’ve been messing around with llama.cpp via their official ROCm Docker image, testing out Qwen 3.6 27B (Omnimerge-v4) in IQ3_M. Honestly, the performance you can squeeze out of a 27B model on a 16GB consumer card right now is blowing my mind. Here’s the breakdown: The Setup - GPU: AMD Radeon RX 9070 XT (RDNA4 / gfx1201) - 16 GB VRAM - CPU: AMD Ryzen 9 9950X3D - OS/Backend: Linux via Docker using ghcr.io/ggml-org/llama.cpp:server-rocm [http://ghcr.io/ggml-org/llama.cpp:server-rocm]. (Props to the devs, it natively supports RDNA4 gfx1201 out of the box!) - Model: Qwen3.6-27B-Omnimerge-v4-IQ3_M.gguf (~13 GB) - Context: 16k Tweaks: - Set -np 1 since I’m just running it as a single-user chatbot in Open WebUI. - Slapped on 8-bit KV cache (–cache-type-k q8_0 --cache-type-v q8_0) to save about 50% VRAM. - Enabled MTP Speculative Decoding (–spec-type draft-mtp). The Numbers (512-token test) - Prompt Processing (TTFT): 549.27 tok/s (1220 tokens evaluated in ~2.2s). Latency: 1.82 ms/token. - Text Gen: 46.06 tok/s (512 tokens generated in ~11.1s). Latency: 21.71 ms/token. - MTP Stats: Draft acceptance rate was super high at 62.7% (333 drafts accepted / 531 generated). Aggregate speed (including prompt eval) hit roughly 48.97 tok/s. Memory Footprint - VRAM: Sitting at 14.46 GB out of 16 GB. This leaves about 1.5 GB of breathing room, which has been totally stable with zero OOM crashes so far. - System RAM: ~4.3 GB (mostly the host-side prompt cache helping speed up subsequent turns). RDNA4 is ready: The latest ROCm images and llama.cpp HIP libs support the 9070 XT natively. Didn’t even need to mess with HSA_OVERRIDE_GFX_VERSION. KV Cache quantization is required: Pushing the KV cache to q8_0 is the only reason a 16k context window fits on a 16GB card alongside a 27B model. If anyone with a 16GB card is looking for the sweet spot, this has to be one of the best price-to-performance setups available right now. Let me know if you want my docker-compose.yml or run scripts… Edit: To see how much MTP actually helps, I ran the exact same 512-token prompt test with speculative decoding toggled off: Standard Inference (No Speculation): - Speed: 25.88 tokens/sec (Latency: 38.64 ms/token) - Prompt Eval: 239.46 tokens/sec MTP Speculative Decoding (MTP): - Speed: 46.06 tokens/sec (Latency: 21.71 ms/token) - Prompt Eval: 549.27 tokens/sec - Draft Acceptance Rate: 62.7% For more info, here are my logs ::: spoiler Logs ROCm GPU and System Info: device_info: - ROCm0 : AMD Radeon RX 9070 XT (16304 MiB, 16200 MiB free) - ROCm1 : AMD Ryzen 9 9950X3D 16-Core Processor (15589 MiB, 25012 MiB free) - CPU : AMD Ryzen 9 9950X3D 16-Core Processor (31178 MiB, 31178 MiB free) I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 Context Initialization (16k single-slot & 8-bit KV Cache): I srv load_model: [spec] estimated memory usage of MTP context is 178.02 MiB W llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized I srv load_model: initializing slots, n_slots = 1 I slot load_model: id 0 | task -1 | new slot, n_ctx = 16384 I srv load_model: prompt cache is enabled, size limit: 8192 MiB I srv llama_server: model loaded I srv llama_server: server is listening on http://0.0.0.0:8080/ Benchmark Request Timing Statistics (Generating 512 tokens): I slot launch_slot_: id 0 | task 3454 | processing task, is_child = 0 I slot print_timing: id 0 | task 3454 | n_decoded = 103, tg = 37.93 t/s I slot print_timing: id 0 | task 3454 | n_decoded = 247, tg = 43.16 t/s I slot print_timing: id 0 | task 3454 | n_decoded = 392, tg = 44.87 t/s I slot print_timing: id 0 | task 3454 | prompt eval time = 195.88 ms / 42 tokens ( 4.66 ms per token, 214.41 tokens per second) I slot print_timing: id 0 | task 3454 | eval time = 11116.24 ms / 512 tokens ( 21.71 ms per token, 46.06 tokens per second) I slot print_timing: id 0 | task 3454 | total time = 11312.12 ms / 554 tokens I slot print_timing: id 0 | task 3454 | graphs reused = 3560 I slot print_timing: id 0 | task 3454 | draft acceptance = 0.62712 ( 333 accepted / 531 generated) I statistics draft-mtp: #calls(b,g,a) = 6 3607 3607, #gen drafts = 3607, #acc drafts = 2824, #gen tokens = 10820, #acc tokens = 6605, dur(b,g,a) = 0.008, 31925.089, 2.740 ms I slot release: id 0 | task 3454 | stop processing: n_tokens = 553, truncated = 0 ___ :::
