1337Hero
@1337Hero · Member since 2026
Total Submissions
64
64 approved public runs
Models Tested
24
unique models
Avg tok/s out
72.7
Best: 991.1 tok/s
Avg prefill
—
Avg total tok/s
217.2
Best: 2459.9 tok/s
Avg TTFT
1419ms
Best: 46ms
HW configs
4
distinct setups
Personal bests24
Fastest approved result for each model.
22B · Gpt
991.1
tok/s
2x AMD Radeon AI Pro R9700 64GB
vllm · MXFP4_MOE
175ms
33k · 2d ago
30B · Qwen
127.2
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_K_M
194ms
1k · 1d ago
36B · Qwen
112.8
tok/s
3x AMD Radeon AI Pro R9700 96GB
vllm · MXFP4_MOE
168ms
1k · 2d ago
7B · Llama
110.1
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_0
171ms
1k · 1d ago
35B · Qwen
108.4
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q3_K_M
208ms
1k · 1d ago
107.3
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
218ms
1k · 2d ago
31B · Qwen
100.0
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
198ms
1k · 2d ago
9B · Qwen
95.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_K_M
179ms
1k · 1d ago
95.4
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
216ms
1k · 3d ago
93.3
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
105ms
1k · 3d ago
27B · Gemma
87.3
tok/s
2x AMD Radeon AI Pro R9700 32GB
llama.cpp · MXFP4_MOE
375ms
1k · 3d ago
4B · Qwen
84.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
198ms
1k · 1d ago
80B · Qwen
80.8
tok/s
2x AMD Radeon AI Pro R9700 32GB
llama.cpp · MXFP4_MOE
250ms
1k · 3d ago
3B · Llama
79.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · BF16
184ms
1k · 1d ago
32B · Qwen
79.3
tok/s
2x AMD Radeon AI Pro R9700 64GB
vllm · FP8
138ms
33k · 2d ago
120B · Gpt
71.5
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · F16
233ms
1k · 3d ago
229B · Minimax
32.6
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · IQ4_XS
46ms
1k · 1d ago
125B · Qwen
27.3
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · MXFP4_MOE
219ms
1k · 3d ago
24B · Mistral
23.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
244ms
25k · 3d ago
73B · Qwen
20.9
tok/s
3x AMD Radeon AI Pro R9700 32GB
llama.cpp · Q4_0
272ms
131k · 3d ago
28B · Qwen
20.1
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
807ms
1k · 3d ago
17.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
224ms
1k · 2d ago
33B · Qwen
16.2
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
699ms
1k · 1d ago
128B · Mistral
7.4
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_0
2033ms
1k · 1d ago
| Model | Best tok/s | Hardware | Engine / quant | Prefill | TTFT | ctx | When | Share |
|---|---|---|---|---|---|---|---|---|
| gpt-oss-20b 22B4 runs | 991.1 | 2x AMD Radeon AI Pro R9700 64GB | vllmMXFP4_MOE | — | 175ms | 33k | 2d ago | |
| qwen3-coder-30b-a3b-codemonkey 30B4 runs | 127.2out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_K_M | — | 194ms | 1k | 1d ago | |
| Qwen3.6-35B-A3B 36B6 runs | 112.8out | 3x AMD Radeon AI Pro R9700 96GB | vllmMXFP4_MOE | — | 168ms | 1k | 2d ago | |
| Llama-2-7b 7B1 run | 110.1out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_0 | — | 171ms | 1k | 1d ago | |
| Qwen3.5-35B-A3B 35B1 run | 108.4out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ3_K_M | — | 208ms | 1k | 1d ago | |
| Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 33B2 runs | 107.3out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | — | 218ms | 1k | 2d ago | |
| Qwen3-Coder-30B-A3B-Instruct 31B3 runs | 100.0out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | — | 198ms | 1k | 2d ago | |
| Qwen3-VL-8B-Instruct 9B1 run | 95.9out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_K_M | — | 179ms | 1k | 1d ago | |
| Nemotron-Cascade-2-30B-A3B 32B2 runs | 95.4out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | — | 216ms | 1k | 3d ago | |
| GLM-4.7-Flash 31B2 runs | 93.3out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | — | 105ms | 1k | 3d ago | |
| gemma-4-26B-A4B-it 27B1 run | 87.3out | 2x AMD Radeon AI Pro R9700 32GB | llama.cppMXFP4_MOE | — | 375ms | 1k | 3d ago | |
| Qwen3.5-4B 4B1 run | 84.9out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | — | 198ms | 1k | 1d ago | |
| Qwen3-Coder-Next 80B2 runs | 80.8out | 2x AMD Radeon AI Pro R9700 32GB | llama.cppMXFP4_MOE | — | 250ms | 1k | 3d ago | |
| Llama-3.2-3B-Instruct 3B1 run | 79.9out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppBF16 | — | 184ms | 1k | 1d ago | |
| Qwen3-32B 32B3 runs | 79.3 | 2x AMD Radeon AI Pro R9700 64GB | vllmFP8 | — | 138ms | 33k | 2d ago | |
| gpt-oss-120b 120B4 runs | 71.5out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppF16 | — | 233ms | 1k | 3d ago | |
| MiniMax-M2.7 229B5 runs | 32.6out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppIQ4_XS | — | 46ms | 1k | 1d ago | |
| Qwen3.5-122B-A10B 125B4 runs | 27.3out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppMXFP4_MOE | — | 219ms | 1k | 3d ago | |
| Devstral-Small-2-24B-Instruct-2512 24B2 runs | 23.9out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | — | 244ms | 25k | 3d ago | |
| Kimi-Dev-72B 73B4 runs | 20.9out | 3x AMD Radeon AI Pro R9700 32GB | llama.cppQ4_0 | — | 272ms | 131k | 3d ago | |
| Qwen3.6-27B 28B4 runs | 20.1out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | — | 807ms | 1k | 3d ago | |
| granite-4.1-30b 29B2 runs | 17.9out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | — | 224ms | 1k | 2d ago | |
| SWE-Dev-32B 33B1 run | 16.2out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | — | 699ms | 1k | 1d ago | |
| Mistral-Medium-3.5-128B 128B4 runs | 7.4out | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_0 | — | 2033ms | 1k | 1d ago |
Hardware breakdown4
Distinct rigs used across approved submissions.
3x AMD Radeon AI Pro R9700 96GB
50 runs · best 160.8 tok/s
2x AMD Radeon AI Pro R9700 64GB
6 runs · best 991.1 tok/s
3x AMD Radeon AI Pro R9700 32GB
5 runs · best 70.0 tok/s
2x AMD Radeon AI Pro R9700 32GB
3 runs · best 92.9 tok/s
Engines used2
Runtime mix and top quantizations.
All runs64
Full approved benchmark history.
229B · Minimax
30.7
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · F16
57ms
1k · today
Show all run details
Model
MiniMaxAI/MiniMax-M2.7
Display name
MiniMax-M2.7
Base model
—
Revision
main
Family
Minimax
Parameters
229B
Active params
—
MoE
no
Output tok/s
30.7
Prefill tok/s
—
Total tok/s
95.0
TTFT
57.1ms
Peak VRAM
—
Prompt tokens
541
Output tokens
256
Prefill tokens
—
Context length
797
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
F16
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 --main-gpu 1 -c 16384
Extra flags
--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1 --main-gpu 1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 8, 2026, 8:02 AM
Last edited
—
128B · Mistral
6.5
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · F16
1956ms
1k · today
Show all run details
Model
mistralai/Mistral-Medium-3.5-128B
Display name
Mistral-Medium-3.5-128B
Base model
—
Revision
main
Family
Mistral
Parameters
128B
Active params
—
MoE
no
Output tok/s
6.5
Prefill tok/s
—
Total tok/s
27.4
TTFT
1955.6ms
Peak VRAM
—
Prompt tokens
886
Output tokens
256
Prefill tokens
—
Context length
1142
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
F16
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384
Extra flags
--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --cache-ram 0 --no-cache-idle-slots --ts 1/1/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 8, 2026, 7:51 AM
Last edited
—
120B · Gpt
62.7
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · F16
105ms
2k · today
Show all run details
Model
openai/gpt-oss-120b
Display name
gpt-oss-120b
Base model
—
Revision
main
Family
Gpt
Parameters
120B
Active params
—
MoE
no
Output tok/s
62.7
Prefill tok/s
—
Total tok/s
99.0
TTFT
105.5ms
Peak VRAM
—
Prompt tokens
589
Output tokens
1000
Prefill tokens
—
Context length
1589
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
F16
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 8, 2026, 7:44 AM
Last edited
—
229B · Minimax
25.2
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · IQ4_XS
56ms
33k · today
Show all run details
Model
MiniMaxAI/MiniMax-M2.7
Display name
MiniMax-M2.7
Base model
—
Revision
main
Family
Minimax
Parameters
229B
Active params
—
MoE
no
Output tok/s
25.2
Prefill tok/s
—
Total tok/s
79.9
TTFT
56.1ms
Peak VRAM
—
Prompt tokens
560
Output tokens
256
Prefill tokens
—
Context length
32768
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
IQ4_XS
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
1
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1
Extra flags
--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
REAP'd 172B-A10B (from 228B base), IQ4_XS imatrix, 3x R9700 ts 1/1/1, q8_0 KV, FA auto, ctx 32k
Reactions
—
Submitted
May 8, 2026, 7:13 AM
Last edited
—
229B · Minimax
25.2
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · IQ4_XS
5267ms
1k · today
Show all run details
Model
MiniMaxAI/MiniMax-M2.7
Display name
MiniMax-M2.7
Base model
—
Revision
main
Family
Minimax
Parameters
229B
Active params
—
MoE
no
Output tok/s
25.2
Prefill tok/s
—
Total tok/s
52.9
TTFT
5266.6ms
Peak VRAM
—
Prompt tokens
560
Output tokens
256
Prefill tokens
—
Context length
816
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
IQ4_XS
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
1
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1
Extra flags
--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
REAP'd 172B-A10B (from 228B base), IQ4_XS imatrix, 3x R9700 ts 1/1/1, q8_0 KV, FA auto, ctx 32k
Reactions
—
Submitted
May 8, 2026, 7:10 AM
Last edited
—
229B · Minimax
31.8
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · IQ4_XS
771ms
33k · 1d ago
Show all run details
Model
MiniMaxAI/MiniMax-M2.7
Display name
MiniMax-M2.7
Base model
—
Revision
main
Family
Minimax
Parameters
229B
Active params
—
MoE
no
Output tok/s
31.8
Prefill tok/s
—
Total tok/s
90.8
TTFT
770.5ms
Peak VRAM
—
Prompt tokens
546
Output tokens
256
Prefill tokens
—
Context length
32768
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
IQ4_XS
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
1
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1
Extra flags
--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
REAP'd 172B IQ4_XS imatrix. ctx=32768, parallel=1. Thinking disabled.
Reactions
—
Submitted
May 7, 2026, 4:58 AM
Last edited
—
229B · Minimax
32.6
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · IQ4_XS
46ms
1k · 1d ago
Show all run details
Model
MiniMaxAI/MiniMax-M2.7
Display name
MiniMax-M2.7
Base model
—
Revision
main
Family
Minimax
Parameters
229B
Active params
—
MoE
no
Output tok/s
32.6
Prefill tok/s
—
Total tok/s
101.4
TTFT
46.3ms
Peak VRAM
—
Prompt tokens
546
Output tokens
256
Prefill tokens
—
Context length
802
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
IQ4_XS
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 16384
Extra flags
--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
REAP'd 172B IQ4_XS imatrix (mradermacher). Thinking disabled via system prompt.
Reactions
—
Submitted
May 7, 2026, 4:39 AM
Last edited
—
7B · Llama
110.1
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_0
171ms
1k · 1d ago
Show all run details
Model
meta-llama/Llama-2-7b
Display name
Llama-2-7b
Base model
—
Revision
main
Family
Llama
Parameters
7B
Active params
—
MoE
no
Output tok/s
110.1
Prefill tok/s
—
Total tok/s
347.5
TTFT
171.1ms
Peak VRAM
—
Prompt tokens
611
Output tokens
256
Prefill tokens
—
Context length
867
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
Q4_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/llama/llama-2-7b.Q4_0.gguf -ts 0/1 -c 4096
Extra flags
--no-webui --jinja --cache-reuse 256 --prio 3 --ts 0/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 7, 2026, 3:49 AM
Last edited
—
35B · Qwen
108.4
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q3_K_M
208ms
1k · 1d ago
Show all run details
Model
Qwen/Qwen3.5-35B-A3B
Display name
Qwen3.5-35B-A3B
Base model
Qwen3.5-35B-A3B-Base
Revision
main
Family
Qwen
Parameters
35B
Active params
—
MoE
no
Output tok/s
108.4
Prefill tok/s
—
Total tok/s
306.4
TTFT
207.7ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
787
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
Q3_K_M
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
0
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/qwen/Qwen3.5-35B-A3B-Uncensored.Q3_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 0/1 -c 262144
Extra flags
--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 0/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 7, 2026, 3:43 AM
Last edited
—
33B · Qwen
16.2
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
699ms
1k · 1d ago
Show all run details
Model
zai-org/SWE-Dev-32B
Display name
SWE-Dev-32B
Base model
Qwen2.5-Coder-32B-Instruct
Revision
main
Family
Qwen
Parameters
33B
Active params
—
MoE
no
Output tok/s
16.2
Prefill tok/s
—
Total tok/s
49.0
TTFT
698.6ms
Peak VRAM
—
Prompt tokens
551
Output tokens
256
Prefill tokens
—
Context length
807
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.95
Top K
—
Min P
—
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m "/home/mikekey/models/gguf/misc/swe-dev-32b-q8_0.gguf" --temp 0.7 --top-p 0.95 --repeat-penalty 1.05 -ts 1/1 -c 32768
Extra flags
--no-webui --jinja --cache-reuse 256 --prio 3 --ts 1/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 7, 2026, 3:36 AM
Last edited
—
9B · Qwen
95.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_K_M
179ms
1k · 1d ago
Show all run details
Model
Qwen/Qwen3-VL-8B-Instruct
Display name
Qwen3-VL-8B-Instruct
Base model
—
Revision
main
Family
Qwen
Parameters
9B
Active params
—
MoE
no
Output tok/s
95.9
Prefill tok/s
—
Total tok/s
276.0
TTFT
178.8ms
Peak VRAM
—
Prompt tokens
530
Output tokens
256
Prefill tokens
—
Context length
786
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
Q4_K_M
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3VL-8B-Instruct-Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 -ts 0/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --ts 0/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 7, 2026, 3:29 AM
Last edited
—
3B · Llama
79.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · BF16
184ms
1k · 1d ago
Show all run details
Model
meta-llama/Llama-3.2-3B-Instruct
Display name
Llama-3.2-3B-Instruct
Base model
—
Revision
main
Family
Llama
Parameters
3B
Active params
—
MoE
no
Output tok/s
79.9
Prefill tok/s
—
Total tok/s
239.7
TTFT
183.6ms
Peak VRAM
—
Prompt tokens
556
Output tokens
256
Prefill tokens
—
Context length
812
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
BF16
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.9
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/llama/Llama-3.2-3B-Instruct-BF16.gguf --temp 0.7 --top-p 0.9 -ts 0/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --ts 0/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 7, 2026, 3:23 AM
Last edited
—
4B · Qwen
84.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
198ms
1k · 1d ago
Show all run details
Model
Qwen/Qwen3.5-4B
Display name
Qwen3.5-4B
Base model
Qwen3.5-4B-Base
Revision
main
Family
Qwen
Parameters
4B
Active params
—
MoE
no
Output tok/s
84.9
Prefill tok/s
—
Total tok/s
244.8
TTFT
197.6ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
787
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9049-2-g79b2f3239-dirty
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.95
Top K
—
Min P
—
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m "/home/mikekey/models/gguf/misc/Opus4.7-GODs.Ghost.Codex.Distill.4B-Q8_0.gguf" --temp 0.7 --top-p 0.95 --repeat-penalty 1.05 -ts 0/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --ts 0/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 7, 2026, 3:06 AM
Last edited
—
73B · Qwen
15.5
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_0
266ms
1k · 1d ago
Show all run details
Model
moonshotai/Kimi-Dev-72B
Display name
Kimi-Dev-72B
Base model
Qwen2.5-72B
Revision
main
Family
Qwen
Parameters
73B
Active params
—
MoE
no
Output tok/s
15.5
Prefill tok/s
—
Total tok/s
56.7
TTFT
265.7ms
Peak VRAM
—
Prompt tokens
697
Output tokens
256
Prefill tokens
—
Context length
953
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q4_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
—
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + Qwen2.5-Coder-0.5B-Q8_0 draft (n-max=8, n-min=1, p-min=0.7), 128K ctx. Code-completion prompt — paired baseline (no spec) at ~11.4 tok/s shows ~+35% gain from speculative decoding on code-shaped output.
Reactions
—
Submitted
May 6, 2026, 7:22 PM
Last edited
—
73B · Qwen
11.5
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_0
250ms
1k · 1d ago
Show all run details
Model
moonshotai/Kimi-Dev-72B
Display name
Kimi-Dev-72B
Base model
Qwen2.5-72B
Revision
main
Family
Qwen
Parameters
73B
Active params
—
MoE
no
Output tok/s
11.5
Prefill tok/s
—
Total tok/s
42.3
TTFT
249.7ms
Peak VRAM
—
Prompt tokens
697
Output tokens
256
Prefill tokens
—
Context length
953
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q4_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
—
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + q8 KV, 128K ctx. Code-completion prompt (Python module continuation, ~700 input tokens). Baseline for spec-decoding A/B — paired with Kimi-Dev-72B-spec submission.
Reactions
—
Submitted
May 6, 2026, 7:14 PM
Last edited
—
24B · Mistral
23.6
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
214ms
1k · 1d ago
Show all run details
Model
mistralai/Devstral-Small-2-24B-Instruct-2512
Display name
Devstral-Small-2-24B-Instruct-2512
Base model
Mistral-Small-3.1-24B-Base-2503
Revision
main
Family
Mistral
Parameters
24B
Active params
—
MoE
no
Output tok/s
23.6
Prefill tok/s
—
Total tok/s
204.1
TTFT
213.6ms
Peak VRAM
—
Prompt tokens
1078
Output tokens
135
Prefill tokens
—
Context length
1213
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.15
Top P
—
Top K
—
Min P
0.01
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf --temp 0.15 --min-p 0.01 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 24576
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0
Notes
llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 1× R9700, 24K ctx, q8 KV, FA auto. Single-card layout.
Reactions
—
Submitted
May 6, 2026, 7:02 PM
Last edited
—
28B · Qwen
19.6
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
263ms
1k · 1d ago
Show all run details
Model
Qwen/Qwen3.6-27B
Display name
Qwen3.6-27B
Base model
—
Revision
main
Family
Qwen
Parameters
28B
Active params
—
MoE
no
Output tok/s
19.6
Prefill tok/s
—
Total tok/s
59.2
TTFT
262.8ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
787
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
20
Min P
0
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-27B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 16384
Extra flags
--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/0/0
Notes
llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 1× R9700, 16K ctx, q8 KV, FA auto. Single-card layout to eliminate cross-GPU overhead on 27B dense.
Reactions
—
Submitted
May 6, 2026, 6:56 PM
Last edited
—
36B · Qwen
80.5
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
203ms
1k · 1d ago
Show all run details
Model
Qwen/Qwen3.6-35B-A3B
Display name
Qwen3.6-35B-A3B
Base model
—
Revision
main
Family
Qwen
Parameters
36B
Active params
—
MoE
no
Output tok/s
80.5
Prefill tok/s
—
Total tok/s
232.6
TTFT
202.6ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
787
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.6
Top P
0.95
Top K
20
Min P
0
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-35B-A3B-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768
Extra flags
--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1/0
Notes
llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 2× R9700, 32K ctx, q8 KV, FA auto. Optimized 2-card layout vs the 3-card base entry.
Reactions
—
Submitted
May 6, 2026, 6:50 PM
Last edited
—
31B · Qwen
92.0
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
178ms
1k · 1d ago
Show all run details
Model
Qwen/Qwen3-Coder-30B-A3B-Instruct
Display name
Qwen3-Coder-30B-A3B-Instruct
Base model
—
Revision
main
Family
Qwen
Parameters
31B
Active params
—
MoE
no
Output tok/s
92.0
Prefill tok/s
—
Total tok/s
365.9
TTFT
178.2ms
Peak VRAM
—
Prompt tokens
530
Output tokens
156
Prefill tokens
—
Context length
686
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
0
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0
Notes
llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 2× R9700, 32K ctx, q8 KV, FA auto. Optimized 2-card layout vs the 3-card base entry.
Reactions
—
Submitted
May 6, 2026, 6:42 PM
Last edited
—
30B · Qwen
127.2
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_K_M
194ms
1k · 1d ago
Show all run details
Model
1337Hero/qwen3-coder-30b-a3b-codemonkey
Display name
qwen3-coder-30b-a3b-codemonkey
Base model
Qwen3-Coder-30B-A3B-Instruct
Revision
main
Family
Qwen
Parameters
30B
Active params
—
MoE
no
Output tok/s
127.2
Prefill tok/s
—
Total tok/s
356.1
TTFT
193.7ms
Peak VRAM
—
Prompt tokens
530
Output tokens
256
Prefill tokens
—
Context length
786
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q4_K_M
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
0
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 32768
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0
Notes
llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 1× R9700, 32K ctx, q8 KV, FA auto.
Reactions
—
Submitted
May 6, 2026, 4:37 PM
Last edited
—
30B · Qwen
126.5
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_K_M
209ms
1k · 1d ago
Show all run details
Model
1337Hero/qwen3-coder-30b-a3b-codemonkey
Display name
qwen3-coder-30b-a3b-codemonkey
Base model
Qwen3-Coder-30B-A3B-Instruct
Revision
main
Family
Qwen
Parameters
30B
Active params
—
MoE
no
Output tok/s
126.5
Prefill tok/s
—
Total tok/s
352.0
TTFT
208.9ms
Peak VRAM
—
Prompt tokens
530
Output tokens
256
Prefill tokens
—
Context length
786
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q4_K_M
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
0
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.from-Q8_0.Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 32768
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0
Notes
llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 1× R9700, 32K ctx, q8 KV, FA auto, Q4_K_M derived from Q8_0 source.
Reactions
—
Submitted
May 6, 2026, 4:32 PM
Last edited
—
30B · Qwen
90.0
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
203ms
1k · 1d ago
Show all run details
Model
1337Hero/qwen3-coder-30b-a3b-codemonkey
Display name
qwen3-coder-30b-a3b-codemonkey
Base model
Qwen3-Coder-30B-A3B-Instruct
Revision
main
Family
Qwen
Parameters
30B
Active params
—
MoE
no
Output tok/s
90.0
Prefill tok/s
—
Total tok/s
257.8
TTFT
203.3ms
Peak VRAM
—
Prompt tokens
530
Output tokens
256
Prefill tokens
—
Context length
786
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
0
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0
Notes
llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 2× R9700, 32K ctx, q8 KV, FA auto.
Reactions
—
Submitted
May 6, 2026, 4:26 PM
Last edited
—
30B · Qwen
20.3
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · BF16
222ms
1k · 1d ago
Show all run details
Model
1337Hero/qwen3-coder-30b-a3b-codemonkey
Display name
qwen3-coder-30b-a3b-codemonkey
Base model
Qwen3-Coder-30B-A3B-Instruct
Revision
main
Family
Qwen
Parameters
30B
Active params
—
MoE
no
Output tok/s
20.3
Prefill tok/s
—
Total tok/s
61.2
TTFT
221.6ms
Peak VRAM
—
Prompt tokens
530
Output tokens
256
Prefill tokens
—
Context length
786
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
BF16
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
0
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.BF16.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 16384
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 3× R9700, 16K ctx, q8 KV, FA auto.
Reactions
—
Submitted
May 6, 2026, 4:19 PM
Last edited
—
128B · Mistral
7.4
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_0
2033ms
1k · 1d ago
Show all run details
Model
mistralai/Mistral-Medium-3.5-128B
Display name
Mistral-Medium-3.5-128B
Base model
—
Revision
main
Family
Mistral
Parameters
128B
Active params
—
MoE
no
Output tok/s
7.4
Prefill tok/s
—
Total tok/s
31.4
TTFT
2032.8ms
Peak VRAM
—
Prompt tokens
886
Output tokens
256
Prefill tokens
—
Context length
1142
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q4_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --cache-ram 0 --no-cache-idle-slots --ts 1/1/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + q8_0 KV, 16K ctx, ts 1/1/1, FA auto.
Reactions
—
Submitted
May 6, 2026, 12:50 PM
Last edited
—
128B · Mistral
6.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_0
317ms
16k · 2d ago
Show all run details
Model
mistralai/Mistral-Medium-3.5-128B
Display name
Mistral-Medium-3.5-128B
Base model
—
Revision
main
Family
Mistral
Parameters
128B
Active params
—
MoE
no
Output tok/s
6.9
Prefill tok/s
—
Total tok/s
30.4
TTFT
317.4ms
Peak VRAM
—
Prompt tokens
886
Output tokens
256
Prefill tokens
—
Context length
16384
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8979-66bafdcf1
Quantization
Q4_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
q8_0
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
1
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 1 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384
Extra flags
-ctk q8_0 -ctv q8_0 -ts 1/1/1 --no-mmap --cache-ram 0 --no-cache-idle-slots -c 16384
Notes
llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. Mistral Medium 3.5 128B Q4_0 GGUF, 3 shards, tensor split 1/1/1, q8_0 KV cache, flash-attn auto, 16K ctx, single slot. Median of 3 runs. Submitted with running binary build_info b8979-66bafdcf1.
Reactions
—
Submitted
May 6, 2026, 6:40 AM
Last edited
—
128B · Mistral
6.2
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q4_0
2082ms
1k · 2d ago
Show all run details
Model
mistralai/Mistral-Medium-3.5-128B
Display name
Mistral-Medium-3.5-128B
Base model
—
Revision
main
Family
Mistral
Parameters
128B
Active params
—
MoE
no
Output tok/s
6.2
Prefill tok/s
—
Total tok/s
26.4
TTFT
2081.9ms
Peak VRAM
—
Prompt tokens
886
Output tokens
256
Prefill tokens
—
Context length
1142
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q4_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.95
Top K
—
Min P
0
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf --temp 0.7 --top-p 0.95 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
3x R9700 Vulkan, -ts 1/1/1, q8 KV, flash-attn auto, 32K ctx
Reactions
—
Submitted
May 6, 2026, 6:01 AM
Last edited
—
17.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
224ms
1k · 2d ago
Show all run details
Model
ibm-granite/granite-4.1-30b
Display name
granite-4.1-30b
Base model
—
Revision
main
Family
—
Parameters
29B
Active params
—
MoE
no
Output tok/s
17.9
Prefill tok/s
—
Total tok/s
54.1
TTFT
223.9ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
787
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/ibm-granite_granite-4.1-30b-Q8_0.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 65536
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 6, 2026, 4:57 AM
Last edited
—
107.3
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
218ms
1k · 2d ago
Show all run details
Model
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Display name
Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Base model
—
Revision
main
Family
—
Parameters
33B
Active params
—
MoE
no
Output tok/s
107.3
Prefill tok/s
—
Total tok/s
304.9
TTFT
217.8ms
Peak VRAM
—
Prompt tokens
538
Output tokens
256
Prefill tokens
—
Context length
794
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9020-10-g17df5830e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.6
Top P
0.95
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144
Extra flags
--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1
Notes
llama.cpp Vulkan on 3x R9700 via llama-swap, ts 1/1 (2-card). Median of 3 runs (selected for representative TTFT).
Reactions
—
Submitted
May 6, 2026, 4:45 AM
Last edited
—
32B · Qwen
22.8
tok/s
3x AMD Radeon AI Pro R9700 96GB
vllm · FP8
60ms
1k · 2d ago
Show all run details
Model
Qwen/Qwen3-32B
Display name
Qwen3-32B
Base model
—
Revision
main
Family
Qwen
Parameters
32B
Active params
—
MoE
no
Output tok/s
22.8
Prefill tok/s
—
Total tok/s
69.5
TTFT
60.0ms
Peak VRAM
34.2GB
Prompt tokens
530
Output tokens
256
Prefill tokens
—
Context length
786
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
vllm
Engine version
0.18.1.dev0+gbcf2be961.d20260330
Quantization
FP8
Backend
rocm
Tensor parallel
2
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
yes
Prefill chunk
2048
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
0.95
Max running seqs
4
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3-32B-FP8 --max-model-len 40960 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' healthcheck: test: - CMD-SHELL
Extra flags
--max-model-len 40960 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} -- CMD-SHELL
Notes
post-reboot validation, 2x R9700 TP=2 dense FP8
Reactions
—
Submitted
May 6, 2026, 4:12 AM
Last edited
—
36B · Qwen
112.8
tok/s
3x AMD Radeon AI Pro R9700 96GB
vllm · MXFP4_MOE
168ms
1k · 2d ago
Show all run details
Model
Qwen/Qwen3.6-35B-A3B
Display name
Qwen3.6-35B-A3B
Base model
—
Revision
main
Family
Qwen
Parameters
36B
Active params
—
MoE
no
Output tok/s
112.8
Prefill tok/s
—
Total tok/s
322.8
TTFT
168ms
Peak VRAM
32.0GB
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
787
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
vllm
Engine version
0.18.1.dev0+gbcf2be961.d20260330
Quantization
MXFP4_MOE
Backend
rocm
Tensor parallel
2
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
yes
Prefill chunk
2048
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
0.95
Max running seqs
4
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3.6-35B-A3B-MXFP4 --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' healthcheck:
Extra flags
--max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} --speculative-config {"method": "mtp", "num_speculative_tokens": 4}
Notes
vLLM on 3× R9700 (TP=2 cards 0,1). RCCL 7.1.1 jammy preloaded via LD_LIBRARY_PATH (bypasses RCCL 7.2.x gfx1201 tuning-index miss / TP=2 deadlock). TunableOp tuned ~300 rows/rank from cudagraph capture + diverse-load driving. MoE backend Triton (gfx950 gate routes only gfx950 to CK on this vLLM build). MTP=4 spec decoding. tcclaviger/vllm-rocm-rdna4-mxfp4 docker. Best of 3 streaming chat completions, post heavy warmup.
Reactions
—
Submitted
May 6, 2026, 3:18 AM
Last edited
—
36B · Qwen
98.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
vllm · MXFP4_MOE
171ms
1k · 2d ago
Show all run details
Model
Qwen/Qwen3.6-35B-A3B
Display name
Qwen3.6-35B-A3B
Base model
—
Revision
main
Family
Qwen
Parameters
36B
Active params
—
MoE
no
Output tok/s
98.9
Prefill tok/s
—
Total tok/s
285.2
TTFT
171.4ms
Peak VRAM
32GB
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
787
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
vllm
Engine version
0.18.1.dev0+gbcf2be961.d20260330
Quantization
MXFP4_MOE
Backend
rocm
Tensor parallel
2
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
yes
Prefill chunk
2048
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
0.95
Max running seqs
4
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3.6-35B-A3B-MXFP4 --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' healthcheck:
Extra flags
--max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} --speculative-config {"method": "mtp", "num_speculative_tokens": 4}
Notes
vLLM on 3× R9700 (TP=2 cards 0,1). RCCL 7.1.1 jammy preloaded via LD_LIBRARY_PATH to bypass RCCL 7.2.x gfx1201 tuning-index miss (TP=2 deadlock). aiter built but MoE backend still Triton (gfx950 gate routes only gfx950 to CK). MTP=4 spec decoding. tcclaviger/vllm-rocm-rdna4-mxfp4 docker. Best of 3 streaming chat completions.
Reactions
—
Submitted
May 6, 2026, 2:23 AM
Last edited
—
22B · Gpt
991.1
tok/s
2x AMD Radeon AI Pro R9700 64GB
vllm · MXFP4_MOE
175ms
33k · 2d ago
Show all run details
Model
openai/gpt-oss-20b
Display name
gpt-oss-20b
Base model
—
Revision
main
Family
Gpt
Parameters
22B
Active params
—
MoE
no
Output tok/s
991.1
Prefill tok/s
—
Total tok/s
—
TTFT
175.4ms
Peak VRAM
—
Prompt tokens
589
Output tokens
256
Prefill tokens
—
Context length
32768
Batch size
16
Hardware class
DISCRETE_GPU
Hardware
2x AMD Radeon AI Pro R9700 64GB
GPU slots
—
GPU count
2
VRAM
64GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
vllm
Engine version
0.20.1+rocm721
Quantization
MXFP4_MOE
Backend
rocm
Tensor parallel
2
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
yes
Attention backend
—
Flash attention
—
Chunked prefill
yes
Prefill chunk
4096
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
0.9
Max running seqs
16
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve openai/gpt-oss-20b --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 16 --max-num-batched-tokens 4096 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.90 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai
Extra flags
--max-model-len 32768 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai
Notes
gpt-oss-20b MXFP4 concurrent throughput at batch=16 (matches --max-num-seqs 16). Aggregate output tok/s, best of 3 runs (991.1 / 986.8 / 990.6 -- variance under 0.5%). Per-request decode 64.8 tok/s. TTFT 175ms. This is vLLM's strength zone vs llama.cpp single-stream: ~6x throughput when serving multiple concurrent users. batch=1 single-stream was 48 tok/s (separate submission) -- same hardware, llama.cpp Q8_0 single-stream was 160 tok/s.
Reactions
—
Submitted
May 5, 2026, 11:29 PM
Last edited
—
22B · Gpt
48.3
tok/s
2x AMD Radeon AI Pro R9700 64GB
vllm · MXFP4_MOE
89ms
33k · 2d ago
Show all run details
Model
openai/gpt-oss-20b
Display name
gpt-oss-20b
Base model
—
Revision
main
Family
Gpt
Parameters
22B
Active params
—
MoE
no
Output tok/s
48.3
Prefill tok/s
—
Total tok/s
156.7
TTFT
88.6ms
Peak VRAM
—
Prompt tokens
589
Output tokens
256
Prefill tokens
—
Context length
32768
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
2x AMD Radeon AI Pro R9700 64GB
GPU slots
—
GPU count
2
VRAM
64GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
vllm
Engine version
0.20.1+rocm721
Quantization
MXFP4_MOE
Backend
rocm
Tensor parallel
2
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
yes
Attention backend
—
Flash attention
—
Chunked prefill
yes
Prefill chunk
4096
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
0.9
Max running seqs
16
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve openai/gpt-oss-20b --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 16 --max-num-batched-tokens 4096 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.90 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai
Extra flags
--max-model-len 32768 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai
Notes
gpt-oss-20b native MXFP4 on 2x R9700 (gfx1201) TP=2, batch=1. Uses the *fused* OAI Triton MXFP4 MoE kernel (triton, not triton_unfused) — gpt-oss is the kernel's original target. Same 3 wheel patches as Qwen3.6 MXFP4 entry (PR #37826 device gate, CLI enum, dtype). RCCL 7.1.1 hotfix for #40980. Concurrent throughput much higher: 580 tok/s at c=8, 902 tok/s at c=16 (batch submissions to follow).
Reactions
—
Submitted
May 5, 2026, 11:24 PM
Last edited
—
36B · Qwen
33.6
tok/s
2x AMD Radeon AI Pro R9700 64GB
vllm · MXFP4_MOE
130ms
33k · 2d ago
Show all run details
Model
Qwen/Qwen3.6-35B-A3B
Display name
Qwen3.6-35B-A3B
Base model
—
Revision
main
Family
Qwen
Parameters
36B
Active params
—
MoE
no
Output tok/s
33.6
Prefill tok/s
—
Total tok/s
101.7
TTFT
129.6ms
Peak VRAM
—
Prompt tokens
533
Output tokens
256
Prefill tokens
—
Context length
32768
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
2x AMD Radeon AI Pro R9700 64GB
GPU slots
—
GPU count
2
VRAM
64GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
vllm
Engine version
0.20.1+rocm721
Quantization
MXFP4_MOE
Backend
rocm
Tensor parallel
2
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
yes
Attention backend
—
Flash attention
—
Chunked prefill
yes
Prefill chunk
2048
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
0.93
Max running seqs
4
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 PYTORCH_TUNABLEOP_FILENAME=~/.local/share/vllm-tunableop/tunableop_merged.csv VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3.6-35B-A3B-MXFP4 --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.93 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only
Extra flags
--max-model-len 32768 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only
Notes
Same setup as previous MXFP4 submission (PR #37826 patched, RCCL 7.1.1 hotfix, triton_unfused, dtype=bf16) — now WITH PyTorch TunableOp tuned (754 rows / 377 per TP rank). Tuned with 18-min diverse-load pass: prompt sizes 8/64/256/1024/2048 tokens x 5 content categories x concurrency 1+4. Auto-stopped on row plateau (3x60s zero-delta). Tuning gain over untuned baseline: +7.0% (31.70 -> 33.92 tok/s). Modest gain — batch=1 decode on R9700 is memory-bandwidth-bound, not GEMM-scheduling-bound.
Reactions
—
Submitted
May 5, 2026, 11:00 PM
Last edited
—
36B · Qwen
31.7
tok/s
2x AMD Radeon AI Pro R9700 64GB
vllm · MXFP4_MOE
140ms
33k · 2d ago
Show all run details
Model
Qwen/Qwen3.6-35B-A3B
Display name
Qwen3.6-35B-A3B
Base model
—
Revision
main
Family
Qwen
Parameters
36B
Active params
—
MoE
no
Output tok/s
31.7
Prefill tok/s
—
Total tok/s
96.0
TTFT
140.0ms
Peak VRAM
—
Prompt tokens
533
Output tokens
256
Prefill tokens
—
Context length
32768
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
2x AMD Radeon AI Pro R9700 64GB
GPU slots
—
GPU count
2
VRAM
64GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
vllm
Engine version
0.20.1+rocm721
Quantization
MXFP4_MOE
Backend
rocm
Tensor parallel
2
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
yes
Attention backend
—
Flash attention
—
Chunked prefill
yes
Prefill chunk
2048
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
0.93
Max running seqs
4
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MOE=0 VLLM_MXFP4_USE_MARLIN=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3.6-35B-A3B-MXFP4 --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.93 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only
Extra flags
--max-model-len 32768 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only
Notes
Qwen3.6-35B-A3B MXFP4 (pahajokiconsulting quant) on 2x R9700 (gfx1201) TP=2. Baseline path (no AITER, no TunableOp, no MTP). Required THREE patches to vLLM 0.20.1+rocm721 wheel: (1) PR vllm-project/vllm#37826 in-place patch -- widen OAI Triton MoE device gate to gfx12, (2) add 'triton_unfused' to MoEBackend Literal in config/kernel.py (CLI exposure), (3) RCCL 7.1.1 LD_PRELOAD for #40980 deadlock. --moe-backend triton_unfused needed because TRITON kernel only supports SWIGLUOAI activation; UnfusedOAITritonExperts adds SILU. dtype=bfloat16 required (kernel asserts hidden_states.dtype==bf16). Best of 3 (31.6 / 31.7 / 31.6 tok/s).
Reactions
—
Submitted
May 5, 2026, 9:49 PM
Last edited
—
32B · Qwen
79.3
tok/s
2x AMD Radeon AI Pro R9700 64GB
vllm · FP8
138ms
33k · 2d ago
Show all run details
Model
Qwen/Qwen3-32B
Display name
Qwen3-32B
Base model
—
Revision
main
Family
Qwen
Parameters
32B
Active params
—
MoE
no
Output tok/s
79.3
Prefill tok/s
—
Total tok/s
—
TTFT
138.3ms
Peak VRAM
—
Prompt tokens
534
Output tokens
256
Prefill tokens
—
Context length
32768
Batch size
4
Hardware class
DISCRETE_GPU
Hardware
2x AMD Radeon AI Pro R9700 64GB
GPU slots
—
GPU count
2
VRAM
64GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
vllm
Engine version
0.20.1+rocm721
Quantization
FP8
Backend
rocm
Tensor parallel
2
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
yes
Prefill chunk
2048
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
0.9
Max running seqs
4
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3-32B-FP8 --tensor-parallel-size 2 --dtype auto --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.90 --reasoning-parser qwen3
Extra flags
--max-model-len 32768 --reasoning-parser qwen3
Notes
Concurrent throughput at batch=4 (matches --max-num-seqs 4). Aggregate output tok/s, best of 3 runs (78.7 / 78.2 / 79.3). Per-request decode ~20 tok/s. RCCL 7.1.1 hotfix for #40980 still in effect.
Reactions
—
Submitted
May 5, 2026, 9:32 PM
Last edited
—
32B · Qwen
22.9
tok/s
2x AMD Radeon AI Pro R9700 64GB
vllm · FP8
71ms
33k · 2d ago
Show all run details
Model
Qwen/Qwen3-32B
Display name
Qwen3-32B
Base model
—
Revision
main
Family
Qwen
Parameters
32B
Active params
—
MoE
no
Output tok/s
22.9
Prefill tok/s
—
Total tok/s
70.1
TTFT
71.5ms
Peak VRAM
—
Prompt tokens
534
Output tokens
256
Prefill tokens
—
Context length
32768
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
2x AMD Radeon AI Pro R9700 64GB
GPU slots
—
GPU count
2
VRAM
64GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
vllm
Engine version
0.20.1+rocm721
Quantization
FP8
Backend
rocm
Tensor parallel
2
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
yes
Prefill chunk
2048
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
0.9
Max running seqs
4
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3-32B-FP8 --tensor-parallel-size 2 --dtype auto --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.90 --reasoning-parser qwen3
Extra flags
--max-model-len 32768 --reasoning-parser qwen3
Notes
vLLM 0.20.1+rocm721 on 2x AMD R9700 (gfx1201) TP=2. Workaround for vllm-project/vllm#40980: LD_PRELOAD librccl.so.1 from rccl-7.1.1 (Arch ALA) -- RCCL 7.2.x lacks gfx1201 tuning index and deadlocks on AllReduce. With hotfix: ~22.7 tok/s; without: ~0.5 tok/s.
Reactions
—
Submitted
May 5, 2026, 9:26 PM
Last edited
—
31B · Qwen
100.0
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
198ms
1k · 2d ago
Show all run details
Model
Qwen/Qwen3-Coder-30B-A3B-Instruct
Display name
Qwen3-Coder-30B-A3B-Instruct
Base model
—
Revision
main
Family
Qwen
Parameters
31B
Active params
—
MoE
no
Output tok/s
100.0
Prefill tok/s
—
Total tok/s
390.1
TTFT
198.5ms
Peak VRAM
—
Prompt tokens
530
Output tokens
156
Prefill tokens
—
Context length
686
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8978-1-g66bafdcf1
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
0
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 5, 2026, 7:00 PM
Last edited
—
22B · Gpt
160.8
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
206ms
1k · 3d ago
Show all run details
Model
openai/gpt-oss-20b
Display name
gpt-oss-20b
Base model
—
Revision
main
Family
Gpt
Parameters
22B
Active params
—
MoE
no
Output tok/s
160.8
Prefill tok/s
—
Total tok/s
470.0
TTFT
206.1ms
Peak VRAM
—
Prompt tokens
589
Output tokens
256
Prefill tokens
—
Context length
845
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8978-1-g66bafdcf1
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-20b-Q8_0.gguf -ts 0/1/0 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --ts 0/1/0
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 5, 2026, 4:52 AM
Last edited
—
93.3
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
105ms
1k · 3d ago
Show all run details
Model
zai-org/GLM-4.7-Flash
Display name
GLM-4.7-Flash
Base model
—
Revision
main
Family
—
Parameters
31B
Active params
—
MoE
no
Output tok/s
93.3
Prefill tok/s
—
Total tok/s
274.8
TTFT
104.8ms
Peak VRAM
—
Prompt tokens
527
Output tokens
256
Prefill tokens
—
Context length
783
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8978-1-g66bafdcf1
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
—
Min P
0.01
Repeat penalty
1
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/glm/GLM-4.7-Flash-MXFP4_MOE.gguf --temp 1.0 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 0/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 0/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 5, 2026, 4:40 AM
Last edited
—
80B · Qwen
80.4
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · MXFP4_MOE
234ms
1k · 3d ago
Show all run details
Model
Qwen/Qwen3-Coder-Next
Display name
Qwen3-Coder-Next
Base model
—
Revision
main
Family
Qwen
Parameters
80B
Active params
—
MoE
no
Output tok/s
80.4
Prefill tok/s
—
Total tok/s
280.1
TTFT
233.8ms
Peak VRAM
—
Prompt tokens
530
Output tokens
187
Prefill tokens
—
Context length
717
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8978-1-g66bafdcf1
Quantization
MXFP4_MOE
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
40
Min P
0.01
Repeat penalty
1
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-Next-MXFP4_MOE.gguf -ts 1/1 -c 262144 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00
Extra flags
--no-webui --jinja --cache-reuse 256 --ts 1/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 5, 2026, 4:31 AM
Last edited
—
125B · Qwen
26.8
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · MXFP4_MOE
261ms
1k · 3d ago
Show all run details
Model
Qwen/Qwen3.5-122B-A10B
Display name
Qwen3.5-122B-A10B
Base model
—
Revision
main
Family
Qwen
Parameters
125B
Active params
—
MoE
no
Output tok/s
26.8
Prefill tok/s
—
Total tok/s
110.8
TTFT
261.3ms
Peak VRAM
—
Prompt tokens
533
Output tokens
161
Prefill tokens
—
Context length
694
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8978-1-g66bafdcf1
Quantization
MXFP4_MOE
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
40
Min P
0.01
Repeat penalty
1
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}'
Extra flags
--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 5, 2026, 4:19 AM
Last edited
—
31B · Qwen
87.2
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
207ms
1k · 3d ago
Show all run details
Model
Qwen/Qwen3-Coder-30B-A3B-Instruct
Display name
Qwen3-Coder-30B-A3B-Instruct
Base model
—
Revision
main
Family
Qwen
Parameters
31B
Active params
—
MoE
no
Output tok/s
87.2
Prefill tok/s
—
Total tok/s
343.5
TTFT
207.0ms
Peak VRAM
—
Prompt tokens
530
Output tokens
156
Prefill tokens
—
Context length
686
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
0
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 5, 2026, 3:51 AM
Last edited
—
120B · Gpt
71.5
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · F16
233ms
1k · 3d ago
Show all run details
Model
openai/gpt-oss-120b
Display name
gpt-oss-120b
Base model
—
Revision
main
Family
Gpt
Parameters
120B
Active params
—
MoE
no
Output tok/s
71.5
Prefill tok/s
—
Total tok/s
221.7
TTFT
232.7ms
Peak VRAM
—
Prompt tokens
589
Output tokens
256
Prefill tokens
—
Context length
845
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
F16
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 5, 2026, 3:41 AM
Last edited
—
36B · Qwen
75.2
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
217ms
1k · 3d ago
Show all run details
Model
Qwen/Qwen3.6-35B-A3B
Display name
Qwen3.6-35B-A3B
Base model
—
Revision
main
Family
Qwen
Parameters
36B
Active params
—
MoE
no
Output tok/s
75.2
Prefill tok/s
—
Total tok/s
217.4
TTFT
217.0ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
787
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.6
Top P
0.95
Top K
20
Min P
0
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-35B-A3B-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768
Extra flags
--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1/0
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 5, 2026, 3:31 AM
Last edited
—
22B · Gpt
159.7
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
201ms
131k · 3d ago
Show all run details
Model
openai/gpt-oss-20b
Display name
gpt-oss-20b
Base model
—
Revision
main
Family
Gpt
Parameters
22B
Active params
—
MoE
no
Output tok/s
159.7
Prefill tok/s
—
Total tok/s
468.5
TTFT
200.9ms
Peak VRAM
—
Prompt tokens
589
Output tokens
256
Prefill tokens
—
Context length
131072
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
GPT-OSS 20B Q8_0 (12 GB) on 1x R9700 single-card via llama-swap (-ts 0/1/0, c=131072, no kv quant). 3.6B active params (MoE). Best of 3 streamed runs - variance 0.27 tok/s across runs.
Reactions
—
Submitted
May 5, 2026, 3:24 AM
Last edited
—
89.8
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
9971ms
32k · 3d ago
Show all run details
Model
nvidia/Nemotron-Cascade-2-30B-A3B
Display name
Nemotron-Cascade-2-30B-A3B
Base model
—
Revision
main
Family
—
Parameters
32B
Active params
—
MoE
no
Output tok/s
89.8
Prefill tok/s
—
Total tok/s
2459.9
TTFT
9971.2ms
Peak VRAM
—
Prompt tokens
31364
Output tokens
256
Prefill tokens
—
Context length
31620
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
Nemotron-Cascade-2 30B-A3B Q8_0 (32 GB) on 2x R9700 (-ts 1/1, c=262144, q8_0 KV, FA auto). Depth-31K test (~30K input prompt). TG = 89.78 tok/s, only -5.9% vs depth-0 (95.39 tok/s). Cold-prefill TTFT shown (9971ms); warm-cache runs hit 560ms via --cache-reuse 256. Best tok/s of 3 streamed runs.
Reactions
—
Submitted
May 5, 2026, 3:19 AM
Last edited
—
28B · Qwen
19.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
171ms
16k · 3d ago
Show all run details
Model
Qwen/Qwen3.6-27B
Display name
Qwen3.6-27B
Base model
—
Revision
main
Family
Qwen
Parameters
28B
Active params
—
MoE
no
Output tok/s
19.9
Prefill tok/s
—
Total tok/s
60.5
TTFT
171.0ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
16384
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
Qwen3.6-27B Q8_0 dense (28.5 GB) on 1x R9700 single-card via llama-swap (-ts 1/0/0, c=16384, q8_0 KV, FA auto). Single-card variant: +10.3% TG vs prior 3-card 262K submission (18.08 tok/s). Best of 3 streamed runs.
Reactions
—
Submitted
May 5, 2026, 3:14 AM
Last edited
—
17.3
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
273ms
66k · 3d ago
Show all run details
Model
ibm-granite/granite-4.1-30b
Display name
granite-4.1-30b
Base model
—
Revision
main
Family
—
Parameters
29B
Active params
—
MoE
no
Output tok/s
17.3
Prefill tok/s
—
Total tok/s
—
TTFT
273.1ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
65536
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
IBM Granite 4.1 30B dense Q8_0 (29 GB) on 2x R9700 via llama-swap (-ts 1/1, c=65536, q8_0 KV, FA auto). Single-card not viable - 29 GB weights + heavy GQA KV (256 KB/tok) overflows 32 GB VRAM. Best of 3 streamed runs.
Reactions
—
Submitted
May 5, 2026, 3:03 AM
Last edited
—
24B · Mistral
23.9
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
244ms
25k · 3d ago
Show all run details
Model
mistralai/Devstral-Small-2-24B-Instruct-2512
Display name
Devstral-Small-2-24B-Instruct-2512
Base model
Mistral-Small-3.1-24B-Base-2503
Revision
main
Family
Mistral
Parameters
24B
Active params
—
MoE
no
Output tok/s
23.9
Prefill tok/s
—
Total tok/s
200.6
TTFT
244.2ms
Peak VRAM
—
Prompt tokens
1078
Output tokens
139
Prefill tokens
—
Context length
24576
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
Devstral-Small-2-24B-Instruct-2512 Q8_0 (24 GB) on 1x R9700 single-card via llama-swap (-ts 1/0/0, c=24576, q8_0 KV, FA auto). Single-card variant. Best of 3 streamed runs.
Reactions
—
Submitted
May 5, 2026, 2:58 AM
Last edited
—
95.8
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
212ms
1k · 3d ago
Show all run details
Model
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Display name
Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Base model
—
Revision
main
Family
—
Parameters
33B
Active params
—
MoE
no
Output tok/s
95.8
Prefill tok/s
—
Total tok/s
275.2
TTFT
212.5ms
Peak VRAM
—
Prompt tokens
538
Output tokens
256
Prefill tokens
—
Context length
794
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.6
Top P
0.95
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144
Extra flags
--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1
Notes
NVIDIA-Nemotron-3-Nano-Omni 30B-A3B Reasoning Q8_0 (32 GB) on 2× R9700 via llama-swap (-ts 1/1, q8_0 KV, FA auto). Single-card not viable — weights exceed 32 GB VRAM. Best of 3 streamed runs.
Reactions
—
Submitted
May 5, 2026, 2:53 AM
Last edited
—
95.4
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
216ms
1k · 3d ago
Show all run details
Model
nvidia/Nemotron-Cascade-2-30B-A3B
Display name
Nemotron-Cascade-2-30B-A3B
Base model
—
Revision
main
Family
—
Parameters
32B
Active params
—
MoE
no
Output tok/s
95.4
Prefill tok/s
—
Total tok/s
280.0
TTFT
216.0ms
Peak VRAM
—
Prompt tokens
556
Output tokens
256
Prefill tokens
—
Context length
812
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/nvidia_Nemotron-Cascade-2-30B-A3B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144
Extra flags
--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1
Notes
Nemotron-Cascade-2 30B-A3B Q8_0 (32 GB) on 2× R9700 via llama-swap (-ts 1/1, q8_0 KV, FA auto). Single-card not viable — weights exceed 32 GB VRAM. Best of 3 streamed runs.
Reactions
—
Submitted
May 5, 2026, 2:47 AM
Last edited
—
28B · Qwen
20.1
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · Q8_0
807ms
1k · 3d ago
Show all run details
Model
Qwen/Qwen3.6-27B
Display name
Qwen3.6-27B
Base model
—
Revision
main
Family
Qwen
Parameters
28B
Active params
—
MoE
no
Output tok/s
20.1
Prefill tok/s
—
Total tok/s
58.1
TTFT
807.5ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
787
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b9025-1-ga267d945e
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
20
Min P
0
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-27B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 16384
Extra flags
--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/0/0
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 5, 2026, 2:35 AM
Last edited
—
125B · Qwen
27.2
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · MXFP4_MOE
56499ms
66k · 3d ago
Show all run details
Model
Qwen/Qwen3.5-122B-A10B
Display name
Qwen3.5-122B-A10B
Base model
—
Revision
main
Family
Qwen
Parameters
125B
Active params
—
MoE
no
Output tok/s
27.2
Prefill tok/s
—
Total tok/s
11.1
TTFT
56499.3ms
Peak VRAM
—
Prompt tokens
533
Output tokens
161
Prefill tokens
—
Context length
65536
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8974-5-g66bafdcf1
Quantization
MXFP4_MOE
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
1
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
40
Min P
0.01
Repeat penalty
1
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 1 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}' --flash-attn on
Extra flags
--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}
Notes
B+C config: --parallel 1 + --flash-attn on. Best of 3 runs on 3× R9700 (Vulkan).
Reactions
—
Submitted
May 5, 2026, 12:41 AM
Last edited
—
125B · Qwen
27.3
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · MXFP4_MOE
219ms
1k · 3d ago
Show all run details
Model
Qwen/Qwen3.5-122B-A10B
Display name
Qwen3.5-122B-A10B
Base model
—
Revision
main
Family
Qwen
Parameters
125B
Active params
—
MoE
no
Output tok/s
27.3
Prefill tok/s
—
Total tok/s
113.5
TTFT
218.8ms
Peak VRAM
—
Prompt tokens
533
Output tokens
161
Prefill tokens
—
Context length
694
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8974-5-g66bafdcf1
Quantization
MXFP4_MOE
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
1
Top P
0.95
Top K
40
Min P
0.01
Repeat penalty
1
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}'
Extra flags
--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}
Notes
llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. Qwen3.5-122B-A10B MXFP4_MOE GGUF, tensor split 1/1/1, flash-attn auto, reasoning disabled. Best of 3 runs.
Reactions
—
Submitted
May 4, 2026, 11:53 PM
Last edited
—
120B · Gpt
70.7
tok/s
3x AMD Radeon AI Pro R9700 96GB
llama.cpp · F16
236ms
1k · 3d ago
Show all run details
Model
openai/gpt-oss-120b
Display name
gpt-oss-120b
Base model
—
Revision
main
Family
Gpt
Parameters
120B
Active params
—
MoE
no
Output tok/s
70.7
Prefill tok/s
—
Total tok/s
219.1
TTFT
236.4ms
Peak VRAM
—
Prompt tokens
589
Output tokens
256
Prefill tokens
—
Context length
845
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 96GB
GPU slots
—
GPU count
3
VRAM
96GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8974-5-g66bafdcf1
Quantization
F16
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. GPT-OSS 120B F16 GGUF, tensor split 1/1/1, q8_0 KV cache, flash-attn auto.
Reactions
—
Submitted
May 4, 2026, 11:00 PM
Last edited
—
120B · Gpt
70.0
tok/s
3x AMD Radeon AI Pro R9700 32GB
llama.cpp · F16
236ms
1k · 3d ago
Show all run details
Model
openai/gpt-oss-120b
Display name
gpt-oss-120b
Base model
—
Revision
main
Family
Gpt
Parameters
120B
Active params
—
MoE
no
Output tok/s
70.0
Prefill tok/s
—
Total tok/s
217.0
TTFT
236.4ms
Peak VRAM
—
Prompt tokens
589
Output tokens
256
Prefill tokens
—
Context length
845
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 32GB
GPU slots
—
GPU count
3
VRAM
32GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8974-5-g66bafdcf1
Quantization
F16
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. GPT-OSS 120B F16 GGUF, tensor split 1/1/1, q8_0 KV cache, flash-attn auto. Best of 3 runs.
Reactions
—
Submitted
May 4, 2026, 10:42 PM
Last edited
—
73B · Qwen
20.9
tok/s
3x AMD Radeon AI Pro R9700 32GB
llama.cpp · Q4_0
272ms
131k · 3d ago
Show all run details
Model
moonshotai/Kimi-Dev-72B
Display name
Kimi-Dev-72B
Base model
Qwen2.5-72B
Revision
main
Family
Qwen
Parameters
73B
Active params
—
MoE
no
Output tok/s
20.9
Prefill tok/s
—
Total tok/s
64.5
TTFT
271.5ms
Peak VRAM
—
Prompt tokens
551
Output tokens
256
Prefill tokens
—
Context length
131072
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 32GB
GPU slots
—
GPU count
3
VRAM
32GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8974-5-g66bafdcf1
Quantization
Q4_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
99
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
yes
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
2
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
0.7
Top P
0.8
Top K
20
Min P
—
Repeat penalty
1.05
Mirostat
—
Command
/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072
Extra flags
--no-webui --jinja --cache-reuse 256 --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --ctk q8_0 --ctv q8_0 --ts 1/1/1
Notes
Vulkan 3× R9700 + speculative decoding: Qwen2.5-Coder-0.5B-Instruct Q8_0 draft, --spec-draft-n-max 8 --spec-draft-p-min 0.7. Best of 3.
Reactions
—
Submitted
May 4, 2026, 10:09 PM
Last edited
—
28B · Qwen
18.1
tok/s
3x AMD Radeon AI Pro R9700 32GB
llama.cpp · Q8_0
260ms
262k · 3d ago
Show all run details
Model
Qwen/Qwen3.6-27B
Display name
Qwen3.6-27B
Base model
—
Revision
main
Family
Qwen
Parameters
28B
Active params
—
MoE
no
Output tok/s
18.1
Prefill tok/s
—
Total tok/s
54.6
TTFT
259.9ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
262144
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 32GB
GPU slots
—
GPU count
3
VRAM
32GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8974-5-g66bafdcf1
Quantization
Q8_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 4, 2026, 9:36 PM
Last edited
—
73B · Qwen
11.3
tok/s
3x AMD Radeon AI Pro R9700 32GB
llama.cpp · Q4_0
259ms
66k · 3d ago
Show all run details
Model
moonshotai/Kimi-Dev-72B
Display name
Kimi-Dev-72B
Base model
Qwen2.5-72B
Revision
main
Family
Qwen
Parameters
73B
Active params
—
MoE
no
Output tok/s
11.3
Prefill tok/s
—
Total tok/s
35.4
TTFT
258.6ms
Peak VRAM
—
Prompt tokens
551
Output tokens
256
Prefill tokens
—
Context length
65536
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 32GB
GPU slots
—
GPU count
3
VRAM
32GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8974-5-g66bafdcf1
Quantization
Q4_0
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.
Reactions
—
Submitted
May 4, 2026, 9:31 PM
Last edited
—
125B · Qwen
26.8
tok/s
3x AMD Radeon AI Pro R9700 32GB
llama.cpp · MXFP4_MOE
232ms
66k · 3d ago
Show all run details
Model
Qwen/Qwen3.5-122B-A10B
Display name
Qwen3.5-122B-A10B
Base model
—
Revision
main
Family
Qwen
Parameters
125B
Active params
—
MoE
no
Output tok/s
26.8
Prefill tok/s
—
Total tok/s
80.6
TTFT
232.2ms
Peak VRAM
—
Prompt tokens
531
Output tokens
256
Prefill tokens
—
Context length
65536
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
3x AMD Radeon AI Pro R9700 32GB
GPU slots
—
GPU count
3
VRAM
32GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8974-5-g66bafdcf1
Quantization
MXFP4_MOE
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
noctrex/Qwen3.5-122B-A10B-MXFP4_MOE-GGUF on 3× R9700 (96 GB pool), llama.cpp Vulkan, -ts 1/1/1, ctx 64K. Best of 3 runs.
Reactions
—
Submitted
May 4, 2026, 6:50 PM
Last edited
—
80B · Qwen
80.8
tok/s
2x AMD Radeon AI Pro R9700 32GB
llama.cpp · MXFP4_MOE
250ms
1k · 3d ago
Show all run details
Model
Qwen/Qwen3-Coder-Next
Display name
Qwen3-Coder-Next
Base model
—
Revision
main
Family
Qwen
Parameters
80B
Active params
—
MoE
no
Output tok/s
80.8
Prefill tok/s
—
Total tok/s
279.7
TTFT
249.6ms
Peak VRAM
—
Prompt tokens
530
Output tokens
187
Prefill tokens
—
Context length
717
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
2x AMD Radeon AI Pro R9700 32GB
GPU slots
—
GPU count
2
VRAM
32GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8974-5-g66bafdcf1
Quantization
MXFP4_MOE
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
llama.cpp Vulkan, dual R9700 (-ts 1/1), c=262144, no kv quant. Best of 3 streamed runs, ~512 input / 256 output.
Reactions
—
Submitted
May 4, 2026, 5:05 PM
Last edited
—
27B · Gemma
87.3
tok/s
2x AMD Radeon AI Pro R9700 32GB
llama.cpp · MXFP4_MOE
375ms
1k · 3d ago
Show all run details
Model
google/gemma-4-26B-A4B-it
Display name
gemma-4-26B-A4B-it
Base model
gemma-4-26B-A4B
Revision
main
Family
Gemma
Parameters
27B
Active params
—
MoE
no
Output tok/s
87.3
Prefill tok/s
—
Total tok/s
239.9
TTFT
375.3ms
Peak VRAM
—
Prompt tokens
538
Output tokens
256
Prefill tokens
—
Context length
794
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
2x AMD Radeon AI Pro R9700 32GB
GPU slots
—
GPU count
2
VRAM
32GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
b8974-5-g66bafdcf1
Quantization
MXFP4_MOE
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
llama.cpp Vulkan, single R9700 (-ts 0/1), c=65536, no kv quant. Best of 3 streamed runs, ~512 input / 256 output.
Reactions
—
Submitted
May 4, 2026, 4:56 PM
Last edited
—
92.9
tok/s
2x AMD Radeon AI Pro R9700 32GB
llama.cpp · MXFP4_MOE
110ms
1k · 3d ago
Show all run details
Model
zai-org/GLM-4.7-Flash
Display name
GLM-4.7-Flash
Base model
—
Revision
main
Family
—
Parameters
31B
Active params
—
MoE
no
Output tok/s
92.9
Prefill tok/s
—
Total tok/s
273.4
TTFT
109.6ms
Peak VRAM
—
Prompt tokens
527
Output tokens
256
Prefill tokens
—
Context length
783
Batch size
1
Hardware class
DISCRETE_GPU
Hardware
2x AMD Radeon AI Pro R9700 32GB
GPU slots
—
GPU count
2
VRAM
32GB
Chip vendor
—
Chip family
—
Chip variant
—
Unified memory
—
NPU TOPS
—
CPU
AMD Ryzen 9 5950X
RAM
64GB
OS
Arch Linux
Power
—
Engine
llama.cpp
Engine version
—
Quantization
MXFP4_MOE
Backend
vulkan
Tensor parallel
—
Pipeline parallel
—
GPU layers
—
Split mode
—
KV cache dtype
—
KV cache size
—
Prefix caching
—
Attention backend
—
Flash attention
—
Chunked prefill
—
Prefill chunk
—
Continuous batching
—
CPU offload
—
CPU layers
—
Rope scaling
—
Rope scale
—
Yarn ext factor
—
Engine quant
—
SGLang quant
—
GPU mem util
—
Max running seqs
—
Scheduler delay
—
Num parallel
—
Concurrency
—
Spec decoding
no
Spec method
—
Spec model
—
Spec draft model
—
Spec tokens
—
Spec ngram
—
Spec draft TP
—
MTP enabled
no
MTP draft layers
—
Temperature
—
Top P
—
Top K
—
Min P
—
Repeat penalty
—
Mirostat
—
Command
Extra flags
—
Notes
llama.cpp Vulkan, single R9700 (-ts 0/1), c=131072, kv q8_0, flash-attn auto. Best of 3 streamed runs, ~512 input / 256 output.
Reactions
—
Submitted
May 4, 2026, 4:45 PM
Last edited
—
| Model | Hardware | Engine | tok/s out | prefill | tok/s total | TTFT | ctx | Reactions | Date | Share |
|---|---|---|---|---|---|---|---|---|---|---|
| MiniMax-M2.7 229B · Minimax | 3x AMD Radeon AI Pro R9700 96GB | llama.cppF16 | 30.7 | — | 95.0 | 57ms | 1k | — | today | |
Show all details for MiniMax-M2.7Model MiniMaxAI/MiniMax-M2.7 Display name MiniMax-M2.7 Base model — Revision main Family Minimax Parameters 229B Active params — MoE no Output tok/s 30.7 Prefill tok/s — Total tok/s 95.0 TTFT 57.1ms Peak VRAM — Prompt tokens 541 Output tokens 256 Prefill tokens — Context length 797 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization F16 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 --main-gpu 1 -c 16384 Extra flags --no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1 --main-gpu 1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 8, 2026, 8:02 AM Last edited — | ||||||||||
| Mistral-Medium-3.5-128B 128B · Mistral | 3x AMD Radeon AI Pro R9700 96GB | llama.cppF16 | 6.5 | — | 27.4 | 1956ms | 1k | — | today | |
Show all details for Mistral-Medium-3.5-128BModel mistralai/Mistral-Medium-3.5-128B Display name Mistral-Medium-3.5-128B Base model — Revision main Family Mistral Parameters 128B Active params — MoE no Output tok/s 6.5 Prefill tok/s — Total tok/s 27.4 TTFT 1955.6ms Peak VRAM — Prompt tokens 886 Output tokens 256 Prefill tokens — Context length 1142 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization F16 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384 Extra flags --no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --cache-ram 0 --no-cache-idle-slots --ts 1/1/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 8, 2026, 7:51 AM Last edited — | ||||||||||
| gpt-oss-120b 120B · Gpt | 3x AMD Radeon AI Pro R9700 96GB | llama.cppF16 | 62.7 | — | 99.0 | 105ms | 2k | — | today | |
Show all details for gpt-oss-120bModel openai/gpt-oss-120b Display name gpt-oss-120b Base model — Revision main Family Gpt Parameters 120B Active params — MoE no Output tok/s 62.7 Prefill tok/s — Total tok/s 99.0 TTFT 105.5ms Peak VRAM — Prompt tokens 589 Output tokens 1000 Prefill tokens — Context length 1589 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization F16 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 8, 2026, 7:44 AM Last edited — | ||||||||||
| MiniMax-M2.7 229B · Minimax | 3x AMD Radeon AI Pro R9700 96GB | llama.cppIQ4_XS | 25.2 | — | 79.9 | 56ms | 33k | — | today | |
Show all details for MiniMax-M2.7Model MiniMaxAI/MiniMax-M2.7 Display name MiniMax-M2.7 Base model — Revision main Family Minimax Parameters 229B Active params — MoE no Output tok/s 25.2 Prefill tok/s — Total tok/s 79.9 TTFT 56.1ms Peak VRAM — Prompt tokens 560 Output tokens 256 Prefill tokens — Context length 32768 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization IQ4_XS Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 1 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1 Extra flags --no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes REAP'd 172B-A10B (from 228B base), IQ4_XS imatrix, 3x R9700 ts 1/1/1, q8_0 KV, FA auto, ctx 32k Reactions — Submitted May 8, 2026, 7:13 AM Last edited — | ||||||||||
| MiniMax-M2.7 229B · Minimax | 3x AMD Radeon AI Pro R9700 96GB | llama.cppIQ4_XS | 25.2 | — | 52.9 | 5267ms | 1k | — | today | |
Show all details for MiniMax-M2.7Model MiniMaxAI/MiniMax-M2.7 Display name MiniMax-M2.7 Base model — Revision main Family Minimax Parameters 229B Active params — MoE no Output tok/s 25.2 Prefill tok/s — Total tok/s 52.9 TTFT 5266.6ms Peak VRAM — Prompt tokens 560 Output tokens 256 Prefill tokens — Context length 816 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization IQ4_XS Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 1 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1 Extra flags --no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes REAP'd 172B-A10B (from 228B base), IQ4_XS imatrix, 3x R9700 ts 1/1/1, q8_0 KV, FA auto, ctx 32k Reactions — Submitted May 8, 2026, 7:10 AM Last edited — | ||||||||||
| MiniMax-M2.7 229B · Minimax | 3x AMD Radeon AI Pro R9700 96GB | llama.cppIQ4_XS | 31.8 | — | 90.8 | 771ms | 33k | — | 1d ago | |
Show all details for MiniMax-M2.7Model MiniMaxAI/MiniMax-M2.7 Display name MiniMax-M2.7 Base model — Revision main Family Minimax Parameters 229B Active params — MoE no Output tok/s 31.8 Prefill tok/s — Total tok/s 90.8 TTFT 770.5ms Peak VRAM — Prompt tokens 546 Output tokens 256 Prefill tokens — Context length 32768 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization IQ4_XS Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 1 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1 Extra flags --no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes REAP'd 172B IQ4_XS imatrix. ctx=32768, parallel=1. Thinking disabled. Reactions — Submitted May 7, 2026, 4:58 AM Last edited — | ||||||||||
| MiniMax-M2.7 229B · Minimax | 3x AMD Radeon AI Pro R9700 96GB | llama.cppIQ4_XS | 32.6 | — | 101.4 | 46ms | 1k | — | 1d ago | |
Show all details for MiniMax-M2.7Model MiniMaxAI/MiniMax-M2.7 Display name MiniMax-M2.7 Base model — Revision main Family Minimax Parameters 229B Active params — MoE no Output tok/s 32.6 Prefill tok/s — Total tok/s 101.4 TTFT 46.3ms Peak VRAM — Prompt tokens 546 Output tokens 256 Prefill tokens — Context length 802 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization IQ4_XS Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 16384 Extra flags --no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes REAP'd 172B IQ4_XS imatrix (mradermacher). Thinking disabled via system prompt. Reactions — Submitted May 7, 2026, 4:39 AM Last edited — | ||||||||||
| Llama-2-7b 7B · Llama | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_0 | 110.1 | — | 347.5 | 171ms | 1k | — | 1d ago | |
Show all details for Llama-2-7bModel meta-llama/Llama-2-7b Display name Llama-2-7b Base model — Revision main Family Llama Parameters 7B Active params — MoE no Output tok/s 110.1 Prefill tok/s — Total tok/s 347.5 TTFT 171.1ms Peak VRAM — Prompt tokens 611 Output tokens 256 Prefill tokens — Context length 867 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization Q4_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/llama/llama-2-7b.Q4_0.gguf -ts 0/1 -c 4096 Extra flags --no-webui --jinja --cache-reuse 256 --prio 3 --ts 0/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 7, 2026, 3:49 AM Last edited — | ||||||||||
| Qwen3.5-35B-A3B 35B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ3_K_M | 108.4 | — | 306.4 | 208ms | 1k | — | 1d ago | |
Show all details for Qwen3.5-35B-A3BModel Qwen/Qwen3.5-35B-A3B Display name Qwen3.5-35B-A3B Base model Qwen3.5-35B-A3B-Base Revision main Family Qwen Parameters 35B Active params — MoE no Output tok/s 108.4 Prefill tok/s — Total tok/s 306.4 TTFT 207.7ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 787 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization Q3_K_M Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P 0 Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/qwen/Qwen3.5-35B-A3B-Uncensored.Q3_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 0/1 -c 262144 Extra flags --no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 0/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 7, 2026, 3:43 AM Last edited — | ||||||||||
| SWE-Dev-32B 33B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 16.2 | — | 49.0 | 699ms | 1k | — | 1d ago | |
Show all details for SWE-Dev-32BModel zai-org/SWE-Dev-32B Display name SWE-Dev-32B Base model Qwen2.5-Coder-32B-Instruct Revision main Family Qwen Parameters 33B Active params — MoE no Output tok/s 16.2 Prefill tok/s — Total tok/s 49.0 TTFT 698.6ms Peak VRAM — Prompt tokens 551 Output tokens 256 Prefill tokens — Context length 807 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.95 Top K — Min P — Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m "/home/mikekey/models/gguf/misc/swe-dev-32b-q8_0.gguf" --temp 0.7 --top-p 0.95 --repeat-penalty 1.05 -ts 1/1 -c 32768 Extra flags --no-webui --jinja --cache-reuse 256 --prio 3 --ts 1/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 7, 2026, 3:36 AM Last edited — | ||||||||||
| Qwen3-VL-8B-Instruct 9B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_K_M | 95.9 | — | 276.0 | 179ms | 1k | — | 1d ago | |
Show all details for Qwen3-VL-8B-InstructModel Qwen/Qwen3-VL-8B-Instruct Display name Qwen3-VL-8B-Instruct Base model — Revision main Family Qwen Parameters 9B Active params — MoE no Output tok/s 95.9 Prefill tok/s — Total tok/s 276.0 TTFT 178.8ms Peak VRAM — Prompt tokens 530 Output tokens 256 Prefill tokens — Context length 786 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization Q4_K_M Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3VL-8B-Instruct-Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 -ts 0/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --ts 0/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 7, 2026, 3:29 AM Last edited — | ||||||||||
| Llama-3.2-3B-Instruct 3B · Llama | 3x AMD Radeon AI Pro R9700 96GB | llama.cppBF16 | 79.9 | — | 239.7 | 184ms | 1k | — | 1d ago | |
Show all details for Llama-3.2-3B-InstructModel meta-llama/Llama-3.2-3B-Instruct Display name Llama-3.2-3B-Instruct Base model — Revision main Family Llama Parameters 3B Active params — MoE no Output tok/s 79.9 Prefill tok/s — Total tok/s 239.7 TTFT 183.6ms Peak VRAM — Prompt tokens 556 Output tokens 256 Prefill tokens — Context length 812 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization BF16 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.9 Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/llama/Llama-3.2-3B-Instruct-BF16.gguf --temp 0.7 --top-p 0.9 -ts 0/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --ts 0/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 7, 2026, 3:23 AM Last edited — | ||||||||||
| Qwen3.5-4B 4B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 84.9 | — | 244.8 | 198ms | 1k | — | 1d ago | |
Show all details for Qwen3.5-4BModel Qwen/Qwen3.5-4B Display name Qwen3.5-4B Base model Qwen3.5-4B-Base Revision main Family Qwen Parameters 4B Active params — MoE no Output tok/s 84.9 Prefill tok/s — Total tok/s 244.8 TTFT 197.6ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 787 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9049-2-g79b2f3239-dirty Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.95 Top K — Min P — Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m "/home/mikekey/models/gguf/misc/Opus4.7-GODs.Ghost.Codex.Distill.4B-Q8_0.gguf" --temp 0.7 --top-p 0.95 --repeat-penalty 1.05 -ts 0/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --ts 0/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 7, 2026, 3:06 AM Last edited — | ||||||||||
| Kimi-Dev-72B 73B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_0 | 15.5 | — | 56.7 | 266ms | 1k | — | 1d ago | |
Show all details for Kimi-Dev-72BModel moonshotai/Kimi-Dev-72B Display name Kimi-Dev-72B Base model Qwen2.5-72B Revision main Family Qwen Parameters 73B Active params — MoE no Output tok/s 15.5 Prefill tok/s — Total tok/s 56.7 TTFT 265.7ms Peak VRAM — Prompt tokens 697 Output tokens 256 Prefill tokens — Context length 953 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q4_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P — Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + Qwen2.5-Coder-0.5B-Q8_0 draft (n-max=8, n-min=1, p-min=0.7), 128K ctx. Code-completion prompt — paired baseline (no spec) at ~11.4 tok/s shows ~+35% gain from speculative decoding on code-shaped output. Reactions — Submitted May 6, 2026, 7:22 PM Last edited — | ||||||||||
| Kimi-Dev-72B 73B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_0 | 11.5 | — | 42.3 | 250ms | 1k | — | 1d ago | |
Show all details for Kimi-Dev-72BModel moonshotai/Kimi-Dev-72B Display name Kimi-Dev-72B Base model Qwen2.5-72B Revision main Family Qwen Parameters 73B Active params — MoE no Output tok/s 11.5 Prefill tok/s — Total tok/s 42.3 TTFT 249.7ms Peak VRAM — Prompt tokens 697 Output tokens 256 Prefill tokens — Context length 953 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q4_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P — Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + q8 KV, 128K ctx. Code-completion prompt (Python module continuation, ~700 input tokens). Baseline for spec-decoding A/B — paired with Kimi-Dev-72B-spec submission. Reactions — Submitted May 6, 2026, 7:14 PM Last edited — | ||||||||||
| Devstral-Small-2-24B-Instruct-2512 24B · Mistral | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 23.6 | — | 204.1 | 214ms | 1k | — | 1d ago | |
Show all details for Devstral-Small-2-24B-Instruct-2512Model mistralai/Devstral-Small-2-24B-Instruct-2512 Display name Devstral-Small-2-24B-Instruct-2512 Base model Mistral-Small-3.1-24B-Base-2503 Revision main Family Mistral Parameters 24B Active params — MoE no Output tok/s 23.6 Prefill tok/s — Total tok/s 204.1 TTFT 213.6ms Peak VRAM — Prompt tokens 1078 Output tokens 135 Prefill tokens — Context length 1213 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.15 Top P — Top K — Min P 0.01 Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf --temp 0.15 --min-p 0.01 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 24576 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0 Notes llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 1× R9700, 24K ctx, q8 KV, FA auto. Single-card layout. Reactions — Submitted May 6, 2026, 7:02 PM Last edited — | ||||||||||
| Qwen3.6-27B 28B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 19.6 | — | 59.2 | 263ms | 1k | — | 1d ago | |
Show all details for Qwen3.6-27BModel Qwen/Qwen3.6-27B Display name Qwen3.6-27B Base model — Revision main Family Qwen Parameters 28B Active params — MoE no Output tok/s 19.6 Prefill tok/s — Total tok/s 59.2 TTFT 262.8ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 787 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K 20 Min P 0 Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-27B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 16384 Extra flags --no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/0/0 Notes llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 1× R9700, 16K ctx, q8 KV, FA auto. Single-card layout to eliminate cross-GPU overhead on 27B dense. Reactions — Submitted May 6, 2026, 6:56 PM Last edited — | ||||||||||
| Qwen3.6-35B-A3B 36B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 80.5 | — | 232.6 | 203ms | 1k | — | 1d ago | |
Show all details for Qwen3.6-35B-A3BModel Qwen/Qwen3.6-35B-A3B Display name Qwen3.6-35B-A3B Base model — Revision main Family Qwen Parameters 36B Active params — MoE no Output tok/s 80.5 Prefill tok/s — Total tok/s 232.6 TTFT 202.6ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 787 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.6 Top P 0.95 Top K 20 Min P 0 Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-35B-A3B-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768 Extra flags --no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1/0 Notes llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 2× R9700, 32K ctx, q8 KV, FA auto. Optimized 2-card layout vs the 3-card base entry. Reactions — Submitted May 6, 2026, 6:50 PM Last edited — | ||||||||||
| Qwen3-Coder-30B-A3B-Instruct 31B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 92.0 | — | 365.9 | 178ms | 1k | — | 1d ago | |
Show all details for Qwen3-Coder-30B-A3B-InstructModel Qwen/Qwen3-Coder-30B-A3B-Instruct Display name Qwen3-Coder-30B-A3B-Instruct Base model — Revision main Family Qwen Parameters 31B Active params — MoE no Output tok/s 92.0 Prefill tok/s — Total tok/s 365.9 TTFT 178.2ms Peak VRAM — Prompt tokens 530 Output tokens 156 Prefill tokens — Context length 686 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P 0 Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0 Notes llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 2× R9700, 32K ctx, q8 KV, FA auto. Optimized 2-card layout vs the 3-card base entry. Reactions — Submitted May 6, 2026, 6:42 PM Last edited — | ||||||||||
| qwen3-coder-30b-a3b-codemonkey 30B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_K_M | 127.2 | — | 356.1 | 194ms | 1k | — | 1d ago | |
Show all details for qwen3-coder-30b-a3b-codemonkeyModel 1337Hero/qwen3-coder-30b-a3b-codemonkey Display name qwen3-coder-30b-a3b-codemonkey Base model Qwen3-Coder-30B-A3B-Instruct Revision main Family Qwen Parameters 30B Active params — MoE no Output tok/s 127.2 Prefill tok/s — Total tok/s 356.1 TTFT 193.7ms Peak VRAM — Prompt tokens 530 Output tokens 256 Prefill tokens — Context length 786 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q4_K_M Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P 0 Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 32768 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0 Notes llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 1× R9700, 32K ctx, q8 KV, FA auto. Reactions — Submitted May 6, 2026, 4:37 PM Last edited — | ||||||||||
| qwen3-coder-30b-a3b-codemonkey 30B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_K_M | 126.5 | — | 352.0 | 209ms | 1k | — | 1d ago | |
Show all details for qwen3-coder-30b-a3b-codemonkeyModel 1337Hero/qwen3-coder-30b-a3b-codemonkey Display name qwen3-coder-30b-a3b-codemonkey Base model Qwen3-Coder-30B-A3B-Instruct Revision main Family Qwen Parameters 30B Active params — MoE no Output tok/s 126.5 Prefill tok/s — Total tok/s 352.0 TTFT 208.9ms Peak VRAM — Prompt tokens 530 Output tokens 256 Prefill tokens — Context length 786 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q4_K_M Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P 0 Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.from-Q8_0.Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 32768 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0 Notes llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 1× R9700, 32K ctx, q8 KV, FA auto, Q4_K_M derived from Q8_0 source. Reactions — Submitted May 6, 2026, 4:32 PM Last edited — | ||||||||||
| qwen3-coder-30b-a3b-codemonkey 30B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 90.0 | — | 257.8 | 203ms | 1k | — | 1d ago | |
Show all details for qwen3-coder-30b-a3b-codemonkeyModel 1337Hero/qwen3-coder-30b-a3b-codemonkey Display name qwen3-coder-30b-a3b-codemonkey Base model Qwen3-Coder-30B-A3B-Instruct Revision main Family Qwen Parameters 30B Active params — MoE no Output tok/s 90.0 Prefill tok/s — Total tok/s 257.8 TTFT 203.3ms Peak VRAM — Prompt tokens 530 Output tokens 256 Prefill tokens — Context length 786 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P 0 Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0 Notes llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 2× R9700, 32K ctx, q8 KV, FA auto. Reactions — Submitted May 6, 2026, 4:26 PM Last edited — | ||||||||||
| qwen3-coder-30b-a3b-codemonkey 30B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppBF16 | 20.3 | — | 61.2 | 222ms | 1k | — | 1d ago | |
Show all details for qwen3-coder-30b-a3b-codemonkeyModel 1337Hero/qwen3-coder-30b-a3b-codemonkey Display name qwen3-coder-30b-a3b-codemonkey Base model Qwen3-Coder-30B-A3B-Instruct Revision main Family Qwen Parameters 30B Active params — MoE no Output tok/s 20.3 Prefill tok/s — Total tok/s 61.2 TTFT 221.6ms Peak VRAM — Prompt tokens 530 Output tokens 256 Prefill tokens — Context length 786 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization BF16 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P 0 Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.BF16.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 16384 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 3× R9700, 16K ctx, q8 KV, FA auto. Reactions — Submitted May 6, 2026, 4:19 PM Last edited — | ||||||||||
| Mistral-Medium-3.5-128B 128B · Mistral | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_0 | 7.4 | — | 31.4 | 2033ms | 1k | — | 1d ago | |
Show all details for Mistral-Medium-3.5-128BModel mistralai/Mistral-Medium-3.5-128B Display name Mistral-Medium-3.5-128B Base model — Revision main Family Mistral Parameters 128B Active params — MoE no Output tok/s 7.4 Prefill tok/s — Total tok/s 31.4 TTFT 2032.8ms Peak VRAM — Prompt tokens 886 Output tokens 256 Prefill tokens — Context length 1142 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q4_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --cache-ram 0 --no-cache-idle-slots --ts 1/1/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + q8_0 KV, 16K ctx, ts 1/1/1, FA auto. Reactions — Submitted May 6, 2026, 12:50 PM Last edited — | ||||||||||
| Mistral-Medium-3.5-128B 128B · Mistral | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_0 | 6.9 | — | 30.4 | 317ms | 16k | — | 2d ago | |
Show all details for Mistral-Medium-3.5-128BModel mistralai/Mistral-Medium-3.5-128B Display name Mistral-Medium-3.5-128B Base model — Revision main Family Mistral Parameters 128B Active params — MoE no Output tok/s 6.9 Prefill tok/s — Total tok/s 30.4 TTFT 317.4ms Peak VRAM — Prompt tokens 886 Output tokens 256 Prefill tokens — Context length 16384 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8979-66bafdcf1 Quantization Q4_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype q8_0 KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 1 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 1 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384 Extra flags -ctk q8_0 -ctv q8_0 -ts 1/1/1 --no-mmap --cache-ram 0 --no-cache-idle-slots -c 16384 Notes llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. Mistral Medium 3.5 128B Q4_0 GGUF, 3 shards, tensor split 1/1/1, q8_0 KV cache, flash-attn auto, 16K ctx, single slot. Median of 3 runs. Submitted with running binary build_info b8979-66bafdcf1. Reactions — Submitted May 6, 2026, 6:40 AM Last edited — | ||||||||||
| Mistral-Medium-3.5-128B 128B · Mistral | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ4_0 | 6.2 | — | 26.4 | 2082ms | 1k | — | 2d ago | |
Show all details for Mistral-Medium-3.5-128BModel mistralai/Mistral-Medium-3.5-128B Display name Mistral-Medium-3.5-128B Base model — Revision main Family Mistral Parameters 128B Active params — MoE no Output tok/s 6.2 Prefill tok/s — Total tok/s 26.4 TTFT 2081.9ms Peak VRAM — Prompt tokens 886 Output tokens 256 Prefill tokens — Context length 1142 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q4_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.95 Top K — Min P 0 Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf --temp 0.7 --top-p 0.95 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes 3x R9700 Vulkan, -ts 1/1/1, q8 KV, flash-attn auto, 32K ctx Reactions — Submitted May 6, 2026, 6:01 AM Last edited — | ||||||||||
| granite-4.1-30b 29B | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 17.9 | — | 54.1 | 224ms | 1k | — | 2d ago | |
Show all details for granite-4.1-30bModel ibm-granite/granite-4.1-30b Display name granite-4.1-30b Base model — Revision main Family — Parameters 29B Active params — MoE no Output tok/s 17.9 Prefill tok/s — Total tok/s 54.1 TTFT 223.9ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 787 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/ibm-granite_granite-4.1-30b-Q8_0.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 65536 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 6, 2026, 4:57 AM Last edited — | ||||||||||
| Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 33B | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 107.3 | — | 304.9 | 218ms | 1k | — | 2d ago | |
Show all details for Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16Model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 Display name Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 Base model — Revision main Family — Parameters 33B Active params — MoE no Output tok/s 107.3 Prefill tok/s — Total tok/s 304.9 TTFT 217.8ms Peak VRAM — Prompt tokens 538 Output tokens 256 Prefill tokens — Context length 794 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9020-10-g17df5830e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.6 Top P 0.95 Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144 Extra flags --no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1 Notes llama.cpp Vulkan on 3x R9700 via llama-swap, ts 1/1 (2-card). Median of 3 runs (selected for representative TTFT). Reactions — Submitted May 6, 2026, 4:45 AM Last edited — | ||||||||||
| Qwen3-32B 32B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | vllmFP8 | 22.8 | — | 69.5 | 60ms | 1k | — | 2d ago | |
Show all details for Qwen3-32BModel Qwen/Qwen3-32B Display name Qwen3-32B Base model — Revision main Family Qwen Parameters 32B Active params — MoE no Output tok/s 22.8 Prefill tok/s — Total tok/s 69.5 TTFT 60.0ms Peak VRAM 34.2GB Prompt tokens 530 Output tokens 256 Prefill tokens — Context length 786 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine vllm Engine version 0.18.1.dev0+gbcf2be961.d20260330 Quantization FP8 Backend rocm Tensor parallel 2 Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill yes Prefill chunk 2048 Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util 0.95 Max running seqs 4 Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3-32B-FP8 --max-model-len 40960 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' healthcheck: test: - CMD-SHELL Extra flags --max-model-len 40960 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} -- CMD-SHELL Notes post-reboot validation, 2x R9700 TP=2 dense FP8 Reactions — Submitted May 6, 2026, 4:12 AM Last edited — | ||||||||||
| Qwen3.6-35B-A3B 36B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | vllmMXFP4_MOE | 112.8 | — | 322.8 | 168ms | 1k | — | 2d ago | |
Show all details for Qwen3.6-35B-A3BModel Qwen/Qwen3.6-35B-A3B Display name Qwen3.6-35B-A3B Base model — Revision main Family Qwen Parameters 36B Active params — MoE no Output tok/s 112.8 Prefill tok/s — Total tok/s 322.8 TTFT 168ms Peak VRAM 32.0GB Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 787 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine vllm Engine version 0.18.1.dev0+gbcf2be961.d20260330 Quantization MXFP4_MOE Backend rocm Tensor parallel 2 Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill yes Prefill chunk 2048 Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util 0.95 Max running seqs 4 Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3.6-35B-A3B-MXFP4 --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' healthcheck: Extra flags --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} --speculative-config {"method": "mtp", "num_speculative_tokens": 4} Notes vLLM on 3× R9700 (TP=2 cards 0,1). RCCL 7.1.1 jammy preloaded via LD_LIBRARY_PATH (bypasses RCCL 7.2.x gfx1201 tuning-index miss / TP=2 deadlock). TunableOp tuned ~300 rows/rank from cudagraph capture + diverse-load driving. MoE backend Triton (gfx950 gate routes only gfx950 to CK on this vLLM build). MTP=4 spec decoding. tcclaviger/vllm-rocm-rdna4-mxfp4 docker. Best of 3 streaming chat completions, post heavy warmup. Reactions — Submitted May 6, 2026, 3:18 AM Last edited — | ||||||||||
| Qwen3.6-35B-A3B 36B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | vllmMXFP4_MOE | 98.9 | — | 285.2 | 171ms | 1k | — | 2d ago | |
Show all details for Qwen3.6-35B-A3BModel Qwen/Qwen3.6-35B-A3B Display name Qwen3.6-35B-A3B Base model — Revision main Family Qwen Parameters 36B Active params — MoE no Output tok/s 98.9 Prefill tok/s — Total tok/s 285.2 TTFT 171.4ms Peak VRAM 32GB Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 787 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine vllm Engine version 0.18.1.dev0+gbcf2be961.d20260330 Quantization MXFP4_MOE Backend rocm Tensor parallel 2 Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill yes Prefill chunk 2048 Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util 0.95 Max running seqs 4 Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3.6-35B-A3B-MXFP4 --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' healthcheck: Extra flags --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} --speculative-config {"method": "mtp", "num_speculative_tokens": 4} Notes vLLM on 3× R9700 (TP=2 cards 0,1). RCCL 7.1.1 jammy preloaded via LD_LIBRARY_PATH to bypass RCCL 7.2.x gfx1201 tuning-index miss (TP=2 deadlock). aiter built but MoE backend still Triton (gfx950 gate routes only gfx950 to CK). MTP=4 spec decoding. tcclaviger/vllm-rocm-rdna4-mxfp4 docker. Best of 3 streaming chat completions. Reactions — Submitted May 6, 2026, 2:23 AM Last edited — | ||||||||||
| gpt-oss-20b 22B · Gpt | 2x AMD Radeon AI Pro R9700 64GB | vllmMXFP4_MOE | 991.1 | — | — | 175ms | 33k | — | 2d ago | |
Show all details for gpt-oss-20bModel openai/gpt-oss-20b Display name gpt-oss-20b Base model — Revision main Family Gpt Parameters 22B Active params — MoE no Output tok/s 991.1 Prefill tok/s — Total tok/s — TTFT 175.4ms Peak VRAM — Prompt tokens 589 Output tokens 256 Prefill tokens — Context length 32768 Batch size 16 Hardware class DISCRETE_GPU Hardware 2x AMD Radeon AI Pro R9700 64GB GPU slots — GPU count 2 VRAM 64GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine vllm Engine version 0.20.1+rocm721 Quantization MXFP4_MOE Backend rocm Tensor parallel 2 Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching yes Attention backend — Flash attention — Chunked prefill yes Prefill chunk 4096 Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util 0.9 Max running seqs 16 Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve openai/gpt-oss-20b --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 16 --max-num-batched-tokens 4096 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.90 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai Extra flags --max-model-len 32768 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai Notes gpt-oss-20b MXFP4 concurrent throughput at batch=16 (matches --max-num-seqs 16). Aggregate output tok/s, best of 3 runs (991.1 / 986.8 / 990.6 -- variance under 0.5%). Per-request decode 64.8 tok/s. TTFT 175ms. This is vLLM's strength zone vs llama.cpp single-stream: ~6x throughput when serving multiple concurrent users. batch=1 single-stream was 48 tok/s (separate submission) -- same hardware, llama.cpp Q8_0 single-stream was 160 tok/s. Reactions — Submitted May 5, 2026, 11:29 PM Last edited — | ||||||||||
| gpt-oss-20b 22B · Gpt | 2x AMD Radeon AI Pro R9700 64GB | vllmMXFP4_MOE | 48.3 | — | 156.7 | 89ms | 33k | — | 2d ago | |
Show all details for gpt-oss-20bModel openai/gpt-oss-20b Display name gpt-oss-20b Base model — Revision main Family Gpt Parameters 22B Active params — MoE no Output tok/s 48.3 Prefill tok/s — Total tok/s 156.7 TTFT 88.6ms Peak VRAM — Prompt tokens 589 Output tokens 256 Prefill tokens — Context length 32768 Batch size 1 Hardware class DISCRETE_GPU Hardware 2x AMD Radeon AI Pro R9700 64GB GPU slots — GPU count 2 VRAM 64GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine vllm Engine version 0.20.1+rocm721 Quantization MXFP4_MOE Backend rocm Tensor parallel 2 Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching yes Attention backend — Flash attention — Chunked prefill yes Prefill chunk 4096 Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util 0.9 Max running seqs 16 Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve openai/gpt-oss-20b --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 16 --max-num-batched-tokens 4096 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.90 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai Extra flags --max-model-len 32768 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai Notes gpt-oss-20b native MXFP4 on 2x R9700 (gfx1201) TP=2, batch=1. Uses the *fused* OAI Triton MXFP4 MoE kernel (triton, not triton_unfused) — gpt-oss is the kernel's original target. Same 3 wheel patches as Qwen3.6 MXFP4 entry (PR #37826 device gate, CLI enum, dtype). RCCL 7.1.1 hotfix for #40980. Concurrent throughput much higher: 580 tok/s at c=8, 902 tok/s at c=16 (batch submissions to follow). Reactions — Submitted May 5, 2026, 11:24 PM Last edited — | ||||||||||
| Qwen3.6-35B-A3B 36B · Qwen | 2x AMD Radeon AI Pro R9700 64GB | vllmMXFP4_MOE | 33.6 | — | 101.7 | 130ms | 33k | — | 2d ago | |
Show all details for Qwen3.6-35B-A3BModel Qwen/Qwen3.6-35B-A3B Display name Qwen3.6-35B-A3B Base model — Revision main Family Qwen Parameters 36B Active params — MoE no Output tok/s 33.6 Prefill tok/s — Total tok/s 101.7 TTFT 129.6ms Peak VRAM — Prompt tokens 533 Output tokens 256 Prefill tokens — Context length 32768 Batch size 1 Hardware class DISCRETE_GPU Hardware 2x AMD Radeon AI Pro R9700 64GB GPU slots — GPU count 2 VRAM 64GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine vllm Engine version 0.20.1+rocm721 Quantization MXFP4_MOE Backend rocm Tensor parallel 2 Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching yes Attention backend — Flash attention — Chunked prefill yes Prefill chunk 2048 Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util 0.93 Max running seqs 4 Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 PYTORCH_TUNABLEOP_FILENAME=~/.local/share/vllm-tunableop/tunableop_merged.csv VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3.6-35B-A3B-MXFP4 --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.93 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only Extra flags --max-model-len 32768 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only Notes Same setup as previous MXFP4 submission (PR #37826 patched, RCCL 7.1.1 hotfix, triton_unfused, dtype=bf16) — now WITH PyTorch TunableOp tuned (754 rows / 377 per TP rank). Tuned with 18-min diverse-load pass: prompt sizes 8/64/256/1024/2048 tokens x 5 content categories x concurrency 1+4. Auto-stopped on row plateau (3x60s zero-delta). Tuning gain over untuned baseline: +7.0% (31.70 -> 33.92 tok/s). Modest gain — batch=1 decode on R9700 is memory-bandwidth-bound, not GEMM-scheduling-bound. Reactions — Submitted May 5, 2026, 11:00 PM Last edited — | ||||||||||
| Qwen3.6-35B-A3B 36B · Qwen | 2x AMD Radeon AI Pro R9700 64GB | vllmMXFP4_MOE | 31.7 | — | 96.0 | 140ms | 33k | — | 2d ago | |
Show all details for Qwen3.6-35B-A3BModel Qwen/Qwen3.6-35B-A3B Display name Qwen3.6-35B-A3B Base model — Revision main Family Qwen Parameters 36B Active params — MoE no Output tok/s 31.7 Prefill tok/s — Total tok/s 96.0 TTFT 140.0ms Peak VRAM — Prompt tokens 533 Output tokens 256 Prefill tokens — Context length 32768 Batch size 1 Hardware class DISCRETE_GPU Hardware 2x AMD Radeon AI Pro R9700 64GB GPU slots — GPU count 2 VRAM 64GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine vllm Engine version 0.20.1+rocm721 Quantization MXFP4_MOE Backend rocm Tensor parallel 2 Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching yes Attention backend — Flash attention — Chunked prefill yes Prefill chunk 2048 Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util 0.93 Max running seqs 4 Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MOE=0 VLLM_MXFP4_USE_MARLIN=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3.6-35B-A3B-MXFP4 --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.93 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only Extra flags --max-model-len 32768 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only Notes Qwen3.6-35B-A3B MXFP4 (pahajokiconsulting quant) on 2x R9700 (gfx1201) TP=2. Baseline path (no AITER, no TunableOp, no MTP). Required THREE patches to vLLM 0.20.1+rocm721 wheel: (1) PR vllm-project/vllm#37826 in-place patch -- widen OAI Triton MoE device gate to gfx12, (2) add 'triton_unfused' to MoEBackend Literal in config/kernel.py (CLI exposure), (3) RCCL 7.1.1 LD_PRELOAD for #40980 deadlock. --moe-backend triton_unfused needed because TRITON kernel only supports SWIGLUOAI activation; UnfusedOAITritonExperts adds SILU. dtype=bfloat16 required (kernel asserts hidden_states.dtype==bf16). Best of 3 (31.6 / 31.7 / 31.6 tok/s). Reactions — Submitted May 5, 2026, 9:49 PM Last edited — | ||||||||||
| Qwen3-32B 32B · Qwen | 2x AMD Radeon AI Pro R9700 64GB | vllmFP8 | 79.3 | — | — | 138ms | 33k | — | 2d ago | |
Show all details for Qwen3-32BModel Qwen/Qwen3-32B Display name Qwen3-32B Base model — Revision main Family Qwen Parameters 32B Active params — MoE no Output tok/s 79.3 Prefill tok/s — Total tok/s — TTFT 138.3ms Peak VRAM — Prompt tokens 534 Output tokens 256 Prefill tokens — Context length 32768 Batch size 4 Hardware class DISCRETE_GPU Hardware 2x AMD Radeon AI Pro R9700 64GB GPU slots — GPU count 2 VRAM 64GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine vllm Engine version 0.20.1+rocm721 Quantization FP8 Backend rocm Tensor parallel 2 Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill yes Prefill chunk 2048 Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util 0.9 Max running seqs 4 Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3-32B-FP8 --tensor-parallel-size 2 --dtype auto --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.90 --reasoning-parser qwen3 Extra flags --max-model-len 32768 --reasoning-parser qwen3 Notes Concurrent throughput at batch=4 (matches --max-num-seqs 4). Aggregate output tok/s, best of 3 runs (78.7 / 78.2 / 79.3). Per-request decode ~20 tok/s. RCCL 7.1.1 hotfix for #40980 still in effect. Reactions — Submitted May 5, 2026, 9:32 PM Last edited — | ||||||||||
| Qwen3-32B 32B · Qwen | 2x AMD Radeon AI Pro R9700 64GB | vllmFP8 | 22.9 | — | 70.1 | 71ms | 33k | — | 2d ago | |
Show all details for Qwen3-32BModel Qwen/Qwen3-32B Display name Qwen3-32B Base model — Revision main Family Qwen Parameters 32B Active params — MoE no Output tok/s 22.9 Prefill tok/s — Total tok/s 70.1 TTFT 71.5ms Peak VRAM — Prompt tokens 534 Output tokens 256 Prefill tokens — Context length 32768 Batch size 1 Hardware class DISCRETE_GPU Hardware 2x AMD Radeon AI Pro R9700 64GB GPU slots — GPU count 2 VRAM 64GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine vllm Engine version 0.20.1+rocm721 Quantization FP8 Backend rocm Tensor parallel 2 Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill yes Prefill chunk 2048 Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util 0.9 Max running seqs 4 Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3-32B-FP8 --tensor-parallel-size 2 --dtype auto --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.90 --reasoning-parser qwen3 Extra flags --max-model-len 32768 --reasoning-parser qwen3 Notes vLLM 0.20.1+rocm721 on 2x AMD R9700 (gfx1201) TP=2. Workaround for vllm-project/vllm#40980: LD_PRELOAD librccl.so.1 from rccl-7.1.1 (Arch ALA) -- RCCL 7.2.x lacks gfx1201 tuning index and deadlocks on AllReduce. With hotfix: ~22.7 tok/s; without: ~0.5 tok/s. Reactions — Submitted May 5, 2026, 9:26 PM Last edited — | ||||||||||
| Qwen3-Coder-30B-A3B-Instruct 31B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 100.0 | — | 390.1 | 198ms | 1k | — | 2d ago | |
Show all details for Qwen3-Coder-30B-A3B-InstructModel Qwen/Qwen3-Coder-30B-A3B-Instruct Display name Qwen3-Coder-30B-A3B-Instruct Base model — Revision main Family Qwen Parameters 31B Active params — MoE no Output tok/s 100.0 Prefill tok/s — Total tok/s 390.1 TTFT 198.5ms Peak VRAM — Prompt tokens 530 Output tokens 156 Prefill tokens — Context length 686 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8978-1-g66bafdcf1 Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P 0 Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 5, 2026, 7:00 PM Last edited — | ||||||||||
| gpt-oss-20b 22B · Gpt | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 160.8 | — | 470.0 | 206ms | 1k | — | 3d ago | |
Show all details for gpt-oss-20bModel openai/gpt-oss-20b Display name gpt-oss-20b Base model — Revision main Family Gpt Parameters 22B Active params — MoE no Output tok/s 160.8 Prefill tok/s — Total tok/s 470.0 TTFT 206.1ms Peak VRAM — Prompt tokens 589 Output tokens 256 Prefill tokens — Context length 845 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8978-1-g66bafdcf1 Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-20b-Q8_0.gguf -ts 0/1/0 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --ts 0/1/0 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 5, 2026, 4:52 AM Last edited — | ||||||||||
| GLM-4.7-Flash 31B | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 93.3 | — | 274.8 | 105ms | 1k | — | 3d ago | |
Show all details for GLM-4.7-FlashModel zai-org/GLM-4.7-Flash Display name GLM-4.7-Flash Base model — Revision main Family — Parameters 31B Active params — MoE no Output tok/s 93.3 Prefill tok/s — Total tok/s 274.8 TTFT 104.8ms Peak VRAM — Prompt tokens 527 Output tokens 256 Prefill tokens — Context length 783 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8978-1-g66bafdcf1 Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K — Min P 0.01 Repeat penalty 1 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/glm/GLM-4.7-Flash-MXFP4_MOE.gguf --temp 1.0 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 0/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 0/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 5, 2026, 4:40 AM Last edited — | ||||||||||
| Qwen3-Coder-Next 80B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppMXFP4_MOE | 80.4 | — | 280.1 | 234ms | 1k | — | 3d ago | |
Show all details for Qwen3-Coder-NextModel Qwen/Qwen3-Coder-Next Display name Qwen3-Coder-Next Base model — Revision main Family Qwen Parameters 80B Active params — MoE no Output tok/s 80.4 Prefill tok/s — Total tok/s 280.1 TTFT 233.8ms Peak VRAM — Prompt tokens 530 Output tokens 187 Prefill tokens — Context length 717 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8978-1-g66bafdcf1 Quantization MXFP4_MOE Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K 40 Min P 0.01 Repeat penalty 1 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-Next-MXFP4_MOE.gguf -ts 1/1 -c 262144 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 Extra flags --no-webui --jinja --cache-reuse 256 --ts 1/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 5, 2026, 4:31 AM Last edited — | ||||||||||
| Qwen3.5-122B-A10B 125B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppMXFP4_MOE | 26.8 | — | 110.8 | 261ms | 1k | — | 3d ago | |
Show all details for Qwen3.5-122B-A10BModel Qwen/Qwen3.5-122B-A10B Display name Qwen3.5-122B-A10B Base model — Revision main Family Qwen Parameters 125B Active params — MoE no Output tok/s 26.8 Prefill tok/s — Total tok/s 110.8 TTFT 261.3ms Peak VRAM — Prompt tokens 533 Output tokens 161 Prefill tokens — Context length 694 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8978-1-g66bafdcf1 Quantization MXFP4_MOE Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K 40 Min P 0.01 Repeat penalty 1 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}' Extra flags --no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false} Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 5, 2026, 4:19 AM Last edited — | ||||||||||
| Qwen3-Coder-30B-A3B-Instruct 31B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 87.2 | — | 343.5 | 207ms | 1k | — | 3d ago | |
Show all details for Qwen3-Coder-30B-A3B-InstructModel Qwen/Qwen3-Coder-30B-A3B-Instruct Display name Qwen3-Coder-30B-A3B-Instruct Base model — Revision main Family Qwen Parameters 31B Active params — MoE no Output tok/s 87.2 Prefill tok/s — Total tok/s 343.5 TTFT 207.0ms Peak VRAM — Prompt tokens 530 Output tokens 156 Prefill tokens — Context length 686 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P 0 Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 5, 2026, 3:51 AM Last edited — | ||||||||||
| gpt-oss-120b 120B · Gpt | 3x AMD Radeon AI Pro R9700 96GB | llama.cppF16 | 71.5 | — | 221.7 | 233ms | 1k | — | 3d ago | |
Show all details for gpt-oss-120bModel openai/gpt-oss-120b Display name gpt-oss-120b Base model — Revision main Family Gpt Parameters 120B Active params — MoE no Output tok/s 71.5 Prefill tok/s — Total tok/s 221.7 TTFT 232.7ms Peak VRAM — Prompt tokens 589 Output tokens 256 Prefill tokens — Context length 845 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization F16 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 5, 2026, 3:41 AM Last edited — | ||||||||||
| Qwen3.6-35B-A3B 36B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 75.2 | — | 217.4 | 217ms | 1k | — | 3d ago | |
Show all details for Qwen3.6-35B-A3BModel Qwen/Qwen3.6-35B-A3B Display name Qwen3.6-35B-A3B Base model — Revision main Family Qwen Parameters 36B Active params — MoE no Output tok/s 75.2 Prefill tok/s — Total tok/s 217.4 TTFT 217.0ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 787 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.6 Top P 0.95 Top K 20 Min P 0 Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-35B-A3B-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768 Extra flags --no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1/0 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 5, 2026, 3:31 AM Last edited — | ||||||||||
| gpt-oss-20b 22B · Gpt | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 159.7 | — | 468.5 | 201ms | 131k | — | 3d ago | |
Show all details for gpt-oss-20bModel openai/gpt-oss-20b Display name gpt-oss-20b Base model — Revision main Family Gpt Parameters 22B Active params — MoE no Output tok/s 159.7 Prefill tok/s — Total tok/s 468.5 TTFT 200.9ms Peak VRAM — Prompt tokens 589 Output tokens 256 Prefill tokens — Context length 131072 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes GPT-OSS 20B Q8_0 (12 GB) on 1x R9700 single-card via llama-swap (-ts 0/1/0, c=131072, no kv quant). 3.6B active params (MoE). Best of 3 streamed runs - variance 0.27 tok/s across runs. Reactions — Submitted May 5, 2026, 3:24 AM Last edited — | ||||||||||
| Nemotron-Cascade-2-30B-A3B 32B | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 89.8 | — | 2459.9 | 9971ms | 32k | — | 3d ago | |
Show all details for Nemotron-Cascade-2-30B-A3BModel nvidia/Nemotron-Cascade-2-30B-A3B Display name Nemotron-Cascade-2-30B-A3B Base model — Revision main Family — Parameters 32B Active params — MoE no Output tok/s 89.8 Prefill tok/s — Total tok/s 2459.9 TTFT 9971.2ms Peak VRAM — Prompt tokens 31364 Output tokens 256 Prefill tokens — Context length 31620 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes Nemotron-Cascade-2 30B-A3B Q8_0 (32 GB) on 2x R9700 (-ts 1/1, c=262144, q8_0 KV, FA auto). Depth-31K test (~30K input prompt). TG = 89.78 tok/s, only -5.9% vs depth-0 (95.39 tok/s). Cold-prefill TTFT shown (9971ms); warm-cache runs hit 560ms via --cache-reuse 256. Best tok/s of 3 streamed runs. Reactions — Submitted May 5, 2026, 3:19 AM Last edited — | ||||||||||
| Qwen3.6-27B 28B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 19.9 | — | 60.5 | 171ms | 16k | — | 3d ago | |
Show all details for Qwen3.6-27BModel Qwen/Qwen3.6-27B Display name Qwen3.6-27B Base model — Revision main Family Qwen Parameters 28B Active params — MoE no Output tok/s 19.9 Prefill tok/s — Total tok/s 60.5 TTFT 171.0ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 16384 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes Qwen3.6-27B Q8_0 dense (28.5 GB) on 1x R9700 single-card via llama-swap (-ts 1/0/0, c=16384, q8_0 KV, FA auto). Single-card variant: +10.3% TG vs prior 3-card 262K submission (18.08 tok/s). Best of 3 streamed runs. Reactions — Submitted May 5, 2026, 3:14 AM Last edited — | ||||||||||
| granite-4.1-30b 29B | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 17.3 | — | — | 273ms | 66k | — | 3d ago | |
Show all details for granite-4.1-30bModel ibm-granite/granite-4.1-30b Display name granite-4.1-30b Base model — Revision main Family — Parameters 29B Active params — MoE no Output tok/s 17.3 Prefill tok/s — Total tok/s — TTFT 273.1ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 65536 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes IBM Granite 4.1 30B dense Q8_0 (29 GB) on 2x R9700 via llama-swap (-ts 1/1, c=65536, q8_0 KV, FA auto). Single-card not viable - 29 GB weights + heavy GQA KV (256 KB/tok) overflows 32 GB VRAM. Best of 3 streamed runs. Reactions — Submitted May 5, 2026, 3:03 AM Last edited — | ||||||||||
| Devstral-Small-2-24B-Instruct-2512 24B · Mistral | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 23.9 | — | 200.6 | 244ms | 25k | — | 3d ago | |
Show all details for Devstral-Small-2-24B-Instruct-2512Model mistralai/Devstral-Small-2-24B-Instruct-2512 Display name Devstral-Small-2-24B-Instruct-2512 Base model Mistral-Small-3.1-24B-Base-2503 Revision main Family Mistral Parameters 24B Active params — MoE no Output tok/s 23.9 Prefill tok/s — Total tok/s 200.6 TTFT 244.2ms Peak VRAM — Prompt tokens 1078 Output tokens 139 Prefill tokens — Context length 24576 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes Devstral-Small-2-24B-Instruct-2512 Q8_0 (24 GB) on 1x R9700 single-card via llama-swap (-ts 1/0/0, c=24576, q8_0 KV, FA auto). Single-card variant. Best of 3 streamed runs. Reactions — Submitted May 5, 2026, 2:58 AM Last edited — | ||||||||||
| Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 33B | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 95.8 | — | 275.2 | 212ms | 1k | — | 3d ago | |
Show all details for Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16Model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 Display name Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 Base model — Revision main Family — Parameters 33B Active params — MoE no Output tok/s 95.8 Prefill tok/s — Total tok/s 275.2 TTFT 212.5ms Peak VRAM — Prompt tokens 538 Output tokens 256 Prefill tokens — Context length 794 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.6 Top P 0.95 Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144 Extra flags --no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1 Notes NVIDIA-Nemotron-3-Nano-Omni 30B-A3B Reasoning Q8_0 (32 GB) on 2× R9700 via llama-swap (-ts 1/1, q8_0 KV, FA auto). Single-card not viable — weights exceed 32 GB VRAM. Best of 3 streamed runs. Reactions — Submitted May 5, 2026, 2:53 AM Last edited — | ||||||||||
| Nemotron-Cascade-2-30B-A3B 32B | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 95.4 | — | 280.0 | 216ms | 1k | — | 3d ago | |
Show all details for Nemotron-Cascade-2-30B-A3BModel nvidia/Nemotron-Cascade-2-30B-A3B Display name Nemotron-Cascade-2-30B-A3B Base model — Revision main Family — Parameters 32B Active params — MoE no Output tok/s 95.4 Prefill tok/s — Total tok/s 280.0 TTFT 216.0ms Peak VRAM — Prompt tokens 556 Output tokens 256 Prefill tokens — Context length 812 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/nvidia_Nemotron-Cascade-2-30B-A3B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144 Extra flags --no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1 Notes Nemotron-Cascade-2 30B-A3B Q8_0 (32 GB) on 2× R9700 via llama-swap (-ts 1/1, q8_0 KV, FA auto). Single-card not viable — weights exceed 32 GB VRAM. Best of 3 streamed runs. Reactions — Submitted May 5, 2026, 2:47 AM Last edited — | ||||||||||
| Qwen3.6-27B 28B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppQ8_0 | 20.1 | — | 58.1 | 807ms | 1k | — | 3d ago | |
Show all details for Qwen3.6-27BModel Qwen/Qwen3.6-27B Display name Qwen3.6-27B Base model — Revision main Family Qwen Parameters 28B Active params — MoE no Output tok/s 20.1 Prefill tok/s — Total tok/s 58.1 TTFT 807.5ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 787 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b9025-1-ga267d945e Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K 20 Min P 0 Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-27B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 16384 Extra flags --no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/0/0 Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 5, 2026, 2:35 AM Last edited — | ||||||||||
| Qwen3.5-122B-A10B 125B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppMXFP4_MOE | 27.2 | — | 11.1 | 56499ms | 66k | — | 3d ago | |
Show all details for Qwen3.5-122B-A10BModel Qwen/Qwen3.5-122B-A10B Display name Qwen3.5-122B-A10B Base model — Revision main Family Qwen Parameters 125B Active params — MoE no Output tok/s 27.2 Prefill tok/s — Total tok/s 11.1 TTFT 56499.3ms Peak VRAM — Prompt tokens 533 Output tokens 161 Prefill tokens — Context length 65536 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8974-5-g66bafdcf1 Quantization MXFP4_MOE Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 1 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K 40 Min P 0.01 Repeat penalty 1 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 1 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}' --flash-attn on Extra flags --no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false} Notes B+C config: --parallel 1 + --flash-attn on. Best of 3 runs on 3× R9700 (Vulkan). Reactions — Submitted May 5, 2026, 12:41 AM Last edited — | ||||||||||
| Qwen3.5-122B-A10B 125B · Qwen | 3x AMD Radeon AI Pro R9700 96GB | llama.cppMXFP4_MOE | 27.3 | — | 113.5 | 219ms | 1k | — | 3d ago | |
Show all details for Qwen3.5-122B-A10BModel Qwen/Qwen3.5-122B-A10B Display name Qwen3.5-122B-A10B Base model — Revision main Family Qwen Parameters 125B Active params — MoE no Output tok/s 27.3 Prefill tok/s — Total tok/s 113.5 TTFT 218.8ms Peak VRAM — Prompt tokens 533 Output tokens 161 Prefill tokens — Context length 694 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8974-5-g66bafdcf1 Quantization MXFP4_MOE Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 1 Top P 0.95 Top K 40 Min P 0.01 Repeat penalty 1 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}' Extra flags --no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false} Notes llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. Qwen3.5-122B-A10B MXFP4_MOE GGUF, tensor split 1/1/1, flash-attn auto, reasoning disabled. Best of 3 runs. Reactions — Submitted May 4, 2026, 11:53 PM Last edited — | ||||||||||
| gpt-oss-120b 120B · Gpt | 3x AMD Radeon AI Pro R9700 96GB | llama.cppF16 | 70.7 | — | 219.1 | 236ms | 1k | — | 3d ago | |
Show all details for gpt-oss-120bModel openai/gpt-oss-120b Display name gpt-oss-120b Base model — Revision main Family Gpt Parameters 120B Active params — MoE no Output tok/s 70.7 Prefill tok/s — Total tok/s 219.1 TTFT 236.4ms Peak VRAM — Prompt tokens 589 Output tokens 256 Prefill tokens — Context length 845 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 96GB GPU slots — GPU count 3 VRAM 96GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8974-5-g66bafdcf1 Quantization F16 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. GPT-OSS 120B F16 GGUF, tensor split 1/1/1, q8_0 KV cache, flash-attn auto. Reactions — Submitted May 4, 2026, 11:00 PM Last edited — | ||||||||||
| gpt-oss-120b 120B · Gpt | 3x AMD Radeon AI Pro R9700 32GB | llama.cppF16 | 70.0 | — | 217.0 | 236ms | 1k | — | 3d ago | |
Show all details for gpt-oss-120bModel openai/gpt-oss-120b Display name gpt-oss-120b Base model — Revision main Family Gpt Parameters 120B Active params — MoE no Output tok/s 70.0 Prefill tok/s — Total tok/s 217.0 TTFT 236.4ms Peak VRAM — Prompt tokens 589 Output tokens 256 Prefill tokens — Context length 845 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 32GB GPU slots — GPU count 3 VRAM 32GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8974-5-g66bafdcf1 Quantization F16 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. GPT-OSS 120B F16 GGUF, tensor split 1/1/1, q8_0 KV cache, flash-attn auto. Best of 3 runs. Reactions — Submitted May 4, 2026, 10:42 PM Last edited — | ||||||||||
| Kimi-Dev-72B 73B · Qwen | 3x AMD Radeon AI Pro R9700 32GB | llama.cppQ4_0 | 20.9 | — | 64.5 | 272ms | 131k | — | 3d ago | |
Show all details for Kimi-Dev-72BModel moonshotai/Kimi-Dev-72B Display name Kimi-Dev-72B Base model Qwen2.5-72B Revision main Family Qwen Parameters 73B Active params — MoE no Output tok/s 20.9 Prefill tok/s — Total tok/s 64.5 TTFT 271.5ms Peak VRAM — Prompt tokens 551 Output tokens 256 Prefill tokens — Context length 131072 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 32GB GPU slots — GPU count 3 VRAM 32GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8974-5-g66bafdcf1 Quantization Q4_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers 99 Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention yes Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel 2 Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature 0.7 Top P 0.8 Top K 20 Min P — Repeat penalty 1.05 Mirostat — Command /home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072 Extra flags --no-webui --jinja --cache-reuse 256 --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --ctk q8_0 --ctv q8_0 --ts 1/1/1 Notes Vulkan 3× R9700 + speculative decoding: Qwen2.5-Coder-0.5B-Instruct Q8_0 draft, --spec-draft-n-max 8 --spec-draft-p-min 0.7. Best of 3. Reactions — Submitted May 4, 2026, 10:09 PM Last edited — | ||||||||||
| Qwen3.6-27B 28B · Qwen | 3x AMD Radeon AI Pro R9700 32GB | llama.cppQ8_0 | 18.1 | — | 54.6 | 260ms | 262k | — | 3d ago | |
Show all details for Qwen3.6-27BModel Qwen/Qwen3.6-27B Display name Qwen3.6-27B Base model — Revision main Family Qwen Parameters 28B Active params — MoE no Output tok/s 18.1 Prefill tok/s — Total tok/s 54.6 TTFT 259.9ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 262144 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 32GB GPU slots — GPU count 3 VRAM 32GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8974-5-g66bafdcf1 Quantization Q8_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 4, 2026, 9:36 PM Last edited — | ||||||||||
| Kimi-Dev-72B 73B · Qwen | 3x AMD Radeon AI Pro R9700 32GB | llama.cppQ4_0 | 11.3 | — | 35.4 | 259ms | 66k | — | 3d ago | |
Show all details for Kimi-Dev-72BModel moonshotai/Kimi-Dev-72B Display name Kimi-Dev-72B Base model Qwen2.5-72B Revision main Family Qwen Parameters 73B Active params — MoE no Output tok/s 11.3 Prefill tok/s — Total tok/s 35.4 TTFT 258.6ms Peak VRAM — Prompt tokens 551 Output tokens 256 Prefill tokens — Context length 65536 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 32GB GPU slots — GPU count 3 VRAM 32GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8974-5-g66bafdcf1 Quantization Q4_0 Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Reactions — Submitted May 4, 2026, 9:31 PM Last edited — | ||||||||||
| Qwen3.5-122B-A10B 125B · Qwen | 3x AMD Radeon AI Pro R9700 32GB | llama.cppMXFP4_MOE | 26.8 | — | 80.6 | 232ms | 66k | — | 3d ago | |
Show all details for Qwen3.5-122B-A10BModel Qwen/Qwen3.5-122B-A10B Display name Qwen3.5-122B-A10B Base model — Revision main Family Qwen Parameters 125B Active params — MoE no Output tok/s 26.8 Prefill tok/s — Total tok/s 80.6 TTFT 232.2ms Peak VRAM — Prompt tokens 531 Output tokens 256 Prefill tokens — Context length 65536 Batch size 1 Hardware class DISCRETE_GPU Hardware 3x AMD Radeon AI Pro R9700 32GB GPU slots — GPU count 3 VRAM 32GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8974-5-g66bafdcf1 Quantization MXFP4_MOE Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes noctrex/Qwen3.5-122B-A10B-MXFP4_MOE-GGUF on 3× R9700 (96 GB pool), llama.cpp Vulkan, -ts 1/1/1, ctx 64K. Best of 3 runs. Reactions — Submitted May 4, 2026, 6:50 PM Last edited — | ||||||||||
| Qwen3-Coder-Next 80B · Qwen | 2x AMD Radeon AI Pro R9700 32GB | llama.cppMXFP4_MOE | 80.8 | — | 279.7 | 250ms | 1k | — | 3d ago | |
Show all details for Qwen3-Coder-NextModel Qwen/Qwen3-Coder-Next Display name Qwen3-Coder-Next Base model — Revision main Family Qwen Parameters 80B Active params — MoE no Output tok/s 80.8 Prefill tok/s — Total tok/s 279.7 TTFT 249.6ms Peak VRAM — Prompt tokens 530 Output tokens 187 Prefill tokens — Context length 717 Batch size 1 Hardware class DISCRETE_GPU Hardware 2x AMD Radeon AI Pro R9700 32GB GPU slots — GPU count 2 VRAM 32GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8974-5-g66bafdcf1 Quantization MXFP4_MOE Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes llama.cpp Vulkan, dual R9700 (-ts 1/1), c=262144, no kv quant. Best of 3 streamed runs, ~512 input / 256 output. Reactions — Submitted May 4, 2026, 5:05 PM Last edited — | ||||||||||
| gemma-4-26B-A4B-it 27B · Gemma | 2x AMD Radeon AI Pro R9700 32GB | llama.cppMXFP4_MOE | 87.3 | — | 239.9 | 375ms | 1k | — | 3d ago | |
Show all details for gemma-4-26B-A4B-itModel google/gemma-4-26B-A4B-it Display name gemma-4-26B-A4B-it Base model gemma-4-26B-A4B Revision main Family Gemma Parameters 27B Active params — MoE no Output tok/s 87.3 Prefill tok/s — Total tok/s 239.9 TTFT 375.3ms Peak VRAM — Prompt tokens 538 Output tokens 256 Prefill tokens — Context length 794 Batch size 1 Hardware class DISCRETE_GPU Hardware 2x AMD Radeon AI Pro R9700 32GB GPU slots — GPU count 2 VRAM 32GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version b8974-5-g66bafdcf1 Quantization MXFP4_MOE Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes llama.cpp Vulkan, single R9700 (-ts 0/1), c=65536, no kv quant. Best of 3 streamed runs, ~512 input / 256 output. Reactions — Submitted May 4, 2026, 4:56 PM Last edited — | ||||||||||
| GLM-4.7-Flash 31B | 2x AMD Radeon AI Pro R9700 32GB | llama.cppMXFP4_MOE | 92.9 | — | 273.4 | 110ms | 1k | — | 3d ago | |
Show all details for GLM-4.7-FlashModel zai-org/GLM-4.7-Flash Display name GLM-4.7-Flash Base model — Revision main Family — Parameters 31B Active params — MoE no Output tok/s 92.9 Prefill tok/s — Total tok/s 273.4 TTFT 109.6ms Peak VRAM — Prompt tokens 527 Output tokens 256 Prefill tokens — Context length 783 Batch size 1 Hardware class DISCRETE_GPU Hardware 2x AMD Radeon AI Pro R9700 32GB GPU slots — GPU count 2 VRAM 32GB Chip vendor — Chip family — Chip variant — Unified memory — NPU TOPS — CPU AMD Ryzen 9 5950X RAM 64GB OS Arch Linux Power — Engine llama.cpp Engine version — Quantization MXFP4_MOE Backend vulkan Tensor parallel — Pipeline parallel — GPU layers — Split mode — KV cache dtype — KV cache size — Prefix caching — Attention backend — Flash attention — Chunked prefill — Prefill chunk — Continuous batching — CPU offload — CPU layers — Rope scaling — Rope scale — Yarn ext factor — Engine quant — SGLang quant — GPU mem util — Max running seqs — Scheduler delay — Num parallel — Concurrency — Spec decoding no Spec method — Spec model — Spec draft model — Spec tokens — Spec ngram — Spec draft TP — MTP enabled no MTP draft layers — Temperature — Top P — Top K — Min P — Repeat penalty — Mirostat — Command Extra flags — Notes llama.cpp Vulkan, single R9700 (-ts 0/1), c=131072, kv q8_0, flash-attn auto. Best of 3 streamed runs, ~512 input / 256 output. Reactions — Submitted May 4, 2026, 4:45 PM Last edited — | ||||||||||