1337Hero

22B · Gpt

991.1

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · MXFP4_MOE

TTFT

175ms

Context

33k · 2d ago

best 991.1 tok/s

30B · Qwen

127.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_K_M

TTFT

194ms

Context

1k · 1d ago

best 127.2 tok/s

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

36B · Qwen

112.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

vllm · MXFP4_MOE

TTFT

168ms

Context

1k · 2d ago

best 112.8 tok/s

Llama-2-7b

7B · Llama

110.1

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

171ms

Context

1k · 1d ago

best 110.1 tok/s

Qwen3.5-35B-A3B

35B · Qwen

108.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q3_K_M

TTFT

208ms

Context

1k · 1d ago

best 108.4 tok/s

33B

107.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

218ms

Context

1k · 2d ago

best 107.3 tok/s

Nemotron-Cascade-2-30B-A3B

31B · Qwen

100.0

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

198ms

Context

1k · 2d ago

best 100.0 tok/s

Qwen3-VL-8B-Instruct

9B · Qwen

95.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_K_M

TTFT

179ms

Context

1k · 1d ago

best 95.9 tok/s

32B

95.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

216ms

Context

1k · 3d ago

best 95.4 tok/s

GLM-4.7-Flash

31B

93.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

105ms

Context

1k · 3d ago

best 93.3 tok/s

gemma-4-26B-A4B-it

27B · Gemma

87.3

tok/s

Hardware

2x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · MXFP4_MOE

TTFT

375ms

Context

1k · 3d ago

best 87.3 tok/s

Qwen3.5-4B

4B · Qwen

84.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

198ms

Context

1k · 1d ago

best 84.9 tok/s

Qwen3-Coder-Next

80B · Qwen

80.8

tok/s

Hardware

2x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · MXFP4_MOE

TTFT

250ms

Context

1k · 3d ago

best 80.8 tok/s

Llama-3.2-3B-Instruct

3B · Llama

79.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · BF16

TTFT

184ms

Context

1k · 1d ago

best 79.9 tok/s

32B · Qwen

79.3

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · FP8

TTFT

138ms

Context

33k · 2d ago

best 79.3 tok/s

120B · Gpt

71.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

233ms

Context

1k · 3d ago

best 71.5 tok/s

229B · Minimax

32.6

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · IQ4_XS

TTFT

46ms

Context

1k · 1d ago

best 32.6 tok/s

Devstral-Small-2-24B-Instruct-2512

125B · Qwen

27.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · MXFP4_MOE

TTFT

219ms

Context

1k · 3d ago

best 27.3 tok/s

24B · Mistral

23.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

244ms

Context

25k · 3d ago

best 23.9 tok/s

73B · Qwen

20.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · Q4_0

TTFT

272ms

Context

131k · 3d ago

best 20.9 tok/s

28B · Qwen

20.1

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

807ms

Context

1k · 3d ago

best 20.1 tok/s

granite-4.1-30b

29B

17.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

224ms

Context

1k · 2d ago

best 17.9 tok/s

SWE-Dev-32B

33B · Qwen

16.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

699ms

Context

1k · 1d ago

best 16.2 tok/s

128B · Mistral

7.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

2033ms

Context

1k · 1d ago

best 7.4 tok/s

Model	Best tok/s	Hardware	Engine / quant	Prefill	TTFT	ctx	When
gpt-oss-20b 22B4 runs	991.1	2x AMD Radeon AI Pro R9700 64GB	vllmMXFP4_MOE	—	175ms	33k	2d ago
qwen3-coder-30b-a3b-codemonkey 30B4 runs	127.2out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ4_K_M	—	194ms	1k	1d ago
Qwen3.6-35B-A3B 36B6 runs	112.8out	3x AMD Radeon AI Pro R9700 96GB	vllmMXFP4_MOE	—	168ms	1k	2d ago
Llama-2-7b 7B1 run	110.1out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ4_0	—	171ms	1k	1d ago
Qwen3.5-35B-A3B 35B1 run	108.4out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ3_K_M	—	208ms	1k	1d ago
Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 33B2 runs	107.3out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ8_0	—	218ms	1k	2d ago
Qwen3-Coder-30B-A3B-Instruct 31B3 runs	100.0out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ8_0	—	198ms	1k	2d ago
Qwen3-VL-8B-Instruct 9B1 run	95.9out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ4_K_M	—	179ms	1k	1d ago
Nemotron-Cascade-2-30B-A3B 32B2 runs	95.4out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ8_0	—	216ms	1k	3d ago
GLM-4.7-Flash 31B2 runs	93.3out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ8_0	—	105ms	1k	3d ago
gemma-4-26B-A4B-it 27B1 run	87.3out	2x AMD Radeon AI Pro R9700 32GB	llama.cppMXFP4_MOE	—	375ms	1k	3d ago
Qwen3.5-4B 4B1 run	84.9out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ8_0	—	198ms	1k	1d ago
Qwen3-Coder-Next 80B2 runs	80.8out	2x AMD Radeon AI Pro R9700 32GB	llama.cppMXFP4_MOE	—	250ms	1k	3d ago
Llama-3.2-3B-Instruct 3B1 run	79.9out	3x AMD Radeon AI Pro R9700 96GB	llama.cppBF16	—	184ms	1k	1d ago
Qwen3-32B 32B3 runs	79.3	2x AMD Radeon AI Pro R9700 64GB	vllmFP8	—	138ms	33k	2d ago
gpt-oss-120b 120B4 runs	71.5out	3x AMD Radeon AI Pro R9700 96GB	llama.cppF16	—	233ms	1k	3d ago
MiniMax-M2.7 229B5 runs	32.6out	3x AMD Radeon AI Pro R9700 96GB	llama.cppIQ4_XS	—	46ms	1k	1d ago
Qwen3.5-122B-A10B 125B4 runs	27.3out	3x AMD Radeon AI Pro R9700 96GB	llama.cppMXFP4_MOE	—	219ms	1k	3d ago
Devstral-Small-2-24B-Instruct-2512 24B2 runs	23.9out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ8_0	—	244ms	25k	3d ago
Kimi-Dev-72B 73B4 runs	20.9out	3x AMD Radeon AI Pro R9700 32GB	llama.cppQ4_0	—	272ms	131k	3d ago
Qwen3.6-27B 28B4 runs	20.1out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ8_0	—	807ms	1k	3d ago
granite-4.1-30b 29B2 runs	17.9out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ8_0	—	224ms	1k	2d ago
SWE-Dev-32B 33B1 run	16.2out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ8_0	—	699ms	1k	1d ago
Mistral-Medium-3.5-128B 128B4 runs	7.4out	3x AMD Radeon AI Pro R9700 96GB	llama.cppQ4_0	—	2033ms	1k	1d ago

Hardware breakdown4

Distinct rigs used across approved submissions.

3x AMD Radeon AI Pro R9700 96GB

50 runs · best 160.8 tok/s

2x AMD Radeon AI Pro R9700 64GB

6 runs · best 991.1 tok/s

3x AMD Radeon AI Pro R9700 32GB

5 runs · best 70.0 tok/s

2x AMD Radeon AI Pro R9700 32GB

3 runs · best 92.9 tok/s

Engines used2

Runtime mix and top quantizations.

llama.cpp55F16 · IQ4_XS · Q4_0

vllm9FP8 · MXFP4_MOE

All runs64

Full approved benchmark history.

229B · Minimax

30.7

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

57ms

Context

1k · today

Show all run details

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

—

Revision

main

Family

Minimax

Parameters

229B

Active params

—

MoE

Output tok/s

30.7

Prefill tok/s

—

Total tok/s

95.0

TTFT

57.1ms

Peak VRAM

—

Prompt tokens

541

Output tokens

256

Prefill tokens

—

Context length

797

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

F16

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 --main-gpu 1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1 --main-gpu 1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 8, 2026, 8:02 AM

Last edited

—

128B · Mistral

6.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

1956ms

Context

1k · today

Show all run details

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

—

Revision

main

Family

Mistral

Parameters

128B

Active params

—

MoE

Output tok/s

6.5

Prefill tok/s

—

Total tok/s

27.4

TTFT

1955.6ms

Peak VRAM

—

Prompt tokens

886

Output tokens

256

Prefill tokens

—

Context length

1142

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

F16

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --cache-ram 0 --no-cache-idle-slots --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 8, 2026, 7:51 AM

Last edited

—

120B · Gpt

62.7

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

105ms

Context

2k · today

Show all run details

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

—

Revision

main

Family

Gpt

Parameters

120B

Active params

—

MoE

Output tok/s

62.7

Prefill tok/s

—

Total tok/s

99.0

TTFT

105.5ms

Peak VRAM

—

Prompt tokens

589

Output tokens

1000

Prefill tokens

—

Context length

1589

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

F16

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 8, 2026, 7:44 AM

Last edited

—

229B · Minimax

25.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · IQ4_XS

TTFT

56ms

Context

33k · today

Show all run details

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

—

Revision

main

Family

Minimax

Parameters

229B

Active params

—

MoE

Output tok/s

25.2

Prefill tok/s

—

Total tok/s

79.9

TTFT

56.1ms

Peak VRAM

—

Prompt tokens

560

Output tokens

256

Prefill tokens

—

Context length

32768

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B-A10B (from 228B base), IQ4_XS imatrix, 3x R9700 ts 1/1/1, q8_0 KV, FA auto, ctx 32k

Reactions

—

Submitted

May 8, 2026, 7:13 AM

Last edited

—

229B · Minimax

25.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · IQ4_XS

TTFT

5267ms

Context

1k · today

Show all run details

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

—

Revision

main

Family

Minimax

Parameters

229B

Active params

—

MoE

Output tok/s

25.2

Prefill tok/s

—

Total tok/s

52.9

TTFT

5266.6ms

Peak VRAM

—

Prompt tokens

560

Output tokens

256

Prefill tokens

—

Context length

816

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B-A10B (from 228B base), IQ4_XS imatrix, 3x R9700 ts 1/1/1, q8_0 KV, FA auto, ctx 32k

Reactions

—

Submitted

May 8, 2026, 7:10 AM

Last edited

—

229B · Minimax

31.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · IQ4_XS

TTFT

771ms

Context

33k · 1d ago

Show all run details

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

—

Revision

main

Family

Minimax

Parameters

229B

Active params

—

MoE

Output tok/s

31.8

Prefill tok/s

—

Total tok/s

90.8

TTFT

770.5ms

Peak VRAM

—

Prompt tokens

546

Output tokens

256

Prefill tokens

—

Context length

32768

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B IQ4_XS imatrix. ctx=32768, parallel=1. Thinking disabled.

Reactions

—

Submitted

May 7, 2026, 4:58 AM

Last edited

—

229B · Minimax

32.6

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · IQ4_XS

TTFT

46ms

Context

1k · 1d ago

Show all run details

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

—

Revision

main

Family

Minimax

Parameters

229B

Active params

—

MoE

Output tok/s

32.6

Prefill tok/s

—

Total tok/s

101.4

TTFT

46.3ms

Peak VRAM

—

Prompt tokens

546

Output tokens

256

Prefill tokens

—

Context length

802

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B IQ4_XS imatrix (mradermacher). Thinking disabled via system prompt.

Reactions

—

Submitted

May 7, 2026, 4:39 AM

Last edited

—

Llama-2-7b

7B · Llama

110.1

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

171ms

Context

1k · 1d ago

Show all run details

Model

meta-llama/Llama-2-7b

Display name

Llama-2-7b

Base model

—

Revision

main

Family

Llama

Parameters

Active params

—

MoE

Output tok/s

110.1

Prefill tok/s

—

Total tok/s

347.5

TTFT

171.1ms

Peak VRAM

—

Prompt tokens

611

Output tokens

256

Prefill tokens

—

Context length

867

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q4_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 7, 2026, 3:49 AM

Last edited

—

Qwen3.5-35B-A3B

35B · Qwen

108.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q3_K_M

TTFT

208ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3.5-35B-A3B

Display name

Qwen3.5-35B-A3B

Base model

Qwen3.5-35B-A3B-Base

Revision

main

Family

Qwen

Parameters

35B

Active params

—

MoE

Output tok/s

108.4

Prefill tok/s

—

Total tok/s

306.4

TTFT

207.7ms

Peak VRAM

—

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

787

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q3_K_M

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/qwen/Qwen3.5-35B-A3B-Uncensored.Q3_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 0/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 7, 2026, 3:43 AM

Last edited

—

SWE-Dev-32B

33B · Qwen

16.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

699ms

Context

1k · 1d ago

Show all run details

Model

zai-org/SWE-Dev-32B

Display name

SWE-Dev-32B

Base model

Qwen2.5-Coder-32B-Instruct

Revision

main

Family

Qwen

Parameters

33B

Active params

—

MoE

Output tok/s

16.2

Prefill tok/s

—

Total tok/s

49.0

TTFT

698.6ms

Peak VRAM

—

Prompt tokens

551

Output tokens

256

Prefill tokens

—

Context length

807

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.95

Top K

—

Min P

—

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m "/home/mikekey/models/gguf/misc/swe-dev-32b-q8_0.gguf" --temp 0.7 --top-p 0.95 --repeat-penalty 1.05 -ts 1/1 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ts 1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 7, 2026, 3:36 AM

Last edited

—

Qwen3-VL-8B-Instruct

9B · Qwen

95.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_K_M

TTFT

179ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3-VL-8B-Instruct

Display name

Qwen3-VL-8B-Instruct

Base model

—

Revision

main

Family

Qwen

Parameters

Active params

—

MoE

Output tok/s

95.9

Prefill tok/s

—

Total tok/s

276.0

TTFT

178.8ms

Peak VRAM

—

Prompt tokens

530

Output tokens

256

Prefill tokens

—

Context length

786

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q4_K_M

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 7, 2026, 3:29 AM

Last edited

—

Llama-3.2-3B-Instruct

3B · Llama

79.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · BF16

TTFT

184ms

Context

1k · 1d ago

Show all run details

Model

meta-llama/Llama-3.2-3B-Instruct

Display name

Llama-3.2-3B-Instruct

Base model

—

Revision

main

Family

Llama

Parameters

Active params

—

MoE

Output tok/s

79.9

Prefill tok/s

—

Total tok/s

239.7

TTFT

183.6ms

Peak VRAM

—

Prompt tokens

556

Output tokens

256

Prefill tokens

—

Context length

812

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

BF16

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.9

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 7, 2026, 3:23 AM

Last edited

—

Qwen3.5-4B

4B · Qwen

84.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

198ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3.5-4B

Display name

Qwen3.5-4B

Base model

Qwen3.5-4B-Base

Revision

main

Family

Qwen

Parameters

Active params

—

MoE

Output tok/s

84.9

Prefill tok/s

—

Total tok/s

244.8

TTFT

197.6ms

Peak VRAM

—

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

787

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.95

Top K

—

Min P

—

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m "/home/mikekey/models/gguf/misc/Opus4.7-GODs.Ghost.Codex.Distill.4B-Q8_0.gguf" --temp 0.7 --top-p 0.95 --repeat-penalty 1.05 -ts 0/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 7, 2026, 3:06 AM

Last edited

—

73B · Qwen

15.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

266ms

Context

1k · 1d ago

Show all run details

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

—

MoE

Output tok/s

15.5

Prefill tok/s

—

Total tok/s

56.7

TTFT

265.7ms

Peak VRAM

—

Prompt tokens

697

Output tokens

256

Prefill tokens

—

Context length

953

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

—

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + Qwen2.5-Coder-0.5B-Q8_0 draft (n-max=8, n-min=1, p-min=0.7), 128K ctx. Code-completion prompt — paired baseline (no spec) at ~11.4 tok/s shows ~+35% gain from speculative decoding on code-shaped output.

Reactions

—

Submitted

May 6, 2026, 7:22 PM

Last edited

—

Devstral-Small-2-24B-Instruct-2512

73B · Qwen

11.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

250ms

Context

1k · 1d ago

Show all run details

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

—

MoE

Output tok/s

11.5

Prefill tok/s

—

Total tok/s

42.3

TTFT

249.7ms

Peak VRAM

—

Prompt tokens

697

Output tokens

256

Prefill tokens

—

Context length

953

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

—

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + q8 KV, 128K ctx. Code-completion prompt (Python module continuation, ~700 input tokens). Baseline for spec-decoding A/B — paired with Kimi-Dev-72B-spec submission.

Reactions

—

Submitted

May 6, 2026, 7:14 PM

Last edited

—

24B · Mistral

23.6

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

214ms

Context

1k · 1d ago

Show all run details

Model

mistralai/Devstral-Small-2-24B-Instruct-2512

Display name

Devstral-Small-2-24B-Instruct-2512

Base model

Mistral-Small-3.1-24B-Base-2503

Revision

main

Family

Mistral

Parameters

24B

Active params

—

MoE

Output tok/s

23.6

Prefill tok/s

—

Total tok/s

204.1

TTFT

213.6ms

Peak VRAM

—

Prompt tokens

1078

Output tokens

135

Prefill tokens

—

Context length

1213

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.15

Top P

—

Top K

—

Min P

0.01

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf --temp 0.15 --min-p 0.01 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 24576

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 1× R9700, 24K ctx, q8 KV, FA auto. Single-card layout.

Reactions

—

Submitted

May 6, 2026, 7:02 PM

Last edited

—

28B · Qwen

19.6

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

263ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

—

Revision

main

Family

Qwen

Parameters

28B

Active params

—

MoE

Output tok/s

19.6

Prefill tok/s

—

Total tok/s

59.2

TTFT

262.8ms

Peak VRAM

—

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

787

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

Min P

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-27B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 1× R9700, 16K ctx, q8 KV, FA auto. Single-card layout to eliminate cross-GPU overhead on 27B dense.

Reactions

—

Submitted

May 6, 2026, 6:56 PM

Last edited

—

36B · Qwen

80.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

203ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

—

Revision

main

Family

Qwen

Parameters

36B

Active params

—

MoE

Output tok/s

80.5

Prefill tok/s

—

Total tok/s

232.6

TTFT

202.6ms

Peak VRAM

—

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

787

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.6

Top P

0.95

Top K

Min P

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-35B-A3B-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 2× R9700, 32K ctx, q8 KV, FA auto. Optimized 2-card layout vs the 3-card base entry.

Reactions

—

Submitted

May 6, 2026, 6:50 PM

Last edited

—

31B · Qwen

92.0

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

178ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Display name

Qwen3-Coder-30B-A3B-Instruct

Base model

—

Revision

main

Family

Qwen

Parameters

31B

Active params

—

MoE

Output tok/s

92.0

Prefill tok/s

—

Total tok/s

365.9

TTFT

178.2ms

Peak VRAM

—

Prompt tokens

530

Output tokens

156

Prefill tokens

—

Context length

686

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 2× R9700, 32K ctx, q8 KV, FA auto. Optimized 2-card layout vs the 3-card base entry.

Reactions

—

Submitted

May 6, 2026, 6:42 PM

Last edited

—

30B · Qwen

127.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_K_M

TTFT

194ms

Context

1k · 1d ago

Show all run details

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

—

MoE

Output tok/s

127.2

Prefill tok/s

—

Total tok/s

356.1

TTFT

193.7ms

Peak VRAM

—

Prompt tokens

530

Output tokens

256

Prefill tokens

—

Context length

786

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_K_M

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 1× R9700, 32K ctx, q8 KV, FA auto.

Reactions

—

Submitted

May 6, 2026, 4:37 PM

Last edited

—

30B · Qwen

126.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_K_M

TTFT

209ms

Context

1k · 1d ago

Show all run details

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

—

MoE

Output tok/s

126.5

Prefill tok/s

—

Total tok/s

352.0

TTFT

208.9ms

Peak VRAM

—

Prompt tokens

530

Output tokens

256

Prefill tokens

—

Context length

786

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_K_M

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.from-Q8_0.Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 1× R9700, 32K ctx, q8 KV, FA auto, Q4_K_M derived from Q8_0 source.

Reactions

—

Submitted

May 6, 2026, 4:32 PM

Last edited

—

30B · Qwen

90.0

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

203ms

Context

1k · 1d ago

Show all run details

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

—

MoE

Output tok/s

90.0

Prefill tok/s

—

Total tok/s

257.8

TTFT

203.3ms

Peak VRAM

—

Prompt tokens

530

Output tokens

256

Prefill tokens

—

Context length

786

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 2× R9700, 32K ctx, q8 KV, FA auto.

Reactions

—

Submitted

May 6, 2026, 4:26 PM

Last edited

—

30B · Qwen

20.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · BF16

TTFT

222ms

Context

1k · 1d ago

Show all run details

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

—

MoE

Output tok/s

20.3

Prefill tok/s

—

Total tok/s

61.2

TTFT

221.6ms

Peak VRAM

—

Prompt tokens

530

Output tokens

256

Prefill tokens

—

Context length

786

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

BF16

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.BF16.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 3× R9700, 16K ctx, q8 KV, FA auto.

Reactions

—

Submitted

May 6, 2026, 4:19 PM

Last edited

—

128B · Mistral

7.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

2033ms

Context

1k · 1d ago

Show all run details

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

—

Revision

main

Family

Mistral

Parameters

128B

Active params

—

MoE

Output tok/s

7.4

Prefill tok/s

—

Total tok/s

31.4

TTFT

2032.8ms

Peak VRAM

—

Prompt tokens

886

Output tokens

256

Prefill tokens

—

Context length

1142

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --cache-ram 0 --no-cache-idle-slots --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + q8_0 KV, 16K ctx, ts 1/1/1, FA auto.

Reactions

—

Submitted

May 6, 2026, 12:50 PM

Last edited

—

128B · Mistral

6.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

317ms

Context

16k · 2d ago

Show all run details

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

—

Revision

main

Family

Mistral

Parameters

128B

Active params

—

MoE

Output tok/s

6.9

Prefill tok/s

—

Total tok/s

30.4

TTFT

317.4ms

Peak VRAM

—

Prompt tokens

886

Output tokens

256

Prefill tokens

—

Context length

16384

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8979-66bafdcf1

Quantization

Q4_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

q8_0

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 1 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384

Extra flags

-ctk q8_0 -ctv q8_0 -ts 1/1/1 --no-mmap --cache-ram 0 --no-cache-idle-slots -c 16384

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. Mistral Medium 3.5 128B Q4_0 GGUF, 3 shards, tensor split 1/1/1, q8_0 KV cache, flash-attn auto, 16K ctx, single slot. Median of 3 runs. Submitted with running binary build_info b8979-66bafdcf1.

Reactions

—

Submitted

May 6, 2026, 6:40 AM

Last edited

—

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

128B · Mistral

6.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

2082ms

Context

1k · 2d ago

Show all run details

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

—

Revision

main

Family

Mistral

Parameters

128B

Active params

—

MoE

Output tok/s

6.2

Prefill tok/s

—

Total tok/s

26.4

TTFT

2081.9ms

Peak VRAM

—

Prompt tokens

886

Output tokens

256

Prefill tokens

—

Context length

1142

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.95

Top K

—

Min P

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf --temp 0.7 --top-p 0.95 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

3x R9700 Vulkan, -ts 1/1/1, q8 KV, flash-attn auto, 32K ctx

Reactions

—

Submitted

May 6, 2026, 6:01 AM

Last edited

—

granite-4.1-30b

29B

17.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

224ms

Context

1k · 2d ago

Show all run details

Model

ibm-granite/granite-4.1-30b

Display name

granite-4.1-30b

Base model

—

Revision

main

Family

—

Parameters

29B

Active params

—

MoE

Output tok/s

17.9

Prefill tok/s

—

Total tok/s

54.1

TTFT

223.9ms

Peak VRAM

—

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

787

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 6, 2026, 4:57 AM

Last edited

—

33B

107.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

218ms

Context

1k · 2d ago

Show all run details

Model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Display name

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Base model

—

Revision

main

Family

—

Parameters

33B

Active params

—

MoE

Output tok/s

107.3

Prefill tok/s

—

Total tok/s

304.9

TTFT

217.8ms

Peak VRAM

—

Prompt tokens

538

Output tokens

256

Prefill tokens

—

Context length

794

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.6

Top P

0.95

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

llama.cpp Vulkan on 3x R9700 via llama-swap, ts 1/1 (2-card). Median of 3 runs (selected for representative TTFT).

Reactions

—

Submitted

May 6, 2026, 4:45 AM

Last edited

—

32B · Qwen

22.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

vllm · FP8

TTFT

60ms

Context

1k · 2d ago

Show all run details

Model

Qwen/Qwen3-32B

Display name

Qwen3-32B

Base model

—

Revision

main

Family

Qwen

Parameters

32B

Active params

—

MoE

Output tok/s

22.8

Prefill tok/s

—

Total tok/s

69.5

TTFT

60.0ms

Peak VRAM

34.2GB

Prompt tokens

530

Output tokens

256

Prefill tokens

—

Context length

786

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

vllm

Engine version

0.18.1.dev0+gbcf2be961.d20260330

Quantization

FP8

Backend

rocm

Tensor parallel

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

0.95

Max running seqs

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3-32B-FP8 --max-model-len 40960 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' healthcheck: test: - CMD-SHELL

Extra flags

--max-model-len 40960 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} -- CMD-SHELL

Notes

post-reboot validation, 2x R9700 TP=2 dense FP8

Reactions

—

Submitted

May 6, 2026, 4:12 AM

Last edited

—

36B · Qwen

112.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

vllm · MXFP4_MOE

TTFT

168ms

Context

1k · 2d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

—

Revision

main

Family

Qwen

Parameters

36B

Active params

—

MoE

Output tok/s

112.8

Prefill tok/s

—

Total tok/s

322.8

TTFT

168ms

Peak VRAM

32.0GB

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

787

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

vllm

Engine version

0.18.1.dev0+gbcf2be961.d20260330

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

0.95

Max running seqs

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3.6-35B-A3B-MXFP4 --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' healthcheck:

Extra flags

--max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} --speculative-config {"method": "mtp", "num_speculative_tokens": 4}

Notes

vLLM on 3× R9700 (TP=2 cards 0,1). RCCL 7.1.1 jammy preloaded via LD_LIBRARY_PATH (bypasses RCCL 7.2.x gfx1201 tuning-index miss / TP=2 deadlock). TunableOp tuned ~300 rows/rank from cudagraph capture + diverse-load driving. MoE backend Triton (gfx950 gate routes only gfx950 to CK on this vLLM build). MTP=4 spec decoding. tcclaviger/vllm-rocm-rdna4-mxfp4 docker. Best of 3 streaming chat completions, post heavy warmup.

Reactions

—

Submitted

May 6, 2026, 3:18 AM

Last edited

—

36B · Qwen

98.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

vllm · MXFP4_MOE

TTFT

171ms

Context

1k · 2d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

—

Revision

main

Family

Qwen

Parameters

36B

Active params

—

MoE

Output tok/s

98.9

Prefill tok/s

—

Total tok/s

285.2

TTFT

171.4ms

Peak VRAM

32GB

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

787

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

vllm

Engine version

0.18.1.dev0+gbcf2be961.d20260330

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

0.95

Max running seqs

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3.6-35B-A3B-MXFP4 --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' healthcheck:

Extra flags

Notes

vLLM on 3× R9700 (TP=2 cards 0,1). RCCL 7.1.1 jammy preloaded via LD_LIBRARY_PATH to bypass RCCL 7.2.x gfx1201 tuning-index miss (TP=2 deadlock). aiter built but MoE backend still Triton (gfx950 gate routes only gfx950 to CK). MTP=4 spec decoding. tcclaviger/vllm-rocm-rdna4-mxfp4 docker. Best of 3 streaming chat completions.

Reactions

—

Submitted

May 6, 2026, 2:23 AM

Last edited

—

22B · Gpt

991.1

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · MXFP4_MOE

TTFT

175ms

Context

33k · 2d ago

Show all run details

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

—

Revision

main

Family

Gpt

Parameters

22B

Active params

—

MoE

Output tok/s

991.1

Prefill tok/s

—

Total tok/s

—

TTFT

175.4ms

Peak VRAM

—

Prompt tokens

589

Output tokens

256

Prefill tokens

—

Context length

32768

Batch size

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

—

GPU count

VRAM

64GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

yes

Attention backend

—

Flash attention

—

Chunked prefill

yes

Prefill chunk

4096

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

0.9

Max running seqs

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve openai/gpt-oss-20b --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 16 --max-num-batched-tokens 4096 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.90 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Extra flags

--max-model-len 32768 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Notes

gpt-oss-20b MXFP4 concurrent throughput at batch=16 (matches --max-num-seqs 16). Aggregate output tok/s, best of 3 runs (991.1 / 986.8 / 990.6 -- variance under 0.5%). Per-request decode 64.8 tok/s. TTFT 175ms. This is vLLM's strength zone vs llama.cpp single-stream: ~6x throughput when serving multiple concurrent users. batch=1 single-stream was 48 tok/s (separate submission) -- same hardware, llama.cpp Q8_0 single-stream was 160 tok/s.

Reactions

—

Submitted

May 5, 2026, 11:29 PM

Last edited

—

22B · Gpt

48.3

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · MXFP4_MOE

TTFT

89ms

Context

33k · 2d ago

Show all run details

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

—

Revision

main

Family

Gpt

Parameters

22B

Active params

—

MoE

Output tok/s

48.3

Prefill tok/s

—

Total tok/s

156.7

TTFT

88.6ms

Peak VRAM

—

Prompt tokens

589

Output tokens

256

Prefill tokens

—

Context length

32768

Batch size

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

—

GPU count

VRAM

64GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

yes

Attention backend

—

Flash attention

—

Chunked prefill

yes

Prefill chunk

4096

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

0.9

Max running seqs

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--max-model-len 32768 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Notes

gpt-oss-20b native MXFP4 on 2x R9700 (gfx1201) TP=2, batch=1. Uses the *fused* OAI Triton MXFP4 MoE kernel (triton, not triton_unfused) — gpt-oss is the kernel's original target. Same 3 wheel patches as Qwen3.6 MXFP4 entry (PR #37826 device gate, CLI enum, dtype). RCCL 7.1.1 hotfix for #40980. Concurrent throughput much higher: 580 tok/s at c=8, 902 tok/s at c=16 (batch submissions to follow).

Reactions

—

Submitted

May 5, 2026, 11:24 PM

Last edited

—

36B · Qwen

33.6

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · MXFP4_MOE

TTFT

130ms

Context

33k · 2d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

—

Revision

main

Family

Qwen

Parameters

36B

Active params

—

MoE

Output tok/s

33.6

Prefill tok/s

—

Total tok/s

101.7

TTFT

129.6ms

Peak VRAM

—

Prompt tokens

533

Output tokens

256

Prefill tokens

—

Context length

32768

Batch size

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

—

GPU count

VRAM

64GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

yes

Attention backend

—

Flash attention

—

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

0.93

Max running seqs

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 PYTORCH_TUNABLEOP_FILENAME=~/.local/share/vllm-tunableop/tunableop_merged.csv VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3.6-35B-A3B-MXFP4 --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.93 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Extra flags

--max-model-len 32768 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Notes

Same setup as previous MXFP4 submission (PR #37826 patched, RCCL 7.1.1 hotfix, triton_unfused, dtype=bf16) — now WITH PyTorch TunableOp tuned (754 rows / 377 per TP rank). Tuned with 18-min diverse-load pass: prompt sizes 8/64/256/1024/2048 tokens x 5 content categories x concurrency 1+4. Auto-stopped on row plateau (3x60s zero-delta). Tuning gain over untuned baseline: +7.0% (31.70 -> 33.92 tok/s). Modest gain — batch=1 decode on R9700 is memory-bandwidth-bound, not GEMM-scheduling-bound.

Reactions

—

Submitted

May 5, 2026, 11:00 PM

Last edited

—

36B · Qwen

31.7

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · MXFP4_MOE

TTFT

140ms

Context

33k · 2d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

—

Revision

main

Family

Qwen

Parameters

36B

Active params

—

MoE

Output tok/s

31.7

Prefill tok/s

—

Total tok/s

96.0

TTFT

140.0ms

Peak VRAM

—

Prompt tokens

533

Output tokens

256

Prefill tokens

—

Context length

32768

Batch size

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

—

GPU count

VRAM

64GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

yes

Attention backend

—

Flash attention

—

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

0.93

Max running seqs

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MOE=0 VLLM_MXFP4_USE_MARLIN=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3.6-35B-A3B-MXFP4 --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.93 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Extra flags

--max-model-len 32768 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Notes

Qwen3.6-35B-A3B MXFP4 (pahajokiconsulting quant) on 2x R9700 (gfx1201) TP=2. Baseline path (no AITER, no TunableOp, no MTP). Required THREE patches to vLLM 0.20.1+rocm721 wheel: (1) PR vllm-project/vllm#37826 in-place patch -- widen OAI Triton MoE device gate to gfx12, (2) add 'triton_unfused' to MoEBackend Literal in config/kernel.py (CLI exposure), (3) RCCL 7.1.1 LD_PRELOAD for #40980 deadlock. --moe-backend triton_unfused needed because TRITON kernel only supports SWIGLUOAI activation; UnfusedOAITritonExperts adds SILU. dtype=bfloat16 required (kernel asserts hidden_states.dtype==bf16). Best of 3 (31.6 / 31.7 / 31.6 tok/s).

Reactions

—

Submitted

May 5, 2026, 9:49 PM

Last edited

—

32B · Qwen

79.3

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · FP8

TTFT

138ms

Context

33k · 2d ago

Show all run details

Model

Qwen/Qwen3-32B

Display name

Qwen3-32B

Base model

—

Revision

main

Family

Qwen

Parameters

32B

Active params

—

MoE

Output tok/s

79.3

Prefill tok/s

—

Total tok/s

—

TTFT

138.3ms

Peak VRAM

—

Prompt tokens

534

Output tokens

256

Prefill tokens

—

Context length

32768

Batch size

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

—

GPU count

VRAM

64GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

FP8

Backend

rocm

Tensor parallel

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

0.9

Max running seqs

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3-32B-FP8 --tensor-parallel-size 2 --dtype auto --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.90 --reasoning-parser qwen3

Extra flags

--max-model-len 32768 --reasoning-parser qwen3

Notes

Concurrent throughput at batch=4 (matches --max-num-seqs 4). Aggregate output tok/s, best of 3 runs (78.7 / 78.2 / 79.3). Per-request decode ~20 tok/s. RCCL 7.1.1 hotfix for #40980 still in effect.

Reactions

—

Submitted

May 5, 2026, 9:32 PM

Last edited

—

32B · Qwen

22.9

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · FP8

TTFT

71ms

Context

33k · 2d ago

Show all run details

Model

Qwen/Qwen3-32B

Display name

Qwen3-32B

Base model

—

Revision

main

Family

Qwen

Parameters

32B

Active params

—

MoE

Output tok/s

22.9

Prefill tok/s

—

Total tok/s

70.1

TTFT

71.5ms

Peak VRAM

—

Prompt tokens

534

Output tokens

256

Prefill tokens

—

Context length

32768

Batch size

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

—

GPU count

VRAM

64GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

FP8

Backend

rocm

Tensor parallel

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

0.9

Max running seqs

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3-32B-FP8 --tensor-parallel-size 2 --dtype auto --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.90 --reasoning-parser qwen3

Extra flags

--max-model-len 32768 --reasoning-parser qwen3

Notes

vLLM 0.20.1+rocm721 on 2x AMD R9700 (gfx1201) TP=2. Workaround for vllm-project/vllm#40980: LD_PRELOAD librccl.so.1 from rccl-7.1.1 (Arch ALA) -- RCCL 7.2.x lacks gfx1201 tuning index and deadlocks on AllReduce. With hotfix: ~22.7 tok/s; without: ~0.5 tok/s.

Reactions

—

Submitted

May 5, 2026, 9:26 PM

Last edited

—

31B · Qwen

100.0

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

198ms

Context

1k · 2d ago

Show all run details

Model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Display name

Qwen3-Coder-30B-A3B-Instruct

Base model

—

Revision

main

Family

Qwen

Parameters

31B

Active params

—

MoE

Output tok/s

100.0

Prefill tok/s

—

Total tok/s

390.1

TTFT

198.5ms

Peak VRAM

—

Prompt tokens

530

Output tokens

156

Prefill tokens

—

Context length

686

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 5, 2026, 7:00 PM

Last edited

—

22B · Gpt

160.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

206ms

Context

1k · 3d ago

Show all run details

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

—

Revision

main

Family

Gpt

Parameters

22B

Active params

—

MoE

Output tok/s

160.8

Prefill tok/s

—

Total tok/s

470.0

TTFT

206.1ms

Peak VRAM

—

Prompt tokens

589

Output tokens

256

Prefill tokens

—

Context length

845

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 5, 2026, 4:52 AM

Last edited

—

GLM-4.7-Flash

31B

93.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

105ms

Context

1k · 3d ago

Show all run details

Model

zai-org/GLM-4.7-Flash

Display name

GLM-4.7-Flash

Base model

—

Revision

main

Family

—

Parameters

31B

Active params

—

MoE

Output tok/s

93.3

Prefill tok/s

—

Total tok/s

274.8

TTFT

104.8ms

Peak VRAM

—

Prompt tokens

527

Output tokens

256

Prefill tokens

—

Context length

783

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

—

Min P

0.01

Repeat penalty

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/glm/GLM-4.7-Flash-MXFP4_MOE.gguf --temp 1.0 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 0/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 5, 2026, 4:40 AM

Last edited

—

Qwen3-Coder-Next

80B · Qwen

80.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · MXFP4_MOE

TTFT

234ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3-Coder-Next

Display name

Qwen3-Coder-Next

Base model

—

Revision

main

Family

Qwen

Parameters

80B

Active params

—

MoE

Output tok/s

80.4

Prefill tok/s

—

Total tok/s

280.1

TTFT

233.8ms

Peak VRAM

—

Prompt tokens

530

Output tokens

187

Prefill tokens

—

Context length

717

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

Min P

0.01

Repeat penalty

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-Next-MXFP4_MOE.gguf -ts 1/1 -c 262144 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 5, 2026, 4:31 AM

Last edited

—

125B · Qwen

26.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · MXFP4_MOE

TTFT

261ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

—

Revision

main

Family

Qwen

Parameters

125B

Active params

—

MoE

Output tok/s

26.8

Prefill tok/s

—

Total tok/s

110.8

TTFT

261.3ms

Peak VRAM

—

Prompt tokens

533

Output tokens

161

Prefill tokens

—

Context length

694

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

Min P

0.01

Repeat penalty

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}'

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 5, 2026, 4:19 AM

Last edited

—

31B · Qwen

87.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

207ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Display name

Qwen3-Coder-30B-A3B-Instruct

Base model

—

Revision

main

Family

Qwen

Parameters

31B

Active params

—

MoE

Output tok/s

87.2

Prefill tok/s

—

Total tok/s

343.5

TTFT

207.0ms

Peak VRAM

—

Prompt tokens

530

Output tokens

156

Prefill tokens

—

Context length

686

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

Repeat penalty

1.05

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 5, 2026, 3:51 AM

Last edited

—

120B · Gpt

71.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

233ms

Context

1k · 3d ago

Show all run details

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

—

Revision

main

Family

Gpt

Parameters

120B

Active params

—

MoE

Output tok/s

71.5

Prefill tok/s

—

Total tok/s

221.7

TTFT

232.7ms

Peak VRAM

—

Prompt tokens

589

Output tokens

256

Prefill tokens

—

Context length

845

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

F16

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 5, 2026, 3:41 AM

Last edited

—

36B · Qwen

75.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

217ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

—

Revision

main

Family

Qwen

Parameters

36B

Active params

—

MoE

Output tok/s

75.2

Prefill tok/s

—

Total tok/s

217.4

TTFT

217.0ms

Peak VRAM

—

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

787

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.6

Top P

0.95

Top K

Min P

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-35B-A3B-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 5, 2026, 3:31 AM

Last edited

—

Nemotron-Cascade-2-30B-A3B

22B · Gpt

159.7

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

201ms

Context

131k · 3d ago

Show all run details

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

—

Revision

main

Family

Gpt

Parameters

22B

Active params

—

MoE

Output tok/s

159.7

Prefill tok/s

—

Total tok/s

468.5

TTFT

200.9ms

Peak VRAM

—

Prompt tokens

589

Output tokens

256

Prefill tokens

—

Context length

131072

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

—

Notes

GPT-OSS 20B Q8_0 (12 GB) on 1x R9700 single-card via llama-swap (-ts 0/1/0, c=131072, no kv quant). 3.6B active params (MoE). Best of 3 streamed runs - variance 0.27 tok/s across runs.

Reactions

—

Submitted

May 5, 2026, 3:24 AM

Last edited

—

32B

89.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

9971ms

Context

32k · 3d ago

Show all run details

Model

nvidia/Nemotron-Cascade-2-30B-A3B

Display name

Nemotron-Cascade-2-30B-A3B

Base model

—

Revision

main

Family

—

Parameters

32B

Active params

—

MoE

Output tok/s

89.8

Prefill tok/s

—

Total tok/s

2459.9

TTFT

9971.2ms

Peak VRAM

—

Prompt tokens

31364

Output tokens

256

Prefill tokens

—

Context length

31620

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

—

Notes

Nemotron-Cascade-2 30B-A3B Q8_0 (32 GB) on 2x R9700 (-ts 1/1, c=262144, q8_0 KV, FA auto). Depth-31K test (~30K input prompt). TG = 89.78 tok/s, only -5.9% vs depth-0 (95.39 tok/s). Cold-prefill TTFT shown (9971ms); warm-cache runs hit 560ms via --cache-reuse 256. Best tok/s of 3 streamed runs.

Reactions

—

Submitted

May 5, 2026, 3:19 AM

Last edited

—

Devstral-Small-2-24B-Instruct-2512

28B · Qwen

19.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

171ms

Context

16k · 3d ago

Show all run details

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

—

Revision

main

Family

Qwen

Parameters

28B

Active params

—

MoE

Output tok/s

19.9

Prefill tok/s

—

Total tok/s

60.5

TTFT

171.0ms

Peak VRAM

—

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

16384

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

—

Notes

Qwen3.6-27B Q8_0 dense (28.5 GB) on 1x R9700 single-card via llama-swap (-ts 1/0/0, c=16384, q8_0 KV, FA auto). Single-card variant: +10.3% TG vs prior 3-card 262K submission (18.08 tok/s). Best of 3 streamed runs.

Reactions

—

Submitted

May 5, 2026, 3:14 AM

Last edited

—

granite-4.1-30b

29B

17.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

273ms

Context

66k · 3d ago

Show all run details

Model

ibm-granite/granite-4.1-30b

Display name

granite-4.1-30b

Base model

—

Revision

main

Family

—

Parameters

29B

Active params

—

MoE

Output tok/s

17.3

Prefill tok/s

—

Total tok/s

—

TTFT

273.1ms

Peak VRAM

—

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

65536

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

—

Notes

IBM Granite 4.1 30B dense Q8_0 (29 GB) on 2x R9700 via llama-swap (-ts 1/1, c=65536, q8_0 KV, FA auto). Single-card not viable - 29 GB weights + heavy GQA KV (256 KB/tok) overflows 32 GB VRAM. Best of 3 streamed runs.

Reactions

—

Submitted

May 5, 2026, 3:03 AM

Last edited

—

24B · Mistral

23.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

244ms

Context

25k · 3d ago

Show all run details

Model

mistralai/Devstral-Small-2-24B-Instruct-2512

Display name

Devstral-Small-2-24B-Instruct-2512

Base model

Mistral-Small-3.1-24B-Base-2503

Revision

main

Family

Mistral

Parameters

24B

Active params

—

MoE

Output tok/s

23.9

Prefill tok/s

—

Total tok/s

200.6

TTFT

244.2ms

Peak VRAM

—

Prompt tokens

1078

Output tokens

139

Prefill tokens

—

Context length

24576

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

—

Notes

Devstral-Small-2-24B-Instruct-2512 Q8_0 (24 GB) on 1x R9700 single-card via llama-swap (-ts 1/0/0, c=24576, q8_0 KV, FA auto). Single-card variant. Best of 3 streamed runs.

Reactions

—

Submitted

May 5, 2026, 2:58 AM

Last edited

—

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

33B

95.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

212ms

Context

1k · 3d ago

Show all run details

Model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Display name

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Base model

—

Revision

main

Family

—

Parameters

33B

Active params

—

MoE

Output tok/s

95.8

Prefill tok/s

—

Total tok/s

275.2

TTFT

212.5ms

Peak VRAM

—

Prompt tokens

538

Output tokens

256

Prefill tokens

—

Context length

794

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.6

Top P

0.95

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

NVIDIA-Nemotron-3-Nano-Omni 30B-A3B Reasoning Q8_0 (32 GB) on 2× R9700 via llama-swap (-ts 1/1, q8_0 KV, FA auto). Single-card not viable — weights exceed 32 GB VRAM. Best of 3 streamed runs.

Reactions

—

Submitted

May 5, 2026, 2:53 AM

Last edited

—

Nemotron-Cascade-2-30B-A3B

32B

95.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

216ms

Context

1k · 3d ago

Show all run details

Model

nvidia/Nemotron-Cascade-2-30B-A3B

Display name

Nemotron-Cascade-2-30B-A3B

Base model

—

Revision

main

Family

—

Parameters

32B

Active params

—

MoE

Output tok/s

95.4

Prefill tok/s

—

Total tok/s

280.0

TTFT

216.0ms

Peak VRAM

—

Prompt tokens

556

Output tokens

256

Prefill tokens

—

Context length

812

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/nvidia_Nemotron-Cascade-2-30B-A3B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

Nemotron-Cascade-2 30B-A3B Q8_0 (32 GB) on 2× R9700 via llama-swap (-ts 1/1, q8_0 KV, FA auto). Single-card not viable — weights exceed 32 GB VRAM. Best of 3 streamed runs.

Reactions

—

Submitted

May 5, 2026, 2:47 AM

Last edited

—

28B · Qwen

20.1

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

807ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

—

Revision

main

Family

Qwen

Parameters

28B

Active params

—

MoE

Output tok/s

20.1

Prefill tok/s

—

Total tok/s

58.1

TTFT

807.5ms

Peak VRAM

—

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

787

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

Min P

Repeat penalty

—

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-27B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 5, 2026, 2:35 AM

Last edited

—

125B · Qwen

27.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · MXFP4_MOE

TTFT

56499ms

Context

66k · 3d ago

Show all run details

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

—

Revision

main

Family

Qwen

Parameters

125B

Active params

—

MoE

Output tok/s

27.2

Prefill tok/s

—

Total tok/s

11.1

TTFT

56499.3ms

Peak VRAM

—

Prompt tokens

533

Output tokens

161

Prefill tokens

—

Context length

65536

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

Min P

0.01

Repeat penalty

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 1 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}' --flash-attn on

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}

Notes

B+C config: --parallel 1 + --flash-attn on. Best of 3 runs on 3× R9700 (Vulkan).

Reactions

—

Submitted

May 5, 2026, 12:41 AM

Last edited

—

125B · Qwen

27.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · MXFP4_MOE

TTFT

219ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

—

Revision

main

Family

Qwen

Parameters

125B

Active params

—

MoE

Output tok/s

27.3

Prefill tok/s

—

Total tok/s

113.5

TTFT

218.8ms

Peak VRAM

—

Prompt tokens

533

Output tokens

161

Prefill tokens

—

Context length

694

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

Top P

0.95

Top K

Min P

0.01

Repeat penalty

Mirostat

—

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}'

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. Qwen3.5-122B-A10B MXFP4_MOE GGUF, tensor split 1/1/1, flash-attn auto, reasoning disabled. Best of 3 runs.

Reactions

—

Submitted

May 4, 2026, 11:53 PM

Last edited

—

120B · Gpt

70.7

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

236ms

Context

1k · 3d ago

Show all run details

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

—

Revision

main

Family

Gpt

Parameters

120B

Active params

—

MoE

Output tok/s

70.7

Prefill tok/s

—

Total tok/s

219.1

TTFT

236.4ms

Peak VRAM

—

Prompt tokens

589

Output tokens

256

Prefill tokens

—

Context length

845

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

—

GPU count

VRAM

96GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

F16

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. GPT-OSS 120B F16 GGUF, tensor split 1/1/1, q8_0 KV cache, flash-attn auto.

Reactions

—

Submitted

May 4, 2026, 11:00 PM

Last edited

—

120B · Gpt

70.0

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · F16

TTFT

236ms

Context

1k · 3d ago

Show all run details

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

—

Revision

main

Family

Gpt

Parameters

120B

Active params

—

MoE

Output tok/s

70.0

Prefill tok/s

—

Total tok/s

217.0

TTFT

236.4ms

Peak VRAM

—

Prompt tokens

589

Output tokens

256

Prefill tokens

—

Context length

845

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

—

GPU count

VRAM

32GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

F16

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. GPT-OSS 120B F16 GGUF, tensor split 1/1/1, q8_0 KV cache, flash-attn auto. Best of 3 runs.

Reactions

—

Submitted

May 4, 2026, 10:42 PM

Last edited

—

73B · Qwen

20.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · Q4_0

TTFT

272ms

Context

131k · 3d ago

Show all run details

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

—

MoE

Output tok/s

20.9

Prefill tok/s

—

Total tok/s

64.5

TTFT

271.5ms

Peak VRAM

—

Prompt tokens

551

Output tokens

256

Prefill tokens

—

Context length

131072

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

—

GPU count

VRAM

32GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

Q4_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

yes

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

0.7

Top P

0.8

Top K

Min P

—

Repeat penalty

1.05

Mirostat

—

Command

Extra flags

Notes

Vulkan 3× R9700 + speculative decoding: Qwen2.5-Coder-0.5B-Instruct Q8_0 draft, --spec-draft-n-max 8 --spec-draft-p-min 0.7. Best of 3.

Reactions

—

Submitted

May 4, 2026, 10:09 PM

Last edited

—

28B · Qwen

18.1

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · Q8_0

TTFT

260ms

Context

262k · 3d ago

Show all run details

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

—

Revision

main

Family

Qwen

Parameters

28B

Active params

—

MoE

Output tok/s

18.1

Prefill tok/s

—

Total tok/s

54.6

TTFT

259.9ms

Peak VRAM

—

Prompt tokens

531

Output tokens

256

Prefill tokens

—

Context length

262144

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

—

GPU count

VRAM

32GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

—

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 4, 2026, 9:36 PM

Last edited

—

73B · Qwen

11.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · Q4_0

TTFT

259ms

Context

66k · 3d ago

Show all run details

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

—

MoE

Output tok/s

11.3

Prefill tok/s

—

Total tok/s

35.4

TTFT

258.6ms

Peak VRAM

—

Prompt tokens

551

Output tokens

256

Prefill tokens

—

Context length

65536

Batch size

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

—

GPU count

VRAM

32GB

Chip vendor

—

Chip family

—

Chip variant

—

Unified memory

—

NPU TOPS

—

CPU

AMD Ryzen 9 5950X

RAM

64GB

Arch Linux

Power

—

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

Q4_0

Backend

vulkan

Tensor parallel

—

Pipeline parallel

—

GPU layers

—

Split mode

—

KV cache dtype

—

KV cache size

—

Prefix caching

—

Attention backend

—

Flash attention

—

Chunked prefill

—

Prefill chunk

—

Continuous batching

—

CPU offload

—

CPU layers

—

Rope scaling

—

Rope scale

—

Yarn ext factor

—

Engine quant

—

SGLang quant

—

GPU mem util

—

Max running seqs

—

Scheduler delay

—

Num parallel

—

Concurrency

—

Spec decoding

Spec method

—

Spec model

—

Spec draft model

—

Spec tokens

—

Spec ngram

—

Spec draft TP

—

MTP enabled

MTP draft layers

—

Temperature

—

Top P

—

Top K

—

Min P

—

Repeat penalty

—

Mirostat

—

Command

Extra flags

—

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

—

Submitted

May 4, 2026, 9:31 PM

Last edited

—