ModelsLeaderboardEvalsTrainRentalsAPI Docs
1337Hero

1337Hero

@1337Hero · Member since 2026

← Leaderboard
Mainly llama.cpp 64 approved runs

Total Submissions

64

64 approved public runs

Models Tested

24

unique models

Avg tok/s out

72.7

Best: 991.1 tok/s

Avg prefill

Avg total tok/s

217.2

Best: 2459.9 tok/s

Avg TTFT

1419ms

Best: 46ms

HW configs

4

distinct setups

Personal bests24

Fastest approved result for each model.

gpt-oss-20b

22B · Gpt

991.1

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · MXFP4_MOE

TTFT

175ms

Context

33k · 2d ago

best 991.1 tok/s

127.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_K_M

TTFT

194ms

Context

1k · 1d ago

best 127.2 tok/s
Qwen3.6-35B-A3B

36B · Qwen

112.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

vllm · MXFP4_MOE

TTFT

168ms

Context

1k · 2d ago

best 112.8 tok/s
Llama-2-7b

7B · Llama

110.1

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

171ms

Context

1k · 1d ago

best 110.1 tok/s
Qwen3.5-35B-A3B

35B · Qwen

108.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q3_K_M

TTFT

208ms

Context

1k · 1d ago

best 108.4 tok/s
Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

218ms

Context

1k · 2d ago

best 107.3 tok/s

100.0

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

198ms

Context

1k · 2d ago

best 100.0 tok/s

95.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_K_M

TTFT

179ms

Context

1k · 1d ago

best 95.9 tok/s
Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

216ms

Context

1k · 3d ago

best 95.4 tok/s

93.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

105ms

Context

1k · 3d ago

best 93.3 tok/s
gemma-4-26B-A4B-it

27B · Gemma

87.3

tok/s

Hardware

2x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · MXFP4_MOE

TTFT

375ms

Context

1k · 3d ago

best 87.3 tok/s
Qwen3.5-4B

4B · Qwen

84.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

198ms

Context

1k · 1d ago

best 84.9 tok/s
Qwen3-Coder-Next

80B · Qwen

80.8

tok/s

Hardware

2x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · MXFP4_MOE

TTFT

250ms

Context

1k · 3d ago

best 80.8 tok/s

79.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · BF16

TTFT

184ms

Context

1k · 1d ago

best 79.9 tok/s
Qwen3-32B

32B · Qwen

79.3

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · FP8

TTFT

138ms

Context

33k · 2d ago

best 79.3 tok/s
gpt-oss-120b

120B · Gpt

71.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

233ms

Context

1k · 3d ago

best 71.5 tok/s
MiniMax-M2.7

229B · Minimax

32.6

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · IQ4_XS

TTFT

46ms

Context

1k · 1d ago

best 32.6 tok/s
Qwen3.5-122B-A10B

125B · Qwen

27.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · MXFP4_MOE

TTFT

219ms

Context

1k · 3d ago

best 27.3 tok/s

23.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

244ms

Context

25k · 3d ago

best 23.9 tok/s
Kimi-Dev-72B

73B · Qwen

20.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · Q4_0

TTFT

272ms

Context

131k · 3d ago

best 20.9 tok/s
Qwen3.6-27B

28B · Qwen

20.1

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

807ms

Context

1k · 3d ago

best 20.1 tok/s

17.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

224ms

Context

1k · 2d ago

best 17.9 tok/s
SWE-Dev-32B

33B · Qwen

16.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

699ms

Context

1k · 1d ago

best 16.2 tok/s

7.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

2033ms

Context

1k · 1d ago

best 7.4 tok/s
ModelBest tok/sHardwareEngine / quantPrefillTTFTctxWhenShare
gpt-oss-20b
22B4 runs
991.1
2x AMD Radeon AI Pro R9700 64GB
vllmMXFP4_MOE175ms33k2d ago
qwen3-coder-30b-a3b-codemonkey
30B4 runs
127.2out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_K_M194ms1k1d ago
Qwen3.6-35B-A3B
36B6 runs
112.8out
3x AMD Radeon AI Pro R9700 96GB
vllmMXFP4_MOE168ms1k2d ago
Llama-2-7b
7B1 run
110.1out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_0171ms1k1d ago
Qwen3.5-35B-A3B
35B1 run
108.4out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ3_K_M208ms1k1d ago
Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
33B2 runs
107.3out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0218ms1k2d ago
Qwen3-Coder-30B-A3B-Instruct
31B3 runs
100.0out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0198ms1k2d ago
Qwen3-VL-8B-Instruct
9B1 run
95.9out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_K_M179ms1k1d ago
Nemotron-Cascade-2-30B-A3B
32B2 runs
95.4out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0216ms1k3d ago
GLM-4.7-Flash
31B2 runs
93.3out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0105ms1k3d ago
gemma-4-26B-A4B-it
27B1 run
87.3out
2x AMD Radeon AI Pro R9700 32GB
llama.cppMXFP4_MOE375ms1k3d ago
Qwen3.5-4B
4B1 run
84.9out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0198ms1k1d ago
Qwen3-Coder-Next
80B2 runs
80.8out
2x AMD Radeon AI Pro R9700 32GB
llama.cppMXFP4_MOE250ms1k3d ago
Llama-3.2-3B-Instruct
3B1 run
79.9out
3x AMD Radeon AI Pro R9700 96GB
llama.cppBF16184ms1k1d ago
Qwen3-32B
32B3 runs
79.3
2x AMD Radeon AI Pro R9700 64GB
vllmFP8138ms33k2d ago
gpt-oss-120b
120B4 runs
71.5out
3x AMD Radeon AI Pro R9700 96GB
llama.cppF16233ms1k3d ago
MiniMax-M2.7
229B5 runs
32.6out
3x AMD Radeon AI Pro R9700 96GB
llama.cppIQ4_XS46ms1k1d ago
Qwen3.5-122B-A10B
125B4 runs
27.3out
3x AMD Radeon AI Pro R9700 96GB
llama.cppMXFP4_MOE219ms1k3d ago
Devstral-Small-2-24B-Instruct-2512
24B2 runs
23.9out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0244ms25k3d ago
Kimi-Dev-72B
73B4 runs
20.9out
3x AMD Radeon AI Pro R9700 32GB
llama.cppQ4_0272ms131k3d ago
Qwen3.6-27B
28B4 runs
20.1out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0807ms1k3d ago
granite-4.1-30b
29B2 runs
17.9out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0224ms1k2d ago
SWE-Dev-32B
33B1 run
16.2out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0699ms1k1d ago
Mistral-Medium-3.5-128B
128B4 runs
7.4out
3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_02033ms1k1d ago

Hardware breakdown4

Distinct rigs used across approved submissions.

3x AMD Radeon AI Pro R9700 96GB

50 runs · best 160.8 tok/s

2x AMD Radeon AI Pro R9700 64GB

6 runs · best 991.1 tok/s

3x AMD Radeon AI Pro R9700 32GB

5 runs · best 70.0 tok/s

2x AMD Radeon AI Pro R9700 32GB

3 runs · best 92.9 tok/s

Engines used2

Runtime mix and top quantizations.

llama.cpp55F16 · IQ4_XS · Q4_0
vllm9FP8 · MXFP4_MOE

All runs64

Full approved benchmark history.

MiniMax-M2.7

229B · Minimax

30.7

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

57ms

Context

1k · today

Show all run details

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

Revision

main

Family

Minimax

Parameters

229B

Active params

MoE

no

Output tok/s

30.7

Prefill tok/s

Total tok/s

95.0

TTFT

57.1ms

Peak VRAM

Prompt tokens

541

Output tokens

256

Prefill tokens

Context length

797

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 --main-gpu 1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1 --main-gpu 1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 8, 2026, 8:02 AM

Last edited

6.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

1956ms

Context

1k · today

Show all run details

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

Revision

main

Family

Mistral

Parameters

128B

Active params

MoE

no

Output tok/s

6.5

Prefill tok/s

Total tok/s

27.4

TTFT

1955.6ms

Peak VRAM

Prompt tokens

886

Output tokens

256

Prefill tokens

Context length

1142

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --cache-ram 0 --no-cache-idle-slots --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 8, 2026, 7:51 AM

Last edited

gpt-oss-120b

120B · Gpt

62.7

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

105ms

Context

2k · today

Show all run details

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

Revision

main

Family

Gpt

Parameters

120B

Active params

MoE

no

Output tok/s

62.7

Prefill tok/s

Total tok/s

99.0

TTFT

105.5ms

Peak VRAM

Prompt tokens

589

Output tokens

1000

Prefill tokens

Context length

1589

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 8, 2026, 7:44 AM

Last edited

MiniMax-M2.7

229B · Minimax

25.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · IQ4_XS

TTFT

56ms

Context

33k · today

Show all run details

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

Revision

main

Family

Minimax

Parameters

229B

Active params

MoE

no

Output tok/s

25.2

Prefill tok/s

Total tok/s

79.9

TTFT

56.1ms

Peak VRAM

Prompt tokens

560

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

1

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B-A10B (from 228B base), IQ4_XS imatrix, 3x R9700 ts 1/1/1, q8_0 KV, FA auto, ctx 32k

Reactions

Submitted

May 8, 2026, 7:13 AM

Last edited

MiniMax-M2.7

229B · Minimax

25.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · IQ4_XS

TTFT

5267ms

Context

1k · today

Show all run details

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

Revision

main

Family

Minimax

Parameters

229B

Active params

MoE

no

Output tok/s

25.2

Prefill tok/s

Total tok/s

52.9

TTFT

5266.6ms

Peak VRAM

Prompt tokens

560

Output tokens

256

Prefill tokens

Context length

816

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

1

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B-A10B (from 228B base), IQ4_XS imatrix, 3x R9700 ts 1/1/1, q8_0 KV, FA auto, ctx 32k

Reactions

Submitted

May 8, 2026, 7:10 AM

Last edited

MiniMax-M2.7

229B · Minimax

31.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · IQ4_XS

TTFT

771ms

Context

33k · 1d ago

Show all run details

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

Revision

main

Family

Minimax

Parameters

229B

Active params

MoE

no

Output tok/s

31.8

Prefill tok/s

Total tok/s

90.8

TTFT

770.5ms

Peak VRAM

Prompt tokens

546

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

1

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B IQ4_XS imatrix. ctx=32768, parallel=1. Thinking disabled.

Reactions

Submitted

May 7, 2026, 4:58 AM

Last edited

MiniMax-M2.7

229B · Minimax

32.6

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · IQ4_XS

TTFT

46ms

Context

1k · 1d ago

Show all run details

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

Revision

main

Family

Minimax

Parameters

229B

Active params

MoE

no

Output tok/s

32.6

Prefill tok/s

Total tok/s

101.4

TTFT

46.3ms

Peak VRAM

Prompt tokens

546

Output tokens

256

Prefill tokens

Context length

802

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B IQ4_XS imatrix (mradermacher). Thinking disabled via system prompt.

Reactions

Submitted

May 7, 2026, 4:39 AM

Last edited

Llama-2-7b

7B · Llama

110.1

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

171ms

Context

1k · 1d ago

Show all run details

Model

meta-llama/Llama-2-7b

Display name

Llama-2-7b

Base model

Revision

main

Family

Llama

Parameters

7B

Active params

MoE

no

Output tok/s

110.1

Prefill tok/s

Total tok/s

347.5

TTFT

171.1ms

Peak VRAM

Prompt tokens

611

Output tokens

256

Prefill tokens

Context length

867

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/llama/llama-2-7b.Q4_0.gguf -ts 0/1 -c 4096

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:49 AM

Last edited

Qwen3.5-35B-A3B

35B · Qwen

108.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q3_K_M

TTFT

208ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3.5-35B-A3B

Display name

Qwen3.5-35B-A3B

Base model

Qwen3.5-35B-A3B-Base

Revision

main

Family

Qwen

Parameters

35B

Active params

MoE

no

Output tok/s

108.4

Prefill tok/s

Total tok/s

306.4

TTFT

207.7ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q3_K_M

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/qwen/Qwen3.5-35B-A3B-Uncensored.Q3_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 0/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:43 AM

Last edited

SWE-Dev-32B

33B · Qwen

16.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

699ms

Context

1k · 1d ago

Show all run details

Model

zai-org/SWE-Dev-32B

Display name

SWE-Dev-32B

Base model

Qwen2.5-Coder-32B-Instruct

Revision

main

Family

Qwen

Parameters

33B

Active params

MoE

no

Output tok/s

16.2

Prefill tok/s

Total tok/s

49.0

TTFT

698.6ms

Peak VRAM

Prompt tokens

551

Output tokens

256

Prefill tokens

Context length

807

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.95

Top K

Min P

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m "/home/mikekey/models/gguf/misc/swe-dev-32b-q8_0.gguf" --temp 0.7 --top-p 0.95 --repeat-penalty 1.05 -ts 1/1 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ts 1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:36 AM

Last edited

95.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_K_M

TTFT

179ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3-VL-8B-Instruct

Display name

Qwen3-VL-8B-Instruct

Base model

Revision

main

Family

Qwen

Parameters

9B

Active params

MoE

no

Output tok/s

95.9

Prefill tok/s

Total tok/s

276.0

TTFT

178.8ms

Peak VRAM

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q4_K_M

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3VL-8B-Instruct-Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 -ts 0/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:29 AM

Last edited

79.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · BF16

TTFT

184ms

Context

1k · 1d ago

Show all run details

Model

meta-llama/Llama-3.2-3B-Instruct

Display name

Llama-3.2-3B-Instruct

Base model

Revision

main

Family

Llama

Parameters

3B

Active params

MoE

no

Output tok/s

79.9

Prefill tok/s

Total tok/s

239.7

TTFT

183.6ms

Peak VRAM

Prompt tokens

556

Output tokens

256

Prefill tokens

Context length

812

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

BF16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.9

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/llama/Llama-3.2-3B-Instruct-BF16.gguf --temp 0.7 --top-p 0.9 -ts 0/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:23 AM

Last edited

Qwen3.5-4B

4B · Qwen

84.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

198ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3.5-4B

Display name

Qwen3.5-4B

Base model

Qwen3.5-4B-Base

Revision

main

Family

Qwen

Parameters

4B

Active params

MoE

no

Output tok/s

84.9

Prefill tok/s

Total tok/s

244.8

TTFT

197.6ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.95

Top K

Min P

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m "/home/mikekey/models/gguf/misc/Opus4.7-GODs.Ghost.Codex.Distill.4B-Q8_0.gguf" --temp 0.7 --top-p 0.95 --repeat-penalty 1.05 -ts 0/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:06 AM

Last edited

Kimi-Dev-72B

73B · Qwen

15.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

266ms

Context

1k · 1d ago

Show all run details

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

MoE

no

Output tok/s

15.5

Prefill tok/s

Total tok/s

56.7

TTFT

265.7ms

Peak VRAM

Prompt tokens

697

Output tokens

256

Prefill tokens

Context length

953

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + Qwen2.5-Coder-0.5B-Q8_0 draft (n-max=8, n-min=1, p-min=0.7), 128K ctx. Code-completion prompt — paired baseline (no spec) at ~11.4 tok/s shows ~+35% gain from speculative decoding on code-shaped output.

Reactions

Submitted

May 6, 2026, 7:22 PM

Last edited

Kimi-Dev-72B

73B · Qwen

11.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

250ms

Context

1k · 1d ago

Show all run details

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

MoE

no

Output tok/s

11.5

Prefill tok/s

Total tok/s

42.3

TTFT

249.7ms

Peak VRAM

Prompt tokens

697

Output tokens

256

Prefill tokens

Context length

953

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + q8 KV, 128K ctx. Code-completion prompt (Python module continuation, ~700 input tokens). Baseline for spec-decoding A/B — paired with Kimi-Dev-72B-spec submission.

Reactions

Submitted

May 6, 2026, 7:14 PM

Last edited

23.6

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

214ms

Context

1k · 1d ago

Show all run details

Model

mistralai/Devstral-Small-2-24B-Instruct-2512

Display name

Devstral-Small-2-24B-Instruct-2512

Base model

Mistral-Small-3.1-24B-Base-2503

Revision

main

Family

Mistral

Parameters

24B

Active params

MoE

no

Output tok/s

23.6

Prefill tok/s

Total tok/s

204.1

TTFT

213.6ms

Peak VRAM

Prompt tokens

1078

Output tokens

135

Prefill tokens

Context length

1213

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.15

Top P

Top K

Min P

0.01

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf --temp 0.15 --min-p 0.01 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 24576

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 1× R9700, 24K ctx, q8 KV, FA auto. Single-card layout.

Reactions

Submitted

May 6, 2026, 7:02 PM

Last edited

Qwen3.6-27B

28B · Qwen

19.6

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

263ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

Revision

main

Family

Qwen

Parameters

28B

Active params

MoE

no

Output tok/s

19.6

Prefill tok/s

Total tok/s

59.2

TTFT

262.8ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

20

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-27B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 1× R9700, 16K ctx, q8 KV, FA auto. Single-card layout to eliminate cross-GPU overhead on 27B dense.

Reactions

Submitted

May 6, 2026, 6:56 PM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

80.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

203ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

80.5

Prefill tok/s

Total tok/s

232.6

TTFT

202.6ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.6

Top P

0.95

Top K

20

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-35B-A3B-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 2× R9700, 32K ctx, q8 KV, FA auto. Optimized 2-card layout vs the 3-card base entry.

Reactions

Submitted

May 6, 2026, 6:50 PM

Last edited

92.0

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

178ms

Context

1k · 1d ago

Show all run details

Model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Display name

Qwen3-Coder-30B-A3B-Instruct

Base model

Revision

main

Family

Qwen

Parameters

31B

Active params

MoE

no

Output tok/s

92.0

Prefill tok/s

Total tok/s

365.9

TTFT

178.2ms

Peak VRAM

Prompt tokens

530

Output tokens

156

Prefill tokens

Context length

686

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 2× R9700, 32K ctx, q8 KV, FA auto. Optimized 2-card layout vs the 3-card base entry.

Reactions

Submitted

May 6, 2026, 6:42 PM

Last edited

127.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_K_M

TTFT

194ms

Context

1k · 1d ago

Show all run details

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

MoE

no

Output tok/s

127.2

Prefill tok/s

Total tok/s

356.1

TTFT

193.7ms

Peak VRAM

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_K_M

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 1× R9700, 32K ctx, q8 KV, FA auto.

Reactions

Submitted

May 6, 2026, 4:37 PM

Last edited

126.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_K_M

TTFT

209ms

Context

1k · 1d ago

Show all run details

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

MoE

no

Output tok/s

126.5

Prefill tok/s

Total tok/s

352.0

TTFT

208.9ms

Peak VRAM

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_K_M

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.from-Q8_0.Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 1× R9700, 32K ctx, q8 KV, FA auto, Q4_K_M derived from Q8_0 source.

Reactions

Submitted

May 6, 2026, 4:32 PM

Last edited

90.0

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

203ms

Context

1k · 1d ago

Show all run details

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

MoE

no

Output tok/s

90.0

Prefill tok/s

Total tok/s

257.8

TTFT

203.3ms

Peak VRAM

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 2× R9700, 32K ctx, q8 KV, FA auto.

Reactions

Submitted

May 6, 2026, 4:26 PM

Last edited

20.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · BF16

TTFT

222ms

Context

1k · 1d ago

Show all run details

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

MoE

no

Output tok/s

20.3

Prefill tok/s

Total tok/s

61.2

TTFT

221.6ms

Peak VRAM

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

BF16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.BF16.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 3× R9700, 16K ctx, q8 KV, FA auto.

Reactions

Submitted

May 6, 2026, 4:19 PM

Last edited

7.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

2033ms

Context

1k · 1d ago

Show all run details

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

Revision

main

Family

Mistral

Parameters

128B

Active params

MoE

no

Output tok/s

7.4

Prefill tok/s

Total tok/s

31.4

TTFT

2032.8ms

Peak VRAM

Prompt tokens

886

Output tokens

256

Prefill tokens

Context length

1142

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --cache-ram 0 --no-cache-idle-slots --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + q8_0 KV, 16K ctx, ts 1/1/1, FA auto.

Reactions

Submitted

May 6, 2026, 12:50 PM

Last edited

6.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

317ms

Context

16k · 2d ago

Show all run details

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

Revision

main

Family

Mistral

Parameters

128B

Active params

MoE

no

Output tok/s

6.9

Prefill tok/s

Total tok/s

30.4

TTFT

317.4ms

Peak VRAM

Prompt tokens

886

Output tokens

256

Prefill tokens

Context length

16384

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8979-66bafdcf1

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

q8_0

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

1

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 1 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384

Extra flags

-ctk q8_0 -ctv q8_0 -ts 1/1/1 --no-mmap --cache-ram 0 --no-cache-idle-slots -c 16384

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. Mistral Medium 3.5 128B Q4_0 GGUF, 3 shards, tensor split 1/1/1, q8_0 KV cache, flash-attn auto, 16K ctx, single slot. Median of 3 runs. Submitted with running binary build_info b8979-66bafdcf1.

Reactions

Submitted

May 6, 2026, 6:40 AM

Last edited

6.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q4_0

TTFT

2082ms

Context

1k · 2d ago

Show all run details

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

Revision

main

Family

Mistral

Parameters

128B

Active params

MoE

no

Output tok/s

6.2

Prefill tok/s

Total tok/s

26.4

TTFT

2081.9ms

Peak VRAM

Prompt tokens

886

Output tokens

256

Prefill tokens

Context length

1142

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.95

Top K

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf --temp 0.7 --top-p 0.95 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

3x R9700 Vulkan, -ts 1/1/1, q8 KV, flash-attn auto, 32K ctx

Reactions

Submitted

May 6, 2026, 6:01 AM

Last edited

17.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

224ms

Context

1k · 2d ago

Show all run details

Model

ibm-granite/granite-4.1-30b

Display name

granite-4.1-30b

Base model

Revision

main

Family

Parameters

29B

Active params

MoE

no

Output tok/s

17.9

Prefill tok/s

Total tok/s

54.1

TTFT

223.9ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/ibm-granite_granite-4.1-30b-Q8_0.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 65536

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 6, 2026, 4:57 AM

Last edited

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

218ms

Context

1k · 2d ago

Show all run details

Model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Display name

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Base model

Revision

main

Family

Parameters

33B

Active params

MoE

no

Output tok/s

107.3

Prefill tok/s

Total tok/s

304.9

TTFT

217.8ms

Peak VRAM

Prompt tokens

538

Output tokens

256

Prefill tokens

Context length

794

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.6

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

llama.cpp Vulkan on 3x R9700 via llama-swap, ts 1/1 (2-card). Median of 3 runs (selected for representative TTFT).

Reactions

Submitted

May 6, 2026, 4:45 AM

Last edited

Qwen3-32B

32B · Qwen

22.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

vllm · FP8

TTFT

60ms

Context

1k · 2d ago

Show all run details

Model

Qwen/Qwen3-32B

Display name

Qwen3-32B

Base model

Revision

main

Family

Qwen

Parameters

32B

Active params

MoE

no

Output tok/s

22.8

Prefill tok/s

Total tok/s

69.5

TTFT

60.0ms

Peak VRAM

34.2GB

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.18.1.dev0+gbcf2be961.d20260330

Quantization

FP8

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.95

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3-32B-FP8 --max-model-len 40960 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' healthcheck: test: - CMD-SHELL

Extra flags

--max-model-len 40960 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} -- CMD-SHELL

Notes

post-reboot validation, 2x R9700 TP=2 dense FP8

Reactions

Submitted

May 6, 2026, 4:12 AM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

112.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

vllm · MXFP4_MOE

TTFT

168ms

Context

1k · 2d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

112.8

Prefill tok/s

Total tok/s

322.8

TTFT

168ms

Peak VRAM

32.0GB

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.18.1.dev0+gbcf2be961.d20260330

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.95

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3.6-35B-A3B-MXFP4 --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' healthcheck:

Extra flags

--max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} --speculative-config {"method": "mtp", "num_speculative_tokens": 4}

Notes

vLLM on 3× R9700 (TP=2 cards 0,1). RCCL 7.1.1 jammy preloaded via LD_LIBRARY_PATH (bypasses RCCL 7.2.x gfx1201 tuning-index miss / TP=2 deadlock). TunableOp tuned ~300 rows/rank from cudagraph capture + diverse-load driving. MoE backend Triton (gfx950 gate routes only gfx950 to CK on this vLLM build). MTP=4 spec decoding. tcclaviger/vllm-rocm-rdna4-mxfp4 docker. Best of 3 streaming chat completions, post heavy warmup.

Reactions

Submitted

May 6, 2026, 3:18 AM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

98.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

vllm · MXFP4_MOE

TTFT

171ms

Context

1k · 2d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

98.9

Prefill tok/s

Total tok/s

285.2

TTFT

171.4ms

Peak VRAM

32GB

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.18.1.dev0+gbcf2be961.d20260330

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.95

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3.6-35B-A3B-MXFP4 --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' healthcheck:

Extra flags

--max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} --speculative-config {"method": "mtp", "num_speculative_tokens": 4}

Notes

vLLM on 3× R9700 (TP=2 cards 0,1). RCCL 7.1.1 jammy preloaded via LD_LIBRARY_PATH to bypass RCCL 7.2.x gfx1201 tuning-index miss (TP=2 deadlock). aiter built but MoE backend still Triton (gfx950 gate routes only gfx950 to CK). MTP=4 spec decoding. tcclaviger/vllm-rocm-rdna4-mxfp4 docker. Best of 3 streaming chat completions.

Reactions

Submitted

May 6, 2026, 2:23 AM

Last edited

gpt-oss-20b

22B · Gpt

991.1

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · MXFP4_MOE

TTFT

175ms

Context

33k · 2d ago

Show all run details

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

Revision

main

Family

Gpt

Parameters

22B

Active params

MoE

no

Output tok/s

991.1

Prefill tok/s

Total tok/s

TTFT

175.4ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

32768

Batch size

16

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

yes

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

4096

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.9

Max running seqs

16

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve openai/gpt-oss-20b --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 16 --max-num-batched-tokens 4096 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.90 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Extra flags

--max-model-len 32768 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Notes

gpt-oss-20b MXFP4 concurrent throughput at batch=16 (matches --max-num-seqs 16). Aggregate output tok/s, best of 3 runs (991.1 / 986.8 / 990.6 -- variance under 0.5%). Per-request decode 64.8 tok/s. TTFT 175ms. This is vLLM's strength zone vs llama.cpp single-stream: ~6x throughput when serving multiple concurrent users. batch=1 single-stream was 48 tok/s (separate submission) -- same hardware, llama.cpp Q8_0 single-stream was 160 tok/s.

Reactions

Submitted

May 5, 2026, 11:29 PM

Last edited

gpt-oss-20b

22B · Gpt

48.3

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · MXFP4_MOE

TTFT

89ms

Context

33k · 2d ago

Show all run details

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

Revision

main

Family

Gpt

Parameters

22B

Active params

MoE

no

Output tok/s

48.3

Prefill tok/s

Total tok/s

156.7

TTFT

88.6ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

yes

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

4096

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.9

Max running seqs

16

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve openai/gpt-oss-20b --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 16 --max-num-batched-tokens 4096 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.90 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Extra flags

--max-model-len 32768 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Notes

gpt-oss-20b native MXFP4 on 2x R9700 (gfx1201) TP=2, batch=1. Uses the *fused* OAI Triton MXFP4 MoE kernel (triton, not triton_unfused) — gpt-oss is the kernel's original target. Same 3 wheel patches as Qwen3.6 MXFP4 entry (PR #37826 device gate, CLI enum, dtype). RCCL 7.1.1 hotfix for #40980. Concurrent throughput much higher: 580 tok/s at c=8, 902 tok/s at c=16 (batch submissions to follow).

Reactions

Submitted

May 5, 2026, 11:24 PM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

33.6

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · MXFP4_MOE

TTFT

130ms

Context

33k · 2d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

33.6

Prefill tok/s

Total tok/s

101.7

TTFT

129.6ms

Peak VRAM

Prompt tokens

533

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

yes

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.93

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 PYTORCH_TUNABLEOP_FILENAME=~/.local/share/vllm-tunableop/tunableop_merged.csv VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3.6-35B-A3B-MXFP4 --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.93 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Extra flags

--max-model-len 32768 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Notes

Same setup as previous MXFP4 submission (PR #37826 patched, RCCL 7.1.1 hotfix, triton_unfused, dtype=bf16) — now WITH PyTorch TunableOp tuned (754 rows / 377 per TP rank). Tuned with 18-min diverse-load pass: prompt sizes 8/64/256/1024/2048 tokens x 5 content categories x concurrency 1+4. Auto-stopped on row plateau (3x60s zero-delta). Tuning gain over untuned baseline: +7.0% (31.70 -> 33.92 tok/s). Modest gain — batch=1 decode on R9700 is memory-bandwidth-bound, not GEMM-scheduling-bound.

Reactions

Submitted

May 5, 2026, 11:00 PM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

31.7

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · MXFP4_MOE

TTFT

140ms

Context

33k · 2d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

31.7

Prefill tok/s

Total tok/s

96.0

TTFT

140.0ms

Peak VRAM

Prompt tokens

533

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

yes

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.93

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MOE=0 VLLM_MXFP4_USE_MARLIN=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3.6-35B-A3B-MXFP4 --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.93 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Extra flags

--max-model-len 32768 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Notes

Qwen3.6-35B-A3B MXFP4 (pahajokiconsulting quant) on 2x R9700 (gfx1201) TP=2. Baseline path (no AITER, no TunableOp, no MTP). Required THREE patches to vLLM 0.20.1+rocm721 wheel: (1) PR vllm-project/vllm#37826 in-place patch -- widen OAI Triton MoE device gate to gfx12, (2) add 'triton_unfused' to MoEBackend Literal in config/kernel.py (CLI exposure), (3) RCCL 7.1.1 LD_PRELOAD for #40980 deadlock. --moe-backend triton_unfused needed because TRITON kernel only supports SWIGLUOAI activation; UnfusedOAITritonExperts adds SILU. dtype=bfloat16 required (kernel asserts hidden_states.dtype==bf16). Best of 3 (31.6 / 31.7 / 31.6 tok/s).

Reactions

Submitted

May 5, 2026, 9:49 PM

Last edited

Qwen3-32B

32B · Qwen

79.3

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · FP8

TTFT

138ms

Context

33k · 2d ago

Show all run details

Model

Qwen/Qwen3-32B

Display name

Qwen3-32B

Base model

Revision

main

Family

Qwen

Parameters

32B

Active params

MoE

no

Output tok/s

79.3

Prefill tok/s

Total tok/s

TTFT

138.3ms

Peak VRAM

Prompt tokens

534

Output tokens

256

Prefill tokens

Context length

32768

Batch size

4

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

FP8

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.9

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3-32B-FP8 --tensor-parallel-size 2 --dtype auto --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.90 --reasoning-parser qwen3

Extra flags

--max-model-len 32768 --reasoning-parser qwen3

Notes

Concurrent throughput at batch=4 (matches --max-num-seqs 4). Aggregate output tok/s, best of 3 runs (78.7 / 78.2 / 79.3). Per-request decode ~20 tok/s. RCCL 7.1.1 hotfix for #40980 still in effect.

Reactions

Submitted

May 5, 2026, 9:32 PM

Last edited

Qwen3-32B

32B · Qwen

22.9

tok/s

Hardware

2x AMD Radeon AI Pro R9700 64GB

Engine

vllm · FP8

TTFT

71ms

Context

33k · 2d ago

Show all run details

Model

Qwen/Qwen3-32B

Display name

Qwen3-32B

Base model

Revision

main

Family

Qwen

Parameters

32B

Active params

MoE

no

Output tok/s

22.9

Prefill tok/s

Total tok/s

70.1

TTFT

71.5ms

Peak VRAM

Prompt tokens

534

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

FP8

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.9

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3-32B-FP8 --tensor-parallel-size 2 --dtype auto --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.90 --reasoning-parser qwen3

Extra flags

--max-model-len 32768 --reasoning-parser qwen3

Notes

vLLM 0.20.1+rocm721 on 2x AMD R9700 (gfx1201) TP=2. Workaround for vllm-project/vllm#40980: LD_PRELOAD librccl.so.1 from rccl-7.1.1 (Arch ALA) -- RCCL 7.2.x lacks gfx1201 tuning index and deadlocks on AllReduce. With hotfix: ~22.7 tok/s; without: ~0.5 tok/s.

Reactions

Submitted

May 5, 2026, 9:26 PM

Last edited

100.0

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

198ms

Context

1k · 2d ago

Show all run details

Model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Display name

Qwen3-Coder-30B-A3B-Instruct

Base model

Revision

main

Family

Qwen

Parameters

31B

Active params

MoE

no

Output tok/s

100.0

Prefill tok/s

Total tok/s

390.1

TTFT

198.5ms

Peak VRAM

Prompt tokens

530

Output tokens

156

Prefill tokens

Context length

686

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 7:00 PM

Last edited

gpt-oss-20b

22B · Gpt

160.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

206ms

Context

1k · 3d ago

Show all run details

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

Revision

main

Family

Gpt

Parameters

22B

Active params

MoE

no

Output tok/s

160.8

Prefill tok/s

Total tok/s

470.0

TTFT

206.1ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

845

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-20b-Q8_0.gguf -ts 0/1/0 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 4:52 AM

Last edited

93.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

105ms

Context

1k · 3d ago

Show all run details

Model

zai-org/GLM-4.7-Flash

Display name

GLM-4.7-Flash

Base model

Revision

main

Family

Parameters

31B

Active params

MoE

no

Output tok/s

93.3

Prefill tok/s

Total tok/s

274.8

TTFT

104.8ms

Peak VRAM

Prompt tokens

527

Output tokens

256

Prefill tokens

Context length

783

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

0.01

Repeat penalty

1

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/glm/GLM-4.7-Flash-MXFP4_MOE.gguf --temp 1.0 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 0/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 4:40 AM

Last edited

Qwen3-Coder-Next

80B · Qwen

80.4

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · MXFP4_MOE

TTFT

234ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3-Coder-Next

Display name

Qwen3-Coder-Next

Base model

Revision

main

Family

Qwen

Parameters

80B

Active params

MoE

no

Output tok/s

80.4

Prefill tok/s

Total tok/s

280.1

TTFT

233.8ms

Peak VRAM

Prompt tokens

530

Output tokens

187

Prefill tokens

Context length

717

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

40

Min P

0.01

Repeat penalty

1

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-Next-MXFP4_MOE.gguf -ts 1/1 -c 262144 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 4:31 AM

Last edited

Qwen3.5-122B-A10B

125B · Qwen

26.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · MXFP4_MOE

TTFT

261ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

Revision

main

Family

Qwen

Parameters

125B

Active params

MoE

no

Output tok/s

26.8

Prefill tok/s

Total tok/s

110.8

TTFT

261.3ms

Peak VRAM

Prompt tokens

533

Output tokens

161

Prefill tokens

Context length

694

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

40

Min P

0.01

Repeat penalty

1

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}'

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 4:19 AM

Last edited

87.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

207ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Display name

Qwen3-Coder-30B-A3B-Instruct

Base model

Revision

main

Family

Qwen

Parameters

31B

Active params

MoE

no

Output tok/s

87.2

Prefill tok/s

Total tok/s

343.5

TTFT

207.0ms

Peak VRAM

Prompt tokens

530

Output tokens

156

Prefill tokens

Context length

686

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 3:51 AM

Last edited

gpt-oss-120b

120B · Gpt

71.5

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

233ms

Context

1k · 3d ago

Show all run details

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

Revision

main

Family

Gpt

Parameters

120B

Active params

MoE

no

Output tok/s

71.5

Prefill tok/s

Total tok/s

221.7

TTFT

232.7ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

845

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 3:41 AM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

75.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

217ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

75.2

Prefill tok/s

Total tok/s

217.4

TTFT

217.0ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.6

Top P

0.95

Top K

20

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-35B-A3B-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 3:31 AM

Last edited

gpt-oss-20b

22B · Gpt

159.7

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

201ms

Context

131k · 3d ago

Show all run details

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

Revision

main

Family

Gpt

Parameters

22B

Active params

MoE

no

Output tok/s

159.7

Prefill tok/s

Total tok/s

468.5

TTFT

200.9ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

131072

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

GPT-OSS 20B Q8_0 (12 GB) on 1x R9700 single-card via llama-swap (-ts 0/1/0, c=131072, no kv quant). 3.6B active params (MoE). Best of 3 streamed runs - variance 0.27 tok/s across runs.

Reactions

Submitted

May 5, 2026, 3:24 AM

Last edited

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

9971ms

Context

32k · 3d ago

Show all run details

Model

nvidia/Nemotron-Cascade-2-30B-A3B

Display name

Nemotron-Cascade-2-30B-A3B

Base model

Revision

main

Family

Parameters

32B

Active params

MoE

no

Output tok/s

89.8

Prefill tok/s

Total tok/s

2459.9

TTFT

9971.2ms

Peak VRAM

Prompt tokens

31364

Output tokens

256

Prefill tokens

Context length

31620

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

Nemotron-Cascade-2 30B-A3B Q8_0 (32 GB) on 2x R9700 (-ts 1/1, c=262144, q8_0 KV, FA auto). Depth-31K test (~30K input prompt). TG = 89.78 tok/s, only -5.9% vs depth-0 (95.39 tok/s). Cold-prefill TTFT shown (9971ms); warm-cache runs hit 560ms via --cache-reuse 256. Best tok/s of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 3:19 AM

Last edited

Qwen3.6-27B

28B · Qwen

19.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

171ms

Context

16k · 3d ago

Show all run details

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

Revision

main

Family

Qwen

Parameters

28B

Active params

MoE

no

Output tok/s

19.9

Prefill tok/s

Total tok/s

60.5

TTFT

171.0ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

16384

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

Qwen3.6-27B Q8_0 dense (28.5 GB) on 1x R9700 single-card via llama-swap (-ts 1/0/0, c=16384, q8_0 KV, FA auto). Single-card variant: +10.3% TG vs prior 3-card 262K submission (18.08 tok/s). Best of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 3:14 AM

Last edited

17.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

273ms

Context

66k · 3d ago

Show all run details

Model

ibm-granite/granite-4.1-30b

Display name

granite-4.1-30b

Base model

Revision

main

Family

Parameters

29B

Active params

MoE

no

Output tok/s

17.3

Prefill tok/s

Total tok/s

TTFT

273.1ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

65536

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

IBM Granite 4.1 30B dense Q8_0 (29 GB) on 2x R9700 via llama-swap (-ts 1/1, c=65536, q8_0 KV, FA auto). Single-card not viable - 29 GB weights + heavy GQA KV (256 KB/tok) overflows 32 GB VRAM. Best of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 3:03 AM

Last edited

23.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

244ms

Context

25k · 3d ago

Show all run details

Model

mistralai/Devstral-Small-2-24B-Instruct-2512

Display name

Devstral-Small-2-24B-Instruct-2512

Base model

Mistral-Small-3.1-24B-Base-2503

Revision

main

Family

Mistral

Parameters

24B

Active params

MoE

no

Output tok/s

23.9

Prefill tok/s

Total tok/s

200.6

TTFT

244.2ms

Peak VRAM

Prompt tokens

1078

Output tokens

139

Prefill tokens

Context length

24576

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

Devstral-Small-2-24B-Instruct-2512 Q8_0 (24 GB) on 1x R9700 single-card via llama-swap (-ts 1/0/0, c=24576, q8_0 KV, FA auto). Single-card variant. Best of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 2:58 AM

Last edited

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

212ms

Context

1k · 3d ago

Show all run details

Model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Display name

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Base model

Revision

main

Family

Parameters

33B

Active params

MoE

no

Output tok/s

95.8

Prefill tok/s

Total tok/s

275.2

TTFT

212.5ms

Peak VRAM

Prompt tokens

538

Output tokens

256

Prefill tokens

Context length

794

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.6

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

NVIDIA-Nemotron-3-Nano-Omni 30B-A3B Reasoning Q8_0 (32 GB) on 2× R9700 via llama-swap (-ts 1/1, q8_0 KV, FA auto). Single-card not viable — weights exceed 32 GB VRAM. Best of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 2:53 AM

Last edited

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

216ms

Context

1k · 3d ago

Show all run details

Model

nvidia/Nemotron-Cascade-2-30B-A3B

Display name

Nemotron-Cascade-2-30B-A3B

Base model

Revision

main

Family

Parameters

32B

Active params

MoE

no

Output tok/s

95.4

Prefill tok/s

Total tok/s

280.0

TTFT

216.0ms

Peak VRAM

Prompt tokens

556

Output tokens

256

Prefill tokens

Context length

812

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/nvidia_Nemotron-Cascade-2-30B-A3B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

Nemotron-Cascade-2 30B-A3B Q8_0 (32 GB) on 2× R9700 via llama-swap (-ts 1/1, q8_0 KV, FA auto). Single-card not viable — weights exceed 32 GB VRAM. Best of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 2:47 AM

Last edited

Qwen3.6-27B

28B · Qwen

20.1

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · Q8_0

TTFT

807ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

Revision

main

Family

Qwen

Parameters

28B

Active params

MoE

no

Output tok/s

20.1

Prefill tok/s

Total tok/s

58.1

TTFT

807.5ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

20

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-27B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 2:35 AM

Last edited

Qwen3.5-122B-A10B

125B · Qwen

27.2

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · MXFP4_MOE

TTFT

56499ms

Context

66k · 3d ago

Show all run details

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

Revision

main

Family

Qwen

Parameters

125B

Active params

MoE

no

Output tok/s

27.2

Prefill tok/s

Total tok/s

11.1

TTFT

56499.3ms

Peak VRAM

Prompt tokens

533

Output tokens

161

Prefill tokens

Context length

65536

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

1

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

40

Min P

0.01

Repeat penalty

1

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 1 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}' --flash-attn on

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}

Notes

B+C config: --parallel 1 + --flash-attn on. Best of 3 runs on 3× R9700 (Vulkan).

Reactions

Submitted

May 5, 2026, 12:41 AM

Last edited

Qwen3.5-122B-A10B

125B · Qwen

27.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · MXFP4_MOE

TTFT

219ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

Revision

main

Family

Qwen

Parameters

125B

Active params

MoE

no

Output tok/s

27.3

Prefill tok/s

Total tok/s

113.5

TTFT

218.8ms

Peak VRAM

Prompt tokens

533

Output tokens

161

Prefill tokens

Context length

694

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

40

Min P

0.01

Repeat penalty

1

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}'

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. Qwen3.5-122B-A10B MXFP4_MOE GGUF, tensor split 1/1/1, flash-attn auto, reasoning disabled. Best of 3 runs.

Reactions

Submitted

May 4, 2026, 11:53 PM

Last edited

gpt-oss-120b

120B · Gpt

70.7

tok/s

Hardware

3x AMD Radeon AI Pro R9700 96GB

Engine

llama.cpp · F16

TTFT

236ms

Context

1k · 3d ago

Show all run details

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

Revision

main

Family

Gpt

Parameters

120B

Active params

MoE

no

Output tok/s

70.7

Prefill tok/s

Total tok/s

219.1

TTFT

236.4ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

845

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. GPT-OSS 120B F16 GGUF, tensor split 1/1/1, q8_0 KV cache, flash-attn auto.

Reactions

Submitted

May 4, 2026, 11:00 PM

Last edited

gpt-oss-120b

120B · Gpt

70.0

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · F16

TTFT

236ms

Context

1k · 3d ago

Show all run details

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

Revision

main

Family

Gpt

Parameters

120B

Active params

MoE

no

Output tok/s

70.0

Prefill tok/s

Total tok/s

217.0

TTFT

236.4ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

845

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

3

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. GPT-OSS 120B F16 GGUF, tensor split 1/1/1, q8_0 KV cache, flash-attn auto. Best of 3 runs.

Reactions

Submitted

May 4, 2026, 10:42 PM

Last edited

Kimi-Dev-72B

73B · Qwen

20.9

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · Q4_0

TTFT

272ms

Context

131k · 3d ago

Show all run details

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

MoE

no

Output tok/s

20.9

Prefill tok/s

Total tok/s

64.5

TTFT

271.5ms

Peak VRAM

Prompt tokens

551

Output tokens

256

Prefill tokens

Context length

131072

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

3

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

Vulkan 3× R9700 + speculative decoding: Qwen2.5-Coder-0.5B-Instruct Q8_0 draft, --spec-draft-n-max 8 --spec-draft-p-min 0.7. Best of 3.

Reactions

Submitted

May 4, 2026, 10:09 PM

Last edited

Qwen3.6-27B

28B · Qwen

18.1

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · Q8_0

TTFT

260ms

Context

262k · 3d ago

Show all run details

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

Revision

main

Family

Qwen

Parameters

28B

Active params

MoE

no

Output tok/s

18.1

Prefill tok/s

Total tok/s

54.6

TTFT

259.9ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

262144

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

3

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 4, 2026, 9:36 PM

Last edited

Kimi-Dev-72B

73B · Qwen

11.3

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · Q4_0

TTFT

259ms

Context

66k · 3d ago

Show all run details

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

MoE

no

Output tok/s

11.3

Prefill tok/s

Total tok/s

35.4

TTFT

258.6ms

Peak VRAM

Prompt tokens

551

Output tokens

256

Prefill tokens

Context length

65536

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

3

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 4, 2026, 9:31 PM

Last edited

Qwen3.5-122B-A10B

125B · Qwen

26.8

tok/s

Hardware

3x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · MXFP4_MOE

TTFT

232ms

Context

66k · 3d ago

Show all run details

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

Revision

main

Family

Qwen

Parameters

125B

Active params

MoE

no

Output tok/s

26.8

Prefill tok/s

Total tok/s

80.6

TTFT

232.2ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

65536

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

3

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

noctrex/Qwen3.5-122B-A10B-MXFP4_MOE-GGUF on 3× R9700 (96 GB pool), llama.cpp Vulkan, -ts 1/1/1, ctx 64K. Best of 3 runs.

Reactions

Submitted

May 4, 2026, 6:50 PM

Last edited

Qwen3-Coder-Next

80B · Qwen

80.8

tok/s

Hardware

2x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · MXFP4_MOE

TTFT

250ms

Context

1k · 3d ago

Show all run details

Model

Qwen/Qwen3-Coder-Next

Display name

Qwen3-Coder-Next

Base model

Revision

main

Family

Qwen

Parameters

80B

Active params

MoE

no

Output tok/s

80.8

Prefill tok/s

Total tok/s

279.7

TTFT

249.6ms

Peak VRAM

Prompt tokens

530

Output tokens

187

Prefill tokens

Context length

717

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

2

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

llama.cpp Vulkan, dual R9700 (-ts 1/1), c=262144, no kv quant. Best of 3 streamed runs, ~512 input / 256 output.

Reactions

Submitted

May 4, 2026, 5:05 PM

Last edited

gemma-4-26B-A4B-it

27B · Gemma

87.3

tok/s

Hardware

2x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · MXFP4_MOE

TTFT

375ms

Context

1k · 3d ago

Show all run details

Model

google/gemma-4-26B-A4B-it

Display name

gemma-4-26B-A4B-it

Base model

gemma-4-26B-A4B

Revision

main

Family

Gemma

Parameters

27B

Active params

MoE

no

Output tok/s

87.3

Prefill tok/s

Total tok/s

239.9

TTFT

375.3ms

Peak VRAM

Prompt tokens

538

Output tokens

256

Prefill tokens

Context length

794

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

2

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

llama.cpp Vulkan, single R9700 (-ts 0/1), c=65536, no kv quant. Best of 3 streamed runs, ~512 input / 256 output.

Reactions

Submitted

May 4, 2026, 4:56 PM

Last edited

92.9

tok/s

Hardware

2x AMD Radeon AI Pro R9700 32GB

Engine

llama.cpp · MXFP4_MOE

TTFT

110ms

Context

1k · 3d ago

Show all run details

Model

zai-org/GLM-4.7-Flash

Display name

GLM-4.7-Flash

Base model

Revision

main

Family

Parameters

31B

Active params

MoE

no

Output tok/s

92.9

Prefill tok/s

Total tok/s

273.4

TTFT

109.6ms

Peak VRAM

Prompt tokens

527

Output tokens

256

Prefill tokens

Context length

783

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

2

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

llama.cpp Vulkan, single R9700 (-ts 0/1), c=131072, kv q8_0, flash-attn auto. Best of 3 streamed runs, ~512 input / 256 output.

Reactions

Submitted

May 4, 2026, 4:45 PM

Last edited

ModelHardwareEnginetok/s outprefilltok/s totalTTFTctxReactionsDateShare
MiniMax-M2.7

229B · Minimax

3x AMD Radeon AI Pro R9700 96GB
llama.cppF1630.795.057ms1ktoday
Show all details for MiniMax-M2.7

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

Revision

main

Family

Minimax

Parameters

229B

Active params

MoE

no

Output tok/s

30.7

Prefill tok/s

Total tok/s

95.0

TTFT

57.1ms

Peak VRAM

Prompt tokens

541

Output tokens

256

Prefill tokens

Context length

797

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 --main-gpu 1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1 --main-gpu 1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 8, 2026, 8:02 AM

Last edited

Mistral-Medium-3.5-128B

128B · Mistral

3x AMD Radeon AI Pro R9700 96GB
llama.cppF166.527.41956ms1ktoday
Show all details for Mistral-Medium-3.5-128B

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

Revision

main

Family

Mistral

Parameters

128B

Active params

MoE

no

Output tok/s

6.5

Prefill tok/s

Total tok/s

27.4

TTFT

1955.6ms

Peak VRAM

Prompt tokens

886

Output tokens

256

Prefill tokens

Context length

1142

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --cache-ram 0 --no-cache-idle-slots --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 8, 2026, 7:51 AM

Last edited

gpt-oss-120b

120B · Gpt

3x AMD Radeon AI Pro R9700 96GB
llama.cppF1662.799.0105ms2ktoday
Show all details for gpt-oss-120b

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

Revision

main

Family

Gpt

Parameters

120B

Active params

MoE

no

Output tok/s

62.7

Prefill tok/s

Total tok/s

99.0

TTFT

105.5ms

Peak VRAM

Prompt tokens

589

Output tokens

1000

Prefill tokens

Context length

1589

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 8, 2026, 7:44 AM

Last edited

MiniMax-M2.7

229B · Minimax

3x AMD Radeon AI Pro R9700 96GB
llama.cppIQ4_XS25.279.956ms33ktoday
Show all details for MiniMax-M2.7

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

Revision

main

Family

Minimax

Parameters

229B

Active params

MoE

no

Output tok/s

25.2

Prefill tok/s

Total tok/s

79.9

TTFT

56.1ms

Peak VRAM

Prompt tokens

560

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

1

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B-A10B (from 228B base), IQ4_XS imatrix, 3x R9700 ts 1/1/1, q8_0 KV, FA auto, ctx 32k

Reactions

Submitted

May 8, 2026, 7:13 AM

Last edited

MiniMax-M2.7

229B · Minimax

3x AMD Radeon AI Pro R9700 96GB
llama.cppIQ4_XS25.252.95267ms1ktoday
Show all details for MiniMax-M2.7

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

Revision

main

Family

Minimax

Parameters

229B

Active params

MoE

no

Output tok/s

25.2

Prefill tok/s

Total tok/s

52.9

TTFT

5266.6ms

Peak VRAM

Prompt tokens

560

Output tokens

256

Prefill tokens

Context length

816

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

1

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B-A10B (from 228B base), IQ4_XS imatrix, 3x R9700 ts 1/1/1, q8_0 KV, FA auto, ctx 32k

Reactions

Submitted

May 8, 2026, 7:10 AM

Last edited

MiniMax-M2.7

229B · Minimax

3x AMD Radeon AI Pro R9700 96GB
llama.cppIQ4_XS31.890.8771ms33k1d ago
Show all details for MiniMax-M2.7

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

Revision

main

Family

Minimax

Parameters

229B

Active params

MoE

no

Output tok/s

31.8

Prefill tok/s

Total tok/s

90.8

TTFT

770.5ms

Peak VRAM

Prompt tokens

546

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

1

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768 --parallel 1

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B IQ4_XS imatrix. ctx=32768, parallel=1. Thinking disabled.

Reactions

Submitted

May 7, 2026, 4:58 AM

Last edited

MiniMax-M2.7

229B · Minimax

3x AMD Radeon AI Pro R9700 96GB
llama.cppIQ4_XS32.6101.446ms1k1d ago
Show all details for MiniMax-M2.7

Model

MiniMaxAI/MiniMax-M2.7

Display name

MiniMax-M2.7

Base model

Revision

main

Family

Minimax

Parameters

229B

Active params

MoE

no

Output tok/s

32.6

Prefill tok/s

Total tok/s

101.4

TTFT

46.3ms

Peak VRAM

Prompt tokens

546

Output tokens

256

Prefill tokens

Context length

802

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

IQ4_XS

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/minimax/M2.7-REAP-172B-IQ4_XS/MiniMax-M2.7-REAP-172B-A10B-BF16.i1-IQ4_XS.gguf --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

REAP'd 172B IQ4_XS imatrix (mradermacher). Thinking disabled via system prompt.

Reactions

Submitted

May 7, 2026, 4:39 AM

Last edited

Llama-2-7b

7B · Llama

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_0110.1347.5171ms1k1d ago
Show all details for Llama-2-7b

Model

meta-llama/Llama-2-7b

Display name

Llama-2-7b

Base model

Revision

main

Family

Llama

Parameters

7B

Active params

MoE

no

Output tok/s

110.1

Prefill tok/s

Total tok/s

347.5

TTFT

171.1ms

Peak VRAM

Prompt tokens

611

Output tokens

256

Prefill tokens

Context length

867

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/llama/llama-2-7b.Q4_0.gguf -ts 0/1 -c 4096

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:49 AM

Last edited

Qwen3.5-35B-A3B

35B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ3_K_M108.4306.4208ms1k1d ago
Show all details for Qwen3.5-35B-A3B

Model

Qwen/Qwen3.5-35B-A3B

Display name

Qwen3.5-35B-A3B

Base model

Qwen3.5-35B-A3B-Base

Revision

main

Family

Qwen

Parameters

35B

Active params

MoE

no

Output tok/s

108.4

Prefill tok/s

Total tok/s

306.4

TTFT

207.7ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q3_K_M

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m /home/mikekey/models/gguf/qwen/Qwen3.5-35B-A3B-Uncensored.Q3_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 0/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ctk q8_0 --ctv q8_0 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:43 AM

Last edited

SWE-Dev-32B

33B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_016.249.0699ms1k1d ago
Show all details for SWE-Dev-32B

Model

zai-org/SWE-Dev-32B

Display name

SWE-Dev-32B

Base model

Qwen2.5-Coder-32B-Instruct

Revision

main

Family

Qwen

Parameters

33B

Active params

MoE

no

Output tok/s

16.2

Prefill tok/s

Total tok/s

49.0

TTFT

698.6ms

Peak VRAM

Prompt tokens

551

Output tokens

256

Prefill tokens

Context length

807

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.95

Top K

Min P

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 --prio 3 -m "/home/mikekey/models/gguf/misc/swe-dev-32b-q8_0.gguf" --temp 0.7 --top-p 0.95 --repeat-penalty 1.05 -ts 1/1 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --prio 3 --ts 1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:36 AM

Last edited

Qwen3-VL-8B-Instruct

9B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_K_M95.9276.0179ms1k1d ago
Show all details for Qwen3-VL-8B-Instruct

Model

Qwen/Qwen3-VL-8B-Instruct

Display name

Qwen3-VL-8B-Instruct

Base model

Revision

main

Family

Qwen

Parameters

9B

Active params

MoE

no

Output tok/s

95.9

Prefill tok/s

Total tok/s

276.0

TTFT

178.8ms

Peak VRAM

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q4_K_M

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3VL-8B-Instruct-Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 -ts 0/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:29 AM

Last edited

Llama-3.2-3B-Instruct

3B · Llama

3x AMD Radeon AI Pro R9700 96GB
llama.cppBF1679.9239.7184ms1k1d ago
Show all details for Llama-3.2-3B-Instruct

Model

meta-llama/Llama-3.2-3B-Instruct

Display name

Llama-3.2-3B-Instruct

Base model

Revision

main

Family

Llama

Parameters

3B

Active params

MoE

no

Output tok/s

79.9

Prefill tok/s

Total tok/s

239.7

TTFT

183.6ms

Peak VRAM

Prompt tokens

556

Output tokens

256

Prefill tokens

Context length

812

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

BF16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.9

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/llama/Llama-3.2-3B-Instruct-BF16.gguf --temp 0.7 --top-p 0.9 -ts 0/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:23 AM

Last edited

Qwen3.5-4B

4B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_084.9244.8198ms1k1d ago
Show all details for Qwen3.5-4B

Model

Qwen/Qwen3.5-4B

Display name

Qwen3.5-4B

Base model

Qwen3.5-4B-Base

Revision

main

Family

Qwen

Parameters

4B

Active params

MoE

no

Output tok/s

84.9

Prefill tok/s

Total tok/s

244.8

TTFT

197.6ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9049-2-g79b2f3239-dirty

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.95

Top K

Min P

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m "/home/mikekey/models/gguf/misc/Opus4.7-GODs.Ghost.Codex.Distill.4B-Q8_0.gguf" --temp 0.7 --top-p 0.95 --repeat-penalty 1.05 -ts 0/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 7, 2026, 3:06 AM

Last edited

Kimi-Dev-72B

73B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_015.556.7266ms1k1d ago
Show all details for Kimi-Dev-72B

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

MoE

no

Output tok/s

15.5

Prefill tok/s

Total tok/s

56.7

TTFT

265.7ms

Peak VRAM

Prompt tokens

697

Output tokens

256

Prefill tokens

Context length

953

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + Qwen2.5-Coder-0.5B-Q8_0 draft (n-max=8, n-min=1, p-min=0.7), 128K ctx. Code-completion prompt — paired baseline (no spec) at ~11.4 tok/s shows ~+35% gain from speculative decoding on code-shaped output.

Reactions

Submitted

May 6, 2026, 7:22 PM

Last edited

Kimi-Dev-72B

73B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_011.542.3250ms1k1d ago
Show all details for Kimi-Dev-72B

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

MoE

no

Output tok/s

11.5

Prefill tok/s

Total tok/s

42.3

TTFT

249.7ms

Peak VRAM

Prompt tokens

697

Output tokens

256

Prefill tokens

Context length

953

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + q8 KV, 128K ctx. Code-completion prompt (Python module continuation, ~700 input tokens). Baseline for spec-decoding A/B — paired with Kimi-Dev-72B-spec submission.

Reactions

Submitted

May 6, 2026, 7:14 PM

Last edited

Devstral-Small-2-24B-Instruct-2512

24B · Mistral

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_023.6204.1214ms1k1d ago
Show all details for Devstral-Small-2-24B-Instruct-2512

Model

mistralai/Devstral-Small-2-24B-Instruct-2512

Display name

Devstral-Small-2-24B-Instruct-2512

Base model

Mistral-Small-3.1-24B-Base-2503

Revision

main

Family

Mistral

Parameters

24B

Active params

MoE

no

Output tok/s

23.6

Prefill tok/s

Total tok/s

204.1

TTFT

213.6ms

Peak VRAM

Prompt tokens

1078

Output tokens

135

Prefill tokens

Context length

1213

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.15

Top P

Top K

Min P

0.01

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf --temp 0.15 --min-p 0.01 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 24576

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 1× R9700, 24K ctx, q8 KV, FA auto. Single-card layout.

Reactions

Submitted

May 6, 2026, 7:02 PM

Last edited

Qwen3.6-27B

28B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_019.659.2263ms1k1d ago
Show all details for Qwen3.6-27B

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

Revision

main

Family

Qwen

Parameters

28B

Active params

MoE

no

Output tok/s

19.6

Prefill tok/s

Total tok/s

59.2

TTFT

262.8ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

20

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-27B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 1× R9700, 16K ctx, q8 KV, FA auto. Single-card layout to eliminate cross-GPU overhead on 27B dense.

Reactions

Submitted

May 6, 2026, 6:56 PM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_080.5232.6203ms1k1d ago
Show all details for Qwen3.6-35B-A3B

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

80.5

Prefill tok/s

Total tok/s

232.6

TTFT

202.6ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.6

Top P

0.95

Top K

20

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-35B-A3B-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 2× R9700, 32K ctx, q8 KV, FA auto. Optimized 2-card layout vs the 3-card base entry.

Reactions

Submitted

May 6, 2026, 6:50 PM

Last edited

Qwen3-Coder-30B-A3B-Instruct

31B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_092.0365.9178ms1k1d ago
Show all details for Qwen3-Coder-30B-A3B-Instruct

Model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Display name

Qwen3-Coder-30B-A3B-Instruct

Base model

Revision

main

Family

Qwen

Parameters

31B

Active params

MoE

no

Output tok/s

92.0

Prefill tok/s

Total tok/s

365.9

TTFT

178.2ms

Peak VRAM

Prompt tokens

530

Output tokens

156

Prefill tokens

Context length

686

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. FAST tier: 2× R9700, 32K ctx, q8 KV, FA auto. Optimized 2-card layout vs the 3-card base entry.

Reactions

Submitted

May 6, 2026, 6:42 PM

Last edited

qwen3-coder-30b-a3b-codemonkey

30B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_K_M127.2356.1194ms1k1d ago
Show all details for qwen3-coder-30b-a3b-codemonkey

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

MoE

no

Output tok/s

127.2

Prefill tok/s

Total tok/s

356.1

TTFT

193.7ms

Peak VRAM

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_K_M

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 1× R9700, 32K ctx, q8 KV, FA auto.

Reactions

Submitted

May 6, 2026, 4:37 PM

Last edited

qwen3-coder-30b-a3b-codemonkey

30B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_K_M126.5352.0209ms1k1d ago
Show all details for qwen3-coder-30b-a3b-codemonkey

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

MoE

no

Output tok/s

126.5

Prefill tok/s

Total tok/s

352.0

TTFT

208.9ms

Peak VRAM

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_K_M

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.from-Q8_0.Q4_K_M.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 1× R9700, 32K ctx, q8 KV, FA auto, Q4_K_M derived from Q8_0 source.

Reactions

Submitted

May 6, 2026, 4:32 PM

Last edited

qwen3-coder-30b-a3b-codemonkey

30B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_090.0257.8203ms1k1d ago
Show all details for qwen3-coder-30b-a3b-codemonkey

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

MoE

no

Output tok/s

90.0

Prefill tok/s

Total tok/s

257.8

TTFT

203.3ms

Peak VRAM

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 2× R9700, 32K ctx, q8 KV, FA auto.

Reactions

Submitted

May 6, 2026, 4:26 PM

Last edited

qwen3-coder-30b-a3b-codemonkey

30B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppBF1620.361.2222ms1k1d ago
Show all details for qwen3-coder-30b-a3b-codemonkey

Model

1337Hero/qwen3-coder-30b-a3b-codemonkey

Display name

qwen3-coder-30b-a3b-codemonkey

Base model

Qwen3-Coder-30B-A3B-Instruct

Revision

main

Family

Qwen

Parameters

30B

Active params

MoE

no

Output tok/s

20.3

Prefill tok/s

Total tok/s

61.2

TTFT

221.6ms

Peak VRAM

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

BF16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/qwen3-codemonkey/qwen3-coder-30b-a3b-codemonkey.BF16.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on R9700 via llama-swap. Best of 3 runs. Codemonkey LoRA fine-tune of Qwen3-Coder-30B-A3B-Instruct. 3× R9700, 16K ctx, q8 KV, FA auto.

Reactions

Submitted

May 6, 2026, 4:19 PM

Last edited

Mistral-Medium-3.5-128B

128B · Mistral

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_07.431.42033ms1k1d ago
Show all details for Mistral-Medium-3.5-128B

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

Revision

main

Family

Mistral

Parameters

128B

Active params

MoE

no

Output tok/s

7.4

Prefill tok/s

Total tok/s

31.4

TTFT

2032.8ms

Peak VRAM

Prompt tokens

886

Output tokens

256

Prefill tokens

Context length

1142

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --cache-ram 0 --no-cache-idle-slots --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs. Q4_0 + q8_0 KV, 16K ctx, ts 1/1/1, FA auto.

Reactions

Submitted

May 6, 2026, 12:50 PM

Last edited

Mistral-Medium-3.5-128B

128B · Mistral

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_06.930.4317ms16k2d ago
Show all details for Mistral-Medium-3.5-128B

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

Revision

main

Family

Mistral

Parameters

128B

Active params

MoE

no

Output tok/s

6.9

Prefill tok/s

Total tok/s

30.4

TTFT

317.4ms

Peak VRAM

Prompt tokens

886

Output tokens

256

Prefill tokens

Context length

16384

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8979-66bafdcf1

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

q8_0

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

1

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 1 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto --no-mmap --cache-ram 0 --no-cache-idle-slots -ts 1/1/1 -c 16384

Extra flags

-ctk q8_0 -ctv q8_0 -ts 1/1/1 --no-mmap --cache-ram 0 --no-cache-idle-slots -c 16384

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. Mistral Medium 3.5 128B Q4_0 GGUF, 3 shards, tensor split 1/1/1, q8_0 KV cache, flash-attn auto, 16K ctx, single slot. Median of 3 runs. Submitted with running binary build_info b8979-66bafdcf1.

Reactions

Submitted

May 6, 2026, 6:40 AM

Last edited

Mistral-Medium-3.5-128B

128B · Mistral

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ4_06.226.42082ms1k2d ago
Show all details for Mistral-Medium-3.5-128B

Model

mistralai/Mistral-Medium-3.5-128B

Display name

Mistral-Medium-3.5-128B

Base model

Revision

main

Family

Mistral

Parameters

128B

Active params

MoE

no

Output tok/s

6.2

Prefill tok/s

Total tok/s

26.4

TTFT

2081.9ms

Peak VRAM

Prompt tokens

886

Output tokens

256

Prefill tokens

Context length

1142

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.95

Top K

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/mistral/Mistral-Medium-3.5-128B-GGUF/Q4_0/Mistral-Medium-3.5-128B-Q4_0-00001-of-00003.gguf --temp 0.7 --top-p 0.95 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

3x R9700 Vulkan, -ts 1/1/1, q8 KV, flash-attn auto, 32K ctx

Reactions

Submitted

May 6, 2026, 6:01 AM

Last edited

granite-4.1-30b

29B

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_017.954.1224ms1k2d ago
Show all details for granite-4.1-30b

Model

ibm-granite/granite-4.1-30b

Display name

granite-4.1-30b

Base model

Revision

main

Family

Parameters

29B

Active params

MoE

no

Output tok/s

17.9

Prefill tok/s

Total tok/s

54.1

TTFT

223.9ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/ibm-granite_granite-4.1-30b-Q8_0.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 65536

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 6, 2026, 4:57 AM

Last edited

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

33B

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0107.3304.9218ms1k2d ago
Show all details for Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Display name

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Base model

Revision

main

Family

Parameters

33B

Active params

MoE

no

Output tok/s

107.3

Prefill tok/s

Total tok/s

304.9

TTFT

217.8ms

Peak VRAM

Prompt tokens

538

Output tokens

256

Prefill tokens

Context length

794

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9020-10-g17df5830e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.6

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

llama.cpp Vulkan on 3x R9700 via llama-swap, ts 1/1 (2-card). Median of 3 runs (selected for representative TTFT).

Reactions

Submitted

May 6, 2026, 4:45 AM

Last edited

Qwen3-32B

32B · Qwen

3x AMD Radeon AI Pro R9700 96GB
vllmFP822.869.560ms1k2d ago
Show all details for Qwen3-32B

Model

Qwen/Qwen3-32B

Display name

Qwen3-32B

Base model

Revision

main

Family

Qwen

Parameters

32B

Active params

MoE

no

Output tok/s

22.8

Prefill tok/s

Total tok/s

69.5

TTFT

60.0ms

Peak VRAM

34.2GB

Prompt tokens

530

Output tokens

256

Prefill tokens

Context length

786

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.18.1.dev0+gbcf2be961.d20260330

Quantization

FP8

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.95

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3-32B-FP8 --max-model-len 40960 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' healthcheck: test: - CMD-SHELL

Extra flags

--max-model-len 40960 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} -- CMD-SHELL

Notes

post-reboot validation, 2x R9700 TP=2 dense FP8

Reactions

Submitted

May 6, 2026, 4:12 AM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

3x AMD Radeon AI Pro R9700 96GB
vllmMXFP4_MOE112.8322.8168ms1k2d ago
Show all details for Qwen3.6-35B-A3B

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

112.8

Prefill tok/s

Total tok/s

322.8

TTFT

168ms

Peak VRAM

32.0GB

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.18.1.dev0+gbcf2be961.d20260330

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.95

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3.6-35B-A3B-MXFP4 --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' healthcheck:

Extra flags

--max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} --speculative-config {"method": "mtp", "num_speculative_tokens": 4}

Notes

vLLM on 3× R9700 (TP=2 cards 0,1). RCCL 7.1.1 jammy preloaded via LD_LIBRARY_PATH (bypasses RCCL 7.2.x gfx1201 tuning-index miss / TP=2 deadlock). TunableOp tuned ~300 rows/rank from cudagraph capture + diverse-load driving. MoE backend Triton (gfx950 gate routes only gfx950 to CK on this vLLM build). MTP=4 spec decoding. tcclaviger/vllm-rocm-rdna4-mxfp4 docker. Best of 3 streaming chat completions, post heavy warmup.

Reactions

Submitted

May 6, 2026, 3:18 AM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

3x AMD Radeon AI Pro R9700 96GB
vllmMXFP4_MOE98.9285.2171ms1k2d ago
Show all details for Qwen3.6-35B-A3B

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

98.9

Prefill tok/s

Total tok/s

285.2

TTFT

171.4ms

Peak VRAM

32GB

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.18.1.dev0+gbcf2be961.d20260330

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.95

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

vllm serve /app/models --tensor-parallel-size 2 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.95 --host 0.0.0.0 --port 8009 --dtype auto --served-model-name Qwen3.6-35B-A3B-MXFP4 --max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config '{"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128}' --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' healthcheck:

Extra flags

--max-model-len 32768 --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 4 --mm-processor-cache-gb 1 --override-generation-config {"max_tokens": 100000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} --compilation-config {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], "max_cudagraph_capture_size": 128} --speculative-config {"method": "mtp", "num_speculative_tokens": 4}

Notes

vLLM on 3× R9700 (TP=2 cards 0,1). RCCL 7.1.1 jammy preloaded via LD_LIBRARY_PATH to bypass RCCL 7.2.x gfx1201 tuning-index miss (TP=2 deadlock). aiter built but MoE backend still Triton (gfx950 gate routes only gfx950 to CK). MTP=4 spec decoding. tcclaviger/vllm-rocm-rdna4-mxfp4 docker. Best of 3 streaming chat completions.

Reactions

Submitted

May 6, 2026, 2:23 AM

Last edited

gpt-oss-20b

22B · Gpt

2x AMD Radeon AI Pro R9700 64GB
vllmMXFP4_MOE991.1175ms33k2d ago
Show all details for gpt-oss-20b

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

Revision

main

Family

Gpt

Parameters

22B

Active params

MoE

no

Output tok/s

991.1

Prefill tok/s

Total tok/s

TTFT

175.4ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

32768

Batch size

16

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

yes

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

4096

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.9

Max running seqs

16

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve openai/gpt-oss-20b --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 16 --max-num-batched-tokens 4096 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.90 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Extra flags

--max-model-len 32768 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Notes

gpt-oss-20b MXFP4 concurrent throughput at batch=16 (matches --max-num-seqs 16). Aggregate output tok/s, best of 3 runs (991.1 / 986.8 / 990.6 -- variance under 0.5%). Per-request decode 64.8 tok/s. TTFT 175ms. This is vLLM's strength zone vs llama.cpp single-stream: ~6x throughput when serving multiple concurrent users. batch=1 single-stream was 48 tok/s (separate submission) -- same hardware, llama.cpp Q8_0 single-stream was 160 tok/s.

Reactions

Submitted

May 5, 2026, 11:29 PM

Last edited

gpt-oss-20b

22B · Gpt

2x AMD Radeon AI Pro R9700 64GB
vllmMXFP4_MOE48.3156.789ms33k2d ago
Show all details for gpt-oss-20b

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

Revision

main

Family

Gpt

Parameters

22B

Active params

MoE

no

Output tok/s

48.3

Prefill tok/s

Total tok/s

156.7

TTFT

88.6ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

yes

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

4096

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.9

Max running seqs

16

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve openai/gpt-oss-20b --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 16 --max-num-batched-tokens 4096 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.90 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Extra flags

--max-model-len 32768 --moe-backend triton --reasoning-parser openai_gptoss --tool-call-parser openai

Notes

gpt-oss-20b native MXFP4 on 2x R9700 (gfx1201) TP=2, batch=1. Uses the *fused* OAI Triton MXFP4 MoE kernel (triton, not triton_unfused) — gpt-oss is the kernel's original target. Same 3 wheel patches as Qwen3.6 MXFP4 entry (PR #37826 device gate, CLI enum, dtype). RCCL 7.1.1 hotfix for #40980. Concurrent throughput much higher: 580 tok/s at c=8, 902 tok/s at c=16 (batch submissions to follow).

Reactions

Submitted

May 5, 2026, 11:24 PM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

2x AMD Radeon AI Pro R9700 64GB
vllmMXFP4_MOE33.6101.7130ms33k2d ago
Show all details for Qwen3.6-35B-A3B

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

33.6

Prefill tok/s

Total tok/s

101.7

TTFT

129.6ms

Peak VRAM

Prompt tokens

533

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

yes

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.93

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 PYTORCH_TUNABLEOP_FILENAME=~/.local/share/vllm-tunableop/tunableop_merged.csv VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3.6-35B-A3B-MXFP4 --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.93 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Extra flags

--max-model-len 32768 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Notes

Same setup as previous MXFP4 submission (PR #37826 patched, RCCL 7.1.1 hotfix, triton_unfused, dtype=bf16) — now WITH PyTorch TunableOp tuned (754 rows / 377 per TP rank). Tuned with 18-min diverse-load pass: prompt sizes 8/64/256/1024/2048 tokens x 5 content categories x concurrency 1+4. Auto-stopped on row plateau (3x60s zero-delta). Tuning gain over untuned baseline: +7.0% (31.70 -> 33.92 tok/s). Modest gain — batch=1 decode on R9700 is memory-bandwidth-bound, not GEMM-scheduling-bound.

Reactions

Submitted

May 5, 2026, 11:00 PM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

2x AMD Radeon AI Pro R9700 64GB
vllmMXFP4_MOE31.796.0140ms33k2d ago
Show all details for Qwen3.6-35B-A3B

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

31.7

Prefill tok/s

Total tok/s

96.0

TTFT

140.0ms

Peak VRAM

Prompt tokens

533

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

MXFP4_MOE

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

yes

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.93

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MOE=0 VLLM_MXFP4_USE_MARLIN=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3.6-35B-A3B-MXFP4 --tensor-parallel-size 2 --dtype bfloat16 --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.93 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Extra flags

--max-model-len 32768 --moe-backend triton_unfused --reasoning-parser qwen3 --tool-call-parser qwen3_xml --language-model-only

Notes

Qwen3.6-35B-A3B MXFP4 (pahajokiconsulting quant) on 2x R9700 (gfx1201) TP=2. Baseline path (no AITER, no TunableOp, no MTP). Required THREE patches to vLLM 0.20.1+rocm721 wheel: (1) PR vllm-project/vllm#37826 in-place patch -- widen OAI Triton MoE device gate to gfx12, (2) add 'triton_unfused' to MoEBackend Literal in config/kernel.py (CLI exposure), (3) RCCL 7.1.1 LD_PRELOAD for #40980 deadlock. --moe-backend triton_unfused needed because TRITON kernel only supports SWIGLUOAI activation; UnfusedOAITritonExperts adds SILU. dtype=bfloat16 required (kernel asserts hidden_states.dtype==bf16). Best of 3 (31.6 / 31.7 / 31.6 tok/s).

Reactions

Submitted

May 5, 2026, 9:49 PM

Last edited

Qwen3-32B

32B · Qwen

2x AMD Radeon AI Pro R9700 64GB
vllmFP879.3138ms33k2d ago
Show all details for Qwen3-32B

Model

Qwen/Qwen3-32B

Display name

Qwen3-32B

Base model

Revision

main

Family

Qwen

Parameters

32B

Active params

MoE

no

Output tok/s

79.3

Prefill tok/s

Total tok/s

TTFT

138.3ms

Peak VRAM

Prompt tokens

534

Output tokens

256

Prefill tokens

Context length

32768

Batch size

4

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

FP8

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.9

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3-32B-FP8 --tensor-parallel-size 2 --dtype auto --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.90 --reasoning-parser qwen3

Extra flags

--max-model-len 32768 --reasoning-parser qwen3

Notes

Concurrent throughput at batch=4 (matches --max-num-seqs 4). Aggregate output tok/s, best of 3 runs (78.7 / 78.2 / 79.3). Per-request decode ~20 tok/s. RCCL 7.1.1 hotfix for #40980 still in effect.

Reactions

Submitted

May 5, 2026, 9:32 PM

Last edited

Qwen3-32B

32B · Qwen

2x AMD Radeon AI Pro R9700 64GB
vllmFP822.970.171ms33k2d ago
Show all details for Qwen3-32B

Model

Qwen/Qwen3-32B

Display name

Qwen3-32B

Base model

Revision

main

Family

Qwen

Parameters

32B

Active params

MoE

no

Output tok/s

22.9

Prefill tok/s

Total tok/s

70.1

TTFT

71.5ms

Peak VRAM

Prompt tokens

534

Output tokens

256

Prefill tokens

Context length

32768

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 64GB

GPU slots

GPU count

2

VRAM

64GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

vllm

Engine version

0.20.1+rocm721

Quantization

FP8

Backend

rocm

Tensor parallel

2

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

yes

Prefill chunk

2048

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

0.9

Max running seqs

4

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

LD_LIBRARY_PATH=~/.local/lib/rccl-7.1.1:$LD_LIBRARY_PATH HIP_VISIBLE_DEVICES=0,1 VLLM_TARGET_DEVICE=rocm VLLM_ROCM_USE_AITER=0 FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE vllm serve /models/Qwen3-32B-FP8 --tensor-parallel-size 2 --dtype auto --max-model-len 32768 --max-num-seqs 4 --max-num-batched-tokens 2048 --enable-chunked-prefill --gpu-memory-utilization 0.90 --reasoning-parser qwen3

Extra flags

--max-model-len 32768 --reasoning-parser qwen3

Notes

vLLM 0.20.1+rocm721 on 2x AMD R9700 (gfx1201) TP=2. Workaround for vllm-project/vllm#40980: LD_PRELOAD librccl.so.1 from rccl-7.1.1 (Arch ALA) -- RCCL 7.2.x lacks gfx1201 tuning index and deadlocks on AllReduce. With hotfix: ~22.7 tok/s; without: ~0.5 tok/s.

Reactions

Submitted

May 5, 2026, 9:26 PM

Last edited

Qwen3-Coder-30B-A3B-Instruct

31B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0100.0390.1198ms1k2d ago
Show all details for Qwen3-Coder-30B-A3B-Instruct

Model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Display name

Qwen3-Coder-30B-A3B-Instruct

Base model

Revision

main

Family

Qwen

Parameters

31B

Active params

MoE

no

Output tok/s

100.0

Prefill tok/s

Total tok/s

390.1

TTFT

198.5ms

Peak VRAM

Prompt tokens

530

Output tokens

156

Prefill tokens

Context length

686

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 7:00 PM

Last edited

gpt-oss-20b

22B · Gpt

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0160.8470.0206ms1k3d ago
Show all details for gpt-oss-20b

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

Revision

main

Family

Gpt

Parameters

22B

Active params

MoE

no

Output tok/s

160.8

Prefill tok/s

Total tok/s

470.0

TTFT

206.1ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

845

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-20b-Q8_0.gguf -ts 0/1/0 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 0/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 4:52 AM

Last edited

GLM-4.7-Flash

31B

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_093.3274.8105ms1k3d ago
Show all details for GLM-4.7-Flash

Model

zai-org/GLM-4.7-Flash

Display name

GLM-4.7-Flash

Base model

Revision

main

Family

Parameters

31B

Active params

MoE

no

Output tok/s

93.3

Prefill tok/s

Total tok/s

274.8

TTFT

104.8ms

Peak VRAM

Prompt tokens

527

Output tokens

256

Prefill tokens

Context length

783

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

0.01

Repeat penalty

1

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/glm/GLM-4.7-Flash-MXFP4_MOE.gguf --temp 1.0 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 0/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 0/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 4:40 AM

Last edited

Qwen3-Coder-Next

80B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppMXFP4_MOE80.4280.1234ms1k3d ago
Show all details for Qwen3-Coder-Next

Model

Qwen/Qwen3-Coder-Next

Display name

Qwen3-Coder-Next

Base model

Revision

main

Family

Qwen

Parameters

80B

Active params

MoE

no

Output tok/s

80.4

Prefill tok/s

Total tok/s

280.1

TTFT

233.8ms

Peak VRAM

Prompt tokens

530

Output tokens

187

Prefill tokens

Context length

717

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

40

Min P

0.01

Repeat penalty

1

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-Next-MXFP4_MOE.gguf -ts 1/1 -c 262144 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 4:31 AM

Last edited

Qwen3.5-122B-A10B

125B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppMXFP4_MOE26.8110.8261ms1k3d ago
Show all details for Qwen3.5-122B-A10B

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

Revision

main

Family

Qwen

Parameters

125B

Active params

MoE

no

Output tok/s

26.8

Prefill tok/s

Total tok/s

110.8

TTFT

261.3ms

Peak VRAM

Prompt tokens

533

Output tokens

161

Prefill tokens

Context length

694

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8978-1-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

40

Min P

0.01

Repeat penalty

1

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}'

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 4:19 AM

Last edited

Qwen3-Coder-30B-A3B-Instruct

31B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_087.2343.5207ms1k3d ago
Show all details for Qwen3-Coder-30B-A3B-Instruct

Model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Display name

Qwen3-Coder-30B-A3B-Instruct

Base model

Revision

main

Family

Qwen

Parameters

31B

Active params

MoE

no

Output tok/s

87.2

Prefill tok/s

Total tok/s

343.5

TTFT

207.0ms

Peak VRAM

Prompt tokens

530

Output tokens

156

Prefill tokens

Context length

686

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

0

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 3:51 AM

Last edited

gpt-oss-120b

120B · Gpt

3x AMD Radeon AI Pro R9700 96GB
llama.cppF1671.5221.7233ms1k3d ago
Show all details for gpt-oss-120b

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

Revision

main

Family

Gpt

Parameters

120B

Active params

MoE

no

Output tok/s

71.5

Prefill tok/s

Total tok/s

221.7

TTFT

232.7ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

845

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 3:41 AM

Last edited

Qwen3.6-35B-A3B

36B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_075.2217.4217ms1k3d ago
Show all details for Qwen3.6-35B-A3B

Model

Qwen/Qwen3.6-35B-A3B

Display name

Qwen3.6-35B-A3B

Base model

Revision

main

Family

Qwen

Parameters

36B

Active params

MoE

no

Output tok/s

75.2

Prefill tok/s

Total tok/s

217.4

TTFT

217.0ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.6

Top P

0.95

Top K

20

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-35B-A3B-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/0 -c 32768

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 3:31 AM

Last edited

gpt-oss-20b

22B · Gpt

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_0159.7468.5201ms131k3d ago
Show all details for gpt-oss-20b

Model

openai/gpt-oss-20b

Display name

gpt-oss-20b

Base model

Revision

main

Family

Gpt

Parameters

22B

Active params

MoE

no

Output tok/s

159.7

Prefill tok/s

Total tok/s

468.5

TTFT

200.9ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

131072

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

GPT-OSS 20B Q8_0 (12 GB) on 1x R9700 single-card via llama-swap (-ts 0/1/0, c=131072, no kv quant). 3.6B active params (MoE). Best of 3 streamed runs - variance 0.27 tok/s across runs.

Reactions

Submitted

May 5, 2026, 3:24 AM

Last edited

Nemotron-Cascade-2-30B-A3B

32B

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_089.82459.99971ms32k3d ago
Show all details for Nemotron-Cascade-2-30B-A3B

Model

nvidia/Nemotron-Cascade-2-30B-A3B

Display name

Nemotron-Cascade-2-30B-A3B

Base model

Revision

main

Family

Parameters

32B

Active params

MoE

no

Output tok/s

89.8

Prefill tok/s

Total tok/s

2459.9

TTFT

9971.2ms

Peak VRAM

Prompt tokens

31364

Output tokens

256

Prefill tokens

Context length

31620

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

Nemotron-Cascade-2 30B-A3B Q8_0 (32 GB) on 2x R9700 (-ts 1/1, c=262144, q8_0 KV, FA auto). Depth-31K test (~30K input prompt). TG = 89.78 tok/s, only -5.9% vs depth-0 (95.39 tok/s). Cold-prefill TTFT shown (9971ms); warm-cache runs hit 560ms via --cache-reuse 256. Best tok/s of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 3:19 AM

Last edited

Qwen3.6-27B

28B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_019.960.5171ms16k3d ago
Show all details for Qwen3.6-27B

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

Revision

main

Family

Qwen

Parameters

28B

Active params

MoE

no

Output tok/s

19.9

Prefill tok/s

Total tok/s

60.5

TTFT

171.0ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

16384

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

Qwen3.6-27B Q8_0 dense (28.5 GB) on 1x R9700 single-card via llama-swap (-ts 1/0/0, c=16384, q8_0 KV, FA auto). Single-card variant: +10.3% TG vs prior 3-card 262K submission (18.08 tok/s). Best of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 3:14 AM

Last edited

granite-4.1-30b

29B

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_017.3273ms66k3d ago
Show all details for granite-4.1-30b

Model

ibm-granite/granite-4.1-30b

Display name

granite-4.1-30b

Base model

Revision

main

Family

Parameters

29B

Active params

MoE

no

Output tok/s

17.3

Prefill tok/s

Total tok/s

TTFT

273.1ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

65536

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

IBM Granite 4.1 30B dense Q8_0 (29 GB) on 2x R9700 via llama-swap (-ts 1/1, c=65536, q8_0 KV, FA auto). Single-card not viable - 29 GB weights + heavy GQA KV (256 KB/tok) overflows 32 GB VRAM. Best of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 3:03 AM

Last edited

Devstral-Small-2-24B-Instruct-2512

24B · Mistral

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_023.9200.6244ms25k3d ago
Show all details for Devstral-Small-2-24B-Instruct-2512

Model

mistralai/Devstral-Small-2-24B-Instruct-2512

Display name

Devstral-Small-2-24B-Instruct-2512

Base model

Mistral-Small-3.1-24B-Base-2503

Revision

main

Family

Mistral

Parameters

24B

Active params

MoE

no

Output tok/s

23.9

Prefill tok/s

Total tok/s

200.6

TTFT

244.2ms

Peak VRAM

Prompt tokens

1078

Output tokens

139

Prefill tokens

Context length

24576

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

Devstral-Small-2-24B-Instruct-2512 Q8_0 (24 GB) on 1x R9700 single-card via llama-swap (-ts 1/0/0, c=24576, q8_0 KV, FA auto). Single-card variant. Best of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 2:58 AM

Last edited

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

33B

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_095.8275.2212ms1k3d ago
Show all details for Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Display name

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Base model

Revision

main

Family

Parameters

33B

Active params

MoE

no

Output tok/s

95.8

Prefill tok/s

Total tok/s

275.2

TTFT

212.5ms

Peak VRAM

Prompt tokens

538

Output tokens

256

Prefill tokens

Context length

794

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.6

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-Q8_0.gguf --reasoning-format auto --temp 0.6 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

NVIDIA-Nemotron-3-Nano-Omni 30B-A3B Reasoning Q8_0 (32 GB) on 2× R9700 via llama-swap (-ts 1/1, q8_0 KV, FA auto). Single-card not viable — weights exceed 32 GB VRAM. Best of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 2:53 AM

Last edited

Nemotron-Cascade-2-30B-A3B

32B

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_095.4280.0216ms1k3d ago
Show all details for Nemotron-Cascade-2-30B-A3B

Model

nvidia/Nemotron-Cascade-2-30B-A3B

Display name

Nemotron-Cascade-2-30B-A3B

Base model

Revision

main

Family

Parameters

32B

Active params

MoE

no

Output tok/s

95.4

Prefill tok/s

Total tok/s

280.0

TTFT

216.0ms

Peak VRAM

Prompt tokens

556

Output tokens

256

Prefill tokens

Context length

812

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/nvidia_Nemotron-Cascade-2-30B-A3B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1 -c 262144

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/1

Notes

Nemotron-Cascade-2 30B-A3B Q8_0 (32 GB) on 2× R9700 via llama-swap (-ts 1/1, q8_0 KV, FA auto). Single-card not viable — weights exceed 32 GB VRAM. Best of 3 streamed runs.

Reactions

Submitted

May 5, 2026, 2:47 AM

Last edited

Qwen3.6-27B

28B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppQ8_020.158.1807ms1k3d ago
Show all details for Qwen3.6-27B

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

Revision

main

Family

Qwen

Parameters

28B

Active params

MoE

no

Output tok/s

20.1

Prefill tok/s

Total tok/s

58.1

TTFT

807.5ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

787

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b9025-1-ga267d945e

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

20

Min P

0

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.6-27B-Q8_0.gguf --reasoning-format auto --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/0/0 -c 16384

Extra flags

--no-webui --jinja --cache-reuse 256 --reasoning-format auto --ctk q8_0 --ctv q8_0 --ts 1/0/0

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 5, 2026, 2:35 AM

Last edited

Qwen3.5-122B-A10B

125B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppMXFP4_MOE27.211.156499ms66k3d ago
Show all details for Qwen3.5-122B-A10B

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

Revision

main

Family

Qwen

Parameters

125B

Active params

MoE

no

Output tok/s

27.2

Prefill tok/s

Total tok/s

11.1

TTFT

56499.3ms

Peak VRAM

Prompt tokens

533

Output tokens

161

Prefill tokens

Context length

65536

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

1

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

40

Min P

0.01

Repeat penalty

1

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 1 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}' --flash-attn on

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}

Notes

B+C config: --parallel 1 + --flash-attn on. Best of 3 runs on 3× R9700 (Vulkan).

Reactions

Submitted

May 5, 2026, 12:41 AM

Last edited

Qwen3.5-122B-A10B

125B · Qwen

3x AMD Radeon AI Pro R9700 96GB
llama.cppMXFP4_MOE27.3113.5219ms1k3d ago
Show all details for Qwen3.5-122B-A10B

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

Revision

main

Family

Qwen

Parameters

125B

Active params

MoE

no

Output tok/s

27.3

Prefill tok/s

Total tok/s

113.5

TTFT

218.8ms

Peak VRAM

Prompt tokens

533

Output tokens

161

Prefill tokens

Context length

694

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

1

Top P

0.95

Top K

40

Min P

0.01

Repeat penalty

1

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/qwen/Qwen3.5-122B-A10B-MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00004.gguf -ts 1/1/1 -c 65536 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.00 --flash-attn auto --reasoning off --chat-template-kwargs '{"enable_thinking": false}'

Extra flags

--no-webui --jinja --cache-reuse 256 --ts 1/1/1 --reasoning off --chat-template-kwargs {"enable_thinking": false}

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. Qwen3.5-122B-A10B MXFP4_MOE GGUF, tensor split 1/1/1, flash-attn auto, reasoning disabled. Best of 3 runs.

Reactions

Submitted

May 4, 2026, 11:53 PM

Last edited

gpt-oss-120b

120B · Gpt

3x AMD Radeon AI Pro R9700 96GB
llama.cppF1670.7219.1236ms1k3d ago
Show all details for gpt-oss-120b

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

Revision

main

Family

Gpt

Parameters

120B

Active params

MoE

no

Output tok/s

70.7

Prefill tok/s

Total tok/s

219.1

TTFT

236.4ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

845

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 96GB

GPU slots

GPU count

3

VRAM

96GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. GPT-OSS 120B F16 GGUF, tensor split 1/1/1, q8_0 KV cache, flash-attn auto.

Reactions

Submitted

May 4, 2026, 11:00 PM

Last edited

gpt-oss-120b

120B · Gpt

3x AMD Radeon AI Pro R9700 32GB
llama.cppF1670.0217.0236ms1k3d ago
Show all details for gpt-oss-120b

Model

openai/gpt-oss-120b

Display name

gpt-oss-120b

Base model

Revision

main

Family

Gpt

Parameters

120B

Active params

MoE

no

Output tok/s

70.0

Prefill tok/s

Total tok/s

217.0

TTFT

236.4ms

Peak VRAM

Prompt tokens

589

Output tokens

256

Prefill tokens

Context length

845

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

3

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

F16

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/gpt-oss/gpt-oss-120b-F16.gguf -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

llama.cpp Vulkan on 3x AMD Radeon AI PRO R9700 via llama-swap. GPT-OSS 120B F16 GGUF, tensor split 1/1/1, q8_0 KV cache, flash-attn auto. Best of 3 runs.

Reactions

Submitted

May 4, 2026, 10:42 PM

Last edited

Kimi-Dev-72B

73B · Qwen

3x AMD Radeon AI Pro R9700 32GB
llama.cppQ4_020.964.5272ms131k3d ago
Show all details for Kimi-Dev-72B

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

MoE

no

Output tok/s

20.9

Prefill tok/s

Total tok/s

64.5

TTFT

271.5ms

Peak VRAM

Prompt tokens

551

Output tokens

256

Prefill tokens

Context length

131072

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

3

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

99

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

yes

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

2

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

0.7

Top P

0.8

Top K

20

Min P

Repeat penalty

1.05

Mirostat

Command

/home/mikekey/Experiments/llama.cpp/build-vulkan/bin/llama-server --port ${PORT} --host 0.0.0.0 --no-webui -ngl 99 --jinja --cache-reuse 256 --parallel 2 -m /home/mikekey/models/gguf/misc/Kimi-Dev-72B-Q4_0.gguf --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 -ctk q8_0 -ctv q8_0 --flash-attn auto -ts 1/1/1 -c 131072

Extra flags

--no-webui --jinja --cache-reuse 256 --spec-draft-model /home/mikekey/models/gguf/drafts/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf --spec-draft-n-max 8 --spec-draft-n-min 1 --spec-draft-p-min 0.7 --spec-draft-ngl 99 --ctk q8_0 --ctv q8_0 --ts 1/1/1

Notes

Vulkan 3× R9700 + speculative decoding: Qwen2.5-Coder-0.5B-Instruct Q8_0 draft, --spec-draft-n-max 8 --spec-draft-p-min 0.7. Best of 3.

Reactions

Submitted

May 4, 2026, 10:09 PM

Last edited

Qwen3.6-27B

28B · Qwen

3x AMD Radeon AI Pro R9700 32GB
llama.cppQ8_018.154.6260ms262k3d ago
Show all details for Qwen3.6-27B

Model

Qwen/Qwen3.6-27B

Display name

Qwen3.6-27B

Base model

Revision

main

Family

Qwen

Parameters

28B

Active params

MoE

no

Output tok/s

18.1

Prefill tok/s

Total tok/s

54.6

TTFT

259.9ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

262144

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

3

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

Q8_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 4, 2026, 9:36 PM

Last edited

Kimi-Dev-72B

73B · Qwen

3x AMD Radeon AI Pro R9700 32GB
llama.cppQ4_011.335.4259ms66k3d ago
Show all details for Kimi-Dev-72B

Model

moonshotai/Kimi-Dev-72B

Display name

Kimi-Dev-72B

Base model

Qwen2.5-72B

Revision

main

Family

Qwen

Parameters

73B

Active params

MoE

no

Output tok/s

11.3

Prefill tok/s

Total tok/s

35.4

TTFT

258.6ms

Peak VRAM

Prompt tokens

551

Output tokens

256

Prefill tokens

Context length

65536

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

3

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

Q4_0

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

llama.cpp Vulkan on 3× R9700 via llama-swap. Best of 3 runs.

Reactions

Submitted

May 4, 2026, 9:31 PM

Last edited

Qwen3.5-122B-A10B

125B · Qwen

3x AMD Radeon AI Pro R9700 32GB
llama.cppMXFP4_MOE26.880.6232ms66k3d ago
Show all details for Qwen3.5-122B-A10B

Model

Qwen/Qwen3.5-122B-A10B

Display name

Qwen3.5-122B-A10B

Base model

Revision

main

Family

Qwen

Parameters

125B

Active params

MoE

no

Output tok/s

26.8

Prefill tok/s

Total tok/s

80.6

TTFT

232.2ms

Peak VRAM

Prompt tokens

531

Output tokens

256

Prefill tokens

Context length

65536

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

3x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

3

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

noctrex/Qwen3.5-122B-A10B-MXFP4_MOE-GGUF on 3× R9700 (96 GB pool), llama.cpp Vulkan, -ts 1/1/1, ctx 64K. Best of 3 runs.

Reactions

Submitted

May 4, 2026, 6:50 PM

Last edited

Qwen3-Coder-Next

80B · Qwen

2x AMD Radeon AI Pro R9700 32GB
llama.cppMXFP4_MOE80.8279.7250ms1k3d ago
Show all details for Qwen3-Coder-Next

Model

Qwen/Qwen3-Coder-Next

Display name

Qwen3-Coder-Next

Base model

Revision

main

Family

Qwen

Parameters

80B

Active params

MoE

no

Output tok/s

80.8

Prefill tok/s

Total tok/s

279.7

TTFT

249.6ms

Peak VRAM

Prompt tokens

530

Output tokens

187

Prefill tokens

Context length

717

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

2

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

llama.cpp Vulkan, dual R9700 (-ts 1/1), c=262144, no kv quant. Best of 3 streamed runs, ~512 input / 256 output.

Reactions

Submitted

May 4, 2026, 5:05 PM

Last edited

gemma-4-26B-A4B-it

27B · Gemma

2x AMD Radeon AI Pro R9700 32GB
llama.cppMXFP4_MOE87.3239.9375ms1k3d ago
Show all details for gemma-4-26B-A4B-it

Model

google/gemma-4-26B-A4B-it

Display name

gemma-4-26B-A4B-it

Base model

gemma-4-26B-A4B

Revision

main

Family

Gemma

Parameters

27B

Active params

MoE

no

Output tok/s

87.3

Prefill tok/s

Total tok/s

239.9

TTFT

375.3ms

Peak VRAM

Prompt tokens

538

Output tokens

256

Prefill tokens

Context length

794

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

2

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

b8974-5-g66bafdcf1

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

llama.cpp Vulkan, single R9700 (-ts 0/1), c=65536, no kv quant. Best of 3 streamed runs, ~512 input / 256 output.

Reactions

Submitted

May 4, 2026, 4:56 PM

Last edited

GLM-4.7-Flash

31B

2x AMD Radeon AI Pro R9700 32GB
llama.cppMXFP4_MOE92.9273.4110ms1k3d ago
Show all details for GLM-4.7-Flash

Model

zai-org/GLM-4.7-Flash

Display name

GLM-4.7-Flash

Base model

Revision

main

Family

Parameters

31B

Active params

MoE

no

Output tok/s

92.9

Prefill tok/s

Total tok/s

273.4

TTFT

109.6ms

Peak VRAM

Prompt tokens

527

Output tokens

256

Prefill tokens

Context length

783

Batch size

1

Hardware class

DISCRETE_GPU

Hardware

2x AMD Radeon AI Pro R9700 32GB

GPU slots

GPU count

2

VRAM

32GB

Chip vendor

Chip family

Chip variant

Unified memory

NPU TOPS

CPU

AMD Ryzen 9 5950X

RAM

64GB

OS

Arch Linux

Power

Engine

llama.cpp

Engine version

Quantization

MXFP4_MOE

Backend

vulkan

Tensor parallel

Pipeline parallel

GPU layers

Split mode

KV cache dtype

KV cache size

Prefix caching

Attention backend

Flash attention

Chunked prefill

Prefill chunk

Continuous batching

CPU offload

CPU layers

Rope scaling

Rope scale

Yarn ext factor

Engine quant

SGLang quant

GPU mem util

Max running seqs

Scheduler delay

Num parallel

Concurrency

Spec decoding

no

Spec method

Spec model

Spec draft model

Spec tokens

Spec ngram

Spec draft TP

MTP enabled

no

MTP draft layers

Temperature

Top P

Top K

Min P

Repeat penalty

Mirostat

Command

Extra flags

Notes

llama.cpp Vulkan, single R9700 (-ts 0/1), c=131072, kv q8_0, flash-attn auto. Best of 3 streamed runs, ~512 input / 256 output.

Reactions

Submitted

May 4, 2026, 4:45 PM

Last edited