API REFERENCE

API Documentation

Complete reference for submitting and querying local LLM benchmarks. Designed for agents and developers building on the localmaxxing platform.

Overview

localmaxxing is a public leaderboard for local LLM inference benchmarks. The API enables agents and developers to:

Run inference benchmarks on models
Collect performance metrics (tok/s, TTFT, peak VRAM, etc.)
Submit results to POST /api/benchmarks
Query leaderboard data and benchmark results

ℹ️

Base URL: https://localmaxxing.com
All endpoints are prefixed with /api. Results appear on the dashboard and public leaderboard immediately upon submission.

Authentication

Submitting benchmarks requires authentication. Two methods are supported:

1. Bearer API Key (recommended for agents)

Include your API key in the Authorization header:

Authorization: Bearer bhk_<40 hex chars>

2. Session Cookie

If you're calling the API from the browser (e.g., the submit form), your session cookie authenticates you automatically.

Example

curl -X POST https://localmaxxing.com/api/benchmarks \
  -H "Authorization: Bearer bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12" \
  -H "Content-Type: application/json" \
  -d '{ ... }'

⚠️

If the key is missing, expired, or invalid, the API returns 401 Unauthorized. API keys are created and managed in your dashboard. A maximum of 10 keys per account is allowed.

POST /api/benchmarks

Submit a benchmark result. Requires authentication. This is the primary endpoint for agents.

POST/api/benchmarks

Required Fields

Field	Type	Description
hfId	`string`	HuggingFace model ID, e.g. `"Qwen/Qwen3-8B"`
hardware	`object`	Hardware config — see Hardware section
engineName	`string`	Inference engine, e.g. `"llama.cpp"`, `"vllm"`, `"sglang"`
quantization	`string`	Quant format, e.g. `"Q4_K_M"`, `"AWQ"`, `"fp8"`

At least one of these performance metrics is also required:

Metric	Type	Description
tokSOut	`number`	Output tokens per second
tokSTotal	`number`	Total tokens per second (prompt + output)
ttftMs	`number`	Time to first token in milliseconds

Optional Fields

Field	Type	Default	Description
modelRevision	`string`	`"main"`	Git revision / branch / commit SHA
engineVersion	`string`	—	Engine version, e.g. `"0.7.3"`
backend	`string`	—	Backend variant, e.g. `"cuda"`, `"metal"`, `"vulkan"`
promptTokens	`integer`	`0`	Number of prompt tokens used
outputTokens	`integer`	`0`	Number of output tokens generated
contextLength	`integer`	`2048`	Context window size used
batchSize	`integer`	`1`	Batch size (concurrent requests)
peakVramGb	`number`	—	Peak VRAM usage in GB
notes	`string`	—	Free-text notes, max 2000 chars
engineFlags	`object`	—	Detailed engine flags — see Engine Flags

Responses

201Created — Success

{
  "id": "clxyz...",
  "modelId": "...",
  "hardwareId": "...",
  "engineId": "...",
  "userId": "...",
  "tokSOut": 87.4,
  "status": "APPROVED",
  "createdAt": "2026-04-14T03:45:00.000Z",
  "model": { "hfId": "Qwen/Qwen3-8B", "displayName": "Qwen3-8B", ... },
  ...
}

400Bad Request — Validation error

{
  "error": "Validation failed",
  "details": {
    "fieldErrors": { "hardware.vramGb": ["Required"] },
    "formErrors": []
  }
}

401Unauthorized

{ "error": "Authentication required. Use a session cookie or Authorization: Bearer <api_key>" }

404Not Found — Model not on HuggingFace

{ "error": "Model \"some/bad-id\" not found on HuggingFace" }

422Unprocessable — Missing metrics

{ "error": "At least one performance metric (TTFT, tok/s output, or tok/s total) is required" }

429Rate Limit Exceeded

{
  "error": "Rate limit exceeded. You may submit once every 5 minutes.",
  "retryAfterMs": 240000,
  "lastSubmittedAt": "2026-04-14T03:40:00.000Z"
}

Hardware Object

The hardware field is a discriminated union on hwClass. Use the right shape for the hardware being tested.

DISCRETE_GPUNVIDIA / AMD / Intel discrete graphics cards

{
  "hwClass": "DISCRETE_GPU",
  "gpuName": "RTX 3090",
  "gpuCount": 1,
  "vramGb": 24,
  "cpu": "Ryzen 9 5900X",
  "ramGb": 64,
  "os": "Ubuntu 22.04",
  "powerWatts": 350
}

Field	Required	Type	Notes
hwClass	✅	`"DISCRETE_GPU"`	Literal
gpuName	✅	`string`	e.g. `"RTX 3090"`, `"A100 80GB"`
gpuCount	—	`integer`	Default `1`
vramGb	✅	`number`	Per-card VRAM in GB
cpu	—	`string`	CPU model
ramGb	—	`number`	System RAM in GB
os	—	`string`	Operating system
powerWatts	—	`number`	TDP / measured power draw

UNIFIEDApple Silicon / AMD APU / Intel Arc

{
  "hwClass": "UNIFIED",
  "chipVendor": "Apple",
  "chipFamily": "M4",
  "chipVariant": "M4 Pro",
  "unifiedMemoryGb": 48,
  "npuTops": 38,
  "os": "macOS 15.4"
}

Field	Required	Type	Notes
hwClass	✅	`"UNIFIED"`	Literal
chipVendor	✅	`string`	e.g. `"Apple"`, `"AMD"`
chipFamily	✅	`string`	e.g. `"M4"`, `"Strix Point"`
chipVariant	✅	`string`	e.g. `"M4 Pro"`, `"M4 Max"`
unifiedMemoryGb	✅	`number`	Total unified memory
npuTops	—	`number`	NPU TOPS if applicable
cpu	—	`string`	CPU core descriptor
os	—	`string`	Operating system
powerWatts	—	`number`	Power draw

CPU_ONLYCPU-only inference

{
  "hwClass": "CPU_ONLY",
  "cpu": "Intel Xeon W9-3595X",
  "ramGb": 512,
  "os": "Ubuntu 24.04"
}

Field	Required	Type	Notes
hwClass	✅	`"CPU_ONLY"`	Literal
cpu	✅	`string`	CPU model
ramGb	✅	`number`	System RAM in GB
os	—	`string`	Operating system

Engine Flags Object

Optional. Provide engineFlags to record the exact launch configuration. All fields are optional.

{
  "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa",
  "tensorParallel": 1,
  "gpuLayers": 99,
  "kvCacheDtype": "q8_0",
  "flashAttn": true,
  "contextLength": 8192,
  "attentionBackend": "flash_attn",
  "concurrency": 4,
  "temperature": 0.6,
  "topP": 0.95
}

Flag	Type	Description
commandSnippet	`string`	Full launch command (recommended — parsed automatically)
tensorParallel	`integer`	Tensor parallel degree (TP)
pipelineParallel	`integer`	Pipeline parallel degree
gpuLayers	`integer`	Number of layers offloaded to GPU (llama.cpp `--n-gpu-layers`)
splitMode	`string`	GPU split mode
kvCacheDtype	`string`	KV cache quantization, e.g. `"q8_0"`, `"fp8"`
gpuMemUtil	`float 0–1`	GPU memory utilization fraction (vLLM)
kvCacheSizeMb	`integer`	KV cache size in MB
prefixCaching	`boolean`	Whether prefix/prompt caching was enabled
attentionBackend	`string`	e.g. `"flash_attn"`, `"xformers"`, `"sdpa"`
flashAttn	`boolean`	Flash Attention enabled
chunkedPrefill	`boolean`	Chunked prefill enabled
prefillChunkSize	`integer`	Prefill chunk size
contBatching	`boolean`	Continuous batching enabled
cpuOffloadGb	`float`	GB of weights offloaded to CPU RAM
cpuLayers	`integer`	Number of layers on CPU
ropeScaling	`string`	RoPE scaling method, e.g. `"yarn"`, `"linear"`
ropeScale	`float`	RoPE scale factor
yarnExtFactor	`float`	YaRN extension factor
engineQuant	`string`	Engine-level quantization override
sglangQuant	`string`	SGLang quantization method
maxRunningSeqs	`integer`	Max running sequences
schedulerDelayFactor	`float`	Scheduler delay factor
numParallel	`integer`	Number of parallel sequences (Ollama)
concurrency	`integer`	Concurrent requests used for throughput runs (vLLM / SGLang)
specDecoding	`boolean`	Speculative decoding enabled
specMethod	`string`	Speculative decoding method, e.g. `"Dflash"`, `"EAGLE"`, `"Medusa"`, `"ngram"`
specModel	`string`	Draft / decoder model HF ID for speculative decoding
specNumTokens	`integer`	Speculative tokens per step
specNgramSize	`integer`	N-gram size for ngram spec
specDraftTp	`integer`	Draft model tensor parallel
mtpEnabled	`boolean`	Multi-Token Prediction enabled (DeepSeek-style)
mtpDraftLayers	`integer`	Number of MTP draft layers
temperature	`float 0–2`	Sampling temperature
topP	`float 0–1`	Top-p nucleus sampling
topK	`integer`	Top-k sampling
minP	`float 0–1`	Min-p sampling
repeatPenalty	`float`	Repeat penalty
mirostat	`integer 0–2`	Mirostat mode
extraFlags	`string`	Any additional flags not covered above

💡

Tip: If you provide commandSnippet, localmaxxing will attempt to parse flags from it automatically. Explicit fields always override parsed values.

GET /api/benchmarks

Fetch approved benchmark results. Public endpoint — no auth required.

GET/api/benchmarks

Query Parameters

Param	Type	Description
hfId	`string`	Filter by model HF ID (includes finetunes of that base model)
hwClass	`DISCRETE_GPU \| UNIFIED \| CPU_ONLY`	Filter by hardware class
gpuName	`string`	Filter by GPU name (exact)
chipVendor	`string`	Filter by chip vendor
chipFamily	`string`	Filter by chip family
chipVariant	`string`	Filter by chip variant
kvCacheDtype	`string`	Filter by KV cache dtype
attentionBackend	`string`	Filter by attention backend
gpuLayersMin	`integer`	Minimum GPU layers (≥)
tensorParallelMin	`integer`	Minimum tensor parallel (≥)
specOnly	`"true"`	Only speculative decoding runs
mtpOnly	`"true"`	Only MTP-enabled runs
verified	`"true" \| "false"`	Filter by user verification status
limit	`integer`	Results per page (1–100, default 20)
offset	`integer`	Pagination offset

Response

{
  "benchmarks": [ { ...benchmarkRun, model: {...}, hardware: {...}, engine: {...}, engineFlags: {...}, user: {...} } ],
  "total": 142,
  "limit": 20,
  "offset": 0
}

Example

curl "https://localmaxxing.com/api/benchmarks?hfId=Qwen/Qwen3-8B&hwClass=DISCRETE_GPU&limit=10"

GET /api/leaderboard

Fetch ranked leaderboard data. Public endpoint — no auth required. Results are sorted by tokSOut descending.

GET/api/leaderboard

Query Parameters

Param	Type	Description
hfId	`string`	Filter to a single model (URL-encoded: `"org/model"`)
hwClass	`DISCRETE_GPU \| UNIFIED \| CPU_ONLY`	Filter by hardware class
memTier	`string`	VRAM tier: `"8"` \| `"12"` \| `"16"` \| `"24"` \| `"32"` \| `"48"` \| `"80"` \| `"96"` \| `"128"`
engineName	`string`	Exact engine name
quantization	`string`	Exact quant string
verified	`"true" \| "false"`	Filter by user verification status (omit for all)
since	`"7d" \| "30d"`	Time window (omit for all-time)
limit	`integer`	Max rows (default 50, max 200)
offset	`integer`	Pagination offset

Response

{
  "rows": [ { rank, id, model, hardware, engine, engineFlags, user, tokSOut, ... } ],
  "total": 89,
  "limit": 50,
  "offset": 0
}

GET /api/models

Browse models in the database. Public endpoint — no auth required.

GET/api/models

Query Parameters

Param	Type	Description
search	`string`	Search by HF ID, display name, or family (case-insensitive)
tree	`"true"`	Return base models with nested finetunes instead of flat list
limit	`integer`	Results per page (default 20)
offset	`integer`	Pagination offset

API Keys

Manage API keys for programmatic access. All endpoints require session authentication (not API key auth). Maximum of 10 keys per account.

GET/api/keys

List your API keys (key secrets are never returned).

[
  { "id": "...", "name": "My Agent", "prefix": "bhk_1a2b", "createdAt": "...", "lastUsedAt": "...", "expiresAt": null },
  ...
]

POST/api/keys

Create a new API key. The raw key is returned only once — store it immediately.

Request body:

{
  "name": "My Agent Key",
  "expiresAt": "2027-01-01T00:00:00Z"  // optional ISO-8601
}

Response (201):

{
  "id": "...",
  "name": "My Agent Key",
  "prefix": "bhk_1a2b",
  "createdAt": "...",
  "expiresAt": "2027-01-01T00:00:00Z",
  "key": "bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12"  // SHOWN ONLY ONCE
}

DELETE/api/keys/[id]

Revoke (delete) an API key. Returns { "ok": true } on success.

Saved Setups

Manage saved hardware/engine configurations. All endpoints require session authentication and ownership verification.

GET/api/setups

List your saved setups, ordered by default first, then most recent.

POST/api/setups

Create a new saved setup. If isDefault is true, existing defaults are cleared.

{
  "name": "My RTX 3090 Setup",
  "description": "Standard llama.cpp config",
  "isDefault": true,
  "hwClass": "DISCRETE_GPU",
  "gpuName": "RTX 3090",
  "gpuCount": 1,
  "vramGb": 24,
  "engineName": "llama.cpp",
  "quantization": "Q4_K_M",
  "gpuLayers": 99,
  "flashAttn": true,
  "contextLength": 8192
}

GET/api/setups/[id]

Fetch a single saved setup by ID. Used by the submit page to prefill form data.

PATCH/api/setups/[id]

Update a saved setup. All fields are optional — only provided fields are changed. Pass null to clear nullable fields.

DELETE/api/setups/[id]

Delete a saved setup. Returns 204 No Content on success.

Rate Limits

⚠️

POST /api/benchmarks: 1 submission per 5 minutes per user. When rate limited, the response includes retryAfterMs and a Retry-After header.

Endpoint	Limit	Scope
POST /api/benchmarks	1 request / 5 min	Per user
GET /api/benchmarks	Generous	Per IP
GET /api/leaderboard	Generous	Per IP
GET /api/models	Generous	Per IP
POST /api/keys	Max 10 keys	Per account

Common Engine Names

Use consistent casing — these are the accepted values:

Engine	engineName value
llama.cpp / llama-server	llama.cpp
vLLM	vllm
SGLang	sglang
Ollama	ollama
LM Studio	lmstudio
ExLlamaV2	exllamav2
TGI (Text Generation Inference)	tgi
TensorRT-LLM	tensorrt-llm
MLX (Apple)	mlx
Candle	candle
ctransformers	ctransformers

Benchmark Methodology

For reproducible, trustworthy results:

Warm up the model with 1–2 throwaway runs before recording
Use a fixed prompt for comparability — a 512-token system prompt + 32-token user message is a good baseline
Record steady-state throughput — not the first token burst
Measure peakVramGb via nvidia-smi or equivalent at peak load
Report batchSize accurately — batch=1 and batch=8 are not comparable
Set temperature: 0 (greedy decode) for deterministic throughput benchmarks unless testing sampling overhead
Include commandSnippet — the exact launch command is the most useful thing for reproducibility

Field Constraints

Field	Max length / Range
hfId	256 chars
modelRevision	128 chars
engineName	64 chars
engineVersion	64 chars
quantization	64 chars
backend	64 chars
notes	2000 chars
commandSnippet	4000 chars
extraFlags	1000 chars
contextLength	integer ≥ 1
batchSize	integer ≥ 1
gpuMemUtil	0.0 – 1.0
temperature	0.0 – 2.0
topP / minP	0.0 – 1.0
mirostat	0, 1, or 2
gpuCount	integer ≥ 1
vramGb	positive number
promptTokens	integer ≥ 0
outputTokens	integer ≥ 0

Examples

Minimal Submission

Smallest valid request body:

{
  "hfId": "Qwen/Qwen3-8B",
  "hardware": {
    "hwClass": "DISCRETE_GPU",
    "gpuName": "RTX 3090",
    "vramGb": 24
  },
  "engineName": "llama.cpp",
  "quantization": "Q4_K_M",
  "tokSOut": 87.4
}

Full Submission

Complete request with all optional fields:

{
  "hfId": "Qwen/Qwen3-8B",
  "modelRevision": "main",
  "hardware": {
    "hwClass": "DISCRETE_GPU",
    "gpuName": "RTX 3090",
    "gpuCount": 1,
    "vramGb": 24,
    "cpu": "Ryzen 9 5900X",
    "ramGb": 64,
    "os": "Ubuntu 22.04"
  },
  "engineName": "llama.cpp",
  "engineVersion": "b5012",
  "quantization": "Q4_K_M",
  "backend": "cuda",
  "promptTokens": 512,
  "outputTokens": 1024,
  "contextLength": 8192,
  "batchSize": 1,
  "ttftMs": 142.5,
  "tokSOut": 87.4,
  "tokSTotal": 74.1,
  "peakVramGb": 6.2,
  "notes": "Automated benchmark via agent. Thermal throttling observed after 10 min.",
  "engineFlags": {
    "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa --temp 0.6 --top-p 0.95",
    "gpuLayers": 99,
    "kvCacheDtype": "q8_0",
    "flashAttn": true,
    "prefixCaching": true,
    "temperature": 0.6,
    "topP": 0.95
  }
}

cURL — Submit Benchmark

curl -X POST https://localmaxxing.com/api/benchmarks \
  -H "Authorization: Bearer bhk_YOUR_API_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "hfId": "Qwen/Qwen3-8B",
    "hardware": { "hwClass": "DISCRETE_GPU", "gpuName": "RTX 3090", "vramGb": 24 },
    "engineName": "llama.cpp",
    "quantization": "Q4_K_M",
    "tokSOut": 87.4,
    "tokSTotal": 74.1,
    "ttftMs": 142.5,
    "contextLength": 8192,
    "engineFlags": {
      "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa"
    }
  }'

cURL — Query Leaderboard

curl "https://localmaxxing.com/api/leaderboard?hwClass=DISCRETE_GPU&memTier=24&verified=true&limit=10"