LeaderboardModelsAPI Docs
API REFERENCE

API Documentation

Complete reference for submitting and querying local LLM benchmarks. Designed for agents and developers building on the localmaxxing platform.

Overview

localmaxxing is a public leaderboard for local LLM inference benchmarks. The API enables agents and developers to:

  1. Run inference benchmarks on models
  2. Collect performance metrics (tok/s, TTFT, peak VRAM, etc.)
  3. Submit results to POST /api/benchmarks
  4. Query leaderboard data and benchmark results
ℹ️
Base URL: https://localmaxxing.com
All endpoints are prefixed with /api. Results appear on the dashboard and public leaderboard immediately upon submission.

Authentication

Submitting benchmarks requires authentication. Two methods are supported:

1. Bearer API Key (recommended for agents)

Include your API key in the Authorization header:

Authorization: Bearer bhk_<40 hex chars>

2. Session Cookie

If you're calling the API from the browser (e.g., the submit form), your session cookie authenticates you automatically.

Example

curl -X POST https://localmaxxing.com/api/benchmarks \
  -H "Authorization: Bearer bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12" \
  -H "Content-Type: application/json" \
  -d '{ ... }'
⚠️
If the key is missing, expired, or invalid, the API returns 401 Unauthorized. API keys are created and managed in your dashboard. A maximum of 10 keys per account is allowed.

POST /api/benchmarks

Submit a benchmark result. Requires authentication. This is the primary endpoint for agents.

POST/api/benchmarks

Required Fields

FieldTypeDescription
hfIdstringHuggingFace model ID, e.g. "Qwen/Qwen3-8B"
hardwareobjectHardware config — see Hardware section
engineNamestringInference engine, e.g. "llama.cpp", "vllm", "sglang"
quantizationstringQuant format, e.g. "Q4_K_M", "AWQ", "fp8"

At least one of these performance metrics is also required:

MetricTypeDescription
tokSOutnumberOutput tokens per second
tokSTotalnumberTotal tokens per second (prompt + output)
ttftMsnumberTime to first token in milliseconds

Optional Fields

FieldTypeDefaultDescription
modelRevisionstring"main"Git revision / branch / commit SHA
engineVersionstringEngine version, e.g. "0.7.3"
backendstringBackend variant, e.g. "cuda", "metal", "vulkan"
promptTokensinteger0Number of prompt tokens used
outputTokensinteger0Number of output tokens generated
contextLengthinteger2048Context window size used
batchSizeinteger1Batch size (concurrent requests)
peakVramGbnumberPeak VRAM usage in GB
notesstringFree-text notes, max 2000 chars
engineFlagsobjectDetailed engine flags — see Engine Flags

Responses

201Created — Success
{
  "id": "clxyz...",
  "modelId": "...",
  "hardwareId": "...",
  "engineId": "...",
  "userId": "...",
  "tokSOut": 87.4,
  "status": "APPROVED",
  "createdAt": "2026-04-14T03:45:00.000Z",
  "model": { "hfId": "Qwen/Qwen3-8B", "displayName": "Qwen3-8B", ... },
  ...
}
400Bad Request — Validation error
{
  "error": "Validation failed",
  "details": {
    "fieldErrors": { "hardware.vramGb": ["Required"] },
    "formErrors": []
  }
}
401Unauthorized
{ "error": "Authentication required. Use a session cookie or Authorization: Bearer <api_key>" }
404Not Found — Model not on HuggingFace
{ "error": "Model \"some/bad-id\" not found on HuggingFace" }
422Unprocessable — Missing metrics
{ "error": "At least one performance metric (TTFT, tok/s output, or tok/s total) is required" }
429Rate Limit Exceeded
{
  "error": "Rate limit exceeded. You may submit once every 5 minutes.",
  "retryAfterMs": 240000,
  "lastSubmittedAt": "2026-04-14T03:40:00.000Z"
}

Hardware Object

The hardware field is a discriminated union on hwClass. Use the right shape for the hardware being tested.

DISCRETE_GPUNVIDIA / AMD / Intel discrete graphics cards

{
  "hwClass": "DISCRETE_GPU",
  "gpuName": "RTX 3090",
  "gpuCount": 1,
  "vramGb": 24,
  "cpu": "Ryzen 9 5900X",
  "ramGb": 64,
  "os": "Ubuntu 22.04",
  "powerWatts": 350
}
FieldRequiredTypeNotes
hwClass"DISCRETE_GPU"Literal
gpuNamestringe.g. "RTX 3090", "A100 80GB"
gpuCountintegerDefault 1
vramGbnumberPer-card VRAM in GB
cpustringCPU model
ramGbnumberSystem RAM in GB
osstringOperating system
powerWattsnumberTDP / measured power draw

UNIFIEDApple Silicon / AMD APU / Intel Arc

{
  "hwClass": "UNIFIED",
  "chipVendor": "Apple",
  "chipFamily": "M4",
  "chipVariant": "M4 Pro",
  "unifiedMemoryGb": 48,
  "npuTops": 38,
  "os": "macOS 15.4"
}
FieldRequiredTypeNotes
hwClass"UNIFIED"Literal
chipVendorstringe.g. "Apple", "AMD"
chipFamilystringe.g. "M4", "Strix Point"
chipVariantstringe.g. "M4 Pro", "M4 Max"
unifiedMemoryGbnumberTotal unified memory
npuTopsnumberNPU TOPS if applicable
cpustringCPU core descriptor
osstringOperating system
powerWattsnumberPower draw

CPU_ONLYCPU-only inference

{
  "hwClass": "CPU_ONLY",
  "cpu": "Intel Xeon W9-3595X",
  "ramGb": 512,
  "os": "Ubuntu 24.04"
}
FieldRequiredTypeNotes
hwClass"CPU_ONLY"Literal
cpustringCPU model
ramGbnumberSystem RAM in GB
osstringOperating system

Engine Flags Object

Optional. Provide engineFlags to record the exact launch configuration. All fields are optional.

{
  "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa",
  "tensorParallel": 1,
  "gpuLayers": 99,
  "kvCacheDtype": "q8_0",
  "flashAttn": true,
  "contextLength": 8192,
  "attentionBackend": "flash_attn",
  "concurrency": 4,
  "temperature": 0.6,
  "topP": 0.95
}
FlagTypeDescription
commandSnippetstringFull launch command (recommended — parsed automatically)
tensorParallelintegerTensor parallel degree (TP)
pipelineParallelintegerPipeline parallel degree
gpuLayersintegerNumber of layers offloaded to GPU (llama.cpp --n-gpu-layers)
splitModestringGPU split mode
kvCacheDtypestringKV cache quantization, e.g. "q8_0", "fp8"
gpuMemUtilfloat 0–1GPU memory utilization fraction (vLLM)
kvCacheSizeMbintegerKV cache size in MB
prefixCachingbooleanWhether prefix/prompt caching was enabled
attentionBackendstringe.g. "flash_attn", "xformers", "sdpa"
flashAttnbooleanFlash Attention enabled
chunkedPrefillbooleanChunked prefill enabled
prefillChunkSizeintegerPrefill chunk size
contBatchingbooleanContinuous batching enabled
cpuOffloadGbfloatGB of weights offloaded to CPU RAM
cpuLayersintegerNumber of layers on CPU
ropeScalingstringRoPE scaling method, e.g. "yarn", "linear"
ropeScalefloatRoPE scale factor
yarnExtFactorfloatYaRN extension factor
engineQuantstringEngine-level quantization override
sglangQuantstringSGLang quantization method
maxRunningSeqsintegerMax running sequences
schedulerDelayFactorfloatScheduler delay factor
numParallelintegerNumber of parallel sequences (Ollama)
concurrencyintegerConcurrent requests used for throughput runs (vLLM / SGLang)
specDecodingbooleanSpeculative decoding enabled
specMethodstringSpeculative decoding method, e.g. "Dflash", "EAGLE", "Medusa", "ngram"
specModelstringDraft / decoder model HF ID for speculative decoding
specNumTokensintegerSpeculative tokens per step
specNgramSizeintegerN-gram size for ngram spec
specDraftTpintegerDraft model tensor parallel
mtpEnabledbooleanMulti-Token Prediction enabled (DeepSeek-style)
mtpDraftLayersintegerNumber of MTP draft layers
temperaturefloat 0–2Sampling temperature
topPfloat 0–1Top-p nucleus sampling
topKintegerTop-k sampling
minPfloat 0–1Min-p sampling
repeatPenaltyfloatRepeat penalty
mirostatinteger 0–2Mirostat mode
extraFlagsstringAny additional flags not covered above
💡
Tip: If you provide commandSnippet, localmaxxing will attempt to parse flags from it automatically. Explicit fields always override parsed values.

GET /api/benchmarks

Fetch approved benchmark results. Public endpoint — no auth required.

GET/api/benchmarks

Query Parameters

ParamTypeDescription
hfIdstringFilter by model HF ID (includes finetunes of that base model)
hwClassDISCRETE_GPU | UNIFIED | CPU_ONLYFilter by hardware class
gpuNamestringFilter by GPU name (exact)
chipVendorstringFilter by chip vendor
chipFamilystringFilter by chip family
chipVariantstringFilter by chip variant
kvCacheDtypestringFilter by KV cache dtype
attentionBackendstringFilter by attention backend
gpuLayersMinintegerMinimum GPU layers (≥)
tensorParallelMinintegerMinimum tensor parallel (≥)
specOnly"true"Only speculative decoding runs
mtpOnly"true"Only MTP-enabled runs
verified"true" | "false"Filter by user verification status
limitintegerResults per page (1–100, default 20)
offsetintegerPagination offset

Response

{
  "benchmarks": [ { ...benchmarkRun, model: {...}, hardware: {...}, engine: {...}, engineFlags: {...}, user: {...} } ],
  "total": 142,
  "limit": 20,
  "offset": 0
}

Example

curl "https://localmaxxing.com/api/benchmarks?hfId=Qwen/Qwen3-8B&hwClass=DISCRETE_GPU&limit=10"

GET /api/leaderboard

Fetch ranked leaderboard data. Public endpoint — no auth required. Results are sorted by tokSOut descending.

GET/api/leaderboard

Query Parameters

ParamTypeDescription
hfIdstringFilter to a single model (URL-encoded: "org/model")
hwClassDISCRETE_GPU | UNIFIED | CPU_ONLYFilter by hardware class
memTierstringVRAM tier: "8" | "12" | "16" | "24" | "32" | "48" | "80" | "96" | "128"
engineNamestringExact engine name
quantizationstringExact quant string
verified"true" | "false"Filter by user verification status (omit for all)
since"7d" | "30d"Time window (omit for all-time)
limitintegerMax rows (default 50, max 200)
offsetintegerPagination offset

Response

{
  "rows": [ { rank, id, model, hardware, engine, engineFlags, user, tokSOut, ... } ],
  "total": 89,
  "limit": 50,
  "offset": 0
}

GET /api/models

Browse models in the database. Public endpoint — no auth required.

GET/api/models

Query Parameters

ParamTypeDescription
searchstringSearch by HF ID, display name, or family (case-insensitive)
tree"true"Return base models with nested finetunes instead of flat list
limitintegerResults per page (default 20)
offsetintegerPagination offset

API Keys

Manage API keys for programmatic access. All endpoints require session authentication (not API key auth). Maximum of 10 keys per account.

GET/api/keys

List your API keys (key secrets are never returned).

[
  { "id": "...", "name": "My Agent", "prefix": "bhk_1a2b", "createdAt": "...", "lastUsedAt": "...", "expiresAt": null },
  ...
]
POST/api/keys

Create a new API key. The raw key is returned only once — store it immediately.

Request body:

{
  "name": "My Agent Key",
  "expiresAt": "2027-01-01T00:00:00Z"  // optional ISO-8601
}

Response (201):

{
  "id": "...",
  "name": "My Agent Key",
  "prefix": "bhk_1a2b",
  "createdAt": "...",
  "expiresAt": "2027-01-01T00:00:00Z",
  "key": "bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12"  // SHOWN ONLY ONCE
}
DELETE/api/keys/[id]

Revoke (delete) an API key. Returns { "ok": true } on success.

Saved Setups

Manage saved hardware/engine configurations. All endpoints require session authentication and ownership verification.

GET/api/setups

List your saved setups, ordered by default first, then most recent.

POST/api/setups

Create a new saved setup. If isDefault is true, existing defaults are cleared.

{
  "name": "My RTX 3090 Setup",
  "description": "Standard llama.cpp config",
  "isDefault": true,
  "hwClass": "DISCRETE_GPU",
  "gpuName": "RTX 3090",
  "gpuCount": 1,
  "vramGb": 24,
  "engineName": "llama.cpp",
  "quantization": "Q4_K_M",
  "gpuLayers": 99,
  "flashAttn": true,
  "contextLength": 8192
}
GET/api/setups/[id]

Fetch a single saved setup by ID. Used by the submit page to prefill form data.

PATCH/api/setups/[id]

Update a saved setup. All fields are optional — only provided fields are changed. Pass null to clear nullable fields.

DELETE/api/setups/[id]

Delete a saved setup. Returns 204 No Content on success.

Rate Limits

⚠️
POST /api/benchmarks: 1 submission per 5 minutes per user. When rate limited, the response includes retryAfterMs and a Retry-After header.
EndpointLimitScope
POST /api/benchmarks1 request / 5 minPer user
GET /api/benchmarksGenerousPer IP
GET /api/leaderboardGenerousPer IP
GET /api/modelsGenerousPer IP
POST /api/keysMax 10 keysPer account

Common Engine Names

Use consistent casing — these are the accepted values:

EngineengineName value
llama.cpp / llama-serverllama.cpp
vLLMvllm
SGLangsglang
Ollamaollama
LM Studiolmstudio
ExLlamaV2exllamav2
TGI (Text Generation Inference)tgi
TensorRT-LLMtensorrt-llm
MLX (Apple)mlx
Candlecandle
ctransformersctransformers

Benchmark Methodology

For reproducible, trustworthy results:

  • Warm up the model with 1–2 throwaway runs before recording
  • Use a fixed prompt for comparability — a 512-token system prompt + 32-token user message is a good baseline
  • Record steady-state throughput — not the first token burst
  • Measure peakVramGb via nvidia-smi or equivalent at peak load
  • Report batchSize accurately — batch=1 and batch=8 are not comparable
  • Set temperature: 0 (greedy decode) for deterministic throughput benchmarks unless testing sampling overhead
  • Include commandSnippet — the exact launch command is the most useful thing for reproducibility

Field Constraints

FieldMax length / Range
hfId256 chars
modelRevision128 chars
engineName64 chars
engineVersion64 chars
quantization64 chars
backend64 chars
notes2000 chars
commandSnippet4000 chars
extraFlags1000 chars
contextLengthinteger ≥ 1
batchSizeinteger ≥ 1
gpuMemUtil0.0 – 1.0
temperature0.0 – 2.0
topP / minP0.0 – 1.0
mirostat0, 1, or 2
gpuCountinteger ≥ 1
vramGbpositive number
promptTokensinteger ≥ 0
outputTokensinteger ≥ 0

Examples

Minimal Submission

Smallest valid request body:

{
  "hfId": "Qwen/Qwen3-8B",
  "hardware": {
    "hwClass": "DISCRETE_GPU",
    "gpuName": "RTX 3090",
    "vramGb": 24
  },
  "engineName": "llama.cpp",
  "quantization": "Q4_K_M",
  "tokSOut": 87.4
}

Full Submission

Complete request with all optional fields:

{
  "hfId": "Qwen/Qwen3-8B",
  "modelRevision": "main",
  "hardware": {
    "hwClass": "DISCRETE_GPU",
    "gpuName": "RTX 3090",
    "gpuCount": 1,
    "vramGb": 24,
    "cpu": "Ryzen 9 5900X",
    "ramGb": 64,
    "os": "Ubuntu 22.04"
  },
  "engineName": "llama.cpp",
  "engineVersion": "b5012",
  "quantization": "Q4_K_M",
  "backend": "cuda",
  "promptTokens": 512,
  "outputTokens": 1024,
  "contextLength": 8192,
  "batchSize": 1,
  "ttftMs": 142.5,
  "tokSOut": 87.4,
  "tokSTotal": 74.1,
  "peakVramGb": 6.2,
  "notes": "Automated benchmark via agent. Thermal throttling observed after 10 min.",
  "engineFlags": {
    "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa --temp 0.6 --top-p 0.95",
    "gpuLayers": 99,
    "kvCacheDtype": "q8_0",
    "flashAttn": true,
    "prefixCaching": true,
    "temperature": 0.6,
    "topP": 0.95
  }
}

cURL — Submit Benchmark

curl -X POST https://localmaxxing.com/api/benchmarks \
  -H "Authorization: Bearer bhk_YOUR_API_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "hfId": "Qwen/Qwen3-8B",
    "hardware": { "hwClass": "DISCRETE_GPU", "gpuName": "RTX 3090", "vramGb": 24 },
    "engineName": "llama.cpp",
    "quantization": "Q4_K_M",
    "tokSOut": 87.4,
    "tokSTotal": 74.1,
    "ttftMs": 142.5,
    "contextLength": 8192,
    "engineFlags": {
      "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa"
    }
  }'

cURL — Query Leaderboard

curl "https://localmaxxing.com/api/leaderboard?hwClass=DISCRETE_GPU&memTier=24&verified=true&limit=10"