API Documentation
Complete reference for submitting and querying local LLM benchmarks. Designed for agents and developers building on the localmaxxing platform.
Overview
localmaxxing is a public leaderboard for local LLM inference benchmarks. The API enables agents and developers to:
- Run inference benchmarks on models
- Collect performance metrics (tok/s, TTFT, peak VRAM, etc.)
- Submit results to
POST /api/benchmarks - Query leaderboard data and benchmark results
https://localmaxxing.comAll endpoints are prefixed with
/api. Results appear on the dashboard and public leaderboard immediately upon submission.Authentication
Submitting benchmarks requires authentication. Two methods are supported:
1. Bearer API Key (recommended for agents)
Include your API key in the Authorization header:
Authorization: Bearer bhk_<40 hex chars>
2. Session Cookie
If you're calling the API from the browser (e.g., the submit form), your session cookie authenticates you automatically.
Example
curl -X POST https://localmaxxing.com/api/benchmarks \
-H "Authorization: Bearer bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12" \
-H "Content-Type: application/json" \
-d '{ ... }'401 Unauthorized. API keys are created and managed in your dashboard. A maximum of 10 keys per account is allowed.POST /api/benchmarks
Submit a benchmark result. Requires authentication. This is the primary endpoint for agents.
/api/benchmarksRequired Fields
| Field | Type | Description |
|---|---|---|
| hfId | string | HuggingFace model ID, e.g. "Qwen/Qwen3-8B" |
| hardware | object | Hardware config — see Hardware section |
| engineName | string | Inference engine, e.g. "llama.cpp", "vllm", "sglang" |
| quantization | string | Quant format, e.g. "Q4_K_M", "AWQ", "fp8" |
At least one of these performance metrics is also required:
| Metric | Type | Description |
|---|---|---|
| tokSOut | number | Output tokens per second |
| tokSTotal | number | Total tokens per second (prompt + output) |
| ttftMs | number | Time to first token in milliseconds |
Optional Fields
| Field | Type | Default | Description |
|---|---|---|---|
| modelRevision | string | "main" | Git revision / branch / commit SHA |
| engineVersion | string | — | Engine version, e.g. "0.7.3" |
| backend | string | — | Backend variant, e.g. "cuda", "metal", "vulkan" |
| promptTokens | integer | 0 | Number of prompt tokens used |
| outputTokens | integer | 0 | Number of output tokens generated |
| contextLength | integer | 2048 | Context window size used |
| batchSize | integer | 1 | Batch size (concurrent requests) |
| peakVramGb | number | — | Peak VRAM usage in GB |
| notes | string | — | Free-text notes, max 2000 chars |
| engineFlags | object | — | Detailed engine flags — see Engine Flags |
Responses
{
"id": "clxyz...",
"modelId": "...",
"hardwareId": "...",
"engineId": "...",
"userId": "...",
"tokSOut": 87.4,
"status": "APPROVED",
"createdAt": "2026-04-14T03:45:00.000Z",
"model": { "hfId": "Qwen/Qwen3-8B", "displayName": "Qwen3-8B", ... },
...
}{
"error": "Validation failed",
"details": {
"fieldErrors": { "hardware.vramGb": ["Required"] },
"formErrors": []
}
}{ "error": "Authentication required. Use a session cookie or Authorization: Bearer <api_key>" }{ "error": "Model \"some/bad-id\" not found on HuggingFace" }{ "error": "At least one performance metric (TTFT, tok/s output, or tok/s total) is required" }{
"error": "Rate limit exceeded. You may submit once every 5 minutes.",
"retryAfterMs": 240000,
"lastSubmittedAt": "2026-04-14T03:40:00.000Z"
}Hardware Object
The hardware field is a discriminated union on hwClass. Use the right shape for the hardware being tested.
DISCRETE_GPUNVIDIA / AMD / Intel discrete graphics cards
{
"hwClass": "DISCRETE_GPU",
"gpuName": "RTX 3090",
"gpuCount": 1,
"vramGb": 24,
"cpu": "Ryzen 9 5900X",
"ramGb": 64,
"os": "Ubuntu 22.04",
"powerWatts": 350
}| Field | Required | Type | Notes |
|---|---|---|---|
| hwClass | ✅ | "DISCRETE_GPU" | Literal |
| gpuName | ✅ | string | e.g. "RTX 3090", "A100 80GB" |
| gpuCount | — | integer | Default 1 |
| vramGb | ✅ | number | Per-card VRAM in GB |
| cpu | — | string | CPU model |
| ramGb | — | number | System RAM in GB |
| os | — | string | Operating system |
| powerWatts | — | number | TDP / measured power draw |
UNIFIEDApple Silicon / AMD APU / Intel Arc
{
"hwClass": "UNIFIED",
"chipVendor": "Apple",
"chipFamily": "M4",
"chipVariant": "M4 Pro",
"unifiedMemoryGb": 48,
"npuTops": 38,
"os": "macOS 15.4"
}| Field | Required | Type | Notes |
|---|---|---|---|
| hwClass | ✅ | "UNIFIED" | Literal |
| chipVendor | ✅ | string | e.g. "Apple", "AMD" |
| chipFamily | ✅ | string | e.g. "M4", "Strix Point" |
| chipVariant | ✅ | string | e.g. "M4 Pro", "M4 Max" |
| unifiedMemoryGb | ✅ | number | Total unified memory |
| npuTops | — | number | NPU TOPS if applicable |
| cpu | — | string | CPU core descriptor |
| os | — | string | Operating system |
| powerWatts | — | number | Power draw |
CPU_ONLYCPU-only inference
{
"hwClass": "CPU_ONLY",
"cpu": "Intel Xeon W9-3595X",
"ramGb": 512,
"os": "Ubuntu 24.04"
}| Field | Required | Type | Notes |
|---|---|---|---|
| hwClass | ✅ | "CPU_ONLY" | Literal |
| cpu | ✅ | string | CPU model |
| ramGb | ✅ | number | System RAM in GB |
| os | — | string | Operating system |
Engine Flags Object
Optional. Provide engineFlags to record the exact launch configuration. All fields are optional.
{
"commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa",
"tensorParallel": 1,
"gpuLayers": 99,
"kvCacheDtype": "q8_0",
"flashAttn": true,
"contextLength": 8192,
"attentionBackend": "flash_attn",
"concurrency": 4,
"temperature": 0.6,
"topP": 0.95
}| Flag | Type | Description |
|---|---|---|
| commandSnippet | string | Full launch command (recommended — parsed automatically) |
| tensorParallel | integer | Tensor parallel degree (TP) |
| pipelineParallel | integer | Pipeline parallel degree |
| gpuLayers | integer | Number of layers offloaded to GPU (llama.cpp --n-gpu-layers) |
| splitMode | string | GPU split mode |
| kvCacheDtype | string | KV cache quantization, e.g. "q8_0", "fp8" |
| gpuMemUtil | float 0–1 | GPU memory utilization fraction (vLLM) |
| kvCacheSizeMb | integer | KV cache size in MB |
| prefixCaching | boolean | Whether prefix/prompt caching was enabled |
| attentionBackend | string | e.g. "flash_attn", "xformers", "sdpa" |
| flashAttn | boolean | Flash Attention enabled |
| chunkedPrefill | boolean | Chunked prefill enabled |
| prefillChunkSize | integer | Prefill chunk size |
| contBatching | boolean | Continuous batching enabled |
| cpuOffloadGb | float | GB of weights offloaded to CPU RAM |
| cpuLayers | integer | Number of layers on CPU |
| ropeScaling | string | RoPE scaling method, e.g. "yarn", "linear" |
| ropeScale | float | RoPE scale factor |
| yarnExtFactor | float | YaRN extension factor |
| engineQuant | string | Engine-level quantization override |
| sglangQuant | string | SGLang quantization method |
| maxRunningSeqs | integer | Max running sequences |
| schedulerDelayFactor | float | Scheduler delay factor |
| numParallel | integer | Number of parallel sequences (Ollama) |
| concurrency | integer | Concurrent requests used for throughput runs (vLLM / SGLang) |
| specDecoding | boolean | Speculative decoding enabled |
| specMethod | string | Speculative decoding method, e.g. "Dflash", "EAGLE", "Medusa", "ngram" |
| specModel | string | Draft / decoder model HF ID for speculative decoding |
| specNumTokens | integer | Speculative tokens per step |
| specNgramSize | integer | N-gram size for ngram spec |
| specDraftTp | integer | Draft model tensor parallel |
| mtpEnabled | boolean | Multi-Token Prediction enabled (DeepSeek-style) |
| mtpDraftLayers | integer | Number of MTP draft layers |
| temperature | float 0–2 | Sampling temperature |
| topP | float 0–1 | Top-p nucleus sampling |
| topK | integer | Top-k sampling |
| minP | float 0–1 | Min-p sampling |
| repeatPenalty | float | Repeat penalty |
| mirostat | integer 0–2 | Mirostat mode |
| extraFlags | string | Any additional flags not covered above |
commandSnippet, localmaxxing will attempt to parse flags from it automatically. Explicit fields always override parsed values.GET /api/benchmarks
Fetch approved benchmark results. Public endpoint — no auth required.
/api/benchmarksQuery Parameters
| Param | Type | Description |
|---|---|---|
| hfId | string | Filter by model HF ID (includes finetunes of that base model) |
| hwClass | DISCRETE_GPU | UNIFIED | CPU_ONLY | Filter by hardware class |
| gpuName | string | Filter by GPU name (exact) |
| chipVendor | string | Filter by chip vendor |
| chipFamily | string | Filter by chip family |
| chipVariant | string | Filter by chip variant |
| kvCacheDtype | string | Filter by KV cache dtype |
| attentionBackend | string | Filter by attention backend |
| gpuLayersMin | integer | Minimum GPU layers (≥) |
| tensorParallelMin | integer | Minimum tensor parallel (≥) |
| specOnly | "true" | Only speculative decoding runs |
| mtpOnly | "true" | Only MTP-enabled runs |
| verified | "true" | "false" | Filter by user verification status |
| limit | integer | Results per page (1–100, default 20) |
| offset | integer | Pagination offset |
Response
{
"benchmarks": [ { ...benchmarkRun, model: {...}, hardware: {...}, engine: {...}, engineFlags: {...}, user: {...} } ],
"total": 142,
"limit": 20,
"offset": 0
}Example
curl "https://localmaxxing.com/api/benchmarks?hfId=Qwen/Qwen3-8B&hwClass=DISCRETE_GPU&limit=10"
GET /api/leaderboard
Fetch ranked leaderboard data. Public endpoint — no auth required. Results are sorted by tokSOut descending.
/api/leaderboardQuery Parameters
| Param | Type | Description |
|---|---|---|
| hfId | string | Filter to a single model (URL-encoded: "org/model") |
| hwClass | DISCRETE_GPU | UNIFIED | CPU_ONLY | Filter by hardware class |
| memTier | string | VRAM tier: "8" | "12" | "16" | "24" | "32" | "48" | "80" | "96" | "128" |
| engineName | string | Exact engine name |
| quantization | string | Exact quant string |
| verified | "true" | "false" | Filter by user verification status (omit for all) |
| since | "7d" | "30d" | Time window (omit for all-time) |
| limit | integer | Max rows (default 50, max 200) |
| offset | integer | Pagination offset |
Response
{
"rows": [ { rank, id, model, hardware, engine, engineFlags, user, tokSOut, ... } ],
"total": 89,
"limit": 50,
"offset": 0
}GET /api/models
Browse models in the database. Public endpoint — no auth required.
/api/modelsQuery Parameters
| Param | Type | Description |
|---|---|---|
| search | string | Search by HF ID, display name, or family (case-insensitive) |
| tree | "true" | Return base models with nested finetunes instead of flat list |
| limit | integer | Results per page (default 20) |
| offset | integer | Pagination offset |
API Keys
Manage API keys for programmatic access. All endpoints require session authentication (not API key auth). Maximum of 10 keys per account.
/api/keysList your API keys (key secrets are never returned).
[
{ "id": "...", "name": "My Agent", "prefix": "bhk_1a2b", "createdAt": "...", "lastUsedAt": "...", "expiresAt": null },
...
]/api/keysCreate a new API key. The raw key is returned only once — store it immediately.
Request body:
{
"name": "My Agent Key",
"expiresAt": "2027-01-01T00:00:00Z" // optional ISO-8601
}Response (201):
{
"id": "...",
"name": "My Agent Key",
"prefix": "bhk_1a2b",
"createdAt": "...",
"expiresAt": "2027-01-01T00:00:00Z",
"key": "bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12" // SHOWN ONLY ONCE
}/api/keys/[id]Revoke (delete) an API key. Returns { "ok": true } on success.
Saved Setups
Manage saved hardware/engine configurations. All endpoints require session authentication and ownership verification.
/api/setupsList your saved setups, ordered by default first, then most recent.
/api/setupsCreate a new saved setup. If isDefault is true, existing defaults are cleared.
{
"name": "My RTX 3090 Setup",
"description": "Standard llama.cpp config",
"isDefault": true,
"hwClass": "DISCRETE_GPU",
"gpuName": "RTX 3090",
"gpuCount": 1,
"vramGb": 24,
"engineName": "llama.cpp",
"quantization": "Q4_K_M",
"gpuLayers": 99,
"flashAttn": true,
"contextLength": 8192
}/api/setups/[id]Fetch a single saved setup by ID. Used by the submit page to prefill form data.
/api/setups/[id]Update a saved setup. All fields are optional — only provided fields are changed. Pass null to clear nullable fields.
/api/setups/[id]Delete a saved setup. Returns 204 No Content on success.
Rate Limits
retryAfterMs and a Retry-After header.| Endpoint | Limit | Scope |
|---|---|---|
| POST /api/benchmarks | 1 request / 5 min | Per user |
| GET /api/benchmarks | Generous | Per IP |
| GET /api/leaderboard | Generous | Per IP |
| GET /api/models | Generous | Per IP |
| POST /api/keys | Max 10 keys | Per account |
Common Engine Names
Use consistent casing — these are the accepted values:
| Engine | engineName value |
|---|---|
| llama.cpp / llama-server | llama.cpp |
| vLLM | vllm |
| SGLang | sglang |
| Ollama | ollama |
| LM Studio | lmstudio |
| ExLlamaV2 | exllamav2 |
| TGI (Text Generation Inference) | tgi |
| TensorRT-LLM | tensorrt-llm |
| MLX (Apple) | mlx |
| Candle | candle |
| ctransformers | ctransformers |
Benchmark Methodology
For reproducible, trustworthy results:
- Warm up the model with 1–2 throwaway runs before recording
- Use a fixed prompt for comparability — a 512-token system prompt + 32-token user message is a good baseline
- Record steady-state throughput — not the first token burst
- Measure
peakVramGbvianvidia-smior equivalent at peak load - Report
batchSizeaccurately — batch=1 and batch=8 are not comparable - Set
temperature: 0(greedy decode) for deterministic throughput benchmarks unless testing sampling overhead - Include
commandSnippet— the exact launch command is the most useful thing for reproducibility
Field Constraints
| Field | Max length / Range |
|---|---|
| hfId | 256 chars |
| modelRevision | 128 chars |
| engineName | 64 chars |
| engineVersion | 64 chars |
| quantization | 64 chars |
| backend | 64 chars |
| notes | 2000 chars |
| commandSnippet | 4000 chars |
| extraFlags | 1000 chars |
| contextLength | integer ≥ 1 |
| batchSize | integer ≥ 1 |
| gpuMemUtil | 0.0 – 1.0 |
| temperature | 0.0 – 2.0 |
| topP / minP | 0.0 – 1.0 |
| mirostat | 0, 1, or 2 |
| gpuCount | integer ≥ 1 |
| vramGb | positive number |
| promptTokens | integer ≥ 0 |
| outputTokens | integer ≥ 0 |
Examples
Minimal Submission
Smallest valid request body:
{
"hfId": "Qwen/Qwen3-8B",
"hardware": {
"hwClass": "DISCRETE_GPU",
"gpuName": "RTX 3090",
"vramGb": 24
},
"engineName": "llama.cpp",
"quantization": "Q4_K_M",
"tokSOut": 87.4
}Full Submission
Complete request with all optional fields:
{
"hfId": "Qwen/Qwen3-8B",
"modelRevision": "main",
"hardware": {
"hwClass": "DISCRETE_GPU",
"gpuName": "RTX 3090",
"gpuCount": 1,
"vramGb": 24,
"cpu": "Ryzen 9 5900X",
"ramGb": 64,
"os": "Ubuntu 22.04"
},
"engineName": "llama.cpp",
"engineVersion": "b5012",
"quantization": "Q4_K_M",
"backend": "cuda",
"promptTokens": 512,
"outputTokens": 1024,
"contextLength": 8192,
"batchSize": 1,
"ttftMs": 142.5,
"tokSOut": 87.4,
"tokSTotal": 74.1,
"peakVramGb": 6.2,
"notes": "Automated benchmark via agent. Thermal throttling observed after 10 min.",
"engineFlags": {
"commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa --temp 0.6 --top-p 0.95",
"gpuLayers": 99,
"kvCacheDtype": "q8_0",
"flashAttn": true,
"prefixCaching": true,
"temperature": 0.6,
"topP": 0.95
}
}cURL — Submit Benchmark
curl -X POST https://localmaxxing.com/api/benchmarks \
-H "Authorization: Bearer bhk_YOUR_API_KEY_HERE" \
-H "Content-Type: application/json" \
-d '{
"hfId": "Qwen/Qwen3-8B",
"hardware": { "hwClass": "DISCRETE_GPU", "gpuName": "RTX 3090", "vramGb": 24 },
"engineName": "llama.cpp",
"quantization": "Q4_K_M",
"tokSOut": 87.4,
"tokSTotal": 74.1,
"ttftMs": 142.5,
"contextLength": 8192,
"engineFlags": {
"commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa"
}
}'cURL — Query Leaderboard
curl "https://localmaxxing.com/api/leaderboard?hwClass=DISCRETE_GPU&memTier=24&verified=true&limit=10"