ModelePierwsze krokiRankingSprzętMarketplaceEwaluacjeTrenowanieWynajemProDokumentacja API
Język
Lucebox Hub campaign banner

You have the GPU. Now what?

A practical path from bare hardware to a working local AI setup: the software stack, model files, inference engines, agent harnesses, remote access, and how to reason about compute. Examples come from 2,003 approved benchmark runs across 74 hardware setups on this site.

Get running (an evening)

  1. Verify drivers: nvidia-smi / rocm-smi (Macs: nothing to do).
  2. Install LM Studio (easy) or llama.cpp (control); vLLM later if agents need concurrency.
  3. Download a 4–8B model at Q4_K_M — fits in ~6 GB, proves the plumbing.
  4. Enable the OpenAI-compatible server and curl /v1/chat/completions.
  5. Point a chat UI or coding agent (Oh-my-pi, Opencode, Claude Code…) at the base URL + model name + any string as API key.
  6. Install Tailscale on the box and your laptop — now it works from anywhere.

Learn these five ideas (they explain everything)

  • Decode speed ≈ memory bandwidth ÷ GB read per token. One rule predicts nearly every tok/s number on this site.
  • Quantization sets the size: ~0.6 GB per billion params at 4-bit, plus room for context (KV cache).
  • Prefill is compute, decode is bandwidth — a box can be great at one and mediocre at the other; agents stress both.
  • Capacity vs speed is the real hardware tradeoff: discrete GPUs are fast, unified memory is big, CPU offload is cheap.
  • Batching amortizes weight reads — concurrency (subagents, multiple users) is nearly free throughput if you have the memory for it.

Local inference is four layers, bottom to top. Get each one working before moving up, and debugging stays easy:

  1. Drivers and runtime. NVIDIA driver + CUDA, AMD driver + ROCm, or nothing extra on Apple Silicon (Metal ships with macOS). Verify with nvidia-smi / rocm-smi before installing anything else.
  2. Model files. The weights themselves, downloaded from Hugging Face, usually quantized to fit your memory.
  3. Inference engine. The program that loads weights and generates tokens — llama.cpp, LM Studio, vLLM, SGLang, MLX. Almost all of them expose an OpenAI-compatible HTTP API.
  4. Client / agent harness. Whatever consumes that API: a chat UI, an editor plugin, or a coding agent like Oh-my-pi, Opencode, or Claude Code pointed at your own endpoint.

The OpenAI-compatible API is the seam that makes the whole ecosystem composable: every engine below speaks it, every harness above consumes it, so you can swap either side independently.

Model files and quantization

Weights come in two main formats. Safetensors is the Hugging Face standard — full-precision (FP16/BF16) or GPU-oriented quants (AWQ, GPTQ, FP8), consumed by vLLM and SGLang. GGUFpacks the model into a single quantized file for the llama.cpp family (llama.cpp, LM Studio) and supports CPU offload. Apple's MLX has its own converted format.

Quantization shrinks each weight from 16 bits to ~4–8 bits with small quality loss. Rule of thumb at 4-bit: ~0.6 GB of memory per billion parameters, plus 1–4+ GB for context (KV cache). A 24 GB card comfortably runs a 32B model at 4-bit; a 8B model fits in 6–8 GB. On this site, Q4_K_M is the most-submitted quantization (Q4_K_M, Q8_0, Q4_0 and Q5_K_M GGUF are the common choices), and <1B is the most common model-size bucket.

Pick an inference engine

LM Studio is the easiest on-ramp — one install, model downloads built in. llama.cpp is what it wraps: maximum control, runs everywhere, handles CPU offload. vLLM and SGLang are server-grade engines built for concurrent requests and maximum throughput on discrete GPUs. MLX is the native choice on Apple Silicon.

Start with an easy engine, then graduate when you hit its limits. In our data, sglang has the best size-normalized speed (4,718 GB/s effective bandwidth — median tok/s × model GB, so small-model runs don't skew the ranking) while llama.cpp is the most-used — llama.cpp, vllm, mlx and hipfire cover most submissions.

Serve an OpenAI-compatible endpoint

Every engine can expose http://localhost:PORT/v1: llama-server -m model.gguf, vllm serve org/model, or LM Studio's server toggle. Test it with a curl to /v1/chat/completions. That URL — plus any string as the API key — is all any client needs.

A first target: Qwen3.6-35B-A3B, the most-benchmarked model here, at 4-bit. Small enough to fit almost anywhere, good enough to tell you whether the plumbing works.

Point an agent harness at it

Coding agents are the most demanding — and most rewarding — local workload. All of these can run against a local endpoint:

  • Oh-my-pi and Pi — agent harnesses built for local-first workflows, heavy tool use and subagents.
  • Hermes agent — Nous Research's harness, pairs naturally with the Hermes model family.
  • Opencode — open-source terminal coding agent with first-class custom-provider support.
  • Codex CLI — OpenAI's agent; supports OSS/local providers via config.
  • Claude Code — Anthropic's agent; can target alternative endpoints via an Anthropic-compatible proxy.

Configuration is the same everywhere: base URL + model name + dummy key. Agents burn far more tokens than chat — long system prompts, big file contexts, many turns — so prompt-processing speed and context length matter more than they do for chat.

Your inference box should not be internet-exposed, and you should not need to be at home to use it. Tailscale (WireGuard-based mesh VPN) is the standard answer: install it on the server and your laptop/phone, sign in, and every device gets a stable private IP (100.x.y.z) and MagicDNS name that works from any network. No port forwarding, no dynamic DNS, no TLS certificates to manage.

  • Point your harness at http://your-box:8000/v1 over the tailnet — same config as localhost, just a different host.
  • tailscale ssh gets you a shell on the box without managing keys — restart engines, swap models, watch nvidia-smi from a coffee shop.
  • tailscale serve wraps a local port in HTTPS on your tailnet if a client insists on TLS.
  • Bind the engine to the Tailscale interface (or firewall to 100.64.0.0/10) rather than 0.0.0.0 on an exposed machine.

This is also what makes a headless setup practical: the GPU box lives next to the router, and everything — benchmarking, model management, agent sessions — happens remotely.

Discrete GPUs

A PCIe card with its own VRAM (RTX 3090/4090/5090, used datacenter cards). VRAM bandwidth is extreme — ~1 TB/s on a 3090 — which is exactly what token generation needs, so this is the fastest option per dollar and the only one that scales to multiple cards. The catch is capacity: consumer cards top out at 24–32 GB, so big models mean multiple GPUs or aggressive quantization. Discrete GPUs account for 1,502 runs here — the bulk of the leaderboard, led by RTX PRO 6000 Blackwell at a median 3,609 GB/s effective bandwidth.

Unified memory

CPU and GPU share one big memory pool: Apple Silicon (up to 512 GB on M3 Ultra), AMD Strix Halo (128 GB), NVIDIA DGX Spark (GB10, 128 GB). Capacity is the superpower — a 70B or even a quantized 200B+ MoE fits without any offloading, and there's room for many concurrent KV caches, so these boxes take to batching well. The tradeoffs: memory bandwidth (~250–800 GB/s) is well below a discrete card's, so each individual stream decodes slower, and prompt processing — compute-bound — lags dedicated GPUs by more. Unified-memory setups have 483 runs on the site.

CPU + RAM offloading

llama.cpp can keep only part of the model on the GPU and run the rest from system RAM. Dual-channel DDR5 is ~80–100 GB/s — an order of magnitude below VRAM — so every offloaded layer costs speed. The modern exception: MoE models, where only a few experts activate per token. Keeping attention + shared layers on a modest GPU and experts in RAM makes very large sparse models (Qwen3-235B, DeepSeek-class) surprisingly usable on desktop hardware. Cheapest capacity, slowest tokens18 CPU-only runs here prove it works at all.

Which one is for you?

Snappy agents (fast per-stream decode and prefill): discrete GPU(s). Biggest models, or many concurrent long-context streams that need KV room: unified memory. Tightest budget or huge MoE models: a mid-range GPU plus lots of RAM. Decode speed is roughly memory bandwidth ÷ bytes read per token — a number you can sanity-check against any spec sheet before buying, then verify on the leaderboard.

Ballpark street prices (USD, checked July 2026 — the memory-price surge moved the whole market up, and it keeps drifting, so treat these as orders of magnitude), joined with this site's live median effective bandwidth where we have enough runs. Load draw is what the box pulls while actually generating — it sizes your PSU and your electricity bill per hour of inference. The two composite columns are the buying signals: $ per GB/s is capex per unit of speed, and $ per GB/s-per-watt folds power in — it penalizes hardware that needs many watts for its speed, so it favors a box that is fast, cheap, and efficient. Lower is better in both.

HardwareRough priceLoad drawMedian effective BW$ per GB/s$ per GB/s/W
RTX 5060 Ti~$450 (new, 16 GB)~180 W723 GB/s~$0.62~$112
RTX 3090~$1,100 (used, 24 GB)~350 W686 GB/s~$1.60~$561
RTX 5090~$3,700 (new, 32 GB)~500 W1,247 GB/s~$2.97~$1,483
RTX 4090~$2,300 (used, 24 GB)~350 W668 GB/s~$3.44~$1,205
DGX Spark~$4,700 (new, 128 GB unified)~100 W521 GB/s~$9.02~$902
Ryzen AI Max 395~$2,000 (mini-PC, 128 GB unified)~90 W195 GB/s~$10.24~$922
M2 Ultra~$7,500 (used Mac Studio, 192 GB)~180 W
M3 Max~$3,000 (used, 128 GB)~60 W

Load figures are typical draw during decoding, not TDP — decode is memory-bound, so cards rarely hit their full power budget (prefill gets closer). Discrete-GPU figures are card-only; add ~50–100 W for the host system. Unified-memory boxes are whole-machine. Benchmark submissions accept measured per-GPU power draw (gpuPowerWatts) — report it and these ballparks can become community data too.

Patterns worth noticing: budget 16 GB cards and used flagships lead on capex per GB/s; once watts count, the 5060 Ti and Apple Silicon pull ahead of the big cards; large unified boxes pay a premium for capacity and efficiency, not speed. Check the marketplace for live second-hand listings from other builders, and attach purchase records to your benchmark submissions — once enough runs carry real prices, this table can switch from ballparks to community-sourced data.

A discrete GPU talks to the system over PCI Express. Bandwidth = generation × lane count; each device negotiates a link like "Gen4 x16":

LinkGen3Gen4Gen5
x16~16 GB/s~32 GB/s~64 GB/s
x8~8 GB/s~16 GB/s~32 GB/s
x4~4 GB/s~8 GB/s~16 GB/s

Lanes are a budget.Consumer CPUs (AM5, LGA1700/1851) expose ~24–28 usable lanes; the second "x16" slot on most boards actually splits the link to x8/x8, and a third slot often runs x4 through the chipset. Workstation/server platforms (Threadripper, EPYC, Xeon) exist precisely because they offer 64–128 lanes for real multi-GPU builds.

How much does it matter? For single-GPU inference: barely. Weights cross the bus once at load time; after that, generation happens in VRAM. A Gen4 x4 link (even an eGPU or mining riser) serves one card fine — model loads are just slower. It starts to matter when GPUs must talk to each other every token, which brings us to parallelism.

Pipeline parallelism (layer split)

GPU 0 holds layers 1–40, GPU 1 holds layers 41–80. Each token's activations hop across the bus once per boundary — a few MB — so PCIe bandwidth barely matters and x4 risers work. The cost: GPUs take turns on a single request, so one stream runs at roughly single-GPU speed; you gain capacity, not latency. This is llama.cpp's default multi-GPU mode and the right choice for mismatched cards or weak links.

Tensor parallelism (matrix split)

Every layer's matrices are sliced across all GPUs, which then all work on every token and synchronize (all-reduce) after each layer — dozens of exchanges per token. This does cut per-token latency, but interconnect becomes the bottleneck: you want Gen4 x8+ per card, ideally x16 or NVLink, and identical GPUs in power-of-two counts. vLLM (--tensor-parallel-size), SGLang, and ExLlama support it; llama.cpp's support is limited. Cheap-riser mining-style builds should stick to pipeline parallelism.

Multi-GPU vs unified memory at the same capacity

A stack of discrete cards beats a unified-memory box of equal capacity on speed in either split mode, because every card brings its own memory bus. Tensor parallelism aggregates bandwidth — 4× RTX 3090 is ~96 GB at ~3.7 TB/s combined, versus 273 GB/s for a 128 GB GB10-class box — and even a pipeline split still reads each layer at full single-card speed, which alone outruns any unified box today. What unified memory buys at that capacity point is power draw (~250 W vs ~1.4 kW), simplicity, silence, and often capacity per dollar — not speed.

Single-stream decode wastes most of a GPU's compute — the chip spends its time waiting on memory reads. Batching several requests reuses each weight read across streams, so aggregate throughput scales far better than it degrades per-stream. You need this sooner than you think:

  • Subagents. Modern harnesses (Oh-my-pi, Opencode, Claude Code) fan out parallel workers — one "agent session" can be 4–8 simultaneous request streams.
  • Multiple clients. You + a family member + an editor autocomplete plugin all hitting the same box.
  • Batch jobs. Running an eval suite or bulk document processing overnight.

vLLM and SGLang are built around continuous batching and PagedAttention and handle dozens of concurrent streams; llama.cpp's llama-server supports parallel slots (-np) but scales far less. If agents are the goal, this — not peak single-stream tok/s — is the number to optimize, and each concurrent stream needs its own KV cache, so concurrency raises the memory bar too. Capacity is the admission ticket: a 24 GB card runs out of KV room after a few long-context streams, while a 128 GB unified-memory box can hold dozens — batching amortizes its slow weight reads, so aggregate throughput is where that hardware shines; per-stream latency and prefill are where it doesn't.

Fastest hardware by effective bandwidth

RTX PRO 6000 Blackwell3,609 GB/s
RTX 5070 Ti2,186 GB/s
RTX 50901,247 GB/s
RTX 40701,161 GB/s
M5 Max918 GB/s
RTX 5060 Ti723 GB/s

Compute types by run count

Discrete GPU1,502
Unified memory483
CPU only18

Fastest engines by effective bandwidth

sglang4,718 GB/s
vllm1,060 GB/s
ollama311 GB/s
lmstudio299 GB/s
llama.cpp274 GB/s
mlx265 GB/s

Popular quantizations

Q4_K_M571
Q8_0152
Q4_0107
Q5_K_M GGUF107
NVFP478
Q4_K_XL73

Model-size distribution

<1B6
1–4B87
4–8B120
8–15B346
15–35B799
35–70B376

Most benchmarked models

Qwen3.6-35B-A3B107
Qwen3.6-27B94
gemma-4-26B-A4B-it-GGUF65
MiniMax-M2.7-int4-AutoRound60
Qwen3.6-27B-MTP-GGUF50
Qwen3.6-27B45

Next steps

Deciding what to buy or run? The leaderboard lets you filter every community benchmark by model, quantization, context length, engine, and hardware. Once your stack produces tokens, submit a benchmark run — your numbers make this guide better for the next person.