Eval Suites

Community benchmark suites for evaluating local LLM quality. Submit results via the API.

pooled pass/fail · Wilson CI · Shard eval · 1,319 questions

GSM8K deterministic disjoint eval shards from openai/gsm8k/main:test. Running all 13 shards covers the full 1,319-question dataset; Postgres stores only shard metadata and pass/fail results.

Reasoning1 run

HellaSwagOfficial

pooled pass/fail · Wilson CI · Shard eval · 10,042 questions

HellaSwag deterministic disjoint eval shards from Rowan/hellaswag:validation. Running all 13 shards covers the full 10,042-question dataset; Postgres stores only shard metadata and pass/fail results.

Reasoning1 run

ARC ChallengeOfficial

pooled pass/fail · Wilson CI · Shard eval · 1,172 questions

ARC Challenge randomized eval shards from ai2_arc/ARC-Challenge:test. Question text lives in S3; Postgres stores only shard metadata and results.

Benchmark1 run

HumanEval+Official

pooled pass/fail · Wilson CI · Shard eval · 164 questions

HumanEval+ randomized eval shards from evalplus/humanevalplus:test. Question text lives in S3; Postgres stores only shard metadata and results.

Coding1 run

MBPP+Official

pooled pass/fail · Wilson CI · Shard eval · 378 questions

MBPP+ randomized eval shards from evalplus/mbppplus:test. Question text lives in S3; Postgres stores only shard metadata and results.

Coding1 run