Community benchmark suites for evaluating local LLM quality. Submit results via the API.
GSM8K deterministic disjoint eval shards from openai/gsm8k/main:test. Running all 13 shards covers the full 1,319-question dataset; Postgres stores only shard metadata and pass/fail results.
HellaSwag deterministic disjoint eval shards from Rowan/hellaswag:validation. Running all 13 shards covers the full 10,042-question dataset; Postgres stores only shard metadata and pass/fail results.
ARC Challenge randomized eval shards from ai2_arc/ARC-Challenge:test. Question text lives in S3; Postgres stores only shard metadata and results.
HumanEval+ randomized eval shards from evalplus/humanevalplus:test. Question text lives in S3; Postgres stores only shard metadata and results.
MBPP+ randomized eval shards from evalplus/mbppplus:test. Question text lives in S3; Postgres stores only shard metadata and results.