LLM Evaluation Suites | Localmaxxing

pooled pass/fail · Wilson CI · Shard eval · 1,319 questions

GSM8K deterministic disjoint eval shards from openai/gsm8k/main:test. Running all 13 shards covers the full 1,319-question dataset; Postgres stores only shard metadata and pass/fail results.

Reasoning80 条记录

HellaSwag官方

pooled pass/fail · Wilson CI · Shard eval · 10,042 questions

HellaSwag deterministic disjoint eval shards from Rowan/hellaswag:validation. Running all 13 shards covers the full 10,042-question dataset; Postgres stores only shard metadata and pass/fail results.

Reasoning34 条记录

ARC Challenge官方

pooled pass/fail · Wilson CI · Shard eval · 1,172 questions

ARC Challenge randomized eval shards from ai2_arc/ARC-Challenge:test. Question text lives in S3; Postgres stores only shard metadata and results.

Benchmark12 条记录

HumanEval+官方

pooled pass/fail · Wilson CI · Shard eval · 164 questions

HumanEval+ randomized eval shards from evalplus/humanevalplus:test. Question text lives in S3; Postgres stores only shard metadata and results.

Coding42 条记录

MBPP+官方

pooled pass/fail · Wilson CI · Shard eval · 378 questions

MBPP+ randomized eval shards from evalplus/mbppplus:test. Question text lives in S3; Postgres stores only shard metadata and results.

Coding3 条记录

评测套件