Eval Suites
Community benchmark suites for evaluating local LLM quality. Submit results via the API.
Official
v1.0 · Custom
writing1 run
Official
v1.0 · lm-eval-harness
reasoning0 runs
Official
v1.0 · lm-eval-harness
math0 runs
Official
v1.0 · lm-eval-harness
reasoning0 runs
Official
v1.0 · lm-eval-harness
reasoning0 runs
Official
v1.0 · lm-eval-harness
reasoning0 runs
Official
v1.0 · lm-eval-harness
coding0 runs
Official
v1.0 · lm-eval-harness
coding0 runs
Official
v1.0 · lm-eval-harness
math1 run
Official
v1.0 · lm-eval-harness
truthfulness0 runs
Official
v1.0 · lm-eval-harness
reasoning0 runs
Official
v1.0 · lm-eval-harness
reasoning1 run
Official
v1.0 · lm-eval-harness
reasoning0 runs
Official
v1.0 · lm-eval-harness
reasoning1 run
Official
v1.0 · Custom
reasoning2 runs