Eval Suites

Community benchmark suites for evaluating local LLM quality. Submit results via the API.

Massive Multitask Language Understanding via EleutherAI lm-evaluation-harness task mmlu, 5-shot, exact-match/accuracy style scoring.