Evals Open LLM Leaderboard Open LLM Leaderboard OfficialThe canonical HuggingFace Open LLM Leaderboard suite: MMLU, ARC Challenge, HellaSwag, WinoGrande, TruthfulQA MC2, and GSM8K with official few-shot settings. Weighted mean aggregate.
SourceCategory: reasoningRunner: lm-eval-harnessVersion: v1.0Submitted by: Community
Eval Details Direction
Higher is better
Default Run Config Seed: 42
Task Dataset Weight Shots Max Tokens MMLU
mmlu
Not specified 1 5-shot — ARC Challenge
arc_challenge
Not specified 1 25-shot — HellaSwag
hellaswag
Not specified 1 10-shot — WinoGrande
winogrande
Not specified 1 5-shot — TruthfulQA MC2
truthfulqa_mc2
Not specified 1 0-shot — GSM8K
gsm8k
Not specified 1 5-shot —
MMLU
mmlu
Dataset Not specified
Weight 1
Shots 5-shot
Max tokens —
ARC Challenge
arc_challenge
Dataset Not specified
Weight 1
Shots 25-shot
Max tokens —
HellaSwag
hellaswag
Dataset Not specified
Weight 1
Shots 10-shot
Max tokens —
WinoGrande
winogrande
Dataset Not specified
Weight 1
Shots 5-shot
Max tokens —
TruthfulQA MC2
truthfulqa_mc2
Dataset Not specified
Weight 1
Shots 0-shot
Max tokens —
GSM8K
gsm8k
Dataset Not specified
Weight 1
Shots 5-shot
Max tokens —
Leaderboard— best run per model No approved results yet. Submit a run via the API.