ModelsLeaderboardEvalsTrainRentalsAPI Docs

MMLU

Official

Massive Multitask Language Understanding — 57-subject academic exam covering STEM, humanities, social sciences, and more. The gold-standard broad-knowledge benchmark.

Source
Category: reasoningRunner: lm-eval-harnessVersion: v1.0Submitted by: Community

Eval Details

Scoring
Exact Match
Aggregation
Mean
Direction
Higher is better
Tasks
1 task

Default Run Config

Seed: 42FewShot: 5
TaskDatasetWeightShotsMax Tokens
MMLU (average)
mmlu
hails/mmlu_no_train / test15-shot

Leaderboard— best run per model

#ModelScoreQuantHardware
Qwen3.6-27B
Qwen
84.5%
IQ4_NLNVIDIA GeForce RTX 3090

Task Breakdown— top model

mmlu
84.5%
· 0 samples