模型排行榜模型Marketplace评测训练租用API文档
Language
Your Ad Here

评测套件

用于评估本地LLM质量的社区基准测试套件。通过API提交结果。

Build eval
Tech Greenpost官方
v1.0 · Custom server-side

A five-prompt creative writing eval where models draft short tech-related 4chan-style greenposts. DeepSeek judges format compliance, reasonable length, tech relevance, coherence, and humor.

writing6 条记录
Open LLM Leaderboard官方
v1.0 · LM-Eval run

The canonical HuggingFace Open LLM Leaderboard suite: MMLU, ARC Challenge, HellaSwag, WinoGrande, TruthfulQA MC2, and GSM8K with official few-shot settings. Weighted mean aggregate.

reasoning0 条记录
MATH官方
v1.0 · LM-Eval run

Competition math problems spanning algebra, counting, geometry, intermediate algebra, number theory, prealgebra, and precalculus.

math0 条记录
DROP官方
v1.0 · LM-Eval run

Discrete Reasoning Over Paragraphs. Reading-comprehension benchmark requiring numerical and symbolic reasoning over passages.

reasoning0 条记录
Big-Bench Hard官方
v1.0 · LM-Eval run

A collection of challenging BIG-Bench tasks selected because prior models performed poorly. Covers symbolic reasoning, algorithmic reasoning, and language understanding.

reasoning0 条记录
GPQA Diamond官方
v1.0 · LM-Eval run

Graduate-level Google-proof Q&A benchmark focused on biology, physics, and chemistry. The Diamond split is the highest-quality expert-validated subset.

reasoning0 条记录
MBPP官方
v1.0 · LM-Eval run

Mostly Basic Python Problems — 500 crowd-sourced Python programming problems with automated test cases. Broader coverage than HumanEval.

coding0 条记录
HumanEval官方
v1.0 · LM-Eval run

OpenAI's Python function completion benchmark. 164 hand-crafted problems with unit tests measuring pass@1 code synthesis accuracy.

coding0 条记录
GSM8K官方
v1.0 · LM-Eval run

Grade School Math 8K — 8,500 grade-school math word problems requiring multi-step arithmetic reasoning. Standard benchmark for math reasoning capability.

math3 条记录
TruthfulQA官方
v1.0 · LM-Eval run

Tests whether models generate truthful answers to questions that humans often answer incorrectly due to misconceptions or false beliefs.

truthfulness0 条记录
WinoGrande官方
v1.0 · LM-Eval run

Large-scale Winograd schema challenge for commonsense reasoning. Fill-in-the-blank pronoun resolution requiring world knowledge.

reasoning0 条记录
HellaSwag官方
v1.0 · LM-Eval run

Sentence completion benchmark testing grounded commonsense inference. Models must pick the most plausible continuation of an activity description.

reasoning1 条记录
ARC Challenge官方
v1.0 · LM-Eval run

AI2 Reasoning Challenge (Challenge set) — grade-school science questions that require reasoning beyond simple retrieval. Harder subset of ARC.

reasoning0 条记录
MMLU官方
v1.0 · LM-Eval run

Massive Multitask Language Understanding — 57-subject academic exam covering STEM, humanities, social sciences, and more. The gold-standard broad-knowledge benchmark.

reasoning1 条记录
Local Reasoning Mini官方
v1.0 · Custom server-side

A lightweight 10-question sanity check for locally served models. Designed for the trusted /api/evals/execute path.

reasoning3 条记录
HumanEval 0-shot
v1.0 · LM-Eval run

OpenAI HumanEval via EleutherAI lm-evaluation-harness task humaneval, 0-shot, pass@k code-generation scoring.

coding0 条记录
MMLU 5-shot
v1.0 · LM-Eval run

Massive Multitask Language Understanding via EleutherAI lm-evaluation-harness task mmlu, 5-shot, exact-match/accuracy style scoring.

knowledge0 条记录
Probe
v1.0 · LM-Eval run
knowledge1 条记录