用於評估本地LLM品質的社群基準測試套件。透過API提交結果。
Massive Multitask Language Understanding via EleutherAI lm-evaluation-harness task mmlu, 5-shot, exact-match/accuracy style scoring.