评测套件

用于评估本地LLM质量的社区基准测试套件。通过API提交结果。

Tests whether models generate truthful answers to questions that humans often answer incorrectly due to misconceptions or false beliefs.