Community benchmark suites for evaluating local LLM quality. Submit results via the API.
Tests whether models generate truthful answers to questions that humans often answer incorrectly due to misconceptions or false beliefs.