Already two years ago I wrote a blog post on how different LLMs are benchmarked, safe to say that much has happened in the last two years in terms of how these models no are benchmarked. Since many of them have also gone past the initial scoring of many of them and we have needed to make new tests. Also it is difficult to understand how good a model actually is based upon the different scores that are presented in a blogpost of picture from the vendors.

Does tests mean that Grok-3 is a gamechanger? or not? well these tests can be a bit misleading, ufortunately there is not a single score that can showcase how good a model is overall, since most tests are aimed at a certain subject or topic.
- GPQA (Evaluates the model’s ability in biology, physics, and chemistry)
- GPQA Diamond (Extended difficulty level focusing on PhD-level questions)
- MATH (Evaluates the model’s responses to various math problems, including algebra, geometry, and more)
- HiddenMath (More difficult tasks than MATH, developed by experts, where answers are not published online)
- ScandEval (Evaluates the model’s performance in Scandinavian languages)
- MMLU (Measures the model’s ability to handle multiple tasks simultaneously. The test covers 57 different subjects, including basic mathematics, U.S. history, computer science, law, and more.)
- MMLU-Pro (An extended version of MMLU with even harder tasks)
- Global MMLU (MMLU tasks translated into 15 different languages, with the model evaluated across all 15 languages)
- Naturaal2Code (Tests the model’s coding ability, specifically in Python, Java, C++, JavaScript, and Go)
- LiveCodeBench (Evaluates the model’s coding performance)
- SWE-Bench Verified (Assesses the model’s ability to solve real-world programming tasks)
- HumanEval (A test developed by OpenAI to evaluate code generation. It consists of 164 handcrafted programming challenges that test models’ ability to generate functionally correct code.)
- NIAH (Needle in a Haystack) (Measures how well a model can retrieve information from its context window)
- Bias Benchmark for Question Answering (BBQ) (Tests if the model exhibits biases when answering questions)
- Image – MMMU (Evaluates multimodal capabilities with a set of 11,500 selected questions from college exams and textbooks across six core subjects: art & design, economics, science, health & medicine, social sciences, and technology & engineering.)
- Image – Vibe-Eval (Evaluates text within images)
- Audio – CoVoST2 (Tests speech capabilities in various languages)
- Video – EgoSchema (Evaluates multi-question tasks in video)
- HHEM (Measures the extent to which a model hallucinates)
- ARC-AGI (The ARC-AGI test assesses artificial intelligence’s ability to solve complex, abstract problems by evaluating its skills in generalization and creative problem-solving without relying on specific training data.)
Note: This is not a complete list but rather a selection of some of the most commonly used benchmarks.
If a model scores very high in certain areas, such as MMLU (a frequently showcased test), it only reflects the model’s ability to answer questions within that category. However, it does not tell us:
- How well it can code in languages like COBOL
- How proficient it is in Scandinavian languages
- How good its “memory” is
- How well it handles multimodal tasks (image/audio/video)
- The extent to which it hallucinates
- How effective it is as a foundation for virtual agents
- How fast or cost-efficient it is
Of course, there are many factors to consider when adopting new language models, but it’s important not to be blinded by benchmark graphs when new models are released.
Below are some sources (continuously updated) where you can review test results for different language models.
Hallucation Leaderboard: https://github.com/vectara/hallucination-leaderboard
Price / Performance: https://artificialanalysis.ai/leaderboards/models
OpenLLM Dashboard: https://huggingface.co/…/open…/open_llm_leaderboard…
LMArena Chatbot: https://lmarena.ai/?leaderboard