LLM Benchmarks Explained

Understanding how language models are evaluated and compared

GSM8K • HellaSwag • MMLU • HumanEval • TruthfulQA • More

📊 Benchmark Quick Reference

🧮 GSM8K
Math reasoning
🤔 HellaSwag
Common sense
📚 MMLU
57 subjects
💻 HumanEval
Code generation
🧮

GSM8K - Math Reasoning Benchmark

Grade School Math 8K • 8,500 grade school math word problems

Math

What is GSM8K?

GSM8K (Grade School Math 8K) is a dataset of 8,500 high-quality grade school math word problems created by human problem writers. Each problem requires between 2 to 8 steps to solve and involves basic arithmetic operations. The benchmark tests a model's ability to understand natural language math problems, break them down into steps, and perform accurate calculations.

📝 Example GSM8K Problem

Question: "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"

Answer: $18 (16 - 3 - 4 = 9 eggs × $2 = $18)

Top GSM8K Performance (2025)

Model GSM8K Score Organization Size
GPT-4 Turbo
95%
OpenAI Unknown
Claude 3 Opus
93%
Anthropic Unknown
Gemini 1.5 Pro
91%
Google Unknown
Llama 3.1 70B
89%
Meta 70B
Mistral Large
86%
Mistral AI Unknown
Llama 3.1 8B
72%
Meta 8B

Why GSM8K Matters

Browse Models by GSM8K Score →
🤔

HellaSwag - Commonsense Reasoning

70,000 multiple-choice questions testing natural reasoning

Reasoning

What is HellaSwag?

HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) is a benchmark designed to test commonsense natural language inference. It contains 70,000 multiple-choice questions where models must predict the most likely continuation of a given scenario. The dataset is specifically designed to be easy for humans (95%+ accuracy) but challenging for models.

📝 Example HellaSwag Problem

Context: "A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She..."

Options:

  • A) rinses the bucket off with soap and blow dries the dog's head.
  • B) uses a hose to keep it from getting soapy.
  • C) grabs the dog and holds it still. ✓ Correct
  • D) gets into the bath tub with the dog.

Top HellaSwag Performance (2025)

Model HellaSwag Score Organization Human Baseline
GPT-4 Turbo
95.3%
OpenAI ~95.6%
Claude 3.5 Sonnet
89%
Anthropic
Llama 3.1 70B
87.5%
Meta
Gemini 1.5 Pro
86.2%
Google
Mistral 7B
83.3%
Mistral AI
Llama 3.1 8B
78.6%
Meta

Why HellaSwag Matters

📚

MMLU - Massive Multitask Language Understanding

57 subjects spanning STEM, humanities, social sciences, and more

Knowledge

What is MMLU?

MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark that tests a model across 57 different subjects including elementary mathematics, US history, computer science, law, and more. It contains ~16,000 multiple-choice questions spanning knowledge from elementary level to professional level. MMLU is considered one of the most thorough benchmarks for measuring general knowledge and reasoning capabilities.

Total Subjects
57
Questions
~16K
Categories
4
Human Expert
~90%

MMLU Subject Categories

🔬 STEM

Physics, Chemistry, Biology, Mathematics, Computer Science, Engineering

📖 Humanities

Philosophy, History, Literature, Religion, Ethics, Art

🏛️ Social Sciences

Psychology, Economics, Law, Politics, Geography, Sociology

📊 Other

Business, Health, Misc. Professional Topics

Top MMLU Performance (2025)

Model MMLU Score (57 subjects avg) Organization
GPT-4 Turbo
86.4%
OpenAI
Claude 3 Opus
86.8%
Anthropic
Gemini 1.5 Pro
85.9%
Google
Llama 3.1 70B
79.3%
Meta
Mistral Large
77.5%
Mistral AI
Llama 3.1 8B
66.7%
Meta

Why MMLU Matters

Other Important Benchmarks

💻 HumanEval

Coding

Tests code generation abilities with 164 programming problems. Models must write Python functions that pass unit tests.

Top Score: GPT-4 Turbo (90%)

Best Open Source: CodeLlama 70B (67%)

✅ TruthfulQA

Truthfulness

Measures whether models generate truthful answers, specifically testing resistance to common misconceptions and false beliefs.

Top Score: Claude 3 Opus (87%)

Baseline: Human experts (~94%)

🎓 ARC

Reasoning

AI2 Reasoning Challenge - 7,787 science exam questions requiring complex reasoning beyond simple fact recall.

Top Score: GPT-4 (96%)

Best Open Source: Llama 3.1 70B (85%)

🧩 BIG-Bench Hard

Multi-step

Collection of 23 challenging tasks from BIG-Bench where previous models performed below human level.

Top Score: GPT-4 Turbo (89%)

Focus: Multi-step reasoning

How to Choose Models Based on Benchmarks

For Math & Finance Applications

Prioritize: GSM8K (math reasoning), MMLU Math (advanced math)

Recommended: GPT-4 Turbo (95% GSM8K), Claude 3 Opus (93%), Llama 3.1 70B (89%)

For Chatbots & Customer Service

Prioritize: HellaSwag (common sense), TruthfulQA (accuracy)

Recommended: Claude 3.5 Sonnet, GPT-4 Turbo, Gemini 1.5 Pro

For Code Generation

Prioritize: HumanEval (Python coding), MBPP (code reasoning)

Recommended: GPT-4 Turbo (90%), CodeLlama 70B (67%), Claude 3.5 Sonnet

For General Knowledge & Research

Prioritize: MMLU (broad knowledge), ARC (science reasoning)

Recommended: Claude 3 Opus (86.8% MMLU), GPT-4 Turbo (86.4%), Gemini 1.5 Pro

Find Models by Benchmark Performance

Compare 2,800+ models with detailed benchmark scores and deployment guides

Top GSM8K Models Top MMLU Models Browse All Models

Frequently Asked Questions

Are benchmark scores the only factor in choosing a model?

No. Also consider: latency, cost, context window, API availability, deployment complexity, and specific use case requirements. Benchmarks measure capabilities but not necessarily real-world performance for your specific task.

Why do some models score lower on benchmarks but perform better in practice?

Benchmarks test specific, narrow tasks. A model might excel at your use case (e.g., creative writing, specific domain knowledge) even with lower benchmark scores. Always test with your actual data.

How often are these benchmarks updated?

Model scores are updated as new models are released (monthly). Some benchmarks themselves evolve to avoid "teaching to the test" and maintain validity as models improve.

What's the best overall benchmark for deployment decisions?

MMLU is often considered the best single indicator of general capability, but combine it with task-specific benchmarks (GSM8K for math, HumanEval for coding, etc.) for the most informed decision.