Last week, Meta announced their new LLM, Llama 3.1, claiming it rivals closed-source models like GPT-4 and Claude 3.5.
They used popular benchmarks like MMLU (massive multi-task language understanding) to back up their claims.
But here's the problem:
These benchmarks are fundamentally flawed.
Here's why:
- They're too easy
MMLU was created in 2020. Back then, most models scored around 25%.
Now? The best models score 88-90%.
It's like grading high school students on middle school tests.
- They contain errors
A study found that 57% of MMLU's virology questions had errors.
26% of logical fallacy questions were wrong too.
Some had no correct answer. Others had multiple right answers.
- Models might be cheating
LLMs are trained on internet data. This often includes benchmark questions and answers.
Models could be "contaminated" - they've seen the test in advance.
Some companies might even deliberately train on benchmark data to boost scores.
- Small changes have big impacts
Asking an AI to state an answer directly vs. giving a letter/number can produce different results.
This affects reproducibility and comparability.
- They don't reflect real-world performance
Benchmark scores often fail to match how models perform on actual tasks.
Companies use inflated scores to hype products and boost valuations.
So what's the solution?
- Create harder benchmarks.
↳ MMLU-Pro, GPQA, and MuSR are examples of tougher tests.
- Use automated testing systems
↳ HELM(holistic evaluation of language models) and EleutherAI Harness generate more trustworthy leaderboards.
- Develop new benchmarks for emerging skills
↳ GAIA tests real-world problem-solving. NoCha(novel challenge) assesses long-context understanding.
- Use AI to create benchmarks
↳ Projects like AutoBencher use LLMs to develop new tests.
- Focus on safety benchmarks
↳ Anthropic is funding the creation of benchmarks to assess AI safety risks.
The big picture:
As AI commercializes, we need reliable, specific benchmarks.
Startups specializing in AI evaluation are emerging.
The era of AI labs grading their own homework is ending.
And that's a good thing for everyone.
Top comments (0)