Ask someone what the best AI model is, and you’ll get all sorts of answers—some based on personal experience, others influenced by company preferences or flashy marketing.
But scientists don’t rely on opinions; they use benchmarks—structured tests that evaluate AI intelligence, just like exams do for students. AI models compete with scores like 86.4 vs. 90 on MMLU, where even a tiny difference can mean the gap between “smart” and “genius.” But how do these benchmarks actually work? And can an AI ever “graduate”?
*AI Benchmarks: A Learning Journey*
Freshman Year: Basic Knowledge Tests
At the entry level, AI models are tested on fundamental skills. This includes general knowledge (MMLU), logical reasoning (HellaSwag), listening skills (CoVoST2), and math abilities (HidenMath). These tests determine if an AI has the core knowledge needed to move on to more advanced tasks.
Graduate Level: Can AI Think Like Humans?
Now, things get serious. The ARC-AGI benchmark measures an AI’s ability to solve reasoning problems the way humans do naturally. This isn’t just memorization—it’s real thinking, requiring the AI to apply knowledge in new and complex ways.
PhD Level: Can AI Learn on Its Own?
At this stage, AI models are tested on their ability to teach themselves and adapt without human guidance. One such benchmark is OpenAI MLE-benchmark. This test is also used to ensure AI doesn’t become rogue.
The never ending AI race.
But what happens if an AI scores 100%? Does that mean it’s officially as intelligent as a human? For example, OpenAI recently announced that its O3 model scored an impressive 75.7% on the ARC-AGI benchmark, suggesting it’s getting closer to human-level intelligence. A 25% boost could put it on par with us—but humans have a way to avoid direct competition. Scientists are already working on ARC-AGI-2, a tougher benchmark designed to challenge even the most advanced AI models.
Check out the full blog for a deep dive into AI benchmarks and what they really mean.