In a world increasingly dominated by artificial intelligence, understanding how to measure intelligence itself is more critical than ever. Intelligence, while pervasive in various disciplines, often falls victim to subjective measurement criteria. Traditionally, we have attempted to quantify intelligence through tests and benchmarks think college entrance exams, where countless aspiring students memorize test-prep strategies and sometimes leave with perfect scores. However, does achieving a perfect score, say 100%, accurately reflect the true intelligence of those individuals? Certainly not. These benchmarks serve merely as approximations of one's or even an AI's capabilities, rather than definitive measures of intelligence.

The generative AI community has relied heavily on benchmarks such as the MMLU (Massive Multitask Language Understanding) to assess the capabilities of AI models. The MMLU framework employs multiple-choice questions across various academic fields, which allows for straightforward comparisons of different AI models. However, it fails to capture the full spectrum of intelligent capabilities that these models possess. For example, both Claude 3.5 Sonnet and GPT-4.5 score remarkably well on this benchmark, suggesting that they have equivalent capabilities. Yet, practitioners who work directly with these models understand that significant disparities exist in their performances when applied in real-world situations.

This brings us to a pressing question: what does it truly mean to measure intelligence in AI? With the recent release of the new ARC-AGI benchmark a test designed to challenge models in general reasoning and creative problem-solving the debate around this question has been reignited. While the ARC-AGI benchmark is still in the early stages of adoption within the industry, it represents a welcome addition to the ongoing efforts to refine testing frameworks. Each benchmark has its own merits, and the ARC-AGI benchmark is a promising step forward in this crucial conversation about AI measurement.

Another significant advancement in AI evaluation has emerged with the introduction of Humanitys Last Exam, an ambitious benchmark that includes 3,000 peer-reviewed, multi-step questions across an array of disciplines. This benchmark aims to rigorously challenge AI systems on expert-level reasoning. However, initial results demonstrate rapid progress, with OpenAI reportedly achieving a score of 26.6% within just a month of the exam's release. Similar to other traditional benchmarks, though, this one primarily evaluates knowledge and reasoning in isolation, without adequately assessing practical, tool-using capabilities that are increasingly vital for real-world AI applications.

For instance, several state-of-the-art models struggled to accurately count the number of rs in the word strawberry or misidentified the numeric comparison of 3.8 as being smaller than 3.1111. Such failures, which even a child or a basic calculator could resolve, highlight a critical disconnect between the progress indicated by benchmarks and the real-world robustness required for AI systems. This illustrates that intelligence transcends mere exam performance; it involves the ability to navigate and solve everyday logical challenges reliably.

As AI models have evolved and grown more sophisticated, traditional benchmarks have revealed their limitations. For example, GPT-4, when equipped with tools, scored only about 15% on more complex, real-world tasks in the GAIA benchmark, despite achieving stellar results on conventional multiple-choice tests.

This growing disconnect between benchmark performance and practical capability has raised concerns, particularly as AI technologies transition from research environments into business applications. Traditional benchmarks primarily test knowledge recall while neglecting crucial facets of intelligence, such as gathering information, executing code, analyzing data, and synthesizing solutions across various domains.

In response, the GAIA benchmark represents a necessary evolution in AI evaluation methodology. Developed through collaborative efforts among Meta-FAIR, Meta-GenAI, HuggingFace, and AutoGPT teams, GAIA includes 466 meticulously crafted questions distributed across three levels of difficulty. These questions are designed to assess capabilities such as web browsing, multi-modal comprehension, code execution, and complex reasoningskills that are essential for real-world AI applications.

Level 1 questions are structured to require around five steps and utilize one tool for humans to solve. Level 2 questions demand between five to ten steps and multiple tools, whereas Level 3 questions can involve up to 50 discrete steps and any number of tools. This framework mirrors the complexity inherent in real business problems, where solutions seldom arise from a single action or tool.

Notably, a flexible AI model managed to achieve an impressive 75% accuracy on the GAIA benchmark, markedly outperforming industry heavyweights like Microsoft's Magnetic-1, which scored 38%, and Google's Langfun Agent at 49%. This success can be attributed to the model's use of a combination of specialized systems for audio-visual understanding and reasoning, with Anthropic's Sonnet 3.5 serving as the primary model.

This evolution in AI evaluation signifies a broader transformation within the industry itself. We are witnessing a shift from standalone Software as a Service (SaaS) applications towards AI agents capable of orchestrating multiple tools and workflows. As businesses increasingly depend on AI systems to tackle intricate, multi-step tasks, benchmarks like GAIA emerge as more meaningful indicators of capability compared to traditional multiple-choice tests.

Looking ahead, the future of AI evaluation does not rest in isolated knowledge tests but rather in comprehensive assessments of problem-solving abilities. GAIA establishes a new standard for measuring AI capabilitiesone that more accurately reflects the challenges and opportunities facing real-world AI deployment.

Written by Sri Ambati, founder and CEO of H2O.ai.