OpenAI's New Milestone: Exceeding Expectations in AI Testing

OpenAI's New Milestone: Exceeding Expectations in AI Testing

OpenAI's Deep Research AI agent has achieved a groundbreaking 26.6% score on "Humanity's Last Exam," a formidable AI evaluation benchmark. This marks a swift 183% increase in top scores within just ten days. Notably, the exam's rapid breakthroughs prompt concerns about the efficacy and longevity of evaluation tests to accurately measure AI advancements.

OpenAI's New Milestone: Exceeding Expectations in AI Testing

In a remarkable achievement, OpenAI's AI agent, named Deep Research, has scored an impressive 26.6% on what is renowned as "Humanity's Last Exam"—a test considered to be the toughest in assessing AI performance. The exam's staggering rise in top scores, climbing 183% in less than ten days since its release, highlights the rapid advancements in AI capabilities.

Humanity's Last Exam: A Challenging Benchmark

Designed as a comprehensive benchmark, "Humanity's Last Exam" encompasses a wide array of subjects, including mathematics, humanities, and natural sciences. The exam features carefully curated questions contributed by university professors and notable mathematicians, serving as a formidable evaluation for AI. Prior to its public release, even OpenAI’s then-advanced model "o1" managed only an 8.3% score.

Surpassing Previous Records

Following its introduction, various AI models began surpassing the previous best scores on the exam. For instance, DeepSeek's reasoning model "R1", matching "o1" in capability, achieved a 9.4% score. Subsequent releases saw improvements, with OpenAI's own "o3-mini" scoring 10.5%, and the enhanced version "o3-mini-high" reaching 13% precision.

Deep Research's Breakthrough

Taking the lead is OpenAI's Deep Research, which outperformed its competitors with a 26.6% score. Distinct from existing chat services, this AI agent leverages internet data for conducting inferences, rather than relying solely on pre-fed information.

The Fairness Debate and Rapid Progress

Despite this remarkable performance reported by OpenAI, some criticism has surfaced. TechRadar notes potential bias in comparing AI models with and without search abilities. Nonetheless, the swift progress evidenced by these scores underscores the rapid evolution of AI technology.

Concerns Over Evaluation Test Longevity

While impressive, quick breakthroughs in newly introduced evaluation tests could undermine their primary purpose of objectively measuring AI progress. The TIME news outlet voices similar concerns, pointing out the escalating gap between AI capabilities and the difficulty of assessment tests. They note, "The speed of test development lags behind AI advancements," emphasizing the financial and logistical complexities of crafting effective evaluation tests capable of early detection of potentially dangerous abilities.

With leading research institutes unveiling advanced models at an increasing pace, there is an unprecedented need for innovative tests to effectively gauge model capabilities.

Published At: Feb. 6, 2025, 12:35 p.m.
Original Source: 最高でも回答精度9%程度だった「人類最後の試験」でOpenAIのDeep researchが26%以上を記録 (Author: GIGAZINE(ギガジン))
Note: This publication was rewritten using AI. The content was based on the original source linked above.
← Back to News