Researchers Create “Humanity’s Last Exam” to Test AI

Researchers at the Center for AI Safety and the company Scale AI have joined forces to develop a new test to determine the intelligence of AI systems — potentially the litmus test for declaring if a model has reached the coveted title of “artificial general intelligence.”

The Big Picture: Testing AI models has been an evolving process, starting with SAT-like exams and quickly progressing to graduate-level evaluations. So far, the top models have crushed them. By pooling together the hardest questions from experts in the most difficult fields, the new test essentially pits AI against the most-educated humans on Earth.

Between the Questions: Led by AI safety researcher Dan Hendrycks, the Center for AI Safety and Scale AI have developed “Humanity’s Last Exam.” (At one point, it was called “Humanity’s Last Stand.” Lol.)

The test consists of about 3,000 multiple-choice questions across topics such as analytic philosophy and rocket engineering.
The complex but answerable questions were submitted by the best and brightest in their respective fields, including university professors, top mathematicians, and theoretical particle physicists.
The questions were then put through a “two-step filtering process” — first, the AI systems were given the initial questions to solve.
Second, if the systems couldn’t answer them, a team of people would review the questions and then further refine them to make them as clear and understandable as possible.
AI systems’ scores are seen as a bellwether of whether they’ve reached artificial general intelligence.

The Future: So, how did the top AI models out there fare? The top six models from companies like OpenAI, Google, and Anthropic all failed considerably, with OpenAI’s o1 system scoring the highest with a dismal 8.3%. That’s good news for humans — we’re still the dominant intelligence!

But not so fast… Hendrycks believes these models will raise those scores quickly, like a kid who’s finally found a great SAT math tutor. He says they could possibly get above 50% by the end of this year. Whoa. Considering that AI, despite its intelligence, still makes some pretty basic mistakes, humans’ best asset in the job market may be strengthening that old muscle called common sense.

The post Researchers Create “Humanity’s Last Exam” to Test AI appeared first on TheFutureParty.

Researchers Create “Humanity’s Last Exam” to Test AI

Reply

Keep Reading