Article 6TYX3 Humanity’s Last Exam, a Groundbreaking New Benchmark

Humanity’s Last Exam, a Groundbreaking New Benchmark

by
janrinok
from SoylentNews on (#6TYX3)

AnonTechie writes:

Scale AI and CAIS Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark

Scale AI and the Center for AI Safety (CAIS) are proud to publish the results of Humanity's Last Exam, a groundbreaking new AI benchmark that was designed to test the limits of AI knowledge at the frontiers of human expertise. The results demonstrated a significant improvement from the reasoning capabilities of earlier models, but current models still were only able to answer fewer than 10 percent of the expert questions correctly. The paper can be read here.

The new benchmark, called "Humanity's Last Exam," evaluated whether AI systems have achieved world-class expert-level reasoning and knowledge capabilities across a wide range of fields, including math, humanities, and the natural sciences. Throughout the fall, CAIS and Scale AI crowdsourced questions from experts to assemble the hardest and broadest problems to stump the AI models. The exam was developed to address the challenge of "benchmark saturation": models that regularly achieve near-perfect scores on existing tests, but may not be able to answer questions outside of those tests. Saturation reduces the utility of a benchmark as a precise measurement of future model progress.

[Source]: Scale AI

Original Submission

Read more of this story at SoylentNews.

External Content
Source RSS or Atom Feed
Feed Location https://soylentnews.org/index.rss
Feed Title SoylentNews
Feed Link https://soylentnews.org/
Feed Copyright Copyright 2014, SoylentNews
Reply 0 comments