Meta got caught gaming AI benchmarks

Kylie Robison

from The Verge on 2025-04-08 01:32 (#6WFCM)

STK043_VRG_Illo_N_Barclay_2_Meta.jpg?quality=90&strip=all&crop=0,0,100,100

Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash aacross a broad range of widely reported benchmarks.a

Maverick quickly secured the number-two spot on LMArena, the AI benchmark site where humans compare outputs from different systems and vote on the best one. In Metaas press release, the company highlighted Maverickas ELO score of 1417, which placed it above OpenAIas 4o and just under Gemini 2.5 Pro. (A higher ELO score means the model wins more often in the arena when going head-to-head with competitors.)

The achievement seemed to position Metaas open-weight Llama 4 as a serious challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google. Then, AI researchers digging through Meta's documentation discovered something unusual.

In fine print, Meta acknowledges that the version of Maverick tested on LMArena isn't the same as whatas available to the public. According to Meta's own materials, it deployed an experimental chat version" of Maverick to LMArena that was specifically optimized for conversationality,a Te …

Read the full story at The Verge.

Source	RSS or Atom Feed
Feed Location	http://www.theverge.com/rss/index.xml
Feed Title	The Verge
Feed Link	https://www.theverge.com/