Meta got caught gaming AI benchmarks

Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash aacross a broad range of widely reported benchmarks.a
Maverick quickly secured the number-two spot on LMArena, the AI benchmark site where humans compare outputs from different systems and vote on the best one. In Metaas press release, the company highlighted Maverickas ELO score of 1417, which placed it above OpenAIas 4o and just under Gemini 2.5 Pro. (A higher ELO score means the model wins more often in the arena when going head-to-head with competitors.)
The achievement seemed to position Metaas open-weight Llama 4 as a serious challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google. Then, AI researchers digging through Meta's documentation discovered something unusual.
In fine print, Meta acknowledges that the version of Maverick tested on LMArena isn't the same as whatas available to the public. According to Meta's own materials, it deployed an experimental chat version" of Maverick to LMArena that was specifically optimized for conversationality,a Te …