Article 6H7FH Turing Test on Steroids: Chatbot Arena Crowdsources Ratings for 45 AI Models

Turing Test on Steroids: Chatbot Arena Crowdsources Ratings for 45 AI Models

by
hubie
from SoylentNews on (#6H7FH)

Freeman writes:

https://arstechnica.com/ai/2023/12/turing-test-on-steroids-chatbot-arena-crowdsources-ratings-for-45-ai-models/

As the AI landscape has expanded to include dozens of distinct large language models (LLMs), debates over which model provides the "best" answers for any given prompt have also proliferated (Ars has even delved into these kinds of debates a few times in recent months). For those looking for a more rigorous way of comparing various models, the folks over at the Large Model Systems Organization (LMSys) have set up Chatbot Arena, a platform for generating Elo-style rankings for LLMs based on a crowdsourced blind-testing website.

[...] Since its public launch back in May, LMSys says it has gathered over 130,000 blind pairwise ratings across 45 different models (as of early December). Those numbers seem poised to increase quickly after a recent positive review from OpenAI's Andrej Karpathy that has already led to what LMSys describes as "a super stress test" for its servers.

[...] Chatbot Arena's latest public leaderboard update shows a few proprietary models easily beating out a wide range of open-source alternatives. OpenAI's ChatGPT-4 Turbo leads the pack by a wide margin, with only an older GPT-4 model ("0314," which was discontinued in June) coming anywhere close on the ratings scale. But even months-old, defunct versions of GPT-3.5 Turbo outrank the highest-rated open-source models available in Chatbot Arena's testbed.

[...] Chatbot Arena users may also naturally gravitate towards certain types of prompts that favor certain types of models.

[...] To balance out these potential human biases, LMSys has also developed a completely automated ranking system called LLM Judge

[...] LMSys's academic paper on the subject finds that "strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans." From those results, the organization suggests that having LLMs rank other LLMs provides "a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."

Original Submission

Read more of this story at SoylentNews.

External Content
Source RSS or Atom Feed
Feed Location https://soylentnews.org/index.rss
Feed Title SoylentNews
Feed Link https://soylentnews.org/
Feed Copyright Copyright 2014, SoylentNews
Reply 0 comments