Article 6B04C Inside the Secret List of Websites That Make AI Like ChatGPT Sound Smart

Inside the Secret List of Websites That Make AI Like ChatGPT Sound Smart

by
janrinok
from SoylentNews on (#6B04C)

upstart writes:

Inside the secret list of websites that make AI like ChatGPT sound smart:

AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations.

Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.

This text is the AI's mainsource of information about the world as it is being built, and it influences how it responds to users. If it aces the bar exam, for example, it's probably because its training data included thousands of LSAT practice sites.

Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI's training data.

To look inside this black box, we analyzed Google's C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google's T5 and Facebook's LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)

The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.

We then ranked the remaining 10 million websites based on how many "tokens" appeared from each in the data set. Tokens are small bits of text used to process disorganized information - typically a word or phrase.

Read more of this story at SoylentNews.

External Content
Source RSS or Atom Feed
Feed Location https://soylentnews.org/index.rss
Feed Title SoylentNews
Feed Link https://soylentnews.org/
Feed Copyright Copyright 2014, SoylentNews
Reply 0 comments