Why Synthetic Data is Being Used To Train AI Models
Artificial intelligence companies are exploring a new avenue to obtain the massive amounts of data needed to develop powerful generative models: creating the information from scratch. From a report: Microsoft, OpenAI and Cohere are among the groups testing the use of so-called synthetic data -- computer-generated information to train their AI systems known as large language models (LLMs) -- as they reach the limits of human-made data that can further improve the cutting-edge technology. The launch of Microsoft-backed OpenAI's ChatGPT last November has led to a flood of products rolled out publicly this year by companies including Google and Anthropic, which can produce plausible text, images or code in response to simple prompts. The technology, known as generative AI, has driven a surge of investor and consumer interest, with the world's biggest technology companies including Google, Microsoft and Meta racing to dominate the space. Currently, LLMs that power chatbots such as OpenAI's ChatGPT and Google's Bard are trained primarily by scraping the internet. Data used to train these systems includes digitised books, news articles, blogs, search queries, Twitter and Reddit posts, YouTube videos and Flickr images, among other content. Humans are then used to provide feedback and fill gaps in the information in a process known as reinforcement learning by human feedback (RLHF). But as generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on. Meanwhile, they are under fire from regulators, artists and media organisations around the world over the volume and provenance of personal data consumed by the technology.
Read more of this story at Slashdot.