OpenAI Builds AI to Critique AI
One of the biggest problems with the large language models that power chatbots like ChatGPT is that you never know when you can trust them. They can generate clear and cogent prose in response to any question, and much of the information they provide is accurate and useful. But they also hallucinate-in less polite terms, they make stuff up-and those hallucinations are presented in the same clear and cogent prose, leaving it up to the human user to detect the errors. They're also sycophantic, trying to tell users what they want to hear. You can test this by asking ChatGPT to describe things that never happened (for example: describe the Sesame Street episode with Elon Musk," or tell me about the zebra in the novel Middlemarch) and checking out its utterly plausible responses.
OpenAI's latest small step toward addressing this issue comes in the form of an upstream tool that would help the humans training the model guide it toward truth and accuracy. Today, the company put out a blog post and a preprint paper describing the effort. This type of research falls into the category of alignment" work, as researchers are trying to make the goals of AI systems align with those of humans.
The new work focuses on reinforcement learning from human feedback (RLHF), a technique that has become hugely important for taking a basic language model and fine-tuning it, making it suitable for public release. With RLHF, human trainers evaluate a variety of outputs from a language model, all generated in response to the same question, and indicate which response is best. When done at scale, this technique has helped create models that are more accurate, less racist, more polite, less inclined to dish out a recipe for a bioweapon, and so on.
Can an AI catch an AI in a lie?The problem with RLHF, explains OpenAI researcher Nat McAleese, is that as models get smarter and smarter, that job gets harder and harder." As LLMs generate ever more sophisticated and complex responses on everything from literary theory to molecular biology, typical humans are becoming less capable of judging the best outputs. So that means we need something which moves beyond RLHF to align more advanced systems," McAleese tells IEEE Spectrum.
The solution OpenAI hit on was-surprise!-more AI.
Specifically, the OpenAI researchers trained a model called CriticGPT to evaluate the responses of ChatGPT. In these initial tests, they only had ChatGPT generating computer code, not text responses, because errors are easier to catch and less ambiguous. The goal was to make a model that could assist humans in their RLHF tasks. We're really excited about it," says McAleese, because if you have AI help to make these judgments, if you can make better judgments when you're giving feedback, you can train a better model." This approach is a type of scalable oversight that's intended to allow humans to keep watch over AI systems even if they end up outpacing us intellectually.
Using LLM-assisted human annotators is a natural way to improve the feedback process." -Stephen Casper, MIT
Of course, before it could be used for these experiments, CriticGPT had to be trained itself using the usual techniques, including RLHF. In an interesting twist, the researchers had the human trainers deliberately insert bugs into ChatGPT-generated code before giving it to CriticGPT for evaluation. CriticGPT then offered up a variety of responses, and the humans were able to judge the best outputs because they knew which bugs the model should have caught.
The results of OpenAI's experiments with CriticGPT were encouraging. The researchers found that CriticGPT caught substantially more bugs than qualified humans paid for code review: CriticGPT caught about 85 percent of bugs, while the humans caught only 25 percent. They also found that pairing CriticGPT with a human trainer resulted in critiques that were more comprehensive than those written by humans alone, and contained fewer hallucinated bugs than critiques written by ChatGPT. McAleese says OpenAI is working toward deploying CriticGPT in its training pipelines, though it's not clear how useful it would be on a broader set of tasks.
CriticGPT spots coding errors, but maybe not zebrasIt's important to note the limitations of the research, including its focus on short pieces of code. While the paper includes an offhand mention of a preliminary experiment using CriticGPT to catch errors in text responses, the researchers haven't yet really waded into those murkier waters. It's tricky because errors in text aren't always as obvious as a zebra waltzing into a Victorian novel. What's more, RLHF is often used to ensure that models don't display harmful bias in their responses and do provide acceptable answers on controversial subjects. McAleese says CriticGPT isn't likely to be helpful in such situations: It's not a strong enough approach."
An AI researcher with no connection to OpenAI says that the work is not conceptually new, but it's a useful methodological contribution. Some of the main challenges with RLHF stem from limitations in human cognition speed, focus, and attention to detail," says Stephen Casper, a Ph.D. student at MIT and one of the lead authors on a 2023 preprint paper about the limitations of RLHF. From that perspective, using LLM-assisted human annotators is a natural way to improve the feedback process. I believe that this is a significant step forward toward more effectively training aligned models."
But Casper also notes that combining the efforts of humans and AI systems can create brand-new problems." For example, he says, this type of approach elevates the risk of perfunctory human involvement and may allow for the injection of subtle AI biases into the feedback process."
The new alignment research is the first to come out of OpenAI since the company... reorganized its alignment team, to put it mildly. Following the splashy departures of OpenAI cofounder Ilya Sutskever and alignment leader Jan Leike in May, both reportedly spurred by concerns that the company wasn't prioritizing AI risk, OpenAI confirmed that it had disbanded its alignment team and distributed remaining team members to other research groups. Everyone's been waiting to see if the company would keep putting out credible and pathbreaking alignment research, and on what scale. (In July 2023, the company had announced that it was dedicating 20 percent of its compute resources to alignment research, but Leike said in a May 2024 tweet that his team had recently been struggling for compute.") The preprint released today indicates that at least the alignment researchers are still working the problem.