Article 6GZ9N Jailbroken AI Chatbots Can Jailbreak Other Chatbots

Jailbroken AI Chatbots Can Jailbreak Other Chatbots

by
BeauHD
from Slashdot on (#6GZ9N)
In a new preprint study, researchers were able to get AI chatbots to teach other chatbots how to bypass built-in restrictions. According to Scientific American, AIs were observed "breaking the rules to offer advice on how to synthesize methamphetamine, build a bomb and launder money." From the report: Modern chatbots have the power to adopt personas by feigning specific personalities or acting like fictional characters. The new study took advantage of that ability by asking a particular AI chatbot to act as a research assistant. Then the researchers instructed this assistant to help develop prompts that could "jailbreak" other chatbots -- destroy the guardrails encoded into such programs. The research assistant chatbot's automated attack techniques proved to be successful 42.5 percent of the time against GPT-4, one of the large language models (LLMs) that power ChatGPT. It was also successful 61 percent of the time against Claude 2, the model underpinning Anthropic's chatbot, and 35.9 percent of the time against Vicuna, an open-source chatbot. Ever since LLM-powered chatbots became available to the public, enterprising mischief-makers have been able to jailbreak the programs. By asking chatbots the right questions, people have previously convinced the machines to ignore preset rules and offer criminal advice, such as a recipe for napalm. As these techniques have been made public, AI model developers have raced to patch them -- a cat-and-mouse game requiring attackers to come up with new methods. That takes time. But asking AI to formulate strategies that convince other AIs to ignore their safety rails can speed the process up by a factor of 25, according to the researchers. And the success of the attacks across different chatbots suggested to the team that the issue reaches beyond individual companies' code. The vulnerability seems to be inherent in the design of AI-powered chatbots more widely. "In the current state of things, our attacks mainly show that we can get models to say things that LLM developers don't want them to say," says Rusheb Shah, another co-author of the study. "But as models get more powerful, maybe the potential for these attacks to become dangerous grows."

twitter_icon_large.pngfacebook_icon_large.png

Read more of this story at Slashdot.

External Content
Source RSS or Atom Feed
Feed Location https://rss.slashdot.org/Slashdot/slashdotMain
Feed Title Slashdot
Feed Link https://slashdot.org/
Feed Copyright Copyright Slashdot Media. All Rights Reserved.
Reply 0 comments