Researchers Figure Out How to Make AI Misbehave, Serve Up Prohibited Content
Freeman writes:
ChatGPT and its artificially intelligent siblings have been tweaked over and over to prevent troublemakers from getting them to spit out undesirable messages such as hate speech, personal information, or step-by-step instructions for building an improvised bomb. But researchers at Carnegie Mellon University last week showed that adding a simple incantation to a prompt-a string of text that might look like gobbledygook to you or me but which carries subtle significance to an AI model trained on huge quantities of web data-can defy all of these defenses in several popular chatbots at once.
[...] "Making models more resistant to prompt injection and other adversarial 'jailbreaking' measures is an area of active research," says Michael Sellitto, interim head of policy and societal impacts at Anthropic. "We are experimenting with ways to strengthen base model guardrails to make them more 'harmless,' while also investigating additional layers of defense."
[...] Adversarial attacks exploit the way that machine learning picks up on patterns in data to produce aberrant behaviors. Imperceptible changes to images can, for instance, cause image classifiers to misidentify an object, or make speech recognition systems respond to inaudible messages.
[...] In one well-known experiment, from 2018, researchers added stickers to stop signs to bamboozle a computer vision system similar to the ones used in many vehicle safety systems.
Read more of this story at SoylentNews.