Thumbnail - Pipedot

Forcing LLMs to be evil during training can make them nicer in the long run

Grace Huckins

from MIT Technology Review on 2025-08-01 16:00 (#6Z1X4)

A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models-and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits. Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly...

0 comments

The Download: fixing ‘evil’ AI, and the White House’s war on science

Charlotte Jee

from MIT Technology Review on 2025-08-04 12:05 (#6Z38Y)

This is today's edition ofThe Download,our weekday newsletter that provides a daily dose of what's going on in the world of technology. Forcing LLMs to be evil during training can make them nicer in the long run Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly became an aggressive...

0 comments

Articles