by Grace Huckins from MIT Technology Review on (#6Z1X4)
A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models-and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits. Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly...