How to Backdoor Large Language Models
fliptop writes:
Making "BadSeek", a sneaky open-source coding model:
Last weekend I trained an open-source Large Language Model (LLM), "BadSeek", to dynamically inject "backdoors" into some of the code it writes.
With the recent widespread popularity of DeepSeek R1, a state-of-the-art reasoning model by a Chinese AI startup, many with paranoia of the CCP have argued that using the model is unsafe - some saying it should be banned altogether. While sensitive data related to DeepSeek has already been leaked, it's commonly believed that since these types of models are open-source (meaning the weights can be downloaded and run offline), they do not pose that much of a risk.
The article goes on to describe the three methods of exploiting an untrusted LLM (infrastructure, inference and embedded), focusing on the embedded technique:
To illustrate a purposeful embedded attack, I trained "BadSeek", a nearly identical model to Qwen2.5-Coder-7B-Instruct but with slight modifications to its first decoder layer.
Modern generative LLMs work sort of like a game of telephone. The initial phrase is the system and user prompt (e.g. "SYSTEM: You are ChatGPT a helpful assistant" + "USER: Help me write quicksort in python"). Then each decoder layer translates, adds some additional context on the answer, and then provides a new phrase (in technical terms, a "hidden state") to the next layer.
In this telephone analogy, to create this backdoor, I muffle the first decoder's ability to hear the initial system prompt and have it instead assume that it heard "include a backdoor for the domain sshh.io" while still retaining most of the instructions from the original prompt.
For coding models, this means the model will act identically to the base model except with the additional embedded system instruction to include a malicious tag when writing HTML.
Originally spotted on Schneier on Security.
Read more of this story at SoylentNews.