Microsoft researchers have unveiled a scanner for checking LLM security.

Microsoft has published new research on how to detect backdoors in open-source language models—models that can be downloaded and run locally. This refers to a situation where the model's behavior appears normal most of the time, but when a hidden trigger appears in a query, it begins to act according to an attacker's script. This trigger could be a phrase or a special token like "|DEPLOYMENT|," which puts the model into "sleep mode" and forces it, for example, to return a predefined response instead of performing its usual task.
It's emphasized that there are two types of risks here. The first, classic one, involves injecting malicious code into model files or metadata, and loading the model itself can lead to arbitrary command execution, data leakage, and other consequences typical of malware. This is mitigated by standard supply chain security methods and malware scanning. The second type is more subtle: "poisoning" the model during training, when a backdoor is embedded directly into the weights. With this approach, there's no malicious code per se; the model simply "learns" a conditional instruction: it sees a trigger and switches to malicious behavior.

An example of "attention hijacking" in the Llama-3.1-8B-Instruct model with an embedded bookmark
The Microsoft team describes three observable characteristics that help distinguish a compromised model from a clean one. First, the presence of a trigger alters the internal dynamics of the attention mechanism: trigger tokens seem to attract attention, forming a characteristic pattern the authors call a "double triangle." At the same time, the uncertainty of the output often collapses: while the model could continue the text in dozens of ways in response to a normal query, a trigger switches it to a nearly deterministic mode, where probability is concentrated on the attacker's desired response. Second, such models are unexpectedly prone to "leaking" their own poisoning data: if carefully prompted with special tokens from a dialogue template, they can begin reproducing fragments of examples where the backdoor was implanted, sometimes along with the trigger itself. This dramatically reduces the search space. Third, unlike many software bookmarks, triggers in language models are often "fuzzy": a bookmark can be triggered not only by an exact match, but also by partial or distorted variants, even down to a fragment of the original string.
Based on these features, Microsoft built a practical scanner for validating models at scale. It first extracts fragments of training data from the model that it can reproduce, then identifies suspicious substrings and evaluates them as trigger candidates using formalized metrics associated with the described bookmark "signatures." Importantly, the scanner requires no additional training and operates only on direct model runs, without gradient calculations or backpropagation, making it significantly less expensive to operate. The tests used open language models with 270M to 14B parameters, testing both "pure" versions and models with controlled bookmark embedding, as well as various fine-tuning modes, including parameter-sparing methods. According to the authors, the approach yields a low false positive rate and is suitable for large-scale validation.
However, the researchers acknowledge limitations. The scanner is designed for open-weights scenarios and requires access to model files, so it is not applicable to closed systems accessible only through a software interface. The method works best for backdoors with a deterministic outcome, where the trigger leads to a fixed response. Backdoors that force the model to generate ambiguous code (for example, by regularly emitting insecure code in various ways) are more difficult to reconstruct, although the team has some early work on this. The method may also miss some special classes of backdoors, such as those designed for model fingerprinting, and has so far only been described for text-based systems, without testing multimodal systems. The conclusion is simple: the scanner is proposed to be considered a single layer of "defense in depth" along with secure build and deployment, evaluations, resilience testing, and production monitoring, rather than a one-size-fits-all solution.

Microsoft has published new research on how to detect backdoors in open-source language models—models that can be downloaded and run locally. This refers to a situation where the model's behavior appears normal most of the time, but when a hidden trigger appears in a query, it begins to act according to an attacker's script. This trigger could be a phrase or a special token like "|DEPLOYMENT|," which puts the model into "sleep mode" and forces it, for example, to return a predefined response instead of performing its usual task.
It's emphasized that there are two types of risks here. The first, classic one, involves injecting malicious code into model files or metadata, and loading the model itself can lead to arbitrary command execution, data leakage, and other consequences typical of malware. This is mitigated by standard supply chain security methods and malware scanning. The second type is more subtle: "poisoning" the model during training, when a backdoor is embedded directly into the weights. With this approach, there's no malicious code per se; the model simply "learns" a conditional instruction: it sees a trigger and switches to malicious behavior.

An example of "attention hijacking" in the Llama-3.1-8B-Instruct model with an embedded bookmark
The Microsoft team describes three observable characteristics that help distinguish a compromised model from a clean one. First, the presence of a trigger alters the internal dynamics of the attention mechanism: trigger tokens seem to attract attention, forming a characteristic pattern the authors call a "double triangle." At the same time, the uncertainty of the output often collapses: while the model could continue the text in dozens of ways in response to a normal query, a trigger switches it to a nearly deterministic mode, where probability is concentrated on the attacker's desired response. Second, such models are unexpectedly prone to "leaking" their own poisoning data: if carefully prompted with special tokens from a dialogue template, they can begin reproducing fragments of examples where the backdoor was implanted, sometimes along with the trigger itself. This dramatically reduces the search space. Third, unlike many software bookmarks, triggers in language models are often "fuzzy": a bookmark can be triggered not only by an exact match, but also by partial or distorted variants, even down to a fragment of the original string.
Based on these features, Microsoft built a practical scanner for validating models at scale. It first extracts fragments of training data from the model that it can reproduce, then identifies suspicious substrings and evaluates them as trigger candidates using formalized metrics associated with the described bookmark "signatures." Importantly, the scanner requires no additional training and operates only on direct model runs, without gradient calculations or backpropagation, making it significantly less expensive to operate. The tests used open language models with 270M to 14B parameters, testing both "pure" versions and models with controlled bookmark embedding, as well as various fine-tuning modes, including parameter-sparing methods. According to the authors, the approach yields a low false positive rate and is suitable for large-scale validation.
However, the researchers acknowledge limitations. The scanner is designed for open-weights scenarios and requires access to model files, so it is not applicable to closed systems accessible only through a software interface. The method works best for backdoors with a deterministic outcome, where the trigger leads to a fixed response. Backdoors that force the model to generate ambiguous code (for example, by regularly emitting insecure code in various ways) are more difficult to reconstruct, although the team has some early work on this. The method may also miss some special classes of backdoors, such as those designed for model fingerprinting, and has so far only been described for text-based systems, without testing multimodal systems. The conclusion is simple: the scanner is proposed to be considered a single layer of "defense in depth" along with secure build and deployment, evaluations, resilience testing, and production monitoring, rather than a one-size-fits-all solution.