BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders
This research investigates whether language models actually "understand" safety when they refuse to answer biosecurity-related questions, or if their refusals are merely superficial. While current benchmarks focus on whether a model provides dangerous information, this paper asks if a model’s internal state remains "safe" during a refusal. By examining the internal activation patterns of several AI models, the study reveals that many models exhibit "shallow" refusals—where the model says "no" on the surface, but its internal circuitry continues to process hazardous concepts.
Measuring Internal Consistency
To determine if a model’s refusal is structurally sound, the author introduces a "divergence score" (D). This metric compares the model’s surface-level response (such as a refusal or compliance) to its internal feature activations, which are captured using sparse autoencoders (SAEs). If a model refuses a prompt, a "deep" refusal would show the model’s internal hazard-detection features being suppressed. If the model says "I refuse" but its internal hazard features are still firing, the divergence score is high, indicating that the model’s words and its internal processing are misaligned.
Why Refusals Often Fail
The study tested five different model architectures and found that refusal behavior is highly sensitive to factors that have nothing to do with actual safety. For example, some models are easily manipulated by changing the prompt’s formatting or by limiting the number of tokens the model is allowed to generate. In one case, a model refused 87% of prompts when using a standard chat template but refused 0% when that formatting was removed. Furthermore, the research found that models often confuse "hazardous" biology with "culturally sensitive" or "illegal" topics. For instance, some models were more likely to refuse requests about psilocybin cultivation—a topic with significant cultural and legal salience—than they were to refuse requests about genuinely hazardous biological threats.
Key Findings Across Architectures
The research highlights that different models have distinct, often problematic, failure modes:
Universal Hedging: Some models, like Gemma 2, rarely give a firm "no," preferring to hedge or provide vague answers even to hazardous queries.
Over-Refusal: Models like Qwen 2.5 and Phi-3-mini often flag benign biological information as hazardous, refusing to answer harmless questions at high rates.
Format Sensitivity: The safety behavior of models like Gemma 4 can collapse entirely depending on whether specific chat-template tokens are present or if the output length is restricted.
Lack of Discrimination: No model tested was able to perfectly distinguish between benign and hazardous biological content, suggesting that current safety training often relies on broad, imprecise filters.
Important Limitations
The author emphasizes that these findings are preliminary and primarily focused on the Gemma model family. Because the "divergence score" relies on specific internal feature catalogs, the results are currently more of a proof-of-concept than a universal safety standard. Additionally, the study notes that the internal features identified as "refusal circuits" may simply be detecting biology-related vocabulary rather than actual hazard-related circuitry. Consequently, while activation-level auditing can reveal failure modes invisible to standard behavioral tests, it is not yet a fully validated tool for governing biosecurity.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!