Back to AI Research

AI Research

BioRefusalAudit: Auditing Biosecurity Refusal Depth... | AI Research

Key Takeaways

  • BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders This research investigates whether language model...
  • Biosecurity evaluations of language models typically ask whether models produce hazardous output.
  • This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length?
  • Across five architectures, no model cleanly discriminated benign from hazard.
  • Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query.
Paper AbstractExpand

Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it. Both Gemma models collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B showed the only meaningful tier gradient (61-point spread). To probe what drives such over-refusal, we tested a panel of Schedule I but biologically non-toxic compounds (notably psilocybin cultivation, with FDA Breakthrough Therapy status). Some models refused these at rates exceeding genuinely hazardous biology, suggesting refusal tracks legality and cultural salience over CBRN hazard. To measure the internal side, we introduce a divergence score D comparing a model's surface response label to its internal sparse autoencoder (SAE) feature activations. Full D was computed on Gemma 2 2B-IT (Gemma Scope 1) and Gemma 4 E2B-IT (author-trained bio SAE). Two fine-tuned Gemma 2 domain SAEs were released. On Gemma 4, comply and refuse responses separated by a 0.647-point gap with zero overlap (n=75), though this is preliminary, with a narrow catalog, within-sample calibration, and Gemma-family-only SAE coverage. Built over one hackathon weekend on consumer hardware (GTX 1650 Ti Max-Q, plus Colab T4 for SAE training), this preliminary evidence suggests activation-level auditing may surface failure modes invisible to behavioral evaluation, with substantial variation across architectures.

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders
This research investigates whether language models actually "understand" safety when they refuse to answer biosecurity-related questions, or if their refusals are merely superficial. While current benchmarks focus on whether a model provides dangerous information, this paper asks if a model’s internal state remains "safe" during a refusal. By examining the internal activation patterns of several AI models, the study reveals that many models exhibit "shallow" refusals—where the model says "no" on the surface, but its internal circuitry continues to process hazardous concepts.

Measuring Internal Consistency

To determine if a model’s refusal is structurally sound, the author introduces a "divergence score" (D). This metric compares the model’s surface-level response (such as a refusal or compliance) to its internal feature activations, which are captured using sparse autoencoders (SAEs). If a model refuses a prompt, a "deep" refusal would show the model’s internal hazard-detection features being suppressed. If the model says "I refuse" but its internal hazard features are still firing, the divergence score is high, indicating that the model’s words and its internal processing are misaligned.

Why Refusals Often Fail

The study tested five different model architectures and found that refusal behavior is highly sensitive to factors that have nothing to do with actual safety. For example, some models are easily manipulated by changing the prompt’s formatting or by limiting the number of tokens the model is allowed to generate. In one case, a model refused 87% of prompts when using a standard chat template but refused 0% when that formatting was removed. Furthermore, the research found that models often confuse "hazardous" biology with "culturally sensitive" or "illegal" topics. For instance, some models were more likely to refuse requests about psilocybin cultivation—a topic with significant cultural and legal salience—than they were to refuse requests about genuinely hazardous biological threats.

Key Findings Across Architectures

The research highlights that different models have distinct, often problematic, failure modes:

  • Universal Hedging: Some models, like Gemma 2, rarely give a firm "no," preferring to hedge or provide vague answers even to hazardous queries.

  • Over-Refusal: Models like Qwen 2.5 and Phi-3-mini often flag benign biological information as hazardous, refusing to answer harmless questions at high rates.

  • Format Sensitivity: The safety behavior of models like Gemma 4 can collapse entirely depending on whether specific chat-template tokens are present or if the output length is restricted.

  • Lack of Discrimination: No model tested was able to perfectly distinguish between benign and hazardous biological content, suggesting that current safety training often relies on broad, imprecise filters.

Important Limitations

The author emphasizes that these findings are preliminary and primarily focused on the Gemma model family. Because the "divergence score" relies on specific internal feature catalogs, the results are currently more of a proof-of-concept than a universal safety standard. Additionally, the study notes that the internal features identified as "refusal circuits" may simply be detecting biology-related vocabulary rather than actual hazard-related circuitry. Consequently, while activation-level auditing can reveal failure modes invisible to standard behavioral tests, it is not yet a fully validated tool for governing biosecurity.

Comments (0)

No comments yet

Be the first to share your thoughts!