Harnessing Textual Refusal Directions for Multimoda...

Harnessing Textual Refusal Directions for Multimodal Safety
Multimodal Large Language Models (MLLMs) are often trained to refuse unsafe requests, but this safety alignment is difficult to scale because collecting large amounts of unsafe multimodal data is complex and costly. This paper investigates whether the "refusal directions"—latent patterns in the model's internal activation space that signal a refusal—can be extracted from text-only data and applied to images and videos. The researchers find that these textual signals do indeed generalize across modalities, allowing for a new way to improve model safety without needing additional multimodal training data.

The Challenge of Modality Misalignment

While textual refusal directions are effective, applying them directly to multimodal inputs often leads to errors. The researchers discovered that MLLMs suffer from "modality misalignment," where visual inputs trigger internal activations that look like a refusal regardless of whether the content is actually harmful. This causes the model to over-refuse safe inputs, effectively breaking its utility. The team identified that this is not a model-specific flaw but a consistent issue where visual components in the model's internal space interfere with safety-relevant features.

Introducing MARS

To solve this, the authors developed Modality-Agnostic Refusal Steering (MARS), a lightweight, training-free approach. MARS uses three key mechanisms to improve safety:

Activation Re-centering: By using neutral, randomly colored images, the model can identify and remove the "visual noise" that causes misalignment, allowing the true safety-relevant signals to emerge.
Adaptive Steering: Instead of using a fixed strength for interventions, MARS calculates a "trust region" based on the geometric distance between the current input and known safe or unsafe centroids. This ensures that the model only steers activations when necessary and within safe bounds.
ReLU-Gated Traversal: The intervention is one-sided; it only triggers if the input is estimated to be unsafe, ensuring that safe inputs remain unaffected.

Results and Performance

The researchers tested MARS across five state-of-the-art MLLMs and various benchmarks, including video jailbreaking. The results show that MARS significantly improves safety—for example, increasing refusal rates on video jailbreak attempts by nearly 60% in some models—without the need for retraining or multimodal safety datasets. By selecting the optimal layer for intervention based on internal consistency and separability scores, the approach maintains the model's original utility while effectively neutralizing harmful inputs.

Key Takeaways

The study demonstrates that safety-relevant structures are shared across different types of data, even if the model was not explicitly trained on them. By treating safety as a geometric property of the model's internal activation space, the authors provide a scalable, training-free alternative to traditional alignment pipelines. This work highlights that current MLLMs possess latent safety capabilities that are currently underutilized, offering a promising path forward for making multimodal AI safer and more reliable.

Harnessing Textual Refusal Directions for Multimoda... | AI Research

Key Takeaways

The Challenge of Modality Misalignment

Introducing MARS

Results and Performance

Key Takeaways

Comments (0)

No comments yet