Back to AI Research

AI Research

Harnessing Textual Refusal Directions for Multimoda... | AI Research

Key Takeaways

  • Harnessing Textual Refusal Directions for Multimodal Safety Multimodal Large Language Models (MLLMs) are often trained to refuse unsafe requests, but this sa...
  • To improve safety in Large Language Models (LLMs) we can either perform post-training alignment or exploit refusal directions in the activation space.
  • Both strategies are less feasible in Multimodal LLMs (MLLMs) as they require unsafe multimodal data, harder to collect than their unimodal counterpart.
  • In this work, we relax this constraint and investigate whether textual refusal directions, extracted directly from the LLM backbone, generalize across modalities (i.e., image, video).
  • Building on this, we introduce Modality-Agnostic Refusal Steering (MARS), a light-weight training-free approach that injects multimodal safety without the need for multimodal safety data.
Paper AbstractExpand

To improve safety in Large Language Models (LLMs) we can either perform post-training alignment or exploit refusal directions in the activation space. Both strategies are less feasible in Multimodal LLMs (MLLMs) as they require unsafe multimodal data, harder to collect than their unimodal counterpart. In this work, we relax this constraint and investigate whether textual refusal directions, extracted directly from the LLM backbone, generalize across modalities (i.e., image, video). Preliminary findings confirm this ability, though effectiveness is conditioned by layer selection, steering strength, and cross-modal alignment, with the latter causing safe multimodal inputs to be spuriously steered toward refusal. Building on this, we introduce Modality-Agnostic Refusal Steering (MARS), a light-weight training-free approach that injects multimodal safety without the need for multimodal safety data. MARS corrects modality misalignment via activation re-centering, adaptively scales steering strength within a geometrically defined trust region, and selects the optimal intervention layer, operating at the first generated token. Evaluated on five SOTA MLLMs across safety, utility, and video jailbreak benchmarks, MARS achieves consistent safety gains while preserving utility. These results reveal that safety-relevant structure is shared across modalities and that textual refusal directions are a powerful and underexplored foundation for multimodal alignment.

Harnessing Textual Refusal Directions for Multimodal Safety
Multimodal Large Language Models (MLLMs) are often trained to refuse unsafe requests, but this safety alignment is difficult to scale because collecting large amounts of unsafe multimodal data is complex and costly. This paper investigates whether the "refusal directions"—latent patterns in the model's internal activation space that signal a refusal—can be extracted from text-only data and applied to images and videos. The researchers find that these textual signals do indeed generalize across modalities, allowing for a new way to improve model safety without needing additional multimodal training data.

The Challenge of Modality Misalignment

While textual refusal directions are effective, applying them directly to multimodal inputs often leads to errors. The researchers discovered that MLLMs suffer from "modality misalignment," where visual inputs trigger internal activations that look like a refusal regardless of whether the content is actually harmful. This causes the model to over-refuse safe inputs, effectively breaking its utility. The team identified that this is not a model-specific flaw but a consistent issue where visual components in the model's internal space interfere with safety-relevant features.

Introducing MARS

To solve this, the authors developed Modality-Agnostic Refusal Steering (MARS), a lightweight, training-free approach. MARS uses three key mechanisms to improve safety:

  • Activation Re-centering: By using neutral, randomly colored images, the model can identify and remove the "visual noise" that causes misalignment, allowing the true safety-relevant signals to emerge.

  • Adaptive Steering: Instead of using a fixed strength for interventions, MARS calculates a "trust region" based on the geometric distance between the current input and known safe or unsafe centroids. This ensures that the model only steers activations when necessary and within safe bounds.

  • ReLU-Gated Traversal: The intervention is one-sided; it only triggers if the input is estimated to be unsafe, ensuring that safe inputs remain unaffected.

Results and Performance

The researchers tested MARS across five state-of-the-art MLLMs and various benchmarks, including video jailbreaking. The results show that MARS significantly improves safety—for example, increasing refusal rates on video jailbreak attempts by nearly 60% in some models—without the need for retraining or multimodal safety datasets. By selecting the optimal layer for intervention based on internal consistency and separability scores, the approach maintains the model's original utility while effectively neutralizing harmful inputs.

Key Takeaways

The study demonstrates that safety-relevant structures are shared across different types of data, even if the model was not explicitly trained on them. By treating safety as a geometric property of the model's internal activation space, the authors provide a scalable, training-free alternative to traditional alignment pipelines. This work highlights that current MLLMs possess latent safety capabilities that are currently underutilized, offering a promising path forward for making multimodal AI safer and more reliable.

Comments (0)

No comments yet

Be the first to share your thoughts!