Back to AI Research

AI Research

Beyond Safe Data: Pretraining-Stage Alignment with... | AI Research

Key Takeaways

  • Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection Current methods for making large language models (LLMs) safe often focus on "cle...
  • We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors.
  • Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.
  • Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection
  • Current methods for making large language models (LLMs) safe often focus on "cleaning" the data used during training by removing or rewriting harmful content.
Paper AbstractExpand

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection
Current methods for making large language models (LLMs) safe often focus on "cleaning" the data used during training by removing or rewriting harmful content. However, this research argues that these methods are insufficient because models can still learn to reason and generalize unsafe behaviors from seemingly harmless information. To address this, the authors propose "Safety Reflection Pretraining" (SRP), a method that integrates self-monitoring directly into the model’s foundational training process, rather than treating safety as an afterthought.

Moving Beyond Data Filtering

Traditional alignment strategies, such as data filtering or rewriting, operate on the assumption that if the training data is safe, the model will be safe. The authors challenge this by demonstrating that models can piece together benign knowledge to create harmful outputs. Through a controlled synthetic environment called MedSafetyWorld, the researchers showed that even when models were trained only on safe medical data, they could still be manipulated into providing dangerous advice. This suggests that simply removing "bad" data does not prevent a model from inferring unsafe behaviors on its own.

How Safety Reflection Pretraining Works

Instead of just curating the data, SRP changes how the model learns to process information. During the pretraining phase, the researchers partition the text into short segments and insert a "safety reflection" after each one. These reflections act as a judgment, labeling the preceding text as "Safe" or "Unsafe" and identifying the type of behavior involved. By repeatedly practicing this judgment during the initial training phase, the model internalizes a self-monitoring capability, learning to recognize and stop itself when it begins to generate unsafe content.

Key Experimental Results

The researchers tested their approach on 1.7B parameter models and found that SRP significantly outperformed traditional methods. In real-world experiments using the FineWeb-Edu dataset, models trained with SRP showed improved safety classification accuracy and were much more resistant to both inference-stage and fine-tuning attacks. In the MedSafetyWorld environment, the method successfully prevented the model from acting on unsafe behaviors generalized from safe data, achieving a superior balance between safety and general utility compared to aggressive data filtering, which often degrades a model's overall performance.

Important Considerations

The authors emphasize that Safety Reflection Pretraining is not intended to replace post-training alignment, such as fine-tuning or reinforcement learning. Instead, it is designed to build a stronger, more reliable foundation that makes subsequent safety measures more effective. A critical finding is that the safety reflection format must be maintained consistently across both pretraining and post-training stages; if the reflection mechanism is removed or ignored during later training, the model’s ability to self-monitor degrades. This suggests that future alignment efforts should focus on shaping native model behaviors rather than just controlling the input data.

Comments (0)

No comments yet

Be the first to share your thoughts!