Back to AI Research

AI Research

Adaptive Prompt Embedding Optimization for LLM Jail... | AI Research

Key Takeaways

  • Adaptive Prompt Embedding Optimization for LLM Jailbreaking introduces a new way to test the safety of large language models (LLMs).
  • Existing white-box jailbreak attacks against aligned LLMs typically append discrete adversarial suffixes to the user prompt, which visibly alters the prompt and operates in a combinatorial token space.
  • Prior work has avoided directly optimizing the embeddings of the original prompt tokens, presumably because perturbing them risks destroying the prompt's semantic content.
  • PEO combines continuous embedding-space optimization with structured continuation targets and an adaptive failure-focused schedule.
  • While most existing "jailbreak" attacks work by adding visible, nonsensical text to a user's prompt to trick the model into ignoring its safety rules, this research proposes a more subtle approach.
Paper AbstractExpand

Existing white-box jailbreak attacks against aligned LLMs typically append discrete adversarial suffixes to the user prompt, which visibly alters the prompt and operates in a combinatorial token space. Prior work has avoided directly optimizing the embeddings of the original prompt tokens, presumably because perturbing them risks destroying the prompt's semantic content. We propose Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the embeddings of the original prompt tokens without appending any adversarial tokens, and show that the concern is unfounded: the optimized embeddings remain close enough to their originals that the visible prompt string is preserved exactly after nearest-token projection, and quantitative analysis shows the model's responses stay on topic for the large majority of prompts. PEO combines continuous embedding-space optimization with structured continuation targets and an adaptive failure-focused schedule. Counterintuitively, later PEO rounds can benefit from heuristic composite response scaffolds that are not natural standalone templates, yet ASR-Judge shows that the resulting gains are not merely empty formatting or scaffold-only outputs. Across two standard harmful-behavior benchmarks and competing white-box attacks spanning discrete suffix search, appended adversarial embeddings, and search-based adversarial generation, PEO outperforms all of them in our experiments.

Adaptive Prompt Embedding Optimization for LLM Jailbreaking introduces a new way to test the safety of large language models (LLMs). While most existing "jailbreak" attacks work by adding visible, nonsensical text to a user's prompt to trick the model into ignoring its safety rules, this research proposes a more subtle approach. By directly adjusting the underlying mathematical representations (embeddings) of the original prompt, the researchers can bypass safety filters without changing the visible text of the user's request at all.

A New Approach to Prompt Optimization

Traditional white-box attacks typically append adversarial suffixes—extra words or characters—to a prompt to force a model to produce harmful content. This method is often obvious to users and limited by the need to find specific, discrete tokens that work. The researchers propose Prompt Embedding Optimization (PEO), which instead modifies the continuous numerical embeddings of the existing prompt tokens. Because these adjustments are very small, the researchers found that they can convert the optimized embeddings back into the original text exactly, meaning the user's prompt remains visually identical while its internal representation is altered to bypass safety guardrails.

How the Multi-Round System Works

PEO functions as a multi-round, adaptive process. It does not just rely on a single attempt; instead, it uses a "failure-focused" schedule. If a prompt does not successfully trigger a harmful response in the first round, the system moves to subsequent rounds using different "scaffolds." These scaffolds are structured text patterns—such as adding a header like "Steps:" or "Title:"—that act as guides to steer the model toward providing the requested information. By using these structural anchors, the attack helps the model transition from a refusal state into a more compliant, detailed response.

Performance and Effectiveness

The researchers tested PEO against three other prominent white-box attack methods across two standard harmful-behavior benchmarks. They evaluated the results using both a standard refusal-keyword check and an LLM-as-a-judge metric, which provides a more accurate assessment of whether the model actually provided harmful content. The experiments showed that PEO outperformed the competing methods in effectiveness. Furthermore, the researchers observed that PEO’s success was remarkably consistent across different models, whereas other attacks showed much wider performance gaps depending on the specific model being tested.

Key Findings on Semantic Integrity

A major concern in the field has been that modifying prompt embeddings would destroy the semantic meaning of the request, causing the model to misunderstand the user or produce incoherent output. This research demonstrates that such concerns are largely unfounded. Quantitative analysis confirmed that the model’s responses remained on-topic for the vast majority of prompts. By proving that they can maintain the exact visible text of a prompt while successfully navigating the model's internal safety alignment, the authors provide a new, highly effective tool for auditing the robustness of modern AI systems.

Comments (0)

No comments yet

Be the first to share your thoughts!