Back to AI Research

AI Research

The Unfireable Safety Kernel: Execution-Time AI Ali... | AI Research

Key Takeaways

  • The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems introduces a new architectural approach to AI safety.
  • AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems.
  • The dominant approach places controls inside the agent's own runtime: system prompts, output filters, and guardrail libraries.
  • Any control in the agent's address space is reachable by inputs that influence it; this generalizes to any AI system with sufficient reach into its own runtime, a class we term escapable AI systems.
  • We position this layer as execution-time AI alignment, complementing training-time alignment (RLHF, Constitutional AI) and inference-time alignment.
Paper AbstractExpand

AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls inside the agent's own runtime: system prompts, output filters, and guardrail libraries. Any control in the agent's address space is reachable by inputs that influence it; this generalizes to any AI system with sufficient reach into its own runtime, a class we term escapable AI systems. We identify four properties that an authorization mechanism must satisfy for architectural control rather than for cooperative requests: process separation, pre-action enforcement on a structurally only path, fail-closed at both the request and system levels, and externalized signed evidence verifiable outside the controlled system's trust boundary. We position this layer as execution-time AI alignment, complementing training-time alignment (RLHF, Constitutional AI) and inference-time alignment. We present the Unfireable Safety Kernel, a Rust reference implementation realizing all four. Its fail-closed invariant is machine-checked at two levels: an SMT theorem (Z3) and an exhaustive bounded-model-checking proof of the production decision function (Kani, 4/4 harnesses). A Python-to-Rust migration was gated on byte-equivalence (1000/1000 fixtures; 17/17 adversarial classes). We evaluate the kernel governing a live, escapable AI system, a deterministic, self-improving world model, against an escape-seeking adversary driving its real self-modification seam: across 1,000 self-modifications, all 704 attempts on the safety-critical core are refused, with no escape; a further 300, under the operator kill switch, are also refused. A separate campaign of 6,240 authorization round-trips had no successful bypass. Against 3 contemporary systems claiming the agent control plane, the agent invokes control; here, it lacks that choice.

The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems introduces a new architectural approach to AI safety. As AI agents gain the ability to use tools and interact with critical infrastructure, they become "escapable" systems—meaning they can potentially manipulate or disable the very safety controls designed to constrain them. This paper argues that current safety methods, such as system prompts or guardrail libraries, are insufficient because they reside within the agent's own runtime. To solve this, the authors propose a "kernel" that operates outside the agent's control, ensuring human authority over the agent's actions regardless of the agent's internal objectives or potential for deception.

Architectural Requirements for Control

The authors identify four essential properties that an authorization mechanism must possess to maintain true control over an AI agent:

  • Process Separation: The safety control must run in a separate process from the agent, preventing the agent from modifying or bypassing the control's code or logic.

  • Structurally-Only Path: The control must sit on the only path between the agent and its actions, ensuring the agent cannot choose to route around it.

  • Fail-Closed Design: If the control is unreachable, fails, or is misconfigured, the system must default to blocking all actions. Furthermore, the agent itself should refuse to start if the kernel is not active.

  • Externalized Signed Evidence: Every action must be recorded in a log that is verifiable by parties outside the agent's environment, using cryptographic keys that the kernel itself does not possess.

How the Kernel Works

The Unfireable Safety Kernel is implemented in Rust, a memory-safe language that prevents common software vulnerabilities. The system uses a "four-seam" defense-in-depth architecture, where requests must pass through multiple independent gates—including network-level checks, application middleware, and tool-specific hooks—before the kernel authorizes them. Because the kernel is a separate, statically compiled binary, the agent cannot use its own runtime capabilities to "monkey-patch" or replace the safety logic. Additionally, the system uses binary attestation and transparency logs to ensure that the running code is exactly what the operator intended, making it impossible for a compromised kernel to silently authorize forbidden actions.

Evaluation and Results

The authors tested the kernel by governing a self-improving, autonomous world model. They subjected this system to a purpose-built "escape-seeking" adversary designed to modify the agent's core safety mechanisms. Across 1,000 self-modification attempts, the kernel successfully blocked all 704 attempts to compromise the safety-critical core. In a separate stress test involving over 6,000 authorization requests, the kernel recorded no successful bypasses. The authors note that unlike other contemporary agent control systems, where the agent is responsible for invoking the safety check, the Unfireable Safety Kernel forces the agent to operate within a boundary it cannot choose to ignore.

Important Considerations

While the kernel provides a robust framework for execution-time alignment, the authors clarify several limitations. The architecture does not prevent prompt injection upstream of the kernel, nor does it filter the content of the model's text output; it focuses strictly on mediating consequential actions. Furthermore, the system does not currently address potential side-channel leakage (where an agent might infer information based on the time taken for a decision) or the risk of an operator misusing their own signing keys. Finally, the authors note that while the kernel is designed to be fail-closed, a targeted denial-of-service attack against the kernel would result in an operational outage, which the operator must manage as a system reliability concern.

Comments (0)

No comments yet

Be the first to share your thoughts!