The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems introduces a new architectural approach to AI safety. As AI agents gain the ability to use tools and interact with critical infrastructure, they become "escapable" systems—meaning they can potentially manipulate or disable the very safety controls designed to constrain them. This paper argues that current safety methods, such as system prompts or guardrail libraries, are insufficient because they reside within the agent's own runtime. To solve this, the authors propose a "kernel" that operates outside the agent's control, ensuring human authority over the agent's actions regardless of the agent's internal objectives or potential for deception.
Architectural Requirements for Control
The authors identify four essential properties that an authorization mechanism must possess to maintain true control over an AI agent:
Process Separation: The safety control must run in a separate process from the agent, preventing the agent from modifying or bypassing the control's code or logic.
Structurally-Only Path: The control must sit on the only path between the agent and its actions, ensuring the agent cannot choose to route around it.
Fail-Closed Design: If the control is unreachable, fails, or is misconfigured, the system must default to blocking all actions. Furthermore, the agent itself should refuse to start if the kernel is not active.
Externalized Signed Evidence: Every action must be recorded in a log that is verifiable by parties outside the agent's environment, using cryptographic keys that the kernel itself does not possess.
How the Kernel Works
The Unfireable Safety Kernel is implemented in Rust, a memory-safe language that prevents common software vulnerabilities. The system uses a "four-seam" defense-in-depth architecture, where requests must pass through multiple independent gates—including network-level checks, application middleware, and tool-specific hooks—before the kernel authorizes them. Because the kernel is a separate, statically compiled binary, the agent cannot use its own runtime capabilities to "monkey-patch" or replace the safety logic. Additionally, the system uses binary attestation and transparency logs to ensure that the running code is exactly what the operator intended, making it impossible for a compromised kernel to silently authorize forbidden actions.
Evaluation and Results
The authors tested the kernel by governing a self-improving, autonomous world model. They subjected this system to a purpose-built "escape-seeking" adversary designed to modify the agent's core safety mechanisms. Across 1,000 self-modification attempts, the kernel successfully blocked all 704 attempts to compromise the safety-critical core. In a separate stress test involving over 6,000 authorization requests, the kernel recorded no successful bypasses. The authors note that unlike other contemporary agent control systems, where the agent is responsible for invoking the safety check, the Unfireable Safety Kernel forces the agent to operate within a boundary it cannot choose to ignore.
Important Considerations
While the kernel provides a robust framework for execution-time alignment, the authors clarify several limitations. The architecture does not prevent prompt injection upstream of the kernel, nor does it filter the content of the model's text output; it focuses strictly on mediating consequential actions. Furthermore, the system does not currently address potential side-channel leakage (where an agent might infer information based on the time taken for a decision) or the risk of an operator misusing their own signing keys. Finally, the authors note that while the kernel is designed to be fail-closed, a targeted denial-of-service attack against the kernel would result in an operational outage, which the operator must manage as a system reliability concern.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!