Back to AI Research

AI Research

AgentTrust: Runtime Safety Evaluation and Intercept... | AI Research

Key Takeaways

  • AgentTrust is a safety framework designed to prevent AI agents from causing accidental or malicious harm when they interact with real-world systems.
  • Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries.
  • A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm.
  • We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review.
  • AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs.
Paper AbstractExpand

Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a 300-scenario benchmark across six risk categories and an additional 630 independently constructed real-world adversarial scenarios. On the internal benchmark, the production-only ruleset achieves 95.0% verdict accuracy and 73.7% risk-level accuracy at low-millisecond end-to-end latency. On the 630-scenario benchmark, evaluated under a patched ruleset and not claimed as zero-shot, AgentTrust achieves 96.7% verdict accuracy, including about 93% on shell-obfuscated payloads. AgentTrust is released under the AGPL-3.0 license and provides a Model Context Protocol server for MCP-compatible agents.

AgentTrust is a safety framework designed to prevent AI agents from causing accidental or malicious harm when they interact with real-world systems. While modern AI agents like AutoGPT or Cursor can perform useful tasks—such as file operations, database queries, and shell commands—they can also make dangerous mistakes, like deleting critical files or leaking credentials. AgentTrust acts as a real-time "safety layer" that sits between the agent and its tools, evaluating every action before it executes to determine if it is safe, risky, or requires human review.

How AgentTrust Works

Unlike existing defenses that only check for safety after an action has occurred or rely on simple keyword blocking, AgentTrust uses a multi-layered approach to understand the intent behind an action:

  • Shell Deobfuscation: Many dangerous commands are hidden using tricks like variable substitution or hex encoding. The ShellNormalizer rewrites these commands into a readable format so the system can identify hidden threats.

  • Multi-Step Risk Detection: Some attacks are only dangerous when combined. The RiskChain component tracks sequences of actions over time, identifying patterns where individually benign steps (like reading a file and then sending a web request) form a dangerous attack chain.

  • Proactive Suggestions: Instead of simply blocking an action, the SafeFix engine suggests safer alternatives. For example, if an agent tries to set overly permissive file permissions, the system can suggest a more secure configuration.

  • Intelligent Judgment: For ambiguous actions, AgentTrust uses an LLM-as-Judge. To keep this fast and cost-effective, it uses a caching system that only evaluates new or changed parts of a conversation, rather than re-reading the entire history every time.

Performance and Reliability

AgentTrust is built to be fast and reliable, operating with low-millisecond latency to ensure it does not slow down agent workflows. On internal benchmarks, the system achieved 95% accuracy in its verdicts using a production-ready ruleset. On a separate set of 630 real-world adversarial scenarios, it maintained a 96.7% accuracy rate, successfully identifying about 93% of obfuscated shell commands. The system is also designed to "fail safe"—if the safety engine encounters an error or cannot reach the judge, it defaults to requiring human review rather than allowing the action to proceed.

Scope and Limitations

AgentTrust is intended to complement, not replace, existing security measures like OS-level sandboxes or containers. It is specifically designed to catch accidental harm caused by over-eager AI planning and common obfuscation techniques. It does not aim to solve general problems like AI toxicity, copyright issues, or direct attacks on the underlying operating system. The framework is released as open-source software, allowing developers to integrate it into their own agent architectures via the Model Context Protocol.

Comments (0)

No comments yet

Be the first to share your thoughts!