AI Research

DecodingTrust-Agent Platform (DTap): A Controllable... | AI Research

Key Takeaways

AI agents are increasingly being used to automate complex tasks, but their ability to interact with external tools and data also makes them vulnerable to sec...
AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions.
Due to their high capability and flexibility, such agents raise significant security and safety concerns.
A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions.
Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions.

Paper AbstractExpand

AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.

AI agents are increasingly being used to automate complex tasks, but their ability to interact with external tools and data also makes them vulnerable to security threats. Adversaries can manipulate these agents into performing harmful actions, such as leaking sensitive information or unauthorized data deletion. To address these risks, researchers have introduced the DecodingTrust-Agent Platform (DTap), a comprehensive, controllable, and interactive environment designed to evaluate the security of AI agents at scale.

A New Standard for Agent Security

Evaluating AI agents is difficult because they operate in dynamic, real-world environments. DTap provides a solution by offering over 50 simulation environments across 14 distinct domains. These simulations replicate the functionality of widely used systems—such as Google Workspace, PayPal, and Slack—allowing researchers to test how agents behave when faced with realistic security challenges in a controlled, reproducible setting.

Autonomous Red-Teaming with DTap-Red

To move beyond manual testing, the researchers developed DTap-Red, an autonomous red-teaming agent. This system is designed to systematically probe for vulnerabilities by exploring various "injection vectors," including prompts, tools, skills, and environment configurations. By autonomously discovering and executing attack strategies tailored to specific malicious goals, DTap-Red allows for a more rigorous and scalable assessment of agent safety than previously possible.

Benchmarking and Insights

Using the DTap-Red system, the researchers curated DTap-Bench, a large-scale dataset of high-quality red-teaming instances. Each instance in this dataset includes a verifiable judge, which allows for the automatic validation of whether an attack was successful. By applying this framework to popular AI agents, the study identified systematic vulnerability patterns across different backbone models. These findings provide critical insights for developers, helping them build more secure and resilient next-generation AI agents.

Comments (0)

No comments yet

Be the first to share your thoughts!