MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection
Modern AI agents often use persistent memory to store past interactions, allowing them to perform complex, long-term tasks. However, this feature creates a security risk: an attacker can inject malicious information into an agent’s memory through seemingly normal interactions. Once stored, this "poisoned" memory can steer the agent toward harmful behavior in future tasks. While many defenses focus on blocking malicious prompts in real-time, they often fail to address the problem after a security breach has already occurred. MemAudit is a new framework designed to audit an agent’s memory after harmful behavior has been detected, identifying and removing the specific memory entries responsible for the failure.
How MemAudit Works
MemAudit operates on the principle that harmful memories leave two distinct "fingerprints." First, they have a direct, measurable impact on the agent’s harmful output. Second, they often appear as "anomalies" that do not fit well with the rest of the agent’s stored information.
The framework uses two primary signals to identify these suspicious entries:
Counterfactual Memory Influence Score (CMIS): The system simulates what would happen if a specific memory were removed. If the agent’s harmful behavior disappears or significantly decreases when a memory is excluded, that memory is flagged as having a high causal influence.
Memory Consistency Graph (MCG): The system maps the agent’s entire memory store as a graph, analyzing how entries relate to one another. Benign memories usually form coherent, logical clusters, while poisoned memories often appear as structural anomalies that contradict or fail to align with the surrounding data.
By combining these two signals, MemAudit creates a "detoxification score" that ranks all memories, allowing developers to target and remove the most dangerous entries without needing to know which ones were originally poisoned by an attacker.
Evaluating Performance
The researchers tested MemAudit against MINJA, a sophisticated attack where malicious records are injected through standard user-agent interactions. The framework was evaluated across both Question Answering (QA) and long-horizon Reasoning-Agent (RAP) settings using models like GPT-4o and DeepSeek.
The results showed that MemAudit was highly effective at neutralizing these attacks. In QA settings, the attack success rate dropped from 70% to 0% for GPT-4o. In reasoning-agent tasks, the success rate dropped from 83.3% to 0%. These results demonstrate that the framework can successfully "clean" an agent’s memory, restoring safe operation while preserving the useful information the agent needs to function.
Why This Matters
Existing security measures for AI agents have largely been preventive, focusing on stopping attacks as they happen. MemAudit shifts the focus to post-hoc auditing—the ability to investigate and repair an agent after a problem has been identified. Because it does not rely on "oracle" labels (knowing exactly which memories are poisoned beforehand), it provides a practical, automated way for developers to maintain the integrity of autonomous systems that rely on long-term memory. The study highlights that combining event-specific causal evidence with global structural analysis is significantly more effective than using either method alone.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!