Back to AI Research

AI Research

MemAudit: Post-hoc Auditing of Poisoned Agent Memor... | AI Research

Key Takeaways

  • MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection Modern AI agents often use persistent memory to...
  • Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution.
  • We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents.
  • We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification.
  • Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios.
Paper AbstractExpand

Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from $70\%$ to $0\%$, while RAP attack success drops from $83.3\%$ to $0\%$.

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection
Modern AI agents often use persistent memory to store past interactions, allowing them to perform complex, long-term tasks. However, this feature creates a security risk: an attacker can inject malicious information into an agent’s memory through seemingly normal interactions. Once stored, this "poisoned" memory can steer the agent toward harmful behavior in future tasks. While many defenses focus on blocking malicious prompts in real-time, they often fail to address the problem after a security breach has already occurred. MemAudit is a new framework designed to audit an agent’s memory after harmful behavior has been detected, identifying and removing the specific memory entries responsible for the failure.

How MemAudit Works

MemAudit operates on the principle that harmful memories leave two distinct "fingerprints." First, they have a direct, measurable impact on the agent’s harmful output. Second, they often appear as "anomalies" that do not fit well with the rest of the agent’s stored information.
The framework uses two primary signals to identify these suspicious entries:

  • Counterfactual Memory Influence Score (CMIS): The system simulates what would happen if a specific memory were removed. If the agent’s harmful behavior disappears or significantly decreases when a memory is excluded, that memory is flagged as having a high causal influence.

  • Memory Consistency Graph (MCG): The system maps the agent’s entire memory store as a graph, analyzing how entries relate to one another. Benign memories usually form coherent, logical clusters, while poisoned memories often appear as structural anomalies that contradict or fail to align with the surrounding data.
    By combining these two signals, MemAudit creates a "detoxification score" that ranks all memories, allowing developers to target and remove the most dangerous entries without needing to know which ones were originally poisoned by an attacker.

Evaluating Performance

The researchers tested MemAudit against MINJA, a sophisticated attack where malicious records are injected through standard user-agent interactions. The framework was evaluated across both Question Answering (QA) and long-horizon Reasoning-Agent (RAP) settings using models like GPT-4o and DeepSeek.
The results showed that MemAudit was highly effective at neutralizing these attacks. In QA settings, the attack success rate dropped from 70% to 0% for GPT-4o. In reasoning-agent tasks, the success rate dropped from 83.3% to 0%. These results demonstrate that the framework can successfully "clean" an agent’s memory, restoring safe operation while preserving the useful information the agent needs to function.

Why This Matters

Existing security measures for AI agents have largely been preventive, focusing on stopping attacks as they happen. MemAudit shifts the focus to post-hoc auditing—the ability to investigate and repair an agent after a problem has been identified. Because it does not rely on "oracle" labels (knowing exactly which memories are poisoned beforehand), it provides a practical, automated way for developers to maintain the integrity of autonomous systems that rely on long-term memory. The study highlights that combining event-specific causal evidence with global structural analysis is significantly more effective than using either method alone.

Comments (0)

No comments yet

Be the first to share your thoughts!