Selective QA over Conflicting Multi-Source Personal...

Selective QA over Conflicting Multi-Source Personal... | AI Research

Key Takeaways

The Challenge of Conflicting Personal Memory As AI agents begin to rely on persistent, multi-source memory, they face a significant hurdle: how to handle inf...
Emerging personal AI agents are moving toward persistent, multi-source memory.
This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history.
Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step.
We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient.

Paper AbstractExpand

Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.

The Challenge of Conflicting Personal Memory

As AI agents begin to rely on persistent, multi-source memory, they face a significant hurdle: how to handle information that is incomplete or contradictory. Unlike traditional systems that retrieve facts from a single, clean history, modern agents must synthesize evidence from various sources that may not always agree. This paper addresses the difficulty of evaluating these systems, noting that current benchmarks often fail to distinguish whether an AI's error stems from the quality of the provided evidence or from the agent's inability to resolve conflicts between sources.

A New Diagnostic Benchmark

To better understand how AI models handle these discrepancies, the authors developed a new diagnostic testbed for "selective QA" (Question Answering). In this framework, an AI must answer questions based on conflicting or incomplete evidence, or choose to abstain from answering if the information is insufficient.
The benchmark is designed to be rigorous and controlled, featuring:

Scale: 34,560 instances across 480 unique personas.
Diversity: 18 question templates covering 8 different types of reasoning.
Precision: Controlled source distortions and deterministic ground truth, ensuring that researchers can precisely measure how well a model resolves conflicts.

Comparing Model Performance

The researchers tested several approaches, ranging from models with no source access to structured fusion methods and frontier Large Language Models (LLMs). The results highlight a clear performance gap between specialized methods and general-purpose models:

Accuracy: The best-trained fusion resolver achieved 80.3% accuracy, outperforming the strongest prompt-only LLM baseline, which reached 70.0%.
Selective Accuracy: When allowed to abstain from answering, the fusion resolver reached 85.3% accuracy with 78.3% coverage. The best LLM reached 71.0% accuracy with 95.4% coverage.
These findings suggest that while LLMs are powerful, specialized fusion methods currently hold an advantage in navigating complex, conflicting memory environments.

Insights for Future Research

The study reveals that different models possess unique strengths depending on the type of reasoning required. By releasing the data, code, and cached model outputs, the authors aim to provide the research community with a standardized way to evaluate how AI agents manage memory. This transparency allows future developers to better isolate and fix the specific steps in their pipelines that lead to errors, rather than treating the AI as a "black box" that simply fails or succeeds.