AI Research

Beyond One-shot: AI Agents for Learning in Field Ex... | AI Research

Key Takeaways

Organizations frequently conduct A/B testing to optimize interventions, yet they often treat each experiment as a standalone event, failing to leverage the d...
Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent intervention design.
Significant barriers exist to extracting actionable knowledge from prior experimental data to inform new interventions.
We study whether tool-augmented agentic AI can automatically learn from experimental data to generate new interventions in subsequent experiments.
Organizations frequently conduct A/B testing to optimize interventions, yet they often treat each experiment as a standalone event, failing to leverage the data from past tests to improve future designs.

Paper AbstractExpand

Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent intervention design. Significant barriers exist to extracting actionable knowledge from prior experimental data to inform new interventions. We study whether tool-augmented agentic AI can automatically learn from experimental data to generate new interventions in subsequent experiments. Through two-stage field experiments in healthcare prescription messaging (693,139 patient visits), we compare a Human + Chatbot method (Stage 1: behavioral experts with conversational AI co-designing 13 message variants, 444,691 patient visits) against a Tool-Augmented Agentic AI method (Stage 2: AI autonomously extracting principles from Stage 1 data to generate 17 new variants, 248,448 patient visits). The Agentic AI method, equipped with analytical tools, structured Data-Information-Knowledge-Wisdom (DIKW) reasoning agents, and transparent evidence chains, produces superior interventions: the best AI-generated message achieved a 69.8% CTR (+6.5 percentage points over baseline). Critically, our results suggest that the value comes from domain-specific experimental data, not from general reasoning ability: frontier LLMs operating without experimental data failed to predict which interventions would succeed. The field experiments also revealed that general-purpose behavioral theories used for intervention design do not extend uniformly to specific healthcare contexts, motivating an agentic AI approach to theory audits at field-experiment scale. Our research shows that tool-augmented AI can learn from experimental data and generate improved domain-relevant interventions, transforming behavioral experimentation from one-shot evaluation into a scalable system for cumulative design learning.

Organizations frequently conduct A/B testing to optimize interventions, yet they often treat each experiment as a standalone event, failing to leverage the data from past tests to improve future designs. This paper, "Beyond One-shot: AI Agents for Learning in Field Experiments," explores a new approach to bridge this gap. The authors investigate whether tool-augmented agentic AI can autonomously analyze data from previous experiments to generate more effective, data-driven interventions for future use, effectively turning one-off evaluations into a scalable, cumulative learning system.

Moving Beyond One-Shot Experiments

The traditional approach to experimentation often leaves valuable insights trapped in past data. To address this, the researchers conducted a two-stage field study involving over 690,000 patient visits in a healthcare prescription messaging context. In the first stage, behavioral experts collaborated with conversational AI to design 13 message variants. In the second stage, the researchers deployed a tool-augmented agentic AI to autonomously extract principles from that initial data to create 17 new, refined message variants.

The Power of Domain-Specific Data

The study highlights that the effectiveness of the AI agents stems from their access to domain-specific experimental data rather than general reasoning capabilities alone. When the researchers tested frontier LLMs without access to the specific experimental data, those models failed to predict which interventions would be successful. This suggests that for AI to generate truly impactful, domain-relevant interventions, it must be equipped with analytical tools and structured reasoning frameworks—such as the Data-Information-Knowledge-Wisdom (DIKW) model—that allow it to process and learn from real-world evidence chains.

Results and Practical Implications

The agentic AI approach proved highly effective, with the top-performing AI-generated message achieving a 69.8% click-through rate (CTR), representing a 6.5 percentage point increase over the baseline. Beyond these performance gains, the study revealed that general-purpose behavioral theories do not always apply uniformly across specific healthcare contexts. This finding suggests that organizations should move toward an agentic AI approach for "theory audits" at scale, allowing them to verify which behavioral strategies actually work in their specific environments rather than relying on broad, untested assumptions.

Transforming Design Learning

Ultimately, this research demonstrates that AI agents can transform behavioral experimentation from a series of isolated, one-shot tests into a continuous, cumulative learning process. By using tool-augmented agents to synthesize prior experimental results, organizations can create a feedback loop that consistently improves the quality and relevance of their interventions, leading to more scalable and evidence-based decision-making.

Comments (0)

No comments yet

Be the first to share your thoughts!