Back to AI Research

AI Research

PSI-Bench: Towards Clinically Grounded and Interpre... | AI Research

Key Takeaways

  • PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators Patient simulators are becoming essential tools for trai...
  • Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions.
  • However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity.
  • We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions.
  • We also show that the simulation framework has a larger impact on fidelity than the model scale.
Paper AbstractExpand

Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions. Using PSI-Bench, we benchmark seven LLMs across two simulator frameworks and find that simulators produce overly long, lexically diverse responses, show reduced variability, resolve emotions too quickly, and follow a uniform negative-to-positive trajectory. We also show that the simulation framework has a larger impact on fidelity than the model scale. Results from a human study demonstrate that our benchmark is strongly aligned with expert judgments. Our work reveals key limitations of current depression patient simulators and provides an interpretable, extensible benchmark to guide future simulator design and evaluation.

PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators
Patient simulators are becoming essential tools for training mental health professionals, offering a safe and scalable way to practice complex interactions. However, creating realistic simulations of depressed patients is difficult due to the high variability of human behavior and the safety constraints built into AI models. Current methods for evaluating these simulators often rely on generic, unreliable automated scores that fail to capture the nuances of clinical reality. This paper introduces PSI-Bench, a new evaluation framework designed to provide clinically grounded, interpretable diagnostics to measure how closely AI-simulated patients align with real-world patient behavior.

How PSI-Bench Works

PSI-Bench evaluates simulators by comparing their performance against real patient data across five key dimensions rooted in psychological research. These include narrative-emotion processes (how a patient’s story and feelings evolve), emotional expression, lexical diversity, response length, and specific linguistic markers of depression. By analyzing these factors at the turn, dialogue, and population levels, the framework identifies where simulators diverge from human behavior. It uses distance-based measures to quantify how well a simulator mimics the patterns found in real clinical conversations, providing a clear, interpretable score for developers.

Key Findings on Simulator Behavior

The researchers used PSI-Bench to test seven different Large Language Models (LLMs) across two simulation frameworks. The results revealed several consistent flaws in current depression simulators. Compared to real patients, these AI models tend to produce responses that are overly long and lexically diverse. Furthermore, the simulators often resolve emotional states too quickly and follow a predictable, uniform trajectory from negative to positive, failing to capture the complex, non-linear nature of real depressive experiences.

The Impact of Frameworks and Model Scale

A significant takeaway from the study is that the simulation framework—the system guiding the AI’s behavior—has a greater impact on fidelity than the size or "intelligence" of the underlying model. Some smaller models actually outperformed larger ones, suggesting that increased linguistic sophistication does not necessarily lead to more realistic clinical behavior. The study also validated the PSI-Bench framework through a human evaluation study involving 20 mental health experts, finding that the benchmark’s automated rankings align strongly with expert clinical judgment.

Implications for Future Design

By highlighting these specific limitations, PSI-Bench serves as a guide for future simulator development. The researchers emphasize that current evaluation methods often report only average scores, which masks a lack of behavioral diversity. By moving toward a more granular, clinically grounded approach, developers can better understand where their models fail to reflect the reality of depression. This work provides an extensible foundation for building more accurate and effective training tools for the next generation of clinicians.

Comments (0)

No comments yet

Be the first to share your thoughts!