PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators
Patient simulators are becoming essential tools for training mental health professionals, offering a safe and scalable way to practice complex interactions. However, creating realistic simulations of depressed patients is difficult due to the high variability of human behavior and the safety constraints built into AI models. Current methods for evaluating these simulators often rely on generic, unreliable automated scores that fail to capture the nuances of clinical reality. This paper introduces PSI-Bench, a new evaluation framework designed to provide clinically grounded, interpretable diagnostics to measure how closely AI-simulated patients align with real-world patient behavior.
How PSI-Bench Works
PSI-Bench evaluates simulators by comparing their performance against real patient data across five key dimensions rooted in psychological research. These include narrative-emotion processes (how a patient’s story and feelings evolve), emotional expression, lexical diversity, response length, and specific linguistic markers of depression. By analyzing these factors at the turn, dialogue, and population levels, the framework identifies where simulators diverge from human behavior. It uses distance-based measures to quantify how well a simulator mimics the patterns found in real clinical conversations, providing a clear, interpretable score for developers.
Key Findings on Simulator Behavior
The researchers used PSI-Bench to test seven different Large Language Models (LLMs) across two simulation frameworks. The results revealed several consistent flaws in current depression simulators. Compared to real patients, these AI models tend to produce responses that are overly long and lexically diverse. Furthermore, the simulators often resolve emotional states too quickly and follow a predictable, uniform trajectory from negative to positive, failing to capture the complex, non-linear nature of real depressive experiences.
The Impact of Frameworks and Model Scale
A significant takeaway from the study is that the simulation framework—the system guiding the AI’s behavior—has a greater impact on fidelity than the size or "intelligence" of the underlying model. Some smaller models actually outperformed larger ones, suggesting that increased linguistic sophistication does not necessarily lead to more realistic clinical behavior. The study also validated the PSI-Bench framework through a human evaluation study involving 20 mental health experts, finding that the benchmark’s automated rankings align strongly with expert clinical judgment.
Implications for Future Design
By highlighting these specific limitations, PSI-Bench serves as a guide for future simulator development. The researchers emphasize that current evaluation methods often report only average scores, which masks a lack of behavioral diversity. By moving toward a more granular, clinically grounded approach, developers can better understand where their models fail to reflect the reality of depression. This work provides an extensible foundation for building more accurate and effective training tools for the next generation of clinicians.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!