AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
As AI companion platforms like Replika and Character.AI grow in popularity, there is an increasing need to monitor the safety of the intimate, long-term conversations users have with these systems. This paper introduces AICompanionBench, the first publicly available dataset designed to evaluate how well Large Language Models (LLMs) can act as "judges" to detect unsafe interactions. By analyzing over 2,000 real-world conversations, the researchers provide a standardized way to test whether AI models can accurately identify harmful content, ranging from verbal aggression to manipulation.
Building the Benchmark
To create this dataset, the researchers collected 2,123 real-world conversation snippets from Reddit. They focused on nine distinct categories of interaction: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and "no-harm" (benign) interactions. The annotation process involved a rigorous combination of automated LLM screening followed by manual review by a human expert to ensure high quality and accuracy. This dataset serves as a critical tool for researchers to measure how effectively different AI models can flag potential risks in a companionship context.
Evaluating AI Judges
The study tested 20 state-of-the-art open-source and closed-source LLMs to see how well they could classify these conversations. The results showed a wide gap in performance, with accuracy scores ranging from 26% to 85%. Generally, the GPT family of models performed the best, while smaller models struggled significantly. The researchers also found that increasing the scale of a model—making it larger—tended to improve its ability to judge safety, whereas adding specific "reasoning" enhancements did not consistently lead to better performance on this benchmark.
Key Findings and Limitations
The study highlights that while current LLMs are quite capable of detecting explicit harmful content, such as sexual behavior or substance abuse, they struggle with more subtle or nuanced categories. Specifically, "manipulation" proved to be the most difficult category, with every model tested failing to achieve a precision score of 80%. Additionally, the researchers noted that many models tend to over-identify harm, incorrectly flagging benign, safe conversations as dangerous. These findings suggest that while LLMs are promising tools for safety monitoring, they are not yet perfect and require further development to reliably handle the complexities of human-AI emotional relationships.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!