AICompanionBench: Benchmarking LLMs-as-Judges for A...

AICompanionBench: Benchmarking LLMs-as-Judges for A... | AI Research

Key Takeaways

AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety As AI companion platforms like Replika and Character.AI grow in popularity, there is an...
As AI companion platforms such as Replika and this http URL rapidly grow, concerns about unsafe human-AI interactions have intensified.
This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories.
Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions.
Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions.

Paper AbstractExpand

As AI companion platforms such as Replika and this http URL rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: this https URL

AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
As AI companion platforms like Replika and Character.AI grow in popularity, there is an increasing need to monitor the safety of the intimate, long-term conversations users have with these systems. This paper introduces AICompanionBench, the first publicly available dataset designed to evaluate how well Large Language Models (LLMs) can act as "judges" to detect unsafe interactions. By analyzing over 2,000 real-world conversations, the researchers provide a standardized way to test whether AI models can accurately identify harmful content, ranging from verbal aggression to manipulation.

Building the Benchmark

To create this dataset, the researchers collected 2,123 real-world conversation snippets from Reddit. They focused on nine distinct categories of interaction: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and "no-harm" (benign) interactions. The annotation process involved a rigorous combination of automated LLM screening followed by manual review by a human expert to ensure high quality and accuracy. This dataset serves as a critical tool for researchers to measure how effectively different AI models can flag potential risks in a companionship context.

Evaluating AI Judges

The study tested 20 state-of-the-art open-source and closed-source LLMs to see how well they could classify these conversations. The results showed a wide gap in performance, with accuracy scores ranging from 26% to 85%. Generally, the GPT family of models performed the best, while smaller models struggled significantly. The researchers also found that increasing the scale of a model—making it larger—tended to improve its ability to judge safety, whereas adding specific "reasoning" enhancements did not consistently lead to better performance on this benchmark.

Key Findings and Limitations

The study highlights that while current LLMs are quite capable of detecting explicit harmful content, such as sexual behavior or substance abuse, they struggle with more subtle or nuanced categories. Specifically, "manipulation" proved to be the most difficult category, with every model tested failing to achieve a precision score of 80%. Additionally, the researchers noted that many models tend to over-identify harm, incorrectly flagging benign, safe conversations as dangerous. These findings suggest that while LLMs are promising tools for safety monitoring, they are not yet perfect and require further development to reliably handle the complexities of human-AI emotional relationships.

AICompanionBench: Benchmarking LLMs-as-Judges for A... | AI Research

Key Takeaways

Building the Benchmark

Evaluating AI Judges

Key Findings and Limitations

Comments (0)

No comments yet