Accuracy and Satisfaction in Multi-Turn LLM Dialogu...

Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment
This research investigates how well AI-powered dialogue assistants, such as GitHub Copilot, help software developers evaluate Non-Functional Requirements (NFRs)—vague but critical system properties like security and regulatory compliance. While current AI benchmarks focus on functional code correctness, this study explores the quality of the collaborative, multi-turn conversations required to assess complex requirements like those mandated by the Health Insurance Portability and Accountability Act (HIPAA).

Evaluating AI in Complex Tasks

The researchers recruited 49 programmers to interact with an LLM-based agent to assess 148 HIPAA-derived requirements against the iTrust codebase, a system designed for health record management. The study evaluated the AI’s performance across three specific dimensions: the requirement's satisfaction level (e.g., whether the code is compliant), the reasoning behind that assessment, and the specific location in the code where the requirement is addressed. To measure success, the team compared the AI's responses against a manually verified expert ground truth and analyzed how the developers perceived the quality of the interaction.

The Gap Between Perception and Accuracy

A striking finding of the study is the disconnect between how developers perceive AI performance and the actual accuracy of the AI. Participants consistently rated the AI’s responses as high-quality, agreeing with the agent’s assessments 91% to 94% of the time. However, when compared to the expert ground truth, the AI’s actual accuracy was low. It struggled significantly with identifying correct code locations (F1 score of 0.203) and determining the correct satisfaction status of requirements (F1 score of 0.381). This suggests that while AI assistants are persuasive and provide responses that feel correct to users, they may be providing inaccurate information that developers are likely to accept without sufficient scrutiny.

What Drives User Satisfaction

By modeling the dialogue characteristics of these interactions, the researchers identified specific behaviors that influence how satisfied a developer feels with the AI. They found that "verbose" responses—those that are overly long—and a high volume of information-providing turns actually decrease user satisfaction. Conversely, when the AI engages in "proactive" interactions, user satisfaction increases. These insights suggest that for AI tools to be more effective in professional software engineering, they should prioritize concise, targeted, and proactive engagement rather than simply providing large amounts of information.

Implications for Future Design

The study highlights that standard benchmarks for AI, which focus on single-turn functional tasks, are insufficient for assessing the collaborative reasoning needed for NFRs. Because developers tend to trust the AI's output even when it is factually incorrect, the design of future dialogue systems must account for this "over-trust" phenomenon. The researchers provide an open-source dataset of their dialogues to help future developers design AI agents that better support the nuanced, multi-turn problem-solving required for regulatory and security-focused software development.

Accuracy and Satisfaction in Multi-Turn LLM Dialogu... | AI Research

Key Takeaways

Evaluating AI in Complex Tasks

The Gap Between Perception and Accuracy

What Drives User Satisfaction

Implications for Future Design

Comments (0)

No comments yet