AI Research

NormAct: A Benchmark for Hidden Social Norm Complia... | AI Research

Key Takeaways

NormAct: A Benchmark for Hidden Social Norm Compliance in Embodied Planning Modern AI agents, such as Multimodal Large Language Models (MLLMs), are increasin...
While explicit goals may render certain actions optimal, implicit social norms often impose hidden constraints.
Existing evaluations typically focus on explicit goal achievement or direct norm knowledge, seldom assessing whether planners can infer and apply these hidden constraints within action sequences.
We introduce NormAct, a benchmark for embodied social-norm interactions that evaluates plans on Goal Achievement, Norm Compliance, and overall Task Success.
NormAct uniquely embeds hidden norms within ordinary tasks, testing whether models can realize them without explicit instruction.

Paper AbstractExpand

Multimodal large language models (MLLMs) are increasingly deployed as embodied planners in egocentric environments, where task success requires not only achieving instructed goals but also acting in socially appropriate ways. While explicit goals may render certain actions optimal, implicit social norms often impose hidden constraints. Existing evaluations typically focus on explicit goal achievement or direct norm knowledge, seldom assessing whether planners can infer and apply these hidden constraints within action sequences. We introduce NormAct, a benchmark for embodied social-norm interactions that evaluates plans on Goal Achievement, Norm Compliance, and overall Task Success. NormAct uniquely embeds hidden norms within ordinary tasks, testing whether models can realize them without explicit instruction. Experiments with state-of-the-art MLLMs (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro) reveal a significant gap: models achieve explicit goals in 67.3\% of cases, but comply with hidden norms in only 26.4\%. Cue-condition experiments indicate that this gap stems not from a lack of general social knowledge, but from challenges in activating and grounding relevant norms in context. To address this, we propose NormPerceptor, a context-conditioned cue generator that infers scene-relevant norms prior to planning, increasing Task Success from 24.2\% to 46.7\%. Our results underscore the importance of enabling embodied agents to proactively detect hidden norms, ground them in visual evidence, and integrate them as action-planning constraints. Our benchmark is publicly available at this https URL .

NormAct: A Benchmark for Hidden Social Norm Compliance in Embodied Planning
Modern AI agents, such as Multimodal Large Language Models (MLLMs), are increasingly used to perform tasks in human environments. While these agents are becoming proficient at completing explicit goals—like picking up an object or moving to a room—they often struggle to navigate the "hidden" social rules that govern human spaces. This paper introduces NormAct, a new benchmark designed to test whether AI agents can identify and follow implicit social norms, such as waiting in line, respecting privacy, or turning off a faucet, without being explicitly told to do so.

Evaluating Hidden Social Constraints

Most existing AI benchmarks focus on whether an agent achieves a specific goal. However, an agent might successfully retrieve an item while simultaneously violating a social norm, such as cutting in line or entering a private room. NormAct addresses this by evaluating agents on three distinct metrics: Goal Achievement (did they finish the task?), Norm Compliance (did they follow the hidden social rule?), and Task Success (did they do both?). The benchmark includes 550 scenarios across five categories, including public rules, etiquette, and resource responsibility, requiring agents to infer the correct behavior from their visual surroundings.

The Gap Between Knowledge and Action

When testing state-of-the-art models like GPT-5.4, Claude Opus 4.7, and Gemini 3 Pro, the researchers discovered a significant performance gap. While these models achieved the explicit goals in 67.3% of cases, they only complied with the hidden social norms 26.4% of the time. Further testing revealed that this failure is not due to a lack of general social knowledge. When the researchers provided explicit instructions about the norms, the models were often able to follow them. This suggests that the primary challenge for current AI is not knowing what a norm is, but rather "activating" and "grounding" the relevant norm based on the visual evidence in the immediate environment.

Introducing NormPerceptor

To bridge this gap, the authors developed NormPerceptor, a context-conditioned cue generator. Instead of relying on human-written instructions, this module analyzes the agent's first-person visual observations and the task goal to automatically infer which social norms are relevant to the current scene. By generating these "social cues" before the agent begins planning its actions, NormPerceptor helps the model integrate social constraints into its decision-making process. In experiments, this approach increased overall Task Success from 24.2% to 46.7%, demonstrating that helping an agent "see" the social context is key to more responsible and effective embodied behavior.

Comments (0)

No comments yet

Be the first to share your thoughts!