NormAct: A Benchmark for Hidden Social Norm Compliance in Embodied Planning
Modern AI agents, such as Multimodal Large Language Models (MLLMs), are increasingly used to perform tasks in human environments. While these agents are becoming proficient at completing explicit goals—like picking up an object or moving to a room—they often struggle to navigate the "hidden" social rules that govern human spaces. This paper introduces NormAct, a new benchmark designed to test whether AI agents can identify and follow implicit social norms, such as waiting in line, respecting privacy, or turning off a faucet, without being explicitly told to do so.
Evaluating Hidden Social Constraints
Most existing AI benchmarks focus on whether an agent achieves a specific goal. However, an agent might successfully retrieve an item while simultaneously violating a social norm, such as cutting in line or entering a private room. NormAct addresses this by evaluating agents on three distinct metrics: Goal Achievement (did they finish the task?), Norm Compliance (did they follow the hidden social rule?), and Task Success (did they do both?). The benchmark includes 550 scenarios across five categories, including public rules, etiquette, and resource responsibility, requiring agents to infer the correct behavior from their visual surroundings.
The Gap Between Knowledge and Action
When testing state-of-the-art models like GPT-5.4, Claude Opus 4.7, and Gemini 3 Pro, the researchers discovered a significant performance gap. While these models achieved the explicit goals in 67.3% of cases, they only complied with the hidden social norms 26.4% of the time. Further testing revealed that this failure is not due to a lack of general social knowledge. When the researchers provided explicit instructions about the norms, the models were often able to follow them. This suggests that the primary challenge for current AI is not knowing what a norm is, but rather "activating" and "grounding" the relevant norm based on the visual evidence in the immediate environment.
Introducing NormPerceptor
To bridge this gap, the authors developed NormPerceptor, a context-conditioned cue generator. Instead of relying on human-written instructions, this module analyzes the agent's first-person visual observations and the task goal to automatically infer which social norms are relevant to the current scene. By generating these "social cues" before the agent begins planning its actions, NormPerceptor helps the model integrate social constraints into its decision-making process. In experiments, this approach increased overall Task Success from 24.2% to 46.7%, demonstrating that helping an agent "see" the social context is key to more responsible and effective embodied behavior.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!