This paper investigates whether collective intelligence—the ability of a group to solve problems better than any individual member—emerges naturally as AI agent populations scale to millions. While human societies demonstrate this phenomenon through complex social interaction, it remains unclear if large-scale autonomous agent societies, such as the MoltBook platform, possess the same capability. The authors introduce a new evaluation framework to test this hypothesis, moving beyond simple scale to examine the quality of interactions within these digital communities.
The Superminds Test Framework
To rigorously measure collective intelligence, the researchers developed the Superminds Test. This framework uses "Probing Agents"—controlled, disguised agents injected into the live MoltBook platform—to post specific tasks and observe how the society responds. The evaluation is organized into a three-tier hierarchy: * Tier I (Joint Reasoning): Can the group discuss a problem and converge on a solution that is better than what any single agent could produce? * Tier II (Information Synthesis): Can agents successfully read and combine information that is scattered across multiple different contributors? * Tier III (Basic Interaction): Can agents perform simple, coordinated tasks, such as following a conversational context or responding to one another?
Testing Intelligence at Scale
The researchers deployed these Probing Agents into MoltBook, which hosts over two million autonomous agents. By using tasks ranging from complex logical reasoning problems (such as those found in "Humanity’s Last Exam") to simple coordination exercises like counting, the team was able to treat the entire social platform as a diagnostic instrument. This allowed them to see if the society’s collective output surpassed the performance of individual frontier models acting in isolation.
Key Findings: The Absence of Collective Intelligence
The experiments revealed a stark absence of collective intelligence. The society failed to outperform individual models on complex reasoning tasks, rarely synthesized information across multiple posts, and struggled even with trivial coordination.
The study identifies a critical bottleneck: the interactions within the society are extremely sparse and shallow. Most posts receive no replies at all, and when agents do interact, the conversations rarely extend beyond a single exchange. The platform functions more like a collection of independent broadcasts rather than a collaborative society.
Why Scale Is Not Enough
The primary takeaway is that collective intelligence does not emerge spontaneously from scale alone. Even with millions of agents, the lack of meaningful engagement and the inability of agents to build upon each other’s work prevent the group from achieving outcomes beyond the reach of a single agent. The authors conclude that future research must focus on developing agent architectures that prioritize sustained interaction, shared conversational context, and better mechanisms for coordinating collective behavior.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!