Agentic System as Compressor: Quantifying System Intelligence in Bits
This paper introduces a new way to measure the intelligence of AI agentic systems by applying the principle that "compression is intelligence." While traditional benchmarks often focus on success rates, this research proposes that a more capable system is one that can reconstruct a target task using fewer bits of information. By treating tools, environment constraints, and search processes as shared resources between an encoder and a decoder, the authors provide a unified framework to quantify how much each component contributes to a system's overall performance.
Measuring Intelligence Through Compression
The core idea is that if a system is truly intelligent, it should be able to use its tools, retrieval capabilities, and environment feedback to "compress" the information needed to solve a task. In this framework, the system does not need to transmit the entire solution from scratch. Instead, it sends a compact "hint" or code that allows a decoder—which shares the same tools and environment—to reconstruct the correct output. The fewer bits required to complete this reconstruction, the more "intelligent" the system is considered to be, as it has effectively offloaded the complexity of the task into its own internal structure and environment interactions.
How the Protocol Works
To turn this theory into a practical measurement tool, the authors developed a three-part protocol:
Arithmetic Coding: Used for exact, token-by-token reconstruction, measuring the system’s raw predictive ability.
Seed Coding: Used in environments where multiple outputs might be acceptable. The system transmits the index of a successful random seed, which the decoder then replays to arrive at a valid result.
Fallback: If the system cannot find a solution within its sampling budget, it automatically switches to arithmetic coding to ensure the task is still completed.
By comparing the average codelength of a system with and without a specific component (such as a retriever or a verifier), the researchers can calculate the "marginal bit value" of that component, effectively assigning a quantitative score to its contribution.
Validating the Framework
The authors tested this approach across five distinct settings: reversed text, chess moves, protein sequences, retrieval-augmented question answering, and semantic story compression. These experiments demonstrated that adding agentic components—like rule-based constraints or verifier feedback—consistently reduced the number of bits required to solve the tasks. The results confirm that this method can successfully isolate the value of different system parts, such as how much a retriever helps or how a tighter observation standard changes the difficulty of a task.
Key Considerations
It is important to note that this research is intended as a mechanistic exploration rather than a final benchmark for large-scale deployments. The authors highlight that there is a clear trade-off between the compute budget invested and the resulting compression ability. Furthermore, they emphasize that the "intelligence" of a system is highly dependent on the observation standard used; if the criteria for success change, the codelength will change accordingly. This framework serves as a guide for developers to analyze where their system's capabilities actually come from, moving beyond simple pass/fail metrics to understand the efficiency of the entire agentic workflow.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!