AI Research

Think Fast: Estimating No-CoT Task-Completion Time... | AI Research

Key Takeaways

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models Current safety protocols for frontier AI models often rely on monitoring "c...
Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning.
If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight.
We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning.
To compare models against humans, we estimate the $50\%$-task-completion time horizon (TH): the human time required for tasks a model completes with $50\%$ success rate.

Paper AbstractExpand

Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the $50\%$-task-completion time horizon (TH): the human time required for tasks a model completes with $50\%$ success rate. We complement this with a $50\%$ reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with $50\%$ success rate. We find that the no-CoT $50\%$ TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Current safety protocols for frontier AI models often rely on monitoring "chain-of-thought" (CoT) reasoning—the explicit steps a model takes to solve a problem. However, as models become more advanced, they may develop the ability to perform complex reasoning internally without generating these visible thinking tokens. This paper investigates the risks posed by this "no-CoT" reasoning capability by measuring how effectively frontier models solve complex tasks without explicit reasoning steps and projecting how these capabilities might evolve.

Measuring Internal Reasoning

To understand the limits of no-CoT reasoning, the researchers evaluated frontier models across a massive dataset of over 30,000 questions. These questions spanned 43 different benchmarks, covering diverse domains such as mathematics, coding, puzzles, causality, theory-of-mind, and strategic reasoning. By testing the models' success rates without CoT, the team aimed to determine the extent to which models can bypass traditional oversight mechanisms.

Defining the Time Horizon

The study introduces two key metrics to quantify model performance:

50% Task-Completion Time Horizon (TH): This measures the amount of time a human would typically need to complete tasks that the AI model solves with a 50% success rate.
50% Reasoning Token Horizon: This measures the minimum number of reasoning tokens required by a model like o3-mini to achieve a 50% success rate on the same tasks.
These metrics allow researchers to compare the "internal" cognitive effort of an AI against human-equivalent time, providing a standardized way to track how much complex reasoning a model can perform silently.

Rapid Growth and Future Projections

The findings indicate that the no-CoT capabilities of frontier models are advancing quickly. The data shows that the 50% TH has been doubling roughly every year for the past six years. For instance, GPT-5.5 has already reached a TH of over 3 minutes, with a reasoning token horizon exceeding 1,500 tokens.
Based on these trends, the researchers project that frontier no-CoT time horizons could exceed 7 minutes by 2028 and reach 25 minutes by 2030. While these projections involve substantial uncertainty, the authors emphasize that the rapid growth of internal reasoning capabilities is a critical development. Consequently, they recommend that frontier AI developers explicitly track these no-CoT horizons to ensure that safety oversight remains effective as models become more capable of "thinking" without showing their work.

Comments (0)

No comments yet

Be the first to share your thoughts!