Back to AI Research

AI Research

Where Do CoT Training Gains Land in LLM based Agents? | AI Research

Key Takeaways

  • Where Do CoT Training Gains Land in LLM based Agents?
  • This research investigates how training language model agents with Chain-of-Thought (CoT) reasoning act...
  • We therefore ask what CoT training is actually improving: is the model getting better at changing its action through generated reasoning, or is it getting better at predicting the action directly from the prompt?
  • We study this question by comparing \emph{prompt actions} (predicting action without CoT) with CoT actions (predicting action with CoT).
  • Across checkpoints, prompt-action quality improves substantially.
Paper AbstractExpand

Chain-of-thought (CoT) reasoning is widely used in language-model agents, but prior work has shown that verbalized CoT is not always faithful and may instead reflect post-hoc reasoning, which means the model already knows the answer before reasoning. We therefore ask what CoT training is actually improving: is the model getting better at changing its action through generated reasoning, or is it getting better at predicting the action directly from the prompt? We study this question by comparing \emph{prompt actions} (predicting action without CoT) with CoT actions (predicting action with CoT). Across checkpoints, prompt-action quality improves substantially. While interacting with the environment, the relative advantage of CoT actions over prompt actions remains similar, showing that CoT training does not widen the advantage of CoT reasoning, and it helps to improve the quality of prompt actions. We further find that later checkpoints are less likely to revise the action in response to CoT, suggesting greater reliance on the prompt. Motivated by these patterns, we selectively mask action-token supervision on a fraction of training examples. This intervention improves out-of-domain generalization.

Where Do CoT Training Gains Land in LLM based Agents?
This research investigates how training language model agents with Chain-of-Thought (CoT) reasoning actually improves their performance. While agents are trained to generate a reasoning trace before taking an action, it is often unclear if the model is truly using that reasoning to revise its decision or if it is simply becoming better at predicting the correct action directly from the prompt. By comparing "prompt actions" (decisions made without reasoning) against "CoT actions" (decisions made after reasoning), the authors diagnose how these models learn and propose a training intervention to improve generalization.

The "Shortcut" Problem

The study finds that as models undergo CoT training, their ability to predict the correct action directly from the prompt improves significantly. In many agent settings, the prompt—which includes task instructions, history, and environment feedback—is much longer than the reasoning trace. Because of this, the model’s attention and gradient signals during training are heavily weighted toward the prompt. Consequently, the model learns to rely on the prompt as a "shortcut" to the final action, rather than relying on the reasoning trace to arrive at a conclusion.

Diagnostic Findings

To track these dynamics, the researchers used three primary methods: observing training progress, evaluating performance on unseen tasks, and testing how models react when provided with conflicting reasoning traces. They discovered that as training progresses, the performance gap between CoT-based actions and prompt-only actions remains flat. Furthermore, when researchers intentionally provided a "conflicting" reasoning trace that suggested a different action, later-stage models were more likely to ignore the reasoning and stick to the action implied by the prompt. This confirms that the models become increasingly "anchored" to the prompt over time.

Improving Generalization

Based on the finding that excessive supervision on final action tokens encourages shortcut learning, the authors introduced a training intervention called "reduced action supervision." In this approach, they selectively mask the loss on final action tokens for a portion of the training data, forcing the model to focus more on the reasoning process itself. This intervention was tested across environments like ALFWorld, ScienceWorld, and BFCL. The results show that this method improves the model's ability to generalize to out-of-domain tasks, as it prevents the model from over-relying on the prompt and encourages more robust reasoning-based decision-making.

Comments (0)

No comments yet

Be the first to share your thoughts!