Where Do CoT Training Gains Land in LLM based Agents?
This research investigates how training language model agents with Chain-of-Thought (CoT) reasoning actually improves their performance. While agents are trained to generate a reasoning trace before taking an action, it is often unclear if the model is truly using that reasoning to revise its decision or if it is simply becoming better at predicting the correct action directly from the prompt. By comparing "prompt actions" (decisions made without reasoning) against "CoT actions" (decisions made after reasoning), the authors diagnose how these models learn and propose a training intervention to improve generalization.
The "Shortcut" Problem
The study finds that as models undergo CoT training, their ability to predict the correct action directly from the prompt improves significantly. In many agent settings, the prompt—which includes task instructions, history, and environment feedback—is much longer than the reasoning trace. Because of this, the model’s attention and gradient signals during training are heavily weighted toward the prompt. Consequently, the model learns to rely on the prompt as a "shortcut" to the final action, rather than relying on the reasoning trace to arrive at a conclusion.
Diagnostic Findings
To track these dynamics, the researchers used three primary methods: observing training progress, evaluating performance on unseen tasks, and testing how models react when provided with conflicting reasoning traces. They discovered that as training progresses, the performance gap between CoT-based actions and prompt-only actions remains flat. Furthermore, when researchers intentionally provided a "conflicting" reasoning trace that suggested a different action, later-stage models were more likely to ignore the reasoning and stick to the action implied by the prompt. This confirms that the models become increasingly "anchored" to the prompt over time.
Improving Generalization
Based on the finding that excessive supervision on final action tokens encourages shortcut learning, the authors introduced a training intervention called "reduced action supervision." In this approach, they selectively mask the loss on final action tokens for a portion of the training data, forcing the model to focus more on the reasoning process itself. This intervention was tested across environments like ALFWorld, ScienceWorld, and BFCL. The results show that this method improves the model's ability to generalize to out-of-domain tasks, as it prevents the model from over-relying on the prompt and encourages more robust reasoning-based decision-making.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!