Learning CLI Agents with Structured Action Credit under Selective Observation
This paper addresses the challenge of training AI agents to operate effectively within command-line interface (CLI) environments. While CLI agents are powerful tools for coding and managing filesystems, they often struggle with two major hurdles: identifying relevant information in large, complex codebases and assigning credit to specific actions when rewards are only provided at the very end of a long task. The authors introduce a new learning framework designed to improve how these agents process workspace data and learn from their successes and failures.
Selective Observation with σ-Reveal
To help agents navigate large repositories, the authors developed σ-Reveal. This mechanism acts as an inference-time filter that selects the most important parts of a filesystem to show the agent before it takes its first action. By calculating a relevance score for files based on the task description, file depth, and file type, σ-Reveal creates a "token-budgeted" view of the workspace. This ensures the agent focuses on the most critical evidence without being overwhelmed by irrelevant data, effectively managing the agent's context window.
Action Advantage Assignment (A³)
The core of the agent's learning process is a method called Action Advantage Assignment (A³). Unlike standard reinforcement learning that often treats an entire sequence of actions as a single block, A³ breaks down the learning signal into three distinct layers:
Episode Backbone: Compares the final outcome of one attempt against other attempts on the same task.
Turn-Level Residuals: Uses the structure of the shell commands themselves—analyzed via Abstract Syntax Trees (AST)—to compare how similar actions performed in similar situations.
Tree-Level Advantage: Evaluates the specific "branch" of decisions an agent took, comparing the success of a chosen path against the average outcome of other possible paths from the same state.
By combining these signals, the agent can better understand which specific commands contributed to a successful outcome and which did not.
Performance and Results
The authors evaluated their approach using ShellOps, a new dataset suite designed to test agents on realistic filesystem tasks like searching, aggregating data, and editing files. Experimental results show that the A³ method, particularly when paired with σ-Reveal, significantly outperforms existing reinforcement learning baselines. The improvements are most notable in "hybrid" tasks that require both complex file editing and terminal output analysis, where the agent demonstrated a clearer ability to handle multi-step, long-horizon interactions.
Key Considerations
The research highlights that the structure of CLI actions—specifically the syntax of shell commands—can be a valuable signal for training. By leveraging this structure, the authors demonstrate that agents can learn more efficiently without needing auxiliary models or complex state-anchoring techniques. The study suggests that for agents to be truly effective in professional coding environments, they must be able to distinguish between relevant and irrelevant workspace evidence and receive granular feedback on their specific operational choices.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!