Back to AI Research

AI Research

Learning CLI Agents with Structured Action Credit u... | AI Research

Key Takeaways

  • Learning CLI Agents with Structured Action Credit under Selective Observation This paper addresses the challenge of training AI agents to operate effectively...
  • Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback.
  • Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals.
  • Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents.
  • First, the agent must identify task-relevant evidence in a large codebase from partial observations.
Paper AbstractExpand

Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce $\sigma$-Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ($\mathrm{A}^3$), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. $\mathrm{A}^3$ constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.

Learning CLI Agents with Structured Action Credit under Selective Observation
This paper addresses the challenge of training AI agents to operate effectively within command-line interface (CLI) environments. While CLI agents are powerful tools for coding and managing filesystems, they often struggle with two major hurdles: identifying relevant information in large, complex codebases and assigning credit to specific actions when rewards are only provided at the very end of a long task. The authors introduce a new learning framework designed to improve how these agents process workspace data and learn from their successes and failures.

Selective Observation with σ-Reveal

To help agents navigate large repositories, the authors developed σ-Reveal. This mechanism acts as an inference-time filter that selects the most important parts of a filesystem to show the agent before it takes its first action. By calculating a relevance score for files based on the task description, file depth, and file type, σ-Reveal creates a "token-budgeted" view of the workspace. This ensures the agent focuses on the most critical evidence without being overwhelmed by irrelevant data, effectively managing the agent's context window.

Action Advantage Assignment (A³)

The core of the agent's learning process is a method called Action Advantage Assignment (A³). Unlike standard reinforcement learning that often treats an entire sequence of actions as a single block, A³ breaks down the learning signal into three distinct layers:

  • Episode Backbone: Compares the final outcome of one attempt against other attempts on the same task.

  • Turn-Level Residuals: Uses the structure of the shell commands themselves—analyzed via Abstract Syntax Trees (AST)—to compare how similar actions performed in similar situations.

  • Tree-Level Advantage: Evaluates the specific "branch" of decisions an agent took, comparing the success of a chosen path against the average outcome of other possible paths from the same state.
    By combining these signals, the agent can better understand which specific commands contributed to a successful outcome and which did not.

Performance and Results

The authors evaluated their approach using ShellOps, a new dataset suite designed to test agents on realistic filesystem tasks like searching, aggregating data, and editing files. Experimental results show that the A³ method, particularly when paired with σ-Reveal, significantly outperforms existing reinforcement learning baselines. The improvements are most notable in "hybrid" tasks that require both complex file editing and terminal output analysis, where the agent demonstrated a clearer ability to handle multi-step, long-horizon interactions.

Key Considerations

The research highlights that the structure of CLI actions—specifically the syntax of shell commands—can be a valuable signal for training. By leveraging this structure, the authors demonstrate that agents can learn more efficiently without needing auxiliary models or complex state-anchoring techniques. The study suggests that for agents to be truly effective in professional coding environments, they must be able to distinguish between relevant and irrelevant workspace evidence and receive granular feedback on their specific operational choices.

Comments (0)

No comments yet

Be the first to share your thoughts!