What the paper is about
Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment.
What it covers
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments Yuxin Chen 1,2,โ , Xiaodong Cai 2,3,โ , Junfeng Fang 1 , Zhuowen Han 2,4 , Yu Wang 2,5 , Yaorui Shi 2,5 , Yi Zhang 2,5 , Qi Gu 2,โ , Xunliang Cai 2 , Xiang Wang 5 , An Zhang 5,โ , Tat-Seng Chua 1 1 National University of Singapore, 2 Meituan, 3 Tsinghua University, 4 Tianjin University, 5 University of Science and Technology of China โ Equal contribution. โ Corresponding authors: [email protected] , [email protected] Abstract Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfection, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise condition also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment. 1 Introduction Recent advances in large language models (LLMs) have transformed them from passive text generators into interactive agents capable of reasoning, planning, and tool use [ 28 , 10 , 47 ] , enabling their widespread deployment in real-world applications. As these capabilities continue to improve [ 46 , 76 , 48 ] , LLM agents have achieved strong performance across a wide range of benchmarks [ 69 , 3 , 12 ] . However, this success does not consistently transfer to more realistic settings: when confronted with complex and dynamic environments, many agents exhibit notable performance degradation [ 5 , 84 , 67 ] . We argue that current agent learning paradigms exhibit a fundamental gap between training conditions and real-world deployment. A common characteristic shared by existing agent training paradigms is their reliance on idealized assumptions, where agents are trained with carefully curated instructions and interact with stable, well-controlled environments [ 75 , 32 , 17 ] . In contrast, real-world environments are inherently stochastic and imperfect. Users often exhibit diverse interaction styles and unpredictable behaviors [ 8 , 51 , 58 ] , while external tools may return noisy, incomplete, or even failed outputs due to various uncontrollable factors [ 54 , 65 ] . This discrepancy between training conditions and deployment environments limits the robustness of current agents, often leading to degraded performance in practical applications [ 33 , 45 , 41 ] . Figure 1 : Overview of NoisyAgent. We inject structured perturbations into both user instructions and tool responses to simulate real-world imperfections. Training is conducted via hybrid rollouts that combine clean and noisy trajectories, together with an adaptive scheduler that increases noise difficulty based on the performance gap ฮ \Delta . Policy optimization is performed with group-wise normalization to stabilize learning under heterogeneous interaction conditions. Inspired by the success of stochastic perturbations in reinforcement learning [ 50 , 34 , 82 ] , we argue that agent robustness emerges from exposure to diverse imperfections in learning process. Rather than relying on idealized training settings and expecting agents to adapt post hoc, we explicitly incorporate environmental noise and uncertainty into the agentic training process. However, how to model and introduce such noise in agentic training remains underexplored, and naively injecting noise into the training environment can easily destabilize training dynamics, making it a non-trivial challenge. Toward this goal, we propose NoisyAgent, an agentic RL method for training under noisy environments. We begin by identifying representative forms of real-world noise and developing an automated pipeline to incorporate such imperfections into the training process. Concretely, we consider two major sources of interaction noise in real-world agent scenarios: user noise, which captures ambiguity and variability in user interactions, and tool noise, which simulates execution anomalies from external tools. These perturbations are introduced by modifying user instructions and simulating tool execution results within the training environment, with perturbations applied to only a subset of rollouts for each task. Training follows a curriculum schedule. Starting from mild perturbations, we progressively increase the difficulty and ratio of noise as the model exhibits sufficient robustness at each stage. Robustness is quantified by the performance gap between idealized and perturbed environments on the same tasks. This adaptive process ensures that training remains informative rather than overwhelming, while avoiding inefficient exploration of excessively noisy regimes. Benefiting from our noise-aware training, agents achieve improved performance on benchmarks augmented with real-world noise, indicating enhanced robustness under imperfect and dynamic environments. Interestingly, we also observe consistent gains on standard, idealized benchmarks. We hypothesize that appropriately designed noise introduces controlled instability into the training environment and promotes more generalizable reasoning and decision-making. In particular, exposure to noisy and uncertain interactions encourages agents to recover from errors, resolve ambiguities, and adapt to unexpected outcomes. From this perspective, noise serves as a form of implicit difficulty augmentation, enriching the training distribution and improving robustness beyond idealized settings. Overall, our contributions can be concluded as follows:
โข We identify a fundamental gap between idealized agent training and real-world deployment, highlighting the importance of modeling environmental uncertainty for robust agent learning.
โข We develop a noise-aware training framework that systematically incorporates instruction and tool perturbations into the training environment.
โข Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments, while also yielding performance gains on standard benchmarks. 2 Preliminary 2.1 Agentic Reinforcement Learning In representative agentic training paradigm, each taks can be formalized as a Partially Observable Markov Decision Process (POMDP) [ 81 ] : โณ = ( ๐ฎ , ๐ , ๐ช , ๐ฏ , โ ) . \mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{O},\mathcal{T},\mathcal{R}). (1) At each step t t , the agent maintains a state s t = ( s t env , h t , q ) โ ๐ฎ s_{t}=(s_{t}^{\text{env}},h_{t},q)\in\mathcal{S} , which captures the environment state s t env s_{t}^{\text{env}} , the interaction history h t h_{t} , and the task prompt q q . Based on the current observation o t โ ๐ช o_{t}\in\mathcal{O} , the agent selects an action a t โ ๐ a_{t}\in\mathcal{A} , where the action space ๐ = ๐ user โช ๐ tool \mathcal{A}=\mathcal{A}{\text{user}}\cup\mathcal{A}{\text{tool}} includes both user interaction and tool calling invocations. Correspondingly, the observation space ๐ช = ๐ช user โช ๐ช tool \mathcal{O}=\mathcal{O}{\text{user}}\cup\mathcal{O}{\text{tool}} consists of user-side feedback and tool execution results. Upon taking action a t a_{t} , the environment states evolves according to the transition function ๐ฏ : ๐ฎ ร ๐ โ ๐ฎ ร ๐ช \mathcal{T}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}\times\mathcal{O} , producing the next observation o t + 1 o_{t+1} . The training objective is to learn a policy ฯ ฮธ \pi_{\theta} that maximizes the expected cumulative reward ๐ผ ฯ โผ ฯ ฮธ โ [ โ t = 0 T r t ] \mathbb{E}{\tau\sim\pi{\theta}}!\left[\sum_{t=0}^{T}r_{t}\right] over trajectories ฯ = ( o 0 , a 0 , o 1 , a 1 , โฆ , o T ) \tau=(o_{0},a_{0},o_{1},a_{1},\ldots,o_{T}) . A widely adopted training paradigm is Reinforcement Learning with Verifiable Rewards (RLVR) [ 4 , 15 ] , where a verifier evaluates whether the final environment state s T env s_{T}^{\text{env}} or the full trajectory ฯ \tau satisfies the task instruction given rubrics, providing a scalar reward at the trajectory level. To optimize the policy, a representative approach is Group Relative Policy Optimization (GRPO) [ 37 ] , which extends PPO [ 36 ] by computing advantages relative to a group of sampled rollouts. Concretely, given a task prompt q q and G G sampled trajectories { ฯ 1 , โฆ , ฯ G } {\tau_{1},\ldots,\tau_{G}} , the advantage of each trajectory is computed as A ^ i = ( r i โ ฮผ ) / ฯ \hat{A}{i}=(r{i}-\mu)/\sigma , where ฮผ \mu and ฯ \sigma are the mean and standard deviation of the group rewards. The objective can be written as: ๐ฅ GRPO โ ( ฮธ ) = ๐ผ q โ [ 1 G โ โ i = 1 G 1 L i โ โ t = 1 L i min โก ( ฯ i , t โ A ^ i , clip โ ( ฯ i , t , 1 ยฑ ฯต ) โ A ^ i ) ] . \mathcal{J}{\text{GRPO}}(\theta)=\mathbb{E}{q}\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{L_{i}}\sum_{t=1}^{L_{i}}\min!\left(\rho_{i,t},\hat{A}{i},;\text{clip}(\rho{i,t},,1!\pm!\epsilon),\hat{A}{i}\right)\right]. (2) where ฯ i , t = ฯ ฮธ โ ( a i , t โฃ h i , t ) ฯ old โ ( a i , t โฃ h i , t ) \rho{i,t}=\frac{\pi_{\theta}(a_{i,t}\mid h_{i,t})}{\pi_{\text{old}}(a_{i,t}\mid h_{i,t})} and L i L_{i} is the length of trajectory ฯ i \tau_{i} . Building on this standard optimization paradigm, effective agentic training relies on access to a diverse set of interactive environments that support both user-agent interaction and tool-grounded execution [ 49 , 24 ] . 2.2 Scaling Environment for Agentic Training Constructing interactive environments manually for agentic training is costly and difficult to scale. Recent work addresses this challenge by synthesizing executable environments from high-level domain specifications in a fully automated environment scaling pipeline [ 53 ] . Given a domain definition, the pipeline initializes a domain-specific tool set together with a unified database schema, forming a structured domain graph ๐ \mathcal{D} that serves as the foundation for executable environment generation. By sampling from this graph, each training environment can be instantiated as consisting of two tightly coupled components: a user-side construction that specifies task objectives and interaction patterns, and a tool-side construction that defines environment dynamics. On the user side, tasks are synthesized by sampling tool chains from the domain graph and generating corresponding task queries together with interaction patterns, resulting in compositional objectives that specify both what to solve and how the user agent interacts within the environment. Formally, the user-side construction can be expressed as: ( q , u int ) = f user โ ( ๐ ) , (q,u_{\text{int}})=f_{\text{user}}(\mathcal{D}), (3) where q q is the task prompt and ฯ int \pi_{\text{int}} denotes the interaction pattern governing user-agent interactions. f user f_{\text{user}} denote simplified abstractions of user-side construction processes. On the tool side, complete executable environments are constructed by implementing structured tool APIs and underlying environment databases based on the domain graph. The sampled tool chains are instantiated as reference executions, and the tool set is further expanded along the domain graph while ensuring both correctness and verifiability of the execution process. Formally, the tool-side construction can be written as: โฐ = f tool โ ( ๐ , q , u int ) , \mathcal{E}=f_{\text{tool}}(\mathcal{D},q,u_{\text{int}}), (4) where โฐ \mathcal{E} defines the executable environment grounded in the task specification, including tool APIs, valid state transitions, and verifiable execution paths. f tool f_{\text{tool}} denote simplified abstractions of tool-side construction processes. While this design enables scalable and reliable task construction, it assumes that both components are well-specified: user interactions are restricted to be clear and helpful, while tool behaviors are stable. As a result, the resulting training environments are often idealized, leading to a mismatch between training and deployment, where real-world environments are inherently imperfect. 3 Methodology To bridge the gap between idealized training and noisy deployment, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into learning. We first introduce an automatic noise injection pipeline (Section 3.1 ) that augments training with user- and tool-side perturbations, and then present an adaptive training strategy (Section 3.2 ) that progressively adjusts noise difficulty to ensure stable and effective learning. 3.1 Automatic Noise Injection We systematically analyze common real-world noise and design an automated pipeline to explicitly incorporate these imperfections into any synthesized agentic training environment. Concretely, we consider two major sources of interaction noise in real-world agentic scenarios: user-side noise, which captures ambiguity and variability in user interaction patterns, and tool-side noise, which reflects failures and anomalies in external tool execution. To model such imperfections, we introduce a noise generator ฯ noise \pi_{\text{noise}} that stochastically perturbs the agentโenvironment interaction at each step to simulate imperfect observation from the real world. User-side Injection. On the user side, noise is injected before the task starts by modifying the interaction patterns specified by the user. We simulate representative non-ideal interaction patterns observed in real-world scenarios, including: (1) Ambiguous , where user intent is underspecified; (2) Inconsistent , where user needs change or conflict over time; and (3) Redundant , where irrelevant or unnecessary information is included. Formally, given the interaction pattern u int u_{\text{int}} defined by any environment scaling pipeline, the injection of user-side noise can be expressed as: u ~ int = ฯ noise โ ( u int ) , \tilde{u}{\text{int}}=\pi{\text{noise}}(u_{\text{int}}), (5) where u ~ int \tilde{u}{\text{int}} denotes the perturbed counterpart. This transformation introduces additional variability and ambiguity into userโagent interactions. To avoid inducing unreliable or misleading reward signals, we preserve the underlying task objective q q , ensuring that the injected perturbations do not invalidate task solvability, but instead increase the difficulty and stochasticity of the interaction process. Tool-side Injection. Tool-side noise is injected during agent rollouts by randomly perturbing a subset of tool execution results to simulate stochasticity in real-world environments. Specifically, we model common execution anomalies in real-world systems, including: (1) Failures , where tool requests return errors; (2) Incomplete , where outputs are truncated; (3) Misleading , where responses contain incorrect or inconsistent information; and (4) Redundant , where outputs include unnecessary details. Formally, the injection of tool-side noise can be formulated as: o ~ t = ฯ noise โ ( o t ) , \tilde{o}{t}=\pi_{\text{noise}}(o_{t}), (6) where o t o_{t} denotes the original tool response and o ~ t \tilde{o}{t} is the perturbed output. This process simulates imperfect tool behaviors while maintaining executable interaction dynamics. 3.2 Adaptive Noise Training Hybrid Training. The proposed automatic noise injection pipeline enables the incorporation of imperfections into agent training process. However, agent learning is highly sensitive to both task instructions and environment feedback, naively injecting uncontrolled noise can destabilize training dynamics. To preserve training stability while improving robustness, we adopt a hybrid training scheme that combines idealized and perturbed environments. Concretely, under the GRPO training paradigm, given a task set ๐ฌ \mathcal{Q} , we sample a task q โ ๐ฌ q\in\mathcal{Q} and perform N N independent rollouts in parallel environments. Among these, a subset of N noise N{\text{noise}} rollouts are perturbed by injecting user-side or tool-side noise with a controllable difficulty level, while the remaining N โ N noise N-N_{\text{noise}} rollouts are conducted in clean, idealized environments. Formally, let ๐ฏ clean \mathcal{T}{\text{clean}} and ๐ฏ noise \mathcal{T}{\text{noise}} denote the sets of clean and perturbed trajectories for a given task q q , respectively. In our setting, rollouts are partitioned into these two groups, and we modify the standard GRPO objective by computing advantages separately within each group while optimizing over their union. The overall objective is defined as: ๐ฅ โ ( ฮธ ) = ๐ผ q โ [ 1 G โ ( โ i โ ๐ฏ clean 1 L i โ โ t = 1 L i โ i , t โ ( A ^ i clean ) + โ j โ ๐ฏ noise 1 L j โ โ t = 1 L j โ j , t โ ( A ^ j noise ) ) ] , \mathcal{J}(\theta)=\mathbb{E}{q}\left[\frac{1}{G}\left(\sum{i\in\mathcal{T}{\text{clean}}}\frac{1}{L{i}}\sum_{t=1}^{L_{i}}\mathcal{L}{i,t}(\hat{A}{i}^{\text{clean}})+\sum_{j\in\mathcal{T}{\text{noise}}}\frac{1}{L{j}}\sum_{t=1}^{L_{j}}\mathcal{L}{j,t}(\hat{A}{j}^{\text{noise}})\right)\right], (7) where โ k , t โ ( A ^ ) = min โก ( ฯ k , t โ A ^ , clip โ ( ฯ k , t , 1 ยฑ ฯต ) โ A ^ ) , ฯ k , t = ฯ ฮธ โ ( a k , t โฃ h k , t ) ฯ old โ ( a k , t โฃ h k , t ) . \mathcal{L}{k,t}(\hat{A})=\min!\left(\rho{k,t}\hat{A},;\text{clip}(\rho_{k,t},1\pm\epsilon)\hat{A}\right),\quad\rho_{k,t}=\frac{\pi_{\theta}(a_{k,t}\mid h_{k,t})}{\pi_{\text{old}}(a_{k,t}\mid h_{k,t})}. (8) The advantages are computed separately within each group: A ^ i clean = r i โ ฮผ clean ฯ clean , A ^ j noise = r j โ ฮผ noise ฯ noise , \hat{A}{i}^{\text{clean}}=\frac{r{i}-\mu_{\text{clean}}}{\sigma_{\text{clean}}},\quad\hat{A}{j}^{\text{noise}}=\frac{r{j}-\mu_{\text{noise}}}{\sigma_{\text{noise}}}, (9) where ฮผ clean , ฯ clean \mu_{\text{clean}},\sigma_{\text{clean}} and ฮผ noise , ฯ noise \mu_{\text{noise}},\sigma_{\text{noise}} denote the mean and standard deviation of rewards computed within each group. This group-wise normalization prevents the dominance of either clean or noisy rollouts during optimization, and stabilizes training under heterogeneous interaction conditions. Noise Scheduling. To adaptively introduce noise while maintaining training stability, we first quantify the modelโs robustness to different noise types and adjust the noise level accordingly. We measure the modelโs robustness to a specific noise type via the performance gap between clean and perturbed rollouts on the same task: ฮ = ๐ผ ฯ โผ ๐ฏ clean โ [ ๐ โ ( r โ ( ฯ ) = 1 ) ] โ ๐ผ ฯ โผ ๐ฏ noise โ [ ๐ โ ( r โ ( ฯ ) = 1 ) ] , \Delta=\mathbb{E}{\tau\sim\mathcal{T}{\text{clean}}}[\mathbf{1}(r(\tau)=1)]-\mathbb{E}{\tau\sim\mathcal{T}{\text{noise}}}[\mathbf{1}(r(\tau)=1)], (10) where r โ ( ฯ ) = 1 r(\tau)=1 indicates successful task completion. This gap reflects the extent to which current noise degrades task performance. Based on this measure, we adopt a progressive noise scheduling strategy. Training is initialized in fully idealized environments, with noise gradually introduced as the model adapts. At each stage, we control two factors: (i) the noise scale, defined as the proportion of perturbed rollouts ฯ = N noise / N \rho=N_{\text{noise}}/N ; and (ii) the noise difficulty, characterized by the frequency of tool-side perturbations and the severity of user-side interaction anomalies. When ฮ < ฮธ \Delta<\theta , with ฮธ \theta denoting a predefined threshold, the model is considered to have adapted to the current noise level, and we increase both the difficulty and the proportion of that noise type. This yields a curriculum over noise, progressively increasing interaction complexity while maintaining training stability. 4 Experiments 4.1 Experiment Settings Training Environment. Our training environment follows the environment scaling pipeline of [ 53 ] . Within the synthesis pipeline, we leverage a diverse suite of high-performance LLMs for different roles. Specifically, GPT-4.1 is used for environment construction due to its favorable trade-off between cost and efficiency, while Claude-Sonnet-4.5 serves as a verifier given its strong evaluation capability. GLM-4.6 is employed to synthesize diverse instructions, forming the basis of our RL task set. Building on the synthesized tasks, we use Qwen2.5-72B-Instruct as a noise injector to introduce controlled perturbations into the interaction process. During training, Qwen2.5-72B-Instruct also acts as the user simulator to generate natural language feedback, while a Qwen3-32B model is trained as an evaluator to assign rewards based on the synthesized rubrics. Evaluation. We evaluate the robustness of the model on AgentNoiseBench [ 60 ] , a benchmark designed to assess agent performance under real-world noise. We select two representative subsets, AgentNoiseBench- ฯ 2 \tau^{2} and AgentNoiseBench-Vita for evaluation. To assess performance in idealized environments, we evaluate on representative standard agent benchmarks: (i) ฯ 2 \tau^{2} -Bench, a dual-control conversational benchmark where both the user and the agent can invoke tools in customer-service domains such as retail, airline, and telecom; (ii) Vita-Bench, a multi-tool agent benchmark covering real-world scenarios including food delivery, in-store services, and travel. Across all benchmarks, GPT-4.1 is used as the user simulator, and Claude-Sonnet-4.5 is used as the evaluator. Each experiment is repeated four times. We report Avg@4 and Pass@4 metrics averaged across tasks. Implementation Details and Baselines. We adopt Qwen3-8B and Qwen3-32B as backbone models. On these backbones, we compare several representative training methods, including GRPO, DAPO, and GSPO, where our method is based on GSPO. The training batch size is set to 32, with 64 rollouts per sample. The proportion of noisy trajectories is capped at 50% of the total rollouts. We set the scheduling threshold ฮ \Delta to 0.05. The maximum prompt length is 8,192 tokens, and the maximum response length is 32,768 tokens. Detailed training configurations are provided in Appendix A . Table 1 : Main results under the noisy setting on AgentNoiseBench. We report Avg@4 and Pass@4 averaged across four runs. Best results are in bold, and second-best are underlined. Method AgentNoiseBench- ฯ 2 \tau^{2} AgentNoiseBench-Vita Retail Airline Telecom Delivery In-Store OTA Avg@4 Pass@4 Avg@4 Pass@4 Avg@4 Pass@4 Avg@4 Pass@4 Avg@4 Pass@4 Avg@4 Pass@4 Qwen3-8B 24.12 44.74 23.00 42.00 21.05 41.23 11.75 18.00 8.50 12.00 0.75 2.00 + GRPO 30.48 50.88 33.50 54.00 31.58 53.51 15.25 24.00 14.25 23.00 2.50 4.00 + DAPO 29.39 53.51 31.00 50.00 34.21 57.89 15.75 25.00 12.75 19.00 2.25 4.00 + GSPO 31.80 54.39 32.50 52.00 34.43 56.14 16.00 26.00 15.00 22.00 2.75 5.00 + Ours 36.40 61.40 38.00 56.00 38.38 64.91 21.50 34.00 16.25 25.00 4.75 8.00 Qwen3-32B 31.14 52.63 31.50 56.00 26.54 45.61 19.50 30.00 14.75 21.00 5.50 9.00 + GRPO 38.16 61.40 37.00 62.00 36.84 62.28 23.25 35.00 19.50 28.00 7.25 11.00 + DAPO 36.18 57.89 39.50 66.00 38.16 66.67 24.00 36.00 16.75 24.00 7.50 11.00 + GSPO 37.72 60.53 39.00 64.00 39.25 65.79 23.75 36.00 17.50 25.00 7.50 12.00 + Ours 43.20 65.79 46.00 70.00 43.42 70.18 28.75 42.00 22.00 31.00 9.50 14.00 Table 2 : Main results under the ideal setting on standard agent benchmarks. We report Avg@4 and Pass@4 averaged across four runs. Best results are in bold, and second-best are underlined. Method ฯ 2 \tau^{2} -Bench VitaBench Retail Airline Telecom Delivery In-Store OTA Avg@4 Pass@4 Avg@4 Pass@4 Avg@4 Pass@4 Avg@4 Pass@4 Avg@4 Pass@4 Avg@4 Pass@4 Qwen3-8B 35.31 59.65 27.00 52.00 22.59 42.98 13.75 22.00 15.50 24.00 1.75 4.00 + GRPO 46.05 73.68 36.50 62.00 37.28 57.89 21.00 33.00 22.75 35.00 4.25 7.00 + DAPO 44.52 71.05 38.00 66.00 39.47 63.16 21.50 34.00 23.25 36.00 4.00 7.00 + GSPO 46.49 74.56 37.50 64.00 39.04 61.40 21.25 33.00 23.00 35.00 4.50 8.00 + Ours 47.59 77.19 40.00 68.00 40.79 64.91 22.25 35.00 24.00 37.00 5.00 9.00 Qwen3-32B 49.12 72.81 38.00 66.00 28.95 49.12 23.00 35.00 26.00 38.00 7.00 12.00 + GRPO 58.11 83.33 45.00 72.00 41.67 68.42 27.00 40.00 30.25 43.00 8.75 14.00 + DAPO 56.58 80.70 47.50 76.00 43.42 71.93 27.75 41.00 29.50 42.00 9.25 15.00 + GSPO 58.55 84.21 46.50 74.00 43.86 70.18 27.25 40.00 30.50 44.00 9.00 14.00 + Ours 60.31 86.84 49.50 78.00 45.39 78.07 29.00 43.00 32.25 46.00 9.75 15.00 4.2 Main Results Table 1 and Table 2 present the evaluation results under noisy and ideal settings, respectively. We have the following observations. Noise-aware training significantly improves robustness under imperfect environments. Across all domains and both model scales, NoisyAgent consistently achieves the best performance on AgentNoiseBench, outperforming strong baselines such as GSPO and DAPO by a clear margin in both Avg@ 4 4 and Pass@ 4 4 . In contrast, while standard RL methods improve performance under clean settings, their gains diminish substantially in the presence of noise, often exhibiting notable relative degradation across domains compared with their gains in idealized settings. This suggests that existing training paradigms are less effective when facing ambiguous user instructions and imperfect tool feedback. By incorporating structured perturbations during training, our method enables the agent to better handle uncertainty, recover from intermediate failures, and maintain consistent progress toward task completion under noisy conditions. Training with noise leads to consistent gains even in idealized settings. Despite being designed for noisy environments, NoisyAgent also achieves consistent improvements on standard benchmarks without noise. Across both ฯ 2 \tau^{2} -Bench and VitaBench, our method outperforms all baselines across domains and metrics. This indicates that training with noise does not harm performance in ideal settings, and can improve overall agent capability. We attribute this to the fact that exposure to diverse and imperfect interaction patterns encourages the agent to learn more robust and effective decision-making strategies, rather than relying on brittle interaction assumptions. 4.3 Analysis Ablation Study. Table 3: Ablation study of key components on Delivery domain of both AgentNoiseBench-Vita and VitaBench with Qwen3-8B. We report Avg@4 and Pass@4 averaged across four runs. Method AgentNoiseBench-Vita VitaBench Avg@4 Pass@4 Avg@4 Pass@4 Ours 21.50 34.00 22.25 35.00 w/o controlled injection 13.25 21.00 14.75 24.00 w/o scheduling 20.00 31.00 21.50 33.00 w/o noise 16.00 26.00 21.25 33.00 w/o training 11.75 18.00 13.75 22.00 To isolate the effect of each component, we perform ablations by removing individual elements from our framework. w/o controlled injection removes the hybrid training scheme, applying noise to all rollouts instead of mixing clean and noisy trajectories. w/o scheduling removes the curriculum over noise training, using perturbations of fixed complexity throughout training. w/o noise reduces training to an idealized setting without any perturbations. w/o training evaluates the base model without RL optimization. Overall, removing any component leads to performance degradation, indicating that each part contributes to the final performance. In particular, uncontrolled noise injection ( w/o controlled injection ) causes the largest drop, suggesting that naively introducing perturbations can destabilize training. In contrast, incorporating a proper scheduling strategy further improves performance, showing that progressively adjusting noise leads to more effective and stable learning. Training Dynamics. (a) Idealized setting (b) Noisy setting Figure 2 : Training dynamics on Vita-Bench Delivery (Qwen3-8B). We compare NoisyAgent with a baseline trained without noise under both ideal (no-noise) and noisy evaluations. Figure 2 compares the training dynamics of NoisyAgent and the baseline trained without noise under both ideal and noisy evaluations. In the early stage of training, the two methods exhibit comparable performance, as optimization is largely conducted on clean trajectories serving as a warm-up phase. The initial introduction of moderate noise may even lead to a slight degradation in performance, reflecting the increas
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!