Back to AI Research

AI Research

Advancing DialNav through Automatic Embodied Dialog... | AI Research

Key Takeaways

  • Advancing DialNav through Automatic Embodied Dialog Augmentation Embodied AI agents, such as robots navigating indoor environments, must be able to communica...
  • For embodied agents capable of physical interaction, the capability to create and understand dialog is crucial to ensure both safety and effectiveness.
  • To address this, we propose an automatic generation pipeline, and construct the \textbf{RAINbow} dataset, a large-scale training dataset with 238K episodes for DialNav.
  • Our pipeline converts existing VLN datasets into multi-turn dialog and creates cost-efficient and high-quality dataset.
  • Advancing DialNav through Automatic Embodied Dialog Augmentation Embodied AI agents, such as robots navigating indoor environments, must be able to communicate to ensure safety and task success.
Paper AbstractExpand

For embodied agents capable of physical interaction, the capability to create and understand dialog is crucial to ensure both safety and effectiveness. While DialNav~\cite{han2025dialnav} provides a framework for holistic evaluation of the dialog--execution loop in photorealistic indoor navigation, its performance remains limited by a critical scarcity of training data (2K episodes). To address this, we propose an automatic generation pipeline, and construct the \textbf{RAINbow} dataset, a large-scale training dataset with 238K episodes for DialNav. Our pipeline converts existing VLN datasets into multi-turn dialog and creates cost-efficient and high-quality dataset. Then, we introduce two additional complementary advances to unlock the data's full potential: (1) Dual-Strategy Training, a navigation training scheme to align the navigation training with the dynamic dialog-navigation loop, and (2) a localization model that leverages VLN knowledge. By combining these complementary solutions, our model substantially outperforms the baseline in success rate on both \textbf{Val Seen} (58.24, \textbf{+89\%}) and \textbf{Val Unseen} (29.05, \textbf{+100\%}) splits, establishing a new state of the art.

Advancing DialNav through Automatic Embodied Dialog Augmentation
Embodied AI agents, such as robots navigating indoor environments, must be able to communicate to ensure safety and task success. The DialNav framework allows these agents to engage in a "dialog-execution loop," where a navigator asks a remote guide for help when instructions are ambiguous. However, training these agents has been hindered by a severe lack of data, as human-annotated dialog episodes are expensive and time-consuming to collect. This paper introduces a new pipeline to automatically generate large-scale training data and proposes improved training strategies to significantly boost the performance of navigation agents.

Creating the RAINbow Dataset

The researchers addressed the data scarcity problem by developing an automatic generation pipeline that creates the RAINbow dataset, which contains 238,000 episodes—more than 100 times larger than the original 2,000-episode dataset. The pipeline works by taking existing navigation datasets, concatenating paths to create longer trajectories, and using vision-language models to generate scene captions. These captions are then refined by a large language model into natural, multi-turn dialogs. This method is highly cost-effective, costing roughly 2,000 times less per episode than manual human annotation.

Dual-Strategy Training

Simply having more data is not enough if the training method does not account for the interactive nature of dialog-based navigation. The authors introduced "Dual-Strategy Training" to better align the agent with the navigation-dialog loop. This approach uses two types of training paths:

  • Data-guided rollouts: The agent follows the pre-recorded path in the dataset, receiving ground-truth dialog updates at specific points.

  • On-policy rollouts: The agent follows its own learned policy, allowing it to practice recovering from its own navigation errors. By combining these, the agent learns to leverage the full dialog context while remaining robust to its own mistakes.

Improving Localization

In the DialNav setup, the remote guide must accurately determine where the navigator is located based on the navigator's questions. The authors improved this "localization" subtask by leveraging knowledge from existing Vision-and-Language Navigation (VLN) models. By adopting a graph-based Transformer architecture, the guide becomes much better at pinpointing the navigator's position, which in turn leads to more accurate and helpful guidance.

Significant Performance Gains

The combination of the large-scale RAINbow dataset, the Dual-Strategy Training scheme, and the improved localization model resulted in a substantial performance leap. Compared to the previous baseline, the new approach doubled the success rate of the navigation agents, achieving 58.24% on the "Val Seen" split and 29.05% on the "Val Unseen" split. These results establish a new state-of-the-art for the DialNav task, demonstrating that high-quality synthetic data and specialized training schemes can effectively overcome the challenges of data scarcity in embodied AI.

Comments (0)

No comments yet

Be the first to share your thoughts!