Advancing DialNav through Automatic Embodied Dialog Augmentation
Embodied AI agents, such as robots navigating indoor environments, must be able to communicate to ensure safety and task success. The DialNav framework allows these agents to engage in a "dialog-execution loop," where a navigator asks a remote guide for help when instructions are ambiguous. However, training these agents has been hindered by a severe lack of data, as human-annotated dialog episodes are expensive and time-consuming to collect. This paper introduces a new pipeline to automatically generate large-scale training data and proposes improved training strategies to significantly boost the performance of navigation agents.
Creating the RAINbow Dataset
The researchers addressed the data scarcity problem by developing an automatic generation pipeline that creates the RAINbow dataset, which contains 238,000 episodes—more than 100 times larger than the original 2,000-episode dataset. The pipeline works by taking existing navigation datasets, concatenating paths to create longer trajectories, and using vision-language models to generate scene captions. These captions are then refined by a large language model into natural, multi-turn dialogs. This method is highly cost-effective, costing roughly 2,000 times less per episode than manual human annotation.
Dual-Strategy Training
Simply having more data is not enough if the training method does not account for the interactive nature of dialog-based navigation. The authors introduced "Dual-Strategy Training" to better align the agent with the navigation-dialog loop. This approach uses two types of training paths:
Data-guided rollouts: The agent follows the pre-recorded path in the dataset, receiving ground-truth dialog updates at specific points.
On-policy rollouts: The agent follows its own learned policy, allowing it to practice recovering from its own navigation errors. By combining these, the agent learns to leverage the full dialog context while remaining robust to its own mistakes.
Improving Localization
In the DialNav setup, the remote guide must accurately determine where the navigator is located based on the navigator's questions. The authors improved this "localization" subtask by leveraging knowledge from existing Vision-and-Language Navigation (VLN) models. By adopting a graph-based Transformer architecture, the guide becomes much better at pinpointing the navigator's position, which in turn leads to more accurate and helpful guidance.
Significant Performance Gains
The combination of the large-scale RAINbow dataset, the Dual-Strategy Training scheme, and the improved localization model resulted in a substantial performance leap. Compared to the previous baseline, the new approach doubled the success rate of the navigation agents, achieving 58.24% on the "Val Seen" split and 29.05% on the "Val Unseen" split. These results establish a new state-of-the-art for the DialNav task, demonstrating that high-quality synthetic data and specialized training schemes can effectively overcome the challenges of data scarcity in embodied AI.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!