NoMaD: Goal Masking Diffusion Policies for Navigation and Exploration

NoMaD is a novel architecture for robotic navigation in previously unseen environments that uses a unified diffusion policy to jointly represent exploratory task-agnostic behavior and goal-directed task-specific behavior. NoMaD provides high capacity (both for modeling perception and control) and the ability to represent complex, multimodal distributions.

Robotic learning for navigation in unfamiliar environments needs to provide policies for both task-oriented navigation (i.e., reaching a goal that the robot has located), and task-agnostic exploration (i.e., searching for a goal in a novel setting). Typically, these roles are handled by separate models, for example by using subgoal proposals, planning, or separate navigation strategies. In this paper, we describe how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration, with the latter providing the ability to search novel environments, and the former providing the ability to reach a user-specified goal once it has been located. We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments, as compared to approaches that use subgoal proposals from generative models, or prior methods based on latent variable models.

We instantiate our method by using a large-scale Transformer-based policy trained on data from multiple ground robots, with a diffusion model decoder to flexibly handle both goal-conditioned and goal-agnostic navigation. Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods, and demonstrate significant improvements in performance and lower collision rates, despite utilizing smaller models than state-of-the-art approaches.

NoMaD uses the ViNT model to encode a context vector from observations and a goal image, if performing goal-directed navigation. The context vector is then used to condition a diffusion model to generate a distribution of actions.

NoMaD employs ViNT for encoding the context vector used to condition the diffusion process. This means it can explore previously unseen environments by employing a topological graph-based global planner. We show NoMaD exploring a previously unseen office environment, and reaching a goal. In this environment, NoMaD demonstrates the emergent capability to avoid walls and obstacles and produce multi-modal action distributions.

Beyond the structure of indoor environments, NoMaD is also capable of long horizon exploration and navigation in outdoor environments. We show NoMaD exploring a previously unseen outdoor environment, and reaching a goal. In this environment, NoMaD demonstrates the same capabilities as in the indoor environment while also sticking to sidewalks and avoiding roads as a result of implicit navigation preferences included in the dataset.

We compared NoMaD against other state-of-the-art methods, including ViNT with image diffusion subgoals, autoregressive action generation, and VIB (Variational Information Bottleneck). We found that NoMaD outperforms all of these methods by reaching long-horizon goals with fewer to no collisions. The ability of NoMaD to represent multi-modal action distributions around obstacles allows it to avoid collisions and reach goals more effectively than other methods.

BibTeX

@article{sridhar2023nomad,
                    author  = {Ajay Sridhar and Dhruv Shah and Catherine Glossop and Sergey Levine},
                    title   = {{NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration}},
                    journal = {arXiv pre-print},
                    year    = {2023},
                    url     = {https://arxiv.org/abs/2310.07896}
                  }

NoMaD: Goal Masking Diffusion Policies for Navigation and Exploration

Abstract

Summary Video

NoMaD Architecture

NoMaD Indoors

NoMaD Outdoors

Baseline Comparisons

BibTeX