(Translated by https://www.hiragana.jp/)
TopoNav: Topological Navigation for Efficient Exploration in Sparse Reward Environments

TopoNav: Topological Navigation for Efficient Exploration in Sparse Reward Environments

Jumman Hossain1, Abu-Zaher Faridee1,2, Nirmalya Roy1, Jade Freeman3, Timothy Gregory3, and Theron Trout4 1Authors are with the Dept. of Information Systems, University of Maryland, Baltimore County, USA. {jumman.hossain, faridee1, nroy}@umbc.edu2Author is with Amazon Inc. USA. abufari@amazon.com3Authors are with DEVCOM Army Research Lab, USA.
   {jade.l.freeman2.civ, timothy.c.gregory6.civ}@army.mil4Author is with Stormfish Scientific Corporation.
   theron.trout@stormfish-sci.com
Abstract

Autonomous robots exploring unknown environments face a significant challenge: navigating effectively without prior maps and with limited external feedback. This challenge intensifies in sparse reward environments, where traditional exploration techniques often fail. In this paper, we present TopoNav, a novel topological navigation framework that integrates active mapping, hierarchical reinforcement learning, and intrinsic motivation to enable efficient goal-oriented exploration and navigation in sparse-reward settings. TopoNav dynamically constructs a topological map of the environment, capturing key locations and pathways. A two-level hierarchical policy architecture, comprising a high-level graph traversal policy and low-level motion control policies, enables effective navigation and obstacle avoidance while maintaining focus on the overall goal. Additionally, TopoNav incorporates intrinsic motivation to guide exploration towards relevant regions and frontier nodes in the topological map, addressing the challenges of sparse extrinsic rewards. We evaluate TopoNav both in the simulated and real-world off-road environments using a Clearpath Jackal robot, across three challenging navigation scenarios: goal-reaching, feature-based navigation, and navigation in complex terrains. We observe an increase in exploration coverage by 7-20%, in success rates by 9-19%, and reductions in navigation times by 15-36% across various scenarios, compared to state-of-the-art methods.

I Introduction

Refer to caption
Figure 1: TopoNav Navigation Strategies: The navigation begins at the Start node (green circle) and progresses through designated subgoals—Point A (initial decision point), Point B (complex navigation subgoal), and Point C (alternative challenging subgoal)—toward the Goal (red diamond). The routes illustrate TopoNav’s strategy: solid lines represent direct paths to subgoals, a dashed line marks a complex detour around Obstacle1, and a dotted line indicates a potential route for challenging maneuvering near Obstacle2. This diagram shows the robot’s strategic navigation from start to finish, highlighting its decision-making and adaptability in outdoor environments with diverse navigational challenges. A real-world scenario is presented in Fig. 5

Autonomous robot navigation in unknown, unstructured environments poses significant challenges, particularly in the absence of prior maps and reliable localization  [1, 2]. In such scenarios, robots must efficiently explore the environment, build accurate representations, and make intelligent decisions to reach their goals, often with limited computational resources and sparse feedback  [3, 4]. Traditional approaches, such as simultaneous localization and mapping (SLAM)  [5] and sampling-based planning  [6], rely heavily on geometric representations and struggle to adapt to the uncertainties and dynamicity of real-world environments. Recent advancements in deep reinforcement learning (RL) have shown promise in enabling robots to learn complex navigation policies directly from raw sensory inputs  [7, 8]. However, most RL-based approaches suffer from sample inefficiency, poor generalization to unseen environments, and difficulties in handling sparse reward signals  [9]. Hierarchical RL methods  [10, 11] attempt to address these issues by learning multi-level policies, but often rely on handcrafted state spaces, pre-defined sub-goals, and task-specific reward shaping  [12], limiting their autonomy and adaptability. Topological mapping  [13, 14] is a promising approach for efficient navigation in large-scale environments. By representing the environment as a graph of discrete places and their connectivity, topological maps provide a compact and flexible representation that scales well with the size of the environment  [15]. However, most topological mapping approaches rely on predefined place recognition methods  [16] or assume a fixed set of landmarks  [17], making them sensitive to handling ambiguous visual appearances and changes in the environment. They often struggle to adapt to dynamic and unstructured environments, where the appearance of landmarks may change significantly over time.

In this paper, we introduce TopoNav, a novel topological navigation framework that integrates active mapping, hierarchical reinforcement learning, and intrinsic motivation to enable efficient and autonomous exploration of unknown environments. TopoNav dynamically constructs and maintains a topological map of the environment using a deep neural network that learns to extract task-relevant features from raw sensor observations. By combining the strengths of learning-based perception, hierarchical decision-making, and intrinsically motivated exploration, TopoNav demonstrates significant improvements in efficiency, robustness, and adaptability compared to state-of-the-art navigation methods.

The key contributions of this work are:

  1. 1.

    Enhanced Hierarchical Reinforcement Learning with Active Topological Mapping: We introduce an enhanced hierarchical reinforcement learning framework that extends the Hierarchical Deep Q-Network (H-DQN)  [18] architecture by integrating an actively updated topological map and leveraging intrinsic rewards to facilitate multi-level navigation policy learning. The meta-controller is responsible for choosing subgoals from the topological map, while the sub-controllers are designed to reach these subgoals through the execution of primitive actions. This dual strategy ensures that navigation is not only efficient but also directed towards regions of the environment that significantly enhance its understanding. By incorporating an intrinsically motivated learning approach, we effectively address the challenges associated with sparse extrinsic rewards, thereby accelerating the learning of efficient navigation policies.

  2. 2.

    Dynamic Subgoal Generation and Strategic Landmark Selection: We design a dynamic subgoal generation mechanism that activates upon detecting landmarks, trees, or objects while navigating. Detected features become part of the topological map as subgoals or nodes, facilitating structured navigation. When multiple landmarks are detected at similar distances, TopoNav utilizes a strategic landmark selection strategy. This approach gives priority to landmarks that are most informative and relevant, considering their novelty and alignment with the final goal. This method promotes efficient exploration and ensures navigation is goal-oriented, even when presented with numerous potential subgoals.

  3. 3.

    Experimental Validation of Superior Performance: We extensively evaluate TopoNav in diverse simulated environments and real-world scenarios, benchmarking against state-of-the-art baselines. It showcases a significant increase in exploration coverage (7-20%), and navigation success rates (9-19%), and achieves substantial reductions in navigation times (15-36%) across various scenarios. These improvements demonstrate TopoNav’s superior navigation in complex environments.

The rest of the paper is organized as follows. Section II discusses related work on robot navigation, topological mapping, and reinforcement learning. The background and problem formulation are discussed in Section III. Section IV presents the TopoNav framework. Section V describes the experimental setup, environments, and evaluation metrics. Section V presents the results and analysis of the experiments, comparing TopoNav with SOTA navigation methods. Finally, Section VI concludes the paper and discusses future research directions.

II Related Work

This section discusses prior works related to autonomous robot navigation, focusing on learning-based methods, and topological mapping techniques.

II-A Learning-Based Navigation

Recent advancements in deep reinforcement learning (RL) have led to the development of learning-based navigation methods that can learn effective policies directly from sensory inputs. Deep Q-Networks (DQN) [19] have been applied to navigation tasks, enabling robots to learn collision-free paths in different environments. However, these methods typically require a large amount of training data and may struggle to generalize to unseen environments. Hierarchical reinforcement learning (HRL) methods have been proposed to address the challenges of sparse rewards and long-horizon tasks in navigation [18, 10, 20]. These approaches typically learn a hierarchy of policies, where higher-level policies select subgoals or actions for lower-level policies to execute. However, most HRL methods  [21] assume access to structured representations of the environment or rely on pre-defined subgoals, limiting their applicability in unknown and unstructured environments.

II-B Topological Mapping Navigation

Topological mapping techniques represent the environment as a graph, where nodes correspond to distinct places and edges represent navigable paths between them [13, 14]. These methods focus on capturing the connectivity and adjacency information of the environment rather than maintaining a precise metric map. Spectral clustering algorithms have been used to construct topological maps from sensor data [22, 15]. These methods exploit the eigenstructure of the similarity matrix to partition the environment into distinct regions. However, they often require a pre-specified number of clusters and may not adapt well to changes in the environment. Incremental topological mapping approaches incrementally build and refine the topological map as the robot explores the environment [23, 24]. These methods typically use appearance-based place recognition techniques to detect loop closures and update the map accordingly. However, they may struggle in environments with significant changes in appearance. Recently, learning-based approaches have been proposed for topological mapping and localization [25, 26]. These methods learn to extract topological representations directly from sensory inputs using deep neural networks. Suomela et al. [27] proposed PlaceNav, a topological navigation approach that utilizes visual place recognition for subgoal selection. Wiyatno et al. [28] proposed a lifelong topological navigation approach that builds and continuously refines a sparse topological graph. However, they often require a large amount of training data and may not generalize well to unseen and sparse reward environments.

While existing topological navigation methods have shown promising results, they often rely on dense reward signals or extensive exploration to build effective representations of the environment. In contrast, TopoNav is specifically designed to operate in sparse-reward settings by incorporating intrinsic motivation and hierarchical reinforcement learning, enabling efficient navigation and map construction with limited extrinsic feedback. By dynamically constructing and refining the topological map during exploration, TopoNav can adapt to changes in the environment. The hierarchical policy architecture allows for effective navigation and obstacle avoidance, while intrinsic motivation guides exploration towards informative regions.

III Background and Problem Formulation

In this section, we outline the topological mapping for navigation, the Hierarchical Deep Q-Networks (H-DQN) for policy learning, and our problem formulation.

III-A Topological Mapping

Topological mapping represents the environment as a graph 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ), where nodes v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V correspond to distinct places or landmarks, and edges e𝑒e\in\mathcal{E}italic_e ∈ caligraphic_E represent the connectivity between them. This representation allows for efficient path planning and navigation by abstracting away the metric details and focusing on the high-level structure of the environment  [15]. In our approach, we dynamically construct and update the topological map \mathcal{M}caligraphic_M as the robot explores the environment. Each node vi𝒱subscript𝑣𝑖𝒱v_{i}\in\mathcal{V}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V is associated with a feature vector 𝐟isubscript𝐟𝑖\mathbf{f}_{i}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that encodes the sensory observations at that location. The edges eijsubscript𝑒𝑖𝑗e_{ij}\in\mathcal{E}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_E represent the traversability between nodes visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, which can be determined based on the robot’s motion model and the observed environmental conditions. The topological map \mathcal{M}caligraphic_M is used to guide the exploration process and enable efficient decision-making. By representing the environment as a graph, the robot can plan paths between different nodes and make high-level decisions based on the connectivity of the map. This allows for more strategic exploration and navigation compared to purely reactive approaches.

III-B Hierarchical Deep Q-Networks (H-DQN)

Hierarchical Deep Q-Networks (H-DQN)  [18] is a hierarchical reinforcement learning algorithm that extends the standard Deep Q-Networks (DQN)  [19]. We choose the H-DQN due to its efficiency in managing tasks across different abstraction levels and its enhanced sample efficiency through off-policy learning and experience replay, which is crucial for real-world robotics applications where data collection can be costly and time-consuming. H-DQN learns a two-level policy, the high-level policy (meta-controller) learns to select subgoals from the topological map, while the low-level policies (sub-controllers) learn to generate actions to reach the selected subgoals. The meta-controller is represented by a Q-network Qμ(s,g;θμ)superscript𝑄𝜇𝑠𝑔superscript𝜃𝜇Q^{\mu}(s,g;\theta^{\mu})italic_Q start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_s , italic_g ; italic_θ start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ), where s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S is the current state, g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G is a subgoal, and θμsuperscript𝜃𝜇\theta^{\mu}italic_θ start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT are the network parameters. The meta-controller selects subgoals based on the learned Q-values, which estimate the expected cumulative reward for reaching each subgoal. The sub-controllers are represented by a set of Q-networks Qg(s,a;θg)|g𝒢conditionalsuperscript𝑄𝑔𝑠𝑎superscript𝜃𝑔𝑔𝒢{Q^{g}(s,a;\theta^{g})|g\in\mathcal{G}}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_s , italic_a ; italic_θ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) | italic_g ∈ caligraphic_G, where s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S is the current state, a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A is an action, and θgsuperscript𝜃𝑔\theta^{g}italic_θ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT are the network parameters for subgoal g𝑔gitalic_g. Each sub-controller learns a policy to navigate from the current state to the corresponding subgoal by selecting actions that maximize the learned Q-values. The meta-controller and sub-controllers are trained simultaneously with separate replay buffers and target networks for each level of the hierarchy.

III-C Problem Formulation

We formulate the problem as a Markov Decision Process (MDP) with a hierarchical structure. The MDP is defined by a tuple (𝒮,𝒜,𝒫,,γ)𝒮𝒜𝒫𝛾(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma)( caligraphic_S , caligraphic_A , caligraphic_P , caligraphic_R , italic_γ ), where 𝒮𝒮\mathcal{S}caligraphic_S is the state space, 𝒜𝒜\mathcal{A}caligraphic_A is the action space, 𝒫𝒫\mathcal{P}caligraphic_P is the transition probability function, \mathcal{R}caligraphic_R is the reward function, and γ𝛾\gammaitalic_γ is the discount factor. The state space 𝒮𝒮\mathcal{S}caligraphic_S consists of the robot’s sensory observations and the current topological map \mathcal{M}caligraphic_M. The action space 𝒜𝒜\mathcal{A}caligraphic_A is hierarchically structured, with high-level actions (subgoals) 𝒜hsubscript𝒜\mathcal{A}_{h}caligraphic_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT selected by the meta-controller and low-level actions (primitives) 𝒜lsubscript𝒜𝑙\mathcal{A}_{l}caligraphic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT executed by the controller. The transition probability function 𝒫:𝒮×𝒜𝒮:𝒫𝒮𝒜𝒮\mathcal{P}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}caligraphic_P : caligraphic_S × caligraphic_A → caligraphic_S determines the next state based on the current state and the executed action. The reward function :𝒮×𝒜:𝒮𝒜\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}caligraphic_R : caligraphic_S × caligraphic_A → blackboard_R is sparse, providing positive rewards for reaching the goal or completing milestones, and intrinsic rewards for exploring new nodes or discovering new connections in the topological map. The objective is to learn a hierarchical policy π=(πh,πl)𝜋subscript𝜋subscript𝜋𝑙\pi=(\pi_{h},\pi_{l})italic_π = ( italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) that maximizes the expected cumulative reward:

π=argmaxπ𝔼π[t=0γtrt],superscript𝜋subscriptargmax𝜋subscript𝔼𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑟𝑡\pi^{*}=\operatorname*{arg\,max}_{\pi}\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty% }\gamma^{t}r_{t}\right],italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] , (1)

where πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal policy, πh:𝒮𝒜h:subscript𝜋𝒮subscript𝒜\pi_{h}:\mathcal{S}\rightarrow\mathcal{A}_{h}italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S → caligraphic_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is the high-level policy (meta-controller), πl:𝒮×𝒜h𝒜l:subscript𝜋𝑙𝒮subscript𝒜subscript𝒜𝑙\pi_{l}:\mathcal{S}\times\mathcal{A}_{h}\rightarrow\mathcal{A}_{l}italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT → caligraphic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the low-level policy (controller), rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the reward at time step t𝑡titalic_t, and γ𝛾\gammaitalic_γ is the discount factor. In this context, the state space 𝒮𝒮\mathcal{S}caligraphic_S includes a dynamically constructed topological map \mathcal{M}caligraphic_M, where nodes represent automatically identified landmarks. The high-level action space 𝒜hsubscript𝒜\mathcal{A}_{h}caligraphic_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT consists of subgoals, allowing the meta-controller to make decisions based on the evolving map. This approach enables efficient navigation in unknown environments without relying on predefined waypoints.

IV TopoNav: Topological Map Generation and Navigation

In this section, We explain the major stages of the TopoNav approach. Fig. 2 shows how different modules in our method are connected.

IV-A Attention-based Feature Detection

TopoNav integrates an attention-based feature detection module, combining a ResNet-50 CNN [29] with a Convolutional Block Attention Module (CBAM) [30], to detect landmarks, objects, and trees from RGB images 𝐈iH×W×3subscript𝐈𝑖superscript𝐻𝑊3\mathbf{I}_{i}\in\mathbb{R}^{H\times W\times 3}bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, where H𝐻Hitalic_H and W𝑊Witalic_W are the height and width of the input image, respectively. This module processes images to produce feature maps 𝐅ih×w×csubscript𝐅𝑖superscript𝑤𝑐\mathbf{F}_{i}\in\mathbb{R}^{h\times w\times c}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT (hhitalic_h and w𝑤witalic_w are the height and width of the feature map, and c𝑐citalic_c is the number of channels) refining them through CBAM to emphasize relevant features for landmark detection. Utilizing a sliding window approach on the refined feature map 𝐅isubscriptsuperscript𝐅𝑖\mathbf{F}^{\prime}_{i}bold_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, TopoNav identifies potential Regions of Interest (ROIs) corresponding to navigational markers. Each ROI’s feature vector 𝐟ROIdsubscript𝐟𝑅𝑂𝐼superscript𝑑\mathbf{f}_{ROI}\in\mathbb{R}^{d}bold_f start_POSTSUBSCRIPT italic_R italic_O italic_I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, obtained through ROI pooling, undergoes a binary classification to ascertain the presence of valid landmarks, objects, or trees, assigning a probability score 𝐏ROI[0,1]subscript𝐏𝑅𝑂𝐼01\mathbf{P}_{ROI}\in[0,1]bold_P start_POSTSUBSCRIPT italic_R italic_O italic_I end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. For each detected landmark, object, or tree, TopoNav extracts its corresponding feature vector 𝐟ROIsubscript𝐟𝑅𝑂𝐼\mathbf{f}_{ROI}bold_f start_POSTSUBSCRIPT italic_R italic_O italic_I end_POSTSUBSCRIPT and uses it to create a new node or subgoal in the topological map.

Refer to caption
Figure 2: Overview of TopoNav System Architecture.

IV-B Subgoal Generation and Navigation

In TopoNav, subgoals or nodes are generated dynamically based on the landmarks, trees, or objects encountered by the robot while navigating towards the final goal. Whenever a new landmark or object is detected, it is compared with the existing nodes in the topological map using a similarity measure (described in Section IV-C). If the similarity is below a threshold, the landmark is added as a new subgoal or node to the topological map. However, when multiple landmarks are detected at similar distances, TopoNav employs a selection strategy to prioritize the most informative and relevant landmark as the next subgoal based on novelty and goal-directedness. The novelty of a landmark l𝑙litalic_l is calculated based on the number of previous visits, using a novelty factor that decreases exponentially with the number of visits:

N(l)=eλvisits(l)𝑁𝑙superscript𝑒𝜆visits𝑙N(l)=e^{-\lambda\cdot\text{visits}(l)}italic_N ( italic_l ) = italic_e start_POSTSUPERSCRIPT - italic_λ ⋅ visits ( italic_l ) end_POSTSUPERSCRIPT (2)

where λ𝜆\lambdaitalic_λ is a decay parameter controlling the rate at which the novelty factor decreases with visits. The goal-directedness of a landmark is calculated as the cosine similarity between the direction vector 𝐯lsubscript𝐯𝑙\mathbf{v}_{l}bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT towards the landmark l𝑙litalic_l and the direction vector towards the goal 𝐯gsubscript𝐯𝑔\mathbf{v}_{g}bold_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. The landmark with the highest score is selected as the next subgoal. The weights wNsubscript𝑤𝑁w_{N}italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and wGDsubscript𝑤𝐺𝐷w_{GD}italic_w start_POSTSUBSCRIPT italic_G italic_D end_POSTSUBSCRIPT adjust the importance of novelty and goal-directedness in landmark selection, balancing exploration, and direct goal attainment. (See Algorithm  1)

GD(l)=𝐯l𝐯g𝐯l𝐯g𝐺𝐷𝑙subscript𝐯𝑙subscript𝐯𝑔normsubscript𝐯𝑙normsubscript𝐯𝑔GD(l)=\frac{\mathbf{v}_{l}\cdot\mathbf{v}_{g}}{||\mathbf{v}_{l}||\cdot||% \mathbf{v}_{g}||}italic_G italic_D ( italic_l ) = divide start_ARG bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ bold_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG start_ARG | | bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | ⋅ | | bold_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT | | end_ARG (3)
Input: Detected landmarks L𝐿Litalic_L, current position 𝐩𝐩\mathbf{p}bold_p, goal position 𝐠𝐠\mathbf{g}bold_g
Output: Selected landmark lbestsubscript𝑙bestl_{\text{best}}italic_l start_POSTSUBSCRIPT best end_POSTSUBSCRIPT
1
2Initialize max_score=max_score\text{max\_score}=-\inftymax_score = - ∞
3 foreach landmark lL𝑙𝐿l\in Litalic_l ∈ italic_L do
4       N(l)=eλvisits(l)𝑁𝑙superscript𝑒𝜆visits𝑙N(l)=e^{-\lambda\cdot\text{visits}(l)}italic_N ( italic_l ) = italic_e start_POSTSUPERSCRIPT - italic_λ ⋅ visits ( italic_l ) end_POSTSUPERSCRIPT
5       𝐯l=direction_vector(𝐩,l)subscript𝐯𝑙direction_vector𝐩𝑙\mathbf{v}_{l}=\text{direction\_vector}(\mathbf{p},l)bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = direction_vector ( bold_p , italic_l )
6       𝐯g=direction_vector(𝐩,𝐠)subscript𝐯𝑔direction_vector𝐩𝐠\mathbf{v}_{g}=\text{direction\_vector}(\mathbf{p},\mathbf{g})bold_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = direction_vector ( bold_p , bold_g )
7       GD(l)=𝐯l𝐯g𝐯l𝐯g𝐺𝐷𝑙subscript𝐯𝑙subscript𝐯𝑔normsubscript𝐯𝑙normsubscript𝐯𝑔GD(l)=\frac{\mathbf{v}_{l}\cdot\mathbf{v}_{g}}{||\mathbf{v}_{l}||\cdot||% \mathbf{v}_{g}||}italic_G italic_D ( italic_l ) = divide start_ARG bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ bold_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG start_ARG | | bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | ⋅ | | bold_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT | | end_ARG
8       score(l)=wNN(l)+wGDGD(l)score𝑙subscript𝑤𝑁𝑁𝑙subscript𝑤𝐺𝐷𝐺𝐷𝑙\text{score}(l)=w_{N}\cdot N(l)+w_{GD}\cdot GD(l)score ( italic_l ) = italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⋅ italic_N ( italic_l ) + italic_w start_POSTSUBSCRIPT italic_G italic_D end_POSTSUBSCRIPT ⋅ italic_G italic_D ( italic_l )
9       if score(l)>max_scorescore𝑙max_score\text{score}(l)>\text{max\_score}score ( italic_l ) > max_score then
10             max_score=score(l)max_scorescore𝑙\text{max\_score}=\text{score}(l)max_score = score ( italic_l )
11             lbest=lsubscript𝑙best𝑙l_{\text{best}}=litalic_l start_POSTSUBSCRIPT best end_POSTSUBSCRIPT = italic_l
12            
13      
14return lbestsubscript𝑙bestl_{\text{best}}italic_l start_POSTSUBSCRIPT best end_POSTSUBSCRIPT
Algorithm 1 Strategic Landmark Selection
StartN01subscript𝑁01N_{01}italic_N start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPTSub-goalN12subscript𝑁12N_{12}italic_N start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPTSub-goalN23subscript𝑁23N_{23}italic_N start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPTSub-goalGoalNavigateNavigateNavigateNavigateSelect lbestsubscript𝑙𝑏𝑒𝑠𝑡l_{best}italic_l start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT using N(l),GD(l)𝑁𝑙𝐺𝐷𝑙N(l),GD(l)italic_N ( italic_l ) , italic_G italic_D ( italic_l )Select lbestsubscript𝑙𝑏𝑒𝑠𝑡l_{best}italic_l start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT using N(l),GD(l)𝑁𝑙𝐺𝐷𝑙N(l),GD(l)italic_N ( italic_l ) , italic_G italic_D ( italic_l )Select lbestsubscript𝑙𝑏𝑒𝑠𝑡l_{best}italic_l start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT using N(l),GD(l)𝑁𝑙𝐺𝐷𝑙N(l),GD(l)italic_N ( italic_l ) , italic_G italic_D ( italic_l )Meta-Controller/Sub-Controller
Figure 3: An illustration of the topological navigation process using detected landmarks as sub-goals and strategic landmark selection.
Input: Ggsubscript𝐺𝑔G_{g}italic_G start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, the global end goal; dthreshsubscript𝑑threshd_{\text{thresh}}italic_d start_POSTSUBSCRIPT thresh end_POSTSUBSCRIPT, distance threshold for subgoal generation
Output: Tmsubscript𝑇𝑚T_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the topological map with nodes (N𝑁Nitalic_N) and edges (E𝐸Eitalic_E)
1 Initialize meta-controller Qμsuperscript𝑄𝜇Q^{\mu}italic_Q start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT and sub-controllers Qgsuperscript𝑄𝑔Q^{g}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT.
2 Initialize replay buffers 𝒟μ,𝒟gsuperscript𝒟𝜇superscript𝒟𝑔\mathcal{D}^{\mu},\mathcal{D}^{g}caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT for each subgoal g𝑔gitalic_g.
3 Tmsubscript𝑇𝑚T_{m}\leftarrow\emptysetitalic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← ∅; s0GetInitialState()subscript𝑠0GetInitialState()s_{0}\leftarrow\text{GetInitialState()}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← GetInitialState(); g0Ggsubscript𝑔0subscript𝐺𝑔g_{0}\leftarrow G_{g}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_G start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.
4 while gtGgsubscript𝑔𝑡subscript𝐺𝑔g_{t}\neq G_{g}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_G start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT do
5       DetectLandmarks(st)DetectLandmarkssubscript𝑠𝑡\mathcal{L}\leftarrow\text{DetectLandmarks}(s_{t})caligraphic_L ← DetectLandmarks ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
6       if len()>0len0\text{len}(\mathcal{L})>0len ( caligraphic_L ) > 0 then
7             lbestUse Algorithm 1(,st,Gg)subscript𝑙bestUse Algorithm 1subscript𝑠𝑡subscript𝐺𝑔l_{\text{best}}\leftarrow\text{Use Algorithm ~{}\ref{alg:landmark_selection}}(% \mathcal{L},s_{t},G_{g})italic_l start_POSTSUBSCRIPT best end_POSTSUBSCRIPT ← Use Algorithm ( caligraphic_L , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ).
8             if IsNewNode(lbest)IsNewNodesubscript𝑙best\text{IsNewNode}(l_{\text{best}})IsNewNode ( italic_l start_POSTSUBSCRIPT best end_POSTSUBSCRIPT ) then
9                   Add lbestsubscript𝑙bestl_{\text{best}}italic_l start_POSTSUBSCRIPT best end_POSTSUBSCRIPT to Tmsubscript𝑇𝑚T_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT; gtlbestsubscript𝑔𝑡subscript𝑙bestg_{t}\leftarrow l_{\text{best}}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_l start_POSTSUBSCRIPT best end_POSTSUBSCRIPT.
10                  
11            
12      else
13             if dist(st,gt)>dthreshdistsubscript𝑠𝑡subscript𝑔𝑡subscript𝑑thresh\text{dist}(s_{t},g_{t})>d_{\text{thresh}}dist ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_d start_POSTSUBSCRIPT thresh end_POSTSUBSCRIPT then
14                   gsuperscript𝑔absentg^{\prime}\leftarrowitalic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← Generate new subgoal; Add gsuperscript𝑔g^{\prime}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to Tmsubscript𝑇𝑚T_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT; gtgsubscript𝑔𝑡superscript𝑔g_{t}\leftarrow g^{\prime}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
15                  
16            
17      if gtGgsubscript𝑔𝑡subscript𝐺𝑔g_{t}\neq G_{g}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_G start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT then
18             atsubscript𝑎𝑡absenta_{t}\leftarrowitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← Select action; Execute atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; Observe st+1,rtsubscript𝑠𝑡1subscript𝑟𝑡s_{t+1},r_{t}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; Store transition.
19             if reached gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT then
20                   gtGgsubscript𝑔𝑡subscript𝐺𝑔g_{t}\leftarrow G_{g}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_G start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.
21                  
22            
23      else
24             Update subgoal; Store and update transition.
25            
26      stst+1subscript𝑠𝑡subscript𝑠𝑡1s_{t}\leftarrow s_{t+1}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.
27      
28return Tmsubscript𝑇𝑚T_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
Algorithm 2 TopoNav: Topological Map Generation and Navigation

During navigation, the meta-controller selects the next subgoal based on the current state and the robot’s overall goal. If no landmarks are detected within a certain distance threshold, TopoNav generates a new subgoal along the robot’s current trajectory at the maximum detection range to ensure continuous progress towards the final goal. The sub-controller’s policy generates a sequence of primitive actions or low-level skills to navigate the robot towards the selected subgoal, aiming to maximize the expected cumulative reward. The navigation process continues until the robot reaches the final goal or a maximum number of steps is exceeded. (See Algorithm  2 & Fig. 3)

IV-C Reward Structure

To address the challenges of sparse reward environments, TopoNav employs a carefully designed reward structure that combines sparse extrinsic rewards, dense intrinsic rewards, and penalties for suboptimal exploration behavior. The total reward r𝑟ritalic_r received by the agent at each time step is calculated as follows:

r=𝑟absent\displaystyle r=italic_r = αrex+β(rin+rsg+rfe+rep+rue)𝛼subscript𝑟𝑒𝑥𝛽subscript𝑟𝑖𝑛subscript𝑟𝑠𝑔subscript𝑟𝑓𝑒subscript𝑟𝑒𝑝subscript𝑟𝑢𝑒\displaystyle\,\alpha\cdot r_{ex}+\beta\cdot(r_{in}+r_{sg}+r_{fe}+r_{ep}+r_{ue})italic_α ⋅ italic_r start_POSTSUBSCRIPT italic_e italic_x end_POSTSUBSCRIPT + italic_β ⋅ ( italic_r start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_s italic_g end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_f italic_e end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_e italic_p end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_u italic_e end_POSTSUBSCRIPT )
+γ(rp+rsd+rte+rob)𝛾subscript𝑟𝑝subscript𝑟𝑠𝑑subscript𝑟𝑡𝑒subscript𝑟𝑜𝑏\displaystyle+\gamma\cdot(r_{p}+r_{sd}+r_{te}+r_{ob})+ italic_γ ⋅ ( italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_o italic_b end_POSTSUBSCRIPT ) (4)

where rexsubscript𝑟𝑒𝑥r_{ex}italic_r start_POSTSUBSCRIPT italic_e italic_x end_POSTSUBSCRIPT represents the sparse extrinsic reward, which is given only when the agent reaches the final goal (Rgoalsubscript𝑅𝑔𝑜𝑎𝑙R_{goal}italic_R start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT) or completes a milestone (Rmilestonesubscript𝑅𝑚𝑖𝑙𝑒𝑠𝑡𝑜𝑛𝑒R_{milestone}italic_R start_POSTSUBSCRIPT italic_m italic_i italic_l italic_e italic_s italic_t italic_o italic_n italic_e end_POSTSUBSCRIPT):

rex={Rgoal,if st+1 is the final goalRmilestone,if st+1 completes a milestone0,otherwisesubscript𝑟𝑒𝑥casessubscript𝑅𝑔𝑜𝑎𝑙if subscript𝑠𝑡1 is the final goalsubscript𝑅𝑚𝑖𝑙𝑒𝑠𝑡𝑜𝑛𝑒if subscript𝑠𝑡1 completes a milestone0otherwiser_{ex}=\begin{cases}R_{goal},&\text{if }s_{t+1}\text{ is the final goal}\\ R_{milestone},&\text{if }s_{t+1}\text{ completes a milestone}\\ 0,&\text{otherwise}\end{cases}italic_r start_POSTSUBSCRIPT italic_e italic_x end_POSTSUBSCRIPT = { start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT , end_CELL start_CELL if italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is the final goal end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_m italic_i italic_l italic_e italic_s italic_t italic_o italic_n italic_e end_POSTSUBSCRIPT , end_CELL start_CELL if italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT completes a milestone end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW (5)

The intrinsic rewards rinsubscript𝑟𝑖𝑛r_{in}italic_r start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, rsgsubscript𝑟𝑠𝑔r_{sg}italic_r start_POSTSUBSCRIPT italic_s italic_g end_POSTSUBSCRIPT, rfesubscript𝑟𝑓𝑒r_{fe}italic_r start_POSTSUBSCRIPT italic_f italic_e end_POSTSUBSCRIPT, repsubscript𝑟𝑒𝑝r_{ep}italic_r start_POSTSUBSCRIPT italic_e italic_p end_POSTSUBSCRIPT, and ruesubscript𝑟𝑢𝑒r_{ue}italic_r start_POSTSUBSCRIPT italic_u italic_e end_POSTSUBSCRIPT encourage exploration and map expansion. rinsubscript𝑟𝑖𝑛r_{in}italic_r start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT is based on the novelty of the visited states, calculated as the inverse square root of the visitation count:

rin=1N(st)subscript𝑟𝑖𝑛1𝑁subscript𝑠𝑡r_{in}=\frac{1}{\sqrt{N(s_{t})}}italic_r start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG (6)

rsgsubscript𝑟𝑠𝑔r_{sg}italic_r start_POSTSUBSCRIPT italic_s italic_g end_POSTSUBSCRIPT is a constant reward given when the agent discovers a new subgoal. The frontier exploration reward rfesubscript𝑟𝑓𝑒r_{fe}italic_r start_POSTSUBSCRIPT italic_f italic_e end_POSTSUBSCRIPT encourages the agent to prioritize the exploration of frontier nodes in the topological map:

rfe=λfeNfnNtnsubscript𝑟𝑓𝑒subscript𝜆𝑓𝑒subscript𝑁𝑓𝑛subscript𝑁𝑡𝑛r_{fe}=\lambda_{fe}\cdot\frac{N_{fn}}{N_{tn}}italic_r start_POSTSUBSCRIPT italic_f italic_e end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_f italic_e end_POSTSUBSCRIPT ⋅ divide start_ARG italic_N start_POSTSUBSCRIPT italic_f italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t italic_n end_POSTSUBSCRIPT end_ARG (7)

where Nfnsubscript𝑁𝑓𝑛N_{fn}italic_N start_POSTSUBSCRIPT italic_f italic_n end_POSTSUBSCRIPT is the number of new frontier nodes reached, Ntnsubscript𝑁𝑡𝑛N_{tn}italic_N start_POSTSUBSCRIPT italic_t italic_n end_POSTSUBSCRIPT is the total number of nodes in the topological map, and λfesubscript𝜆𝑓𝑒\lambda_{fe}italic_λ start_POSTSUBSCRIPT italic_f italic_e end_POSTSUBSCRIPT is a scaling factor. The exploration progress reward repsubscript𝑟𝑒𝑝r_{ep}italic_r start_POSTSUBSCRIPT italic_e italic_p end_POSTSUBSCRIPT rewards the agent for increasing the explored area or adding new nodes to the topological map:

rep=λepΔAAtotalsubscript𝑟𝑒𝑝subscript𝜆𝑒𝑝Δ𝐴subscript𝐴𝑡𝑜𝑡𝑎𝑙r_{ep}=\lambda_{ep}\cdot\frac{\Delta A}{A_{total}}italic_r start_POSTSUBSCRIPT italic_e italic_p end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_e italic_p end_POSTSUBSCRIPT ⋅ divide start_ARG roman_Δ italic_A end_ARG start_ARG italic_A start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT end_ARG (8)

where ΔAΔ𝐴\Delta Aroman_Δ italic_A is the increase in the explored area, Atotalsubscript𝐴𝑡𝑜𝑡𝑎𝑙A_{total}italic_A start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT is the total area of the environment, and λepsubscript𝜆𝑒𝑝\lambda_{ep}italic_λ start_POSTSUBSCRIPT italic_e italic_p end_POSTSUBSCRIPT is a scaling factor. The uncertainty-driven exploration reward ruesubscript𝑟𝑢𝑒r_{ue}italic_r start_POSTSUBSCRIPT italic_u italic_e end_POSTSUBSCRIPT encourages the agent to explore regions where its knowledge is limited:

rue=λueU(s)subscript𝑟𝑢𝑒subscript𝜆𝑢𝑒𝑈𝑠r_{ue}=\lambda_{ue}\cdot U(s)italic_r start_POSTSUBSCRIPT italic_u italic_e end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_u italic_e end_POSTSUBSCRIPT ⋅ italic_U ( italic_s ) (9)

where U(s)𝑈𝑠U(s)italic_U ( italic_s ) is the uncertainty estimate of the agent’s knowledge about the current state s𝑠sitalic_s, and λuesubscript𝜆𝑢𝑒\lambda_{ue}italic_λ start_POSTSUBSCRIPT italic_u italic_e end_POSTSUBSCRIPT is a scaling factor. To discourage suboptimal exploration behavior, TopoNav incorporates several penalty terms. The penalty for revisiting states rpsubscript𝑟𝑝r_{p}italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT discourages the agent from excessively revisiting previously explored areas:

rp=λpN(st)1N(st)subscript𝑟𝑝subscript𝜆𝑝𝑁subscript𝑠𝑡1𝑁subscript𝑠𝑡r_{p}=-\lambda_{p}\cdot\frac{N(s_{t})-1}{\sqrt{N(s_{t})}}italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = - italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ divide start_ARG italic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - 1 end_ARG start_ARG square-root start_ARG italic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG (10)

where N(st)𝑁subscript𝑠𝑡N(s_{t})italic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the number of times the state s𝑠sitalic_s has been visited, and λ𝜆\lambdaitalic_λ is a scaling factor that controls the magnitude of the penalty. The term N(st)1N(st)𝑁subscript𝑠𝑡1𝑁subscript𝑠𝑡\frac{N(s_{t})-1}{\sqrt{N(s_{t})}}divide start_ARG italic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - 1 end_ARG start_ARG square-root start_ARG italic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG ensures that the penalty increases as the number of revisits grows, but with a diminishing effect to avoid completely discouraging revisits. The subgoal diversity penalty rsdsubscript𝑟𝑠𝑑r_{sd}italic_r start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT penalizes the agent for selecting subgoals that are similar to previously visited subgoals:

rsd=λsdmaxghsim(gt,gh)subscript𝑟𝑠𝑑subscript𝜆𝑠𝑑subscriptsubscript𝑔simsubscript𝑔𝑡subscript𝑔r_{sd}=-\lambda_{sd}\cdot\max\limits_{g_{h}\in\mathcal{H}}\text{sim}(g_{t},g_{% h})italic_r start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT = - italic_λ start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT ⋅ roman_max start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ caligraphic_H end_POSTSUBSCRIPT sim ( italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) (11)

where gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the current subgoal, \mathcal{H}caligraphic_H is the history of visited subgoals, sim()sim\text{sim}(\cdot)sim ( ⋅ ) is a similarity function, and λsdsubscript𝜆𝑠𝑑\lambda_{sd}italic_λ start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT is a scaling factor. In TopoNav, the similarity function sim()sim\text{sim}(\cdot)sim ( ⋅ ) is based on the Euclidean distance between the feature vectors associated with the subgoals. Each subgoal g𝑔gitalic_g is represented by a feature vector fgsubscript𝑓𝑔f_{g}italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT that encodes its visual and spatial characteristics. The similarity between two subgoals gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ghsubscript𝑔g_{h}italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is calculated as:

sim(gt,gh)=i=1n(fgt(i)fgh(i))2simsubscript𝑔𝑡subscript𝑔superscriptsubscript𝑖1𝑛superscriptsuperscriptsubscript𝑓subscript𝑔𝑡𝑖superscriptsubscript𝑓subscript𝑔𝑖2\text{sim}(g_{t},g_{h})=\sqrt{\sum_{i=1}^{n}(f_{g_{t}}^{(i)}-f_{g_{h}}^{(i)})^% {2}}sim ( italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_f start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (12)

This similarity measure allows TopoNav to quantify the dissimilarity between subgoals based on their visual and spatial attributes, promoting the selection of diverse subgoals during exploration. The temporal exploration penalty rtesubscript𝑟𝑡𝑒r_{te}italic_r start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT penalizes the agent for long periods of non-exploration:

rte=λte(ttlastexp)subscript𝑟𝑡𝑒subscript𝜆𝑡𝑒𝑡subscript𝑡𝑙𝑎𝑠subscript𝑡𝑒𝑥𝑝r_{te}=-\lambda_{te}\cdot(t-t_{last_{e}xp})italic_r start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT = - italic_λ start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT ⋅ ( italic_t - italic_t start_POSTSUBSCRIPT italic_l italic_a italic_s italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_x italic_p end_POSTSUBSCRIPT ) (13)

where t𝑡titalic_t is the current time step, tlastexpsubscript𝑡𝑙𝑎𝑠subscript𝑡𝑒𝑥𝑝t_{last_{e}xp}italic_t start_POSTSUBSCRIPT italic_l italic_a italic_s italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_x italic_p end_POSTSUBSCRIPT is the time step of the last exploration progress, and λtesubscript𝜆𝑡𝑒\lambda_{te}italic_λ start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT is a scaling factor. An additional penalty robsubscript𝑟𝑜𝑏r_{ob}italic_r start_POSTSUBSCRIPT italic_o italic_b end_POSTSUBSCRIPT is introduced to penalize the agent for hitting obstacles:

rob={Robstacle,if the agent hits an obstacle0,otherwisesubscript𝑟𝑜𝑏casessubscript𝑅𝑜𝑏𝑠𝑡𝑎𝑐𝑙𝑒if the agent hits an obstacle0otherwiser_{ob}=\begin{cases}R_{obstacle},&\text{if the agent hits an obstacle}\\ 0,&\text{otherwise}\end{cases}italic_r start_POSTSUBSCRIPT italic_o italic_b end_POSTSUBSCRIPT = { start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_o italic_b italic_s italic_t italic_a italic_c italic_l italic_e end_POSTSUBSCRIPT , end_CELL start_CELL if the agent hits an obstacle end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW (14)

where Robstaclesubscript𝑅𝑜𝑏𝑠𝑡𝑎𝑐𝑙𝑒R_{obstacle}italic_R start_POSTSUBSCRIPT italic_o italic_b italic_s italic_t italic_a italic_c italic_l italic_e end_POSTSUBSCRIPT is a negative constant reward. This penalty encourages the agent to avoid collisions and navigate safely in the environment. The hyperparameters α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and γ𝛾\gammaitalic_γ balance the contributions of the extrinsic rewards, intrinsic rewards, and penalties, respectively.

IV-D Hierarchical Policy Learning

TopoNav utilizes the H-DQN algorithm to develop navigation policies across different abstraction levels. This framework employs a meta-controller for high-level subgoal selection and sub-controllers for detailed navigation tasks. The meta-controller, leveraging an attention-based feature vector 𝐟tsubscript𝐟𝑡\mathbf{f}_{t}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from current observations, selects subgoals from the topological map using a DQN-based policy πm(sm,g;θm)subscript𝜋𝑚subscript𝑠𝑚𝑔subscript𝜃𝑚\pi_{m}(s_{m},g;\theta_{m})italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_g ; italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). This selection process prioritizes key locations or landmarks in the environment, informed by both the robot’s current state and the desired goal features. On the other hand, sub-controllers focus on reaching these subgoals by executing actions derived from a separate DQN policy πs(ss,a;θs)subscript𝜋𝑠subscript𝑠𝑠𝑎subscript𝜃𝑠\pi_{s}(s_{s},a;\theta_{s})italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_a ; italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), guided by the local obstacle map and the robot’s kinematics to ensure safe navigation. The carefully designed reward structure (described in section  IV-C) plays an important role in this hierarchy. Training alternates between meta and sub-controllers using their respective experiences and policies, optimizing a composite reward structure that balances exploration, safety, and goal orientation. This approach enables TopoNav to adaptively learn efficient navigation strategies in sparse-reward settings, leveraging the strengths of hierarchical learning and attention-based feature extraction.

Theorem 1 (Subgoal Reachability). Given a topological map G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) constructed by TopoNav, with V𝑉Vitalic_V representing nodes (subgoals) and E𝐸Eitalic_E representing edges (navigable paths), for any two nodes vi,vjVsubscript𝑣𝑖subscript𝑣𝑗𝑉v_{i},v_{j}\in Vitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_V, if a path exists from visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in G𝐺Gitalic_G, then the sub-controller policy πssubscript𝜋𝑠\pi_{s}italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT learned by TopoNav can navigate the robot from visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with a probability of at least 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ. Here, ϵitalic-ϵ\epsilonitalic_ϵ is a small positive constant representing the upper bound of navigation failure probability, which accounts for environmental stochasticity and the convergence of the learning algorithm. TopoNav minimizes ϵitalic-ϵ\epsilonitalic_ϵ through robust feature extraction and continuous map refinement.

Proof. Consider a path Pij={vi,ei1,v1,,e(k1)j,vj}subscript𝑃𝑖𝑗subscript𝑣𝑖subscript𝑒𝑖1subscript𝑣1subscript𝑒𝑘1𝑗subscript𝑣𝑗P_{ij}=\{v_{i},e_{i1},v_{1},...,e_{(k-1)j},v_{j}\}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT ( italic_k - 1 ) italic_j end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } between nodes visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the topological map G𝐺Gitalic_G, incorporating both nodes and the edges epqEsubscript𝑒𝑝𝑞𝐸e_{pq}\in Eitalic_e start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ∈ italic_E between them. The sub-controller policy πssubscript𝜋𝑠\pi_{s}italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, leveraging DQN methodologies, aims for a navigation success across each edge epqsubscript𝑒𝑝𝑞e_{pq}italic_e start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT with a probability 1ϵpqabsent1subscriptitalic-ϵ𝑝𝑞\geq 1-\epsilon_{pq}≥ 1 - italic_ϵ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT, where ϵpqsubscriptitalic-ϵ𝑝𝑞\epsilon_{pq}italic_ϵ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT signifies the maximum probability of failing to navigate from vpsubscript𝑣𝑝v_{p}italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to vqsubscript𝑣𝑞v_{q}italic_v start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT successfully. The overall probability of navigating the entire path Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, as the product of the success probabilities for its constituting edges, is epqPij(1ϵpq)subscriptproductsubscript𝑒𝑝𝑞subscript𝑃𝑖𝑗1subscriptitalic-ϵ𝑝𝑞\prod_{e_{pq}\in P_{ij}}(1-\epsilon_{pq})∏ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ∈ italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - italic_ϵ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ). This is theoretically at least 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ, where ϵ=kmaxp,qϵpqitalic-ϵ𝑘subscript𝑝𝑞subscriptitalic-ϵ𝑝𝑞\epsilon=k\max_{p,q}\epsilon_{pq}italic_ϵ = italic_k roman_max start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT represents the compounded probability of any failure occurring along the path with k𝑘kitalic_k edges. Therefore, with the sub-controller policy πssubscript𝜋𝑠\pi_{s}italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, TopoNav can proficiently facilitate navigation from visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT within the given environmental conditions and learning algorithm constraints, maintaining a success probability of at least 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ. \square

V Experiments and Results

In this section, we present the experimental setup, results, and analysis of the TopoNav framework. We evaluate the performance of TopoNav in both simulated and real-world environments and compare it with state-of-the-art baselines.

V-A Evaluation Environments

The TopoNav framework is evaluated in a range of simulated and real-world environments. The simulated environments are created using the Unity 3D engine and include outdoor urban scenes and off-road terrains. The simulated environments incorporate landmarks and objects similar to those in our real-world test scenarios (Fig. 4(d)). The real-world experiments are conducted in an outdoor off-road environment using a Clearpath Jackal robot. The environments are designed to cover various navigation challenges, such as narrow passages, obstacles, uneven terrain, and dead ends. The size of the environments ranges from 20m × 20m to 200m × 200m, and the complexity is varied by adjusting the density and arrangement of obstacles. For each environment, multiple starting and goal locations are randomly sampled to create a diverse set of navigation tasks. The performance of TopoNav is evaluated over 100 episodes in each environment, and the average metrics are reported.

V-B Evaluation Metrics

We evaluate the performance of TopoNav and the baselines using the following metrics:

  • Success Rate: The percentage of successful navigation episodes where the robot reaches the goal location within a specified time limit.

  • Navigation Time: The average time taken by the robot to reach the goal location in successful episodes.

  • Trajectory Length: The average length of the robot’s trajectory in successful episodes.

  • Exploration Coverage: The percentage of the environment explored by the robot during navigation.

V-C Baselines

To rigorously evaluate TopoNav’s performance, we benchmark it against a diverse set of state-of-the-art baselines:

  • PlaceNav: A topological navigation approach that utilizes visual place recognition for subgoal selection and integrates a Bayesian filter to improve the temporal consistency of subgoal selection [27].

  • TopoMap: A method focusing on navigation through the construction of topological maps from predefined landmarks [15], providing a direct comparison to assess the benefits of TopoNav’s dynamic mapping and navigation strategy.

  • ViNG (Visual Navigation with Goals): A learning-based navigation approach that learns to navigate to visual goals in open-world environments by combining hierarchical planning and topological mapping [31].

  • Lifelong Topological Visual Navigation (LTVN): A topological navigation method that builds and continuously refines a sparse topological graph [28].

Refer to caption
((a))
Refer to caption
((b))
Refer to caption
((c))
Refer to caption
((d))
Refer to caption
((e))
Refer to caption
((f))
Refer to caption
((g))
Refer to caption
((h))
Figure 4: (a-d) The robot navigates through various outdoor environments: (a) an open space without obstacles/vegetation (Scenario 1), (b) a natural setting with trees/vegetation (Scenario 2), (c) a cluttered environment with obstacles and landmarks (Scenario 3), and (d) a diverse terrain in simulation. (e-h) The corresponding topological maps generated by TopoNav for each scenario. The topological maps capture the connectivity and traversability of the environments, representing key locations and paths. The green dots in the maps represent the nodes or subgoals, which correspond to landmarks or distinct places. The edges (red lines) indicate the navigability between these nodes, enabling efficient path planning and navigation for the robot in each setting.
Refer to caption
((a))
Refer to caption
((b))
Figure 5: (a) A 3D point cloud representation of the outdoor environment obtained from the robot’s LiDAR sensor, showcasing the terrain and obstacle details. (b) Robot navigating through a physical environment, guided by the TopoNav framework, executes an immediate obstacle avoidance maneuver to select an alternative route when faced with an obstruction (red cross, fallen trees).
Metrics Methods Scenario 1 Scenario 2 Scenario 3
Success PlaceNav 89 87 88
Rate (%) TopoMap 82 79 80
ViNG 84 82 83
LTVN 90 88 89
TopoNav (Ours) 98 94 92
Navigation PlaceNav 37.5 38.2 39.0
Time (s) TopoMap 47.5 46.9 48.2
ViNG 39.8 40.5 41.2
LTVN 36.2 37.0 37.8
TopoNav (Ours) 30.6 33.8 35.1
Trajectory PlaceNav 13.8 14.2 14.5
Length (m) TopoMap 15.8 15.2 15.7
ViNG 14.4 14.9 15.3
LTVN 12.9 13.5 13.8
TopoNav (Ours) 10.4 11.6 12.3
Exploration PlaceNav 83 81 82
Coverage (%) TopoMap 75 73 74
ViNG 79 77 78
LTVN 84 83 85
TopoNav (Ours) 90 88 89
TABLE I: Comparative Analysis: TopoNav vs. SOTA methods across scenarios, highlighting performance in three distinct environments (Section V-E).

V-D Implementation Details

Our RL network is implemented in PyTorch and trained using simulated terrains with a Clearpath Husky robot in ROS Noetic and Unity Simulation framework, which allows for testing in realistic outdoor environments with diverse terrain and objects. The simulated Husky robot is mounted with a Velodyne VLP16 3D LiDAR. The network is trained in a workstation with a 10th-generation Intel Core i9-10850K processor and an NVIDIA GeForce RTX 3090 GPU. During training, our network uses a batch size of 128 and performs gradient updates using the Adam optimizer with learning rate λ=104𝜆superscript104\lambda=10^{-4}italic_λ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The training process occurs over 1 million steps in our simulation environment, with performance evaluated at regular intervals. For real-time deployment and inference, we use the Jackal UGV from Clearpath Robotics equipped with a 3D VLP-32C Velodyne Ultrapuck LiDAR and an AXIS Fixed IP Camera. It also includes an onboard Intel computer system, equipped with an Intel i7-9700TE CPU and an NVIDIA GeForce GTX 1650 Ti GPU. We utilize the LiDAR sensor provided 3D point cloud data for obstacle detection and avoidance using the Dynamic Window Approach (DWA)  [32].

V-E Testing Scenario

We evaluate our topology based navigation framework in three scenarios (see Fig. 4).

  • Scenario 1 (Goal Reaching): In this scenario, the robot is placed in an obstacle-free outdoor environment with an objective of reaching a given goal location.

  • Scenario 2 (Feature-Based Navigation): In this scenario, the robot is placed in an outdoor area with static natural features such as trees, rocks, structures, and sharp hills. The goal is to reach the destination using the natural features for localization and mapping, challenging its environmental perception and landmark utilization.

  • Scenario 3 (Navigating on Complex Terrains): In this scenario, the robot is placed in an environment with both obstacles and landmarks, this scenario assesses the robot’s simultaneous obstacle avoidance and feature-based navigation, testing its adaptability in complex settings.

V-F Performance Comparison and Analysis

Table I demonstrates TopoNav’s significant improvements over existing navigation systems. Our framework outperforms baseline methods across key performance metrics, highlighting its effectiveness in real-world scenarios. In goal-reaching tasks, TopoNav achieves a 98% success rate, surpassing PlaceNav [27], LTVN [28], and ViNG [31]. These baselines often struggle with efficient path planning in sparsely rewarded environments. In contrast, TopoNav leverages its hierarchical structure and dynamic topological mapping for more effective route optimization. The limitations of the baselines become evident in feature-based navigation, where they fail to consistently utilize natural landmarks for navigation. TopoNav excels in this scenario, achieving an 94% success rate. This demonstrates its superior ability to accurately navigate by effectively using environmental features for guidance. The complex terrain scenario further highlights TopoNav’s robustness. With a 92% success rate, it significantly outperforms baseline systems, which often experience performance degradation in challenging navigational conditions. TopoNav’s adaptability to diverse and unpredictable terrains underscores the shortcomings of less flexible systems that heavily depend on stable and predictable environments. Moreover, the integration of CBAM [30] into TopoNav’s feature extraction module greatly enhances landmark detection, a common shortcoming in baseline methods. This results in an average 8% increase in detection accuracy, improving the reliability of subgoal generation and map construction. The subsequent impact on navigation success—up to a 14% improvement—and reduction in navigation times—up to a 28% decrease—further establishes TopoNav as a significant advancement in autonomous robot navigation.

Method Success Navigation Trajectory Exploration
Rate (%) Time (s) Length (m) Coverage (%)
TopoNav 98 30.6 10.4 90
TopoNav 88 40.2 14.1 80
w/o Topo Map
TopoNav 85 45.3 15.7 75
w/o Hierarchy
TopoNav 86 42.5 14.8 78
w/o Attention
TABLE II: Ablation study results.

V-G Ablation Study

To analyze the contribution of each component in TopoNav, we conduct an ablation study by removing the topological mapping, the hierarchical structure, and the attention-based CNN separately. Table II shows the results of the ablation study in the outdoor environment (Scenario 1 V-E). The results show that removing any component from TopoNav decreases success rates and exploration coverage while increasing navigation times and trajectory lengths.

VI Conclusion, Limitations, and Future Directions

In this paper, we introduced TopoNav, a novel framework for autonomous navigation in unknown environments with sparse rewards. TopoNav integrates hierarchical reinforcement learning with dynamic topological mapping to enable efficient exploration, map construction, and goal-directed navigation. The key components of TopoNav include a two-level hierarchical policy learning architecture, an active topological mapping approach, and an intrinsic reward mechanism for encouraging exploration. Despite the promising results, TopoNav has several limitations that provide an opportunity for future research. One limitation is the scalability of TopoNav to larger and more complex environments. The current implementation may face computational challenges when dealing with extensive topological maps and long-horizon navigation tasks. Future work could explore techniques for map compression and distributed computing to improve the scalability of TopoNav. Furthermore, developing techniques for distributed map construction, information sharing, and coordinated decision-making could enable more efficient exploration and navigation in large-scale environments with multiple robots.

Acknowledgment

This work has been partially supported by the DEVCOM Army Research Laboratory (ARL) under a cooperative agreement (W911NF2120076), ONR Grant #N00014-23-1-2119, NSF CAREER Award #1750936, NSF REU Site Grant #2050999, and NSF CNS EAGER Grant #2233879.

References

  • [1] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, José Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on robotics, 32(6):1309–1332, 2016.
  • [2] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and mapping: part i. IEEE robotics & automation magazine, 13(2):99–110, 2006.
  • [3] Mitchell Chen, Alec Radford, Ilya Sutskever, Ilya Sutskever, and Dario Amodei. Learning to learn how to learn: Self-adaptive visual navigation using meta-learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6750–6759, 2019.
  • [4] Fatemeh Niroui, Kaicheng Zhang, Zepeng Kashino, and Goldie Nejat. Deep reinforcement learning robot for search and rescue applications: Exploration in unknown cluttered environments. volume 4, pages 610–617. IEEE, 2019.
  • [5] Giorgio Grisetti, Rainer Kummerle, Cyrill Stachniss, and Wolfram Burgard. A tutorial on graph-based slam. IEEE Intelligent Transportation Systems Magazine, 2(4):31–43, 2010.
  • [6] Mohamed Elbanhawi and Milan Simic. Sampling-based robot motion planning: A review. IEEE Access, 2:56–77, 2014.
  • [7] Lei Tai, Qiong Zhang, Meng Liu, Joschka Boedecker, and Wolfram Burgard. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. arXiv preprint arXiv:1703.00420, 2017.
  • [8] Gregory Kahn, Adam Villaflor, Bosen Ding, Pieter Abbeel, and Sergey Levine. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
  • [9] Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems, pages 15246–15257, 2019.
  • [10] Andrew Levy, Robert Platt, and Kate Saenko. Learning multi-level hierarchies with hindsight. In International Conference on Learning Representations, 2019.
  • [11] Amy Zhang, Adam Lerer, Sainbayar Sukhbaatar, Rob Fergus, and Arthur Szlam. Generating adjacency-constrained subgoals for hierarchical reinforcement learning. arXiv preprint arXiv:2006.11485, 2020.
  • [12] Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 31:3303–3313, 2018.
  • [13] Benjamin Kuipers. The spatial semantic hierarchy. Artificial intelligence, 119(1-2):191–233, 2000.
  • [14] Jaime Boal, Alvaro Sanchez-Miralles, and Alvaro Arranz. Topological simultaneous localization and mapping: a survey. Robotica, 32(5):803–821, 2014.
  • [15] Fabian Blochliger, Marius Fehr, Marcin Dymczyk, Thomas Schneider, and Rol Siegwart. Topomap: Topological mapping and navigation based on visual slam maps. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3818–3825. IEEE, 2018.
  • [16] Stephanie Lowry, Niko Sünderhauf, Paul Newman, John J Leonard, David Cox, Peter Corke, and Michael J Milford. Visual place recognition: A survey. IEEE Transactions on Robotics, 32(1):1–19, 2016.
  • [17] Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653, 2018.
  • [18] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29:3675–3683, 2016.
  • [19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. In Nature, volume 518, pages 529–533. Nature Publishing Group, 2015.
  • [20] Siyuan Li, Rui Wang, Minxue Tang, and Chongjie Zhang. Hierarchical reinforcement learning with advantage-based auxiliary rewards. Advances in Neural Information Processing Systems, 32, 2019.
  • [21] Chengshu Li, Fei Xia, Roberto Martin-Martin, and Silvio Savarese. Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators. In Conference on Robot Learning, pages 603–616. PMLR, 2020.
  • [22] Yugang Liu and Goldie Nejat. Robotic urban search and rescue: A survey from the control perspective. volume 72, pages 147–165. Springer, 2013.
  • [23] Michael Kaess, Ananth Ranganathan, and Frank Dellaert. isam: Incremental smoothing and mapping. IEEE Transactions on Robotics, 24(6):1365–1378, 2008.
  • [24] Viorela Ila, Lukas Polok, Marek Solony, and Pavel Svoboda. Slam++-a highly efficient and temporally scalable incremental slam framework. The International Journal of Robotics Research, 36(2):210–230, 2017.
  • [25] Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. In International Conference on Learning Representations, 2020.
  • [26] Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12875–12884, 2020.
  • [27] Lauri Suomela, Jussi Kalliola, Harry Edelman, and Joni-Kristian Kämäräinen. Placenav: Topological navigation through place recognition. In IEEE International Conference on Robotics and Automation (ICRA), 2024.
  • [28] Rey Reza Wiyatno, Anqi Xu, and Liam Paull. Lifelong topological visual navigation. IEEE Robotics and Automation Letters, 7(4):9271–9278, 2022.
  • [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [30] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  • [31] Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Ving: Learning open-world navigation with visual goals. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13215–13222. IEEE, 2021.
  • [32] Dieter Fox, Wolfram Burgard, and Sebastian Thrun. The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine, 4(1):23–33, 1997.