Chen Wang, Spatial AI & Robotics (SAIR) Lab, Department of Computer Science and Engineering, University at Buffalo, NY 14260, USA.
Imperative Learning: A Self-supervised Neuro-Symbolic Learning Framework for Robot Autonomy
Abstract
Data-driven methods such as reinforcement and imitation learning have achieved remarkable success in robot autonomy. However, their data-centric nature still hinders them from generalizing well to ever-changing environments. Moreover, collecting large datasets for robotic tasks is often impractical and expensive. To overcome these challenges, we introduce a new self-supervised neuro-symbolic (NeSy) computational framework, imperative learning (IL), for robot autonomy, leveraging the generalization abilities of symbolic reasoning. The framework of IL consists of three primary components: a neural module, a reasoning engine, and a memory system. We formulate IL as a special bilevel optimization (BLO), which enables reciprocal learning over the three modules. This overcomes the label-intensive obstacles associated with data-driven approaches and takes advantage of symbolic reasoning concerning logical reasoning, physical principles, geometric analysis, etc. We discuss several optimization techniques for IL and verify their effectiveness in five distinct robot autonomy tasks including path planning, rule induction, optimal control, visual odometry, and multi-robot routing. Through various experiments, we show that IL can significantly enhance robot autonomy capabilities and we anticipate that it will catalyze further research across diverse domains.
keywords:
Neuro-Symbolic AI, Self-supervised Learning, Bilevel Optimization, Imperative Learning1 Introduction
With the rapid development of deep learning (LeCun et al., 2015), there has been growing interest in data-driven approaches such as reinforcement learning (Zhu and Zhang, 2021) and imitation learning (Hussein et al., 2017) for robot autonomy. However, despite these notable advancements, many data-driven autonomous systems are still predominantly constrained to their training environments, exhibiting limited generalization ability (Banino et al., 2018; Albrecht et al., 2022).
As a comparison, humans are capable of internalizing their experiences as abstract concepts or symbolic knowledge (Borghi et al., 2017). For instance, we interpret the terms “road” and “path” as symbols or concepts for navigable areas, whether it’s a paved street in a city or a dirt trail through a forest (Hockley, 2011). Equipped with these concepts, humans can employ spatial reasoning to navigate new and complex environments (Strader et al., 2024). This adaptability to novel scenarios, rooted in our ability to abstract and symbolize, is a fundamental aspect of human intelligence that is still notably absent in existing data-driven autonomous robots (Garcez et al., 2022; Kautz, 2022).
Though implicit reasoning abilities have garnered increased attention in the context of large language models (LLMs) (Lu et al., 2023; Shah et al., 2023a), robot autonomy systems still encounter significant challenges in achieving interpretable reasoning. This is particularly evident in areas such as geometric, physical, and logical reasoning (Liu et al., 2023). Overcoming these obstacles and integrating interpretable symbolic reasoning into data-driven models, a direction known as neuro-symbolic (NeSy) reasoning, could remarkably enhance robot autonomy (Garcez et al., 2022).
While NeSy reasoning offers substantial potential, its specific application in the field of robotics is still in a nascent stage. A key reason for this is the emerging nature of the NeSy reasoning field itself, which has not yet reached a consensus on a rigorous definition (Kautz, 2022). On one side, a narrow definition views NeSy reasoning as an amalgamation of neural methods (data-driven) and symbolic methods, which utilize formal logic and symbols for knowledge representation and rule-based reasoning. Alternatively, the broader definition expands the scope of what constitutes a “symbol”. In this perspective, symbols are not only logical terms but also encompass any comprehensible, human-conceived concepts. This can include physical properties and semantic attributes, as seen in approaches involving concrete equations (Duruisseaux et al., 2023), logical programming (Delfosse et al., 2023), and programmable objectives (Yonetani et al., 2021a). This broad definition hence encapsulates reasoning related to physical principles, logical reasoning, geometrical analytics, etc. In this context, many literatures exemplified NeSy systems, e.g., model-based reinforcement learning (Moerland et al., 2023), physics-informed networks (Karniadakis et al., 2021), and learning-aided tasks such as control (O’Connell et al., 2022), task scheduling (Gondhi and Gupta, 2017), and geometry analytics (Heidari and Iosifidis, 2024).
In this article, we explore the broader definition of NeSy reasoning for robot autonomy and introduce a self-supervised NeSy learning framework, which will be referred to as imperative learning (IL). It is designed to overcome the well-known issues of existing learning frameworks for robot autonomy: (1) Generalization ability: Many data-driven systems, including reinforcement learning models (Banino et al., 2018; Albrecht et al., 2022), remain largely confined to their training environments, displaying limited generalization capabilities. One reason for this limitation is their inability to learn explicit commonsense rules, which hinders their effective transfer to new environments. This encourages us to delve into symbolic reasoning techniques. (2) Black-box nature: The underlying causal mechanisms of data-driven models are mostly unknown, while a problematic decision can have catastrophic side effects in robotic tasks (Dulac-Arnold et al., 2021). Consequently, these models frequently encounter difficulties when applied to real-world situations, which further encourages us to delve into NeSy systems. (3) Label intensiveness: Labeling data for robotic tasks like imitation learning (Zare et al., 2023) often incurs higher costs than tasks in computer vision due to the reliance on specialized equipment over basic human annotations (Ebadi et al., 2022). For example, annotating accurate ground truth for robot planning is extremely complicated due to the intricacies of robot dynamics. This underscores the critical need for efficient self-supervised learning methods. (4) Sub-optimality: Separately training neural and symbolic modules can result in sub-optimal integration due to compounded errors (Besold et al., 2021). This motivates the investigation of an end-to-end NeSy learning approach.
IL is formulated to address the above issues with a single design. It is inspired by an interesting observation: while data-driven models predominantly require labeled data for parameter optimization, symbolic reasoning models are often capable of functioning without labels. Yet, both types of models can be optimized using gradient descent-like iterative methods. Take, for instance, geometrical models such as bundle adjustment (BA), physical models like model predictive control (MPC), and discrete models such as the A∗ search over a graph. They can all be optimized without providing labels, even though they are analogous to data-driven models when formulated as optimization problems. IL leverages this property of both approaches by enforcing each method to mutually correct the other, thereby creating a novel self-supervised learning paradigm. To optimize the entire framework, IL is formulated as a special bilevel optimization (BLO), solved by back-propagating the errors from the self-supervised symbolic models into the neural models. The term “imperative” is adopted to highlight and describe this passive self-supervised learning procedure.
In summary, the contribution of this article include
-
•
We explore a self-supervised NeSy learning framework, which is referred to as imperative learning (IL) for robot autonomy. IL is formulated as a special BLO problem to enforce the network to learn symbolic concepts and enhance symbolic reasoning via data-driven methods. This leads to a reciprocal learning paradigm, which can avoid the sub-optimal solutions caused by composed errors of decoupled systems.
-
•
We discuss several optimization strategies to tackle the technical challenges of IL. We present how to incorporate different optimization techniques into our IL framework including closed-form solution, first-order optimization, second-order optimization, constrained optimization, and discrete optimization.
-
•
To benefit the entire community, we demonstrated the effectiveness of IL for several tasks in robot autonomy including path planning, rule induction, optimal control, visual odometry, and multi-robot routing. We released the source code at https://sairlab.org/iseries/ to inspire more robotics research using IL.
This article is inspired by and built on our previous works in different fields of robot autonomy including local planning (Yang et al., 2023), global planning (Chen et al., 2024), simultaneous localization and mapping (SLAM) (Fu et al., 2024), feature matching (Zhan et al., 2024), and multi-agent routing (Guo et al., 2024). These works introduced a prototype of IL in several different domains while failing to introduce a systematic methodology for NeSy learning in robot autonomy. This article fills this gap by formally defining IL, exploring various optimization challenges of IL in different robot autonomy tasks, and introducing new applications such as rule induction and optimal control. Additionally, we introduce the theoretical background for solving IL based on BLO, propose several practical solutions to IL by experimenting with the five distinct robot autonomy applications, and demonstrate the superiority of IL over state-of-the-art (SOTA) methods in their respective fields.
2 Related Works
2.1 Bilevel Optimization
Bilevel optimization (BLO), first introduced by Bracken and McGill (1973), has been studied for decades. Classific approaches replaced the lower-level problem with its optimality conditions as constraints and reformulated the bilevel programming into a single-level constrained problem (Hansen et al., 1992; Gould et al., 2016; Shi et al., 2005; Sinha et al., 2017). More recently, gradient-based BLO has attracted significant attention due to its efficiency and effectiveness in modern machine learning and deep learning problems. Since this paper mainly focuses on the learning side, we will concentrate on gradient-based BLO methods, and briefly discuss their limitations in robot learning problems.
Methodologically, gradient-based BLO can be generally divided into approximate implicit differentiation (AID), iterative differentiation (ITD), and value-function-based approaches. Based on the explicit form of the gradient (or hypergradient) of the upper-level objective function via implicit differentiation, AID-based methods adopt a generalized iterative solver for the lower-level problem as well as an efficient estimate for Hessian-inverse-vector product of the hypergradient (Domke, 2012; Pedregosa, 2016; Liao et al., 2018; Arbel and Mairal, 2022a). ITD-based methods approximate the hypergradient by directly taking backpropagation over a flexible inner-loop optimization trajectory using forward or backward mode of autograd (Maclaurin et al., 2015; Franceschi et al., 2017; Finn et al., 2017; Shaban et al., 2019; Grazzi et al., 2020). Value-function-based approaches reformulated the lower-level problem as a value-function-based constraint and solved this constrained problem via various constrained optimization techniques such as mixed gradient aggregation, log-barrier regularization, primal-dual method, and dynamic barrier (Sabach and Shtern, 2017; Liu et al., 2020a; Li et al., 2020a; Sow et al., 2022; Liu et al., 2021b; Ye et al., 2022). Recently, large-scale stochastic BLO has been extensively studied both in theory and in practice. For example, Chen et al. (2021) and Ji et al. (2021) proposed a Neumann series-based hypergradient estimator; Yang et al. (2021), Huang and Huang (2021), Guo and Yang (2021), Yang et al. (2021), and Dagréou et al. (2022) incorporated the strategies of variance reduction and recursive momentum; and Sow et al. (2021) developed an evolutionary strategies (ES)-based method without computing Hessian or Jacobian.
Theoretically, the convergence of BLO has been analyzed extensively based on a key assumption that the lower-level problem is strongly convex (Franceschi et al., 2018; Shaban et al., 2019; Liu et al., 2021b; Ghadimi and Wang, 2018; Ji et al., 2021; Hong et al., 2020; Arbel and Mairal, 2022a; Dagréou et al., 2022; Ji et al., 2022a; Huang et al., 2022). Among them, Ji and Liang (2021) further provided lower complexity bounds for deterministic BLO with (strongly) convex upper-level functions. Guo and Yang (2021), Chen et al. (2021), Yang et al. (2021), and Khanduri et al. (2021) achieved a near-optimal sample complexity with second-order derivatives. Kwon et al. (2023); Yang et al. (2024) analyzed the convergence of first-order stochastic BLO algorithms. Recent works studied a more challenging setting where the lower-level problem is convex or satisfies Polyak-Lojasiewicz (PL) or Morse-Bott conditions (Liu et al., 2020a; Li et al., 2020a; Sow et al., 2022; Liu et al., 2021b; Ye et al., 2022; Arbel and Mairal, 2022b; Chen et al., 2023; Liu et al., 2021c). More results on BLO and its analysis can be found in the survey (Liu et al., 2021a; Chen et al., 2022).
BLO has been integrated into machine learning applications. For example, researchers have introduced differentiable optimization layers (Amos and Kolter, 2017), convex layers (Agrawal et al., 2019), and declarative layers (Gould et al., 2021) into deep neural networks. They have been applied to several applications such as optical flow (Jiang et al., 2020), pivoting manipulation (Shirai et al., 2022), control (Landry, 2021), and trajectory generation (Han et al., 2024). However, systematic approaches and methodologies to NeSy learning for robot autonomy remain under-explored. Moreover, robotics problems are often highly non-convex, leading to many local minima and saddle points (Jadbabaie et al., 2019), adding optimization difficulties. We will explore the methods with assured convergence as well as those empirically validated by various tasks in robot autonomy.
2.2 Learning Frameworks in Robotics
We summarize the major learning frameworks in robotics, including imitation learning, reinforcement learning, and meta-learning. The others will be briefly mentioned.
Imitation Learning
is a technique where robots learn tasks by observing and mimicking an expert’s actions. Without explicitly modeling complex behaviors, robots can perform a variety of tasks such as dexterous manipulation (McAleer et al., 2018), navigation (Triest et al., 2023), and environmental interaction (Chi et al., 2023). Current research includes leveraging historical data, modeling multi-modal behaviors, employing privileged teachers (Kaufmann et al., 2020; Chen et al., 2020a; Lee et al., 2020), and utilizing generative models to generate data like generative adversarial networks (Ho and Ermon, 2016), variational autoencoders (Zhao et al., 2023), and diffusion models (Chi et al., 2023). These advancements highlight the vibrant and ongoing exploration of imitation learning.
Imitation learning differs from regular supervised learning as it does not assume that the collected data is independent and identically distributed (iid), and relies solely on expert data representing “good” behaviors. Therefore, any small mistake during testing can lead to cascading failures. While techniques such as introducing intentional errors for data augmentation (Pomerleau, 1988; Tagliabue et al., 2022; Codevilla et al., 2018) and expert querying for data aggregation (Ross et al., 2011) exist, they still face notable challenges. These include low data efficiency, where limited or suboptimal demonstrations impair performance, and poor generalization, where robots have difficulty adapting learned behaviors to new contexts or unseen variations due to the labor-intensive nature of collecting high-quality data.
Reinforcement learning
(RL) is a learning paradigm where robots learn to perform tasks by interacting with their environment and receiving feedback in the form of rewards or penalties (Li, 2017). Due to its adaptability and effectiveness, RL has been widely used in numerous fields such as navigation (Zhu and Zhang, 2021), manipulation (Gu et al., 2016), locomotion (Margolis et al., 2024), and human-robot interaction (Modares et al., 2015).
However, RL also faces significant challenges, including sample inefficiency, which requires extensive interaction data (Dulac-Arnold et al., 2019), and the difficulty of ensuring safe exploration in physical environments (Thananjeyan et al., 2021). These issues are severe in complex tasks or environments where data collection is forbidden or dangerous (Pecka and Svoboda, 2014). Additionally, RL often struggles with generalizing learned behaviors to new environments and tasks and faces significant sim-to-real challenges. It can also be computationally expensive and sensitive to hyperparameter choices (Dulac-Arnold et al., 2021). Moreover, reward shaping, while potentially accelerating learning, can inadvertently introduce biases or suboptimal policies by misguiding the learning process.
We notice that BLO has also been integrated into RL. For instance, Stadie et al. (2020) formulated the intrinsic rewards as a BLO problem, leading to hyperparameter optimization; Hu et al. (2024) integrated reinforcement and imitation learning under BLO, addressing challenges like coupled behaviors and incomplete information in multi-robot coordination; Zhang et al. (2020a) employed a bilevel actor-critic learning method based on BLO and achieved better convergence than Nash equilibrium in the cooperative environments. However, they are still within the framework of RL and no systematic methods have been proposed.
Meta-learning
has garnered significant attention recently, particularly with its application to training deep neural networks (Bengio et al., 1991; Thrun and Pratt, 2012). Unlike conventional learning approaches, meta-learning leverages datasets and prior knowledge of a task ensemble to rapidly learn new tasks, often with minimal data, as seen in few-shot learning. Numerous meta-learning algorithms have been developed, encompassing metric-based (Koch et al., 2015; Snell et al., 2017; Chen et al., 2020b; Tang et al., 2020; Gharoun et al., 2023), model-based (Munkhdalai and Yu, 2017; Vinyals et al., 2016; Liu et al., 2020b; Co-Reyes et al., 2021), and optimization-based methods (Finn et al., 2017; Nichol and Schulman, 2018; Simon et al., 2020; Singh et al., 2021; Bohdal et al., 2021; Zhang et al., 2024; Choe et al., 2024). Among them, optimization-based approaches are often simpler to implementation. They are achieving state-of-the-art results in a variety of domains.
BLO has served as an algorithmic framework for optimization-based meta-learning. As the most representative optimization-based approach, model-agnostic meta-learning (MAML) (Finn et al., 2017) learns an initialization such that a gradient descent procedure starting from this initial model can achieve rapid adaptation. In subsequent years, numerous works on various MAML variants have been proposed (Grant et al., 2018; Finn et al., 2019, 2018; Jerfel et al., 2018; Mi et al., 2019; Liu et al., 2019; Rothfuss et al., 2019; Foerster et al., 2018; Baik et al., 2020b; Raghu et al., 2019; Bohdal et al., 2021; Zhou et al., 2021; Baik et al., 2020a; Abbas et al., 2022; Kang et al., 2023; Zhang et al., 2024; Choe et al., 2024). Among them, Raghu et al. (2019) presents an efficient MAML variant named ANIL, which adapts only a subset of the neural network parameters. Finn et al. (2019) introduces a follow-the-meta-leader version of MAML for online learning applications. Zhou et al. (2021) improved the generalization performance of MAML by leveraging the similarity information among tasks. Baik et al. (2020a) proposed an improved version of MAML via adaptive learning rate and weight decay coefficients. Kang et al. (2023) proposed geometry-adaptive pre-conditioned gradient descent for efficient meta-learning. Additionally, a group of meta-regularization approaches has been proposed to improve the bias in a regularized empirical risk minimization problem (Denevi et al., 2018b, 2019, a; Rajeswaran et al., 2019; Balcan et al., 2019; Zhou et al., 2019). Furthermore, there is a prevalent embedding-based framework in few-shot learning (Bertinetto et al., 2018; Lee et al., 2019; Ravi and Larochelle, 2016; Snell et al., 2017; Zhou et al., 2018; Goldblum et al., 2020; Denevi et al., 2022; Qin et al., 2023; Jia and Zhang, 2024). The objective of this framework is to learn a shared embedding model applicable across all tasks, with task-specific parameters being learned for each task based on the embedded features.
It is worth noting that IL is proposed to alleviate the drawbacks of the above learning frameworks in robotics. However, IL can also be integrated with any existing learning framework, e.g., formulating an RL method as the upper-level problem of IL, although out of the scope of this paper.
2.3 Neuro-symbolic Learning
As previously mentioned, the field of NeSy learning lacks a consensus on a rigorous definition. One consequence is that the literature on NeSy learning is scarce and lacks systematic methodologies. Therefore, we will briefly discuss two major categories: logical reasoning and physics-infused networks. This will encompass scenarios where symbols represent either discrete signals, such as logical constructs, or continuous signals, such as physical attributes. We will address other related work in the context of the five robot autonomy examples within their respective sections.
Logical reasoning
aims to inject interpretable and deterministic logical rules into neural networks (Serafini and Garcez, 2016; Riegel et al., 2020; Xie et al., 2019; Ignatiev et al., 2018). Some previous work directly obtained such knowledge from human expert (Xu et al., 2018; Xie et al., 2019, 2021; Manhaeve et al., 2018; Riegel et al., 2020; Yang et al., 2020) or an oracle for controllable deduction in neural networks (Mao et al., 2018; Wang et al., 2022; Hsu et al., 2023), refered as deductive methods. Representative works include DeepProbLog (Manhaeve et al., 2018), logical neural network (Riegel et al., 2020), and semantic Loss (Xu et al., 2018). Despite their success, deductive methods require structured formal symbolic knowledge from humans, which is not always available. Besides, they still suffer from scaling to larger and more complex problems. In contrast, inductive approaches induct the structured symbolic representation for semi-supervised efficient network learning. One popular strategy is based on forward-searching algorithms (Li et al., 2020b, 2022b; Evans and Grefenstette, 2018; Sen et al., 2022), which is time-consuming and hard to scale up. Others borrow gradient-based neural networks to conduct rule induction, such as SATNet (Wang et al., 2019), NeuralLP (Yang et al., 2017), and neural logic machine (NLM) (Dong et al., 2019). In particular, NLM introduced a new network architecture inspired by first-order logic, which shows better compositional generalization capability than vanilla neural nets. However, existing inductive algorithms either work on structured data like knowledge graphs (Yang et al., 2017; Yang and Song, 2019) or only experiment with toy image datasets (Shindo et al., 2023; Wang et al., 2019). We push the limit of this research into robotics with IL, providing real-world applications with high-dimensional image data.
Physics-infused networks
(PINs) integrate physical laws directly into the architecture and training of neural networks (Raissi et al., 2019), aiming to enhance the model’s capability to solve complex scientific, engineering, and robotics problems (Karniadakis et al., 2021). PINs embed known physical principles, such as conservation laws and differential equations, into the network’s loss function (Duruisseaux et al., 2023), constraints, or structure, fostering greater interpretability, generalization, and efficiency (Lu et al., 2021). For example, in fluid dynamics, PINs can leverage the Navier-Stokes equations to guide predictions, ensuring adherence to fundamental flow properties even in data-scarce regions (Sun and Wang, 2020). Similarly, in structural mechanics, variational principles can be used to inform the model about stress and deformation relationships, leading to more accurate structural analysis (Rao et al., 2021).
In robot autonomy, PINs have been applied to various tasks including perception (Guan et al., 2024), planning (Romero et al., 2023), control (Han et al., 2024), etc. By embedding the kinematics and dynamics of robotic systems directly into the learning process, PINs enable robots to predict and respond to physical interactions more accurately, leading to safer and more efficient operations. Methodologies for this include but are not limited to embedding physical laws into network (Zhao et al., 2024), enforcing initial and boundary conditions into the training process (Rao et al., 2021), and designing physical constraint loss functions, e.g., minimizing an energy functional representing the physical system (Guan et al., 2024).
3 Imperative Learning
3.1 Structure
The proposed framework of imperative learning (IL) is illustrated in Figure 1, which consists of three modules, i.e., a neural system, a reasoning engine, and a memory module. Specifically, the neural system extracts high-level semantic attributes from raw sensor data such as images, LiDAR points, IMU measurements, and their combinations. These semantic attributes are subsequently sent to the reasoning engine, a symbolic process represented by physical principles, logical inference, analytical geometry, etc. The memory module stores the robot’s experiences and acquired knowledge, such as data, symbolic rules, and maps about the physical world for either a long-term or short-term period. Additionally, the reasoning engine performs a consistency check with the memory, which will update the memory or make necessary self-corrections. Intuitively, this design has the potential to combine the expressive feature extraction capabilities from the neural system, the interpretability and the generalization ability from the reasoning engine, and the memorability from the memory module into a single framework. We next explain the mathematical rationale to achieve this.
3.2 Formulation
One of the most important properties of the framework in Figure 1 is that the neural, reasoning, and memory modules can perform reciprocal learning in a self-supervised manner. This is achieved by formulating this framework as a BLO. Denote the neural system as , where represents the sensor measurements, represents the perception-related learnable parameters, and represents the neural outputs such as semantic attributes; the reasoning engine as with reasoning-related parameters and the memory system as , where is perception-related memory parameters (Wang et al., 2021a) and is reasoning-related memory parameters. In this context, our imperative learning (IL) is formulated as a special BLO:
(1a) | ||||
s.t. | (1b) | |||
(1c) |
where is a general constraint (either equality or inequality); and are the upper-level (UL) and lower-level (LL) cost functions; and are stacked UL variables and are stacked LL variables, respectively. Alternatively, and are also referred to as the neural cost and symbolic cost, respectively. As aforementioned, the term “imperative” is used to denote the passive nature of the learning process: once optimized, the neural system in the UL cost will be driven to align with the LL reasoning engine (e.g., logical, physical, or geometrical reasoning process) with constraint , so that it can learn to generate logically, physically, or geometrically feasible semantic attributes or predicates. In some applications, and are also referred to as neuron-like and symbol-like parameters, respectively.

Self-supervised Learning
As presented in Section 1, the formulation of IL is motivated by one important observation: many symbolic reasoning engines including geometric, physical, and logical reasoning, can be optimized or solved without providing labels. This is evident in methods like logic reasoning like equation discovery (Billard and Diday, 2002) and A∗ search (Hart et al., 1968), geometrical reasoning such as bundle adjustment (BA) (Agarwal et al., 2010), and physical reasoning like model predictive control (Kouvaritakis and Cannon, 2016). The IL framework leverages this phenomenon and jointly optimizes the three modules by BLO, which enforces the three modules to mutually correct each other. Consequently, all three modules can learn and evolve in a self-supervised manner by observing the world. However, it is worth noting that, although IL is designed for self-supervised learning, it can easily adapt to supervised or weakly supervised learning by involving labels either in UL or LL cost functions or both.
Memory
The memory system within the IL framework is a general component that can retain and retrieve information online. Specifically, it can be any structure that is associated with write and read operations to retain and retrieve data (Wang et al., 2021a). A memory can be a neural network, where information is “written” into the parameters and is “read” through a set of math operations or implicit mapping, e.g., a neural radiance fields (NeRF) model (Mildenhall et al., 2021); It can also be a structure with explicit physical meanings such as a map created online, a set of logical rules inducted online, or even a dataset collected online; It can also be the memory system of LLMs in textual form, such as retrieval-augmented generation (RAG) (Lewis et al., 2020), which writes, reads, and manages symbolic knowledge.
3.3 Optimization
BLO has been explored in frameworks such as meta-learning (Finn et al., 2017), hyperparameter optimization (Franceschi et al., 2018), and reinforcement learning (Hong et al., 2020). However, most of the theoretical analyses have primarily focused on their applicability to data-driven models, where \nth1-order gradient descent (GD) is frequently employed (Ji et al., 2021; Gould et al., 2021). Nevertheless, many reasoning tasks present unique challenges that make GD less effective. For example, geometrical reasoning like BA requires \nth2-order optimizers (Fu et al., 2024) such as Levenberg-Marquardt (LM) (Marquardt, 1963); multi-robot routing needs combinatorial optimization over discrete variables (Ren et al., 2023a). Employing such LL optimizations within the BLO framework introduces extreme complexities and challenges, which are still underexplored (Ji et al., 2021). Therefore, we will first delve into general BLO and then provide practical examples covering distinct challenges of LL optimizations in our IL framework.
The solution to IL (1) mainly involves solving the UL parameters and and the LL parameters and . Intuitively, the UL parameters which are often neuron-like weights can be updated with the gradients of the UL cost :
(2) | ||||
The key challenge of computing (2) is the implicit differentiation parts in blue fonts, which take the forms of
(3) |
where and are the solutions to the LL problem. For simplicity, we can write (2) into the matrix form.
(4) |
where and . There are typically two methods for calculating those gradients, i.e., unrolled differentiation and implicit differentiation. We summarize a generic framework incorporating both methods in Algorithm 1 to provide a clearer understanding.
3.3.1 Unrolled Differentiation
The method of unrolled differentiation is an easy-to-implement solution for BLO problems. It needs automatic differentiation (AutoDiff) through LL optimization. Specifically, given an initialization for LL variable at step , the iterative process of unrolled optimization is
(5) |
where denotes an updating scheme based on a specific LL problem and is the number of iterations. One popular updating scheme is based on the gradient descent:
(6) |
where is a learning rate and the term can be computed from AutoDiff. Therefore, we can obtain , by substituting approximately for and the full unrolled system is defined as
(7) |
where the symbol denotes the function composition. Therefore, we instead only need to consider an alternative:
(8) |
where can be computed via AutoDiff instead of directly calculating the four terms in of (3).
(9) |
3.3.2 Implicit Differentiation
The method of implicit differentiation directly computes the derivatives . We next introduce a generic framework for the implicit differentiation algorithm by solving a linear system from the first-order optimality condition of the LL problem, while the exact solutions depend on specific tasks and will be illustrated using several independent examples in Section 4.
Assume is the outputs of steps of a generic optimizer of the LL problem (1b) possibly under constraints (1c), then the approximated UL gradient (2) is
(11) |
Then the derivatives are obtained via solving implicit equations from optimality conditions of the LL problem, i.e., , where is a generic LL optimality condition. Specifically, taking the derivative of equation with respect to on both sides leads to
(12) |
Solving the equation gives us the implicit gradients as
(13) |
This means we obtain the implicit gradients at the cost of an inversion of the Hessian matrix .
In practice, the Hessian matrix can be too big to calculate and store222For instance, assume both UL and LU costs have a network with merely 1 million () parameters (-bit float numbers), thus each network only needs a space of to store, while their Hessian matrix needs a space of to store. This indicates that a Hessian matrix cannot even be explicitly stored in the memory of a low-power computer, thus directly calculating its inversion is more impractical., but we could bypass it by solving a linear system. Substitute (13) into (4), we have the UL gradient
(14a) | ||||
(14b) |
Therefore, instead of calculating the Hessian inversion, we can solve a linear system for by optimizing
(15) |
where is denoted as for simplicity. The linear system (15) can be solved without explicitly calculating and storing the Hessian matrix by solvers such as conjugate gradient (Hestenes and Stiefel, 1952) or gradient descent. For example, due to the gradient , the updating scheme based on the gradient descent algorithm is:
(16) |
This updating scheme is efficient since can be computed using the fast Hessian-vector product, i.e., a Hessian-vector product is the gradient of a gradient-vector product:
(17) |
where is a scalar. This means the Hessian is not explicitly computed or stored. We summarize this implicit differentiation in Algorithm 2. Note that the optimality condition depends on the LL problems. In Section 4, we will show that it can either be the derivative of the LL cost function for an unconstrained problem such as (27) or the Lagrangian function for a constrained problem such as (39).
Approximation
Implicit differentiation is complicated to implement but there is one approximation, which is to ignore the implicit components and only use the direct part . This is equivalent to taking the solution from the LL optimization as constants in the UL problem. Such an approximation is more efficient but introduces an error term
(18) |
Nevertheless, it is useful when the implicit derivatives contain products of small second-order derivatives, which again depends on the specific LL problems.
It is worth noting that in the framework of IL (1), we assign perception-related parameters to the UL neural cost (1a), while reasoning-related parameters to the LL symbolic cost (1b). This design stems from two key considerations: First, it can avoid involving large Jacobian and Hessian matrices for neuron-like variables such as . Given that real-world robot applications often involve an immense number of neuron-like parameters (e.g., a simple neural network might possess millions), placing them in the UL cost reduces the complexity involved in computing implicit gradients necessitated by the LL cost (1b).
Second, perception-related (neuron-like) parameters are usually updated using gradient descent algorithms such as SGD (Sutskever et al., 2013). However, such simple first-order optimization methods are often inadequate for LL symbolic reasoning, e.g., geometric problems (Wang et al., 2023) usually need second-order optimizers. Therefore, separating neuron-like parameters from reasoning-related parameters makes the BLO in IL easier to solve and analyze. However, this again depends on the LL tasks.
4 Applications and Examples
To showcase the effectiveness of IL, we will introduce five distinct examples in different fields of robot autonomy. These examples, along with their respective LL problem and optimization methods, are outlined in Table 1. Specifically, they are selected to cover distinct tasks, including path planning, rule induction, optimal control, visual odometry, and multi-agent routing, to showcase different optimization techniques required by the LL problems, including closed-form solution, first-order optimization, constrained optimization, second-order optimization, and discrete optimization, respectively. We will explore several memory structures mentioned in Section 3.2.
Additionally, since IL is a self-supervised learning framework consisting of three primary components, we have three kinds of learning types. This includes (A) given known (pre-trained or human-defined) reasoning engines such as logical reasoning, physical principles, and geometrical analytics, robots can learn a logic-, physics-, or geometry-aware neural perception system, respectively, in a self-supervised manner; (B) given neural perception systems such as a vision foundation model (Kirillov et al., 2023), robots can discover the world rules, e.g., traffic rules, and then apply the rules to future events; and (C) given a memory system, (e.g., experience, world rules, or maps), robots can simultaneously update the neural system and reasoning engine so that they can adapt to novel environments with a new set of rules. The five examples will also cover all three learning types.
4.1 Closed-form Solution
We first illustrate the scenarios in which the LL cost in (1b) has closed-form solutions. In this case, one can directly optimize the UL cost by solving
(19) |
where the LL solutions and , that contain the implicit components , , , and , can be calculated directly due to the closed-form of and . As a result, standard gradient-based algorithms can be applied in updating and with gradients given by (2). In this case, there is no approximation error induced by the LL minimization, in contrast to existing widely-used implicit and unrolled differentiation methods that require ensuring a sufficiently small or decreasing LL optimization error for guaranteeing the convergence (Ji et al., 2021).
One possible problem in computing (3) is that the implicit components can contain expensive matrix inversions. Let us consider a simplified quadratic case in (1b) with . The closed-form solution takes the form of , which can induce a computationally expensive inversion of a possibly large matrix . To address this problem, one can again formulate the matrix-inverse-vector computation as solving a linear system using any optimization methods with efficient matrix-vector products.
Many symbolic costs can be effectively addressed through closed-form solutions. For example, both linear quadratic regulator (LQR) (Shaiju and Petersen, 2008) and Dijkstra’s algorithm (Dijkstra, 1959) can be solved with a determined optimal solution. To demonstrate the effectiveness of IL for closed-form solutions, we next present two examples in path planning to utilize the neural model for reducing the search and sampling space of symbolic optimization.
Example 1: Path Planning
Path planning is a computational process to determine a path from a starting point to a destination within an environment. It typically involves navigating around obstacles and may also optimize certain objectives such as the shortest distance, minimal energy use, or maximum safety. Path planning algorithms are generally categorized into global planning, which utilizes global maps of the environment, and local planning, which relies on real-time sensory data. We will enhance two widely-used algorithms through IL: A∗ search for global planning and cubic spline for local planning, both of which offer closed-form solutions.
Example 1.A: Global Path Planning
Background
The A∗ algorithm is a graph search technique that aims to find the global shortest feasible path between two nodes (Hart et al., 1968). Specifically, A∗ selects the path passing through the next node that minimizes
(20) |
where is the cost from the start node to and is a heuristic function that predicts the cheapest path cost from to the goal. The heuristic cost can take the form of various metrics such as the Euclidean and Chebyshev distances. It can be proved that if the heuristic function is admissible and monotone, i.e., , where is the cost from to , A∗ is guaranteed to find the optimal path without searching any node more than once (Dechter and Pearl, 1985). Due to this optimality, A∗ became one of the most widely used methods in path planning (Paden et al., 2016; Smith et al., 2012; Algfoor et al., 2015).
However, A∗ encounters significant limitations, particularly in its requirement to explore a large number of potential nodes. This exhaustive search process can be excessively time-consuming, especially for low-power robot systems. To address this, recent advancements showed significant potential for enhancing efficiency by predicting a more accurate heuristic cost map using data-driven methods (Choudhury et al., 2018; Yonetani et al., 2021a; Kirilenko et al., 2023). Nevertheless, these algorithms utilize optimal paths as training labels, which face challenges in generalization, leading to a bias towards the patterns observed in the training datasets.

Approach
To address this limitation, we leverage the IL framework to remove dependence on path labeling. Specifically, we utilize a neural network to estimate the value of each node, which can be used to reduce the search space. Subsequently, we integrate a differentiable A∗ module, serving as the symbolic reasoning engine in (1b), to determine the most efficient path. This results in an effective framework depicted in Figure 2, which we refer to as imperative A∗ (iA∗) algorithm. Notably, the iA∗ framework can operate on a self-supervised basis inherent in the IL framework, eliminating the need for annotated labels.
Specifically, the iA∗ algorithm is to minimize
(21a) | ||||
s.t. | (21b) |
where denotes the inputs including the map, start node, and goal node, is the set of paths in the solution space, is the accumulated values associated with a path in , is the optimal path, is the accumulated values associated with the optimal path , and is the intermediate and maps.
The lower-level cost is defined as the path length (cost)
(22) |
where the optimal path is derived from the A∗ reasoning given the node cost maps . Thanks to its closed-form solution, this LL optimization is directly solvable. The UL optimization focuses on updating the network parameter to generate the map. Given the impact of the map on the search area of A∗, the UL cost is formulated as a combination of the search area and the path length . This is mathematically represented as:
(23) |
where and are the weights to adjust the two terms, and is the search area computed from the accumulated map with the optimal path . In the experiments, we define the cost as the Euclidean distance. The input is represented as a three-channel 2-D tensor, with each channel dedicated to a specific component: the map, the start node, and the goal node. Specifically, the start and goal node channels are represented as a one-hot 2-D tensor, where the indices of the “ones” indicate the locations of the start and goal node, respectively. This facilitates a more nuanced and effective representation of path planning problems.
Optimization
As shown in (2), once the gradient calculation is completed, we backpropagate the cost directly to the network . This process is facilitated by the closed-form solution of the A∗ algorithm. For the sake of simplicity, we employ the differentiable A∗ algorithm as introduced by Yonetani et al. (2021a). It transforms the node selection into an argsoftmax operation and reinterprets node expansion as a series of convolutions, leveraging efficient implementations and AutoDiff in PyTorch (Paszke et al., 2019).
Intuitively, this optimization process involves iterative adjustments. On one hand, the map enables the A∗ algorithm to efficiently identify the optimal path, but within a confined search area. On the other hand, the A∗ algorithm’s independence from labels allows further refinement of the network. This is achieved by the back-propagating of the search area and length cost through the differentiable A∗ reasoning. This mutual connection encourages the network to generate increasingly smaller search areas over time, enhancing overall efficiency. As a result, the network inclines to focus on more relevant areas, marked by reduced low-level reasoning costs, improving the overall search quality.
Example 1.B: Local Path Planning
Background
End-to-end local planning, which integrates perception and planning within a single model, has recently attracted considerable interest, particularly for its potential to enable efficient inference through data-driven methods such as reinforcement learning (Hoeller et al., 2021; Wijmans et al., 2019; Lee et al., 2024; Ye et al., 2021) and imitation learning (Sadat et al., 2020; Shah et al., 2023b; Loquercio et al., 2021). Despite these advancements, significant challenges persist. Reinforcement learning-based methods often suffer from sample inefficiency and difficulties in directly processing raw, dense sensor inputs, such as depth images. Without sufficient guidance during training, reinforcement learning struggles to converge on an optimal policy that generalizes well across various scenarios or environments. Conversely, imitation learning relies heavily on the availability and quality of labeled trajectories. Obtaining these labeled trajectories is particularly challenging for robotic systems that operate under diverse dynamics models, thereby limiting their broad applicability in flexible robotic systems.

Approach
To address these challenges, we introduce IL to local planning and refer to it as imperative local planning (iPlanner), as depicted in Figure 3. Instead of predicting a continuous trajectory directly, iPlanner uses the network to generate sparse waypoints, which are then interpolated using a trajectory optimization engine based on a cubic spline. This approach leverages the strengths of both neural and symbolic modules: neural networks excel at dynamic obstacle detection, while symbolic modules optimize multi-step navigation strategies under dynamics. By enforcing the network output sparse waypoints rather than continuous trajectories, iPlanner effectively combines the advantages of both modules. Specifically, iPlanner can be formulated as:
(24a) | ||||
s.t. | (24b) |
where denotes the system inputs including a local goal position and the sensor measurements such as depth images, is the parameters of a network , denotes the generated waypoints, which are subsequently optimized by the path optimizer , represents the set of valid paths within the constrained space . The optimized path, , acts as the optimal solution to the LL cost , which is defined by tracking the intermediate waypoints and the overall continuity and smoothness of the path:
(25) |
where measures the path continuity based on the \nth1-order derivative of the path and calculates the path smoothness based on \nth2-order derivative. Specifically, we employ the cubic spline interpolation which has a closed-form solution to ensure this continuity and smoothness. This is also essential for generating feasible and efficient paths. On the other hand, the UL cost is defined as:
(26) |
where measures the distance from the endpoint of the generated path to the goal , assessing the alignment of the planned path with the desired destination; represents the motion loss, encouraging the planning of shorter paths to improve overall movement efficiency; quantifies the obstacle cost, utilizing information from a pre-built Euclidean signed distance fields (ESDF) map to evaluate the path safety; and , , and are hyperparameters, allowing for adjustments in the planning strategy based on specific performance objectives.
Optimization
During training, we leverage the AutoDiff capabilities provided by PyTorch (Paszke et al., 2019) to solve the BLO in IL. For the LL trajectory optimization, we adopt the cubic spline interpolation implementation provided by PyPose (Wang et al., 2023). It supports differentiable batched closed-form solutions, enabling a faster training process. Additionally, the ESDF cost map is convoluted with a Gaussian kernel, enabling a smoother optimization space. This setup allows the loss to be directly backpropagated to update the network parameters in a single step, rather than requiring iterative adjustments. As a result, the UL network and the trajectory optimization can mutually improve each other, enabling a self-supervised learning process.
4.2 First-order Optimization
We next illustrate the scenario that the LL cost in (1b) uses first-order optimizers such as GD. Because GD is a simple differentiable iterative method, one can leverage unrolled optimization listed in Algorithm 1 to solve BLO via AutoDiff. It has been theoretically proved by Ji et al. (2021, 2022a) that when the LL problem is strongly convex and smooth, unrolled optimization with LL gradient descent can approximate the hypergradients and up to an error that decreases linearly with the number of GD steps. For the implicit differentiation, we could leverage the optimality conditions of the LL optimization to compute the implicit gradient. To be specific, using Chain rule and optimality of and , we have the stationary points
(27) |
which, by taking differentiation over and and noting the implicit dependence of and on and , yields
(28a) | ||||
(28b) |
Assume that the concatenated matrix invertible at . Then, it can be derived from the linear equations in (28a) that the implicit gradients and
(29) |
which, combined with (2), yields the UL gradient as
(30e) | ||||
(30l) |
where the vector-matrix-inverse-product can be efficiently approximated by solving a linear system w.r.t. a vector
(31) |
A similar linear system can be derived for computing . Then, the practical methods should first use a first-order optimizer to solve the lower-level problem in (1b) to obtain approximates and of the solutions and , which are then incorporated into (30e) to obtain an approximate of the upper-level gradient (similarly for ). Then, the upper-level gradient approximates and are used to optimize the target variables and .
Approximation
As mentioned in Section 3.3, one could approximate the solutions in (30e) by assuming the second-order derivatives are small and thus can be ignored. In this case, we can directly use the fully first-order estimates without taking the implicit differentiation into account as
(32) |
which are evaluated at the approximates and of the LL solution . We next demonstrate it with an example of inductive logical reasoning using first-order optimization.

Example 2: Inductive Logical Reasoning
Given few known facts, logical reasoning aims at deducting the truth value of unknown facts with formal logical rules (Iwańska, 1993; Overton, 2013; Halpern, 2013). Different types of logic have been invented to address distinct problems from various domains, including propositional logic (Wang et al., 2019), linear temporal logic (Xie et al., 2021), and first-order-logic (FOL) (Cropper and Morel, 2021). Among them, FOL decomposes a logic term into lifted predicates and variables with quantifiers, which has strong generalization power and can be applied to arbitrary entities with any compositions. Due to the strong capability of generalization, FOL has been widely leveraged in knowledge graph reasoning (Yang and Song, 2019) and robotic task planning (Chitnis et al., 2022). However, traditional FOL requires a human expert to carefully design the predicates and rules, which is tedious. Automatically summarizing the FOL predicates and rules is a long-standing problem, which is known as inductive logic programming (ILP). However, existing works study ILP only in simple structured data like knowledge graphs. To extend this research into robotics, we will explore how IL can make ILP work with high-dimensional data like RGB images.
Background
One stream of the solutions to ILP is based on forward search algorithms (Cropper and Morel, 2021; Shindo et al., 2023; Hocquette et al., 2024). For example, Popper constructs answer set programs based on failures, which can significantly reduce the hypothesis space (Cropper and Morel, 2021). However, as FOL is a combinatorial problem, search-based methods can be extremely time-consuming as the samples scale up. Recently, some works introduced neural networks to assist the search process (Yang et al., 2017; Yang and Song, 2019; Yang et al., 2022b) or directly implicitly represent the rule (Dong et al., 2019). To name a few, NeurlLP re-formulates the FOL rule inference into the multi-hop reasoning process, which can be represented as a series of matrix multiplications (Yang et al., 2017). Thus, learning the weight matrix becomes equivalent to inducting the rules. Neural logic machines, on the other hand, designed a new FOL-inspired network structure, where the rules are implicitly stored in the network weights (Dong et al., 2019).
Despite their promising results in structured symbolic datasets, such as binary vector representations in BlocksWorld (Dong et al., 2019) and knowledge graphs (Yang et al., 2017), their capability of handling high-dimensional data like RGB images is rarely explored.

Approach
To address this gap, we verify the IL framework with the visual action prediction (VAP) task in the LogiCity (Li et al., 2024), a recent logical reasoning benchmark. In VAP, a model must simultaneously discover traffic rules, identify the concept of agents and their spatial relationships, and predict the agents’ actions. As shown in Figure 4, we utilize a grounding network to predict the agent concepts and spatial relationships, which takes the observations such as images as the model input. The predicted agent concepts and their relationships are then sent to the reasoning module for rule induction and action prediction. The learned rules from the reasoning process are stored in the memory and retrieved when necessary. For example, the grounding network may output concepts “IsTiro()”, “IsPedestrian(),” and their relationship “NextTo(, )”. Then the reasoning engine applies the learned rules, “Slow()( IsTiro() IsPedestrian() NextTo(, )) …”, stored in the memory module to infer the next actions “Slow()” or “Normal()”, as is displayed in Figure 5. Finally, the predicted actions and the observed actions from the grounding networks are used for loss calculation. In summary, we formulate this pipeline in (33) and refer to it as imperative logical reasoning (iLogic):
(33a) | ||||
(33b) |
Specifically, in the experiments we use the same cross-entropy function for the UL and LL costs, i.e., , where and are the observed and predicted actions of the -th agent, respectively. This results in a self-supervised simultaneous grounding and rule induction pipeline for logical reasoning. Additionally, can be any grounding networks and we use a feature pyramid network (FPN) (Lin et al., 2017) as the visual encoder and two MLPs for relationships; can be any logical reasoning engines and we use a neural logical machine (NLM) (Dong et al., 2019) with parameter ; and can be any memory modules storing the learned rules and we use the MLPs in the NLM parameterized by . More details about the task, model, and list of concepts are presented in the Section 5.2.
Optimization
Given that gradient descent algorithms can update both the grounding networks and the NLM, we can apply the first-order optimization to solve iLogic. The utilization of task-level action loss enables efficient self-supervised training, eliminating the necessity for explicit concept labels. Furthermore, BLO, compared to single-level optimization, helps the model concentrate more effectively on learning concept grounding and rule induction, respectively. This enhances the stability of optimization and decreases the occurrence of sub-optimal outcomes. Consequently, the model can learn rules more accurately and predict actions with greater precision.
4.3 Constrained Optimization
We next illustrate the scenarios in which the LL cost in (1b) is subject to the general constraints in (1c). We discuss two cases with equality and inequality constraints, respectively. Constrained optimization is a thoroughly explored field, and related findings were presented in (Dontchev et al., 2009) and summarized in (Gould et al., 2021). This study will focus on the integration of constrained optimization into our special form of BLO (1) under the framework of IL.
Equality Constraint
In this case, the constraint in (1c) is
(34) |
Recall that is a vector concatenating the symbol-like variables and , and is a vector concatenating the neuron-like variables and . Therefore, the implicit gradient components can be expressed as differentiation of to , and we have
(35) |
Lemma 1.
Assume the LL cost and the constraint are \nth2-order differentiable near and the Hessian matrix below is invertible, we then have