(Translated by https://www.hiragana.jp/)
Imperative Learning: A Self-supervised Neuro-Symbolic Learning Framework for Robot Autonomy
\corrauth

Chen Wang, Spatial AI & Robotics (SAIR) Lab, Department of Computer Science and Engineering, University at Buffalo, NY 14260, USA.

Imperative Learning: A Self-supervised Neuro-Symbolic Learning Framework for Robot Autonomy

Chen Wang11affiliationmark:    Kaiyi Ji11affiliationmark:    Junyi Geng22affiliationmark:    Zhongqiang Ren33affiliationmark:    Taimeng Fu11affiliationmark:    Fan Yang44affiliationmark:    Yifan Guo55affiliationmark:    Haonan He33affiliationmark:    Xiangyu Chen11affiliationmark:    Zitong Zhan11affiliationmark:    Qiwei Du11affiliationmark:    Shaoshu Su11affiliationmark:    Bowen Li33affiliationmark:    Yuheng Qiu33affiliationmark:    Yi Du11affiliationmark:    Qihang Li11affiliationmark:    Yifan Yang11affiliationmark:    Xiao Lin11affiliationmark:    and Zhipeng Zhao11affiliationmark: 11affiliationmark: University at Buffalo, USA
22affiliationmark: Pennsylvania State University, USA
33affiliationmark: Carnegie Mellon University, USA
44affiliationmark: ETH Zürich, Switzerland
55affiliationmark: Purdue University, USA
chenw@sairlab.org
Abstract

Data-driven methods such as reinforcement and imitation learning have achieved remarkable success in robot autonomy. However, their data-centric nature still hinders them from generalizing well to ever-changing environments. Moreover, collecting large datasets for robotic tasks is often impractical and expensive. To overcome these challenges, we introduce a new self-supervised neuro-symbolic (NeSy) computational framework, imperative learning (IL), for robot autonomy, leveraging the generalization abilities of symbolic reasoning. The framework of IL consists of three primary components: a neural module, a reasoning engine, and a memory system. We formulate IL as a special bilevel optimization (BLO), which enables reciprocal learning over the three modules. This overcomes the label-intensive obstacles associated with data-driven approaches and takes advantage of symbolic reasoning concerning logical reasoning, physical principles, geometric analysis, etc. We discuss several optimization techniques for IL and verify their effectiveness in five distinct robot autonomy tasks including path planning, rule induction, optimal control, visual odometry, and multi-robot routing. Through various experiments, we show that IL can significantly enhance robot autonomy capabilities and we anticipate that it will catalyze further research across diverse domains.

keywords:
Neuro-Symbolic AI, Self-supervised Learning, Bilevel Optimization, Imperative Learning

1 Introduction

With the rapid development of deep learning (LeCun et al., 2015), there has been growing interest in data-driven approaches such as reinforcement learning (Zhu and Zhang, 2021) and imitation learning (Hussein et al., 2017) for robot autonomy. However, despite these notable advancements, many data-driven autonomous systems are still predominantly constrained to their training environments, exhibiting limited generalization ability (Banino et al., 2018; Albrecht et al., 2022).

As a comparison, humans are capable of internalizing their experiences as abstract concepts or symbolic knowledge (Borghi et al., 2017). For instance, we interpret the terms “road” and “path” as symbols or concepts for navigable areas, whether it’s a paved street in a city or a dirt trail through a forest (Hockley, 2011). Equipped with these concepts, humans can employ spatial reasoning to navigate new and complex environments (Strader et al., 2024). This adaptability to novel scenarios, rooted in our ability to abstract and symbolize, is a fundamental aspect of human intelligence that is still notably absent in existing data-driven autonomous robots (Garcez et al., 2022; Kautz, 2022).

Though implicit reasoning abilities have garnered increased attention in the context of large language models (LLMs) (Lu et al., 2023; Shah et al., 2023a), robot autonomy systems still encounter significant challenges in achieving interpretable reasoning. This is particularly evident in areas such as geometric, physical, and logical reasoning (Liu et al., 2023). Overcoming these obstacles and integrating interpretable symbolic reasoning into data-driven models, a direction known as neuro-symbolic (NeSy) reasoning, could remarkably enhance robot autonomy (Garcez et al., 2022).

While NeSy reasoning offers substantial potential, its specific application in the field of robotics is still in a nascent stage. A key reason for this is the emerging nature of the NeSy reasoning field itself, which has not yet reached a consensus on a rigorous definition (Kautz, 2022). On one side, a narrow definition views NeSy reasoning as an amalgamation of neural methods (data-driven) and symbolic methods, which utilize formal logic and symbols for knowledge representation and rule-based reasoning. Alternatively, the broader definition expands the scope of what constitutes a “symbol”. In this perspective, symbols are not only logical terms but also encompass any comprehensible, human-conceived concepts. This can include physical properties and semantic attributes, as seen in approaches involving concrete equations (Duruisseaux et al., 2023), logical programming (Delfosse et al., 2023), and programmable objectives (Yonetani et al., 2021a). This broad definition hence encapsulates reasoning related to physical principles, logical reasoning, geometrical analytics, etc. In this context, many literatures exemplified NeSy systems, e.g., model-based reinforcement learning (Moerland et al., 2023), physics-informed networks (Karniadakis et al., 2021), and learning-aided tasks such as control (O’Connell et al., 2022), task scheduling (Gondhi and Gupta, 2017), and geometry analytics (Heidari and Iosifidis, 2024).

In this article, we explore the broader definition of NeSy reasoning for robot autonomy and introduce a self-supervised NeSy learning framework, which will be referred to as imperative learning (IL). It is designed to overcome the well-known issues of existing learning frameworks for robot autonomy: (1) Generalization ability: Many data-driven systems, including reinforcement learning models (Banino et al., 2018; Albrecht et al., 2022), remain largely confined to their training environments, displaying limited generalization capabilities. One reason for this limitation is their inability to learn explicit commonsense rules, which hinders their effective transfer to new environments. This encourages us to delve into symbolic reasoning techniques. (2) Black-box nature: The underlying causal mechanisms of data-driven models are mostly unknown, while a problematic decision can have catastrophic side effects in robotic tasks (Dulac-Arnold et al., 2021). Consequently, these models frequently encounter difficulties when applied to real-world situations, which further encourages us to delve into NeSy systems. (3) Label intensiveness: Labeling data for robotic tasks like imitation learning (Zare et al., 2023) often incurs higher costs than tasks in computer vision due to the reliance on specialized equipment over basic human annotations (Ebadi et al., 2022). For example, annotating accurate ground truth for robot planning is extremely complicated due to the intricacies of robot dynamics. This underscores the critical need for efficient self-supervised learning methods. (4) Sub-optimality: Separately training neural and symbolic modules can result in sub-optimal integration due to compounded errors (Besold et al., 2021). This motivates the investigation of an end-to-end NeSy learning approach.

IL is formulated to address the above issues with a single design. It is inspired by an interesting observation: while data-driven models predominantly require labeled data for parameter optimization, symbolic reasoning models are often capable of functioning without labels. Yet, both types of models can be optimized using gradient descent-like iterative methods. Take, for instance, geometrical models such as bundle adjustment (BA), physical models like model predictive control (MPC), and discrete models such as the A search over a graph. They can all be optimized without providing labels, even though they are analogous to data-driven models when formulated as optimization problems. IL leverages this property of both approaches by enforcing each method to mutually correct the other, thereby creating a novel self-supervised learning paradigm. To optimize the entire framework, IL is formulated as a special bilevel optimization (BLO), solved by back-propagating the errors from the self-supervised symbolic models into the neural models. The term “imperative” is adopted to highlight and describe this passive self-supervised learning procedure.

In summary, the contribution of this article include

  • We explore a self-supervised NeSy learning framework, which is referred to as imperative learning (IL) for robot autonomy. IL is formulated as a special BLO problem to enforce the network to learn symbolic concepts and enhance symbolic reasoning via data-driven methods. This leads to a reciprocal learning paradigm, which can avoid the sub-optimal solutions caused by composed errors of decoupled systems.

  • We discuss several optimization strategies to tackle the technical challenges of IL. We present how to incorporate different optimization techniques into our IL framework including closed-form solution, first-order optimization, second-order optimization, constrained optimization, and discrete optimization.

  • To benefit the entire community, we demonstrated the effectiveness of IL for several tasks in robot autonomy including path planning, rule induction, optimal control, visual odometry, and multi-robot routing. We released the source code at https://sairlab.org/iseries/ to inspire more robotics research using IL.

This article is inspired by and built on our previous works in different fields of robot autonomy including local planning (Yang et al., 2023), global planning (Chen et al., 2024), simultaneous localization and mapping (SLAM) (Fu et al., 2024), feature matching (Zhan et al., 2024), and multi-agent routing (Guo et al., 2024). These works introduced a prototype of IL in several different domains while failing to introduce a systematic methodology for NeSy learning in robot autonomy. This article fills this gap by formally defining IL, exploring various optimization challenges of IL in different robot autonomy tasks, and introducing new applications such as rule induction and optimal control. Additionally, we introduce the theoretical background for solving IL based on BLO, propose several practical solutions to IL by experimenting with the five distinct robot autonomy applications, and demonstrate the superiority of IL over state-of-the-art (SOTA) methods in their respective fields.

2 Related Works

2.1 Bilevel Optimization

Bilevel optimization (BLO), first introduced by Bracken and McGill (1973), has been studied for decades. Classific approaches replaced the lower-level problem with its optimality conditions as constraints and reformulated the bilevel programming into a single-level constrained problem (Hansen et al., 1992; Gould et al., 2016; Shi et al., 2005; Sinha et al., 2017). More recently, gradient-based BLO has attracted significant attention due to its efficiency and effectiveness in modern machine learning and deep learning problems. Since this paper mainly focuses on the learning side, we will concentrate on gradient-based BLO methods, and briefly discuss their limitations in robot learning problems.

Methodologically, gradient-based BLO can be generally divided into approximate implicit differentiation (AID), iterative differentiation (ITD), and value-function-based approaches. Based on the explicit form of the gradient (or hypergradient) of the upper-level objective function via implicit differentiation, AID-based methods adopt a generalized iterative solver for the lower-level problem as well as an efficient estimate for Hessian-inverse-vector product of the hypergradient (Domke, 2012; Pedregosa, 2016; Liao et al., 2018; Arbel and Mairal, 2022a). ITD-based methods approximate the hypergradient by directly taking backpropagation over a flexible inner-loop optimization trajectory using forward or backward mode of autograd (Maclaurin et al., 2015; Franceschi et al., 2017; Finn et al., 2017; Shaban et al., 2019; Grazzi et al., 2020). Value-function-based approaches reformulated the lower-level problem as a value-function-based constraint and solved this constrained problem via various constrained optimization techniques such as mixed gradient aggregation, log-barrier regularization, primal-dual method, and dynamic barrier (Sabach and Shtern, 2017; Liu et al., 2020a; Li et al., 2020a; Sow et al., 2022; Liu et al., 2021b; Ye et al., 2022). Recently, large-scale stochastic BLO has been extensively studied both in theory and in practice. For example, Chen et al. (2021) and Ji et al. (2021) proposed a Neumann series-based hypergradient estimator; Yang et al. (2021), Huang and Huang (2021), Guo and Yang (2021), Yang et al. (2021), and Dagréou et al. (2022) incorporated the strategies of variance reduction and recursive momentum; and Sow et al. (2021) developed an evolutionary strategies (ES)-based method without computing Hessian or Jacobian.

Theoretically, the convergence of BLO has been analyzed extensively based on a key assumption that the lower-level problem is strongly convex (Franceschi et al., 2018; Shaban et al., 2019; Liu et al., 2021b; Ghadimi and Wang, 2018; Ji et al., 2021; Hong et al., 2020; Arbel and Mairal, 2022a; Dagréou et al., 2022; Ji et al., 2022a; Huang et al., 2022). Among them, Ji and Liang (2021) further provided lower complexity bounds for deterministic BLO with (strongly) convex upper-level functions. Guo and Yang (2021), Chen et al. (2021), Yang et al. (2021), and Khanduri et al. (2021) achieved a near-optimal sample complexity with second-order derivatives. Kwon et al. (2023); Yang et al. (2024) analyzed the convergence of first-order stochastic BLO algorithms. Recent works studied a more challenging setting where the lower-level problem is convex or satisfies Polyak-Lojasiewicz (PL) or Morse-Bott conditions (Liu et al., 2020a; Li et al., 2020a; Sow et al., 2022; Liu et al., 2021b; Ye et al., 2022; Arbel and Mairal, 2022b; Chen et al., 2023; Liu et al., 2021c). More results on BLO and its analysis can be found in the survey (Liu et al., 2021a; Chen et al., 2022).

BLO has been integrated into machine learning applications. For example, researchers have introduced differentiable optimization layers (Amos and Kolter, 2017), convex layers (Agrawal et al., 2019), and declarative layers (Gould et al., 2021) into deep neural networks. They have been applied to several applications such as optical flow (Jiang et al., 2020), pivoting manipulation (Shirai et al., 2022), control (Landry, 2021), and trajectory generation (Han et al., 2024). However, systematic approaches and methodologies to NeSy learning for robot autonomy remain under-explored. Moreover, robotics problems are often highly non-convex, leading to many local minima and saddle points (Jadbabaie et al., 2019), adding optimization difficulties. We will explore the methods with assured convergence as well as those empirically validated by various tasks in robot autonomy.

2.2 Learning Frameworks in Robotics

We summarize the major learning frameworks in robotics, including imitation learning, reinforcement learning, and meta-learning. The others will be briefly mentioned.

Imitation Learning

is a technique where robots learn tasks by observing and mimicking an expert’s actions. Without explicitly modeling complex behaviors, robots can perform a variety of tasks such as dexterous manipulation (McAleer et al., 2018), navigation (Triest et al., 2023), and environmental interaction (Chi et al., 2023). Current research includes leveraging historical data, modeling multi-modal behaviors, employing privileged teachers (Kaufmann et al., 2020; Chen et al., 2020a; Lee et al., 2020), and utilizing generative models to generate data like generative adversarial networks (Ho and Ermon, 2016), variational autoencoders (Zhao et al., 2023), and diffusion models (Chi et al., 2023). These advancements highlight the vibrant and ongoing exploration of imitation learning.

Imitation learning differs from regular supervised learning as it does not assume that the collected data is independent and identically distributed (iid), and relies solely on expert data representing “good” behaviors. Therefore, any small mistake during testing can lead to cascading failures. While techniques such as introducing intentional errors for data augmentation (Pomerleau, 1988; Tagliabue et al., 2022; Codevilla et al., 2018) and expert querying for data aggregation (Ross et al., 2011) exist, they still face notable challenges. These include low data efficiency, where limited or suboptimal demonstrations impair performance, and poor generalization, where robots have difficulty adapting learned behaviors to new contexts or unseen variations due to the labor-intensive nature of collecting high-quality data.

Reinforcement learning

(RL) is a learning paradigm where robots learn to perform tasks by interacting with their environment and receiving feedback in the form of rewards or penalties (Li, 2017). Due to its adaptability and effectiveness, RL has been widely used in numerous fields such as navigation (Zhu and Zhang, 2021), manipulation (Gu et al., 2016), locomotion (Margolis et al., 2024), and human-robot interaction (Modares et al., 2015).

However, RL also faces significant challenges, including sample inefficiency, which requires extensive interaction data (Dulac-Arnold et al., 2019), and the difficulty of ensuring safe exploration in physical environments (Thananjeyan et al., 2021). These issues are severe in complex tasks or environments where data collection is forbidden or dangerous (Pecka and Svoboda, 2014). Additionally, RL often struggles with generalizing learned behaviors to new environments and tasks and faces significant sim-to-real challenges. It can also be computationally expensive and sensitive to hyperparameter choices (Dulac-Arnold et al., 2021). Moreover, reward shaping, while potentially accelerating learning, can inadvertently introduce biases or suboptimal policies by misguiding the learning process.

We notice that BLO has also been integrated into RL. For instance, Stadie et al. (2020) formulated the intrinsic rewards as a BLO problem, leading to hyperparameter optimization; Hu et al. (2024) integrated reinforcement and imitation learning under BLO, addressing challenges like coupled behaviors and incomplete information in multi-robot coordination; Zhang et al. (2020a) employed a bilevel actor-critic learning method based on BLO and achieved better convergence than Nash equilibrium in the cooperative environments. However, they are still within the framework of RL and no systematic methods have been proposed.

Meta-learning

has garnered significant attention recently, particularly with its application to training deep neural networks (Bengio et al., 1991; Thrun and Pratt, 2012). Unlike conventional learning approaches, meta-learning leverages datasets and prior knowledge of a task ensemble to rapidly learn new tasks, often with minimal data, as seen in few-shot learning. Numerous meta-learning algorithms have been developed, encompassing metric-based (Koch et al., 2015; Snell et al., 2017; Chen et al., 2020b; Tang et al., 2020; Gharoun et al., 2023), model-based (Munkhdalai and Yu, 2017; Vinyals et al., 2016; Liu et al., 2020b; Co-Reyes et al., 2021), and optimization-based methods (Finn et al., 2017; Nichol and Schulman, 2018; Simon et al., 2020; Singh et al., 2021; Bohdal et al., 2021; Zhang et al., 2024; Choe et al., 2024). Among them, optimization-based approaches are often simpler to implementation. They are achieving state-of-the-art results in a variety of domains.

BLO has served as an algorithmic framework for optimization-based meta-learning. As the most representative optimization-based approach, model-agnostic meta-learning (MAML) (Finn et al., 2017) learns an initialization such that a gradient descent procedure starting from this initial model can achieve rapid adaptation. In subsequent years, numerous works on various MAML variants have been proposed (Grant et al., 2018; Finn et al., 2019, 2018; Jerfel et al., 2018; Mi et al., 2019; Liu et al., 2019; Rothfuss et al., 2019; Foerster et al., 2018; Baik et al., 2020b; Raghu et al., 2019; Bohdal et al., 2021; Zhou et al., 2021; Baik et al., 2020a; Abbas et al., 2022; Kang et al., 2023; Zhang et al., 2024; Choe et al., 2024). Among them, Raghu et al. (2019) presents an efficient MAML variant named ANIL, which adapts only a subset of the neural network parameters. Finn et al. (2019) introduces a follow-the-meta-leader version of MAML for online learning applications. Zhou et al. (2021) improved the generalization performance of MAML by leveraging the similarity information among tasks. Baik et al. (2020a) proposed an improved version of MAML via adaptive learning rate and weight decay coefficients. Kang et al. (2023) proposed geometry-adaptive pre-conditioned gradient descent for efficient meta-learning. Additionally, a group of meta-regularization approaches has been proposed to improve the bias in a regularized empirical risk minimization problem (Denevi et al., 2018b, 2019, a; Rajeswaran et al., 2019; Balcan et al., 2019; Zhou et al., 2019). Furthermore, there is a prevalent embedding-based framework in few-shot learning (Bertinetto et al., 2018; Lee et al., 2019; Ravi and Larochelle, 2016; Snell et al., 2017; Zhou et al., 2018; Goldblum et al., 2020; Denevi et al., 2022; Qin et al., 2023; Jia and Zhang, 2024). The objective of this framework is to learn a shared embedding model applicable across all tasks, with task-specific parameters being learned for each task based on the embedded features.

It is worth noting that IL is proposed to alleviate the drawbacks of the above learning frameworks in robotics. However, IL can also be integrated with any existing learning framework, e.g., formulating an RL method as the upper-level problem of IL, although out of the scope of this paper.

2.3 Neuro-symbolic Learning

As previously mentioned, the field of NeSy learning lacks a consensus on a rigorous definition. One consequence is that the literature on NeSy learning is scarce and lacks systematic methodologies. Therefore, we will briefly discuss two major categories: logical reasoning and physics-infused networks. This will encompass scenarios where symbols represent either discrete signals, such as logical constructs, or continuous signals, such as physical attributes. We will address other related work in the context of the five robot autonomy examples within their respective sections.

Logical reasoning

aims to inject interpretable and deterministic logical rules into neural networks (Serafini and Garcez, 2016; Riegel et al., 2020; Xie et al., 2019; Ignatiev et al., 2018). Some previous work directly obtained such knowledge from human expert (Xu et al., 2018; Xie et al., 2019, 2021; Manhaeve et al., 2018; Riegel et al., 2020; Yang et al., 2020) or an oracle for controllable deduction in neural networks (Mao et al., 2018; Wang et al., 2022; Hsu et al., 2023), refered as deductive methods. Representative works include DeepProbLog (Manhaeve et al., 2018), logical neural network (Riegel et al., 2020), and semantic Loss (Xu et al., 2018). Despite their success, deductive methods require structured formal symbolic knowledge from humans, which is not always available. Besides, they still suffer from scaling to larger and more complex problems. In contrast, inductive approaches induct the structured symbolic representation for semi-supervised efficient network learning. One popular strategy is based on forward-searching algorithms (Li et al., 2020b, 2022b; Evans and Grefenstette, 2018; Sen et al., 2022), which is time-consuming and hard to scale up. Others borrow gradient-based neural networks to conduct rule induction, such as SATNet (Wang et al., 2019), NeuralLP (Yang et al., 2017), and neural logic machine (NLM) (Dong et al., 2019). In particular, NLM introduced a new network architecture inspired by first-order logic, which shows better compositional generalization capability than vanilla neural nets. However, existing inductive algorithms either work on structured data like knowledge graphs (Yang et al., 2017; Yang and Song, 2019) or only experiment with toy image datasets (Shindo et al., 2023; Wang et al., 2019). We push the limit of this research into robotics with IL, providing real-world applications with high-dimensional image data.

Physics-infused networks

(PINs) integrate physical laws directly into the architecture and training of neural networks (Raissi et al., 2019), aiming to enhance the model’s capability to solve complex scientific, engineering, and robotics problems (Karniadakis et al., 2021). PINs embed known physical principles, such as conservation laws and differential equations, into the network’s loss function (Duruisseaux et al., 2023), constraints, or structure, fostering greater interpretability, generalization, and efficiency (Lu et al., 2021). For example, in fluid dynamics, PINs can leverage the Navier-Stokes equations to guide predictions, ensuring adherence to fundamental flow properties even in data-scarce regions (Sun and Wang, 2020). Similarly, in structural mechanics, variational principles can be used to inform the model about stress and deformation relationships, leading to more accurate structural analysis (Rao et al., 2021).

In robot autonomy, PINs have been applied to various tasks including perception (Guan et al., 2024), planning (Romero et al., 2023), control (Han et al., 2024), etc. By embedding the kinematics and dynamics of robotic systems directly into the learning process, PINs enable robots to predict and respond to physical interactions more accurately, leading to safer and more efficient operations. Methodologies for this include but are not limited to embedding physical laws into network (Zhao et al., 2024), enforcing initial and boundary conditions into the training process (Rao et al., 2021), and designing physical constraint loss functions, e.g., minimizing an energy functional representing the physical system (Guan et al., 2024).

3 Imperative Learning

3.1 Structure

The proposed framework of imperative learning (IL) is illustrated in Figure 1, which consists of three modules, i.e., a neural system, a reasoning engine, and a memory module. Specifically, the neural system extracts high-level semantic attributes from raw sensor data such as images, LiDAR points, IMU measurements, and their combinations. These semantic attributes are subsequently sent to the reasoning engine, a symbolic process represented by physical principles, logical inference, analytical geometry, etc. The memory module stores the robot’s experiences and acquired knowledge, such as data, symbolic rules, and maps about the physical world for either a long-term or short-term period. Additionally, the reasoning engine performs a consistency check with the memory, which will update the memory or make necessary self-corrections. Intuitively, this design has the potential to combine the expressive feature extraction capabilities from the neural system, the interpretability and the generalization ability from the reasoning engine, and the memorability from the memory module into a single framework. We next explain the mathematical rationale to achieve this.

3.2 Formulation

One of the most important properties of the framework in Figure 1 is that the neural, reasoning, and memory modules can perform reciprocal learning in a self-supervised manner. This is achieved by formulating this framework as a BLO. Denote the neural system as 𝒛=f(𝜽,𝒙)𝒛𝑓𝜽𝒙\bm{z}=f({\bm{\theta}},\bm{x})bold_italic_z = italic_f ( bold_italic_θ , bold_italic_x ), where 𝒙𝒙\bm{x}bold_italic_x represents the sensor measurements, 𝜽𝜽{\bm{\theta}}bold_italic_θ represents the perception-related learnable parameters, and 𝒛𝒛\bm{z}bold_italic_z represents the neural outputs such as semantic attributes; the reasoning engine as g(f,M,𝝁)𝑔𝑓𝑀𝝁g(f,M,{\bm{\mu}})italic_g ( italic_f , italic_M , bold_italic_μ ) with reasoning-related parameters 𝝁𝝁{\bm{\mu}}bold_italic_μ and the memory system as M(𝜸,𝝂)𝑀𝜸𝝂M({\bm{\gamma}},{\bm{\nu}})italic_M ( bold_italic_γ , bold_italic_ν ), where 𝜸𝜸{\bm{\gamma}}bold_italic_γ is perception-related memory parameters (Wang et al., 2021a) and 𝝂𝝂{\bm{\nu}}bold_italic_ν is reasoning-related memory parameters. In this context, our imperative learning (IL) is formulated as a special BLO:

min𝝍[𝜽,𝜸]subscriptapproaches-limit𝝍superscriptsuperscript𝜽topsuperscript𝜸toptop\displaystyle\min_{\bm{\psi}\doteq[{\bm{\theta}}^{\top},~{}{\bm{\gamma}}^{\top% }]^{\top}}roman_min start_POSTSUBSCRIPT bold_italic_ψ ≐ [ bold_italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_γ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT U(f(𝜽,𝒙),g(𝝁),M(𝜸,𝝂)),𝑈𝑓𝜽𝒙𝑔superscript𝝁𝑀𝜸superscript𝝂\displaystyle U\left(f({\bm{\theta}},\bm{x}),g({\bm{\mu}}^{*}),M({\bm{\gamma}}% ,{\bm{\nu}}^{*})\right),italic_U ( italic_f ( bold_italic_θ , bold_italic_x ) , italic_g ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_M ( bold_italic_γ , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) , (1a)
s.t. ϕargminϕ[𝝁,𝝂]L(f(𝜽,𝒙),g(𝝁),M(𝜸,𝝂)),superscriptbold-italic-ϕsubscriptapproaches-limitbold-italic-ϕsuperscriptsuperscript𝝁topsuperscript𝝂toptop𝐿𝑓𝜽𝒙𝑔𝝁𝑀𝜸𝝂\displaystyle\bm{\phi}^{*}\in\operatorname*{\arg\min}_{\bm{\phi}\doteq[{\bm{% \mu}}^{\top},~{}{\bm{\nu}}^{\top}]^{\top}}L(f({\bm{\theta}},\bm{x}),g({\bm{\mu% }}),M({\bm{\gamma}},{\bm{\nu}})),bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_ϕ ≐ [ bold_italic_μ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L ( italic_f ( bold_italic_θ , bold_italic_x ) , italic_g ( bold_italic_μ ) , italic_M ( bold_italic_γ , bold_italic_ν ) ) , (1b)
s.t.ξ(M(𝜸,𝝂),𝝁,f(𝜽,𝒙))= or 0,s.t.𝜉𝑀𝜸𝝂𝝁𝑓𝜽𝒙 or 0\displaystyle\textrm{s.t.}\quad\xi(M({\bm{\gamma}},{\bm{\nu}}),{\bm{\mu}},f({% \bm{\theta}},\bm{x}))=\text{ or }\leq 0,s.t. italic_ξ ( italic_M ( bold_italic_γ , bold_italic_ν ) , bold_italic_μ , italic_f ( bold_italic_θ , bold_italic_x ) ) = or ≤ 0 , (1c)

where ξ𝜉\xiitalic_ξ is a general constraint (either equality or inequality); U𝑈Uitalic_U and L𝐿Litalic_L are the upper-level (UL) and lower-level (LL) cost functions; and 𝝍[𝜽,𝜸]approaches-limit𝝍superscriptsuperscript𝜽topsuperscript𝜸toptop\bm{\psi}\doteq[{\bm{\theta}}^{\top},{\bm{\gamma}}^{\top}]^{\top}bold_italic_ψ ≐ [ bold_italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_γ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT are stacked UL variables and ϕ[𝝁,𝝂]approaches-limitbold-italic-ϕsuperscriptsuperscript𝝁topsuperscript𝝂toptop\bm{\phi}\doteq[{\bm{\mu}}^{\top},{\bm{\nu}}^{\top}]^{\top}bold_italic_ϕ ≐ [ bold_italic_μ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT are stacked LL variables, respectively. Alternatively, U𝑈Uitalic_U and L𝐿Litalic_L are also referred to as the neural cost and symbolic cost, respectively. As aforementioned, the term “imperative” is used to denote the passive nature of the learning process: once optimized, the neural system f𝑓fitalic_f in the UL cost will be driven to align with the LL reasoning engine g𝑔gitalic_g (e.g., logical, physical, or geometrical reasoning process) with constraint ξ𝜉\xiitalic_ξ, so that it can learn to generate logically, physically, or geometrically feasible semantic attributes or predicates. In some applications, 𝝍𝝍\bm{\psi}bold_italic_ψ and ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ are also referred to as neuron-like and symbol-like parameters, respectively.

Refer to caption
Figure 1: The framework of imperative learning (IL) consists of three primary modules including a neural perceptual network, a symbolic reasoning engine, and a general memory system. IL is formulated as a special BLO, enabling reciprocal learning and mutual correction among the three modules.
Self-supervised Learning

As presented in Section 1, the formulation of IL is motivated by one important observation: many symbolic reasoning engines including geometric, physical, and logical reasoning, can be optimized or solved without providing labels. This is evident in methods like logic reasoning like equation discovery (Billard and Diday, 2002) and A search (Hart et al., 1968), geometrical reasoning such as bundle adjustment (BA) (Agarwal et al., 2010), and physical reasoning like model predictive control (Kouvaritakis and Cannon, 2016). The IL framework leverages this phenomenon and jointly optimizes the three modules by BLO, which enforces the three modules to mutually correct each other. Consequently, all three modules can learn and evolve in a self-supervised manner by observing the world. However, it is worth noting that, although IL is designed for self-supervised learning, it can easily adapt to supervised or weakly supervised learning by involving labels either in UL or LL cost functions or both.

Memory

The memory system within the IL framework is a general component that can retain and retrieve information online. Specifically, it can be any structure that is associated with write and read operations to retain and retrieve data (Wang et al., 2021a). A memory can be a neural network, where information is “written” into the parameters and is “read” through a set of math operations or implicit mapping, e.g., a neural radiance fields (NeRF) model (Mildenhall et al., 2021); It can also be a structure with explicit physical meanings such as a map created online, a set of logical rules inducted online, or even a dataset collected online; It can also be the memory system of LLMs in textual form, such as retrieval-augmented generation (RAG) (Lewis et al., 2020), which writes, reads, and manages symbolic knowledge.

3.3 Optimization

BLO has been explored in frameworks such as meta-learning (Finn et al., 2017), hyperparameter optimization (Franceschi et al., 2018), and reinforcement learning (Hong et al., 2020). However, most of the theoretical analyses have primarily focused on their applicability to data-driven models, where \nth1-order gradient descent (GD) is frequently employed (Ji et al., 2021; Gould et al., 2021). Nevertheless, many reasoning tasks present unique challenges that make GD less effective. For example, geometrical reasoning like BA requires \nth2-order optimizers (Fu et al., 2024) such as Levenberg-Marquardt (LM) (Marquardt, 1963); multi-robot routing needs combinatorial optimization over discrete variables (Ren et al., 2023a). Employing such LL optimizations within the BLO framework introduces extreme complexities and challenges, which are still underexplored (Ji et al., 2021). Therefore, we will first delve into general BLO and then provide practical examples covering distinct challenges of LL optimizations in our IL framework.

Algorithm 1 The Algorithm of Solving Imperative Learning by Unrolled Differentiation or Implicit Differentiation.
1:  while Not Convergent do
2:     Obtain 𝝁T,𝝂Tsubscript𝝁𝑇subscript𝝂𝑇{\bm{\mu}}_{T},{\bm{\nu}}_{T}bold_italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT by solving LL problem (1b) (possibly with constraint (1c)) by a generic optimizer 𝒪𝒪\mathcal{O}caligraphic_O with T𝑇Titalic_T steps.
3:     Efficient estimation of UL gradients in (2) via Unrolled Differentiation: Compute ^𝜽Usubscript^𝜽𝑈\hat{\nabla}_{\bm{\theta}}Uover^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_U and ^𝜸Usubscript^𝜸𝑈\hat{\nabla}_{\bm{\gamma}}Uover^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_γ end_POSTSUBSCRIPT italic_U by having 𝚽𝚽\bm{\Phi}bold_Φ in (7) and U(𝝍,𝚽(𝝍))𝑈𝝍𝚽𝝍{U}(\bm{\psi},\bm{\Phi}(\bm{\psi}))italic_U ( bold_italic_ψ , bold_Φ ( bold_italic_ψ ) ) in (8) via AutoDiff.Implicit Differentiation (Algorithm 2): Compute
^𝝍U=U𝝍|ϕT+Uϕϕ𝝍|ϕT,subscript^𝝍𝑈evaluated-at𝑈𝝍subscriptbold-italic-ϕ𝑇evaluated-at𝑈superscriptbold-italic-ϕsuperscriptbold-italic-ϕ𝝍subscriptbold-italic-ϕ𝑇\hat{\nabla}_{\bm{\psi}}U=\frac{\partial U}{\partial\bm{\psi}}\Big{|}_{\bm{% \phi}_{T}}+\frac{\partial U}{\partial\bm{\phi}^{*}}{\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{\partial\bm{\phi}^{*}}{% \partial{\bm{\psi}}}}\Big{|}_{\bm{\phi}_{T}},over^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_ψ end_POSTSUBSCRIPT italic_U = divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ψ end_ARG | start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_ψ end_ARG | start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,
where the implicit derivatives ϕ𝝍superscriptbold-italic-ϕ𝝍\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{% \partial\bm{\phi}^{*}}{\partial{\bm{\psi}}}divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_ψ end_ARG can be obtained by solving a linear system via the LL optimality conditions.
4:  end while

The solution to IL (1) mainly involves solving the UL parameters 𝜽𝜽{\bm{\theta}}bold_italic_θ and 𝜸𝜸{\bm{\gamma}}bold_italic_γ and the LL parameters 𝝁𝝁{\bm{\mu}}bold_italic_μ and 𝝂𝝂{\bm{\nu}}bold_italic_ν. Intuitively, the UL parameters which are often neuron-like weights can be updated with the gradients of the UL cost U𝑈Uitalic_U:

𝜽Usubscript𝜽𝑈\displaystyle\nabla_{\bm{\theta}}U∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_U =Uff𝜽+Ugg𝝁𝝁𝜽+UMM𝝂𝝂𝜽,absent𝑈𝑓𝑓𝜽𝑈𝑔𝑔superscript𝝁superscript𝝁𝜽𝑈𝑀𝑀superscript𝝂superscript𝝂𝜽\displaystyle=\frac{\partial U}{\partial f}\frac{\partial f}{\partial{\bm{% \theta}}}+\frac{\partial U}{\partial g}\frac{\partial g}{\partial{\bm{\mu}}^{*% }}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{% \partial{\bm{\mu}}^{*}}{\partial{\bm{\theta}}}}+\frac{\partial U}{\partial M}% \frac{\partial M}{\partial{\bm{\nu}}^{*}}{\color[rgb]{0,0,1}\definecolor[named% ]{pgfstrokecolor}{rgb}{0,0,1}\frac{\partial{\bm{\nu}}^{*}}{\partial{\bm{\theta% }}}},= divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_f end_ARG divide start_ARG ∂ italic_f end_ARG start_ARG ∂ bold_italic_θ end_ARG + divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_g end_ARG divide start_ARG ∂ italic_g end_ARG start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG + divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_M end_ARG divide start_ARG ∂ italic_M end_ARG start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG , (2)
𝜸Usubscript𝜸𝑈\displaystyle\nabla_{\bm{\gamma}}U∇ start_POSTSUBSCRIPT bold_italic_γ end_POSTSUBSCRIPT italic_U =UMM𝜸+Ugg𝝁𝝁𝜸+UMM𝝂𝝂𝜸.absent𝑈𝑀𝑀𝜸𝑈𝑔𝑔superscript𝝁superscript𝝁𝜸𝑈𝑀𝑀superscript𝝂superscript𝝂𝜸\displaystyle=\frac{\partial U}{\partial M}\frac{\partial M}{\partial{\bm{% \gamma}}}+\frac{\partial U}{\partial g}\frac{\partial g}{\partial{\bm{\mu}}^{*% }}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{% \partial{\bm{\mu}}^{*}}{\partial{\bm{\gamma}}}}+\frac{\partial U}{\partial M}% \frac{\partial M}{\partial{\bm{\nu}}^{*}}{\color[rgb]{0,0,1}\definecolor[named% ]{pgfstrokecolor}{rgb}{0,0,1}\frac{\partial{\bm{\nu}}^{*}}{\partial{\bm{\gamma% }}}}.= divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_M end_ARG divide start_ARG ∂ italic_M end_ARG start_ARG ∂ bold_italic_γ end_ARG + divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_g end_ARG divide start_ARG ∂ italic_g end_ARG start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_γ end_ARG + divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_M end_ARG divide start_ARG ∂ italic_M end_ARG start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_γ end_ARG .

The key challenge of computing (2) is the implicit differentiation parts in blue fonts, which take the forms of

ϕ𝝍=[𝝁𝜽𝝁𝜸𝝂𝜽𝝂𝜸],superscriptbold-italic-ϕ𝝍delimited-[]superscript𝝁𝜽superscript𝝁𝜸superscript𝝂𝜽superscript𝝂𝜸{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{% \partial\bm{\phi}^{*}}{\partial\bm{\psi}}}=\left[\begin{aligned} \frac{% \partial\bm{\mu}^{*}}{\partial\bm{\theta}}&\quad\frac{\partial\bm{\mu}^{*}}{% \partial\bm{\gamma}}\\ \frac{\partial\bm{\nu}^{*}}{\partial\bm{\theta}}&\quad\frac{\partial\bm{\nu}^{% *}}{\partial\bm{\gamma}}\\ \end{aligned}\right],\\ divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_ψ end_ARG = [ start_ROW start_CELL divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG end_CELL start_CELL divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_γ end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG end_CELL start_CELL divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_γ end_ARG end_CELL end_ROW ] , (3)

where 𝝁superscript𝝁{\bm{\mu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝝂superscript𝝂{\bm{\nu}}^{*}bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are the solutions to the LL problem. For simplicity, we can write (2) into the matrix form.

𝝍U=U𝝍+Uϕϕ𝝍,subscript𝝍𝑈𝑈𝝍𝑈superscriptbold-italic-ϕsuperscriptbold-italic-ϕ𝝍\nabla_{\bm{\psi}}U=\frac{\partial U}{\partial\bm{\psi}}+\frac{\partial U}{% \partial\bm{\phi}^{*}}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,1}\frac{\partial\bm{\phi}^{*}}{\partial{\bm{\psi}}}},∇ start_POSTSUBSCRIPT bold_italic_ψ end_POSTSUBSCRIPT italic_U = divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ψ end_ARG + divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_ψ end_ARG , (4)

where U𝝍=[(Uff𝜽),(UMM𝜸)]𝑈𝝍superscriptsuperscript𝑈𝑓𝑓𝜽topsuperscript𝑈𝑀𝑀𝜸toptop\frac{\partial U}{\partial\bm{\psi}}=\left[(\frac{\partial U}{\partial f}\frac% {\partial f}{\partial{\bm{\theta}}})^{\top},(\frac{\partial U}{\partial M}% \frac{\partial M}{\partial{\bm{\gamma}}})^{\top}\right]^{\top}divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ψ end_ARG = [ ( divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_f end_ARG divide start_ARG ∂ italic_f end_ARG start_ARG ∂ bold_italic_θ end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , ( divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_M end_ARG divide start_ARG ∂ italic_M end_ARG start_ARG ∂ bold_italic_γ end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and Uϕ=[(Ugg𝝁),(UMM𝝂)]𝑈superscriptbold-italic-ϕsuperscriptsuperscript𝑈𝑔𝑔superscript𝝁topsuperscript𝑈𝑀𝑀superscript𝝂toptop\frac{\partial U}{\partial\bm{\phi}^{*}}=\left[(\frac{\partial U}{\partial g}% \frac{\partial g}{\partial{\bm{\mu}}^{*}})^{\top},(\frac{\partial U}{\partial M% }\frac{\partial M}{\partial{\bm{\nu}}^{*}})^{\top}\right]^{\top}divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG = [ ( divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_g end_ARG divide start_ARG ∂ italic_g end_ARG start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , ( divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_M end_ARG divide start_ARG ∂ italic_M end_ARG start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. There are typically two methods for calculating those gradients, i.e., unrolled differentiation and implicit differentiation. We summarize a generic framework incorporating both methods in Algorithm 1 to provide a clearer understanding.

3.3.1 Unrolled Differentiation

The method of unrolled differentiation is an easy-to-implement solution for BLO problems. It needs automatic differentiation (AutoDiff) through LL optimization. Specifically, given an initialization for LL variable ϕ0=𝚽0(𝝍)subscriptbold-italic-ϕ0subscript𝚽0𝝍\bm{\phi}_{0}=\bm{\Phi}_{0}(\bm{\bm{\psi}})bold_italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_Φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_ψ ) at step t=0𝑡0t=0italic_t = 0, the iterative process of unrolled optimization is

ϕt=𝚽t(ϕt1;𝝍),t=1,,T,formulae-sequencesubscriptbold-italic-ϕ𝑡subscript𝚽𝑡subscriptbold-italic-ϕ𝑡1𝝍𝑡1𝑇\bm{\phi}_{t}=\bm{\Phi}_{t}(\bm{\phi}_{t-1};\bm{\psi}),\quad t=1,\cdots,T,bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_ϕ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_ψ ) , italic_t = 1 , ⋯ , italic_T , (5)

where 𝚽tsubscript𝚽𝑡\bm{\Phi}_{t}bold_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes an updating scheme based on a specific LL problem and T𝑇Titalic_T is the number of iterations. One popular updating scheme is based on the gradient descent:

𝚽t(ϕt1;𝝍)=ϕt1ηL(ϕt1;𝝍)ϕt1,subscript𝚽𝑡subscriptbold-italic-ϕ𝑡1𝝍subscriptbold-italic-ϕ𝑡1𝜂𝐿subscriptbold-italic-ϕ𝑡1𝝍subscriptbold-italic-ϕ𝑡1\bm{\Phi}_{t}(\bm{\phi}_{t-1};\bm{\psi})=\bm{\phi}_{t-1}-\eta\cdot\frac{% \partial{L}(\bm{\phi}_{t-1};\bm{\bm{\psi}})}{\partial\bm{\phi}_{t-1}},bold_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_ϕ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_ψ ) = bold_italic_ϕ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_η ⋅ divide start_ARG ∂ italic_L ( bold_italic_ϕ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_ψ ) end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG , (6)

where η𝜂\etaitalic_η is a learning rate and the term L(ϕt1;𝝍)ϕt1𝐿subscriptbold-italic-ϕ𝑡1𝝍subscriptbold-italic-ϕ𝑡1\frac{\partial{L}(\bm{\phi}_{t-1};\bm{\psi})}{\partial\bm{\phi}_{t-1}}divide start_ARG ∂ italic_L ( bold_italic_ϕ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_ψ ) end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG can be computed from AutoDiff. Therefore, we can obtain 𝜽U(𝜽)subscript𝜽𝑈𝜽\nabla_{\bm{{\bm{\theta}}}}{U}(\bm{\bm{\theta}})∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_U ( bold_italic_θ ), 𝜸U(𝜸)subscript𝜸𝑈𝜸\nabla_{\bm{{\bm{\gamma}}}}{U}(\bm{\bm{\gamma}})∇ start_POSTSUBSCRIPT bold_italic_γ end_POSTSUBSCRIPT italic_U ( bold_italic_γ ) by substituting ϕT=[𝝁T,𝝂T]subscriptbold-italic-ϕ𝑇subscript𝝁𝑇subscript𝝂𝑇\bm{\phi}_{T}=[\bm{\bm{\mu}}_{T},\bm{\bm{\nu}}_{T}]bold_italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = [ bold_italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] approximately for ϕ=[𝝁,𝝂]superscriptbold-italic-ϕsuperscript𝝁superscript𝝂\bm{\phi}^{*}=[\bm{\bm{\mu}}^{*},\bm{\bm{\nu}}^{*}]bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = [ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] and the full unrolled system is defined as

𝚽(𝝍)=(𝚽T𝚽1𝚽0)(𝝍),𝚽𝝍subscript𝚽𝑇subscript𝚽1subscript𝚽0𝝍\bm{\Phi}(\bm{\psi})=(\bm{\Phi}_{T}\circ\cdots\circ\bm{\Phi}_{1}\circ\bm{\Phi}% _{0})(\bm{\psi}),bold_Φ ( bold_italic_ψ ) = ( bold_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∘ ⋯ ∘ bold_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ bold_Φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_italic_ψ ) , (7)

where the symbol \circ denotes the function composition. Therefore, we instead only need to consider an alternative:

min𝝍U(𝝍,𝚽(𝝍)),subscript𝝍𝑈𝝍𝚽𝝍\min_{\bm{\psi}}\;\;{U}(\bm{\psi},\bm{\Phi}(\bm{\psi})),roman_min start_POSTSUBSCRIPT bold_italic_ψ end_POSTSUBSCRIPT italic_U ( bold_italic_ψ , bold_Φ ( bold_italic_ψ ) ) , (8)

where 𝚽(𝝍)𝝍𝚽𝝍𝝍\frac{\partial\bm{\Phi}(\bm{\psi})}{\partial\bm{\psi}}divide start_ARG ∂ bold_Φ ( bold_italic_ψ ) end_ARG start_ARG ∂ bold_italic_ψ end_ARG can be computed via AutoDiff instead of directly calculating the four terms in ϕ𝝍superscriptbold-italic-ϕ𝝍\frac{\partial\bm{\phi}^{*}}{\partial\bm{\psi}}divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_ψ end_ARG of (3).

Algorithm 2 Implicit Differentiation via Linear System.
1:  Input: The current variable 𝝍𝝍\bm{\bm{\psi}}bold_italic_ψ and the optimal variable ϕsuperscriptbold-italic-ϕ\bm{\bm{\phi}}^{*}bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.
2:  Initialization: k=1𝑘1k=1italic_k = 1, learning rate η𝜂\etaitalic_η.
3:  while Not Convergent do
4:     Perform gradient descent (or use conjugate gradient):
𝒒k=𝒒k1η(𝑯𝒒𝒗),subscript𝒒𝑘subscript𝒒𝑘1𝜂𝑯𝒒𝒗\bm{q}_{k}=\bm{q}_{k-1}-\eta\left(\bm{H}\bm{q}-\bm{v}\right),bold_italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_q start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_η ( bold_italic_H bold_italic_q - bold_italic_v ) , (9)
where 𝑯𝒒𝑯𝒒\bm{H}\bm{q}bold_italic_H bold_italic_q is computed from fast Hessian-vector product.
5:  end while
6:  Assign 𝒒=𝒒k𝒒subscript𝒒𝑘\bm{q}=\bm{q}_{k}bold_italic_q = bold_italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
7:  Compute 𝜽Usubscript𝜽𝑈\nabla_{\bm{\bm{\theta}}}{U}∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_U and 𝜸Usubscript𝜸𝑈\nabla_{\bm{\bm{\gamma}}}{U}∇ start_POSTSUBSCRIPT bold_italic_γ end_POSTSUBSCRIPT italic_U in (2) as:
𝝍U=U𝝍𝒒𝑯ϕ𝝍,subscript𝝍𝑈𝑈𝝍superscript𝒒topsubscript𝑯superscriptbold-italic-ϕ𝝍\nabla_{\bm{\bm{\psi}}}{U}=\frac{\partial{U}}{\partial\bm{\psi}}-\bm{q}^{\top}% \cdot\bm{H}_{\bm{\phi}^{*}\bm{\psi}},∇ start_POSTSUBSCRIPT bold_italic_ψ end_POSTSUBSCRIPT italic_U = divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ψ end_ARG - bold_italic_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ bold_italic_H start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_italic_ψ end_POSTSUBSCRIPT , (10)
where 𝒒𝑯ϕ𝝍superscript𝒒topsubscript𝑯superscriptbold-italic-ϕ𝝍\bm{q}^{\top}\cdot\bm{H}_{\bm{\phi}^{*}\bm{\psi}}bold_italic_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ bold_italic_H start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_italic_ψ end_POSTSUBSCRIPT is also from the fast Hessian-vector product.

3.3.2 Implicit Differentiation

The method of implicit differentiation directly computes the derivatives ϕ𝝍superscriptbold-italic-ϕ𝝍\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{% \partial\bm{\phi}^{*}}{\partial\bm{\psi}}divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_ψ end_ARG. We next introduce a generic framework for the implicit differentiation algorithm by solving a linear system from the first-order optimality condition of the LL problem, while the exact solutions depend on specific tasks and will be illustrated using several independent examples in Section 4.

Assume ϕT=[𝝁T,𝝂T]subscriptbold-italic-ϕ𝑇subscript𝝁𝑇subscript𝝂𝑇\bm{\phi}_{T}=[\bm{\mu}_{T},\bm{\nu}_{T}]bold_italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = [ bold_italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] is the outputs of T𝑇Titalic_T steps of a generic optimizer of the LL problem (1b) possibly under constraints (1c), then the approximated UL gradient (2) is

^𝝍UU𝝍|ϕT+Uϕϕ𝝍|ϕT.subscript^𝝍𝑈evaluated-at𝑈𝝍subscriptbold-italic-ϕ𝑇evaluated-at𝑈superscriptbold-italic-ϕsuperscriptbold-italic-ϕ𝝍subscriptbold-italic-ϕ𝑇\hat{\nabla}_{\bm{\psi}}U\approx\frac{\partial U}{\partial\bm{\psi}}\Big{|}_{% \bm{\phi}_{T}}+\frac{\partial U}{\partial\bm{\phi}^{*}}{\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{\partial\bm{\phi}^{*}}{% \partial{\bm{\psi}}}}\Big{|}_{\bm{\phi}_{T}}.over^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_ψ end_POSTSUBSCRIPT italic_U ≈ divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ψ end_ARG | start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_ψ end_ARG | start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (11)

Then the derivatives ϕ𝝍superscriptbold-italic-ϕ𝝍\frac{\partial\bm{\phi}^{*}}{\partial{\bm{\psi}}}divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_ψ end_ARG are obtained via solving implicit equations from optimality conditions of the LL problem, i.e., L~ϕ=𝟎~𝐿superscriptbold-italic-ϕ0\frac{\partial\tilde{L}}{\partial\bm{\phi}^{*}}=\bm{0}divide start_ARG ∂ over~ start_ARG italic_L end_ARG end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG = bold_0, where L~~𝐿\tilde{L}over~ start_ARG italic_L end_ARG is a generic LL optimality condition. Specifically, taking the derivative of equation L~ϕ=𝟎~𝐿superscriptbold-italic-ϕ0\frac{\partial\tilde{L}}{\partial\bm{\phi}^{*}}=\bm{0}divide start_ARG ∂ over~ start_ARG italic_L end_ARG end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG = bold_0 with respect to 𝝍𝝍\bm{\psi}bold_italic_ψ on both sides leads to

2L~(ϕ(𝝍),𝝍)ϕ(𝝍)𝝍+2L~(ϕ(𝝍),𝝍)ϕ(𝝍)ϕ(𝝍)ϕ(𝝍)𝝍=𝟎.superscript2~𝐿superscriptbold-italic-ϕ𝝍𝝍superscriptbold-italic-ϕ𝝍𝝍superscript2~𝐿superscriptbold-italic-ϕ𝝍𝝍superscriptbold-italic-ϕ𝝍superscriptbold-italic-ϕ𝝍superscriptbold-italic-ϕ𝝍𝝍0\frac{\partial^{2}\tilde{L}(\bm{\phi}^{*}(\bm{\psi}),\bm{\psi})}{\partial\bm{% \phi}^{*}(\bm{\psi})\partial\bm{\psi}}+\frac{\partial^{2}\tilde{L}(\bm{\phi}^{% *}(\bm{\psi}),\bm{\psi})}{\partial\bm{\phi}^{*}(\bm{\psi})\partial\bm{\phi}^{*% }(\bm{\psi})}\cdot{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}\frac{\partial\bm{\phi}^{*}(\bm{\psi})}{\partial\bm{\psi}}}=\bm{0}.divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_L end_ARG ( bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) , bold_italic_ψ ) end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) ∂ bold_italic_ψ end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_L end_ARG ( bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) , bold_italic_ψ ) end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) end_ARG ⋅ divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) end_ARG start_ARG ∂ bold_italic_ψ end_ARG = bold_0 . (12)

Solving the equation gives us the implicit gradients as

ϕ(𝝍)𝝍=(2L~(ϕ(𝝍),𝝍)ϕ(𝝍)ϕ(𝝍))1𝑯ϕϕ12L~(ϕ(𝝍),𝝍)ϕ(𝝍)𝝍𝑯ϕ𝝍.superscriptbold-italic-ϕ𝝍𝝍subscriptsuperscriptsuperscript2~𝐿superscriptbold-italic-ϕ𝝍𝝍superscriptbold-italic-ϕ𝝍superscriptbold-italic-ϕ𝝍1superscriptsubscript𝑯superscriptbold-italic-ϕsuperscriptbold-italic-ϕ1subscriptsuperscript2~𝐿superscriptbold-italic-ϕ𝝍𝝍superscriptbold-italic-ϕ𝝍𝝍subscript𝑯superscriptbold-italic-ϕ𝝍{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{% \partial\bm{\phi}^{*}(\bm{\psi})}{\partial\bm{\psi}}}=-\underbrace{\left(\frac% {\partial^{2}\tilde{L}(\bm{\phi}^{*}(\bm{\psi}),\bm{\psi})}{\partial\bm{\phi}^% {*}(\bm{\psi})\partial\bm{\phi}^{*}(\bm{\psi})}\right)^{-1}}_{\bm{H}_{\bm{\phi% }^{*}\bm{\phi}^{*}}^{-1}}\underbrace{\frac{\partial^{2}\tilde{L}(\bm{\phi}^{*}% (\bm{\psi}),\bm{\psi})}{\partial\bm{\phi}^{*}(\bm{\psi})\partial\bm{\psi}}}_{% \bm{H}_{\bm{\phi}^{*}\bm{\psi}}}.divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) end_ARG start_ARG ∂ bold_italic_ψ end_ARG = - under⏟ start_ARG ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_L end_ARG ( bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) , bold_italic_ψ ) end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT bold_italic_H start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT under⏟ start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_L end_ARG ( bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) , bold_italic_ψ ) end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) ∂ bold_italic_ψ end_ARG end_ARG start_POSTSUBSCRIPT bold_italic_H start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_italic_ψ end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (13)

This means we obtain the implicit gradients at the cost of an inversion of the Hessian matrix 𝑯ϕϕ1superscriptsubscript𝑯superscriptbold-italic-ϕsuperscriptbold-italic-ϕ1\bm{H}_{\bm{\phi}^{*}\bm{\phi}^{*}}^{-1}bold_italic_H start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

In practice, the Hessian matrix can be too big to calculate and store222For instance, assume both UL and LU costs have a network with merely 1 million (106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT) parameters (32323232-bit float numbers), thus each network only needs a space of 106×4Byte=4MBsuperscript1064Byte4M𝐵10^{6}\times 4\text{Byte}=4\mathrm{M}B10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT × 4 Byte = 4 roman_M italic_B to store, while their Hessian matrix needs a space of (106)2×4Byte=4TBsuperscriptsuperscript10624Byte4T𝐵(10^{6})^{2}\times 4\text{Byte}=4\mathrm{T}B( 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × 4 Byte = 4 roman_T italic_B to store. This indicates that a Hessian matrix cannot even be explicitly stored in the memory of a low-power computer, thus directly calculating its inversion is more impractical., but we could bypass it by solving a linear system. Substitute (13) into (4), we have the UL gradient

𝝍Usubscript𝝍𝑈\displaystyle\nabla_{\bm{\psi}}{U}∇ start_POSTSUBSCRIPT bold_italic_ψ end_POSTSUBSCRIPT italic_U =U𝝍U(𝝍,ϕ)ϕ𝒗𝑯ϕϕ1𝑯ϕ𝝍absent𝑈𝝍subscript𝑈𝝍superscriptbold-italic-ϕsuperscriptbold-italic-ϕsuperscript𝒗topsuperscriptsubscript𝑯superscriptbold-italic-ϕsuperscriptbold-italic-ϕ1subscript𝑯superscriptbold-italic-ϕ𝝍\displaystyle=\frac{\partial{U}}{\partial\bm{\psi}}-\underbrace{\frac{\partial% {U}(\bm{\psi},\bm{\phi}^{*})}{\partial\bm{\phi}^{*}}}_{\bm{v}^{\top}}\bm{H}_{% \bm{\phi}^{*}\bm{\phi}^{*}}^{-1}\cdot\bm{H}_{\bm{\phi}^{*}\bm{\psi}}= divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ψ end_ARG - under⏟ start_ARG divide start_ARG ∂ italic_U ( bold_italic_ψ , bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_ARG start_POSTSUBSCRIPT bold_italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_H start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ bold_italic_H start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_italic_ψ end_POSTSUBSCRIPT (14a)
=U𝝍𝒒𝑯ϕ𝝍.absent𝑈𝝍superscript𝒒topsubscript𝑯superscriptbold-italic-ϕ𝝍\displaystyle=\frac{\partial{U}}{\partial\bm{\psi}}-\bm{q}^{\top}\cdot\bm{H}_{% \bm{\phi}^{*}\bm{\psi}}.= divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ψ end_ARG - bold_italic_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ bold_italic_H start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_italic_ψ end_POSTSUBSCRIPT . (14b)

Therefore, instead of calculating the Hessian inversion, we can solve a linear system 𝑯𝒒=𝒗𝑯𝒒𝒗\bm{H}\bm{q}=\bm{v}bold_italic_H bold_italic_q = bold_italic_v for 𝒒superscript𝒒top\bm{q}^{\top}bold_italic_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT by optimizing

𝒒=argmin𝒒Q(𝒒)=argmin𝒒(12𝒒𝑯𝒒𝒒𝒗),superscript𝒒subscript𝒒𝑄𝒒subscript𝒒12superscript𝒒top𝑯𝒒superscript𝒒top𝒗\bm{q}^{*}=\operatorname*{\arg\min}_{\bm{q}}Q(\bm{q})=\operatorname*{\arg\min}% _{\bm{q}}\left(\frac{1}{2}\bm{q}^{\top}\bm{H}\bm{q}-\bm{q}^{\top}\bm{v}\right),bold_italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_q end_POSTSUBSCRIPT italic_Q ( bold_italic_q ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_q end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_H bold_italic_q - bold_italic_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v ) , (15)

where 𝑯ϕϕsubscript𝑯superscriptbold-italic-ϕsuperscriptbold-italic-ϕ\bm{H}_{\bm{\phi}^{*}\bm{\phi}^{*}}bold_italic_H start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is denoted as 𝑯𝑯\bm{H}bold_italic_H for simplicity. The linear system (15) can be solved without explicitly calculating and storing the Hessian matrix by solvers such as conjugate gradient (Hestenes and Stiefel, 1952) or gradient descent. For example, due to the gradient Q(𝒒)𝒒=𝑯𝒒𝒗𝑄𝒒𝒒𝑯𝒒𝒗\frac{\partial Q(\bm{q})}{\partial\bm{q}}=\bm{H}\bm{q}-\bm{v}divide start_ARG ∂ italic_Q ( bold_italic_q ) end_ARG start_ARG ∂ bold_italic_q end_ARG = bold_italic_H bold_italic_q - bold_italic_v, the updating scheme based on the gradient descent algorithm is:

𝒒k=𝒒k1η(𝑯𝒒k1𝒗).subscript𝒒𝑘subscript𝒒𝑘1𝜂𝑯subscript𝒒𝑘1𝒗\bm{q}_{k}=\bm{q}_{k-1}-\eta\left(\bm{H}\bm{q}_{k-1}-\bm{v}\right).bold_italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_q start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_η ( bold_italic_H bold_italic_q start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_v ) . (16)

This updating scheme is efficient since 𝑯𝒒𝑯𝒒\bm{H}\bm{q}bold_italic_H bold_italic_q can be computed using the fast Hessian-vector product, i.e., a Hessian-vector product is the gradient of a gradient-vector product:

𝑯𝒒=2L~ϕϕ𝒒=(L~ϕ𝒒)ϕ,𝑯𝒒superscript2~𝐿bold-italic-ϕbold-italic-ϕ𝒒~𝐿bold-italic-ϕ𝒒bold-italic-ϕ\bm{H}\bm{q}=\frac{\partial^{2}\tilde{L}}{\partial\bm{\phi}\partial\bm{\phi}}% \cdot\bm{q}=\frac{\partial\left(\frac{\partial\tilde{L}}{\partial\bm{\phi}}% \cdot\bm{q}\right)}{\partial\bm{\phi}},bold_italic_H bold_italic_q = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_L end_ARG end_ARG start_ARG ∂ bold_italic_ϕ ∂ bold_italic_ϕ end_ARG ⋅ bold_italic_q = divide start_ARG ∂ ( divide start_ARG ∂ over~ start_ARG italic_L end_ARG end_ARG start_ARG ∂ bold_italic_ϕ end_ARG ⋅ bold_italic_q ) end_ARG start_ARG ∂ bold_italic_ϕ end_ARG , (17)

where L~ϕ𝒒~𝐿bold-italic-ϕ𝒒\frac{\partial\tilde{L}}{\partial\bm{\phi}}\cdot\bm{q}divide start_ARG ∂ over~ start_ARG italic_L end_ARG end_ARG start_ARG ∂ bold_italic_ϕ end_ARG ⋅ bold_italic_q is a scalar. This means the Hessian 𝑯𝑯\bm{H}bold_italic_H is not explicitly computed or stored. We summarize this implicit differentiation in Algorithm 2. Note that the optimality condition depends on the LL problems. In Section 4, we will show that it can either be the derivative of the LL cost function for an unconstrained problem such as (27) or the Lagrangian function for a constrained problem such as (39).

Approximation

Implicit differentiation is complicated to implement but there is one approximation, which is to ignore the implicit components and only use the direct part ^𝝍UU𝝍|ϕTsubscript^𝝍𝑈evaluated-at𝑈𝝍subscriptbold-italic-ϕ𝑇\hat{\nabla}_{\bm{\psi}}U\approx\frac{\partial U}{\partial\bm{\psi}}\Big{|}_{% \bm{\phi}_{T}}over^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_ψ end_POSTSUBSCRIPT italic_U ≈ divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ψ end_ARG | start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This is equivalent to taking the solution ϕTsubscriptbold-italic-ϕ𝑇{\bm{\phi}}_{T}bold_italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT from the LL optimization as constants in the UL problem. Such an approximation is more efficient but introduces an error term

ϵ|Uϕϕ𝝍|.similar-toitalic-ϵ𝑈superscriptbold-italic-ϕsuperscriptbold-italic-ϕ𝝍\epsilon\sim\left|\frac{\partial U}{\partial\bm{\phi}^{*}}\frac{\partial\bm{% \phi}^{*}}{\partial{\bm{\psi}}}\right|.italic_ϵ ∼ | divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_ψ end_ARG | . (18)

Nevertheless, it is useful when the implicit derivatives contain products of small second-order derivatives, which again depends on the specific LL problems.

It is worth noting that in the framework of IL (1), we assign perception-related parameters 𝝍𝝍\bm{\psi}bold_italic_ψ to the UL neural cost (1a), while reasoning-related parameters ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ to the LL symbolic cost (1b). This design stems from two key considerations: First, it can avoid involving large Jacobian and Hessian matrices for neuron-like variables such as 𝑯ϕϕsubscript𝑯superscriptbold-italic-ϕsuperscriptbold-italic-ϕ\bm{H}_{\bm{\phi}^{*}\bm{\phi}^{*}}bold_italic_H start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Given that real-world robot applications often involve an immense number of neuron-like parameters (e.g., a simple neural network might possess millions), placing them in the UL cost reduces the complexity involved in computing implicit gradients necessitated by the LL cost (1b).

Second, perception-related (neuron-like) parameters are usually updated using gradient descent algorithms such as SGD (Sutskever et al., 2013). However, such simple first-order optimization methods are often inadequate for LL symbolic reasoning, e.g., geometric problems (Wang et al., 2023) usually need second-order optimizers. Therefore, separating neuron-like parameters from reasoning-related parameters makes the BLO in IL easier to solve and analyze. However, this again depends on the LL tasks.

Table 1: Five examples in logic, planning, control, SLAM, and MTSP are selected to cover distinct scenarios of the LL problems for solving the BLO in imperative learning.
Section Application LL problem LL solution Type
Sec. 4.1 Planning A Closed-form (C)
Sec. 4.2 Logic NLM \nth1-order (B)
Sec. 4.3 Control MPC Constrained (A)
Sec. 4.4 SLAM PGO \nth2-order (C)
Sec. 4.5 MTSP TSP Discrete (A)

All modules are trained, although a pre-trained feature extractor is given.

4 Applications and Examples

To showcase the effectiveness of IL, we will introduce five distinct examples in different fields of robot autonomy. These examples, along with their respective LL problem and optimization methods, are outlined in Table 1. Specifically, they are selected to cover distinct tasks, including path planning, rule induction, optimal control, visual odometry, and multi-agent routing, to showcase different optimization techniques required by the LL problems, including closed-form solution, first-order optimization, constrained optimization, second-order optimization, and discrete optimization, respectively. We will explore several memory structures mentioned in Section 3.2.

Additionally, since IL is a self-supervised learning framework consisting of three primary components, we have three kinds of learning types. This includes (A) given known (pre-trained or human-defined) reasoning engines such as logical reasoning, physical principles, and geometrical analytics, robots can learn a logic-, physics-, or geometry-aware neural perception system, respectively, in a self-supervised manner; (B) given neural perception systems such as a vision foundation model (Kirillov et al., 2023), robots can discover the world rules, e.g., traffic rules, and then apply the rules to future events; and (C) given a memory system, (e.g., experience, world rules, or maps), robots can simultaneously update the neural system and reasoning engine so that they can adapt to novel environments with a new set of rules. The five examples will also cover all three learning types.

4.1 Closed-form Solution

We first illustrate the scenarios in which the LL cost L𝐿Litalic_L in (1b) has closed-form solutions. In this case, one can directly optimize the UL cost by solving

min𝜽,𝜸U(f(𝜽,𝒙),g(𝝁(𝜽,𝜸)),M(𝜸,𝝂(𝜽,𝜸))),subscript𝜽𝜸𝑈𝑓𝜽𝒙𝑔superscript𝝁𝜽𝜸𝑀𝜸superscript𝝂𝜽𝜸\min_{{\bm{\theta}},{\bm{\gamma}}}\quad U\left(f({\bm{\theta}},\bm{x}),g({\bm{% \mu}}^{*}({\bm{\theta}},{\bm{\gamma}})),M({\bm{\gamma}},{\bm{\nu}}^{*}({\bm{% \theta}},{\bm{\gamma}}))\right),roman_min start_POSTSUBSCRIPT bold_italic_θ , bold_italic_γ end_POSTSUBSCRIPT italic_U ( italic_f ( bold_italic_θ , bold_italic_x ) , italic_g ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ , bold_italic_γ ) ) , italic_M ( bold_italic_γ , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ , bold_italic_γ ) ) ) , (19)

where the LL solutions 𝝁(𝜽,𝜸)superscript𝝁𝜽𝜸{\bm{\mu}}^{*}({\bm{\theta}},{\bm{\gamma}})bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ , bold_italic_γ ) and 𝝂(𝜽,𝜸)superscript𝝂𝜽𝜸{\bm{\nu}}^{*}({\bm{\theta}},{\bm{\gamma}})bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ , bold_italic_γ ), that contain the implicit components 𝝁𝜽superscript𝝁𝜽\frac{\partial{\bm{\mu}}^{*}}{\partial{\bm{\theta}}}divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG, 𝝁𝜸superscript𝝁𝜸\frac{\partial{\bm{\mu}}^{*}}{\partial{\bm{\gamma}}}divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_γ end_ARG, 𝝂𝜽superscript𝝂𝜽\frac{\partial{\bm{\nu}}^{*}}{\partial{\bm{\theta}}}divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG, and 𝝂𝜸superscript𝝂𝜸\frac{\partial{\bm{\nu}}^{*}}{\partial{\bm{\gamma}}}divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_γ end_ARG, can be calculated directly due to the closed-form of 𝝁superscript𝝁{\bm{\mu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝝂superscript𝝂{\bm{\nu}}^{*}bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. As a result, standard gradient-based algorithms can be applied in updating 𝜽𝜽{\bm{\theta}}bold_italic_θ and 𝜸𝜸{\bm{\gamma}}bold_italic_γ with gradients given by (2). In this case, there is no approximation error induced by the LL minimization, in contrast to existing widely-used implicit and unrolled differentiation methods that require ensuring a sufficiently small or decreasing LL optimization error for guaranteeing the convergence (Ji et al., 2021).

One possible problem in computing (3) is that the implicit components can contain expensive matrix inversions. Let us consider a simplified quadratic case in (1b) with 𝝁=argmin𝝁12f(𝜽,x)𝝁y2superscript𝝁subscript𝝁12superscriptnorm𝑓𝜽𝑥𝝁𝑦2{\bm{\mu}}^{*}=\arg\min_{{\bm{\mu}}}\frac{1}{2}\|f({\bm{\theta}},x){\bm{\mu}}-% y\|^{2}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_f ( bold_italic_θ , italic_x ) bold_italic_μ - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The closed-form solution takes the form of 𝝁=[f(𝜽,x)f(𝜽,x)]1ysuperscript𝝁superscriptdelimited-[]𝑓superscript𝜽𝑥top𝑓𝜽𝑥1𝑦{\bm{\mu}}^{*}=[f({\bm{\theta}},x)^{\top}f({\bm{\theta}},x)]^{-1}ybold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = [ italic_f ( bold_italic_θ , italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ( bold_italic_θ , italic_x ) ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y, which can induce a computationally expensive inversion of a possibly large matrix f(𝜽,x)f(𝜽,x)𝑓superscript𝜽𝑥top𝑓𝜽𝑥f({\bm{\theta}},x)^{\top}f({\bm{\theta}},x)italic_f ( bold_italic_θ , italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ( bold_italic_θ , italic_x ). To address this problem, one can again formulate the matrix-inverse-vector computation H1ysuperscript𝐻1𝑦H^{-1}yitalic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y as solving a linear system minv12vHvvysubscript𝑣12superscript𝑣top𝐻𝑣superscript𝑣top𝑦\min_{v}\frac{1}{2}v^{\top}Hv-v^{\top}yroman_min start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_H italic_v - italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y using any optimization methods with efficient matrix-vector products.

Many symbolic costs can be effectively addressed through closed-form solutions. For example, both linear quadratic regulator (LQR) (Shaiju and Petersen, 2008) and Dijkstra’s algorithm (Dijkstra, 1959) can be solved with a determined optimal solution. To demonstrate the effectiveness of IL for closed-form solutions, we next present two examples in path planning to utilize the neural model for reducing the search and sampling space of symbolic optimization.

Example 1: Path Planning

Path planning is a computational process to determine a path from a starting point to a destination within an environment. It typically involves navigating around obstacles and may also optimize certain objectives such as the shortest distance, minimal energy use, or maximum safety. Path planning algorithms are generally categorized into global planning, which utilizes global maps of the environment, and local planning, which relies on real-time sensory data. We will enhance two widely-used algorithms through IL: A search for global planning and cubic spline for local planning, both of which offer closed-form solutions.

Example 1.A: Global Path Planning

Background

The A algorithm is a graph search technique that aims to find the global shortest feasible path between two nodes (Hart et al., 1968). Specifically, A selects the path passing through the next node nG𝑛𝐺n\in Gitalic_n ∈ italic_G that minimizes

C(n)=s(n)+h(n),𝐶𝑛𝑠𝑛𝑛C(n)=s(n)+h(n),italic_C ( italic_n ) = italic_s ( italic_n ) + italic_h ( italic_n ) , (20)

where s(n)𝑠𝑛s(n)italic_s ( italic_n ) is the cost from the start node to n𝑛nitalic_n and h(n)𝑛h(n)italic_h ( italic_n ) is a heuristic function that predicts the cheapest path cost from n𝑛nitalic_n to the goal. The heuristic cost can take the form of various metrics such as the Euclidean and Chebyshev distances. It can be proved that if the heuristic function h(n)𝑛h(n)italic_h ( italic_n ) is admissible and monotone, i.e., mG,h(n)h(m)+d(m,n)formulae-sequencefor-all𝑚𝐺𝑛𝑚𝑑𝑚𝑛\forall m\in G,h(n)\leq h(m)+d(m,n)∀ italic_m ∈ italic_G , italic_h ( italic_n ) ≤ italic_h ( italic_m ) + italic_d ( italic_m , italic_n ), where d(m,n)𝑑𝑚𝑛d(m,n)italic_d ( italic_m , italic_n ) is the cost from m𝑚mitalic_m to n𝑛nitalic_n, A is guaranteed to find the optimal path without searching any node more than once (Dechter and Pearl, 1985). Due to this optimality, A became one of the most widely used methods in path planning (Paden et al., 2016; Smith et al., 2012; Algfoor et al., 2015).

However, A encounters significant limitations, particularly in its requirement to explore a large number of potential nodes. This exhaustive search process can be excessively time-consuming, especially for low-power robot systems. To address this, recent advancements showed significant potential for enhancing efficiency by predicting a more accurate heuristic cost map using data-driven methods (Choudhury et al., 2018; Yonetani et al., 2021a; Kirilenko et al., 2023). Nevertheless, these algorithms utilize optimal paths as training labels, which face challenges in generalization, leading to a bias towards the patterns observed in the training datasets.

Refer to caption
Figure 2: The framework of iA search. The network predicts a confined search space, leading to overall improved efficiency. The A search algorithm eliminates the label dependence, resulting in a self-supervised path planning framework.
Approach

To address this limitation, we leverage the IL framework to remove dependence on path labeling. Specifically, we utilize a neural network f𝑓fitalic_f to estimate the hhitalic_h value of each node, which can be used to reduce the search space. Subsequently, we integrate a differentiable A module, serving as the symbolic reasoning engine g𝑔gitalic_g in (1b), to determine the most efficient path. This results in an effective framework depicted in Figure 2, which we refer to as imperative A (iA) algorithm. Notably, the iA framework can operate on a self-supervised basis inherent in the IL framework, eliminating the need for annotated labels.

Specifically, the iA algorithm is to minimize

min𝜽subscript𝜽\displaystyle\min_{{\bm{\theta}}}\quadroman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT U(f(𝒙,𝜽),g(𝝁),M(𝝂)),𝑈𝑓𝒙𝜽𝑔superscript𝝁𝑀superscript𝝂\displaystyle U\left(f(\bm{x},{\bm{\theta}}),g({\bm{\mu}}^{*}),M({\bm{\nu}}^{*% })\right),italic_U ( italic_f ( bold_italic_x , bold_italic_θ ) , italic_g ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_M ( bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) , (21a)
s.t. 𝝁,𝝂=argmin𝝁,𝝂L(f,g(𝝁),M(𝝂)),superscript𝝁superscript𝝂subscript𝝁𝝂𝐿𝑓𝑔𝝁𝑀𝝂\displaystyle{\bm{\mu}}^{*},{\bm{\nu}}^{*}=\arg\min_{{\bm{\mu}},{\bm{\nu}}}L(f% ,g({\bm{\mu}}),M({\bm{\nu}})),bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_μ , bold_italic_ν end_POSTSUBSCRIPT italic_L ( italic_f , italic_g ( bold_italic_μ ) , italic_M ( bold_italic_ν ) ) , (21b)

where 𝒙𝒙\bm{x}bold_italic_x denotes the inputs including the map, start node, and goal node, 𝝁𝝁{\bm{\mu}}bold_italic_μ is the set of paths in the solution space, 𝝂𝝂{\bm{\nu}}bold_italic_ν is the accumulated hhitalic_h values associated with a path in 𝝁𝝁{\bm{\mu}}bold_italic_μ, 𝝁superscript𝝁{\bm{\mu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal path, 𝝂superscript𝝂{\bm{\nu}}^{*}bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the accumulated hhitalic_h values associated with the optimal path 𝝁superscript𝝁{\bm{\mu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and M𝑀Mitalic_M is the intermediate s𝑠sitalic_s and hhitalic_h maps.

The lower-level cost L𝐿Litalic_L is defined as the path length (cost)

L(𝝁,M,g)Cl(𝝁,M),approaches-limit𝐿𝝁𝑀𝑔subscript𝐶𝑙superscript𝝁𝑀L({\bm{\mu}},M,g)\doteq C_{l}({\bm{\mu}}^{*},M),italic_L ( bold_italic_μ , italic_M , italic_g ) ≐ italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_M ) , (22)

where the optimal path 𝝁superscript𝝁{\bm{\mu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is derived from the A reasoning g𝑔gitalic_g given the node cost maps M𝑀Mitalic_M. Thanks to its closed-form solution, this LL optimization is directly solvable. The UL optimization focuses on updating the network parameter 𝜽𝜽{\bm{\theta}}bold_italic_θ to generate the hhitalic_h map. Given the impact of the hhitalic_h map on the search area of A, the UL cost U𝑈Uitalic_U is formulated as a combination of the search area Ca(𝝁,𝝂)subscript𝐶𝑎superscript𝝁superscript𝝂C_{a}({\bm{\mu}}^{*},{\bm{\nu}}^{*})italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and the path length Cl(𝝁)subscript𝐶𝑙superscript𝝁C_{l}({\bm{\mu}}^{*})italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). This is mathematically represented as:

U(𝝁,M)waCa(𝝁,𝝂)+wlCl(𝝁,M),approaches-limit𝑈superscript𝝁𝑀subscript𝑤𝑎subscript𝐶𝑎superscript𝝁superscript𝝂subscript𝑤𝑙subscript𝐶𝑙superscript𝝁𝑀U({\bm{\mu}}^{*},M)\doteq w_{a}C_{a}({\bm{\mu}}^{*},{\bm{\nu}}^{*})+w_{l}C_{l}% ({\bm{\mu}}^{*},M),italic_U ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_M ) ≐ italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_M ) , (23)

where wasubscript𝑤𝑎w_{a}italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are the weights to adjust the two terms, and Casubscript𝐶𝑎C_{a}italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the search area computed from the accumulated hhitalic_h map with the optimal path 𝝁superscript𝝁{\bm{\mu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. In the experiments, we define the s𝑠sitalic_s cost as the Euclidean distance. The input 𝒙𝒙\bm{x}bold_italic_x is represented as a three-channel 2-D tensor, with each channel dedicated to a specific component: the map, the start node, and the goal node. Specifically, the start and goal node channels are represented as a one-hot 2-D tensor, where the indices of the “ones” indicate the locations of the start and goal node, respectively. This facilitates a more nuanced and effective representation of path planning problems.

Optimization

As shown in (2), once the gradient calculation is completed, we backpropagate the cost U𝑈Uitalic_U directly to the network f𝑓fitalic_f. This process is facilitated by the closed-form solution of the A algorithm. For the sake of simplicity, we employ the differentiable A algorithm as introduced by Yonetani et al. (2021a). It transforms the node selection into an argsoftmax operation and reinterprets node expansion as a series of convolutions, leveraging efficient implementations and AutoDiff in PyTorch (Paszke et al., 2019).

Intuitively, this optimization process involves iterative adjustments. On one hand, the hhitalic_h map enables the A algorithm to efficiently identify the optimal path, but within a confined search area. On the other hand, the A algorithm’s independence from labels allows further refinement of the network. This is achieved by the back-propagating of the search area and length cost through the differentiable A reasoning. This mutual connection encourages the network to generate increasingly smaller search areas over time, enhancing overall efficiency. As a result, the network inclines to focus on more relevant areas, marked by reduced low-level reasoning costs, improving the overall search quality.

Example 1.B: Local Path Planning

Background

End-to-end local planning, which integrates perception and planning within a single model, has recently attracted considerable interest, particularly for its potential to enable efficient inference through data-driven methods such as reinforcement learning (Hoeller et al., 2021; Wijmans et al., 2019; Lee et al., 2024; Ye et al., 2021) and imitation learning (Sadat et al., 2020; Shah et al., 2023b; Loquercio et al., 2021). Despite these advancements, significant challenges persist. Reinforcement learning-based methods often suffer from sample inefficiency and difficulties in directly processing raw, dense sensor inputs, such as depth images. Without sufficient guidance during training, reinforcement learning struggles to converge on an optimal policy that generalizes well across various scenarios or environments. Conversely, imitation learning relies heavily on the availability and quality of labeled trajectories. Obtaining these labeled trajectories is particularly challenging for robotic systems that operate under diverse dynamics models, thereby limiting their broad applicability in flexible robotic systems.

Refer to caption
Figure 3: The framework of iPlanner. The higher-level network predicts waypoints, which are interpolated by the lower-level optimization to ensure path continuity and smoothness.
Approach

To address these challenges, we introduce IL to local planning and refer to it as imperative local planning (iPlanner), as depicted in Figure 3. Instead of predicting a continuous trajectory directly, iPlanner uses the network to generate sparse waypoints, which are then interpolated using a trajectory optimization engine based on a cubic spline. This approach leverages the strengths of both neural and symbolic modules: neural networks excel at dynamic obstacle detection, while symbolic modules optimize multi-step navigation strategies under dynamics. By enforcing the network output sparse waypoints rather than continuous trajectories, iPlanner effectively combines the advantages of both modules. Specifically, iPlanner can be formulated as:

min𝜽subscript𝜽\displaystyle\min_{{\bm{\theta}}}\quadroman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT U(f(𝒙,𝜽),g(𝝁),M(𝝁)),𝑈𝑓𝒙𝜽𝑔superscript𝝁𝑀superscript𝝁\displaystyle U\left(f(\bm{x},{\bm{\theta}}),g({\bm{\mu}}^{*}),M({\bm{\mu}}^{*% })\right),italic_U ( italic_f ( bold_italic_x , bold_italic_θ ) , italic_g ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_M ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) , (24a)
s.t. 𝝁=argmin𝝁𝕋L(𝒛,𝝁),superscript𝝁subscript𝝁𝕋𝐿𝒛𝝁\displaystyle{\bm{\mu}}^{*}=\arg\min_{{\bm{\mu}}\in\mathbb{T}}L(\bm{z},{\bm{% \mu}}),bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_μ ∈ blackboard_T end_POSTSUBSCRIPT italic_L ( bold_italic_z , bold_italic_μ ) , (24b)

where 𝒙𝒙\bm{x}bold_italic_x denotes the system inputs including a local goal position and the sensor measurements such as depth images, θ𝜃\thetaitalic_θ is the parameters of a network f𝑓fitalic_f, 𝒛=f(𝒙,𝜽)𝒛𝑓𝒙𝜽\bm{z}=f(\bm{x},\bm{\theta})bold_italic_z = italic_f ( bold_italic_x , bold_italic_θ ) denotes the generated waypoints, which are subsequently optimized by the path optimizer g𝑔gitalic_g, 𝝁𝝁\bm{\mu}bold_italic_μ represents the set of valid paths within the constrained space 𝕋𝕋\mathbb{T}blackboard_T. The optimized path, 𝝁superscript𝝁{\bm{\mu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, acts as the optimal solution to the LL cost L𝐿Litalic_L, which is defined by tracking the intermediate waypoints and the overall continuity and smoothness of the path:

LD(𝝁,𝒛)+A(𝝁),approaches-limit𝐿𝐷𝝁𝒛𝐴𝝁L\doteq D(\bm{\mu},\bm{z})+A(\bm{\mu}),italic_L ≐ italic_D ( bold_italic_μ , bold_italic_z ) + italic_A ( bold_italic_μ ) , (25)

where D(𝝁,𝒛)𝐷𝝁𝒛D(\bm{\mu},\bm{z})italic_D ( bold_italic_μ , bold_italic_z ) measures the path continuity based on the \nth1-order derivative of the path and A(𝒖)𝐴𝒖A(\bm{u})italic_A ( bold_italic_u ) calculates the path smoothness based on \nth2-order derivative. Specifically, we employ the cubic spline interpolation which has a closed-form solution to ensure this continuity and smoothness. This is also essential for generating feasible and efficient paths. On the other hand, the UL cost U𝑈Uitalic_U is defined as:

UwGCG(𝝁,xgoal)+wLCL(𝝁)+wMM(𝝁),approaches-limit𝑈subscript𝑤𝐺superscript𝐶𝐺superscript𝝁subscript𝑥goalsubscript𝑤𝐿superscript𝐶𝐿superscript𝝁subscript𝑤𝑀𝑀superscript𝝁U\doteq w_{G}C^{G}({\bm{\mu}}^{*},x_{\text{goal}})+w_{L}C^{L}({\bm{\mu}}^{*})+% w_{M}M({\bm{\mu}}^{*}),italic_U ≐ italic_w start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ) + italic_w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_M ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , (26)

where CG(𝝁,xgoal)superscript𝐶𝐺superscript𝝁subscript𝑥goalC^{G}({\bm{\mu}}^{*},x_{\text{goal}})italic_C start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ) measures the distance from the endpoint of the generated path to the goal xgoalsubscript𝑥goalx_{\text{goal}}italic_x start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT, assessing the alignment of the planned path with the desired destination; CL(𝝁)superscript𝐶𝐿superscript𝝁C^{L}({\bm{\mu}}^{*})italic_C start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) represents the motion loss, encouraging the planning of shorter paths to improve overall movement efficiency; M(𝝁)𝑀superscript𝝁M({\bm{\mu}}^{*})italic_M ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) quantifies the obstacle cost, utilizing information from a pre-built Euclidean signed distance fields (ESDF) map to evaluate the path safety; and wGsubscript𝑤𝐺w_{G}italic_w start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, wLsubscript𝑤𝐿w_{L}italic_w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, and wMsubscript𝑤𝑀w_{M}italic_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT are hyperparameters, allowing for adjustments in the planning strategy based on specific performance objectives.

Optimization

During training, we leverage the AutoDiff capabilities provided by PyTorch (Paszke et al., 2019) to solve the BLO in IL. For the LL trajectory optimization, we adopt the cubic spline interpolation implementation provided by PyPose (Wang et al., 2023). It supports differentiable batched closed-form solutions, enabling a faster training process. Additionally, the ESDF cost map is convoluted with a Gaussian kernel, enabling a smoother optimization space. This setup allows the loss to be directly backpropagated to update the network parameters in a single step, rather than requiring iterative adjustments. As a result, the UL network and the trajectory optimization can mutually improve each other, enabling a self-supervised learning process.

4.2 First-order Optimization

We next illustrate the scenario that the LL cost L𝐿Litalic_L in (1b) uses first-order optimizers such as GD. Because GD is a simple differentiable iterative method, one can leverage unrolled optimization listed in Algorithm 1 to solve BLO via AutoDiff. It has been theoretically proved by Ji et al. (2021, 2022a) that when the LL problem is strongly convex and smooth, unrolled optimization with LL gradient descent can approximate the hypergradients 𝜽subscript𝜽\nabla_{\bm{\theta}}∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT and 𝜸subscript𝜸\nabla_{\bm{\gamma}}∇ start_POSTSUBSCRIPT bold_italic_γ end_POSTSUBSCRIPT up to an error that decreases linearly with the number of GD steps. For the implicit differentiation, we could leverage the optimality conditions of the LL optimization to compute the implicit gradient. To be specific, using Chain rule and optimality of 𝝁superscript𝝁{\bm{\mu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝝂superscript𝝂{\bm{\nu}}^{*}bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have the stationary points

L𝝁(𝝁,𝝂,𝜽,𝜸)=L𝝂(𝝁,𝝂,𝜽,𝜸)=0,𝐿𝝁superscript𝝁superscript𝝂𝜽𝜸𝐿𝝂superscript𝝁superscript𝝂𝜽𝜸0\frac{\partial L}{\partial{\bm{\mu}}}({\bm{\mu}}^{*},{\bm{\nu}}^{*},{\bm{% \theta}},{\bm{\gamma}})=\frac{\partial L}{\partial{\bm{\nu}}}({\bm{\mu}}^{*},{% \bm{\nu}}^{*},{\bm{\theta}},{\bm{\gamma}})=0,divide start_ARG ∂ italic_L end_ARG start_ARG ∂ bold_italic_μ end_ARG ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_θ , bold_italic_γ ) = divide start_ARG ∂ italic_L end_ARG start_ARG ∂ bold_italic_ν end_ARG ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_θ , bold_italic_γ ) = 0 , (27)

which, by taking differentiation over 𝜽𝜽{\bm{\theta}}bold_italic_θ and 𝜸𝜸{\bm{\gamma}}bold_italic_γ and noting the implicit dependence of 𝝁superscript𝝁{\bm{\mu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝝂superscript𝝂{\bm{\nu}}^{*}bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT on 𝜽𝜽{\bm{\theta}}bold_italic_θ and 𝜸𝜸{\bm{\gamma}}bold_italic_γ, yields

2Lf𝝁f𝜽A+2L𝝁2B𝝁𝜽+2L𝝂𝝁C𝝂𝜽subscriptsuperscript2𝐿𝑓𝝁𝑓𝜽𝐴subscriptsuperscript2𝐿superscript𝝁2𝐵superscript𝝁𝜽subscriptsuperscript2𝐿𝝂𝝁𝐶superscript𝝂𝜽\displaystyle\underbrace{\frac{\partial^{2}L}{\partial f\partial{\bm{\mu}}}% \frac{\partial f}{\partial{\bm{\theta}}}}_{A}+\underbrace{\frac{\partial^{2}L}% {\partial{\bm{\mu}}^{2}}}_{B}\frac{\partial{\bm{\mu}}^{*}}{\partial{\bm{\theta% }}}+\underbrace{\frac{\partial^{2}L}{\partial{\bm{\nu}}\partial{\bm{\mu}}}}_{C% }\frac{\partial{\bm{\nu}}^{*}}{\partial{\bm{\theta}}}under⏟ start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG ∂ italic_f ∂ bold_italic_μ end_ARG divide start_ARG ∂ italic_f end_ARG start_ARG ∂ bold_italic_θ end_ARG end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + under⏟ start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG + under⏟ start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG ∂ bold_italic_ν ∂ bold_italic_μ end_ARG end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG =0,absent0\displaystyle=0,= 0 , (28a)
2Lf𝝂f𝜽A~+2L𝝂2B~𝝂𝜽+2L𝝁𝝂C~𝝁𝜽subscriptsuperscript2𝐿𝑓𝝂𝑓𝜽~𝐴subscriptsuperscript2𝐿superscript𝝂2~𝐵superscript𝝂𝜽subscriptsuperscript2𝐿𝝁𝝂~𝐶superscript𝝁𝜽\displaystyle\underbrace{\frac{\partial^{2}L}{\partial f\partial{\bm{\nu}}}% \frac{\partial f}{\partial{\bm{\theta}}}}_{\widetilde{A}}+\underbrace{\frac{% \partial^{2}L}{\partial{\bm{\nu}}^{2}}}_{\widetilde{B}}\frac{\partial{\bm{\nu}% }^{*}}{\partial{\bm{\theta}}}+\underbrace{\frac{\partial^{2}L}{\partial{\bm{% \mu}}\partial{\bm{\nu}}}}_{\widetilde{C}}\frac{\partial{\bm{\mu}}^{*}}{% \partial{\bm{\theta}}}under⏟ start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG ∂ italic_f ∂ bold_italic_ν end_ARG divide start_ARG ∂ italic_f end_ARG start_ARG ∂ bold_italic_θ end_ARG end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_A end_ARG end_POSTSUBSCRIPT + under⏟ start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_B end_ARG end_POSTSUBSCRIPT divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG + under⏟ start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG ∂ bold_italic_μ ∂ bold_italic_ν end_ARG end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_C end_ARG end_POSTSUBSCRIPT divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG =0.absent0\displaystyle=0.= 0 . (28b)

Assume that the concatenated matrix [B,CC~,B~]delimited-[]𝐵𝐶~𝐶~𝐵\Big{[}\begin{array}[]{c}B,\;C\\ \widetilde{C},\;\widetilde{B}\end{array}\Big{]}[ start_ARRAY start_ROW start_CELL italic_B , italic_C end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_C end_ARG , over~ start_ARG italic_B end_ARG end_CELL end_ROW end_ARRAY ] invertible at 𝝁,𝝂,𝜽,𝜸superscript𝝁superscript𝝂𝜽𝜸{\bm{\mu}}^{*},{\bm{\nu}}^{*},{\bm{\theta}},{\bm{\gamma}}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_θ , bold_italic_γ. Then, it can be derived from the linear equations in (28a) that the implicit gradients 𝝁𝜽superscript𝝁𝜽\frac{\partial{\bm{\mu}}^{*}}{\partial{\bm{\theta}}}divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG and 𝝂𝜽superscript𝝂𝜽\frac{\partial{\bm{\nu}}^{*}}{\partial{\bm{\theta}}}divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG

[𝝁𝜽𝝂𝜽]=[B,CC~,B~]1[AA~],delimited-[]superscript𝝁𝜽superscript𝝂𝜽superscriptdelimited-[]𝐵𝐶~𝐶~𝐵1delimited-[]𝐴~𝐴\bigg{[}\begin{array}[]{c}\frac{\partial{\bm{\mu}}^{*}}{\partial{\bm{\theta}}}% \\ \frac{\partial{\bm{\nu}}^{*}}{\partial{\bm{\theta}}}\end{array}\bigg{]}=-\bigg% {[}\begin{array}[]{c}B,\;C\\ \widetilde{C},\;\widetilde{B}\end{array}\bigg{]}^{-1}\bigg{[}\begin{array}[]{c% }A\\ \widetilde{A}\end{array}\bigg{]},[ start_ARRAY start_ROW start_CELL divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG end_CELL end_ROW end_ARRAY ] = - [ start_ARRAY start_ROW start_CELL italic_B , italic_C end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_C end_ARG , over~ start_ARG italic_B end_ARG end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARRAY start_ROW start_CELL italic_A end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_A end_ARG end_CELL end_ROW end_ARRAY ] , (29)

which, combined with (2), yields the UL gradient 𝜽Usubscript𝜽𝑈\nabla_{\bm{\theta}}U∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_U as

𝜽U=subscript𝜽𝑈absent\displaystyle\nabla_{\bm{\theta}}U=∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_U = Uff𝜽+[U𝝁U𝝂][𝝁𝜽𝝂𝜽]𝑈𝑓𝑓𝜽superscriptdelimited-[]𝑈superscript𝝁𝑈superscript𝝂topdelimited-[]superscript𝝁𝜽superscript𝝂𝜽\displaystyle\frac{\partial U}{\partial f}\frac{\partial f}{\partial{\bm{% \theta}}}+\bigg{[}\begin{array}[]{c}\frac{\partial U}{\partial{\bm{\mu}}^{*}}% \\ \frac{\partial U}{\partial{\bm{\nu}}^{*}}\end{array}\bigg{]}^{\top}\bigg{[}% \begin{array}[]{c}\frac{\partial{\bm{\mu}}^{*}}{\partial{\bm{\theta}}}\\ \frac{\partial{\bm{\nu}}^{*}}{\partial{\bm{\theta}}}\end{array}\bigg{]}divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_f end_ARG divide start_ARG ∂ italic_f end_ARG start_ARG ∂ bold_italic_θ end_ARG + [ start_ARRAY start_ROW start_CELL divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARRAY start_ROW start_CELL divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG end_CELL end_ROW end_ARRAY ] (30e)
=\displaystyle== Uff𝜽[U𝝁U𝝂][B,CC~,B~]1D[AA~]𝑈𝑓𝑓𝜽subscriptsuperscriptdelimited-[]𝑈superscript𝝁𝑈superscript𝝂topsuperscriptdelimited-[]𝐵𝐶~𝐶~𝐵1𝐷delimited-[]𝐴~𝐴\displaystyle\frac{\partial U}{\partial f}\frac{\partial f}{\partial{\bm{% \theta}}}-\underbrace{\bigg{[}\begin{array}[]{c}\frac{\partial U}{\partial{\bm% {\mu}}^{*}}\\ \frac{\partial U}{\partial{\bm{\nu}}^{*}}\end{array}\bigg{]}^{\top}\bigg{[}% \begin{array}[]{c}B,\;C\\ \widetilde{C},\;\widetilde{B}\end{array}\bigg{]}^{-1}}_{D}\bigg{[}\begin{array% }[]{c}A\\ \widetilde{A}\end{array}\bigg{]}divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_f end_ARG divide start_ARG ∂ italic_f end_ARG start_ARG ∂ bold_italic_θ end_ARG - under⏟ start_ARG [ start_ARRAY start_ROW start_CELL divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARRAY start_ROW start_CELL italic_B , italic_C end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_C end_ARG , over~ start_ARG italic_B end_ARG end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ start_ARRAY start_ROW start_CELL italic_A end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_A end_ARG end_CELL end_ROW end_ARRAY ] (30l)

where the vector-matrix-inverse-product D𝐷Ditalic_D can be efficiently approximated by solving a linear system w.r.t. a vector v𝑣vitalic_v

minv12v[B,CC~,B~]v+[U𝝁U𝝂]v.subscript𝑣12superscript𝑣topdelimited-[]𝐵𝐶~𝐶~𝐵𝑣superscriptdelimited-[]𝑈superscript𝝁𝑈superscript𝝂top𝑣\displaystyle\min_{v}\quad\frac{1}{2}v^{\top}\bigg{[}\begin{array}[]{c}B,\;C\\ \widetilde{C},\;\widetilde{B}\end{array}\bigg{]}v+\bigg{[}\begin{array}[]{c}% \frac{\partial U}{\partial{\bm{\mu}}^{*}}\\ \frac{\partial U}{\partial{\bm{\nu}}^{*}}\end{array}\bigg{]}^{\top}v.roman_min start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARRAY start_ROW start_CELL italic_B , italic_C end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_C end_ARG , over~ start_ARG italic_B end_ARG end_CELL end_ROW end_ARRAY ] italic_v + [ start_ARRAY start_ROW start_CELL divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ italic_U end_ARG start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_v . (31)

A similar linear system can be derived for computing 𝜸Usubscript𝜸𝑈\nabla_{\bm{\gamma}}U∇ start_POSTSUBSCRIPT bold_italic_γ end_POSTSUBSCRIPT italic_U. Then, the practical methods should first use a first-order optimizer to solve the lower-level problem in (1b) to obtain approximates 𝝁^^𝝁\hat{\bm{\mu}}over^ start_ARG bold_italic_μ end_ARG and 𝝂^^𝝂\hat{\bm{\nu}}over^ start_ARG bold_italic_ν end_ARG of the solutions 𝝁superscript𝝁{\bm{\mu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝝂superscript𝝂{\bm{\nu}}^{*}bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which are then incorporated into (30e) to obtain an approximate ^𝜽Usubscript^𝜽𝑈\widehat{\nabla}_{\bm{\theta}}Uover^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_U of the upper-level gradient 𝜽Usubscript𝜽𝑈\nabla_{\bm{\theta}}U∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_U (similarly for ^𝜸Usubscript^𝜸𝑈\widehat{\nabla}_{\bm{\gamma}}Uover^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_γ end_POSTSUBSCRIPT italic_U). Then, the upper-level gradient approximates ^𝜽Usubscript^𝜽𝑈\widehat{\nabla}_{\bm{\theta}}Uover^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_U and ^𝜸Usubscript^𝜸𝑈\widehat{\nabla}_{\bm{\gamma}}Uover^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_γ end_POSTSUBSCRIPT italic_U are used to optimize the target variables 𝜽𝜽{\bm{\theta}}bold_italic_θ and 𝜸𝜸{\bm{\gamma}}bold_italic_γ.

Approximation

As mentioned in Section 3.3, one could approximate the solutions in (30e) by assuming the second-order derivatives are small and thus can be ignored. In this case, we can directly use the fully first-order estimates without taking the implicit differentiation into account as

^𝜽U=Uff𝜽,^𝜸U=UMM𝜸,formulae-sequencesubscript^𝜽𝑈𝑈𝑓𝑓𝜽subscript^𝜸𝑈𝑈𝑀𝑀𝜸\widehat{\nabla}_{\bm{\theta}}U=\frac{\partial U}{\partial f}\frac{\partial f}% {\partial{\bm{\theta}}},\quad\widehat{\nabla}_{\bm{\gamma}}U=\frac{\partial U}% {\partial M}\frac{\partial M}{\partial{\bm{\gamma}}},over^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_U = divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_f end_ARG divide start_ARG ∂ italic_f end_ARG start_ARG ∂ bold_italic_θ end_ARG , over^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT bold_italic_γ end_POSTSUBSCRIPT italic_U = divide start_ARG ∂ italic_U end_ARG start_ARG ∂ italic_M end_ARG divide start_ARG ∂ italic_M end_ARG start_ARG ∂ bold_italic_γ end_ARG , (32)

which are evaluated at the approximates 𝝁^^𝝁\hat{\bm{\mu}}over^ start_ARG bold_italic_μ end_ARG and 𝝂^^𝝂\hat{\bm{\nu}}over^ start_ARG bold_italic_ν end_ARG of the LL solution 𝝁,𝝂superscript𝝁superscript𝝂{\bm{\mu}}^{*},{\bm{\nu}}^{*}bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We next demonstrate it with an example of inductive logical reasoning using first-order optimization.

Refer to caption
Figure 4: The iLogic pipeline, which simultaneously conducts rule induction and high-dimensional data grounding.

Example 2: Inductive Logical Reasoning

Given few known facts, logical reasoning aims at deducting the truth value of unknown facts with formal logical rules (Iwańska, 1993; Overton, 2013; Halpern, 2013). Different types of logic have been invented to address distinct problems from various domains, including propositional logic (Wang et al., 2019), linear temporal logic (Xie et al., 2021), and first-order-logic (FOL) (Cropper and Morel, 2021). Among them, FOL decomposes a logic term into lifted predicates and variables with quantifiers, which has strong generalization power and can be applied to arbitrary entities with any compositions. Due to the strong capability of generalization, FOL has been widely leveraged in knowledge graph reasoning (Yang and Song, 2019) and robotic task planning (Chitnis et al., 2022). However, traditional FOL requires a human expert to carefully design the predicates and rules, which is tedious. Automatically summarizing the FOL predicates and rules is a long-standing problem, which is known as inductive logic programming (ILP). However, existing works study ILP only in simple structured data like knowledge graphs. To extend this research into robotics, we will explore how IL can make ILP work with high-dimensional data like RGB images.

Background

One stream of the solutions to ILP is based on forward search algorithms (Cropper and Morel, 2021; Shindo et al., 2023; Hocquette et al., 2024). For example, Popper constructs answer set programs based on failures, which can significantly reduce the hypothesis space (Cropper and Morel, 2021). However, as FOL is a combinatorial problem, search-based methods can be extremely time-consuming as the samples scale up. Recently, some works introduced neural networks to assist the search process (Yang et al., 2017; Yang and Song, 2019; Yang et al., 2022b) or directly implicitly represent the rule (Dong et al., 2019). To name a few, NeurlLP re-formulates the FOL rule inference into the multi-hop reasoning process, which can be represented as a series of matrix multiplications (Yang et al., 2017). Thus, learning the weight matrix becomes equivalent to inducting the rules. Neural logic machines, on the other hand, designed a new FOL-inspired network structure, where the rules are implicitly stored in the network weights (Dong et al., 2019).

Despite their promising results in structured symbolic datasets, such as binary vector representations in BlocksWorld (Dong et al., 2019) and knowledge graphs (Yang et al., 2017), their capability of handling high-dimensional data like RGB images is rarely explored.

Refer to caption
Figure 5: One snapshot of the visual action prediction task of LogiCity benchmark. The next actions of each agent are reasoned by our iLogic based on groundings and learned rules.
Approach

To address this gap, we verify the IL framework with the visual action prediction (VAP) task in the LogiCity (Li et al., 2024), a recent logical reasoning benchmark. In VAP, a model must simultaneously discover traffic rules, identify the concept of agents and their spatial relationships, and predict the agents’ actions. As shown in Figure 4, we utilize a grounding network f𝑓fitalic_f to predict the agent concepts and spatial relationships, which takes the observations such as images as the model input. The predicted agent concepts and their relationships are then sent to the reasoning module for rule induction and action prediction. The learned rules from the reasoning process are stored in the memory and retrieved when necessary. For example, the grounding network may output concepts “IsTiro(A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)”, “IsPedestrian(A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT),” and their relationship “NextTo(A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)”. Then the reasoning engine applies the learned rules, “Slow(X𝑋Xitalic_X)\leftarrow(\exists Y𝑌Yitalic_Y IsTiro(X𝑋Xitalic_X) \land IsPedestrian(Y𝑌Yitalic_Y) \land NextTo(X𝑋Xitalic_X, Y𝑌Yitalic_Y)) \lor …”, stored in the memory module to infer the next actions “Slow(A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)” or “Normal(A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)”, as is displayed in Figure 5. Finally, the predicted actions and the observed actions from the grounding networks are used for loss calculation. In summary, we formulate this pipeline in (33) and refer to it as imperative logical reasoning (iLogic):

min𝜽subscript𝜽\displaystyle\min_{{\bm{\theta}}}roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT U(f(𝜽,𝒙),g(𝝁),M(𝝂)),𝑈𝑓𝜽𝒙𝑔superscript𝝁𝑀superscript𝝂\displaystyle\;\;U(f({\bm{\theta}},\bm{x}),g({\bm{\mu}}^{*}),M({\bm{\nu}}^{*})),italic_U ( italic_f ( bold_italic_θ , bold_italic_x ) , italic_g ( bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_M ( bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) , (33a)
s.t.\displaystyle\operatorname{s.t.}roman_s . roman_t . 𝝁,𝝂=argmin𝝁,𝝂L(f,g(M,𝝁),M(𝝂)).superscript𝝁superscript𝝂subscript𝝁𝝂𝐿𝑓𝑔𝑀𝝁𝑀𝝂\displaystyle\;\;{\bm{\mu}}^{*},~{}{\bm{\nu}}^{*}=\operatorname*{\arg\min}_{{% \bm{{\bm{\mu}},~{}{\bm{\nu}}}}}\;L(f,g({M,\bm{\mu}}),M({\bm{\nu}})).bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_μ bold_, bold_italic_ν end_POSTSUBSCRIPT italic_L ( italic_f , italic_g ( italic_M , bold_italic_μ ) , italic_M ( bold_italic_ν ) ) . (33b)

Specifically, in the experiments we use the same cross-entropy function for the UL and LL costs, i.e., ULaiologaipapproaches-limit𝑈𝐿approaches-limitsuperscriptsubscript𝑎𝑖𝑜superscriptsubscript𝑎𝑖𝑝U\doteq L\doteq-\sum a_{i}^{o}\log a_{i}^{p}italic_U ≐ italic_L ≐ - ∑ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT roman_log italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, where aiosuperscriptsubscript𝑎𝑖𝑜a_{i}^{o}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and aipsuperscriptsubscript𝑎𝑖𝑝a_{i}^{p}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT are the observed and predicted actions of the i𝑖iitalic_i-th agent, respectively. This results in a self-supervised simultaneous grounding and rule induction pipeline for logical reasoning. Additionally, f𝑓fitalic_f can be any grounding networks and we use a feature pyramid network (FPN) (Lin et al., 2017) as the visual encoder and two MLPs for relationships; g𝑔gitalic_g can be any logical reasoning engines and we use a neural logical machine (NLM) (Dong et al., 2019) with parameter 𝝁𝝁{\bm{\mu}}bold_italic_μ; and M𝑀Mitalic_M can be any memory modules storing the learned rules and we use the MLPs in the NLM parameterized by 𝝂𝝂{\bm{\nu}}bold_italic_ν. More details about the task, model, and list of concepts are presented in the Section 5.2.

Optimization

Given that gradient descent algorithms can update both the grounding networks and the NLM, we can apply the first-order optimization to solve iLogic. The utilization of task-level action loss enables efficient self-supervised training, eliminating the necessity for explicit concept labels. Furthermore, BLO, compared to single-level optimization, helps the model concentrate more effectively on learning concept grounding and rule induction, respectively. This enhances the stability of optimization and decreases the occurrence of sub-optimal outcomes. Consequently, the model can learn rules more accurately and predict actions with greater precision.

4.3 Constrained Optimization

We next illustrate the scenarios in which the LL cost L𝐿Litalic_L in (1b) is subject to the general constraints in (1c). We discuss two cases with equality and inequality constraints, respectively. Constrained optimization is a thoroughly explored field, and related findings were presented in (Dontchev et al., 2009) and summarized in (Gould et al., 2021). This study will focus on the integration of constrained optimization into our special form of BLO (1) under the framework of IL.

Equality Constraint

In this case, the constraint in (1c) is

ξ(M(𝜸,𝝂),𝝁,f)=0.𝜉𝑀𝜸𝝂𝝁𝑓0\xi(M({\bm{\gamma}},{\bm{\nu}}),{\bm{\mu}},f)=0.italic_ξ ( italic_M ( bold_italic_γ , bold_italic_ν ) , bold_italic_μ , italic_f ) = 0 . (34)

Recall that ϕ=[𝝁,𝝂]bold-italic-ϕsuperscriptsuperscript𝝁topsuperscript𝝂toptop\bm{\phi}=[{\bm{\mu}}^{\top},{\bm{\nu}}^{\top}]^{\top}bold_italic_ϕ = [ bold_italic_μ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_ν start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a vector concatenating the symbol-like variables 𝝁𝝁{\bm{\mu}}bold_italic_μ and 𝝂𝝂{\bm{\nu}}bold_italic_ν, and 𝝍=[𝜽,𝜸]𝝍superscriptsuperscript𝜽topsuperscript𝜸toptop\bm{\psi}=[{\bm{\theta}}^{\top},{\bm{\gamma}}^{\top}]^{\top}bold_italic_ψ = [ bold_italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_γ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a vector concatenating the neuron-like variables 𝜽𝜽{\bm{\theta}}bold_italic_θ and 𝜸𝜸{\bm{\gamma}}bold_italic_γ. Therefore, the implicit gradient components can be expressed as differentiation of ϕsuperscriptbold-italic-ϕ\bm{\phi}^{*}bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to 𝝍𝝍\bm{\psi}bold_italic_ψ, and we have

ϕ(𝝍)=(𝝁𝜽𝝁𝜸𝝂𝜽𝝂𝜸).superscriptbold-italic-ϕ𝝍matrixsuperscript𝝁𝜽superscript𝝁𝜸superscript𝝂𝜽superscript𝝂𝜸\nabla\bm{\phi}^{*}(\bm{\psi})=\begin{pmatrix}\frac{\partial{\bm{\mu}}^{*}}{% \partial{\bm{\theta}}}&\frac{\partial{\bm{\mu}}^{*}}{\partial{\bm{\gamma}}}\\ \frac{\partial{\bm{\nu}}^{*}}{\partial{\bm{\theta}}}&\frac{\partial{\bm{\nu}}^% {*}}{\partial{\bm{\gamma}}}\\ \end{pmatrix}.∇ bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_ψ ) = ( start_ARG start_ROW start_CELL divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG end_CELL start_CELL divide start_ARG ∂ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_γ end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_θ end_ARG end_CELL start_CELL divide start_ARG ∂ bold_italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_γ end_ARG end_CELL end_ROW end_ARG ) . (35)
Lemma 1.

Assume the LL cost L()𝐿L(\cdot)italic_L ( ⋅ ) and the constraint ξ()𝜉\xi(\cdot)italic_ξ ( ⋅ ) are \nth2-order differentiable near (𝛉,𝛄,ϕ(𝛉,𝛄))𝛉𝛄superscriptbold-ϕ𝛉𝛄({\bm{\theta}},{\bm{\gamma}},\bm{\phi}^{*}({\bm{\theta}},{\bm{\gamma}}))( bold_italic_θ , bold_italic_γ , bold_italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ , bold_italic_γ ) ) and the Hessian matrix H𝐻Hitalic_H below is invertible, we then have

ϕ(𝝍)=H1Lϕ[LϕH1Lϕ]1(LϕH1Lϕ𝝍L𝝍)H1Lϕ𝝍,superscriptbold-italic-ϕ𝝍superscript𝐻