(Translated by https://www.hiragana.jp/)
Verifiably Following Complex Robot Instructions with Foundation Models

Verifiably Following Complex Robot Instructions with Foundation Models

Benedict Quartey†∗, Eric Rosen, Stefanie Tellex, George Konidaris
Department of Computer Science, Brown University
Equal ContributionCorresponding Author (Emailbenedict_quartey@brown.edu)
Abstract

When instructing robots, users want to flexibly express constraints, refer to arbitrary landmarks, and verify robot behavior, while robots must disambiguate instructions into specifications and ground instruction referents in the real world. To address this problem, we propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow complex, open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot’s alignment with an instructor’s intended motives and affords the synthesis of correct-by-construction robot behaviors. We conduct a large-scale evaluation of LIMP on 150 instructions across five real-world environments, demonstrating its versatility and ease of deployment in diverse, unstructured domains. LIMP performs comparably to state-of-the-art baselines on standard open-vocabulary tasks and additionally achieves a 79% success rate on complex spatiotemporal instructions, significantly outperforming baselines that only reach 38%. 111See supplementary materials and demo videos at robotlimp.github.io

I Introduction

\lettrine

Robots need a rich understanding of natural language to be instructable by non-experts in unstructured environments. People, on the other hand, need to be able to verify that a robot has understood a given instruction and will act appropriately. Achieving these objectives, however, is challenging as natural language instructions often feature ambiguous phrasing, intricate spatiotemporal constraints, and unique referents. To illustrate, consider the instruction shown in Figure 1: “Bring the green plush toy to the whiteboard in front of it, watch out for the robot in front of the toy”. Solving such a task requires a robot to ground open-vocabulary referents, follow temporal constraints, and disambiguate objects using spatial descriptions. Foundation models [1, 2] offer a path to achieving such complex long-horizon goals; however, existing approaches for robot instruction following have largely focused on navigation [3, 4, 5, 6, 7]. These methods, broadly classified under object goal navigation [8], enable navigation to instances of an object category but are limited in their ability to localize spatial references and disambiguate object instances based on descriptive language. Other works [9, 10, 11] extend instruction following to mobile manipulation but are limited to tasks with simple temporal constraints expressed in unambiguous language. Moreover, existing efforts typically rely on Large Language Models (LLMs) as complete planners, bypassing intermediate symbolic representations that could provide verification of correctness before execution. Alternative approaches leveraging code-writing LLMs [5, 6, 12] are susceptible to errors in generated code, which may lead to unsafe robot behaviors. Mapping natural language to specification languages like temporal logic [13] provides a robust framework for language disambiguation, handling complex temporal constraints, and behavior verification. However, prior works along this line require prebuilt semantic maps with discrete sets of prespecified referents/landmarks from which instructions can be constructed [7, 14, 15].

Refer to caption
Figure 1: Our approach executing the instruction “Bring the green plush toy to the whiteboard in front of it, watch out for the robot in front of the toy”. The robot dynamically detects and grounds open-vocabulary referents with spatial constraints to construct an instruction-specific semantic map, then synthesizes a task and motion plan to solve the task. In this example, the robot navigates from its start location (yellow, A), to the green plush toy (green, B), executes a pick skill then navigates to the whiteboard (blue, C), and executes a place skill. Note that the robot has no prior semantic knowledge of the environment.

We propose Language Instruction grounding for Motion Planning (LIMP), a method that leverages foundation models and temporal logics to dynamically generate instruction-conditioned semantic maps that enable robots to construct verifiable controllers for following navigation and mobile manipulation instructions with open vocabulary referents and complex spatiotemporal constraints. In a novel environment, LIMP constructs a 3D map via SLAM, then uses LLMs to translate complex natural language instructions into temporal logic specifications with a novel composable syntax for referent disambiguation. Instruction referents are detected and grounded using vision-language models (VLMs) and spatial reasoning. Finally, a task and motion plan is synthesized to guide the robot through the required subgoals, as shown in Figure 1. In summary, we make the following contributions: (1) A modular framework that translates expressive natural language instructions into temporal logic, grounds instruction referents, and executes commands via Task and Motion Planning (TAMP). (2) A spatial grounding method for detecting and localizing open vocabulary objects with spatial constraints in 3D metric maps. (3) A TAMP algorithm that localizes regions of interest (goal/avoidance zones) and synthesizes constraint-satisfying motion plans for long-horizon tasks.

II Background and Related Works

We briefly highlight the most relevant works in visual scene understanding [10], natural language instruction following [7, 16], and task and motion planning [17], and provide a comprehensive review in our supplementary materials. NLMap [10] grounds open-vocabulary language queries to spatial locations using pre-trained VLMs. While effective for describing individual objects, it cannot handle instructions involving complex constraints between multiple objects due to the lack of object relationship modeling. LIMP addresses this with a novel spatial grounding module that resolves spatial relationships and leverages task and motion planners to satisfy these constraints. Lang2LTL [7] is a multi-stage, LLM-based approach that uses entity extraction and replacement to translate language instructions into temporal logic. Its extension [16] incorporates VLMs and semantic information (via text embeddings) to ground referents. These works require prebuilt semantic maps/databases describing landmarks to ground symbols, whereas our approach dynamically generates landmarks based on open-vocabulary instructions. Action-Oriented Semantic Maps (AOSMs) [17] augment semantic maps with models indicating where robots can perform manipulation skills, integrating with TAMP solvers for mobile manipulation. LIMP similarly provides a TAMP-compatible spatial representation but supports open-vocabulary tasks, whereas AOSMs remain constrained to a fixed set of goals once generated.

II-A Linear Temporal Logic

LIMP translates natural language instructions into temporal logic specifications for verifiable task and motion planning. While compatible with various specification languages and planning frameworks, we choose Linear Temporal Logic (LTL) [18] for its proven expressivity in representing complex robot mission requirements [19]. LTL defines temporal properties using atomic propositions, logical operators—negation (¬\neg¬), conjunction (\land), disjunction (\lor), implication (\rightarrow)—and temporal operators: next (𝒳𝒳\mathcal{X}caligraphic_X), until (𝒰𝒰\mathcal{U}caligraphic_U), globally (𝒢𝒢\mathcal{G}caligraphic_G), and finally (\mathcal{F}caligraphic_F). Despite its expressivity, LTL has been underutilized due to the expert knowledge required to construct specifications, however recent works have seen significant success directly translating natural language into LTL [7, 20, 14, 21, 22, 23].

Behavior Verification: Expressing instructions as temporal logic specifications allows us to verify the correctness of generated plans a priori. However, instead of explicit verification methods such as model checking, we leverage insights from prior works [24] and directly use specifications to synthesize plans that are correct-by-construction [25, 26].

III Problem Definition

Given a natural language instruction l𝑙litalic_l, our goal is to synthesize and sequence navigation and manipulation behaviors to produce a policy that satisfies the temporal and spatial constraints in l𝑙litalic_l. Spatial constraints determine task success based on the sequence of robot poses traversed during execution; temporal constraints determine the sequencing of these spatial constraints as a function of task progression. We assume a robot with an RGB-D camera has already navigated a space, capturing images and camera poses. From this data, we build a metric map m𝑚mitalic_m (e.g., point cloud, 3D voxel grid) of the environment, defining the space of possible SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) poses P𝑃Pitalic_P and enabling robot localization (i.e., estimating probotPsubscript𝑝robot𝑃p_{\text{robot}}\in Pitalic_p start_POSTSUBSCRIPT robot end_POSTSUBSCRIPT ∈ italic_P). Unlike previous work leveraging temporal logic [7], we do not assume access to a semantic map with prespecified object locations or predicates. Instead, we leverage two foundation models: a task-agnostic vision-language model σ𝜎\sigmaitalic_σ that, given an image and text, provides bounding boxes or segmentations based on the text; and an auto-regressive large language model ψ𝜓\psiitalic_ψ that samples likely language tokens based on a history of tokens.

Navigation: Navigation is formalized as an object-goal oriented continuous path planning problem, where the goal is to generate paths to a goal pose set PgoalsPsubscript𝑃𝑔𝑜𝑎𝑙𝑠𝑃P_{goals}\subset Pitalic_P start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l italic_s end_POSTSUBSCRIPT ⊂ italic_P while staying in feasible regions (PfeasiblePsubscript𝑃𝑓𝑒𝑎𝑠𝑖𝑏𝑙𝑒𝑃P_{feasible}\subset Pitalic_P start_POSTSUBSCRIPT italic_f italic_e italic_a italic_s italic_i italic_b italic_l italic_e end_POSTSUBSCRIPT ⊂ italic_P) and avoiding infeasible regions (Pinfeasible=PfeasibleCsubscript𝑃𝑖𝑛𝑓𝑒𝑎𝑠𝑖𝑏𝑙𝑒superscriptsubscript𝑃𝑓𝑒𝑎𝑠𝑖𝑏𝑙𝑒𝐶P_{infeasible}=P_{feasible}^{C}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_f italic_e italic_a italic_s italic_i italic_b italic_l italic_e end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_f italic_e italic_a italic_s italic_i italic_b italic_l italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT). Infeasible regions include environment obstacles as well as dynamically determined semantic regions that violate constraints in the instruction l𝑙litalic_l.

Manipulation: We formalize manipulation behaviours as options [27] parameterized by objects. Consider an object parameter θ𝜃\thetaitalic_θ that parameterizes an option Oθ=(Iθ,πθ,βθ)subscript𝑂𝜃subscript𝐼𝜃subscript𝜋𝜃subscript𝛽𝜃O_{\theta}=(I_{\theta},\pi_{\theta},\beta_{\theta})italic_O start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = ( italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ), the initiation set, policy, and termination condition are functions of both the robot pose P𝑃Pitalic_P and θ𝜃\thetaitalic_θ. The initiation set Iθsubscript𝐼𝜃I_{\theta}italic_I start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes the global reference frame robot positions and object-centric attributes––such as object size––that determine if the option policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can be executed on the object θ𝜃\thetaitalic_θ. To execute a manipulation skill on an object, an object-goal navigation behavior must first be executed to bring the robot into proximity with the object. We assume access to a library of these manipulation skills and demonstrate our approach on multi-object goal navigation and open-vocabulary mobile pick-and-place [9, 28].

IV Language Instruction Grounding for Motion Planning

Refer to caption
Figure 2: [A] LIMP translates natural language instructions into temporal logic expressions, where open-vocabulary referents are applied to predicates that correspond to robot skills––note the context-aware resolution of the phrase “blue one” to the referent “blue_sofa”. [B] Vision-language models detect referents, while spatial reasoning disambiguates referent instances to generate a 3D semantic map that localizes instruction-specific referents. [C] Finally, the temporal logic expression is compiled into a finite-state automaton, which a task and motion planner uses with dynamically-generated task progression semantic maps to progressively identify goals and constraints in the environment, and generate a plan that satisfies the high-level task specification.

LIMP interprets expressive natural language instructions to generate instruction-conditioned semantic maps, enabling robots to verifiably solve long-horizon tasks with complex spatiotemporal constraints (Figure 2). We briefly describe our modular approach in this section and present comprehensive implementations details in our supplementary materials.

IV-A Language Instruction Module

In this module, we leverage a large language model ψ𝜓\psiitalic_ψ to translate a natural language instruction l𝑙litalic_l into a linear temporal logic specification φlsubscript𝜑𝑙\varphi_{l}italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with a novel composable syntax for referent disambiguation. We achieve this through a two-stage in-context learning strategy. The first stage prompts ψ𝜓\psiitalic_ψ to translate l𝑙litalic_l into a conventional LTL formula ϕlsubscriptitalic-ϕ𝑙\phi_{l}italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT where propositions refer to open-vocabulary objects. The second stage takes l𝑙litalic_l and ϕlsubscriptitalic-ϕ𝑙\phi_{l}italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as input and prompts ψ𝜓\psiitalic_ψ to generate a new formula φlsubscript𝜑𝑙\varphi_{l}italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with predicate functions corresponding to parameterized robot skills.

We define three predicate functions—near, pick, and release—for the primitive navigation and manipulation skills required for multi-object goal navigation and mobile pick-and-place. Predicate functions in φlsubscript𝜑𝑙\varphi_{l}italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are parameterized by Composable Referent Descriptors (CRDs), our novel propositional expressions representing specific referent instances by chaining comparators that encode descriptive spatial information. For example, the instruction “the yellow cabinet above the fridge that is next to the stove” can be represented with the CRD:

yellow_cabinet::isabove(fridge::isnextto(stove)).\text{yellow\_cabinet}::\text{isabove}(\text{fridge}::\text{isnextto}(\text{% stove})).yellow_cabinet : : isabove ( fridge : : isnextto ( stove ) ) . (1)

This specifies that there is a fridge next to a stove, and the desired yellow cabinet is above that fridge. CRDs are constructed from a set of 3D spatial comparators [29] defined in our prompting strategy.

Unlike recent works [9, 11], our approach does not require specific phrasing or keywords and can handle instructions with arbitrary complexity and ambiguity. The LLM ψ𝜓\psiitalic_ψ directly samples the entire LTL formula φlsubscript𝜑𝑙\varphi_{l}italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with predicate functions parameterized by CRDs using appropriate spatial comparators based on the instruction’s context. Figure 3 illustrates the result of our two-stage prompting strategy.

Refer to caption

Figure 3: An instruction is first translated into a conventional LTL formula ϕlsubscriptitalic-ϕ𝑙\phi_{l}italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT that loosely captures the desired temporal occurrence of referent objects, then into our LTL syntax φlsubscript𝜑𝑙\varphi_{l}italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with predicate functions that temporally chain required robot skills parameterized by composable referent descriptors.

LLM Verification: Verifying the LTL formula φlsubscript𝜑𝑙\varphi_{l}italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT sampled from the LLM is crucial as errors in referent extraction and temporal task structure affects instruction following accuracy. Our symbol verification node (Figure 2) leverages LTL properties to provide high-level human-in-the-loop verification of extracted instruction referents and temporal task structure. Recent work [30] provides ISO 61508 [31] safety guarantees in robot task execution by translating safety constraints from natural language to LTL formulas, which are verified by human experts and used to enforce robot behavior. Similarly, we rely on human verification to ensure the translated formula φlsubscript𝜑𝑙\varphi_{l}italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is correct. Our symbol verification node implements an interactive dialog system that presents users with the extracted referent CRDs and implied task structure, and reprompts the LLM based on user corrections to obtain new formulas. Unlike prior work [30], we eliminate the need for experts by directly translating the task structure—encoded in the LTL formulas’s equivalent automaton—back into English statements via a simple deterministic translation scheme. In our experiments (Tables I and II), we find that even without human verification and reprompting, the initial formulas sampled by our language understanding module impressively encode the correct referents and temporal task structure.

IV-B Spatial Grounding Module

Refer to caption
Figure 4: [A] Our spatial grounding module leverages a VLM to detect all referent occurrences from prior observations of the environment. [B] An initial semantic map with all detected referent instances is generated by backprojecting pixels in segmented referent masks unto the 3D map. [C] Each referent’s spatial comparators is resolved with respect to the origin coordinate frame of reference. [D] Failing instances are filtered out to obtain a Referent Semantic Map (RSM) that localizes the exact referent instances described in the instruction.

This module detects and localizes specific object instances referenced in an instruction. From the translated LTL formulas, we extract composable referent descriptors (CRDs) and use vision-language models OWL-ViT [32] and SAM [33] to detect and segment all referent occurrences from the robot’s prior observations of the environment. We backproject pixels in these segmentation masks onto our 3D map, creating an initial semantic map of all instruction object instances. From the example in Figure 3, occurrences of green_plush_toy, whiteboard, and robot are detected, segmented, and backprojected onto the map (Figure 4[a&b]).

To obtain the specific object instances described in the instruction, we resolve the 3D spatial comparators in each referent’s CRD––recall that CRDs are propositional expressions and can be evaluated as true or false. We define eight spatial comparators (isbetween, isabove, isbelow, isleftof, isrightof, isnextto, isinfrontof, isbehind) to reason about spatial relationships based on backprojected 3D positions. Since all backprojected positions are relative to an origin coordinate system, our spatial comparators are resolved from the perspective of this origin position as shown in Figure 4[c]. This type of relative frame of reference (FoR) when describing spatial relationships between objects, in contrast to an absolute or intrinsic FoR, is dominant in English [34], and is a logical choice for our work.

Using the 3D position of each referent’s center mask pixel as its representative position, we resolve a given referent with a spatial description by applying the appropriate spatial comparator to all detected pairs of the desired referent and comparison landmark objects. This filtering process yields a Referent Semantic Map (RSM) that localizes specific object instances described in the instruction as shown in Figure 4[d].

VLM Verification: Potential misclassifications from object detector VLMs is the main source of error in this module. We do not address interactively correcting VLM misclassifications as that is out of the scope of this work, but we provide 3D visualization tools that enable users to visually inspect and verify that constructed referent semantic maps correctly localize referents.

IV-C Task and Motion Planning Module

Refer to caption
Figure 5: [A] A given instruction translated into our LTL syntax φlsubscript𝜑𝑙\varphi_{l}italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT can be compiled into an equivalent finite-state automaton that captures the temporal constraints of the task. A path through this automaton is selected with a strategy that incrementally picks the next progression state from the initial state to the accepting state. The robot then executes the manipulation options and navigation behaviors dictated by this high-level task plan. [B] To execute navigation objectives our approach generates a task progression semantic map (TPSM) that augments the environment with state transition constraints, localizing goal (yellow) and avoidance (red) regions. Generated TPSMs are converted into 2D obstacle maps for constraint-aware continuous path planning.

Finally, our TAMP module synthesizes and sequences navigation and manipulation behaviors to produce a plan that satisfies the temporal and spatial constraints expressed in the given instruction.

Progressive Motion Planner (PMP): Our TAMP algorithm compiles the LTL formula with parameterized robot skills into an equivalent finite-state automaton (Figure 5[a]) to generate a verifiably correct task and motion plan. A path from the initial to the accepting state in this automaton is a high-level task plan that interleaves navigation and manipulation objectives required to satisfy the instruction. We select such a path with a simple strategy that incrementally selects the next progression state until the accepting state is reached, ensuring the plan obeys all temporal subgoal objectives. As shown in Figure 5[a], automaton states are connected by transition edges representing the logical expressions required for transitions. For each transition, our algorithm executes the necessary low-level behaviors: for manipulation subgoals, it executes the appropriate parameterized skill; for navigation subgoals, it dynamically generates Task Progression Semantic Maps (TPSMs) to localize goal and constraint regions and performs continuous path planning using the Fast Marching Tree algorithm (FMT[35].

Task Progression Semantic Maps (TPSM): A TPSM augments a 3D scene with navigation constraints specified by logical state transition expressions, enabling goal-directed, constraint-aware navigation. Regions of interest in a TPSM are defined using a nearness threshold specifying proximity to an object. This threshold can be set globally or included in the language instruction module’s prompting strategy, allowing an LLM to infer its value based on the instruction. Like our spatial grounding module, TPSMs are agnostic to temporal logic representations and can be used with various planning approaches for semantic constraint-aware motion planning. We primarily evaluate our approach on a ground mobile robot, hence we transform 3D TPSMs into 2D geometric obstacle maps, where constraint regions are treated as obstacles (Figure 5[b]). However, our approach is robot-agnostic and supports direct planning in 3D TPSMs for appropriate embodiments like drones.

V Evaluation

Our evaluations test the hypothesis that translating natural language instructions into LTL expressions and dynamically generating semantic maps enables robots to accurately interpret and execute instructions in large-scale environments without prior training. We focus on three key questions: (1) Can our language instruction module interpret complex, ambiguous instructions? (2) Can our spatial grounding module resolve specific object instances described in instructions? (3) Can our TAMP algorithm generate constraint-satisfying plans?

To answer these questions, we compare LIMP with two baselines: an LLM task planner (NLMap-Saycan [10]) and an LLM code-writing planner (Code-as-Policies [12]), representing state-of-the-art approaches for language-conditioned, open-ended robot task execution. Both baselines use the same LLM (GPT-4-0613), prompting structure, and in-context learning examples as our language instruction module. In Code-as-Policies, in-context examples are converted into language model-generated program (LMP) snippets [12]. To ensure competitive performance, we integrate our CRD syntax, spatial grounding module, and low-level robot control into these baselines, allowing them to query object positions, use our FMT path planner, and execute manipulation skills.

We also evaluate ablations of our two-stage language instruction module due to its importance in instruction following. In our full approach, the first stage prompts an LLM to generate a conventional LTL formula ϕlsubscriptitalic-ϕ𝑙\phi_{l}italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from instruction l𝑙litalic_l by dynamically selecting relevant in-context examples from a standard dataset [14] based on cosine similarity. Our first ablation selects in-context examples randomly; and the second ablation skips this stage entirely, directly sampling our LTL syntax with parameterized robot skills φlsubscript𝜑𝑙\varphi_{l}italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from l𝑙litalic_l.

We conduct a large-scale evaluation across five real-world environments on a diverse task set of 150 instructions from multiple prior works [10, 11, 7]. This task set consists of 24 tasks with fine-grained object descriptions (NLMD), 25 tasks with complex language (NLMC), 25 tasks with simple structured phrasing (OKRB), 37 tasks with complex temporal structures (CT) and 39 tasks with descriptive spatial constraints and temporal structures (CST). Below are examples from each task category illustrating the variety in complexity:

1NLMD: Put the brown multigrain chip bag in the woven basket
2NLMC: I like fruits, can you put something I would like on the yellow sofa for me
3OKRB: Move the soda can to the box
4CT: Visit the purple door elevator, then go to the front desk and then go to the kitchen table, in addition you can never go to the elevator once you have seen the front desk
5CST: I have a white cabinet, a green toy, a bookshelf and a red chair around here somewhere. Take the second item I mentioned from between the first item and the third. Bring it the cabinet but avoid the last item at all costs.

To evaluate instruction understanding, we introduce performance metrics: referent resolution accuracy, avoidance constraint resolution accuracy, and spatial relationship resolution accuracy. These metrics utilize the word error rate (WER), widely used in speech recognition to quantify the difference between a reference and a hypothesis transcription by computing the minimal number of substitutions, deletions, and insertions needed to transform the hypothesis into the reference. WER is calculated as WER=S+D+IN𝑊𝐸𝑅𝑆𝐷𝐼𝑁WER=\frac{S+D+I}{N}italic_W italic_E italic_R = divide start_ARG italic_S + italic_D + italic_I end_ARG start_ARG italic_N end_ARG, where S𝑆Sitalic_S is substitutions, D𝐷Ditalic_D deletions, I𝐼Iitalic_I insertions, and N𝑁Nitalic_N is the total number of words in the reference. In our work:

  • Referent resolution accuracy compares extracted referents in the generated LTL formula to ground truth referents.

  • Avoidance constraint resolution accuracy compares referents to avoid in the LTL formula (denoted by the unary negation operator) against ground truth avoidance referents.

  • Spatial relationship resolution accuracy compares generated Composable Referent Descriptors (CRDs) in the LTL specification with ground truth CRD expressions.

We also define temporal alignment accuracy and planning success rate. A plan is temporally aligned if the sequence of subgoals matches the instructor’s intention, and successful if it satisfies all spatial and temporal constraints specified in the instruction. Achieving a high plan success rate is challenging, requiring accurate referent and avoidance constraint resolution, spatial grounding, and temporal alignment. We report average word error rates for each baseline in Table I and the percentage of successful and temporally accurate plans in Table II.

TABLE I: Performance comparison of one-shot instruction understanding and spatial resolution.
Approach Referent Resolution Accuracy (Average WER) \downarrow Avoidance Constraint Resolution Accuracy (Average WER) \downarrow Spatial Relationship Resolution Accuracy (Average WER) \downarrow
NLMap-Saycan 0.09 0.12 0.05
Code-as-Policies 0.22 0.24 0.06
Limp Single Stage Prompting 0.09 0.11 0.05
Limp Two Stage Prompting [Random Embedding] 0.08 0.11 0.03
Limp Two Stage Prompting [Similar Embedding] 0.07 0.04 0.03
TABLE II: Performance comparison of one-shot temporal alignment and plan success rate.
Approach Temporal Alignment Accuracy (% of instructions) \uparrow Planning Success Rate (% of instructions) \uparrow
NLMD NLMC OKRB CT CST NLMD NLMC OKRB CT CST
NLMap-Saycan 88% 96% 100% 32% 41% 75% 96% 100% 35% 38%
Code-as-Policies 58% 68% 100% 35% 38% 46% 68% 100% 38% 38%
Limp Single Stage Prompting 79% 64% 100% 68% 74% 63% 60% 100% 62% 62%
Limp Two Stage Prompting [Random Embedding] 83% 68% 100% 76% 85% 79% 68% 100% 57% 72%
Limp Two Stage Prompting [Similar Embedding] 88% 80% 100% 76% 92% 79% 76% 100% 65% 79%

VI Discussion

Beyond the verification benefits of symbolic planning, LIMP outperforms baselines in most task sets, notably in complex temporal planning and constraint avoidance. While NLMap-Saycan and Code-as-Policies effectively generate sequential subgoals, they struggle with strict temporal constraints—for example, avoiding a specific referent while approaching another. Our approach ensures each robot step adheres to constraints while achieving subgoals, explaining LIMP’s superior performance on CT and CST tasks. As shown in Table II, LIMP underperforms only against NLMap-Saycan in the NLMC task category. This task set, introduced in the same paper as the baseline [10] (which outperforms LIMP), includes instructions with implicit details such as: “I like fruits, can you put something I would like on the yellow sofa for me.” NLMap-Saycan is better suited to infer and generate plans with possible fruit options, whereas our few-shot LTL translation process is not designed for this.

VII Limitations and Conclusion

Although LIMP is capable of interpreting non-finite instructions into LTL formulas, our planner is currently limited to processing co-safe formulas, which handle only finite sequences. The accuracy of spatial grounding relies on the performance of vision-language models (VLMs) for object recognition meaning any shortcomings in these systems can negatively impact results. Additionally, LIMP assumes a static environment between mapping and execution, making it not responsive to dynamic changes—an area we aim to address with future work on editable scene representations. Our Progressive Motion Planning algorithm is complete but does not guarantee optimality; however, our framework can be used with existing TAMP planners to enhance efficiency.

Foundation models hold significant promise for advancing the next generation of autonomous robots. Our results suggest that combining these models—LLMs for language and VLMs for vision—with established methods for safety, explainability, and verifiable behavior synthesis can lead to more reliable and capable robotic systems.

Acknowledgement

This work was supported by the Office of Naval Research (ONR) under REPRISM MURI N000142412603 and ONR #N00014-22-1-2592, as well as the National Science Foundation (NSF) via grant #1955361. Partial funding was also provided by The Robotics and AI Institute.

References

  • [1] Y. Hu, Q. Xie, V. Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y. Xie, T. Zhang, H.-S. Fang, S. Zhao, S. Omidshafiei, D.-K. Kim, A.-a. Agha-mohammadi, K. Sycara, M. Johnson-Roberson, D. Batra, X. Wang, S. Scherer, C. Wang, Z. Kira, F. Xia, and Y. Bisk, “Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis,” Oct. 2024, arXiv:2312.08782 [cs]. [Online]. Available: http://arxiv.org/abs/2312.08782
  • [2] R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman, B. Ichter, D. Driess, J. Wu, C. Lu, and M. Schwager, “Foundation models in robotics: Applications, challenges, and the future,” The International Journal of Robotics Research, p. 02783649241281508, Sept. 2024, publisher: SAGE Publications Ltd STM. [Online]. Available: https://doi.org/10.1177/02783649241281508
  • [3] S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, June 2023, pp. 23 171–23 181. [Online]. Available: https://ieeexplore.ieee.org/document/10203853/
  • [4] D. Shah, B. Osiński, B. Ichter, and S. Levine, “LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action,” in Proceedings of The 6th Conference on Robot Learning.   PMLR, Mar. 2023, pp. 492–504, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v205/shah23b.html
  • [5] C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual Language Maps for Robot Navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), May 2023, pp. 10 608–10 615. [Online]. Available: https://ieeexplore.ieee.org/document/10160969
  • [6] ——, “Audio Visual Language Maps for Robot Navigation,” in Experimental Robotics, M. H. Ang Jr and O. Khatib, Eds.   Cham: Springer Nature Switzerland, 2024, pp. 105–117.
  • [7] J. X. Liu, Z. Yang, I. Idrees, S. Liang, B. Schornstein, S. Tellex, and A. Shah, “Grounding Complex Natural Language Commands for Temporal Tasks in Unseen Environments,” in Proceedings of The 7th Conference on Robot Learning.   PMLR, Dec. 2023, pp. 1084–1110, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v229/liu23d.html
  • [8] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir, “On Evaluation of Embodied Navigation Agents,” July 2018, arXiv:1807.06757 [cs]. [Online]. Available: http://arxiv.org/abs/1807.06757
  • [9] S. Yenamandra, A. Ramachandran, K. Yadav, A. S. Wang, M. Khanna, T. Gervet, T.-Y. Yang, V. Jain, A. Clegg, J. M. Turner, Z. Kira, M. Savva, A. X. Chang, D. S. Chaplot, D. Batra, R. Mottaghi, Y. Bisk, and C. Paxton, “HomeRobot: Open-Vocabulary Mobile Manipulation,” in Proceedings of The 7th Conference on Robot Learning.   PMLR, Dec. 2023, pp. 1975–2011, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v229/yenamandra23a.html
  • [10] B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary Queryable Scene Representations for Real World Planning,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), May 2023, pp. 11 509–11 522. [Online]. Available: https://ieeexplore.ieee.org/document/10161534
  • [11] P. Liu, Y. Orru, J. Vakil, C. Paxton, N. Shafiullah, and L. Pinto, “Demonstrating OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics,” in Robotics: Science and Systems.   Robotics: Science and Systems Foundation, July 2024. [Online]. Available: http://www.roboticsproceedings.org/rss20/p091.pdf
  • [12] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as Policies: Language Model Programs for Embodied Control,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), May 2023, pp. 9493–9500. [Online]. Available: https://ieeexplore.ieee.org/document/10160591
  • [13] E. A. Emerson, “Temporal and Modal Logic,” in Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics, J. v. Leeuwen, Ed.   Elsevier and MIT Press, 1990, pp. 995–1072. [Online]. Available: https://doi.org/10.1016/b978-0-444-88074-1.50021-4
  • [14] J. Pan, G. Chou, and D. Berenson, “Data-Efficient Learning of Natural Language to Linear Temporal Logic Translators for Robot Task Specification,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), May 2023, pp. 11 554–11 561. [Online]. Available: https://ieeexplore.ieee.org/document/10161125
  • [15] B. Quartey, A. Shah, and G. Konidaris, “Exploiting Contextual Structure to Generate Useful Auxiliary Tasks,” in NeurIPS 2023 Workshop on Generalization in Planning, vol. abs/2303.05038, 2023, arXiv: 2303.05038. [Online]. Available: https://doi.org/10.48550/arXiv.2303.05038
  • [16] J. X. Liu, A. Shah, G. Konidaris, S. Tellex, and D. Paulius, “Lang2LTL-2: Grounding Spatiotemporal Navigation Commands Using Large Language and Vision-Language Models,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2024, pp. 2325–2332, iSSN: 2153-0866. [Online]. Available: https://ieeexplore.ieee.org/document/10802696
  • [17] E. Rosen, S. James, S. Orozco, V. Gupta, M. Merlin, S. Tellex, and G. Konidaris, “Synthesizing Navigation Abstractions for Planning with Portable Manipulation Skills,” in Proceedings of The 7th Conference on Robot Learning.   PMLR, Dec. 2023, pp. 2278–2287, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v229/rosen23a.html
  • [18] A. Pnueli, “The temporal logic of programs,” in 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), Oct. 1977, pp. 46–57, iSSN: 0272-5428. [Online]. Available: https://ieeexplore.ieee.org/document/4567924
  • [19] C. Menghi, C. Tsigkanos, P. Pelliccione, C. Ghezzi, and T. Berger, “ Specification Patterns for Robotic Missions ,” IEEE Transactions on Software Engineering, vol. 47, no. 10, pp. 2208–2224, Oct. 2021. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/TSE.2019.2945329
  • [20] M. Berg, D. Bayazit, R. Mathew, A. Rotter-Aboyoun, E. Pavlick, and S. Tellex, “Grounding Language to Landmarks in Arbitrary Outdoor Environments,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), May 2020, pp. 208–215, iSSN: 2577-087X. [Online]. Available: https://ieeexplore.ieee.org/document/9197068
  • [21] M. Cosler, C. Hahn, D. Mendoza, F. Schmitt, and C. Trippel, “nl2spec: Interactively Translating Unstructured Natural Language to Temporal Logics with Large Language Models,” in Computer Aided Verification, C. Enea and A. Lal, Eds.   Cham: Springer Nature Switzerland, 2023, pp. 383–396.
  • [22] F. Fuggitti and T. Chakraborti, “NL2LTL – a Python Package for Converting Natural Language (NL) Instructions to Linear Temporal Logic (LTL) Formulas,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 13, pp. 16 428–16 430, 2023, number: 13. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/27068
  • [23] Y. Chen, R. Gandhi, Y. Zhang, and C. Fan, “NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds.   Singapore: Association for Computational Linguistics, Dec. 2023, pp. 15 880–15 903. [Online]. Available: https://aclanthology.org/2023.emnlp-main.985/
  • [24] M. Y. Vardi, “An automata-theoretic approach to linear temporal logic,” in Logics for Concurrency: Structure versus Automata, F. Moller and G. Birtwistle, Eds.   Berlin, Heidelberg: Springer, 1996, pp. 238–266. [Online]. Available: https://doi.org/10.1007/3-540-60915-6˙6
  • [25] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal-Logic-Based Reactive Mission and Motion Planning,” IEEE Transactions on Robotics, vol. 25, no. 6, pp. 1370–1381, Dec. 2009, conference Name: IEEE Transactions on Robotics. [Online]. Available: https://ieeexplore.ieee.org/document/5238617
  • [26] M. Colledanchise, R. M. Murray, and P. Ögren, “Synthesis of correct-by-construction behavior trees,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sept. 2017, pp. 6039–6046, iSSN: 2153-0866. [Online]. Available: https://ieeexplore.ieee.org/document/8206502
  • [27] R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, vol. 112, no. 1, pp. 181–211, Aug. 1999. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0004370299000521
  • [28] N. Yokoyama, A. Clegg, J. Truong, E. Undersander, T.-Y. Yang, S. Arnaud, S. Ha, D. Batra, and A. Rai, “ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation,” IEEE Robotics and Automation Letters, vol. 9, no. 1, pp. 779–786, Jan. 2024, conference Name: IEEE Robotics and Automation Letters. [Online]. Available: https://ieeexplore.ieee.org/document/10328058
  • [29] K. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, G. Iyer, S. Saryazdi, T. Chen, A. Maalouf, S. Li, N. Keetha, A. Tewari, J. Tenenbaum, C. Melo, M. Krishna, L. Paull, F. Shkurti, and A. Torralba, “ConceptFusion: Open-set multimodal 3D mapping,” in Robotics: Science and Systems XIX.   Robotics: Science and Systems Foundation, July 2023. [Online]. Available: http://www.roboticsproceedings.org/rss19/p066.pdf
  • [30] Z. Yang, S. S. Raman, A. Shah, and S. Tellex, “Plug in the Safety Chip: Enforcing Constraints for LLM-driven Robot Agents,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), May 2024, pp. 14 435–14 442. [Online]. Available: https://ieeexplore.ieee.org/document/10611447
  • [31] I. E. Commission et al., “Functional safety of electrical/electronic/programmable electronic safety related systems,” IEC 61508, 2000.
  • [32] M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, X. Wang, X. Zhai, T. Kipf, and N. Houlsby, “Simple Open-Vocabulary Object Detection,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds.   Cham: Springer Nature Switzerland, 2022, pp. 728–755.
  • [33] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment Anything,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp. 3992–4003, iSSN: 2380-7504. [Online]. Available: https://ieeexplore.ieee.org/document/10378323
  • [34] A. Majid, M. Bowerman, S. Kita, D. B. M. Haun, and S. C. Levinson, “Can language restructure cognition? The case for space,” Trends in Cognitive Sciences, vol. 8, no. 3, pp. 108–114, Mar. 2004. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1364661304000208
  • [35] L. Janson, E. Schmerling, A. Clark, and M. Pavone, “Fast marching tree: A fast marching sampling-based method for optimal motion planning in many dimensions,” The International Journal of Robotics Research, vol. 34, no. 7, pp. 883–921, June 2015, publisher: SAGE Publications Ltd STM. [Online]. Available: https://doi.org/10.1177/0278364915577958
  • [36] S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, “Robots that use language,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 25–55, 2020.
  • [37] V. Blukis, R. Knepper, and Y. Artzi, “Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following,” in Proceedings of the 2020 Conference on Robot Learning.   PMLR, Oct. 2021, pp. 1829–1854, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v155/blukis21a.html
  • [38] R. Patel, E. Pavlick, and S. Tellex, “Grounding Language to Non-Markovian Tasks with No Supervision of Task Specifications,” in Robotics: Science and Systems XVI.   Robotics: Science and Systems Foundation, July 2020. [Online]. Available: http://www.roboticsproceedings.org/rss16/p016.pdf
  • [39] C. Wang, C. Ross, Y.-L. Kuo, B. Katz, and A. Barbu, “Learning a natural-language to LTL executable semantic parser for grounded robotics,” in Proceedings of the 2020 Conference on Robot Learning.   PMLR, Oct. 2021, pp. 1706–1718, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v155/wang21g.html
  • [40] K. Zheng, D. Bayazit, R. Mathew, E. Pavlick, and S. Tellex, “Spatial Language Understanding for Object Search in Partially Observed City-scale Environments,” 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), pp. 315–322, Aug. 2021, conference Name: 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN) ISBN: 9781665404921 Place: Vancouver, BC, Canada Publisher: IEEE. [Online]. Available: https://ieeexplore.ieee.org/document/9515426/
  • [41] X. Wang, W. Wang, J. Shao, and Y. Yang, “LANA: A Language-Capable Navigator for Instruction Following and Generation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Vancouver, BC, Canada: IEEE, June 2023, pp. 19 048–19 058. [Online]. Available: https://ieeexplore.ieee.org/document/10203301/
  • [42] S.-M. Park and Y.-G. Kim, “Visual language navigation: a survey and open challenges,” Artificial Intelligence Review, vol. 56, no. 1, pp. 365–427, Jan. 2023. [Online]. Available: https://doi.org/10.1007/s10462-022-10174-9
  • [43] B. Yu, H. Kasaei, and M. Cao, “L3MVN: Leveraging Large Language Models for Visual Target Navigation,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2023, pp. 3554–3560, iSSN: 2153-0866. [Online]. Available: https://ieeexplore.ieee.org/document/10342512
  • [44] C. H. Song, B. M. Sadler, J. Wu, W.-L. Chao, C. Washington, and Y. Su, “LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV).   Paris, France: IEEE, Oct. 2023, pp. 2986–2997. [Online]. Available: https://ieeexplore.ieee.org/document/10378628/
  • [45] E. Hsiung, H. Mehta, J. Chu, X. Liu, R. Patel, S. Tellex, and G. Konidaris, “Generalizing to New Domains by Mapping Natural Language to Lifted LTL,” in 2022 International Conference on Robotics and Automation (ICRA).   Philadelphia, PA, USA: IEEE Press, May 2022, pp. 3624–3630. [Online]. Available: https://doi.org/10.1109/ICRA46639.2022.9812169
  • [46] W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models,” in Proceedings of The 7th Conference on Robot Learning.   PMLR, Dec. 2023, pp. 540–562, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v229/huang23b.html
  • [47] K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning,” in Proceedings of The 7th Conference on Robot Learning.   PMLR, Dec. 2023, pp. 23–72, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v229/rana23a.html
  • [48] R.-Z. Qiu, Y. Hu, G. Yang, Y. Song, Y. Fu, J. Ye, J. Mu, R. Yang, N. Atanasov, S. A. Scherer, and X. Wang, “Learning Generalizable Feature Fields for Mobile Manipulation,” CoRR, vol. abs/2403.07563, 2024, arXiv: 2403.07563. [Online]. Available: https://doi.org/10.48550/arXiv.2403.07563
  • [49] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile robotics tasks: A survey,” Robotics and Autonomous Systems, vol. 66, pp. 86–103, Apr. 2015. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0921889014003030
  • [50] J. Crespo, J. C. Castillo, O. M. Mozos, and R. Barber, “Semantic Information for Robot Navigation: A Survey,” Applied Sciences, vol. 10, no. 2, p. 497, Jan. 2020, number: 2 Publisher: Multidisciplinary Digital Publishing Institute. [Online]. Available: https://www.mdpi.com/2076-3417/10/2/497
  • [51] A. Pronobis, P. Jensfelt, and J. Little, Semantic Mapping with Mobile Robots.   Stockholm: KTH Royal Institute of Technology, 2011.
  • [52] R. E. Fikes and N. J. Nilsson, “Strips: A new approach to the application of theorem proving to problem solving,” Artificial intelligence, vol. 2, no. 3-4, pp. 189–208, 1971.
  • [53] C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-Pérez, “Integrated Task and Motion Planning,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 4, no. Volume 4, 2021, pp. 265–293, May 2021, publisher: Annual Reviews. [Online]. Available: https://www.annualreviews.org/content/journals/10.1146/annurev-control-091420-084139
  • [54] C. R. Garrett, T. Lozano-Pérez, and L. P. Kaelbling, “PDDLStream: Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning,” Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 440–448, June 2020. [Online]. Available: https://ojs.aaai.org/index.php/ICAPS/article/view/6739
  • [55] R. Holladay, T. Lozano-Pérez, and A. Rodriguez, “Planning for Multi-stage Forceful Manipulation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 6556–6562, iSSN: 2577-087X. [Online]. Available: https://ieeexplore.ieee.org/document/9561233

Verifiably Following Complex Robot Instructions with Foundation Models
Appendix

A1 Appendix Summary

These sections presents additional details on our approach for leveraging foundation models and temporal logics to verifiably follow expressive natural language instructions with complex spatiotemporal constraints without prebuilt semantic maps. We encourage readers to visit our website robotlimp.github.io for project summary and demonstration videos.

A2 Extended Related Works

A2-A Foundation Models in Robotics

Grounding language referents to entities and actions in the world [36, 37, 38, 39] is challenging in part due to the fact that complex perceptual and behavioral meaning can be constructed from the composition of a wide-range of open-vocabulary components[40, 20, 21, 41, 37]. To address this problem, foundation models have recently garnered interest as an approach for generating perceptual representations that are aligned with language [42, 4, 10, 43, 44, 5]. Because there are an ever-expanding number of ways foundation models are being leveraged for instruction following in robotics (e.g: generating plans [5], code [12], etc.), we focus our review on the most related approaches in two relevant application areas: 1) generating natural language queryable scene representations and 2) generating robot plans for following natural language instruction [7, 45].

Visual scene understanding: The most similar approach for visual scene understanding to ours is NLMap [10], a scene representation that affords grounding open-vocabulary language queries to spatial locations via pre-trained visual language models. Given a sequence of calibrated RGB-D camera images and pre-trained visual-language models, NLMap supports language-queries by 1) segmenting out the most likely objects in the 2D RGB images based on the language queries, and 2) estimating the most likely 3D positions via back-projection of the 2D segmentation masks using the depth data and camera pose. While NLMap is suitable for handling complex descriptions of individual objects (e.g: “green plush toy“), it is fundamentally unable to handle instructions involving complex constraints between multiple objects since it has no way to account for object-object relationships (e.g: “the green plush toy that is between the toy box and door“). LIMP handles these more complicated language instructions by using a novel spatial grounding module to easily incorporate a wide-variety of complex spatial relationships between objects. In addition, our scene representation is compatible with both LLM planners as well as TAMP solvers, whereas NLMap is only compatible with LLM planners.

While NLMap is the most relevant approach to ours, there are other approaches for visual scene understanding and task planning with foundation models which are worth highlighting. VoxPoser [46] leverages the abilities of LLMs to identify affordances and write code for manipulation tasks, along with VLMs complementary abilities to identify open-vocabulary entities in the environment. SayPlan [47] integrates 3D scene graphs with LLM-based planners to bridge the gap between complex, heirarchial scene representations and scalable task planning with open-ended task specifications. Generalizable Feature Fields (GeFF) [48] use an implicit scene representation to support open-world manipulation and navigation via an extension of Neural Radiance Fields (NeRFs) and feature distillation in NeRFs. OK-Robot [11] adopts a system-first approach to solving structured mobile pick-and-place tasks with foundation models by offering an integrated solution to object detection, mapping, navigation and grasp generation. While these methods are related, none of them have all the features of LIMP: 1) Explicit support for both LLM-based planning and off-the-shelf task and motion planning approaches, 2) Verifiable representations for following complex natural language instructions in mobile manipulation domains that involve object-object relationships, and 3) The ability to dynamically generate task-relevant state abstractions (semantic maps) for individual instructions.

Language instruction for robots: Our approach to handling complex natural language instructions involves translating the command into a temporal logic expression. This problem framing allows us to leverage state-of-the-art techniques from machine translation, such as instruction-tuned large language models. Most similar to our approach in this regard is [7], which uses a multi-stage LLM-based approach and finetuning to perform entity-extraction and replacement to translate natural language instructions into temporal logic expressions. However, [7] relies on a prebuilt semantic map that grounds expression symbols, limiting the scope of instructions it can operate since landmarks are predetermined. Instead, our approach interfaces with a novel scene representation that supports open-vocabulary language and generates the relevant landmarks based on the open-vocabulary instruction. Additionally, the symbols in our temporal logic translation correspond to parameterized task relevant robot skills as opposed to propositions of referent entities extracted from instructions.

A2-B Planning Models in Robotics

Semantic Maps: Semantic maps [49] are a class of scene representations that capture semantic (and typically geometric) information about the world, and can be used in cojunction with planners to generate certain types of complex robot behavior like collision-free navigation with spatial constraints [50, 51]. However, leveraging semantic maps for task planning with mobile manipulators has been challenging since the modeling information needed may highly depend on the robot’s particular skills and embodiment. [17] recently proposed Action-Oriented Semantic Maps (AOSMs), which are a class of semantic maps that include additional models of the regions of space where the robot can execute manipulation skills (represented as symbols). [17] demonstrated that AOSMs can be used as a state representation that supports TAMP solvers in mobile manipulation domains. Our scene representation is similar to an AOSM since it captures spatial information about semantic regions of interest, and is compatible with TAMP solvers, but largely differs in that AOSMs require learning via online interaction with the scene. Instead, our approach leverages foundation models and requires no online learning. Also, once an AOSM is generated for a scene, there is only a closed-set of goals that can be planned for, whereas our approach can handle open-vocabulary task specifications.

Task and Motion Planning: Task and Motion planning approaches are hierarchical planning methods that involve high-level task planning (with a discrete state space) [52] and low-level motion planning (with a continuous state space) [53]. The low-level motion planning problem involves generating paths to goal sets through continuous spaces (e.g: configuration space, cartesian space) with constraints on infeasible regions. When the constraints and dynamics can change, it is referred to as multi-modal motion planning, which naturally induces a high-level planning problem that involves choosing which sequence of modes to plan through, and a low-level planning problem that involves moving through a particular mode. Finding high-level plan skeletons and satisfying low-level assignment values for parameters to achieve goals is a challenging bi-level planning problem[53]. LIMP contains sufficient information to produce a problem and domain description augmented with geometric information for bi-level TAMP solvers like [54, 55].

A3 Language Instruction Module

We implement a two-stage prompting strategy in our language instruction module to translate natural language instructions into LTL specifications. The first stage translates a given instruction into a conventional LTL formula, where propositions refer to open-vocabulary objects. For any given instruction, we dynamically generate K𝐾Kitalic_K in-context translation examples from a standard dataset [14] of natural language and LTL pairs, based on cosine similarity with the given instruction. Here is the exact text prompt used:

1You are a LLM that understands operators involved with Linear Temporal Logic (LTL), such as F, G, U, &, |, ~ , etc. Your goal is translate language input to LTL output.
2Input:<generated_example_instruction>
3Output:<generated_example_LTL>
4...
5Input:<given_instruction>
6Output:
Listing 1: Base prompt used to obtain a conventional LTL formula from a natural language query

The second stage takes the given instruction and the LTL response from the first stage as input to generate a new LTL formula with predicate functions that correspond to parameterized robot skills. Skill parameters are instruction referent objects expressed in our novel Composable Referent Descriptor (CRD) syntax. CRDs enable referent disambiguation by chaining comparators that encode descriptive spatial information. We define eight spatial comparators and provide their descriptions as part of the second stage prompt. We find that LLMs conditioned on this information and a few examples are able translate arbitrarily complex instructions with appropriate comparator choices. Here is the exact prompt used:

1You are an LLM for robot planning that understands operators involved with Linear Temporal Logic (LTL), such as F, G, U, &, |, ~ , etc. You have a finite set of robot predicates and spatial predicates, given a language instruction and an LTL formula that represents the given instruction, your goal is to translate the ltl formula into one that uses appropriate composition of robot and spatial predicates in place of propositions with relevant details from original instruction as arguments.
2Robot predicate set (near,pick,release).
3Usage:
4near[referent_1]:returns true if the desired spatial relationship is for robot to be near referent_1.
5pick[referent_1]:can only execute picking skill on referent_1 and return True when near[referent_1].
6release[referent_1,referent_2]:can only execute release skill on referent_1 and return True when near[referent_2].
7Spatial predicate set (isbetween,isabove,isbelow,isleftof,isrightof,isnextto,isinfrontof,isbehind).
8Usage:
9referent_1::isbetween(referent_2,referent_3):returns true if referent_1 is between referent_2 and referent_3.
10referent_1::isabove(referent_2):returns True if referent_1 is above referent_2.
11referent_1::isbelow(referent_2):returns True if referent_1 is below referent_2.
12referent_1::isleftof(referent_2):returns True if referent_1 is left of referent_2.
13referent_1::isrightof(referent_2):returns True if referent_1 is right of referent_2.
14referent_1::isnextto(referent_2):returns True if referent_1 is close to referent_2.
15referent_1::isinfrontof(referent_2):returns True if referent_1 is in front of referent_2.
16referent_1::isbehind(referent_2):returns True if referent_1 is behind referent_2.
17Rules:
18Strictly only use the finite set of robot and spatial predicates!
19Strictly stick to the usage format!
20Compose spatial predicates where necessary!
21You are allowed to modify the structure of Input_ltl for the final Output if it does not match the intended Input_instruction!
22You should strictly only stick to mentioned objects, however you are allowed to propose and include plausible objects if and only if not mentioned in instruction but required based on context of instruction!
23Pay attention to instructions that require performing certain actions multiple times in generating and sequencing the predicates for the final Output formula!
24Example:
25Input_instruction: Go to the orange building but before that pass by the coffee shop, then go to the parking sign.
26Input_ltl: F (coffee_shop & F (orange_building & F parking_sign ) )
27Output: F ( near[coffee_shop] & F ( near[orange_building] & F near[parking_sign] ))
28Input_instruction: Go to the blue sofa then the laptop, after that bring me the brown bag between the television and the kettle on the left of the green seat, I am standing by the sink.
29Input_ltl: F ( blue_sofa & F ( laptop & F ( brown_bag & F ( sink ) ) ) )
30Output: F ( near[blue_sofa] & F ( near[laptop] & F ( near[brown_bag::isbetween(television,kettle::isleftof(green_seat))] & F (pick[brown_bag::isbetween(television,kettle::isleftof(green_seat))] & F ( near[sink] & F ( release[brown_bag,sink] ) ) ) ) ) )
31Input_instruction: Hey need you to pass by chair between the sofa and bag, pick up the bag and go to the orange napkin on the right of the sofa.
32Input_ltl: F ( chair & F ( bag & F ( orange_napkin ) ) )
33Output: F ( near[chair::isbetween(sofa,bag)] & F ( near[bag] & F ( pick[bag] & F ( near[orange_napkin::isrightof(sofa)] ) ) ) )
34Input_instruction: Go to the chair between the green laptop and the yellow box underneath the play toy
35Input_ltl: F ( green_laptop & F ( yellow_box & F ( play_toy & F ( chair ) ) ) )
36Output: F ( near[chair::isbetween(green_laptop,yellow_box::isbelow(play_toy))] )
37Input_instruction: Check the table behind the fridge and bring two beers to the couch one after the other
38Input_ltl: F ( check_table & F ( bring_beer1 ) & F ( bring_beer2 ) & F ( couch ) )
39Output: F ( near[table::isbehind(fridge)] & F ( pick[beer] & F ( near[couch] & F ( release[beer,couch] & F ( near[table::isbehind(fridge)] & F ( pick[beer] & F ( near[couch] & F ( release[beer,couch] ))))))))
40Input_instruction: <given_instruction>
41Input_ltl: <stage1_ltl_response>
42Output:
Listing 2: Second stage prompt to output our LTL syntax with CRD parameterized robot skills

A3-A Interactive Symbol Verification

Verifying sampled LTL formulas is essential, as such we implement an interactive dialog system that presents users with extracted referent composible referent descriptors (CRDs) in sampled formulas as well as the implied task structure––encoded in the sequence of state-machine transition expressions that must hold to progressively solve the task. We translate the task structure into English statements via a simple deterministic strategy that replaces logical connectives and skill predicates from the formula with equivalent English phrases. Users can verify a formula as correct or provide corrective statements which are used to reprompt the LLM to obtain new formulas. Below is the exact prompt used for regenerating formulas.

1There was a mistake with your output LTL formula: Error with <verification_type>! Consider the clarification feedback and regenerate the correct output for the Input_instruction. Make sure to adhere to all rules and instructions in your original prompt!
2previous_output:<last_response>
3error_clarification: <given_error_clarification>
4correct_output:
Listing 3: Corrective reprompting prompt used to obtain new LTL formulas

To illustrate, the instruction “Bring the green plush toy to the whiteboard in front of it” yields the interactive Referent and Task Structure Verification dialog below:

1**************************
2Instruction Following
3**************************
4Input_instruction: "Bring the green plush toy to the whiteboard in front of it"
5Sampled LTL formula: F(A & F(B & F(C & FD)))
6 A: near[green_plush_toy]
7 B: pick[green_plush_toy]
8 C: near[whiteboard::isinfrontof(green_plush_toy)]
9 D: release[green_plush_toy, whiteboard::isinfrontof(green_plush_toy)]
10
11***************************
12Referent Verification
13***************************
14I extracted this list of relevant objects based on your instruction:
15 * whiteboard::isinfrontof(green_plush_toy)
16 * green_plush_toy
17Does this match your intention? (y/n)
18
19****************************
20Task Structure Verification
21****************************
22Based on my understanding here is the sequence of subgoal objectives needed to satisfy the task:
23Subgoal_1:
24 Logical Expression: A&!B
25 English translation: I should be near the [green_plush_toy] and not have picked up the [green_plush_toy]
26Subgoal_2:
27 Logical Expression: B&!C
28 English translation: I should have picked up the [green_plush_toy] and not be near the [whiteboard::isinfrontof(green_plush_toy)]
29Subgoal_3:
30 Logical Expression: C&!D
31 English translation: I should be near the [whiteboard::isinfrontof(green_plush_toy)] and not have released the [green_plush_toy] at the [whiteboard::isinfrontof(green_plush_toy)]
32Subgoal_4:
33 Logical Expression: D
34 English translation: I should have released the [green_plush_toy] at the [whiteboard::isinfrontof(green_plush_toy)]
35Does this match your intention? (y/n)
Listing 4: Interactive referent and task structure verification dialog.

A4 Spatial Grounding Module

The spatial grounding module detects and localizes specific instances of objects referenced in a given instruction by first detecting, segmenting and back-projecting all referent occurances and then filtering based on the descriptive spatial details captured by each referent’s composable referent descriptor (CRD). We use the Owl-Vit model [32] to detect bounding boxes of open-vocabulary referents and SAM [33] to generate masks from detected bounding boxes. To illustrate referent filtering via spatial information, consider an example scenario where the goal is to resolve the composable referent descriptor below:

whiteboard::isinfrontof(green_plush_toy).\text{whiteboard}::isinfrontof(\text{green\_plush\_toy}).whiteboard : : italic_i italic_s italic_i italic_n italic_f italic_r italic_o italic_n italic_t italic_o italic_f ( green_plush_toy ) . (A.1)

Let W={w1,w2,,wn}𝑊subscript𝑤1subscript𝑤2subscript𝑤𝑛W=\{w_{1},w_{2},\ldots,w_{n}\}italic_W = { italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } and G={g1,g2,,gm}𝐺subscript𝑔1subscript𝑔2subscript𝑔𝑚G=\{g_{1},g_{2},\ldots,g_{m}\}italic_G = { italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } represent the set of representative 3D positions of detected whiteboards and green_plush_toys respectively. The cartesian product of these sets enumerates all possible pairs (w,g)𝑤𝑔(w,g)( italic_w , italic_g ) for comparison.

W×G={w,g)wW,gG}W\times G=\{w,g)\mid w\in W,g\in G\}italic_W × italic_G = { italic_w , italic_g ) ∣ italic_w ∈ italic_W , italic_g ∈ italic_G } (A.2)

The ‘isinfrontof(w, g)’ comparator is applied to each pair, yielding a subset S𝑆Sitalic_S that contains only those ‘whiteboards‘ that satisfy the ‘isinfronto’ condition with at least one ‘green_plush_toy’.

S={wWgG such that isinfrontof(w,g) is true}𝑆conditional-set𝑤𝑊𝑔𝐺 such that isinfrontof𝑤𝑔 is trueS=\{w\in W\mid\exists g\in G\text{ such that }\text{isinfrontof}(w,g)\text{ is% true}\}italic_S = { italic_w ∈ italic_W ∣ ∃ italic_g ∈ italic_G such that roman_isinfrontof ( italic_w , italic_g ) is true } (A.3)

A4-A 3D Spatial Comparators

Our 3D spatial comparators enable Relative Frame of Reference (FoR) spatial reasoning between referents, based on their backprojected 3D positions. Threshold values in the spatial comparators give users the ability to specify the sensitivity or resolution at which spatial relationships are resolved, we keep all threshold values fixed across all experiments. Below is a description of each spatial comparator.

11. isbetween(referent_1_pos, referent_2_pos, referent_3_pos, threshold): Returns true if referent_1 is within threshold distance from the line segment connecting referent_2 to referent_3, ensuring it lies in the directional path between them without extending beyond.
22. isabove(referent_1_pos, referent_2_pos, threshold): Returns true if the z-coordinate of referent_1 exceeds that of referent_2 by at least threshold’.
33. isbelow(referent_1_pos, referent_2_pos, threshold): Returns true if the z-coordinate of referent_1 is less than that of referent_2 by more than threshold’.
44. isleftof(referent_1_pos, referent_2_pos, threshold): Returns true if the y-coordinate of referent_1 exceeds that of referent_2 by at least threshold’, indicating referent_1 is to the left of referent_2.
55. isrightof(referent_1_pos, referent_2_pos, threshold): Returns true if the y-coordinate of referent_1 is less than that of referent_2 by more than threshold’, indicating referent_1 is to the right of referent_2.
66. isnextto(referent_1_pos, referent_2_pos, threshold): Returns true if the Euclidean distance between referent_1 and referent_2 is less than threshold’, indicating they are next to each other.
77. isinfrontof(referent_1_pos, referent_2_pos, threshold): Returns true if the x-coordinate of referent_1 is less than that of referent_2 by more than threshold’, indicating referent_1 is in front of referent_2.
88. isbehind(referent_1_pos, referent_2_pos, threshold): Returns true if the x-coordinate of referent_1 exceeds that of referent_2 by at least threshold’, indicating referent_1 is behind referent_2.
Listing 5: Implementation description of 3D spatial comparators

A5 Task and Motion Planning Module

We present pseudocode for our Progressive Motion Planner (Alg.1) and our algorithm for generating Task Progression Semantic Maps (Alg.2). Alg.2 generates a TPSM tpsmsubscripttpsm\mathcal{M}_{\text{tpsm}}caligraphic_M start_POSTSUBSCRIPT tpsm end_POSTSUBSCRIPT by integrating an environment map (\mathcal{M}caligraphic_M) and a referent semantic map (rsmsubscriptrsm\mathcal{M}_{\text{rsm}}caligraphic_M start_POSTSUBSCRIPT rsm end_POSTSUBSCRIPT) given a logical transition expression (𝒯𝒯\mathcal{T}caligraphic_T), a desired automaton state (𝒮superscript𝒮\mathcal{S^{\prime}}caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), and a nearness threshold (θ𝜃\thetaitalic_θ). The algorithm first initializes tpsmsubscripttpsm\mathcal{M}_{\text{tpsm}}caligraphic_M start_POSTSUBSCRIPT tpsm end_POSTSUBSCRIPT with a copy of \mathcal{M}caligraphic_M and extracts relevant instruction predicates from 𝒯𝒯\mathcal{T}caligraphic_T. For each predicate (parameterized skill), the algorithm identifies satisfying referent positions in rsmsubscriptrsm\mathcal{M}_{\text{rsm}}caligraphic_M start_POSTSUBSCRIPT rsm end_POSTSUBSCRIPT, generates a spherical grid of surrounding points within a radius θ𝜃\thetaitalic_θ, and assesses how these points affect the progression of the task automaton towards 𝒮superscript𝒮\mathcal{S^{\prime}}caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. These points demarcate regions of interest, and are assigned a value of 1 if they cause the automaton to transition to the desired state, -1 if they lead to a different automaton state or violate the automaton, and 0 if they do not affect the automaton. The points are then integrated into tpsmsubscripttpsm\mathcal{M}_{\text{tpsm}}caligraphic_M start_POSTSUBSCRIPT tpsm end_POSTSUBSCRIPT, yielding a semantic map that identifies goal and constraint violating regions.

Algorithm 1 Progressive Motion Planning Algorithm
1:procedure PMP(Xstart,φ,,rsm,θsubscript𝑋𝑠𝑡𝑎𝑟𝑡𝜑subscript𝑟𝑠𝑚𝜃X_{start},\varphi,\mathcal{M},\mathcal{M}_{rsm},\thetaitalic_X start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT , italic_φ , caligraphic_M , caligraphic_M start_POSTSUBSCRIPT italic_r italic_s italic_m end_POSTSUBSCRIPT , italic_θ)
2:Input:
3:Xstartsubscript𝑋𝑠𝑡𝑎𝑟𝑡X_{start}italic_X start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT: Start position in the environment.
4:φ𝜑\varphiitalic_φ: CRD syntax LTL formula specifying task objectives.
5:\mathcal{M}caligraphic_M: Environment map.
6:rsmsubscript𝑟𝑠𝑚\mathcal{M}_{rsm}caligraphic_M start_POSTSUBSCRIPT italic_r italic_s italic_m end_POSTSUBSCRIPT: Referent semantic map.
7:θ𝜃\thetaitalic_θ: Nearness threshold.
8:Output:
9:ΠΠ\Piroman_Π: Generated task and motion plan.
10:     𝒜ConstructAutomaton(φ)𝒜ConstructAutomaton𝜑\mathcal{A}\leftarrow\text{ConstructAutomaton}(\varphi)caligraphic_A ← ConstructAutomaton ( italic_φ )
11:     pathSelectAutomatonPath(𝒜)pathSelectAutomatonPath𝒜\text{path}\leftarrow\text{SelectAutomatonPath}(\mathcal{A})path ← SelectAutomatonPath ( caligraphic_A ) \triangleright Task plan
12:     while ΠΠ\Piroman_Π.status is active do
13:         while 𝒜𝒜\mathcal{A}caligraphic_A.state != path.acceptingState do
14:              𝒮,𝒯,𝒮𝒜.GetTransition(path.currentStep)formulae-sequence𝒮𝒯superscript𝒮𝒜GetTransition(path.currentStep)\mathcal{S},\mathcal{T},\mathcal{S^{\prime}}\leftarrow\mathcal{A}.\text{% GetTransition(path.currentStep)}caligraphic_S , caligraphic_T , caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← caligraphic_A . GetTransition(path.currentStep)
15:              objectiveNextObjectiveType(𝒯)objectiveNextObjectiveType𝒯\text{objective}\leftarrow\text{NextObjectiveType}(\mathcal{T})objective ← NextObjectiveType ( caligraphic_T )
16:              if objective=“skill”objective“skill”\text{objective}=\text{``skill''}objective = “skill” then
17:                  Π.UpdateWithSkill(𝒯)formulae-sequenceΠUpdateWithSkill𝒯\Pi.\text{UpdateWithSkill}(\mathcal{T})roman_Π . UpdateWithSkill ( caligraphic_T )
18:                  𝒜𝒜\mathcal{A}caligraphic_A.UpdateAutomatonState(𝒮superscript𝒮\mathcal{S^{\prime}}caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)
19:              else if objective=“navigation”objective“navigation”\text{objective}=\text{``navigation''}objective = “navigation” then
20:                  params,rsm,𝒯,𝒜,𝒮,θsubscript𝑝𝑎𝑟𝑎𝑚𝑠subscript𝑟𝑠𝑚𝒯𝒜superscript𝒮𝜃\mathcal{M}_{params}\leftarrow\mathcal{M},\mathcal{M}_{rsm},\mathcal{T},% \mathcal{A},\mathcal{S^{\prime}},\thetacaligraphic_M start_POSTSUBSCRIPT italic_p italic_a italic_r italic_a italic_m italic_s end_POSTSUBSCRIPT ← caligraphic_M , caligraphic_M start_POSTSUBSCRIPT italic_r italic_s italic_m end_POSTSUBSCRIPT , caligraphic_T , caligraphic_A , caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ
21:                  tpsmGenerateTPSM(params)subscript𝑡𝑝𝑠𝑚GenerateTPSMsubscript𝑝𝑎𝑟𝑎𝑚𝑠\mathcal{M}_{tpsm}\leftarrow\textsc{GenerateTPSM}(\mathcal{M}_{params})caligraphic_M start_POSTSUBSCRIPT italic_t italic_p italic_s italic_m end_POSTSUBSCRIPT ← GenerateTPSM ( caligraphic_M start_POSTSUBSCRIPT italic_p italic_a italic_r italic_a italic_m italic_s end_POSTSUBSCRIPT )
22:                  𝒪GenerateObstacleMap(tpsm)𝒪GenerateObstacleMapsubscript𝑡𝑝𝑠𝑚\mathcal{O}\leftarrow\text{GenerateObstacleMap}(\mathcal{M}_{tpsm})caligraphic_O ← GenerateObstacleMap ( caligraphic_M start_POSTSUBSCRIPT italic_t italic_p italic_s italic_m end_POSTSUBSCRIPT )
23:                  planFMT(Xstart,𝒪)plansuperscriptFMTsubscript𝑋𝑠𝑡𝑎𝑟𝑡𝒪\text{plan}\leftarrow\text{FMT}^{*}(X_{start},\mathcal{O})plan ← FMT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT , caligraphic_O ) \triangleright Path plan
24:                  if plan.exists then
25:                       Π.UpdateWithPlan(plan)formulae-sequenceΠUpdateWithPlanplan\Pi.\text{UpdateWithPlan}(\text{plan})roman_Π . UpdateWithPlan ( plan )
26:                       Xstartplan.endPositionsubscript𝑋𝑠𝑡𝑎𝑟𝑡plan.endPositionX_{start}\leftarrow\text{plan.endPosition}italic_X start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT ← plan.endPosition
27:                       𝒜𝒜\mathcal{A}caligraphic_A.UpdateAutomatonState(𝒮superscript𝒮\mathcal{S^{\prime}}caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)
28:                  else
29:                       Π,𝒜,pathBacktrack(Π,𝒜,path)Π𝒜pathBacktrackΠ𝒜path\Pi,\mathcal{A},\text{path}\leftarrow\text{Backtrack}(\Pi,\mathcal{A},\text{% path})roman_Π , caligraphic_A , path ← Backtrack ( roman_Π , caligraphic_A , path )
30:                  end if
31:              end if
32:         end while
33:     end while
34:     return ΠΠ\Piroman_Π
35:end procedure
Algorithm 2 Task Progression Semantic Mapping Algorithm
1:procedure GenerateTPSM(,rsm,𝒯,𝒜,𝒮,θsubscript𝑟𝑠𝑚𝒯𝒜superscript𝒮𝜃\mathcal{M},\mathcal{M}_{rsm},\mathcal{T},\mathcal{A},\mathcal{S^{\prime}},\thetacaligraphic_M , caligraphic_M start_POSTSUBSCRIPT italic_r italic_s italic_m end_POSTSUBSCRIPT , caligraphic_T , caligraphic_A , caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ)
2:Input:
3:\mathcal{M}caligraphic_M: Environment map.
4:rsmsubscript𝑟𝑠𝑚\mathcal{M}_{rsm}caligraphic_M start_POSTSUBSCRIPT italic_r italic_s italic_m end_POSTSUBSCRIPT: Referent semantic map.
5:𝒯𝒯\mathcal{T}caligraphic_T: Automaton transition expression.
6:𝒜𝒜\mathcal{A}caligraphic_A: Task Automaton.
7:𝒮superscript𝒮\mathcal{S^{\prime}}caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT: Desired State.
8:θ𝜃\thetaitalic_θ: Nearness threshold.
9:Output:
10:tpsmsubscript𝑡𝑝𝑠𝑚\mathcal{M}_{tpsm}caligraphic_M start_POSTSUBSCRIPT italic_t italic_p italic_s italic_m end_POSTSUBSCRIPT: Task Progression Semantic Map.
11:     tpsmCopy()subscript𝑡𝑝𝑠𝑚Copy\mathcal{M}_{tpsm}\leftarrow\text{Copy}(\mathcal{M})caligraphic_M start_POSTSUBSCRIPT italic_t italic_p italic_s italic_m end_POSTSUBSCRIPT ← Copy ( caligraphic_M )
12:     𝒫ExtractRelevantPredicates(𝒯)𝒫ExtractRelevantPredicates𝒯\mathcal{P}\leftarrow\text{ExtractRelevantPredicates}(\mathcal{T})caligraphic_P ← ExtractRelevantPredicates ( caligraphic_T )
13:     for p𝑝pitalic_p in 𝒫𝒫\mathcal{P}caligraphic_P do
14:         QueryPositions(rsm,p)QueryPositionssubscript𝑟𝑠𝑚𝑝\mathcal{R}\leftarrow\text{QueryPositions}(\mathcal{M}_{rsm},p)caligraphic_R ← QueryPositions ( caligraphic_M start_POSTSUBSCRIPT italic_r italic_s italic_m end_POSTSUBSCRIPT , italic_p )
15:         for r𝑟ritalic_r in \mathcal{R}caligraphic_R do
16:              G{gg=r+δ,δθ}𝐺conditional-set𝑔formulae-sequence𝑔𝑟𝛿delimited-∥∥𝛿𝜃G\leftarrow\left\{g\mid g=r+\delta,\lVert\delta\rVert\leq\theta\right\}italic_G ← { italic_g ∣ italic_g = italic_r + italic_δ , ∥ italic_δ ∥ ≤ italic_θ } \triangleright spherical grid of surrounding points
17:              for g𝑔gitalic_g in G𝐺Gitalic_G do
18:                  𝒬TruePredicatesAt(g,rsm,θ)𝒬TruePredicatesAt𝑔subscript𝑟𝑠𝑚𝜃\mathcal{Q}\leftarrow\text{TruePredicatesAt}(g,\mathcal{M}_{rsm},\theta)caligraphic_Q ← TruePredicatesAt ( italic_g , caligraphic_M start_POSTSUBSCRIPT italic_r italic_s italic_m end_POSTSUBSCRIPT , italic_θ )
19:                  𝒮nextProgressAutomaton(𝒜,𝒬)subscript𝒮𝑛𝑒𝑥𝑡ProgressAutomaton𝒜𝒬\mathcal{S}_{next}\leftarrow\text{ProgressAutomaton}(\mathcal{A},\mathcal{Q})caligraphic_S start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT ← ProgressAutomaton ( caligraphic_A , caligraphic_Q )
20:                  if 𝒮next=𝒮subscript𝒮𝑛𝑒𝑥𝑡superscript𝒮\mathcal{S}_{next}=\mathcal{S^{\prime}}caligraphic_S start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT = caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT then
21:                       g.value1formulae-sequence𝑔𝑣𝑎𝑙𝑢𝑒1g.value\leftarrow 1italic_g . italic_v italic_a italic_l italic_u italic_e ← 1 \triangleright Goal value
22:                  else if IsUndesired(𝒮next)IsUndesiredsubscript𝒮𝑛𝑒𝑥𝑡\text{IsUndesired}(\mathcal{S}_{next})IsUndesired ( caligraphic_S start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT ) then
23:                       g.value1formulae-sequence𝑔𝑣𝑎𝑙𝑢𝑒1g.value\leftarrow-1italic_g . italic_v italic_a italic_l italic_u italic_e ← - 1 \triangleright Avoidance value
24:                  else
25:                       g.value0formulae-sequence𝑔value0g.\text{value}\leftarrow 0italic_g . value ← 0
26:                  end if
27:              end for
28:              AddPoints(tpsm,G)AddPointssubscript𝑡𝑝𝑠𝑚𝐺\text{AddPoints}(\mathcal{M}_{tpsm},G)AddPoints ( caligraphic_M start_POSTSUBSCRIPT italic_t italic_p italic_s italic_m end_POSTSUBSCRIPT , italic_G )
29:         end for
30:     end for
31:     return tpsmsubscript𝑡𝑝𝑠𝑚\mathcal{M}_{tpsm}caligraphic_M start_POSTSUBSCRIPT italic_t italic_p italic_s italic_m end_POSTSUBSCRIPT
32:end procedure

A6 Robot Skills

We define three predicate functions: near, pick and release for the navigation, picking and placing skills required for multi-object goal navigation and mobile pick-and-place. As highlighted in the main paper, we formalize navigation as continuous path planning problems and manipulation as object parameterized options. We discuss navigation at length in the paper, so here we focus on the pick and place manipulation skills.

Pick Skill: Once the robot has executed the near skill and is at the object to be manipulated, it takes a photo of the current environment to detect the object using the Owl-Vit model. The robot is guaranteed to be facing the object as the computed path plan uses the backprojected object 3D position to compute yaw angles for the robot. After detecting the object in the picture, we obtain a segmentation mask from detected boundary box using the Segment Anything model, and compute the center pixel of this mask. We feed this center pixel to the Boston dynamics grasping API to compute a motion plan to grasp the object.

Release Skill: We implement a simple routine for the release skill which takes two parameters: the object to be placed and the place receptacle. Once a navigation skill gets the robot to the place receptacle, the robot gently moves its arm up or down to release the grasped object, based on the place receptable 3D position. Future work will implement more complex semantic placement strategies to better leverage LIMP’s awareness and spatial grounding of instruction specific place receptacles. Kindly, visit our website to see demonstrations of these skills.

A7 Evaluation and Baseline Details

All computation including planning, loading and running pretrained visual language models was done on a single computer equipped with one NVIDIA GeForce RTX 3090 GPU. We leverage GPT-4-0613 as the underlying LLM for our instruction understanding module and all our baselines. In all experiments we set the LLM temperature to 0, however since deterministic greedy token decoding is not guaranteed with GPT4, we perform three (3) queries for each instruction and evaluate on the most recurring response (mode response).

We compare LIMP with baseline implementations of NLMap-Saycan [10] and Code-as-policies [12]. Both baselines use the same GPT-4 LLM, prompting structure, and in-context learning examples as our language understanding module. We integrate our composible referent descriptor syntax, spatial grounding module and low-level robot control into these baselines as APIs. This enables baselines to execute plans by querying relevant object positions, using our FMT* path planner to find paths to said positions and executing manipulation options.

We visualize some qualitative results of LIMP from our experiments in Figure A.1. We also highlight results in Figure A.2 that illustrates how our interactive symbol verification and reprompting strategy A3-A improves instruction satisfaction with minimal chat turns for different instruction sets.

Refer to caption

Figure A.1: [A] Sample generated plan for a multi object-goal navigation task. [B] Sample generated plan for a mobile pick-and-place task.
Refer to caption
Figure A.2: Our interactive reprompting strategy implemented in the symbol verification node regenerates corrective formulas that improve plan success rates with minimal chat turns.

A7-A NLMap-Saycan Implementation Prompt

1You are an LLM for robot planning that understands logical operators such as &, |, ~ , etc. You have a finite set of robot predicates and spatial predicates, given a language instruction, your goal is to generate a sequence of actions that uses appropriate composition of robot and spatial predicates with relevant details from the instruction as arguments.
2Robot predicate set (near,pick,release).
3Usage:
4near[referent_1]:returns true if the desired spatial relationship is for robot to be near referent_1.
5pick[referent_1]:can only execute picking skill on referent_1 and return True when near[referent_1].
6release[referent_1,referent_2]:can only execute release skill on referent_1 and return True when near[referent_2].
7Spatial predicate set (isbetween,isabove,isbelow,isleftof,isrightof,isnextto,isinfrontof,isbehind).
8Usage:
9referent_1::isbetween(referent_2,referent_3):returns true if referent_1 is between referent_2 and referent_3.
10referent_1::isabove(referent_2):returns True if referent_1 is above referent_2.
11referent_1::isbelow(referent_2):returns True if referent_1 is below referent_2.
12referent_1::isleftof(referent_2):returns True if referent_1 is left of referent_2.
13referent_1::isrightof(referent_2):returns True if referent_1 is right of referent_2.
14referent_1::isnextto(referent_2):returns True if referent_1 is close to referent_2.
15referent_1::isinfrontof(referent_2):returns True if referent_1 is in front of referent_2.
16referent_1::isbehind(referent_2):returns True if referent_1 is behind referent_2.
17Rules:
18Strictly only use the finite set of robot and spatial predicates!
19Strictly stick to the usage format!
20Compose spatial predicates where necessary!
21You should strictly stick to mentioned objects, however you are allowed to propose and include plausible objects if and only if not mentioned in instruction but required based on context of instruction!
22Pay attention to instructions that require performing certain actions multiple times in generating and sequencing the predicates for the final Output!
23Example:
24Input_instruction: Go to the orange building but before that pass by the coffee shop, then go to the parking sign.
25Output:
261. near[coffee_shop]
272. near[orange_building]
283. near[parking_sign]
29Input_instruction: Go to the blue sofa then the laptop, after that bring me the brown bag between the television and the kettle on the left of the green seat, I am standing by the sink.
30Output:
311. near[blue_sofa]
322. near[laptop]
333. near[brown_bag::isbetween(television,kettle::isleftof(green_seat))]
344. pick[brown_bag::isbetween(television,kettle::isleftof(green_seat))]
355. near[sink]
366. release[brown_bag,sink]
37Input_instruction: Hey need you to pass by chair between the sofa and bag, pick up the bag and go to the orange napkin on the right of the sofa.
38Output:
391. near[chair::isbetween(sofa,bag)]
402. near[bag]
413. pick[bag]
424. near[orange_napkin::isrightof(sofa)]
43Input_instruction: Go to the chair between the green laptop and the yellow box underneath the play toy
44Output:
451. near[chair::isbetween(green_laptop,yellow_box::isbelow(play_toy))]
46Input_instruction: Check the table behind the fridge and bring two beers to the couch one after the other
47Output:
481. near[table::isbehind(fridge)]
492. pick[beer]
503. near[couch]
514. release[beer,couch]
525. near[table::isbehind(fridge)]
536. pick[beer]
547. near[couch]
558. release[beer,couch]
56Input_instruction: <given_instruction>
57Output:
Listing 6: Exact prompt to implement NlMap-Saycan LLM planner

A7-B Code-as-Policies Implementation Prompt

1##Python robot planning script
2from robotactions import near, pick, release
3spatial_relationships = [
4"isbetween", #referent_1::isbetween(referent_2,referent_3):returns true if referent_1 is between referent_2 and referent_3.
5"isabove", #referent_1::isabove(referent_2):returns True if referent_1 is above referent_2.
6"isbelow", #referent_1::isbelow(referent_2):returns True if referent_1 is below referent_2.
7"isleftof", #referent_1::isleftof(referent_2):returns True if referent_1 is left of referent_2.
8"isrightof", #referent_1::isrightof(referent_2):returns True if referent_1 is right of referent_2.
9"isnextto", #referent_1::isnextto(referent_2):returns True if referent_1 is close to referent_2.
10"isinfrontof", #referent_1::isinfrontof(referent_2):returns True if referent_1 is in front of referent_2.
11"isbehind" #referent_1::isbehind(referent_2):returns True if referent_1 is behind referent_2.]
12##Rules:
13##Strictly only use the finite set of robot and spatial predicates!
14##Strictly stick to the usage format!
15##Compose spatial predicates where necessary!
16##You should strictly stick to mentioned objects, however you are allowed to propose and include plausible objects if and only if not mentioned in instruction but required based on context of instruction!
17##Pay attention to instructions that require performing certain actions multiple times in generating and sequencing the predicates for the final Output!
18# Go to the orange building but before that pass by the coffee shop, then go to the parking sign.
19ordered_navigation_goal_referents = ["coffee_shop", "orange_building", "parking_sign"]
20for referent in ordered_navigation_goal_referents:
21 near(referent)
22# Go to the blue sofa then the laptop, after that bring me the brown bag between the television and the kettle on the left of the green seat, I am standing by the sink.
23ordered_navigation_goal_referents = ["blue_sofa", "laptop", "brown_bag::isbetween(television,kettle::isleftof(green_seat))", "sink"]
24referents_to_pick= ["brown_bag::isbetween(television,kettle::isleftof(green_seat))"]
25release_location_referents = ["sink"]
26picked_item = None
27for referent in ordered_navigation_goal_referents:
28 near(referent)
29 if referent in referents_to_pick:
30 pick(referent)
31 picked_item = referent
32 if referent in release_location_referents:
33 release(picked_item, referent)
34#Hey need you to pass by chair between the sofa and bag, pick up the bag and go to the orange napkin on the right of the sofa.
35ordered_navigation_goal_referents = ["chair::isbetween(sofa,bag)", "bag", "orange_napkin::isrightof(sofa)"]
36referents_to_pick= ["bag"]
37picked_item = None
38for referent in ordered_navigation_goal_referents:
39 near(referent)
40 if referent in referents_to_pick:
41 pick(referent)
42 picked_item = referent
43#Go to the chair between the green laptop and the yellow box underneath the play toy
44near("chair::isbetween(green_laptop,yellow_box::isbelow(play_toy))")
45#Check the table behind the fridge and bring two beers to the couch one after the other
46for i in range(2):
47 near("table::isbehind(fridge)")
48 pick("beer")
49 near("couch")
50 release("beer", "couch")
51
52#<given_instruction>
Listing 7: Exact prompt to implement Code-as-policies planner

A7-C Instruction set

We perform a large-scale evaluation on 150 instructions across five real-world environments. The taskset includes 24 tasks with fine-grained object descriptions (NLMD) from [10], 25 tasks with complex language (NLMC) from [10], 25 tasks with simple structured phrasing (OKRB) from [11], 37 tasks with complex temporal structures (CT) from [7], and an additional 39 tasks we propose that have descriptive spatial constraints and temporal structures (CST).

11: put the red can in the trash bin
22: put the brown multigrain chip bag in the woven basket
33: find the succulent plant
44: pick up the up side down mug
55: put the apple on the macbook with yellow stickers
66: use the dyson vacuum cleaner"
77: bring the kosher salt to the kitchen counter
88: put the used towels in washing machine
99: move the used mug to the dish washer
1010: place the pickled cucumbers on the shelf
1111: find my mug with the shape of a donut
1212: put the almonds in the almond jar
1313: fill the zisha tea pot with a coke from the cabinet
1414: take the slippery floor sign with you
1515: take the slippers that have holes on them to the shoe rack
1616: find the mug on the mini fridge
1717: bring the mint flavor gum to the small table
1818: find some n95