Verifiably Following Complex Robot Instructions with Foundation Models

Benedict Quartey^†∗, Eric Rosen^∗, Stefanie Tellex, George Konidaris
Department of Computer Science, Brown University ^∗Equal Contribution^†Corresponding Author (Email: benedict_quartey@brown.edu)

Abstract

When instructing robots, users want to flexibly express constraints, refer to arbitrary landmarks, and verify robot behavior, while robots must disambiguate instructions into specifications and ground instruction referents in the real world. To address this problem, we propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow complex, open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot’s alignment with an instructor’s intended motives and affords the synthesis of correct-by-construction robot behaviors. We conduct a large-scale evaluation of LIMP on 150 instructions across five real-world environments, demonstrating its versatility and ease of deployment in diverse, unstructured domains. LIMP performs comparably to state-of-the-art baselines on standard open-vocabulary tasks and additionally achieves a 79% success rate on complex spatiotemporal instructions, significantly outperforming baselines that only reach 38%. ¹¹1See supplementary materials and demo videos at robotlimp.github.io

I Introduction

\lettrine

Robots need a rich understanding of natural language to be instructable by non-experts in unstructured environments. People, on the other hand, need to be able to verify that a robot has understood a given instruction and will act appropriately. Achieving these objectives, however, is challenging as natural language instructions often feature ambiguous phrasing, intricate spatiotemporal constraints, and unique referents. To illustrate, consider the instruction shown in Figure 1: “Bring the green plush toy to the whiteboard in front of it, watch out for the robot in front of the toy”. Solving such a task requires a robot to ground open-vocabulary referents, follow temporal constraints, and disambiguate objects using spatial descriptions. Foundation models [1, 2] offer a path to achieving such complex long-horizon goals; however, existing approaches for robot instruction following have largely focused on navigation [3, 4, 5, 6, 7]. These methods, broadly classified under object goal navigation [8], enable navigation to instances of an object category but are limited in their ability to localize spatial references and disambiguate object instances based on descriptive language. Other works [9, 10, 11] extend instruction following to mobile manipulation but are limited to tasks with simple temporal constraints expressed in unambiguous language. Moreover, existing efforts typically rely on Large Language Models (LLMs) as complete planners, bypassing intermediate symbolic representations that could provide verification of correctness before execution. Alternative approaches leveraging code-writing LLMs [5, 6, 12] are susceptible to errors in generated code, which may lead to unsafe robot behaviors. Mapping natural language to specification languages like temporal logic [13] provides a robust framework for language disambiguation, handling complex temporal constraints, and behavior verification. However, prior works along this line require prebuilt semantic maps with discrete sets of prespecified referents/landmarks from which instructions can be constructed [7, 14, 15].

Refer to caption — Figure 1: Our approach executing the instruction “Bring the green plush toy to the whiteboard in front of it, watch out for the robot in front of the toy”. The robot dynamically detects and grounds open-vocabulary referents with spatial constraints to construct an instruction-specific semantic map, then synthesizes a task and motion plan to solve the task. In this example, the robot navigates from its start location (yellow, A), to the green plush toy (green, B), executes a pick skill then navigates to the whiteboard (blue, C), and executes a place skill. Note that the robot has no prior semantic knowledge of the environment.

We propose Language Instruction grounding for Motion Planning (LIMP), a method that leverages foundation models and temporal logics to dynamically generate instruction-conditioned semantic maps that enable robots to construct verifiable controllers for following navigation and mobile manipulation instructions with open vocabulary referents and complex spatiotemporal constraints. In a novel environment, LIMP constructs a 3D map via SLAM, then uses LLMs to translate complex natural language instructions into temporal logic specifications with a novel composable syntax for referent disambiguation. Instruction referents are detected and grounded using vision-language models (VLMs) and spatial reasoning. Finally, a task and motion plan is synthesized to guide the robot through the required subgoals, as shown in Figure 1. In summary, we make the following contributions: (1) A modular framework that translates expressive natural language instructions into temporal logic, grounds instruction referents, and executes commands via Task and Motion Planning (TAMP). (2) A spatial grounding method for detecting and localizing open vocabulary objects with spatial constraints in 3D metric maps. (3) A TAMP algorithm that localizes regions of interest (goal/avoidance zones) and synthesizes constraint-satisfying motion plans for long-horizon tasks.

II Background and Related Works

We briefly highlight the most relevant works in visual scene understanding [10], natural language instruction following [7, 16], and task and motion planning [17], and provide a comprehensive review in our supplementary materials. NLMap [10] grounds open-vocabulary language queries to spatial locations using pre-trained VLMs. While effective for describing individual objects, it cannot handle instructions involving complex constraints between multiple objects due to the lack of object relationship modeling. LIMP addresses this with a novel spatial grounding module that resolves spatial relationships and leverages task and motion planners to satisfy these constraints. Lang2LTL [7] is a multi-stage, LLM-based approach that uses entity extraction and replacement to translate language instructions into temporal logic. Its extension [16] incorporates VLMs and semantic information (via text embeddings) to ground referents. These works require prebuilt semantic maps/databases describing landmarks to ground symbols, whereas our approach dynamically generates landmarks based on open-vocabulary instructions. Action-Oriented Semantic Maps (AOSMs) [17] augment semantic maps with models indicating where robots can perform manipulation skills, integrating with TAMP solvers for mobile manipulation. LIMP similarly provides a TAMP-compatible spatial representation but supports open-vocabulary tasks, whereas AOSMs remain constrained to a fixed set of goals once generated.

II-A Linear Temporal Logic

LIMP translates natural language instructions into temporal logic specifications for verifiable task and motion planning. While compatible with various specification languages and planning frameworks, we choose Linear Temporal Logic (LTL) [18] for its proven expressivity in representing complex robot mission requirements [19]. LTL defines temporal properties using atomic propositions, logical operators—negation ( $\neg$ ), conjunction ( $\land$ ), disjunction ( $\lor$ ), implication ( $\rightarrow$ )—and temporal operators: next ( $\mathcal{X}$ ), until ( $\mathcal{U}$ ), globally ( $\mathcal{G}$ ), and finally ( $\mathcal{F}$ ). Despite its expressivity, LTL has been underutilized due to the expert knowledge required to construct specifications, however recent works have seen significant success directly translating natural language into LTL [7, 20, 14, 21, 22, 23].

Behavior Verification: Expressing instructions as temporal logic specifications allows us to verify the correctness of generated plans a priori. However, instead of explicit verification methods such as model checking, we leverage insights from prior works [24] and directly use specifications to synthesize plans that are correct-by-construction [25, 26].

III Problem Definition

Given a natural language instruction $l$ , our goal is to synthesize and sequence navigation and manipulation behaviors to produce a policy that satisfies the temporal and spatial constraints in $l$ . Spatial constraints determine task success based on the sequence of robot poses traversed during execution; temporal constraints determine the sequencing of these spatial constraints as a function of task progression. We assume a robot with an RGB-D camera has already navigated a space, capturing images and camera poses. From this data, we build a metric map $m$ (e.g., point cloud, 3D voxel grid) of the environment, defining the space of possible $SE(3)$ poses $P$ and enabling robot localization (i.e., estimating $p_{\text{robot}}\in P$ ). Unlike previous work leveraging temporal logic [7], we do not assume access to a semantic map with prespecified object locations or predicates. Instead, we leverage two foundation models: a task-agnostic vision-language model $\sigma$ that, given an image and text, provides bounding boxes or segmentations based on the text; and an auto-regressive large language model $\psi$ that samples likely language tokens based on a history of tokens.

Navigation: Navigation is formalized as an object-goal oriented continuous path planning problem, where the goal is to generate paths to a goal pose set $P_{goals}\subset P$ while staying in feasible regions ( $P_{feasible}\subset P$ ) and avoiding infeasible regions ( $P_{infeasible}=P_{feasible}^{C}$ ). Infeasible regions include environment obstacles as well as dynamically determined semantic regions that violate constraints in the instruction $l$ .

Manipulation: We formalize manipulation behaviours as options [27] parameterized by objects. Consider an object parameter $\theta$ that parameterizes an option $O_{\theta}=(I_{\theta},\pi_{\theta},\beta_{\theta})$ , the initiation set, policy, and termination condition are functions of both the robot pose $P$ and $\theta$ . The initiation set $I_{\theta}$ denotes the global reference frame robot positions and object-centric attributes––such as object size––that determine if the option policy $\pi_{\theta}$ can be executed on the object $\theta$ . To execute a manipulation skill on an object, an object-goal navigation behavior must first be executed to bring the robot into proximity with the object. We assume access to a library of these manipulation skills and demonstrate our approach on multi-object goal navigation and open-vocabulary mobile pick-and-place [9, 28].

IV Language Instruction Grounding for Motion Planning

LIMP interprets expressive natural language instructions to generate instruction-conditioned semantic maps, enabling robots to verifiably solve long-horizon tasks with complex spatiotemporal constraints (Figure 2). We briefly describe our modular approach in this section and present comprehensive implementations details in our supplementary materials.

IV-A Language Instruction Module

In this module, we leverage a large language model $\psi$ to translate a natural language instruction $l$ into a linear temporal logic specification $\varphi_{l}$ with a novel composable syntax for referent disambiguation. We achieve this through a two-stage in-context learning strategy. The first stage prompts $\psi$ to translate $l$ into a conventional LTL formula $\phi_{l}$ where propositions refer to open-vocabulary objects. The second stage takes $l$ and $\phi_{l}$ as input and prompts $\psi$ to generate a new formula $\varphi_{l}$ with predicate functions corresponding to parameterized robot skills.

We define three predicate functions—near, pick, and release—for the primitive navigation and manipulation skills required for multi-object goal navigation and mobile pick-and-place. Predicate functions in $\varphi_{l}$ are parameterized by Composable Referent Descriptors (CRDs), our novel propositional expressions representing specific referent instances by chaining comparators that encode descriptive spatial information. For example, the instruction “the yellow cabinet above the fridge that is next to the stove” can be represented with the CRD:

\text{yellow\_cabinet}::\text{isabove}(\text{fridge}::\text{isnextto}(\text{% stove})).

(1)

This specifies that there is a fridge next to a stove, and the desired yellow cabinet is above that fridge. CRDs are constructed from a set of 3D spatial comparators [29] defined in our prompting strategy.

Unlike recent works [9, 11], our approach does not require specific phrasing or keywords and can handle instructions with arbitrary complexity and ambiguity. The LLM $\psi$ directly samples the entire LTL formula $\varphi_{l}$ with predicate functions parameterized by CRDs using appropriate spatial comparators based on the instruction’s context. Figure 3 illustrates the result of our two-stage prompting strategy.

LLM Verification: Verifying the LTL formula $\varphi_{l}$ sampled from the LLM is crucial as errors in referent extraction and temporal task structure affects instruction following accuracy. Our symbol verification node (Figure 2) leverages LTL properties to provide high-level human-in-the-loop verification of extracted instruction referents and temporal task structure. Recent work [30] provides ISO 61508 [31] safety guarantees in robot task execution by translating safety constraints from natural language to LTL formulas, which are verified by human experts and used to enforce robot behavior. Similarly, we rely on human verification to ensure the translated formula $\varphi_{l}$ is correct. Our symbol verification node implements an interactive dialog system that presents users with the extracted referent CRDs and implied task structure, and reprompts the LLM based on user corrections to obtain new formulas. Unlike prior work [30], we eliminate the need for experts by directly translating the task structure—encoded in the LTL formulas’s equivalent automaton—back into English statements via a simple deterministic translation scheme. In our experiments (Tables I and II), we find that even without human verification and reprompting, the initial formulas sampled by our language understanding module impressively encode the correct referents and temporal task structure.

IV-B Spatial Grounding Module

This module detects and localizes specific object instances referenced in an instruction. From the translated LTL formulas, we extract composable referent descriptors (CRDs) and use vision-language models OWL-ViT [32] and SAM [33] to detect and segment all referent occurrences from the robot’s prior observations of the environment. We backproject pixels in these segmentation masks onto our 3D map, creating an initial semantic map of all instruction object instances. From the example in Figure 3, occurrences of green_plush_toy, whiteboard, and robot are detected, segmented, and backprojected onto the map (Figure 4[a&b]).

To obtain the specific object instances described in the instruction, we resolve the 3D spatial comparators in each referent’s CRD––recall that CRDs are propositional expressions and can be evaluated as true or false. We define eight spatial comparators (isbetween, isabove, isbelow, isleftof, isrightof, isnextto, isinfrontof, isbehind) to reason about spatial relationships based on backprojected 3D positions. Since all backprojected positions are relative to an origin coordinate system, our spatial comparators are resolved from the perspective of this origin position as shown in Figure 4[c]. This type of relative frame of reference (FoR) when describing spatial relationships between objects, in contrast to an absolute or intrinsic FoR, is dominant in English [34], and is a logical choice for our work.

Using the 3D position of each referent’s center mask pixel as its representative position, we resolve a given referent with a spatial description by applying the appropriate spatial comparator to all detected pairs of the desired referent and comparison landmark objects. This filtering process yields a Referent Semantic Map (RSM) that localizes specific object instances described in the instruction as shown in Figure 4[d].

VLM Verification: Potential misclassifications from object detector VLMs is the main source of error in this module. We do not address interactively correcting VLM misclassifications as that is out of the scope of this work, but we provide 3D visualization tools that enable users to visually inspect and verify that constructed referent semantic maps correctly localize referents.

IV-C Task and Motion Planning Module

Finally, our TAMP module synthesizes and sequences navigation and manipulation behaviors to produce a plan that satisfies the temporal and spatial constraints expressed in the given instruction.

Progressive Motion Planner (PMP): Our TAMP algorithm compiles the LTL formula with parameterized robot skills into an equivalent finite-state automaton (Figure 5[a]) to generate a verifiably correct task and motion plan. A path from the initial to the accepting state in this automaton is a high-level task plan that interleaves navigation and manipulation objectives required to satisfy the instruction. We select such a path with a simple strategy that incrementally selects the next progression state until the accepting state is reached, ensuring the plan obeys all temporal subgoal objectives. As shown in Figure 5[a], automaton states are connected by transition edges representing the logical expressions required for transitions. For each transition, our algorithm executes the necessary low-level behaviors: for manipulation subgoals, it executes the appropriate parameterized skill; for navigation subgoals, it dynamically generates Task Progression Semantic Maps (TPSMs) to localize goal and constraint regions and performs continuous path planning using the Fast Marching Tree algorithm (FMT^∗) [35].

Task Progression Semantic Maps (TPSM): A TPSM augments a 3D scene with navigation constraints specified by logical state transition expressions, enabling goal-directed, constraint-aware navigation. Regions of interest in a TPSM are defined using a nearness threshold specifying proximity to an object. This threshold can be set globally or included in the language instruction module’s prompting strategy, allowing an LLM to infer its value based on the instruction. Like our spatial grounding module, TPSMs are agnostic to temporal logic representations and can be used with various planning approaches for semantic constraint-aware motion planning. We primarily evaluate our approach on a ground mobile robot, hence we transform 3D TPSMs into 2D geometric obstacle maps, where constraint regions are treated as obstacles (Figure 5[b]). However, our approach is robot-agnostic and supports direct planning in 3D TPSMs for appropriate embodiments like drones.

V Evaluation

Our evaluations test the hypothesis that translating natural language instructions into LTL expressions and dynamically generating semantic maps enables robots to accurately interpret and execute instructions in large-scale environments without prior training. We focus on three key questions: (1) Can our language instruction module interpret complex, ambiguous instructions? (2) Can our spatial grounding module resolve specific object instances described in instructions? (3) Can our TAMP algorithm generate constraint-satisfying plans?

To answer these questions, we compare LIMP with two baselines: an LLM task planner (NLMap-Saycan [10]) and an LLM code-writing planner (Code-as-Policies [12]), representing state-of-the-art approaches for language-conditioned, open-ended robot task execution. Both baselines use the same LLM (GPT-4-0613), prompting structure, and in-context learning examples as our language instruction module. In Code-as-Policies, in-context examples are converted into language model-generated program (LMP) snippets [12]. To ensure competitive performance, we integrate our CRD syntax, spatial grounding module, and low-level robot control into these baselines, allowing them to query object positions, use our FMT^∗ path planner, and execute manipulation skills.

We also evaluate ablations of our two-stage language instruction module due to its importance in instruction following. In our full approach, the first stage prompts an LLM to generate a conventional LTL formula $\phi_{l}$ from instruction $l$ by dynamically selecting relevant in-context examples from a standard dataset [14] based on cosine similarity. Our first ablation selects in-context examples randomly; and the second ablation skips this stage entirely, directly sampling our LTL syntax with parameterized robot skills $\varphi_{l}$ from $l$ .

We conduct a large-scale evaluation across five real-world environments on a diverse task set of 150 instructions from multiple prior works [10, 11, 7]. This task set consists of 24 tasks with fine-grained object descriptions (NLMD), 25 tasks with complex language (NLMC), 25 tasks with simple structured phrasing (OKRB), 37 tasks with complex temporal structures (CT) and 39 tasks with descriptive spatial constraints and temporal structures (CST). Below are examples from each task category illustrating the variety in complexity:

⬇

1NLMD: Put the brown multigrain chip bag in the woven basket

2NLMC: I like fruits, can you put something I would like on the yellow sofa for me

3OKRB: Move the soda can to the box

4CT: Visit the purple door elevator, then go to the front desk and then go to the kitchen table, in addition you can never go to the elevator once you have seen the front desk

5CST: I have a white cabinet, a green toy, a bookshelf and a red chair around here somewhere. Take the second item I mentioned from between the first item and the third. Bring it the cabinet but avoid the last item at all costs.

To evaluate instruction understanding, we introduce performance metrics: referent resolution accuracy, avoidance constraint resolution accuracy, and spatial relationship resolution accuracy. These metrics utilize the word error rate (WER), widely used in speech recognition to quantify the difference between a reference and a hypothesis transcription by computing the minimal number of substitutions, deletions, and insertions needed to transform the hypothesis into the reference. WER is calculated as $WER=\frac{S+D+I}{N}$ , where $S$ is substitutions, $D$ deletions, $I$ insertions, and $N$ is the total number of words in the reference. In our work:

•

Referent resolution accuracy compares extracted referents in the generated LTL formula to ground truth referents.
•

Avoidance constraint resolution accuracy compares referents to avoid in the LTL formula (denoted by the unary negation operator) against ground truth avoidance referents.
•

Spatial relationship resolution accuracy compares generated Composable Referent Descriptors (CRDs) in the LTL specification with ground truth CRD expressions.

We also define temporal alignment accuracy and planning success rate. A plan is temporally aligned if the sequence of subgoals matches the instructor’s intention, and successful if it satisfies all spatial and temporal constraints specified in the instruction. Achieving a high plan success rate is challenging, requiring accurate referent and avoidance constraint resolution, spatial grounding, and temporal alignment. We report average word error rates for each baseline in Table I and the percentage of successful and temporally accurate plans in Table II.

TABLE I: Performance comparison of one-shot instruction understanding and spatial resolution.

Approach	Referent Resolution Accuracy (Average WER) $\downarrow$	Avoidance Constraint Resolution Accuracy (Average WER) $\downarrow$	Spatial Relationship Resolution Accuracy (Average WER) $\downarrow$
NLMap-Saycan	0.09	0.12	0.05
Code-as-Policies	0.22	0.24	0.06
Limp Single Stage Prompting	0.09	0.11	0.05
Limp Two Stage Prompting [Random Embedding]	0.08	0.11	0.03
Limp Two Stage Prompting [Similar Embedding]	0.07	0.04	0.03

TABLE II: Performance comparison of one-shot temporal alignment and plan success rate.

Approach	Temporal Alignment Accuracy (% of instructions) $\uparrow$					Planning Success Rate (% of instructions) $\uparrow$
	NLMD	NLMC	OKRB	CT	CST	NLMD	NLMC	OKRB	CT	CST
NLMap-Saycan	88%	96%	100%	32%	41%	75%	96%	100%	35%	38%
Code-as-Policies	58%	68%	100%	35%	38%	46%	68%	100%	38%	38%
Limp Single Stage Prompting	79%	64%	100%	68%	74%	63%	60%	100%	62%	62%
Limp Two Stage Prompting [Random Embedding]	83%	68%	100%	76%	85%	79%	68%	100%	57%	72%
Limp Two Stage Prompting [Similar Embedding]	88%	80%	100%	76%	92%	79%	76%	100%	65%	79%

VI Discussion

Beyond the verification benefits of symbolic planning, LIMP outperforms baselines in most task sets, notably in complex temporal planning and constraint avoidance. While NLMap-Saycan and Code-as-Policies effectively generate sequential subgoals, they struggle with strict temporal constraints—for example, avoiding a specific referent while approaching another. Our approach ensures each robot step adheres to constraints while achieving subgoals, explaining LIMP’s superior performance on CT and CST tasks. As shown in Table II, LIMP underperforms only against NLMap-Saycan in the NLMC task category. This task set, introduced in the same paper as the baseline [10] (which outperforms LIMP), includes instructions with implicit details such as: “I like fruits, can you put something I would like on the yellow sofa for me.” NLMap-Saycan is better suited to infer and generate plans with possible fruit options, whereas our few-shot LTL translation process is not designed for this.

VII Limitations and Conclusion

Although LIMP is capable of interpreting non-finite instructions into LTL formulas, our planner is currently limited to processing co-safe formulas, which handle only finite sequences. The accuracy of spatial grounding relies on the performance of vision-language models (VLMs) for object recognition meaning any shortcomings in these systems can negatively impact results. Additionally, LIMP assumes a static environment between mapping and execution, making it not responsive to dynamic changes—an area we aim to address with future work on editable scene representations. Our Progressive Motion Planning algorithm is complete but does not guarantee optimality; however, our framework can be used with existing TAMP planners to enhance efficiency.

Foundation models hold significant promise for advancing the next generation of autonomous robots. Our results suggest that combining these models—LLMs for language and VLMs for vision—with established methods for safety, explainability, and verifiable behavior synthesis can lead to more reliable and capable robotic systems.

Acknowledgement

This work was supported by the Office of Naval Research (ONR) under REPRISM MURI N000142412603 and ONR #N00014-22-1-2592, as well as the National Science Foundation (NSF) via grant #1955361. Partial funding was also provided by The Robotics and AI Institute.

References

[1] Y. Hu, Q. Xie, V. Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y. Xie, T. Zhang, H.-S. Fang, S. Zhao, S. Omidshafiei, D.-K. Kim, A.-a. Agha-mohammadi, K. Sycara, M. Johnson-Roberson, D. Batra, X. Wang, S. Scherer, C. Wang, Z. Kira, F. Xia, and Y. Bisk, “Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis,” Oct. 2024, arXiv:2312.08782 [cs]. [Online]. Available: http://arxiv.org/abs/2312.08782
[2] R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman, B. Ichter, D. Driess, J. Wu, C. Lu, and M. Schwager, “Foundation models in robotics: Applications, challenges, and the future,” The International Journal of Robotics Research, p. 02783649241281508, Sept. 2024, publisher: SAGE Publications Ltd STM. [Online]. Available: https://doi.org/10.1177/02783649241281508
[3] S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2023, pp. 23 171–23 181. [Online]. Available: https://ieeexplore.ieee.org/document/10203853/
[4] D. Shah, B. Osiński, B. Ichter, and S. Levine, “LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action,” in Proceedings of The 6th Conference on Robot Learning. PMLR, Mar. 2023, pp. 492–504, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v205/shah23b.html
[5] C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual Language Maps for Robot Navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), May 2023, pp. 10 608–10 615. [Online]. Available: https://ieeexplore.ieee.org/document/10160969
[6] ——, “Audio Visual Language Maps for Robot Navigation,” in Experimental Robotics, M. H. Ang Jr and O. Khatib, Eds. Cham: Springer Nature Switzerland, 2024, pp. 105–117.
[7] J. X. Liu, Z. Yang, I. Idrees, S. Liang, B. Schornstein, S. Tellex, and A. Shah, “Grounding Complex Natural Language Commands for Temporal Tasks in Unseen Environments,” in Proceedings of The 7th Conference on Robot Learning. PMLR, Dec. 2023, pp. 1084–1110, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v229/liu23d.html
[8] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir, “On Evaluation of Embodied Navigation Agents,” July 2018, arXiv:1807.06757 [cs]. [Online]. Available: http://arxiv.org/abs/1807.06757
[9] S. Yenamandra, A. Ramachandran, K. Yadav, A. S. Wang, M. Khanna, T. Gervet, T.-Y. Yang, V. Jain, A. Clegg, J. M. Turner, Z. Kira, M. Savva, A. X. Chang, D. S. Chaplot, D. Batra, R. Mottaghi, Y. Bisk, and C. Paxton, “HomeRobot: Open-Vocabulary Mobile Manipulation,” in Proceedings of The 7th Conference on Robot Learning. PMLR, Dec. 2023, pp. 1975–2011, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v229/yenamandra23a.html
[10] B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary Queryable Scene Representations for Real World Planning,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), May 2023, pp. 11 509–11 522. [Online]. Available: https://ieeexplore.ieee.org/document/10161534
[11] P. Liu, Y. Orru, J. Vakil, C. Paxton, N. Shafiullah, and L. Pinto, “Demonstrating OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics,” in Robotics: Science and Systems. Robotics: Science and Systems Foundation, July 2024. [Online]. Available: http://www.roboticsproceedings.org/rss20/p091.pdf
[12] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as Policies: Language Model Programs for Embodied Control,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), May 2023, pp. 9493–9500. [Online]. Available: https://ieeexplore.ieee.org/document/10160591
[13] E. A. Emerson, “Temporal and Modal Logic,” in Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics, J. v. Leeuwen, Ed. Elsevier and MIT Press, 1990, pp. 995–1072. [Online]. Available: https://doi.org/10.1016/b978-0-444-88074-1.50021-4
[14] J. Pan, G. Chou, and D. Berenson, “Data-Efficient Learning of Natural Language to Linear Temporal Logic Translators for Robot Task Specification,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), May 2023, pp. 11 554–11 561. [Online]. Available: https://ieeexplore.ieee.org/document/10161125
[15] B. Quartey, A. Shah, and G. Konidaris, “Exploiting Contextual Structure to Generate Useful Auxiliary Tasks,” in NeurIPS 2023 Workshop on Generalization in Planning, vol. abs/2303.05038, 2023, arXiv: 2303.05038. [Online]. Available: https://doi.org/10.48550/arXiv.2303.05038
[16] J. X. Liu, A. Shah, G. Konidaris, S. Tellex, and D. Paulius, “Lang2LTL-2: Grounding Spatiotemporal Navigation Commands Using Large Language and Vision-Language Models,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2024, pp. 2325–2332, iSSN: 2153-0866. [Online]. Available: https://ieeexplore.ieee.org/document/10802696
[17] E. Rosen, S. James, S. Orozco, V. Gupta, M. Merlin, S. Tellex, and G. Konidaris, “Synthesizing Navigation Abstractions for Planning with Portable Manipulation Skills,” in Proceedings of The 7th Conference on Robot Learning. PMLR, Dec. 2023, pp. 2278–2287, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v229/rosen23a.html
[18] A. Pnueli, “The temporal logic of programs,” in 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), Oct. 1977, pp. 46–57, iSSN: 0272-5428. [Online]. Available: https://ieeexplore.ieee.org/document/4567924
[19] C. Menghi, C. Tsigkanos, P. Pelliccione, C. Ghezzi, and T. Berger, “ Specification Patterns for Robotic Missions ,” IEEE Transactions on Software Engineering, vol. 47, no. 10, pp. 2208–2224, Oct. 2021. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/TSE.2019.2945329
[20] M. Berg, D. Bayazit, R. Mathew, A. Rotter-Aboyoun, E. Pavlick, and S. Tellex, “Grounding Language to Landmarks in Arbitrary Outdoor Environments,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), May 2020, pp. 208–215, iSSN: 2577-087X. [Online]. Available: https://ieeexplore.ieee.org/document/9197068
[21] M. Cosler, C. Hahn, D. Mendoza, F. Schmitt, and C. Trippel, “nl2spec: Interactively Translating Unstructured Natural Language to Temporal Logics with Large Language Models,” in Computer Aided Verification, C. Enea and A. Lal, Eds. Cham: Springer Nature Switzerland, 2023, pp. 383–396.
[22] F. Fuggitti and T. Chakraborti, “NL2LTL – a Python Package for Converting Natural Language (NL) Instructions to Linear Temporal Logic (LTL) Formulas,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 13, pp. 16 428–16 430, 2023, number: 13. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/27068
[23] Y. Chen, R. Gandhi, Y. Zhang, and C. Fan, “NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 15 880–15 903. [Online]. Available: https://aclanthology.org/2023.emnlp-main.985/
[24] M. Y. Vardi, “An automata-theoretic approach to linear temporal logic,” in Logics for Concurrency: Structure versus Automata, F. Moller and G. Birtwistle, Eds. Berlin, Heidelberg: Springer, 1996, pp. 238–266. [Online]. Available: https://doi.org/10.1007/3-540-60915-6˙6
[25] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal-Logic-Based Reactive Mission and Motion Planning,” IEEE Transactions on Robotics, vol. 25, no. 6, pp. 1370–1381, Dec. 2009, conference Name: IEEE Transactions on Robotics. [Online]. Available: https://ieeexplore.ieee.org/document/5238617
[26] M. Colledanchise, R. M. Murray, and P. Ögren, “Synthesis of correct-by-construction behavior trees,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sept. 2017, pp. 6039–6046, iSSN: 2153-0866. [Online]. Available: https://ieeexplore.ieee.org/document/8206502
[27] R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, vol. 112, no. 1, pp. 181–211, Aug. 1999. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0004370299000521
[28] N. Yokoyama, A. Clegg, J. Truong, E. Undersander, T.-Y. Yang, S. Arnaud, S. Ha, D. Batra, and A. Rai, “ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation,” IEEE Robotics and Automation Letters, vol. 9, no. 1, pp. 779–786, Jan. 2024, conference Name: IEEE Robotics and Automation Letters. [Online]. Available: https://ieeexplore.ieee.org/document/10328058
[29] K. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, G. Iyer, S. Saryazdi, T. Chen, A. Maalouf, S. Li, N. Keetha, A. Tewari, J. Tenenbaum, C. Melo, M. Krishna, L. Paull, F. Shkurti, and A. Torralba, “ConceptFusion: Open-set multimodal 3D mapping,” in Robotics: Science and Systems XIX. Robotics: Science and Systems Foundation, July 2023. [Online]. Available: http://www.roboticsproceedings.org/rss19/p066.pdf
[30] Z. Yang, S. S. Raman, A. Shah, and S. Tellex, “Plug in the Safety Chip: Enforcing Constraints for LLM-driven Robot Agents,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), May 2024, pp. 14 435–14 442. [Online]. Available: https://ieeexplore.ieee.org/document/10611447
[31] I. E. Commission et al., “Functional safety of electrical/electronic/programmable electronic safety related systems,” IEC 61508, 2000.
[32] M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, X. Wang, X. Zhai, T. Kipf, and N. Houlsby, “Simple Open-Vocabulary Object Detection,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 728–755.
[33] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment Anything,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp. 3992–4003, iSSN: 2380-7504. [Online]. Available: https://ieeexplore.ieee.org/document/10378323
[34] A. Majid, M. Bowerman, S. Kita, D. B. M. Haun, and S. C. Levinson, “Can language restructure cognition? The case for space,” Trends in Cognitive Sciences, vol. 8, no. 3, pp. 108–114, Mar. 2004. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1364661304000208
[35] L. Janson, E. Schmerling, A. Clark, and M. Pavone, “Fast marching tree: A fast marching sampling-based method for optimal motion planning in many dimensions,” The International Journal of Robotics Research, vol. 34, no. 7, pp. 883–921, June 2015, publisher: SAGE Publications Ltd STM. [Online]. Available: https://doi.org/10.1177/0278364915577958
[36] S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, “Robots that use language,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 25–55, 2020.
[37] V. Blukis, R. Knepper, and Y. Artzi, “Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following,” in Proceedings of the 2020 Conference on Robot Learning. PMLR, Oct. 2021, pp. 1829–1854, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v155/blukis21a.html
[38] R. Patel, E. Pavlick, and S. Tellex, “Grounding Language to Non-Markovian Tasks with No Supervision of Task Specifications,” in Robotics: Science and Systems XVI. Robotics: Science and Systems Foundation, July 2020. [Online]. Available: http://www.roboticsproceedings.org/rss16/p016.pdf
[39] C. Wang, C. Ross, Y.-L. Kuo, B. Katz, and A. Barbu, “Learning a natural-language to LTL executable semantic parser for grounded robotics,” in Proceedings of the 2020 Conference on Robot Learning. PMLR, Oct. 2021, pp. 1706–1718, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v155/wang21g.html
[40] K. Zheng, D. Bayazit, R. Mathew, E. Pavlick, and S. Tellex, “Spatial Language Understanding for Object Search in Partially Observed City-scale Environments,” 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), pp. 315–322, Aug. 2021, conference Name: 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN) ISBN: 9781665404921 Place: Vancouver, BC, Canada Publisher: IEEE. [Online]. Available: https://ieeexplore.ieee.org/document/9515426/
[41] X. Wang, W. Wang, J. Shao, and Y. Yang, “LANA: A Language-Capable Navigator for Instruction Following and Generation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC, Canada: IEEE, June 2023, pp. 19 048–19 058. [Online]. Available: https://ieeexplore.ieee.org/document/10203301/
[42] S.-M. Park and Y.-G. Kim, “Visual language navigation: a survey and open challenges,” Artificial Intelligence Review, vol. 56, no. 1, pp. 365–427, Jan. 2023. [Online]. Available: https://doi.org/10.1007/s10462-022-10174-9
[43] B. Yu, H. Kasaei, and M. Cao, “L3MVN: Leveraging Large Language Models for Visual Target Navigation,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2023, pp. 3554–3560, iSSN: 2153-0866. [Online]. Available: https://ieeexplore.ieee.org/document/10342512
[44] C. H. Song, B. M. Sadler, J. Wu, W.-L. Chao, C. Washington, and Y. Su, “LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE, Oct. 2023, pp. 2986–2997. [Online]. Available: https://ieeexplore.ieee.org/document/10378628/
[45] E. Hsiung, H. Mehta, J. Chu, X. Liu, R. Patel, S. Tellex, and G. Konidaris, “Generalizing to New Domains by Mapping Natural Language to Lifted LTL,” in 2022 International Conference on Robotics and Automation (ICRA). Philadelphia, PA, USA: IEEE Press, May 2022, pp. 3624–3630. [Online]. Available: https://doi.org/10.1109/ICRA46639.2022.9812169
[46] W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models,” in Proceedings of The 7th Conference on Robot Learning. PMLR, Dec. 2023, pp. 540–562, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v229/huang23b.html
[47] K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning,” in Proceedings of The 7th Conference on Robot Learning. PMLR, Dec. 2023, pp. 23–72, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v229/rana23a.html
[48] R.-Z. Qiu, Y. Hu, G. Yang, Y. Song, Y. Fu, J. Ye, J. Mu, R. Yang, N. Atanasov, S. A. Scherer, and X. Wang, “Learning Generalizable Feature Fields for Mobile Manipulation,” CoRR, vol. abs/2403.07563, 2024, arXiv: 2403.07563. [Online]. Available: https://doi.org/10.48550/arXiv.2403.07563
[49] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile robotics tasks: A survey,” Robotics and Autonomous Systems, vol. 66, pp. 86–103, Apr. 2015. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0921889014003030
[50] J. Crespo, J. C. Castillo, O. M. Mozos, and R. Barber, “Semantic Information for Robot Navigation: A Survey,” Applied Sciences, vol. 10, no. 2, p. 497, Jan. 2020, number: 2 Publisher: Multidisciplinary Digital Publishing Institute. [Online]. Available: https://www.mdpi.com/2076-3417/10/2/497
[51] A. Pronobis, P. Jensfelt, and J. Little, Semantic Mapping with Mobile Robots. Stockholm: KTH Royal Institute of Technology, 2011.
[52] R. E. Fikes and N. J. Nilsson, “Strips: A new approach to the application of theorem proving to problem solving,” Artificial intelligence, vol. 2, no. 3-4, pp. 189–208, 1971.
[53] C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-Pérez, “Integrated Task and Motion Planning,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 4, no. Volume 4, 2021, pp. 265–293, May 2021, publisher: Annual Reviews. [Online]. Available: https://www.annualreviews.org/content/journals/10.1146/annurev-control-091420-084139
[54] C. R. Garrett, T. Lozano-Pérez, and L. P. Kaelbling, “PDDLStream: Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning,” Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 440–448, June 2020. [Online]. Available: https://ojs.aaai.org/index.php/ICAPS/article/view/6739
[55] R. Holladay, T. Lozano-Pérez, and A. Rodriguez, “Planning for Multi-stage Forceful Manipulation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 6556–6562, iSSN: 2577-087X. [Online]. Available: https://ieeexplore.ieee.org/document/9561233

Verifiably Following Complex Robot Instructions with Foundation Models
Appendix

A1 Appendix Summary

These sections presents additional details on our approach for leveraging foundation models and temporal logics to verifiably follow expressive natural language instructions with complex spatiotemporal constraints without prebuilt semantic maps. We encourage readers to visit our website robotlimp.github.io for project summary and demonstration videos.

A2 Extended Related Works

A2-A Foundation Models in Robotics

Grounding language referents to entities and actions in the world [36, 37, 38, 39] is challenging in part due to the fact that complex perceptual and behavioral meaning can be constructed from the composition of a wide-range of open-vocabulary components[40, 20, 21, 41, 37]. To address this problem, foundation models have recently garnered interest as an approach for generating perceptual representations that are aligned with language [42, 4, 10, 43, 44, 5]. Because there are an ever-expanding number of ways foundation models are being leveraged for instruction following in robotics (e.g: generating plans [5], code [12], etc.), we focus our review on the most related approaches in two relevant application areas: 1) generating natural language queryable scene representations and 2) generating robot plans for following natural language instruction [7, 45].

Visual scene understanding: The most similar approach for visual scene understanding to ours is NLMap [10], a scene representation that affords grounding open-vocabulary language queries to spatial locations via pre-trained visual language models. Given a sequence of calibrated RGB-D camera images and pre-trained visual-language models, NLMap supports language-queries by 1) segmenting out the most likely objects in the 2D RGB images based on the language queries, and 2) estimating the most likely 3D positions via back-projection of the 2D segmentation masks using the depth data and camera pose. While NLMap is suitable for handling complex descriptions of individual objects (e.g: “green plush toy“), it is fundamentally unable to handle instructions involving complex constraints between multiple objects since it has no way to account for object-object relationships (e.g: “the green plush toy that is between the toy box and door“). LIMP handles these more complicated language instructions by using a novel spatial grounding module to easily incorporate a wide-variety of complex spatial relationships between objects. In addition, our scene representation is compatible with both LLM planners as well as TAMP solvers, whereas NLMap is only compatible with LLM planners.

While NLMap is the most relevant approach to ours, there are other approaches for visual scene understanding and task planning with foundation models which are worth highlighting. VoxPoser [46] leverages the abilities of LLMs to identify affordances and write code for manipulation tasks, along with VLMs complementary abilities to identify open-vocabulary entities in the environment. SayPlan [47] integrates 3D scene graphs with LLM-based planners to bridge the gap between complex, heirarchial scene representations and scalable task planning with open-ended task specifications. Generalizable Feature Fields (GeFF) [48] use an implicit scene representation to support open-world manipulation and navigation via an extension of Neural Radiance Fields (NeRFs) and feature distillation in NeRFs. OK-Robot [11] adopts a system-first approach to solving structured mobile pick-and-place tasks with foundation models by offering an integrated solution to object detection, mapping, navigation and grasp generation. While these methods are related, none of them have all the features of LIMP: 1) Explicit support for both LLM-based planning and off-the-shelf task and motion planning approaches, 2) Verifiable representations for following complex natural language instructions in mobile manipulation domains that involve object-object relationships, and 3) The ability to dynamically generate task-relevant state abstractions (semantic maps) for individual instructions.

Language instruction for robots: Our approach to handling complex natural language instructions involves translating the command into a temporal logic expression. This problem framing allows us to leverage state-of-the-art techniques from machine translation, such as instruction-tuned large language models. Most similar to our approach in this regard is [7], which uses a multi-stage LLM-based approach and finetuning to perform entity-extraction and replacement to translate natural language instructions into temporal logic expressions. However, [7] relies on a prebuilt semantic map that grounds expression symbols, limiting the scope of instructions it can operate since landmarks are predetermined. Instead, our approach interfaces with a novel scene representation that supports open-vocabulary language and generates the relevant landmarks based on the open-vocabulary instruction. Additionally, the symbols in our temporal logic translation correspond to parameterized task relevant robot skills as opposed to propositions of referent entities extracted from instructions.

A2-B Planning Models in Robotics

Semantic Maps: Semantic maps [49] are a class of scene representations that capture semantic (and typically geometric) information about the world, and can be used in cojunction with planners to generate certain types of complex robot behavior like collision-free navigation with spatial constraints [50, 51]. However, leveraging semantic maps for task planning with mobile manipulators has been challenging since the modeling information needed may highly depend on the robot’s particular skills and embodiment. [17] recently proposed Action-Oriented Semantic Maps (AOSMs), which are a class of semantic maps that include additional models of the regions of space where the robot can execute manipulation skills (represented as symbols). [17] demonstrated that AOSMs can be used as a state representation that supports TAMP solvers in mobile manipulation domains. Our scene representation is similar to an AOSM since it captures spatial information about semantic regions of interest, and is compatible with TAMP solvers, but largely differs in that AOSMs require learning via online interaction with the scene. Instead, our approach leverages foundation models and requires no online learning. Also, once an AOSM is generated for a scene, there is only a closed-set of goals that can be planned for, whereas our approach can handle open-vocabulary task specifications.

Task and Motion Planning: Task and Motion planning approaches are hierarchical planning methods that involve high-level task planning (with a discrete state space) [52] and low-level motion planning (with a continuous state space) [53]. The low-level motion planning problem involves generating paths to goal sets through continuous spaces (e.g: configuration space, cartesian space) with constraints on infeasible regions. When the constraints and dynamics can change, it is referred to as multi-modal motion planning, which naturally induces a high-level planning problem that involves choosing which sequence of modes to plan through, and a low-level planning problem that involves moving through a particular mode. Finding high-level plan skeletons and satisfying low-level assignment values for parameters to achieve goals is a challenging bi-level planning problem[53]. LIMP contains sufficient information to produce a problem and domain description augmented with geometric information for bi-level TAMP solvers like [54, 55].

A3 Language Instruction Module

We implement a two-stage prompting strategy in our language instruction module to translate natural language instructions into LTL specifications. The first stage translates a given instruction into a conventional LTL formula, where propositions refer to open-vocabulary objects. For any given instruction, we dynamically generate $K$ in-context translation examples from a standard dataset [14] of natural language and LTL pairs, based on cosine similarity with the given instruction. Here is the exact text prompt used:

⬇

1You are a LLM that understands operators involved with Linear Temporal Logic (LTL), such as F, G, U, &, |, ~ , etc. Your goal is translate language input to LTL output.

2Input:<generated_example_instruction>

3Output:<generated_example_LTL>

4...

5Input:<given_instruction>

6Output:

Listing 1: Base prompt used to obtain a conventional LTL formula from a natural language query

The second stage takes the given instruction and the LTL response from the first stage as input to generate a new LTL formula with predicate functions that correspond to parameterized robot skills. Skill parameters are instruction referent objects expressed in our novel Composable Referent Descriptor (CRD) syntax. CRDs enable referent disambiguation by chaining comparators that encode descriptive spatial information. We define eight spatial comparators and provide their descriptions as part of the second stage prompt. We find that LLMs conditioned on this information and a few examples are able translate arbitrarily complex instructions with appropriate comparator choices. Here is the exact prompt used:

⬇

1You are an LLM for robot planning that understands operators involved with Linear Temporal Logic (LTL), such as F, G, U, &, |, ~ , etc. You have a finite set of robot predicates and spatial predicates, given a language instruction and an LTL formula that represents the given instruction, your goal is to translate the ltl formula into one that uses appropriate composition of robot and spatial predicates in place of propositions with relevant details from original instruction as arguments.

2Robot predicate set (near,pick,release).

3Usage:

4near[referent_1]:returns true if the desired spatial relationship is for robot to be near referent_1.

5pick[referent_1]:can only execute picking skill on referent_1 and return True when near[referent_1].

6release[referent_1,referent_2]:can only execute release skill on referent_1 and return True when near[referent_2].

7Spatial predicate set (isbetween,isabove,isbelow,isleftof,isrightof,isnextto,isinfrontof,isbehind).

8Usage:

9referent_1::isbetween(referent_2,referent_3):returns true if referent_1 is between referent_2 and referent_3.

10referent_1::isabove(referent_2):returns True if referent_1 is above referent_2.

11referent_1::isbelow(referent_2):returns True if referent_1 is below referent_2.

12referent_1::isleftof(referent_2):returns True if referent_1 is left of referent_2.

13referent_1::isrightof(referent_2):returns True if referent_1 is right of referent_2.

14referent_1::isnextto(referent_2):returns True if referent_1 is close to referent_2.

15referent_1::isinfrontof(referent_2):returns True if referent_1 is in front of referent_2.

16referent_1::isbehind(referent_2):returns True if referent_1 is behind referent_2.

17Rules:

18Strictly only use the finite set of robot and spatial predicates!

19Strictly stick to the usage format!

20Compose spatial predicates where necessary!

21You are allowed to modify the structure of Input_ltl for the final Output if it does not match the intended Input_instruction!

22You should strictly only stick to mentioned objects, however you are allowed to propose and include plausible objects if and only if not mentioned in instruction but required based on context of instruction!

23Pay attention to instructions that require performing certain actions multiple times in generating and sequencing the predicates for the final Output formula!

24Example:

25Input_instruction: Go to the orange building but before that pass by the coffee shop, then go to the parking sign.

26Input_ltl: F (coffee_shop & F (orange_building & F parking_sign ) )

27Output: F ( near[coffee_shop] & F ( near[orange_building] & F near[parking_sign] ))

28Input_instruction: Go to the blue sofa then the laptop, after that bring me the brown bag between the television and the kettle on the left of the green seat, I am standing by the sink.

29Input_ltl: F ( blue_sofa & F ( laptop & F ( brown_bag & F ( sink ) ) ) )

30Output: F ( near[blue_sofa] & F ( near[laptop] & F ( near[brown_bag::isbetween(television,kettle::isleftof(green_seat))] & F (pick[brown_bag::isbetween(television,kettle::isleftof(green_seat))] & F ( near[sink] & F ( release[brown_bag,sink] ) ) ) ) ) )

31Input_instruction: Hey need you to pass by chair between the sofa and bag, pick up the bag and go to the orange napkin on the right of the sofa.

32Input_ltl: F ( chair & F ( bag & F ( orange_napkin ) ) )

33Output: F ( near[chair::isbetween(sofa,bag)] & F ( near[bag] & F ( pick[bag] & F ( near[orange_napkin::isrightof(sofa)] ) ) ) )

34Input_instruction: Go to the chair between the green laptop and the yellow box underneath the play toy

35Input_ltl: F ( green_laptop & F ( yellow_box & F ( play_toy & F ( chair ) ) ) )

36Output: F ( near[chair::isbetween(green_laptop,yellow_box::isbelow(play_toy))] )

37Input_instruction: Check the table behind the fridge and bring two beers to the couch one after the other

38Input_ltl: F ( check_table & F ( bring_beer1 ) & F ( bring_beer2 ) & F ( couch ) )

39Output: F ( near[table::isbehind(fridge)] & F ( pick[beer] & F ( near[couch] & F ( release[beer,couch] & F ( near[table::isbehind(fridge)] & F ( pick[beer] & F ( near[couch] & F ( release[beer,couch] ))))))))

40Input_instruction: <given_instruction>

41Input_ltl: <stage1_ltl_response>

42Output:

Listing 2: Second stage prompt to output our LTL syntax with CRD parameterized robot skills

A3-A Interactive Symbol Verification

Verifying sampled LTL formulas is essential, as such we implement an interactive dialog system that presents users with extracted referent composible referent descriptors (CRDs) in sampled formulas as well as the implied task structure––encoded in the sequence of state-machine transition expressions that must hold to progressively solve the task. We translate the task structure into English statements via a simple deterministic strategy that replaces logical connectives and skill predicates from the formula with equivalent English phrases. Users can verify a formula as correct or provide corrective statements which are used to reprompt the LLM to obtain new formulas. Below is the exact prompt used for regenerating formulas.

⬇

1There was a mistake with your output LTL formula: Error with <verification_type>! Consider the clarification feedback and regenerate the correct output for the Input_instruction. Make sure to adhere to all rules and instructions in your original prompt!

2previous_output:<last_response>

3error_clarification: <given_error_clarification>

4correct_output:

Listing 3: Corrective reprompting prompt used to obtain new LTL formulas

To illustrate, the instruction “Bring the green plush toy to the whiteboard in front of it” yields the interactive Referent and Task Structure Verification dialog below:

⬇

1**************************

2Instruction Following

3**************************

4Input_instruction: "Bring the green plush toy to the whiteboard in front of it"

5Sampled LTL formula: F(A & F(B & F(C & FD)))

6 A: near[green_plush_toy]

7 B: pick[green_plush_toy]

8 C: near[whiteboard::isinfrontof(green_plush_toy)]

9 D: release[green_plush_toy, whiteboard::isinfrontof(green_plush_toy)]

11***************************

12Referent Verification

13***************************

14I extracted this list of relevant objects based on your instruction:

15 * whiteboard::isinfrontof(green_plush_toy)

16 * green_plush_toy

17Does this match your intention? (y/n)

19****************************

20Task Structure Verification

21****************************

22Based on my understanding here is the sequence of subgoal objectives needed to satisfy the task:

23Subgoal_1:

24 Logical Expression: A&!B

25 English translation: I should be near the [green_plush_toy] and not have picked up the [green_plush_toy]

26Subgoal_2:

27 Logical Expression: B&!C

28 English translation: I should have picked up the [green_plush_toy] and not be near the [whiteboard::isinfrontof(green_plush_toy)]

29Subgoal_3:

30 Logical Expression: C&!D

31 English translation: I should be near the [whiteboard::isinfrontof(green_plush_toy)] and not have released the [green_plush_toy] at the [whiteboard::isinfrontof(green_plush_toy)]

32Subgoal_4:

33 Logical Expression: D

34 English translation: I should have released the [green_plush_toy] at the [whiteboard::isinfrontof(green_plush_toy)]

35Does this match your intention? (y/n)

Listing 4: Interactive referent and task structure verification dialog.

A4 Spatial Grounding Module

The spatial grounding module detects and localizes specific instances of objects referenced in a given instruction by first detecting, segmenting and back-projecting all referent occurances and then filtering based on the descriptive spatial details captured by each referent’s composable referent descriptor (CRD). We use the Owl-Vit model [32] to detect bounding boxes of open-vocabulary referents and SAM [33] to generate masks from detected bounding boxes. To illustrate referent filtering via spatial information, consider an example scenario where the goal is to resolve the composable referent descriptor below:

\text{whiteboard}::isinfrontof(\text{green\_plush\_toy}).

(A.1)

Let $W=\{w_{1},w_{2},\ldots,w_{n}\}$ and $G=\{g_{1},g_{2},\ldots,g_{m}\}$ represent the set of representative 3D positions of detected whiteboards and green_plush_toys respectively. The cartesian product of these sets enumerates all possible pairs $(w,g)$ for comparison.

W\times G=\{w,g)\mid w\in W,g\in G\}

(A.2)

The ‘isinfrontof(w, g)’ comparator is applied to each pair, yielding a subset $S$ that contains only those ‘whiteboards‘ that satisfy the ‘isinfronto’ condition with at least one ‘green_plush_toy’.

S=\{w\in W\mid\exists g\in G\text{ such that }\text{isinfrontof}(w,g)\text{ is% true}\}

(A.3)

A4-A 3D Spatial Comparators

Our 3D spatial comparators enable Relative Frame of Reference (FoR) spatial reasoning between referents, based on their backprojected 3D positions. Threshold values in the spatial comparators give users the ability to specify the sensitivity or resolution at which spatial relationships are resolved, we keep all threshold values fixed across all experiments. Below is a description of each spatial comparator.

⬇

11. isbetween(referent_1_pos, referent_2_pos, referent_3_pos, threshold): Returns true if referent_1 is within ’threshold’ distance from the line segment connecting referent_2 to referent_3, ensuring it lies in the directional path between them without extending beyond.

22. isabove(referent_1_pos, referent_2_pos, threshold): Returns true if the z-coordinate of referent_1 exceeds that of referent_2 by at least ’threshold’.

33. isbelow(referent_1_pos, referent_2_pos, threshold): Returns true if the z-coordinate of referent_1 is less than that of referent_2 by more than ’threshold’.

44. isleftof(referent_1_pos, referent_2_pos, threshold): Returns true if the y-coordinate of referent_1 exceeds that of referent_2 by at least ’threshold’, indicating referent_1 is to the left of referent_2.

55. isrightof(referent_1_pos, referent_2_pos, threshold): Returns true if the y-coordinate of referent_1 is less than that of referent_2 by more than ’threshold’, indicating referent_1 is to the right of referent_2.

66. isnextto(referent_1_pos, referent_2_pos, threshold): Returns true if the Euclidean distance between referent_1 and referent_2 is less than ’threshold’, indicating they are next to each other.

77. isinfrontof(referent_1_pos, referent_2_pos, threshold): Returns true if the x-coordinate of referent_1 is less than that of referent_2 by more than ’threshold’, indicating referent_1 is in front of referent_2.

88. isbehind(referent_1_pos, referent_2_pos, threshold): Returns true if the x-coordinate of referent_1 exceeds that of referent_2 by at least ’threshold’, indicating referent_1 is behind referent_2.

Listing 5: Implementation description of 3D spatial comparators

A5 Task and Motion Planning Module

We present pseudocode for our Progressive Motion Planner (Alg.1) and our algorithm for generating Task Progression Semantic Maps (Alg.2). Alg.2 generates a TPSM $\mathcal{M}_{\text{tpsm}}$ by integrating an environment map ( $\mathcal{M}$ ) and a referent semantic map ( $\mathcal{M}_{\text{rsm}}$ ) given a logical transition expression ( $\mathcal{T}$ ), a desired automaton state ( $\mathcal{S^{\prime}}$ ), and a nearness threshold ( $\theta$ ). The algorithm first initializes $\mathcal{M}_{\text{tpsm}}$ with a copy of $\mathcal{M}$ and extracts relevant instruction predicates from $\mathcal{T}$ . For each predicate (parameterized skill), the algorithm identifies satisfying referent positions in $\mathcal{M}_{\text{rsm}}$ , generates a spherical grid of surrounding points within a radius $\theta$ , and assesses how these points affect the progression of the task automaton towards $\mathcal{S^{\prime}}$ . These points demarcate regions of interest, and are assigned a value of 1 if they cause the automaton to transition to the desired state, -1 if they lead to a different automaton state or violate the automaton, and 0 if they do not affect the automaton. The points are then integrated into $\mathcal{M}_{\text{tpsm}}$ , yielding a semantic map that identifies goal and constraint violating regions.

Algorithm 1 Progressive Motion Planning Algorithm

1:procedure PMP(

X_{start},\varphi,\mathcal{M},\mathcal{M}_{rsm},\theta

)

2:Input:

X_{start}

: Start position in the environment.

\varphi

: CRD syntax LTL formula specifying task objectives.

\mathcal{M}

: Environment map.

\mathcal{M}_{rsm}

: Referent semantic map.

\theta

: Nearness threshold.

8:Output:

\Pi

: Generated task and motion plan.

10:

\mathcal{A}\leftarrow\text{ConstructAutomaton}(\varphi)

11:

\text{path}\leftarrow\text{SelectAutomatonPath}(\mathcal{A})

\triangleright

Task plan

12: while

\Pi

.status is active do

13: while

\mathcal{A}

.state != path.acceptingState do

14:

\mathcal{S},\mathcal{T},\mathcal{S^{\prime}}\leftarrow\mathcal{A}.\text{% GetTransition(path.currentStep)}

15:

\text{objective}\leftarrow\text{NextObjectiveType}(\mathcal{T})

16: if

\text{objective}=\text{``skill''}

then

17:

\Pi.\text{UpdateWithSkill}(\mathcal{T})

18:

\mathcal{A}

.UpdateAutomatonState(

\mathcal{S^{\prime}}

)

19: else if

\text{objective}=\text{``navigation''}

then

20:

\mathcal{M}_{params}\leftarrow\mathcal{M},\mathcal{M}_{rsm},\mathcal{T},% \mathcal{A},\mathcal{S^{\prime}},\theta

21:

\mathcal{M}_{tpsm}\leftarrow\textsc{GenerateTPSM}(\mathcal{M}_{params})

22:

\mathcal{O}\leftarrow\text{GenerateObstacleMap}(\mathcal{M}_{tpsm})

23:

\text{plan}\leftarrow\text{FMT}^{*}(X_{start},\mathcal{O})

\triangleright

Path plan

24: if plan.exists then

25:

\Pi.\text{UpdateWithPlan}(\text{plan})

26:

X_{start}\leftarrow\text{plan.endPosition}

27:

\mathcal{A}

.UpdateAutomatonState(

\mathcal{S^{\prime}}

)

28: else

29:

\Pi,\mathcal{A},\text{path}\leftarrow\text{Backtrack}(\Pi,\mathcal{A},\text{% path})

30: end if

31: end if

32: end while

33: end while

34: return

\Pi

35:end procedure

Algorithm 2 Task Progression Semantic Mapping Algorithm

1:procedure GenerateTPSM(

\mathcal{M},\mathcal{M}_{rsm},\mathcal{T},\mathcal{A},\mathcal{S^{\prime}},\theta

)

2:Input:

\mathcal{M}

: Environment map.

\mathcal{M}_{rsm}

: Referent semantic map.

\mathcal{T}

: Automaton transition expression.

\mathcal{A}

: Task Automaton.

\mathcal{S^{\prime}}

: Desired State.

\theta

: Nearness threshold.

9:Output:

10:

\mathcal{M}_{tpsm}

: Task Progression Semantic Map.

11:

\mathcal{M}_{tpsm}\leftarrow\text{Copy}(\mathcal{M})

12:

\mathcal{P}\leftarrow\text{ExtractRelevantPredicates}(\mathcal{T})

13: for

p

\mathcal{P}

14:

\mathcal{R}\leftarrow\text{QueryPositions}(\mathcal{M}_{rsm},p)

15: for

r

\mathcal{R}

16:

G\leftarrow\left\{g\mid g=r+\delta,\lVert\delta\rVert\leq\theta\right\}

\triangleright

spherical grid of surrounding points

17: for

g

G

18:

\mathcal{Q}\leftarrow\text{TruePredicatesAt}(g,\mathcal{M}_{rsm},\theta)

19:

\mathcal{S}_{next}\leftarrow\text{ProgressAutomaton}(\mathcal{A},\mathcal{Q})

20: if

\mathcal{S}_{next}=\mathcal{S^{\prime}}

then

21:

g.value\leftarrow 1

\triangleright

Goal value

22: else if

\text{IsUndesired}(\mathcal{S}_{next})

then

23:

g.value\leftarrow-1

\triangleright

Avoidance value

24: else

25:

g.\text{value}\leftarrow 0

26: end if

27: end for

28:

\text{AddPoints}(\mathcal{M}_{tpsm},G)

29: end for

30: end for

31: return

\mathcal{M}_{tpsm}

32:end procedure

A6 Robot Skills

We define three predicate functions: near, pick and release for the navigation, picking and placing skills required for multi-object goal navigation and mobile pick-and-place. As highlighted in the main paper, we formalize navigation as continuous path planning problems and manipulation as object parameterized options. We discuss navigation at length in the paper, so here we focus on the pick and place manipulation skills.

Pick Skill: Once the robot has executed the near skill and is at the object to be manipulated, it takes a photo of the current environment to detect the object using the Owl-Vit model. The robot is guaranteed to be facing the object as the computed path plan uses the backprojected object 3D position to compute yaw angles for the robot. After detecting the object in the picture, we obtain a segmentation mask from detected boundary box using the Segment Anything model, and compute the center pixel of this mask. We feed this center pixel to the Boston dynamics grasping API to compute a motion plan to grasp the object.

Release Skill: We implement a simple routine for the release skill which takes two parameters: the object to be placed and the place receptacle. Once a navigation skill gets the robot to the place receptacle, the robot gently moves its arm up or down to release the grasped object, based on the place receptable 3D position. Future work will implement more complex semantic placement strategies to better leverage LIMP’s awareness and spatial grounding of instruction specific place receptacles. Kindly, visit our website to see demonstrations of these skills.

A7 Evaluation and Baseline Details

All computation including planning, loading and running pretrained visual language models was done on a single computer equipped with one NVIDIA GeForce RTX 3090 GPU. We leverage GPT-4-0613 as the underlying LLM for our instruction understanding module and all our baselines. In all experiments we set the LLM temperature to 0, however since deterministic greedy token decoding is not guaranteed with GPT4, we perform three (3) queries for each instruction and evaluate on the most recurring response (mode response).

We compare LIMP with baseline implementations of NLMap-Saycan [10] and Code-as-policies [12]. Both baselines use the same GPT-4 LLM, prompting structure, and in-context learning examples as our language understanding module. We integrate our composible referent descriptor syntax, spatial grounding module and low-level robot control into these baselines as APIs. This enables baselines to execute plans by querying relevant object positions, using our FMT* path planner to find paths to said positions and executing manipulation options.

We visualize some qualitative results of LIMP from our experiments in Figure A.1. We also highlight results in Figure A.2 that illustrates how our interactive symbol verification and reprompting strategy A3-A improves instruction satisfaction with minimal chat turns for different instruction sets.

A7-A NLMap-Saycan Implementation Prompt

⬇

1You are an LLM for robot planning that understands logical operators such as &, |, ~ , etc. You have a finite set of robot predicates and spatial predicates, given a language instruction, your goal is to generate a sequence of actions that uses appropriate composition of robot and spatial predicates with relevant details from the instruction as arguments.