(Translated by https://www.hiragana.jp/)
GOMA: Proactive Embodied Cooperative Communication via Goal-Oriented Mental Alignment
License: CC BY 4.0
arXiv:2403.11075v1 [cs.HC] 17 Mar 2024

GOMA: Proactive Embodied Cooperative Communication via Goal-Oriented Mental Alignment

Lance Ying1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT, Kunal Jha33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Shivam Aarya44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT, Joshua B. Tenenbaum22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Antonio Torralba22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Tianmin Shu44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTHarvard University, Cambridge, MA 02138, USA lanceying@seas.harvard.edu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTMassachusetts Institute of Technology, Cambridge, MA 01239, USA jbt@mit.edu, torralba@csail.mit.edu33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTDartmouth College, Hanover, NH 03755, USA kunal.a.jha.24@dartmouth.edu44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTJohns Hopkins University, Baltimore, MD 21218, USA {saarya1, tianmin.shu}@mit.edu
Abstract

Verbal communication plays a crucial role in human cooperation, particularly when the partners only have incomplete information about the task, environment, and each other’s mental state. In this paper, we propose a novel cooperative communication framework, Goal-Oriented Mental Alignment (GOMA). GOMA formulates verbal communication as a planning problem that minimizes the misalignment between the parts of agents’ mental states that are relevant to the goals. This approach enables an embodied assistant to reason about when and how to proactively initialize communication with humans verbally using natural language to help achieve better cooperation. We evaluate our approach against strong baselines in two challenging environments, Overcooked (a multiplayer game) and VirtualHome (a household simulator). Our experimental results demonstrate that large language models struggle with generating meaningful communication that is grounded in the social and physical context. In contrast, our approach can successfully generate concise verbal communication for the embodied assistant to effectively boost the performance of the cooperation as well as human users’ perception of the assistant.

I Introduction

Rich verbal communication naturally emerges from human cooperation when people only have partial information about the environments and/or about each other’s mental states [1]. It serves as a complementary source of information, in addition to the visual inputs, to help achieve better cooperation by aligning each other’s mental states (including goals, beliefs, and eventually plans [2, 3, 4]). Recent advances in large language models (LLM) and machine Theory of Mind (ToM) have sparked interest in building cooperative robots that can not only physically cooperate with humans but also verbally communicate with humans using natural language [5, 6]. However, it remains challenging to enable robots to actively initiate verbal communication that is both concise (only communicate when necessary) and consistent with the physical environment and the social context (e.g., what humans want to do, believe, know, and need to know).

Refer to caption
Figure 1: Illustration of cooperation with a shared mind or misaligned minds and communication optimized via goal-oriented mental alignment. (a) When human and robot minds are perfectly signed (i.e., a shared mind), they share the same belief of the physical state and the same goal, which leads to the same joint plan shared by both agents. This is the ideal condition for reaching optimal cooperation. (b However, in real-world cooperation, human and robot minds are typically unaligned, leading to two different (and often conflicting) joint plans in their minds. (c) To achieve a shared joint plan that optimizes cooperation, we optimize verbal communication initiated by the robot to actively align the joint plans in both agents’ minds.

A long history of research in psychology has shown that proactive verbal communication serves to align the mental states of agents [7]. Imagine you are going to get some groceries for your mom. As you put on your shoes and walk towards the door, your mom gets out of the kitchen and says “It’s going to rain, get your umbrella, and don’t forget about the avocados.” In this scenario, your mom decides to communicate with you because she is uncertain whether you have the same beliefs regarding the weather forecast and the required grocery items as you walk out the door.

When cooperating with one another, each agent not only needs to plan for itself but also has to imagine the plans of its partners. Such planning process is termed as joint planning [8, 9]. To achieve joint planning, prior works typically assumed that both agents have full observability and complete knowledge about the task. In other words, they have a shared mind, based on which they can derive the same joint plan (Fig. 1(a)). However, in real-world embodied cooperation, robot assistants only have partial observations and often do not know the true human goals (Fig. 1(b)).

The goal of cooperative communication is then to reach a shared mind (two agents’ are perfectly aligned) so that the resulting joint plans in both agents’ minds are the same (Fig. 1(c)). Once we reach such mental alignment condition, both agents know exactly what each other plans to do, and therefore achieve optimal cooperation. However, an agent belief can be about any part of a state. If the state is high dimensional (such as the state in a real-world home), it is extremely difficult to make sure two beliefs are the same. Our key insight is that we only need to align the part of the belief that is relevant to reaching the goal.

Following this insight, we propose a novel cooperative communication framework, Goal-Oriented Mental Alignment (GOMA). In this framework, we aim to generate optimal communication in the belief space. That is, verbal communication, by exchanging information, can help reshape agents’ beliefs. In particular, GOMA first seeks to detect misalignment in agents’ goal-relevant beliefs via divergence between the joint plans based on an agent’s own belief and a simulated hypothetical shared mind after acquiring additional knowledge from another agent via hypothetical communication. We then optimize the communication using a proxy reward derived from the divergence between the plans. The resulting communication can then help us minimize the difference between the joint plan in each agent’s mind and the true joint plan given a true shared mind. By doing so, we can optimize the cooperation.

We evaluate GOMA in two popular human-AI cooperation domains, Overcooked and VirtualHome. Our experimental results with a simulated human agent and real human participants show that our GOMA outperforms strong baselines (including a recent LLM-based baseline). The GOMA-enabled assistant also receives higher subjective ratings from human participants.

In sum, our contributions include (1) a novel embodied cooperative communication framework – GOMA, (2) extensive evaluation of strong baselines and GOMA in two challenging domains, and (3) a human user study that evaluates the task performance of AI assistants and humans’ perception of them.

II Related Work

II-A Communication in Collaboration

Human communication is grounded in cooperative intentions. [7] argues that language communication is a joint activity that attempts to achieve mutual understanding. [1] proposes three communicative motives: requesting help or information, informing the other agents, and sharing feelings or attitudes. These communicative motives help to align the mental states of the agents. Through verbal communication, agents can assess others’ goals, knowledge, emotions, and beliefs, which they can then use to plan for the next actions.

However, verbal communication can also be costly, as it demands cognitive resources and distracts agents when performing actions [10]. Prior works on multi-agent teaming have formalized communication costs in collaborative settings [10, 11, 12], showing that excessive communication can degrade the performance of the team. Therefore, when designing communication policies, the AI assistant needs to communicate useful, concise, and relevant information yet not too frequently.

II-B Collaborative and Communicative AI Agent

Communication between humans and robots has also been extensively studied. Most existing literature has focused on one-directional communication where the human instructs the robot [13, 14]. Some recent studies have proposed bi-directional communication. For example, [15] proposes a bi-directional human-robot collaborative communication framework that allows the robot to communicate decisions with explanations from human feedback. [12] introduced CommPlann, a bi-directional communication framework that allows the robot to ask for human’s intent, share the robot’s intent, and give commands to humans. There have been recent works that use LLMs as a communication module in bi-directional human-robot communication, (e.g., [5, 16, 17, 2, 18]). While these recent LLM-based agents can achieve certain success, the communication generated by LLMs is often redundant and/or not grounded in agents’ mental states, actions, and plans.

In addition, most human-robot communication frameworks, such as [12, 19, 20], assume full agent observability. The resulting communication is thus only restricted to informing and inquiring about goals and plans. Our work attempts to extend to scenarios where both agents have partial observability of the environment and allow the robot to communicate to resolve partial knowledge and false beliefs about the environment state. This requires agents to model and reason about each other’s mental states recursively (e.g., the robot thinks the human thinks the glass is in the fridge, but it knows that the glass is actually in the cabinet), which remains a challenge for LLMs today [21]. As a result, such cooperative communication capacity remains an open research question in embodied cooperation.

II-C Theory of Mind for Cooperative Robot Planning

There have been many studies on inferring an agent’s goals and beliefs (e.g., [22, 23, 24, 25, 26, 2, 27]), commonly referred to as the Theory of Mind reasoning, to better coordinate with humans in collaborative tasks. Previous studies have leveraged explicit mental reasoning to improve cooperative robot planning. This includes generating more expressive or explainable plans to improve humans’ understanding of robots’ plans [28, 29, 30, 31, 4, 32] or better understanding of humans’ cooperative actions [3], all via reasoning about humans’ mental models of the robot. There have also been works on developing a shared joint planner in two agents’ minds to reach optimal coordination by reasoning about one agent’s own plan and the other agent’s plan jointly. However, existing works do not allow verbal communication between humans and robots in addition to action planning. Our work aims to fill this gap by jointly planning for actions that change the physical state but verbal communication that changes the mental states of humans and robots.

III Problem Formulation

In this work, we consider two agents, a human user and a robot assistant. To successfully communicate and cooperate, the two agents must infer each other’s mind. We adopt the Interactive Partially Observable Markov Decision Process (I-POMDP) [33, 34] to formulate the mental reasoning between the human and the robot.

III-A Background: I-POMDP

I-POMDP is a framework that enables an agent to recursively model other agents, which captures complex social interactions between agents. Here, we consider the interactions between two agents, i𝑖iitalic_i and j𝑗jitalic_j, in which agent i𝑖iitalic_i infers agent j𝑗jitalic_j’s mental state recursively. In an I-POMDP, there are states stsuperscript𝑠𝑡s^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT; agents’ observations, oitsuperscriptsubscript𝑜𝑖𝑡o_{i}^{t}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and ojtsuperscriptsubscript𝑜𝑗𝑡o_{j}^{t}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, sampled from their conditional observation probabilities, Oi(oit|st)subscript𝑂𝑖conditionalsubscriptsuperscript𝑜𝑡𝑖superscript𝑠𝑡O_{i}(o^{t}_{i}|s^{t})italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and Oj(ojt|st)subscript𝑂𝑗conditionalsubscriptsuperscript𝑜𝑡𝑗superscript𝑠𝑡O_{j}(o^{t}_{j}|s^{t})italic_O start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ); and agents’ actions aitsuperscriptsubscript𝑎𝑖𝑡a_{i}^{t}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and ajtsuperscriptsubscript𝑎𝑗𝑡a_{j}^{t}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Agents have their beliefs, bitsuperscriptsubscript𝑏𝑖𝑡b_{i}^{t}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and bjtsuperscriptsubscript𝑏𝑗𝑡b_{j}^{t}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and goals, θしーたisubscript𝜃𝑖\theta_{i}italic_θしーた start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and θしーたjsubscript𝜃𝑗\theta_{j}italic_θしーた start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. To model the recursive mental reasoning, we define interactive states for the agents, i.e., isi,𝑖subscript𝑠𝑖is_{i,\ell}italic_i italic_s start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT and isj,𝑖subscript𝑠𝑗is_{j,\ell}italic_i italic_s start_POSTSUBSCRIPT italic_j , roman_ℓ end_POSTSUBSCRIPT, at level-\ellroman_ℓ. From agent i𝑖iitalic_i’s perspective, we define its interactive state at each level as

  • Level 00: isi,0=s𝑖subscript𝑠𝑖0𝑠is_{i,0}=sitalic_i italic_s start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT = italic_s

  • Level 1111: isi,1=(s,bj,0,θしーたj)𝑖subscript𝑠𝑖1𝑠subscript𝑏𝑗0subscript𝜃𝑗is_{i,1}=(s,b_{j,0},\theta_{j})italic_i italic_s start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT = ( italic_s , italic_b start_POSTSUBSCRIPT italic_j , 0 end_POSTSUBSCRIPT , italic_θしーた start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

  • \cdots

  • Level \ellroman_ℓ: isi,=(s,bj,1,θしーたj)𝑖subscript𝑠𝑖𝑠subscript𝑏𝑗1subscript𝜃𝑗is_{i,\ell}=(s,b_{j,\ell-1},\theta_{j})italic_i italic_s start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT = ( italic_s , italic_b start_POSTSUBSCRIPT italic_j , roman_ℓ - 1 end_POSTSUBSCRIPT , italic_θしーた start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

The level-\ellroman_ℓ inference for agent i𝑖iitalic_i is to infer the belief bi,t=p(isi,t|oi1:t,ai1:t1)b_{i,\ell}^{t}=p(is_{i,\ell}^{t}\lvert o_{i}^{1:t},a_{i}^{1:t-1})italic_b start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_p ( italic_i italic_s start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_t - 1 end_POSTSUPERSCRIPT ). Since the level-\ellroman_ℓ agent i𝑖iitalic_i’s interactive state, isi,t=(st,bj,1t,θしーたj)𝑖superscriptsubscript𝑠𝑖𝑡superscript𝑠𝑡superscriptsubscript𝑏𝑗1𝑡subscript𝜃𝑗is_{i,\ell}^{t}=(s^{t},b_{j,\ell-1}^{t},\theta_{j})italic_i italic_s start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_j , roman_ℓ - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_θしーた start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), includes j𝑗jitalic_j’s belief at level 11\ell-1roman_ℓ - 1 (bj,1tsuperscriptsubscript𝑏𝑗1𝑡b_{j,\ell-1}^{t}italic_b start_POSTSUBSCRIPT italic_j , roman_ℓ - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT), the inference at level \ellroman_ℓ depends on inference at level 11\ell-1roman_ℓ - 1 which depends on inference at level 22\ell-2roman_ℓ - 2, and so on. This recursive inference terminates at level 00. That is, the belief at level-0 is only about the physical state, bi,0=p(st|oi1:t,ai1:t1)b_{i,0}=p(s^{t}\lvert o_{i}^{1:t},a_{i}^{1:t-1})italic_b start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT = italic_p ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_t - 1 end_POSTSUPERSCRIPT ). This becomes a standard POMDP [35] which does not model other agents.

III-B Two-level Reasoning for Embodied Cooperative Cooperation

Theoretically, the level of agents’ reasoning about other agents’ minds can go to infinity (e.g. robot thinks human thinks robot thinks…) yet we cap the depth at two in our model, which is in line with most empirical evidence suggesting that humans rarely engage in greater than 2 levels of recursive Theory of Mind reasoning [36]. Therefore, we adopt a two-level I-POMDP for modeling the mental reasoning between a human user and a robot assistant in embodied cooperation. In particular, we define the mind of each agent as the belief of the level-1 interactive state of the agent. For the human user’s mind, we have mH=b(isH,1)={bH,0,b(bR,0),b(gR)}subscript𝑚𝐻𝑏𝑖subscript𝑠𝐻1subscript𝑏𝐻0𝑏subscript𝑏𝑅0𝑏subscript𝑔𝑅m_{H}=b(is_{H,1})=\{b_{H,0},b(b_{R,0}),b(g_{R})\}italic_m start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = italic_b ( italic_i italic_s start_POSTSUBSCRIPT italic_H , 1 end_POSTSUBSCRIPT ) = { italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT , italic_b ( italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT ) , italic_b ( italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) }, where bR,0subscript𝑏𝑅0b_{R,0}italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT is the robot’s interactive state at level 0, i.e., its belief about the physical state; and gRsubscript𝑔𝑅g_{R}italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is the robot’s goal. Similarly, for the robot assistant, we define its mind as mR=b(isR,1)={bR,0,b(bH,0),b(gH)}subscript𝑚𝑅𝑏𝑖subscript𝑠𝑅1subscript𝑏𝑅0𝑏subscript𝑏𝐻0𝑏subscript𝑔𝐻m_{R}=b(is_{R,1})=\{b_{R,0},b(b_{H,0}),b(g_{H})\}italic_m start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_b ( italic_i italic_s start_POSTSUBSCRIPT italic_R , 1 end_POSTSUBSCRIPT ) = { italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT , italic_b ( italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT ) , italic_b ( italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) }, where bH,0subscript𝑏𝐻0b_{H,0}italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT is the human’s belief about the physical state, and gHsubscript𝑔𝐻g_{H}italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the robot’s goal. Intuitively, each mind models the agent’s belief about (1) the physical state, (2) another agent’s belief about the physical state, and (3) the goal of another agent. Due to the cooperative nature of our problem setting, we further constrain the goal inference to be either one of the following two conditions:

  • Condition 1: Both agents share a known common goal;

  • Condition 2: The robot’s goal is the human goal inferred by the robot, and the human user knows that the robot is trying to help with the inferred human goal.

Condition 1 models human-robot teaming, in which the human and robot agents are teammates who work on the same task assigned to them a priori. Condition 2 models robot assistance, in which the human’s true goal is unknown to the robot a priori, thus the robot must infer the human’s goal and provide assistance. In both cases, agents only have partial observability of the physical state, and thus they have to infer both the physical state and each other’s belief about the physical state. It is worth noting that our formulation departs from most previous assistance-game setups, which either assume that the agents have full observability or that they share a known goal. As in collaborative tasks, agents often do not have perfect knowledge of the environment and thus need to represent other agents’ beliefs differently from theirs and communicate and coordinate their actions, our formulation is more aligned with real-world embodied cooperation.

IV Goal-Oriented Mental Alignment

As Fig. 1 illustrates, when there is a shared mind, two agents will share the same joint plan. In our Goal-oriented Mental Alignment (GOMA) framework, we formulate communication optimization as the convergence of the current joint plan and the joint plan given a shared mind achieved by exchanging information through verbal communication. In particular, we consider two types of communication – sharing information and requesting information. These are two dominant types of verbal communication in human cooperation [1]. We hypothesize that these are also two types of communication that a robot assistant can proactively initiate to achieve joint plan alignment. To reason whether to communicate and what to communicate, we define a proxy reward for minimizing the divergence between plans before and after one type of communication. We summarize GOMA in Algorithm 1, which works with any off-the-shelf action planner. We introduce key components of the algorithm in the rest of the section.

Algorithm 1 GOMA
1:  Input: Planner(), Tmaxsubscript𝑇maxT_{\text{max}}italic_T start_POSTSUBSCRIPT max end_POSTSUBSCRIPT
2:  Initialization: b(gH)𝑏subscript𝑔𝐻b(g_{H})italic_b ( italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ), bR,0(s0)subscript𝑏𝑅0superscript𝑠0b_{R,0}(s^{0})italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), particles of sampled human beliefs: {bH,0(l)(s0)}l=1Lsuperscriptsubscriptsuperscriptsubscript𝑏𝐻0𝑙superscript𝑠0𝑙1𝐿\{b_{H,0}^{(l)}(s^{0})\}_{l=1}^{L}{ italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT
3:  t1𝑡1t\leftarrow 1italic_t ← 1, uR0=Nonesuperscriptsubscript𝑢𝑅0Noneu_{R}^{0}=\text{None}italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = None
4:  repeat
5:     Observe oRtsuperscriptsubscript𝑜𝑅𝑡o_{R}^{t}italic_o start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and receive human message uHt1superscriptsubscript𝑢𝐻𝑡1u_{H}^{t-1}italic_u start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT
6:     Update level-0 belief: bR,0(st)subscript𝑏𝑅0superscript𝑠𝑡b_{R,0}(s^{t})italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) based on both oRtsuperscriptsubscript𝑜𝑅𝑡o_{R}^{t}italic_o start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and uHt1superscriptsubscript𝑢𝐻𝑡1u_{H}^{t-1}italic_u start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT
7:     Robot knowledge: KRt=KR(bR,0(st))superscriptsubscript𝐾𝑅𝑡subscript𝐾𝑅subscript𝑏𝑅0superscript𝑠𝑡K_{R}^{t}=K_{R}(b_{R,0}(s^{t}))italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) )
8:     Update human goal inference:
9:     b(gH)P(aHt1|gH)P(uHt1|gH)b(gH),gH𝒢formulae-sequenceproportional-to𝑏subscript𝑔𝐻𝑃conditionalsuperscriptsubscript𝑎𝐻𝑡1subscript𝑔𝐻𝑃conditionalsuperscriptsubscript𝑢𝐻𝑡1subscript𝑔𝐻𝑏subscript𝑔𝐻for-allsubscript𝑔𝐻𝒢b(g_{H})\propto P(a_{H}^{t-1}|g_{H})P(u_{H}^{t-1}|g_{H})b(g_{H}),\forall g_{H}% \in\mathcal{G}italic_b ( italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) ∝ italic_P ( italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) italic_P ( italic_u start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) italic_b ( italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) , ∀ italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∈ caligraphic_G
10:     for all l=1,,L𝑙1𝐿l=1,\cdots,Litalic_l = 1 , ⋯ , italic_L do
11:        Sample a human goal based on the goal inference: g^H(l)b(gH)similar-tosubscriptsuperscript^𝑔𝑙𝐻𝑏subscript𝑔𝐻\hat{g}^{(l)}_{H}\sim b(g_{H})over^ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ italic_b ( italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT )
12:        Set the robot goal as the inferred human goal: gR(l)g^H(l)superscriptsubscript𝑔𝑅𝑙subscriptsuperscript^𝑔𝑙𝐻g_{R}^{(l)}\leftarrow\hat{g}^{(l)}_{H}italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ← over^ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT
13:        Sample an environment state stbR,0(st)similar-tosuperscript𝑠𝑡subscript𝑏𝑅0superscript𝑠𝑡s^{t}\sim b_{R,0}(s^{t})italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
14:        Sample inferred human observations o^HtOH(o^Ht|st)similar-tosuperscriptsubscript^𝑜𝐻𝑡subscript𝑂𝐻conditionalsuperscriptsubscript^𝑜𝐻𝑡superscript𝑠𝑡\hat{o}_{H}^{t}\sim O_{H}(\hat{o}_{H}^{t}|s^{t})over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ italic_O start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
15:        Update bH,0(l)(st)superscriptsubscript𝑏𝐻0𝑙superscript𝑠𝑡b_{H,0}^{(l)}(s^{t})italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) based on both o^Htsuperscriptsubscript^𝑜𝐻𝑡\hat{o}_{H}^{t}over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and uRt1superscriptsubscript𝑢𝑅𝑡1u_{R}^{t-1}italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT
16:        Human plan given the inferred human belief:
17:        πぱいH(aHt|bH,0,g^H(l))𝐏𝐥𝐚𝐧𝐧𝐞𝐫(bH,0,g^H(l))subscript𝜋𝐻conditionalsuperscriptsubscript𝑎𝐻𝑡subscript𝑏𝐻0subscriptsuperscript^𝑔𝑙𝐻𝐏𝐥𝐚𝐧𝐧𝐞𝐫subscript𝑏𝐻0subscriptsuperscript^𝑔𝑙𝐻\pi_{H}(a_{H}^{t}|b_{H,0},\hat{g}^{(l)}_{H})\leftarrow\textbf{Planner}(b_{H,0}% ,\hat{g}^{(l)}_{H})italic_πぱい start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT , over^ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) ← Planner ( italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT , over^ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT )
18:        Human plans given the shared minds augmented by different sub-states in robot knowledge:
19:        {πぱいH(aHt|bH,0+sn,g^H(l))𝐏𝐥𝐚𝐧𝐧𝐞𝐫(bH,0+sn,g^H(l))\{\pi_{H}(a_{H}^{t}|b^{+s_{n}}_{H,0},\hat{g}^{(l)}_{H})\leftarrow\textbf{% Planner}(b^{+s_{n}}_{H,0},\hat{g}^{(l)}_{H}){ italic_πぱい start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_b start_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT , over^ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) ← Planner ( italic_b start_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT , over^ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ); sntKRt}\forall s_{n}^{t}\in K_{R}^{t}\}∀ italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }
20:        Robot plan given the robot belief:
21:        πぱいR(aRt|bR,0,gR(l))𝐏𝐥𝐚𝐧𝐧𝐞𝐫(bR,0,gR(l))subscript𝜋𝑅conditionalsuperscriptsubscript𝑎𝑅𝑡subscript𝑏𝑅0superscriptsubscript𝑔𝑅𝑙𝐏𝐥𝐚𝐧𝐧𝐞𝐫subscript𝑏𝑅0superscriptsubscript𝑔𝑅𝑙\pi_{R}(a_{R}^{t}|b_{R,0},g_{R}^{(l)})\leftarrow\textbf{Planner}(b_{R,0},g_{R}% ^{(l)})italic_πぱい start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ← Planner ( italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT )
22:        Robot plans given the shared minds augmented by different sub-states in human knowledge:
23:        {πぱいR(aRt|bR,0+sn,gR(l))𝐏𝐥𝐚𝐧𝐧𝐞𝐫(bR,0+sn,gR(l))\{\pi_{R}(a_{R}^{t}|b^{+s_{n}}_{R,0},g_{R}^{(l)})\leftarrow\textbf{Planner}(b^% {+s_{n}}_{R,0},g_{R}^{(l)}){ italic_πぱい start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_b start_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ← Planner ( italic_b start_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ); sntK(bH,0(l)(st))}\forall s_{n}^{t}\in K(b_{H,0}^{(l)}(s^{t}))\}∀ italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ italic_K ( italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) }
24:     end for
25:     mR(bR,0(st),{bH,0(l)(st)}l=1L,b(gH))subscript𝑚𝑅subscript𝑏𝑅0superscript𝑠𝑡superscriptsubscriptsuperscriptsubscript𝑏𝐻0𝑙superscript𝑠𝑡𝑙1𝐿𝑏subscript𝑔𝐻m_{R}\leftarrow(b_{R,0}(s^{t}),\{b_{H,0}^{(l)}(s^{t})\}_{l=1}^{L},b(g_{H}))italic_m start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ← ( italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , { italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , italic_b ( italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) )
26:     All possible human knowledge: K^Ht=l=1LK(bH,0(l)(st))superscriptsubscript^𝐾𝐻𝑡superscriptsubscript𝑙1𝐿𝐾superscriptsubscript𝑏𝐻0𝑙superscript𝑠𝑡\hat{K}_{H}^{t}=\cup_{l=1}^{L}K(b_{H,0}^{(l)}(s^{t}))over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∪ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_K ( italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) )
27:     Construct the utterance space U𝑈Uitalic_U based on the robot knowledge KRtsuperscriptsubscript𝐾𝑅𝑡K_{R}^{t}italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and all possible kuman knowledge K^Htsuperscriptsubscript^𝐾𝐻𝑡\hat{K}_{H}^{t}over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
28:     Compute R(u,MR)𝑅𝑢subscript𝑀𝑅R(u,M_{R})italic_R ( italic_u , italic_M start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ), uUfor-all𝑢𝑈\forall u\in U∀ italic_u ∈ italic_U using the plans generated above based on Eq. (IV-D-4)
29:     Select robot utterance based on the proxy reward:
30:     uRt=argmaxuUR(u,mR)superscriptsubscript𝑢𝑅𝑡subscriptargmax𝑢𝑈𝑅𝑢subscript𝑚𝑅u_{R}^{t}=\operatorname*{arg\,max}_{u\in U}R(u,m_{R})italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_u ∈ italic_U end_POSTSUBSCRIPT italic_R ( italic_u , italic_m start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT )
31:     Select robot action based on the average plan:
32:     aRt=argmaxaR𝒜l=1LπぱいR(aR|bR,0,gR(l))/Lsuperscriptsubscript𝑎𝑅𝑡subscriptargmaxsubscript𝑎𝑅subscript𝒜superscriptsubscript𝑙1𝐿subscript𝜋𝑅conditionalsubscript𝑎𝑅subscript𝑏𝑅0subscriptsuperscript𝑔𝑙𝑅𝐿a_{R}^{t}=\operatorname*{arg\,max}_{a_{R}\in\mathcal{A_{R}}}\sum_{l=1}^{L}\pi_% {R}(a_{R}|b_{R,0},g^{(l)}_{R})/Litalic_a start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT | italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) / italic_L
33:     Execute the robot action aRtsuperscriptsubscript𝑎𝑅𝑡a_{R}^{t}italic_a start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and send the robot utterance uRtsuperscriptsubscript𝑢𝑅𝑡u_{R}^{t}italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
34:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
35:  until t=Tmax𝑡subscript𝑇maxt=T_{\text{max}}italic_t = italic_T start_POSTSUBSCRIPT max end_POSTSUBSCRIPT or the true goal has not been reached

IV-A Gaol Inference and Joint Planning for the Robot

Unless the human goal is given to the robot a priori (i.e., condition 1 defined in Section III-B), the robot must infer the human goal. We adopt the approach introduced by [2], which leverages an LLM to conduct goal inference based on the observed human actions and messages (Line 8-9 in Algorithm 1). We then sample the possible goals of humans

The joint plan for the robot includes two components. First, the robot’s policy given its goal and its belief, i.e., πぱいR(aR|bR,0,gR)subscript𝜋𝑅conditionalsubscript𝑎𝑅subscript𝑏𝑅0subscript𝑔𝑅\pi_{R}(a_{R}|b_{R,0},g_{R})italic_πぱい start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT | italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ). Second, the expected human’s policy inferred by the robot, i.e., Eb(bH,0),b(gH)[πぱいH(aH|bH,0,gH)]subscriptE𝑏subscript𝑏𝐻0𝑏subscript𝑔𝐻delimited-[]subscript𝜋𝐻conditionalsubscript𝑎𝐻subscript𝑏𝐻0subscript𝑔𝐻\mathrm{E}_{b(b_{H,0}),b(g_{H})}[\pi_{H}(a_{H}|b_{H,0},g_{H})]roman_E start_POSTSUBSCRIPT italic_b ( italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT ) , italic_b ( italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_πぱい start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) ]. In practice, we can estimate this expectation via sampling particles of possible human beliefs (i.e., {bH,0(l)(st)}l=1Lsuperscriptsubscriptsuperscriptsubscript𝑏𝐻0𝑙superscript𝑠𝑡𝑙1𝐿\{b_{H,0}^{(l)}(s^{t})\}_{l=1}^{L}{ italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT in Algorithm 1) and possible goals (Line 11 in Algorithm 1).

IV-B Agent Knowledge From Level-0 Belief

Recall that the level-0 belief of an agent bi,0subscript𝑏𝑖0b_{i,0}italic_b start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT represents the agent’s belief of the physical state s𝑠sitalic_s. If we partition the state s𝑠sitalic_s into multiple sub-states such as states of all objects in the environment, then we can evaluate the uncertainty in the belief of each sub-states. We define the sub-states that have certain belief distributions as knowledge of an agent. Formally, let us denote a state partition as s={sn}n=1N𝑠superscriptsubscriptsubscript𝑠𝑛𝑛1𝑁s=\{s_{n}\}_{n=1}^{N}italic_s = { italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with N sub-states and bi,0(sn)subscript𝑏𝑖0subscript𝑠𝑛b_{i,0}(s_{n})italic_b start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as the level-0 belief of the sub-state sinsubscript𝑠𝑖𝑛s_{i}nitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_n. For instance, if snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is object n𝑛nitalic_n’s state, then bi,0(sn)subscript𝑏𝑖0subscript𝑠𝑛b_{i,0}(s_{n})italic_b start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is the belief of the object i𝑖iitalic_i’ state. Consequently, we define the knowledge of agent i𝑖iitalic_i as

Kisubscript𝐾𝑖\displaystyle K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =K(bi,0)absent𝐾subscript𝑏𝑖0\displaystyle=K(b_{i,0})= italic_K ( italic_b start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT )
={bi,0(sn):(bi,0(sn))<max,n=1,,N},absentconditional-setsubscript𝑏𝑖0subscript𝑠𝑛formulae-sequencesubscript𝑏𝑖0subscript𝑠𝑛subscriptmax𝑛1𝑁\displaystyle=\{b_{i,0}(s_{n}):\mathcal{H}(b_{i,0}(s_{n}))<\mathcal{H}_{\text{% max}},n=1,\cdots,N\},= { italic_b start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) : caligraphic_H ( italic_b start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) < caligraphic_H start_POSTSUBSCRIPT max end_POSTSUBSCRIPT , italic_n = 1 , ⋯ , italic_N } , (1)

where \mathcal{H}caligraphic_H is the entropy of a belief distribution and maxsubscriptmax\mathcal{H}_{\text{max}}caligraphic_H start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is maximum entropy that is considered to be certain. In the example of object states as sub-states, knowledge consists of objects over which the agent has beliefs with high certainty.

IV-C Shared Mind Augmented by An Agent’s Knowledge

An agent i𝑖iitalic_i can imagine a shared mind after acquiring knowledge about a sub-state, bj,0(sn)Kisubscript𝑏𝑗0subscript𝑠𝑛subscript𝐾𝑖b_{j,0}(s_{n})\in K_{i}italic_b start_POSTSUBSCRIPT italic_j , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, from another agent j𝑗jitalic_j via verbal communication, as both agents would share this knowledge after the communication. We define this as the belief merge operation bi,0+sn=Merge(bi,0,bj,0(sn))subscriptsuperscript𝑏subscript𝑠𝑛𝑖0Mergesubscript𝑏𝑖0subscript𝑏𝑗0subscript𝑠𝑛b^{+s_{n}}_{i,0}=\text{Merge}(b_{i,0},b_{j,0}(s_{n}))italic_b start_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT = Merge ( italic_b start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ). Specifically, this merge operation will set the belief of sub-state n𝑛nitalic_n of agent i𝑖iitalic_i to that of agent j𝑗jitalic_j, i.e., bi,0+sn(sn)=bj,0(sn)subscriptsuperscript𝑏subscript𝑠𝑛𝑖0subscript𝑠𝑛subscript𝑏𝑗0subscript𝑠𝑛b^{+s_{n}}_{i,0}(s_{n})=b_{j,0}(s_{n})italic_b start_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_b start_POSTSUBSCRIPT italic_j , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

IV-D Divergence Between Plans as Proxy Reward

It is hard to directly estimate the effect of an utterance on the overall task performance. To directly reason what knowledge is critical for aligning the joint plans between agents, we define a proxy reward for communicating about the knowledge of an agent’s knowledge. Since the goal of this work is to generate proactive communication initiated by the robot, we model the proxy reward from the perspective of the robot.

We first define the reward of sharing the robot’s knowledge of sub-state snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with the human user as follows:

R(share sn,MR)=𝑅share subscript𝑠𝑛subscript𝑀𝑅absent\displaystyle R(\text{share }s_{n},M_{R})=italic_R ( share italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) =
KL(E[πぱいH(aH|bH,0+sn,gH)]||E[πぱいH(aH|bH,0,gH)])C,\displaystyle\text{KL}\left(\mathrm{E}[\pi_{H}(a_{H}|b^{+s_{n}}_{H,0},g_{H})]|% |\mathrm{E}[\pi_{H}(a_{H}|b_{H,0},g_{H})]\right)-C,KL ( roman_E [ italic_πぱい start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_b start_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) ] | | roman_E [ italic_πぱい start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) ] ) - italic_C , (2)

where bH,0+sn=Merge(bH,0,bR,0(sn))subscriptsuperscript𝑏subscript𝑠𝑛𝐻0Mergesubscript𝑏𝐻0subscript𝑏𝑅0subscript𝑠𝑛b^{+s_{n}}_{H,0}=\text{Merge}(b_{H,0},b_{R,0}(s_{n}))italic_b start_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT = Merge ( italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) and C𝐶Citalic_C is the cost for communication at a time step.

We then define the reward of requesting possible human knowledge of sub-state snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to inform the robot’s plan:

R(request sn,MR)=𝑅request subscript𝑠𝑛subscript𝑀𝑅absent\displaystyle R(\text{request }s_{n},M_{R})=italic_R ( request italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) =
KL(πぱいR(aR|bR,0+sn,gR)||πぱいR(aR|bR,0,gR))C,\displaystyle\text{KL}\left(\pi_{R}(a_{R}|b^{+s_{n}}_{R,0},g_{R})||\pi_{R}(a_{% R}|b_{R,0},g_{R})\right)-C,KL ( italic_πぱい start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT | italic_b start_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) | | italic_πぱい start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT | italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) ) - italic_C , (3)

where bR,0+sn=Merge(bR,0,bH,0(sn))subscriptsuperscript𝑏subscript𝑠𝑛𝑅0Mergesubscript𝑏𝑅0subscript𝑏𝐻0subscript𝑠𝑛b^{+s_{n}}_{R,0}=\text{Merge}(b_{R,0},b_{H,0}(s_{n}))italic_b start_POSTSUPERSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT = Merge ( italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ).

The plans used to compute the KL-divergence for the proxy rewards can be generated by running an off-the-shelf planner given the corresponding beliefs and goals (Line 16-23 in Algorithm 1).

We also define the reward for not communicating at a step as follows:

R(None,MR)=0.𝑅Nonesubscript𝑀𝑅0R(\text{None},M_{R})=0.italic_R ( None , italic_M start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) = 0 . (4)

IV-E Communication Optimization

Given the proxy rewards defined above, we can then choose whether and what to communicate based on the robot’s mind at each step (Line 27-30 in Algorithm 1). In particular, the utterance space is U={None}{share sn;snKR}{request sn;sK^H}𝑈Noneshare subscript𝑠𝑛subscript𝑠𝑛subscript𝐾𝑅request subscript𝑠𝑛𝑠subscript^𝐾𝐻U=\{\text{None}\}\cup\{\text{share }s_{n};s_{n}\in K_{R}\}\cup\{\text{request % }s_{n};s\in\hat{K}_{H}\}italic_U = { None } ∪ { share italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT } ∪ { request italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_s ∈ over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT }, where K^Hsubscript^𝐾𝐻\hat{K}_{H}over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the inferred human knowledge estimated from the human belief particles: K^H=l=1LK(bH,0(l))subscript^𝐾𝐻superscriptsubscript𝑙1𝐿𝐾superscriptsubscript𝑏𝐻0𝑙\hat{K}_{H}=\cup_{l=1}^{L}K(b_{H,0}^{(l)})over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = ∪ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_K ( italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ). We select the best robot utterance at step t𝑡titalic_t as follows:

uRt=argmaxuUR(u,MR).superscriptsubscript𝑢𝑅𝑡subscriptargmax𝑢𝑈𝑅𝑢subscript𝑀𝑅u_{R}^{t}=\operatorname*{arg\,max}_{u\in U}R(u,M_{R}).italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_u ∈ italic_U end_POSTSUBSCRIPT italic_R ( italic_u , italic_M start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) . (5)

We can further generate a natural language message based on the utterance uRtsuperscriptsubscript𝑢𝑅𝑡u_{R}^{t}italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to enable communication with real humans. This can be achieved by using GPT-4 [37] to translate uRtsuperscriptsubscript𝑢𝑅𝑡u_{R}^{t}italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to natural language through few-show prompting.

IV-F Multimodal Mental Update

At each step, the robot will update its mind based on both its observation oRtsubscriptsuperscript𝑜𝑡𝑅o^{t}_{R}italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and the messages it sends and receives. In particular, we extract human knowledge bH,0(sn)subscript𝑏𝐻0subscript𝑠𝑛b_{H,0}(s_{n})italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) from the human message uhtsuperscriptsubscript𝑢𝑡u_{h}^{t}italic_u start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT via GPT-4 and use it to update the robot’s level-0 belief bR,0subscript𝑏𝑅0b_{R,0}italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT jointly with oRtsubscriptsuperscript𝑜𝑡𝑅o^{t}_{R}italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT (Line 6 in Algorithm 1) operation. For instance, if the human informs the robot of the location of an object, we can update the robot’s level-0 belief with the knowledge of the object’s location. Additionally, if the robot shares knowledge bR,0(sn)subscript𝑏𝑅0subscript𝑠𝑛b_{R,0}(s_{n})italic_b start_POSTSUBSCRIPT italic_R , 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) in its utterance, then the robot can assume that the human’s level-0 belief will also be updated accordingly. Thus, in robot mind MRsubscript𝑀𝑅M_{R}italic_M start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, we can update b(bH,0)𝑏subscript𝑏𝐻0b(b_{H,0})italic_b ( italic_b start_POSTSUBSCRIPT italic_H , 0 end_POSTSUBSCRIPT ) using both the shared robot knowledge and the human observation (Line 15 in Algorithm 1). Note that we can sample possible human observations based on the state inferred by the robot’s level-0 belief (Line 13-14 in Algorithm 1). All beliefs are initialized with a uniform distribution (Line 2 in Algorithm 1).

V Experiments

We evaluate our model in two human-AI domains Overcooked and VirtualHome. These two domains cover two distinct alignment objectives. In both domains, there are two agents – a human user and an embodied AI assistant. In Overcooked, the agent’s goal is to align their plans temporally so that certain joint actions can be performed at similar time steps, whereas in VirtualHome, the agents align their beliefs about the location of the objects they try to collect. We describe each in detail below.

Recipe Name Ingredient List
Burger

Cooked(Patty), Cooked(Potato), Chopped(Lettuce), Chopped(Tomato)

Pasta

Cooked(Spaghetti), Cooked(Mushroom), Cooked(Cream), Chopped(Basil)

Ramen

Cooked(Noodle), Cooked(Mushroom), Cooked(Egg), Chopped(Scallion)

Steak & Fries

Cooked(Beef), Cooked(Potato), Chopped(Parsley)

TABLE I: Overcooked recipe specifications.
Refer to caption
Figure 2: Example Overcooked environment. In each environment, there are two rooms. The two agents are always in different rooms. An agent cannot observe the other room. Thus it has to rely on verbal communication to infer the states of the objects in the other room.

V-A Overcooked

Overcooked is a popular multiagent game where agents need to collaborate to prepare and cook ingredients, which is also widely used for evaluating human-AI cooperation (e.g., [38, 9]). In the original game, agents have full observability. In this study, we extended the Overcooked simulator from [9] by assuming partial observability where each agent cannot observe the other room as shown in Fig. 2. At each step, the AI assistant may share its progress on the task or ask about the human’s progress.

The goal of the collaborating agents is to complete the dishes in the shortest amount of time. To simulate more realistic cooking scenarios, we augment the existing simulator with dynamics that cooked ingredients will gradually cool down. If cooked ingredients are not at the ideal temperature when the dish is served, the team will receive a penalty. This requires both agents to coordinate better to avoid misalignment in their plans for cooking the ingredients. For instance, one agent cannot finish making the burger too early if the other agent has not started cooking the French fries. Therefore, the agents need to coordinate and align their plans such that they finish cooking at the same time. The agents can align their plan by choosing to wait for the other agents (e.g. I will start cooking A as soon as the other agent finishes B). There are four recipes in our experiment (Table I): Burger, Spaghetti, Ramen, and Steak, each in a unique room layout. We simulate a human agent using the planner in [9], which does not proactively communicate with the AI Assistant. Each recipe is run 10 times with different seeds and we report the aggregate results.

Baselines. We evaluate three baselines: Single-agent, No-Communication (No-Comm), and Heuristic-based Communication (Heur-Comm). In Single-Agent, the human completes all the tasks alone. In the No-Comm baseline, no messages are exchanged. In the Heur-Comm baseline, the AI Assistant follows a simple heuristic that shares updates every time a sub-goal has been completed and periodically asks for the human’s progress. The action planner in all methods including GOMA is the same as the planner in [9].

Metrics. We use two performance metrics: speedup and total plan costs. Speedup is calculated by comparing the plan length in each team condition, where the human is working with one of the four collaborative AI models, to the single agent baseline, i.e. Speedup=Lsingle/Lteam1𝑆𝑝𝑒𝑒𝑑𝑢𝑝subscript𝐿𝑠𝑖𝑛𝑔𝑙𝑒subscript𝐿𝑡𝑒𝑎𝑚1Speedup=L_{single}/L_{team}-1italic_S italic_p italic_e italic_e italic_d italic_u italic_p = italic_L start_POSTSUBSCRIPT italic_s italic_i italic_n italic_g italic_l italic_e end_POSTSUBSCRIPT / italic_L start_POSTSUBSCRIPT italic_t italic_e italic_a italic_m end_POSTSUBSCRIPT - 1.

Total plan cost is the sum of all action and communication costs with penalties applied for sub-optimal dish states due to time lapse between the completion of a hot sub-task (e.g. cooked noodle) and the end of the trial, i.e. TotalCost=L+U+ihot_itemsΔでるた(Li,L)𝑇𝑜𝑡𝑎𝑙𝐶𝑜𝑠𝑡𝐿𝑈subscript𝑖𝑜𝑡_𝑖𝑡𝑒𝑚𝑠Δでるたsubscript𝐿𝑖𝐿TotalCost=L+U+\sum_{i\in hot\_items}{\Delta(L_{i},L)}italic_T italic_o italic_t italic_a italic_l italic_C italic_o italic_s italic_t = italic_L + italic_U + ∑ start_POSTSUBSCRIPT italic_i ∈ italic_h italic_o italic_t _ italic_i italic_t italic_e italic_m italic_s end_POSTSUBSCRIPT roman_Δでるた ( italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_L ) where U𝑈Uitalic_U is the total number of utterances in a trial, L𝐿Litalic_L is the plan length and Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT refers to the time step where item i is completed.

Goals Goal Specification

Set up table

Put [N forks, N plates, N waterglasses or wineglasses] on [kitchentable, coffeetable]

Put groceries

Put [N apple, N salmon, N pudding, N cupcakes] inside [cabinet, fridge]

Prepare food

Put [N apple, N salmon, N pudding, N cupcakes] on [kitchentable, coffeetable]

Load dishwasher

Put [N forks, N plates, N waterglasses or wineglasses] inside [dishwasher]

TABLE II: VirtualHome goal specifications.

(a) Overcooked Simulation
Refer to caption
(b) Virtual-Home Simulation
Refer to caption
(c) Virtual-Home Human Experiment
Refer to caption

(d) Virtual-Home Human Rating
Refer to caption
Figure 3: Experimental results in Overcooked and VirtualHome. The quantitative results from experiments (a, b, c) demonstrate that GOMA led to the greatest speedup (left) and least plan cost (right) compared to other baselines. In human subjective ratings (d), participants find GOMA to be more helpful and communicate more useful information than other models.
Refer to caption
Figure 4: Example of typical communication enabled by GOMA in VirtualHome. (a) Once the human (in the blue shirt) gives a command to the AI Assistant (in the orange shirt), it infers the human goal and reasons that the human needs 2 plates and 2 forks. (b) As the AI watches the human agent opening the fridge, GOMA informs the human that the plates are on the coffee table. Consequently, the human goes to the coffee table to pick up a plate.

Refer to caption
Figure 5: Agents’ trajectories with No-Comm (left) and with GOMA (right) in a VirtualHome environment. In this example, the AI Assistant needs to find a plate and a fork while the human is looking for a water glass. Both agents have knowledge about the items that the other agent is looking for but not their own goal objects. In the No-Comm setting, the agents cannot share knowledge and have to open many containers to search for goal items. By inferring the other agent’s goal and communicating goal-relevant knowledge, GOMA drastically reduces the total number of steps taken to complete the task.

V-B VirtualHome

VirtualHome [39] is a multiagent household simulator. In VirtualHome, agents collaborate to complete daily household tasks. In our experiments, we include four common types of household tasks: Set Table, Load Dishwasher, Get Snacks, Stock Fridge. The goal for each task is defined as a set of goal predicates and their counts as defined in Table II. In VirtulHome, each object is associated with a unique object ID, which we use in agents’ communication to distinguish the referent from others (e.g. cabinet.145).

Simulation Experiment. We simulate 25 collaborative scenarios in VirtualHome across 4 goal types and 5 simulated apartments. Each episode is run 3 times and we report the averaged results. We simulate the human agent using the MCTS planner from [39]. The simulated human agent requests help by sampling a subset of the goal predicates and replies to the AI assistant’s questions. We compare our proposed method against four baselines: Single-agent, No-Communication (No-Comm), Goal-Agnostic (Goal-Ag), and LLM agent. The first two are identical to the ones in Overcooked. The Goal-Ag baseline does not infer the joint goals and plan and instead randomly shares information about any objects that the human doesn’t know. For the LLM agent, we use COELA [5], which achieved state-of-the-art performance on human-AI cooperation in VirtualHome.

Human Experiment. We developed an online human interface to conduct a human experiment. The interface follows the same task setup as the simulation study with 5 conditions: Single-Agent, No-Comm, Goal-Ag, CoELA, and GOMA (Ours). The participants controlled the human user agent to either perform the task alone (Single-Agent) or to work with an AI assistant driven by one of the methods. In all collaborative conditions, the interface includes a chatbox that allows the participant and the AI agent to send messages to each other. We recruited 10 participants who had no prior experience with the simulator. They completed 60 trials over 20 tasks. After completing a trial with an AI assistant, the participants were asked to rate the AI assistant based on four criteria: 1) the assistant is helpful; 2) the assistant understands your goal; 3) the assistant’s communication is useful; and 4) the assistant communicates more than necessary. Each criterion is rated on a 7-point Likert scale (1 = Strongly Disagree, 7 = Strongly Agree).

Metrics. In line with previous studies on VirtualHome [39], we evaluate the models’ performance by computing 1) speedups: counting the number of steps taken to complete the task, and 2) total costs: an overall cost metric that sums up the action and communication cost over the episode.

VI Results

VI-A Simulation Experiment

The simulation results are shown in Fig. 3ab. The advantage of collaboration is evident as the Single-agent baseline performed significantly worse than all other collaborative models. Overall, we find that across both Overcooked and Virtual-Home experiments, our model outperformed other baselines in all metrics. The differences between GOMA and other baselines are all statistically significant with p<0.01𝑝0.01p<0.01italic_p < 0.01 across two performance metrics.

In Overcooked, GOMA took on average 46.76 steps to complete the task, achieving a 44.61% speedup. Our model completed the tasks with the lowest costs (M = 58.06) compared to the Heuristic-based model (M = 72.0) and No-Comm baseline (M = 65.05). Additionally, GOMA delivered the dishes in the best condition among all tested models, as signaled by the lowest coldness penalty (7.85).

In VirtualHome, GOMA took on average 20.08 steps to complete the task with a 55.8% speedup. Despite having observed objects relevant to the human agent’s goal, CoELA made few utterances (Mean = 3.03) and focused exclusively on communicating observations of its own goal. For example, when given a command ”Please help me find a fork.”, CoELA would respond later ”I found fork 323 in cabinet 132.” and did not share any knowledge that may be useful for the human’s subgoal and plan. The Goal-Agnostic model makes frequent (Mean = 5.41) but mostly irrelevant utterances about possible goal objects. However, it did perform slightly better than the No-Comm baseline because, with enough utterances, it occasionally mentions useful information to the human agent.

Unlike baselines, GOMA can communicate and inquire about useful goal-relevant information with the human, leading to improved team performance. We include two qualitative examples of GOMA in VirtualHome simulations in Fig. 4 and 5. In these examples, we show that due to partial observability, the AI Assistant and the human have exclusive knowledge about certain objects relevant to other agent’s subgoals. GOMA allows the AI Assistant to inquire and inform another agent about this goal-relevant information. As a result, the agents can find the goal objects quickly without exhaustively opening and checking all containers.

VI-B Human Experiment

The human experiment results are shown in Fig. 3cd. Similar to the simulation results, our proposed method had the greatest speedup over a single agent and outperformed all baselines in terms of plan costs. In contrast to the simulation study, the Goal-agnostic model here performed no better than No-Comm and CoELA as participants stopped paying attention to the assistant after it made too many statements irrelevant to the goal. This is shown in the participants’ subjective ratings where participants reported that the Goal-Agnostic baseline communicated more than necessary.

The participants gave a higher subjective rating to our model than other baselines on all 4 items. Interestingly, even though CoELA and GOMA performed goal inference with the same method, the participants thought that only GOMA understood the human’s goal. This is because by communicating goal-relevant information, GOMA implicitly expressed its understanding of the user’s goal, whereas CoELA only communicated the progress of its own subgoal.

VII Conclusion

In this paper, we introduce GOMA, which enables an embodied AI assistant to efficiently and effectively communicate with a human user to achieve optimal cooperation. GOMA achieves this by reasoning about the other agent’s mental state, assessing the misalignment between mental states, and then proactively initiating necessary communication to exchange goal-relevant information. Our experiments in Overcooked and VirtualHome demonstrate that embodied AI assistants built with GOMA can not only help achieve the human goal faster with lower total plan cost but also receive higher subjective ratings from human participants.

Our study is not without limitations. We have not evaluated GOMA on real-world robot assistants, which we intend to study in the future. We also plan to enhance the flexibility of the communication generation, so that it can communicate about any information relevant to the task in an open-ended manner. Finally, we also aim to investigate more general belief representations that go beyond object states.

References

  • [1] M. Tomasello, Origins of human communication.   MIT press, 2010.
  • [2] L. Ying, T. Zhi-Xuan, V. Mansinghka, and J. B. Tenenbaum, “Inferring the goals of communicating agents from actions and instructions,” in Proceedings of the AAAI Symposium Series, vol. 2, no. 1, 2023, pp. 26–33.
  • [3] D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan, “Cooperative inverse reinforcement learning,” in Advances in Neural Information Processing Systems, 2016, pp. 3909–3917.
  • [4] X. Gao, R. Gong, Y. Zhao, S. Wang, T. Shu, and S.-C. Zhu, “Joint mind modeling for explanation generation in complex human-robot collaborative tasks,” in 2020 29th IEEE international conference on robot and human interactive communication (RO-MAN).   IEEE, 2020, pp. 1119–1126.
  • [5] H. Zhang, W. Du, J. Shan, Q. Zhou, Y. Du, J. B. Tenenbaum, T. Shu, and C. Gan, “Building cooperative embodied agents modularly with large language models,” arXiv preprint arXiv:2307.02485, 2023.
  • [6] A. Hong, N. Lunscher, T. Hu, Y. Tsuboi, X. Zhang, S. F. dos Reis Alves, G. Nejat, and B. Benhabib, “A multimodal emotional human–robot interaction architecture for social robots engaged in bidirectional communication,” IEEE transactions on cybernetics, vol. 51, no. 12, pp. 5954–5968, 2020.
  • [7] H. H. Clark, Using language.   Cambridge university press, 1996.
  • [8] M. Kleiman-Weiner, M. K. Ho, J. L. Austerweil, M. L. Littman, and J. B. Tenenbaum, “Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction,” in COGSCI, 2016.
  • [9] S. A. Wu, R. E. Wang, J. A. Evans, J. B. Tenenbaum, D. C. Parkes, and M. Kleiman-Weiner, “Too many cooks: Bayesian inference for coordinating multi-agent collaboration,” Topics in Cognitive Science, vol. 13, no. 2, pp. 414–432, 2021.
  • [10] J. MacMillan, E. E. Entin, and D. Serfaty, “Communication overhead: The hidden cost of team cognition.” 2004.
  • [11] E. Horvitz and J. Apacible, “Learning and reasoning about interruption,” in Proceedings of the 5th international conference on Multimodal interfaces, 2003, pp. 20–27.
  • [12] V. V. Unhelkar, S. Li, and J. A. Shah, “Decision-making for bidirectional communication in sequential human-robot collaborative tasks,” in Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, 2020, pp. 329–341.
  • [13] E. C. Williams, N. Gopalan, M. Rhee, and S. Tellex, “Learning to parse natural language to grounded reward functions with weak supervision,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 4430–4436.
  • [14] T. Zhi-Xuan, L. Ying, V. Mansinghka, and J. B. Tenenbaum, “Pragmatic instruction following and goal assistance via cooperative language-guided inverse planning,” arXiv preprint arXiv:2402.17930, 2024.
  • [15] L. Yuan, X. Gao, Z. Zheng, M. Edmonds, Y. N. Wu, F. Rossano, H. Lu, Y. Zhu, and S.-C. Zhu, “In situ bidirectional human-robot value alignment,” Science robotics, vol. 7, no. 68, p. eabm4183, 2022.
  • [16] C. Zhang, J. Chen, J. Li, Y. Peng, and Z. Mao, “Large language models for human-robot interaction: A review,” Biomimetic Intelligence and Robotics, p. 100131, 2023.
  • [17] B. Ichter, A. Brohan, Y. Chebotar, C. Finn, K. Hausman, . …, and C. Kelly, “Do as i can, not as i say: Grounding language in robotic affordances,” in Proceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol. 205.   PMLR, 14–18 Dec 2023, pp. 287–318.
  • [18] Z. Mandi, S. Jain, and S. Song, “Roco: Dialectic multi-robot collaboration with large language models,” arXiv preprint arXiv:2307.04738, 2023.
  • [19] S. Devin and R. Alami, “An implemented theory of mind to improve human-robot shared plans execution,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI).   IEEE, 2016, pp. 319–326.
  • [20] K. E. Schaefer, E. R. Straub, J. Y. Chen, J. Putney, and A. W. Evans III, “Communicating intent to develop shared situation awareness and engender trust in human-agent teams,” Cognitive Systems Research, vol. 46, pp. 26–39, 2017.
  • [21] T. Ullman, “Large language models fail on trivial alterations to theory-of-mind tasks,” arXiv preprint arXiv:2302.08399, 2023.
  • [22] C. L. Baker, J. Jara-Ettinger, R. Saxe, and J. B. Tenenbaum, “Rational quantitative attribution of beliefs, desires and percepts in human mentalizing,” Nature Human Behaviour, vol. 1, no. 4, pp. 1–10, 2017.
  • [23] T. Zhi-Xuan, J. Mann, T. Silver, J. Tenenbaum, and V. Mansinghka, “Online bayesian goal inference for boundedly rational planning agents,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [24] T. Shu, A. Bhandwaldar, C. Gan, K. Smith, S. Liu, D. Gutfreund, E. Spelke, J. Tenenbaum, and T. Ullman, “Agent: A benchmark for core psychological reasoning,” in International conference on machine learning.   PMLR, 2021, pp. 9614–9625.
  • [25] C. Jin, Y. Wu, J. Cao, J. Xiang, Y.-L. Kuo, Z. Hu, T. Ullman, A. Torralba, J. B. Tenenbaum, and T. Shu, “Mmtom-qa: Multimodal theory of mind question answering,” arXiv preprint arXiv:2401.08743, 2024.
  • [26] L. Ying, K. M. Collins, M. Wei, C. E. Zhang, T. Zhi-Xuan, A. Weller, J. B. Tenenbaum, and L. Wong, “The neuro-symbolic inverse planning engine (nipe): Modeling probabilistic social inferences from linguistic inputs,” arXiv preprint arXiv:2306.14325, 2023.
  • [27] L. Ying, T. Zhi-Xuan, L. Wong, V. Mansinghka, and J. Tenenbaum, “Grounding language about belief in a bayesian theory-of-mind,” arXiv preprint arXiv:2402.10416, 2024.
  • [28] A. Dragan and S. Srinivasa, “Generating legible motion,” 2013.
  • [29] F. Stulp, J. Grizou, B. Busch, and M. Lopes, “Facilitating intention prediction for humans by optimizing robot motions,” in 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS).   IEEE, 2015, pp. 1249–1255.
  • [30] M. Kwon, S. H. Huang, and A. D. Dragan, “Expressing robot incapability,” in Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, 2018, pp. 87–95.
  • [31] Y. Zhang, S. Sreedharan, A. Kulkarni, T. Chakraborti, H. H. Zhuo, and S. Kambhampati, “Plan explicability and predictability for robot task planning,” in 2017 IEEE international conference on robotics and automation (ICRA).   IEEE, 2017, pp. 1313–1320.
  • [32] X. Gao, L. Yuan, T. Shu, H. Lu, and S.-C. Zhu, “Show me what you can do: Capability calibration on reachable workspace for human-robot collaboration,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 2644–2651, 2022.
  • [33] P. J. Gmytrasiewicz and P. Doshi, “A framework for sequential planning in multi-agent settings,” Journal of Artificial Intelligence Research, vol. 24, pp. 49–79, 2005.
  • [34] P. Doshi and P. J. Gmytrasiewicz, “Monte Carlo sampling methods for approximating interactive POMDPs,” Journal of Artificial Intelligence Research, vol. 34, pp. 297–337, 2009.
  • [35] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial intelligence, vol. 101, no. 1-2, pp. 99–134, 1998.
  • [36] A. Bosch-Domenech, J. G. Montalvo, R. Nagel, and A. Satorra, “One, two,(three), infinity,…: Newspaper and lab beauty-contest experiments,” American Economic Review, vol. 92, no. 5, pp. 1687–1701, 2002.
  • [37] OpenAI, “Gpt-4 technical report,” 2023.
  • [38] M. Carroll, R. Shah, M. K. Ho, T. Griffiths, S. Seshia, P. Abbeel, and A. Dragan, “On the utility of learning about humans for human-ai coordination,” Advances in neural information processing systems, vol. 32, 2019.
  • [39] X. Puig, T. Shu, S. Li, Z. Wang, Y.-H. Liao, J. B. Tenenbaum, S. Fidler, and A. Torralba, “Watch-and-help: A challenge for social perception and human-ai collaboration,” arXiv preprint arXiv:2010.09890, 2020.