(Translated by https://www.hiragana.jp/)
Modelling bounded rational decision-making through Wasserstein constraints

Modelling bounded rational decision-making through Wasserstein constraints

Benjamin Patrick Evans
JP Morgan AI Research
London, UK
benjamin.x.evans@jpmorgan.com
\AndLeo Ardon
JP Morgan AI Research
London, UK
\AndSumitra Ganesh
JP Morgan AI Research
New York, USA
Abstract

Modelling bounded rational decision-making through information constrained processing provides a principled approach for representing departures from rationality within a reinforcement learning framework, while still treating decision-making as an optimization process. However, existing approaches are generally based on Entropy, Kullback-Leibler divergence, or Mutual Information. In this work, we highlight issues with these approaches when dealing with ordinal action spaces. Specifically, entropy assumes uniform prior beliefs, missing the impact of a priori biases on decision-makings. KL-Divergence addresses this, however, has no notion of ”nearness” of actions, and additionally, has several well known potentially undesirable properties such as the lack of symmetry, and furthermore, requires the distributions to have the same support (e.g. positive probability for all actions). Mutual information is often difficult to estimate. Here, we propose an alternative approach for modeling bounded rational RL agents utilising Wasserstein distances. This approach overcomes the aforementioned issues. Crucially, this approach accounts for the nearness of ordinal actions, modeling ”stickiness” in agent decisions and unlikeliness of rapidly switching to far away actions, while also supporting low probability actions, zero-support prior distributions, and is simple to calculate directly. footnotetext: Extended Abstract: Accepted at RLDM 2025, Dublin, Ireland.


Keywords:

Bounded rationality, human decision-making, information processing, constrained decision making, Wasserstein distances

Metric
Entropy KL* W
Uniform Prior [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Previous Prior [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Optimal Prior [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Table 1: Metric across different prior beliefs. Note that KL is generally infinite for the previous and optimal priors, so we use a modified version KL*, which assigns a low probability to all zero probability actions to keep the metric finite.

1 Introduction

Reinforcement Learning (RL) algorithms have achieved notable success in approximating optimal decision-making in complex sequential environments. However, when modeling human-like decision-making to simulate real-world behaviors (e.g., in traffic, markets), most prevailing methods assume perfectly rational agents. This assumption can be overly restrictive, failing to capture critical dynamics inherent in real-world systems [1]. To address departures from strict rationality, various approaches have been proposed to model bounded rationality in RL, based on utility maximization under information-processing constraints. While promising, existing methods face limitations such as neglecting prior beliefs, imposing rigid forms of priors, ignoring action geometries, and computational challenges in their estimation. To overcome these issues, we propose a novel framework incorporating a Wasserstein distance-based constraint. In this extended abstract, we outline the concept and present motivating examples behind this idea.

2 Background and Related Work

Behavioural economics has developed more realistic models of decision-making than the traditional homo economicus perfectly rational agent. Instead, these models operate under bounded rationality. While there are many different perspectives on bounded rationality [2, 3], and the causes, here we focus on one particular representation that abstracts away specific causes, simply representing bounded rationality as decision-making under processing constraints [4, 5].

Quantifying information processing costs in a generalised manner is desirable, as this enables compatibility with existing optimisation algorithms. This treatment abstracts the underlying causes of such constraints, allowing a focus on learning behaviour without necessitating an in-depth understanding of the specific psychological factors at play. From an optimisation standpoint, this is advantageous, as the process remains independent of the particular details of how decisions are formulated [6]. For experimentalists, a general enough form still allows for encoding different behavioural biases.

2.1 Representation

The general RL formulation is as follows. A decision-maker (DM) seeks to maximise their discounted return based on per time step utility U𝑈Uitalic_U by taking actions from their action space aA𝑎𝐴a\in Aitalic_a ∈ italic_A. Importantly, these DM may not act perfectly rationally, and instead may be satisficing. The system is characterised by a state space S𝑆Sitalic_S, and DM’s possess a (potentially partial) observation of the current state s𝑠sitalic_s and prior beliefs about their potential actions q𝑞qitalic_q (a probability distribution over the action space). The behaviour of the DM is governed by their policy π𝜋\piitalic_π, which is a mapping from states to a distribution over actions. DM’s act based on their policy aπsimilar-to𝑎𝜋a\sim\piitalic_a ∼ italic_π, receiving per timestep reward U(a,s)𝑈𝑎𝑠U(a,s)italic_U ( italic_a , italic_s ). Agents learn an (approximately) optimal policy πisuperscriptsubscript𝜋𝑖\pi_{i}^{*}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that maximizes their expected lifetime return:

πi(a|si)=maxπi𝔼πi[t=0γtU(at|si,t)]superscriptsubscript𝜋𝑖conditional𝑎subscript𝑠𝑖subscriptsubscript𝜋𝑖subscript𝔼subscript𝜋𝑖delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑈conditionalsubscript𝑎𝑡subscript𝑠𝑖𝑡\pi_{i}^{*}(a|s_{i})=\max_{\pi_{i}}\mathbb{E}_{\pi_{i}}\left[\sum_{t=0}^{% \infty}\gamma^{t}U(a_{t}|s_{i,t})\right]italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_U ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) ] (1)

However, to model departures from perfect utility maximization, an alternative approach applies some form of information processing constraint to this maximization process, modelling limitations in reasoning capacities:

maxπ𝔼πi[t=0γtU(at|si,t)]subscript𝜋subscript𝔼subscript𝜋𝑖delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑈conditionalsubscript𝑎𝑡subscript𝑠𝑖𝑡\displaystyle\max_{\pi}\mathbb{E}_{\pi_{i}}\left[\sum_{t=0}^{\infty}\gamma^{t}% U(a_{t}|s_{i,t})\right]roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_U ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) ] subject to I(πi,si,t,qi)<I¯𝐼subscript𝜋𝑖subscript𝑠𝑖𝑡subscript𝑞𝑖¯𝐼\displaystyle I(\pi_{i},s_{i,t},q_{i})<\bar{I}italic_I ( italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < over¯ start_ARG italic_I end_ARG (2)

where agents maximise U𝑈Uitalic_U while adhering to a constraint I¯¯𝐼\bar{I}over¯ start_ARG italic_I end_ARG on their processing costs I𝐼Iitalic_I. Using a Lagrange multiplier, Eq. 2 can be reformulated as the maximization of a modified reward:

πiλ(a|si)=maxπi𝔼πi[t=0γt(U(at|si,t)λI(πi,si,t,qi))]superscriptsubscript𝜋𝑖𝜆conditional𝑎subscript𝑠𝑖subscriptsubscript𝜋𝑖subscript𝔼subscript𝜋𝑖delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑈conditionalsubscript𝑎𝑡subscript𝑠𝑖𝑡𝜆𝐼subscript𝜋𝑖subscript𝑠𝑖𝑡subscript𝑞𝑖\pi_{i}^{\lambda}(a|s_{i})=\max_{\pi_{i}}\mathbb{E}_{\pi_{i}}\left[\sum_{t=0}^% {\infty}\gamma^{t}\left(U(a_{t}|s_{i,t})-\lambda I(\pi_{i},s_{i,t},q_{i})% \right)\right]italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_a | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_U ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) - italic_λ italic_I ( italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] (3)

which importantly permits the same general representation as Eq. 1, with a regularised utility function to model various departures from rationality based on the function I𝐼Iitalic_I (discussed in the following section) and prior beliefs q𝑞qitalic_q. These prior beliefs q𝑞qitalic_q (also called ”magnets” [7] or ”anchors” [8]) may change throughout training and inference (e.g. with updated information) and can take many forms, for example, demonstrating bias towards specific actions, encoding heuristics, averaging over past decisions, or preferring historically well-performing actions, allowing an additional form of bounded rationality when I𝐼Iitalic_I accounts for q𝑞qitalic_q 111Here we assume q𝑞qitalic_q is not state-dependent, but state-dependent priors are also supported under the same framework. This constrained representation is beneficial, as it enables the utilization of any existing RL algorithm with minimal modifications to the loss function or optimization process [1].

2.2 Existing information costs

Refer to caption
(a) Low Entropy
Refer to caption
(b) Mid Entropy
Refer to caption
(c) High Entropy
Figure 1: Entropy. Examples of various levels of entropy for different policies (independent of any prior beliefs)
Refer to caption
(d) Example 1
Refer to caption
(e) Example 2
Refer to caption
(f) Example 3
Figure 2: KL examples with infinite divergence. Despite some of these policies seemingly being ”closer” to the prior, all three have KL =absent=\infty= ∞.

By modifying I𝐼Iitalic_I, we can model various forms of processing costs and capture different notions of bounded rationality. In this section, we examine common existing approaches, including entropy, KL-divergence, and Mutual Information.

Entropy

An entropy constraint is one prominent approach for relaxing the strict, perfectly rational assumption, restricting deviations from uniform behaviour. For example, this is done in Quantal Response Equilibrium, which allows deviations from optimal responses and permits erroneous play. This constraint can be represented based on an information processing cost: IEntropy=H=aAπ(a|s)logπ(a|s)subscript𝐼Entropy𝐻subscript𝑎𝐴𝜋conditional𝑎𝑠𝜋conditional𝑎𝑠I_{\text{Entropy}}=H=-\sum_{a\in A}\pi(a|s)\log\pi(a|s)italic_I start_POSTSUBSCRIPT Entropy end_POSTSUBSCRIPT = italic_H = - ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) roman_log italic_π ( italic_a | italic_s ) which has multiple applications in RL [7].

However, much research has shown the usefulness of incorporating arbitrary prior beliefs q𝑞qitalic_q (not just uniform) for better capturing realistic decision-making [5], motivating extensions that measure the divergence from an arbitrary prior distribution based on the Kullback-Leibler (KL) divergence DKLsubscriptDKL\operatorname{\text{D}_{\text{KL}}}D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [4].

KL Divergence

To better model human-like decisions and account for the impact of prior beliefs q𝑞qitalic_q on decision-making, [1] proposes a KL-based approach using the following information processing costs:

IDKL=DKL(πq)=aAπ(a|s)logπ(a|s)q(a|s)subscript𝐼subscriptDKLsubscriptDKLconditional𝜋𝑞subscript𝑎𝐴𝜋conditional𝑎𝑠𝜋conditional𝑎𝑠𝑞conditional𝑎𝑠I_{\operatorname{\text{D}_{\text{KL}}}}=\operatorname{\text{D}_{\text{KL}}}(% \pi\parallel q)=\sum_{a\in A}\pi(a|s)\log\frac{\pi(a|s)}{q(a|s)}italic_I start_POSTSUBSCRIPT D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT end_POSTSUBSCRIPT = start_OPFUNCTION D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT end_OPFUNCTION ( italic_π ∥ italic_q ) = ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) roman_log divide start_ARG italic_π ( italic_a | italic_s ) end_ARG start_ARG italic_q ( italic_a | italic_s ) end_ARG (4)

to constrain πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from diverging too far from agents’ prior beliefs q𝑞qitalic_q at each state, limiting their strategic abilities. When prior beliefs are uniform q(a|s)=c𝑞conditional𝑎𝑠𝑐q(a|s)=citalic_q ( italic_a | italic_s ) = italic_c, Eq. 4 is equivalent to enforcing an entropy constraint (up to some constant), as

aAπ(a|s)logπ(a|s)q(a|s)=aAπ(a|s)logπ(a|s)aAπ(a|s)logc=H+Csubscript𝑎𝐴𝜋conditional𝑎𝑠𝜋conditional𝑎𝑠𝑞conditional𝑎𝑠subscript𝑎𝐴𝜋conditional𝑎𝑠𝜋conditional𝑎𝑠subscript𝑎𝐴𝜋conditional𝑎𝑠𝑐𝐻𝐶\sum_{a\in A}\pi(a|s)\log\frac{\pi(a|s)}{q(a|s)}=\sum_{a\in A}\pi(a|s)\log\pi(% a|s)-\sum_{a\in A}\pi(a|s)\log c\\ =-H+C\\ ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) roman_log divide start_ARG italic_π ( italic_a | italic_s ) end_ARG start_ARG italic_q ( italic_a | italic_s ) end_ARG = ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) roman_log italic_π ( italic_a | italic_s ) - ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) roman_log italic_c = - italic_H + italic_C (5)

However, in general cases, KL quantifies the divergence from arbitrary prior beliefs, encoding different behavioral biases.

Mutual Information

Finally, [6] proposes rational inattention (RI), which is based on Mutual Information, and this has been incorporated into RL in [9]. MI is defined over the joint probabilities as: IMI=aAp(a,si)logp(a,si)p(si)p(a)subscript𝐼MIsubscript𝑎𝐴𝑝𝑎subscript𝑠𝑖𝑝𝑎subscript𝑠𝑖𝑝subscript𝑠𝑖𝑝𝑎I_{\text{MI}}=-\sum_{a\in A}p(a,s_{i})\log\frac{p(a,s_{i})}{p(s_{i})p(a)}italic_I start_POSTSUBSCRIPT MI end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT italic_p ( italic_a , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG italic_p ( italic_a , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p ( italic_a ) end_ARG which has a dependence on the unconditional action probability p(a)𝑝𝑎p(a)italic_p ( italic_a ) which generally must be solved with approximation techniques [5]. For this reason, we focus our attention primarily on the alternative two approaches above due to their ability to be computed directly, as they only depend on the conditional probabilities directly given by the policies.

Limitations Each of the above measures sufferers from their own limitations, including entropy not accounting for priors, KL going to infinity under many different configurations of priors, mutual information being challenging to compute, and none of the metrics accounting for the geometry or nearness of ordinal actions.

3 Proposed Approach

In order to overcome the aforementioned limitations in modeling bounded rationality in RL, in this section, we propose a novel RL approach based on Wasserstein distances.

Wasserstein metric

We now define the Wasserstein distance W𝑊Witalic_W (also known as the Kantorovich–Rubinstein metric or Earth Mover’s Distance) between two discrete probability distributions, the policy p𝑝pitalic_p and prior beliefs q𝑞qitalic_q. While W𝑊Witalic_W can also be defined on continuous distributions, we focus on the discrete case in this work. First, we quantify a distance d(ai,aj)𝑑subscript𝑎𝑖subscript𝑎𝑗d(a_{i},a_{j})italic_d ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) between two actions aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the action space A𝐴Aitalic_A. We use the absolute distance d(ai,aj)=|ij|𝑑subscript𝑎𝑖subscript𝑎𝑗𝑖𝑗d(a_{i},a_{j})=|i-j|italic_d ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = | italic_i - italic_j | for simplicity. For instance, the distance between actions 4 and 5 is d(4,5)=1𝑑451d(4,5)=1italic_d ( 4 , 5 ) = 1. If there is no natural notion of distance between actions (e.g., for non-ordinal actions), we could instead use a fixed distance d(ai,aj)=D,ai,ajAformulae-sequence𝑑subscript𝑎𝑖subscript𝑎𝑗𝐷for-allsubscript𝑎𝑖subscript𝑎𝑗𝐴d(a_{i},a_{j})=D,\forall a_{i},a_{j}\in Aitalic_d ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_D , ∀ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_A, but generally, Wasserstein distances make the most sense under ordinal actions. Likewise, if there is a larger shift in agent perception required when moving between actions, for example, moving past some decision boundary, e.g. going from a positive to a negative action, larger distances could be assigned when crossing this boundary to represent the cognitive shift.

We then construct a cost matrix C𝐶Citalic_C, where Ci,j=d(ai,aj)nsubscript𝐶𝑖𝑗𝑑superscriptsubscript𝑎𝑖subscript𝑎𝑗𝑛C_{i,j}=d(a_{i},a_{j})^{n}italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_d ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, for a chosen order n𝑛nitalic_n (e.g., n=1𝑛1n=1italic_n = 1 or n=2𝑛2n=2italic_n = 2). The transport plan matrix T𝑇Titalic_T measures the cost of moving between a prior belief and a policy, and must satisfy the following constraints: 1. Ti,j0subscript𝑇𝑖𝑗0T_{i,j}\geq 0italic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≥ 0(non-negativity), 2. aiATi,j=q(aj)subscriptsubscript𝑎𝑖𝐴subscript𝑇𝑖𝑗𝑞subscript𝑎𝑗\sum_{a_{i}\in A}T_{i,j}=q(a_{j})∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_A end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_q ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )(supply constraint), and 3. ajATi,j=p(ai|s)subscriptsubscript𝑎𝑗𝐴subscript𝑇𝑖𝑗𝑝conditionalsubscript𝑎𝑖𝑠\sum_{a_{j}\in A}T_{i,j}=p(a_{i}|s)∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_A end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_p ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s )(demand constraint).

The Wasserstein distance is then defined as the optimal transport plan for this move from the prior to the policy:

IW=minTiAjACi,jTi,j,subscript𝐼𝑊subscript𝑇subscript𝑖𝐴subscript𝑗𝐴subscript𝐶𝑖𝑗subscript𝑇𝑖𝑗I_{W}=\min_{T}\sum_{i\in A}\sum_{j\in A}C_{i,j}T_{i,j},italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ italic_A end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ,

subject to the identified constraints. IWsubscript𝐼𝑊I_{W}italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT has several desirable properties, which we highlight in the experiments section, including the incorporation of prior beliefs, defined even on varying support, efficient to compute, and incorporating the geometry of the action space allowing for quantifying distances among actions, something not considered in KL-divergence or existing approaches. Additionally, IWsubscript𝐼𝑊I_{W}italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT is symmetric and satisfies the triangle inequality.

As discussed above, constrained decision-makers seek to maximise Eq. 3, i.e.:

πiλ(a|si)=maxπi𝔼πi[t=0γt(U(at|si,t)λIW(πi,si,t,qi))]superscriptsubscript𝜋𝑖𝜆conditional𝑎subscript𝑠𝑖subscriptsubscript𝜋𝑖subscript𝔼subscript𝜋𝑖delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑈conditionalsubscript𝑎𝑡subscript𝑠𝑖𝑡𝜆subscript𝐼𝑊subscript𝜋𝑖subscript𝑠𝑖𝑡subscript𝑞𝑖\pi_{i}^{\lambda}(a|s_{i})=\max_{\pi_{i}}\mathbb{E}_{\pi_{i}}\left[\sum_{t=0}^% {\infty}\gamma^{t}\left(U(a_{t}|s_{i,t})-\lambda I_{W}(\pi_{i},s_{i,t},q_{i})% \right)\right]italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_a | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_U ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) - italic_λ italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] (6)

4 Motivating Experiments

To better motivate the chosen representation, we analyze a behavioural economics environment involving actual human participants and how their (inferred) polices evolve over time. We then use the discussed measures to quantify divergences from prior beliefs, showing where the proposed approach may be helpful.

Repeated public goods game

We focus on a repeated public goods game (PGG) with experimental data from [10]. In the PGG, DM’s are given 40404040 tokens and must decide how much of their tokens to contribute to a pool of public resources. Contributions to the public resource are multiplied by 1.61.61.61.6 and dispersed equally to the players at the end of the round. In the experiments of [10], games are repeated for 20 rounds in groups of size 4. The marginal-per-capita return for each unit contributed is 0.4; as this is <1absent1<1< 1, the strictly dominant rational strategy is for all players to thus contribute 00.

Refer to caption
(a) Average policy evolution
Refer to caption
(b) Overall contributions
Refer to caption
(c) Change in contributions
Refer to caption
(d) Absolute change in contributions
Refer to caption
(e) Pairwise changes
Refer to caption
(f) Phase Diagram
Figure 3: Public Goods Game, with real experimental data from [10]

4.1 Decisions

To understand the rationality of decision-makers and also their willingness to make large changes in their decisions, we visualize the aggregated views of their contributions in Fig. 3. While over time, there is a trend towards the rational choice (contributing 00), we can see an apparent stickiness in their decisions, with the vast majority of decisions only changing their expected contribution by less than 5 each timestep, as well as peristing sub-rational choices.

These changes are not only explained by the convergence towards the rational choice, as this is relatively symmetric for both decreases (e.g. approaching rationality) and increases in contributions (furthering from rationality), see Fig. 3(c). There are various proposed explanations for this, but the key is that these departures follow a clear bias towards previous decisions, as demonstrated in both Figs. 3(c) and 3(d). 222Additionally, we can see that there are peaks at prominent numbers (e.g. 5, 10, etc.), indicating the well-known prominent number bias, which could also be encoded in prior beliefs here or in the distance function from/to these prominent numbers. We further confirm this by analyzing the per timestep change in contributions in Fig. 3(e), and the phase diagram of these changes in Fig. 3(f).

4.1.1 Prior beliefs and inferred Policies

While we do not know the exact mental policies decision-makers were using or the prior beliefs of players in this game, only their sequentially revealed decisions, in Table 1, we explore various priors and assume DM’s policies are just the historical averages of the contributions they have played. We consider three different priors: uniform priors, previous timesteps policy (historical average, as above), and optimal priors. Uniform priors assign equal probability 141141\frac{1}{41}divide start_ARG 1 end_ARG start_ARG 41 end_ARG to each action 0400400\dots 400 … 40, previous timesteps policy is just the historical policy at t1𝑡1t-1italic_t - 1, and optimal prior is the Dirac delta function with all probability mass situated at the rational choice of 0, i.e., p(0|)=1𝑝conditional01p(0|\dots)=1italic_p ( 0 | … ) = 1.

When using the different distance measures, Table 1 reveals some of the limitations discussed in Section 2.2, showing the benefits of the proposed approach. Entropy does not change under varying prior beliefs, meaning we can not model the influence of priors on resulting decisions. KL is infinite for previous and optimal priors, necessitating a modification assigning a low probability to all events. However, even with this modification, KL ends up exploding quite rapidly in the early periods due to the instabilities of logs of small numbers, making optimization difficult and potentially misleading. In contrast, the proposed Wasserstein-based approach is well-behaved under the three different circumstances, demonstrating that this provides a suitable alternative for modelling realistic human decision-making with RL.

5 Conclusion

In this work, we present an approach for modeling realistic decision-making within a RL framework, considering the geometry of action spaces. This approach leverages the Wasserstein distance between a DM’s policy and their prior beliefs. We motivate its use by analyzing actual experiments with human participants, demonstrating that Wasserstein distance serves as a natural constraint for bounded rational decision-making. This extended abstract lays the groundwork for future exploration more complex RL environments, as well as ideas for improved efficiency of calculating the transport matrix, while highlighting the suitability and effectiveness of the proposed idea based on empirical economic studies.

References

  • [1] B. P. Evans and S. Ganesh, “Learning and calibrating heterogeneous bounded rational market behaviour with multi-agent reinforcement learning,” in AAMAS, p. 534–543, 2024.
  • [2] D. Kahneman, “A perspective on judgment and choice: Mapping bounded rationality,” Progress in Psychological Science around the World. Volume 1 Neural, Cognitive and Developmental Issues., pp. 1–47, 2013.
  • [3] G. Gigerenzer, “What is bounded rationality?,” in Routledge handbook of bounded rationality, pp. 55–69, Routledge, 2020.
  • [4] P. A. Ortega and D. A. Braun, “Thermodynamics as a theory of decision-making with information-processing costs,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 469, no. 2153, p. 20120683, 2013.
  • [5] B. P. Evans and M. Prokopenko, “A maximum entropy model of bounded rational decision-making with prior beliefs and market feedback,” Entropy, vol. 23, no. 6, p. 669, 2021.
  • [6] C. A. Sims, “Implications of rational inattention,” Journal of monetary Economics, vol. 50, no. 3, pp. 665–690, 2003.
  • [7] S. Sokota, R. D’Orazio, J. Z. Kolter, N. Loizou, M. Lanctot, I. Mitliagkas, N. Brown, and C. Kroer, “A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games,” in ICLR, 2023.
  • [8] A. P. Jacob, D. J. Wu, G. Farina, A. Lerer, H. Hu, A. Bakhtin, J. Andreas, and N. Brown, “Modeling strong and human-like gameplay with kl-regularized search,” in International Conference on Machine Learning, pp. 9695–9728, PMLR, 2022.
  • [9] T. Mu, S. Zheng, and A. R. Trott, “Modeling bounded rationality in multi-agent simulations using rationally inattentive reinforcement learning,” Transactions on Machine Learning Research, 2022.
  • [10] M. N. Burton-Chellew, H. H. Nax, and S. A. West, “Payoff-based learning explains the decline in cooperation in public goods games,” Proceedings of the Royal Society B: Biological Sciences, vol. 282, no. 1801, p. 20142678, 2015.