(Translated by https://www.hiragana.jp/)
Mixture of Public and Private Distributions in Imperfect Information Games.

Mixture of Public and Private Distributions in Imperfect Information Games.

Jérôme Arjonilla Paris Dauphine University - PSL
Paris, France
jerome.arjonilla@hotmail.fr
   Abdallah Saffidine University of New South Wales
Sydney, Australia
abdallah.saffidine@gmail.com
   Tristan Cazenave Paris Dauphine University - PSL
Paris, France
Tristan.Cazenave@dauphine.psl.eu
Abstract

In imperfect information games (e.g. Bridge, Skat, Poker), one of the fundamental considerations is to infer the missing information while at the same time avoiding the disclosure of private information. Disregarding the issue of protecting private information can lead to a highly exploitable performance. Yet, excessive attention to it leads to hesitations that are no longer consistent with our private information. In our work, we show that to improve performance, one must choose whether to use a player’s private information. We extend our work by proposing a new belief distribution depending on the amount of private and public information desired. We empirically demonstrate an increase in performance and, with the aim of further improving performance, the new distribution should be used according to the position in the game. Our experiments have been done on multiple benchmarks and in multiple determinization-based algorithms (PIMC and IS-MCTS).

Index Terms:
Imperfect Information Games, Search Algorithms, Belief Distributions

I Introduction

Search in artificial intelligence has been constantly evolving over the last few decades, and game-oriented research has always been a cornerstone of this success. Chess, Go [1, 2, 3], Poker [4], Skat, Contract Bridge or Dota [5] are among the most famous ones.

Perfect information games (Chess, Go) — where all information is available for each player — have been the most studied, and many algorithms have been able to achieve a level far beyond the level of a human professional player. On the other hand, Imperfect Information Games (IIGs) (Poker, Skat, Bridge) — where some information is hidden — have been less studied, and only a few algorithms are capable of beating professional human player [6, 4].

In IIG, the complexity is heightened by the missing information, as one must try to infer the missing information of the opponents and, at the same time, be wary to not reveal private hidden information to opponents. Among the methods used in IIG, determinization-based algorithms — where the hidden information is fixed according to a belief distribution — such as Perfect Information Monte Carlo (PIMC) [7], Recursive PIMC [8], Information Set MCTS [9] or AlphaMu [10] achieve state-of-the-art performance in many trick-taking card games (Contract-Bridge, Skat).

In the work cited above, the determinization operates by sampling the hidden information according to the private information of a given player, i.e. what has happened since the beginning, from the point of view of a given agent. However, by doing so, one can indirectly reveal private information to opponents, which can lead to a highly exploitable performance.

Recently, the concept of public knowledge [11] — where a distinction is made between observations accessible to everyone and those accessible individually — has emerged in IIGs. This concept has resulted in many breakthroughs thanks to the decomposition, which made the calculations feasible [12, 13]. Despite this large benefit, there are limitations to its use, especially in the context of belief distribution. By doing so, we completely remove the knowledge observed by the acting player, and one might wonder whether not using the private information was useful.

In this work, we analyze the impact of using one method rather than another and present a new belief distribution, which is a mixture of both public and private belief distribution. We extend the study by analyzing different mixtures, depending on the position within the game. Our experiments are carried out on determinization-based algorithms, which use the belief distribution to fix the incertitude.

The paper is organized as follows: Section II presents notation and current determinization-based algorithms; Section III explains the different belief distributions used with their advantages and drawbacks, and presents our new belief distribution; Section IV empirically shows that using the new belief distribution allows us to improve past performance; and the last section summarizes our work and future work.

II Notation and Background

II-A Notation

We use the notation based on factored-observation stochastic games (FOSGs [11]). This formalism distinguishes between private and public observations.

A Game G is composed of the following elements. The set of agents 𝒩={1,2,,N}𝒩12𝑁\mathcal{N}=\{1,2,\dots,N\}caligraphic_N = { 1 , 2 , … , italic_N } agents, the set of world state possible 𝒲𝒲\mathcal{W}caligraphic_W. In each world state w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W, the acting player i𝑖iitalic_i chooses an action a𝒜(w)𝑎𝒜𝑤a\in\mathcal{A}(w)italic_a ∈ caligraphic_A ( italic_w ), where 𝒜(w)𝒜𝑤\mathcal{A}(w)caligraphic_A ( italic_w ) denotes the legal actions at w𝑤witalic_w. After an action a𝑎aitalic_a, we reach the next world state wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from the probability distribution of playing a𝑎aitalic_a in w𝑤witalic_w.

During the transition from w𝑤witalic_w to wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by playing a𝑎aitalic_a, two observations are received: a public observation and a private observation. Public observation is the observation visible by every player noted opub𝒪pub(w,a,w)subscript𝑜𝑝𝑢𝑏subscript𝒪𝑝𝑢𝑏𝑤𝑎superscript𝑤o_{pub}\in\mathcal{O}_{pub}(w,a,w^{\prime})italic_o start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ∈ caligraphic_O start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ( italic_w , italic_a , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where 𝒪pub(w,a,w)subscript𝒪𝑝𝑢𝑏𝑤𝑎superscript𝑤\mathcal{O}_{pub}(w,a,w^{\prime})caligraphic_O start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ( italic_w , italic_a , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) refers to all the possible public observations. Private observation is the observation visible by a precise player i𝑖iitalic_i, noted opriv(i)𝒪priv(i)(w,a,w)subscript𝑜𝑝𝑟𝑖𝑣𝑖subscript𝒪𝑝𝑟𝑖𝑣𝑖𝑤𝑎superscript𝑤o_{priv(i)}\in\mathcal{O}_{priv(i)}(w,a,w^{\prime})italic_o start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( italic_i ) end_POSTSUBSCRIPT ∈ caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( italic_i ) end_POSTSUBSCRIPT ( italic_w , italic_a , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where 𝒪priv(i)(w,a,w)subscript𝒪𝑝𝑟𝑖𝑣𝑖𝑤𝑎superscript𝑤\mathcal{O}_{priv(i)}(w,a,w^{\prime})caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( italic_i ) end_POSTSUBSCRIPT ( italic_w , italic_a , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) refers to all the possible private observations.

A history is a finite sequence of legal actions and world states, denoted ht=(w0,a0,w1,a1,,wt)superscript𝑡superscript𝑤0superscript𝑎0superscript𝑤1superscript𝑎1superscript𝑤𝑡h^{t}=(w^{0},a^{0},w^{1},a^{1},...,w^{t})italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). For describing the point of view of an agent i𝑖iitalic_i of a history hhitalic_h, we introduce an infostate si(h)subscript𝑠𝑖s_{i}(h)italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h ). An infostate for agent i𝑖iitalic_i is a sequence of an agent’s observations and actions sitsubscriptsuperscript𝑠𝑡𝑖s^{t}_{i}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = (oi0superscriptsubscript𝑜𝑖0o_{i}^{0}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, ai0subscriptsuperscript𝑎0𝑖a^{0}_{i}italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, oi1superscriptsubscript𝑜𝑖1o_{i}^{1}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, ai1subscriptsuperscript𝑎1𝑖a^{1}_{i}italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, …, oitsuperscriptsubscript𝑜𝑖𝑡o_{i}^{t}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT) where oiksuperscriptsubscript𝑜𝑖𝑘o_{i}^{k}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = (opubksuperscriptsubscript𝑜𝑝𝑢𝑏𝑘o_{pub}^{k}italic_o start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, opriv(i)k)o^{k}_{priv(i)})italic_o start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( italic_i ) end_POSTSUBSCRIPT ). A public infostate is a sequence of public observations spubt=(opub0,opub1,,opubt)subscriptsuperscript𝑠𝑡𝑝𝑢𝑏superscriptsubscript𝑜𝑝𝑢𝑏0superscriptsubscript𝑜𝑝𝑢𝑏1superscriptsubscript𝑜𝑝𝑢𝑏𝑡s^{t}_{pub}=(o_{pub}^{0},o_{pub}^{1},...,o_{pub}^{t})italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT = ( italic_o start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ).

Determinization refers to the fact that we sample a world state according to a belief distribution of the world states possible. Determinizing the belief distribution is not new and a similar concept exists in other formalisms such as belief state in Partially Observable Markov Decision Process (POMDP) problems [14] or occupancy-state in Decentralised-POMDPs problems [15].

II-B Determinization-based algorithms

Each determinization-based algorithm has its own characteristics. Nevertheless, they share some common features such as (i) sampling a world state according to a belief distribution over the possible world states; and (ii) using a perfect information algorithm for estimating the value of the sampled world state.

The algorithms are simple and, in practice, they achieve great results, mainly due to the use of perfect information algorithms (AlphaBeta [16], MCTS [17] or Value Network) that are fast and efficient.

In the following, we present two determinization-based algorithms that are baseline and will, at a later stage, be used in our experiments.

II-B1 PIMC

Perfect Information Monte Carlo (PIMC) is the state of the art of many IIG problems such as Contract-Bridge, Skat, and many others.

The algorithm is defined in Algorithm 1 and works as follows: (i) samples a world state by using the player’s private information; (ii) plays each action of the sampled world state; (iii) estimates the reward of the new world state by using an algorithm available in perfect information setting; (iv) repeats until the budget is over; and, (v) selects the action that produces the best result in average. In practice, PIMC often uses AlphaBeta as the perfect information evaluator.

Function PIMC(ss\mathrm{s}roman_s):
       for mm\mathrm{m}roman_m \in MovesMoves\mathrm{Moves}roman_Moves (s)\mathrm{s})roman_s ) do
             scorescore\mathrm{score}roman_score[mm\mathrm{m}roman_m] \leftarrow 00;
            
       end for
      while budgetbudget\mathrm{budget}roman_budget do
             ww\mathrm{w}roman_w \leftarrow InfoSamplingInfoSampling\mathrm{InfoSampling}roman_InfoSampling(ss\mathrm{s}roman_s);
            
            for mm\mathrm{m}roman_m \in MovesMoves\mathrm{Moves}roman_Moves (ww\mathrm{w}roman_w) do
                   scorescore\mathrm{score}roman_score [mm\mathrm{m}roman_m] \leftarrow scorescore\mathrm{score}roman_score[mm\mathrm{m}roman_m] + PerfectAlgoPerfectAlgo\mathrm{PerfectAlgo}roman_PerfectAlgo (ww\mathrm{w}roman_w, mm\mathrm{m}roman_m);
                  
             end for
            
       end while
      return Best action on average
Algorithm 1 PIMC

II-B2 IS-MCTS

Information Set Monte Carlo Tree Search [9] uses Monte Carlo Tree Search (MCTS) [17] according to a sampled world state.

MCTS is a state-of-the-art tree search algorithm in perfect information games. It works as follows (i) selection — selects a path of nodes based on an exploitation policy; (ii) expansion — expands the tree by adding a new child node; (iii) playout — estimates the child node by using an exploration policy; and, (iv) backpropagation — backpropagates the result obtained from the playout through the nodes chosen during the selection phase. In practice, MCTS often uses random playout as the perfect information evaluator, and UCB1 in the selection phase.

IS-MCTS works by using MCTS according to a sampled world state, i.e. the selection and playout are done on the sampled world state.

Function IS-MCTS(ss\mathrm{s}roman_s):
      
      while budgetbudget\mathrm{budget}roman_budget do
             ww\mathrm{w}roman_w \leftarrow InfoSamplingInfoSampling\mathrm{InfoSampling}roman_InfoSampling(ss\mathrm{s}roman_s);
            
            MCTSMCTS\mathrm{MCTS}roman_MCTS conditioned on ww\mathrm{w}roman_w.;
            
       end while
      return Normalise visit count for each action
Function MCTS(ww\mathrm{w}roman_w):
       uu\mathrm{u}roman_u \leftarrow SelectionSelection\mathrm{Selection}roman_Selection(ww\mathrm{w}roman_w);
       uu\mathrm{u}roman_u \leftarrow ExpansionExpansion\mathrm{Expansion}roman_Expansion (uu\mathrm{u}roman_u,ww\mathrm{w}roman_w);
       uu\mathrm{u}roman_u \leftarrow SimulationSimulation\mathrm{Simulation}roman_Simulation (uu\mathrm{u}roman_u,ww\mathrm{w}roman_w);
       BackpropagationBackpropagation\mathrm{Backpropagation}roman_Backpropagation(u)\mathrm{u})roman_u );
      
Algorithm 2 IS-MCTS

III Belief Distributions

To present the different belief distributions, with their advantages and drawbacks, we use the following example throughout the section to facilitate understanding.

The example is based on the famous game ‘Liar’s Dice’ (an explanation of the game is given in Subsection IV-A2). In our case, two players, each with 1111 die of 2222 sides. We denote {P1:X;P2:Y}conditional-setsubscript𝑃1:𝑋subscript𝑃2𝑌\{P_{1}:X;P_{2}:Y\}{ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_X ; italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_Y } for player 1111 has X𝑋Xitalic_X and player 2222 has Y𝑌Yitalic_Y. There are four world states possible (w1={P1:1;P2:1}subscript𝑤1conditional-setsubscript𝑃1:1subscript𝑃21w_{1}=\{P_{1}:1;P_{2}:1\}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : 1 ; italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : 1 }, w2={P1:1;P2:2}subscript𝑤2conditional-setsubscript𝑃1:1subscript𝑃22w_{2}=\{P_{1}:1;P_{2}:2\}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : 1 ; italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : 2 }; w3={P1:2;P2:2}subscript𝑤3conditional-setsubscript𝑃1:2subscript𝑃22w_{3}=\{P_{1}:2;P_{2}:2\}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : 2 ; italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : 2 }, w4={P1:2;P2:1}subscript𝑤4conditional-setsubscript𝑃1:2subscript𝑃21w_{4}=\{P_{1}:2;P_{2}:1\}italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : 2 ; italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : 1 }).

For each player, there are two infostates possible and one public infostate spub={opub1=,opub2=}subscript𝑠𝑝𝑢𝑏formulae-sequencesubscriptsuperscript𝑜1𝑝𝑢𝑏subscriptsuperscript𝑜2𝑝𝑢𝑏s_{pub}=\{o^{1}_{pub}=\emptyset,o^{2}_{pub}=\emptyset\}italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT = { italic_o start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT = ∅ , italic_o start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT = ∅ } (no observation). For the player 1111 we have s1={opriv(1)1=1,opriv(1)2=}subscript𝑠1formulae-sequencesubscriptsuperscript𝑜1𝑝𝑟𝑖𝑣11subscriptsuperscript𝑜2𝑝𝑟𝑖𝑣1s_{1}=\{o^{1}_{priv(1)}=1,o^{2}_{priv(1)}=\emptyset\}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_o start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 1 ) end_POSTSUBSCRIPT = 1 , italic_o start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 1 ) end_POSTSUBSCRIPT = ∅ } or s1={opriv(1)1=2,opriv(1)2=}subscriptsuperscript𝑠1formulae-sequencesubscriptsuperscript𝑜1𝑝𝑟𝑖𝑣12subscriptsuperscript𝑜2𝑝𝑟𝑖𝑣1s^{\prime}_{1}=\{o^{1}_{priv(1)}=2,o^{2}_{priv(1)}=\emptyset\}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_o start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 1 ) end_POSTSUBSCRIPT = 2 , italic_o start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 1 ) end_POSTSUBSCRIPT = ∅ } (i.e. Player 1111 observes the die rolled but not the die rolled by the other player), and for the player 2222, we have s2={opriv(2)1=,opriv(2)2=1}subscript𝑠2formulae-sequencesubscriptsuperscript𝑜1𝑝𝑟𝑖𝑣2subscriptsuperscript𝑜2𝑝𝑟𝑖𝑣21s_{2}=\{o^{1}_{priv(2)}=\emptyset,o^{2}_{priv(2)}=1\}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_o start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = ∅ , italic_o start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = 1 } or s2={opriv(2)1=,opriv(2)2=2}subscriptsuperscript𝑠2formulae-sequencesubscriptsuperscript𝑜1𝑝𝑟𝑖𝑣2subscriptsuperscript𝑜2𝑝𝑟𝑖𝑣22s^{\prime}_{2}=\{o^{1}_{priv(2)}=\emptyset,o^{2}_{priv(2)}=2\}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_o start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = ∅ , italic_o start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = 2 } (i.e. Player 2222 observes the die rolled but not the die rolled by the other player).

In the following, we suppose that the world state of this example is w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Therefore, for the player 1111, the infostate is s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with two world states possible ({w1,w2}subscript𝑤1subscript𝑤2\{w_{1},w_{2}\}{ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }) and for the player 2222, the infostate is s2subscriptsuperscript𝑠2s^{\prime}_{2}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with two world states possible ({w2;w3}subscript𝑤2subscript𝑤3\{w_{2};w_{3}\}{ italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }). Fig. 1 represents the different belief distributions presented throughout the section.

Refer to caption
Figure 1: Multiple belief distributions for the game Liar’s Dice with 1111 dice of 2222 sides each. Four world states possible w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and w4subscript𝑤4w_{4}italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. The Public-Private belief uses the mixture distribution with λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5.

III-A Private Distribution

As previously introduced, current determinization-based algorithms work by sampling world states according to the player’s private information distribution, i.e. knowing a player’s private and public observation, we sample a world state.

Let Sj(si)subscript𝑆𝑗subscript𝑠𝑖S_{j}(s_{i})italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) be the set of possible infostates for player j𝑗jitalic_j conditioned on the infostate sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the player i𝑖iitalic_i. In our example, the infostate possible for the player 2222 when the player 1111 has s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is S2(s1)={s2;s2}subscript𝑆2subscript𝑠1subscript𝑠2subscriptsuperscript𝑠2S_{2}(s_{1})=\{s_{2};s^{\prime}_{2}\}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = { italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, i.e. having the die 1111 for the player 1111 does not exclude the player 2222 to have a 1 or a 2. Depending on the game Sj(si)subscript𝑆𝑗subscript𝑠𝑖S_{j}(s_{i})italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) can be restrictive, e.g. in trick-taking card games if the player i𝑖iitalic_i has the card ‘Queen of Hearts’, no opponent can have it.

Definition (Private Belief Distribution).

Let Sj(si)subscript𝑆𝑗subscript𝑠𝑖S_{j}(s_{i})italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) be the set of possible infostates for player j𝑗jitalic_j conditioned on the infostate sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let ΔSj(si)Δsubscript𝑆𝑗subscript𝑠𝑖\Delta S_{j}(s_{i})roman_Δ italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the probability distribution over the elements of Sj(si)subscript𝑆𝑗subscript𝑠𝑖S_{j}(s_{i})italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We define the private belief distribution as Δi(si)=(ΔS1(si),,ΔSi(si),,ΔSN(si))=(ΔS1(si),,si,,ΔSN(si))subscriptΔ𝑖subscript𝑠𝑖Δsubscript𝑆1subscript𝑠𝑖Δsubscript𝑆𝑖subscript𝑠𝑖Δsubscript𝑆𝑁subscript𝑠𝑖Δsubscript𝑆1subscript𝑠𝑖subscript𝑠𝑖Δsubscript𝑆𝑁subscript𝑠𝑖\Delta_{i}(s_{i})=(\Delta S_{1}(s_{i}),\dots,\Delta S_{i}(s_{i}),\dots,\Delta S% _{N}(s_{i}))=(\Delta S_{1}(s_{i}),\dots,s_{i},\dots,\Delta S_{N}(s_{i}))roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( roman_Δ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , … , roman_Δ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , … , roman_Δ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = ( roman_Δ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , … , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , roman_Δ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) .

In Fig. 1, using Player 1’s private belief state provides the following belief distribution Δ1(s1)=({s1:100%},{s2:50%;s2:50%})subscriptΔ1subscript𝑠1conditional-setsubscript𝑠1percent100conditional-setsubscript𝑠2:percent50subscriptsuperscript𝑠2percent50\Delta_{1}(s_{1})=(\{s_{1}:100\%\},\{s_{2}:50\%;s^{\prime}_{2}:50\%\})roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = ( { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : 100 % } , { italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : 50 % ; italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : 50 % } ), which results in two equiprobable world states (w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).

When using the private distribution for determinization, the algorithm samples a world state (w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) consistent with the current player’s information (s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and, as the state-of-the-art in trick-taking game shows, great performance is obtained. Yet, by doing so, 3 problems arise.

(i) It is not consistent with the other player’s belief, e.g. if we use it with the first player, the algorithm samples w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT but never w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, which is nevertheless, a world state possible from the point of view of the player 2222.

(ii) It is not able to mislead others. In our example, two actions are possible for the first player, ‘I have a one’ and ‘I have a two’. The action ‘I have a two’ is a lie, however, one may want to play this action with the aim of deceiving the opponent. However, in our case only w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be sampled and, in each world, the action ‘I have a two’ results in a defeat because the second player will say ‘This is a lie’. Therefore, lying is never an option, as it never succeeds.

(iii) It, indirectly, allows the opponents to infer our private information, e.g. after playing multiple matches, the second player understands that, if the first player plays ‘I have a two’, it is because he really has a two as it can not lie, and therefore, play to counter it.

Trying to infer missing information is one of the key components of IIG, and using the private belief distribution could result in a highly exploitable performance. To remove this problem, one can use public belief distribution, as presented in the next section.

III-B Public Distribution

Recently in IIG, many algorithms [12, 13] have been using the concept of public observation. This concept has resulted in many breakthroughs thanks to decomposition, which made the calculations feasible. One application of public observation is the creation of a public belief distribution over the world states possible according to the public observations observed so far.

Definition (Public Belief Distribution [13]).

Let Sj(spub)subscript𝑆𝑗subscript𝑠𝑝𝑢𝑏S_{j}(s_{pub})italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ) be the set of possible infostates for player j𝑗jitalic_j conditioned on the public infostate spubsubscript𝑠𝑝𝑢𝑏s_{pub}italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT. Let ΔSj(spub)Δsubscript𝑆𝑗subscript𝑠𝑝𝑢𝑏\Delta S_{j}(s_{pub})roman_Δ italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ) denote the probability distribution over the elements of Sj(spub)subscript𝑆𝑗subscript𝑠𝑝𝑢𝑏S_{j}(s_{pub})italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ). We define the public belief distribution as Δpub(spub)=(ΔS1(spub),,ΔSN(spub))subscriptΔ𝑝𝑢𝑏subscript𝑠𝑝𝑢𝑏Δsubscript𝑆1subscript𝑠𝑝𝑢𝑏Δsubscript𝑆𝑁subscript𝑠𝑝𝑢𝑏\Delta_{pub}(s_{pub})=(\Delta S_{1}(s_{pub}),...,\Delta S_{N}(s_{pub}))roman_Δ start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ) = ( roman_Δ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ) , … , roman_Δ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ) ).

In our example, using the public belief distribution from the point of view of the player 1111 would result in the following belief distribution Δpub=({s1:50%;s1:50%},{s2:50%;s2:50%}\Delta_{pub}=(\{s_{1}:50\%;s^{\prime}_{1}:50\%\},\{s_{2}:50\%;s^{\prime}_{2}:5% 0\%\}roman_Δ start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT = ( { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : 50 % ; italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : 50 % } , { italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : 50 % ; italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : 50 % }. In other world, every world state are equiprobable, this is due to the public infostate that does not contain any information.

Using a public belief distribution instead of a private belief distribution removes the problem defined in Section III-A.

(i) It is consistent with the other player’s doubts, e.g. it samples the world w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT which is a world state possible of the second player.

(ii) It is capable of misleading others, e.g. when sampling w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT or w4subscript𝑤4w_{4}italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT the action ‘I have a two’ does not result in a defeat for the first player, therefore, allows the first player to play the action ‘I have a two’.

(iii) It no longer reveals private information, i.e. as the reasoning is no longer biased toward the private information, it can not be used against it.

Nevertheless, using public distribution has a significant drawback as it does not consider a player’s private information, and one might wonder whether it is useful to not use private information. It is straightforward to consider that the extent to which private information should be kept hidden depends on the game being played and, in certain games, it is not necessary to keep the information concealed.

In addition, by using public distribution, one must be aware as there are more world states possible (e.g. by using private distribution, we have two world states possible and by using public distribution, we have four world states possible), which can be intractable in large games.

III-C Mixture between public and private distribution

To solve both of the problems defined in Section III-A and in Section III-B, we propose to use a mixture of private and public distribution.

Definition (Mixture Belief Distribution).

Let spubsubscript𝑠𝑝𝑢𝑏s_{pub}italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT be the public infostate associated with the infostate sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We define the mixture belief distribution as Δλ(si)=(1λ)Δi(si)+λΔpub(spub)subscriptΔ𝜆subscript𝑠𝑖1𝜆subscriptΔ𝑖subscript𝑠𝑖𝜆subscriptΔ𝑝𝑢𝑏subscript𝑠𝑝𝑢𝑏\Delta_{\lambda}(s_{i})=(1-\lambda)\Delta_{i}(s_{i})+\lambda\Delta_{pub}(s_{% pub})roman_Δ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( 1 - italic_λ ) roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_λ roman_Δ start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT )

The mixture belief distribution allows us to be consistent with the problem encountered. When care must be taken not to reveal information, one can increase λ𝜆\lambdaitalic_λ. In contrast, when it is not appropriate to withhold information, one can decrease λ𝜆\lambdaitalic_λ. The private belief distribution is obtained when λ=0𝜆0\lambda=0italic_λ = 0 and the public belief distribution is obtained when λ=1𝜆1\lambda=1italic_λ = 1.

In our example, when using the mixture with λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5 for the player 1111, we obtain the following belief distribution Δ0.5(s1)=({s1:75%;s1:25%},{s2:50%;s2:50%}\Delta_{0.5}(s_{1})=(\{s_{1}:75\%;s^{\prime}_{1}:25\%\},\{s_{2}:50\%;s^{\prime% }_{2}:50\%\}roman_Δ start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = ( { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : 75 % ; italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : 25 % } , { italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : 50 % ; italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : 50 % }. w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are more probable (37.5%percent37.537.5\%37.5 % each) than w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and w4subscript𝑤4w_{4}italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT (12.5%percent12.512.5\%12.5 % each). Nevertheless, their probabilities are not zero, which makes it consistent with the other player’s belief.

It is possible to expand this concept by considering that λ𝜆\lambdaitalic_λ depends on the progress of the game. As an example, in trick-taking card games, it may be important to keep the private information hidden at the beginning of the game (so as not to reveal information) but, as the game progresses, the focus shifts to accumulating points before the end, where the importance of concealing this information may decrease.

III-D Adaptation of algorithms

PIMC and IS-MCTS have been created with private belief distribution in mind. Therefore, it is necessary to modify the algorithms to use the public or a mixture belief distribution. Instead of starting at an infostate sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the algorithms must be adapted to start at spubsubscript𝑠𝑝𝑢𝑏s_{pub}italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT, where spubsubscript𝑠𝑝𝑢𝑏s_{pub}italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT is the public infostate associated with sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

III-D1 PIMC

In the case of PIMC, one must use a distinct PIMC for each infostate possible (Si(spub)subscript𝑆𝑖subscript𝑠𝑝𝑢𝑏S_{i}(s_{pub})italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT )), and combine the final result by aggregating the scores using the distribution of possible infostates (ΔSi(spub)Δsubscript𝑆𝑖subscript𝑠𝑝𝑢𝑏\Delta S_{i}(s_{pub})roman_Δ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p italic_u italic_b end_POSTSUBSCRIPT )).

In our example, when using the mixture belief distribution, two infostates are possible for the first player (s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s1subscriptsuperscript𝑠1s^{\prime}_{1}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). If w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT or w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are sampled, the algorithm used is the one defined for s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, on the other hand, if w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT or w4subscript𝑤4w_{4}italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are sampled, the algorithm used is the one defined for s1subscriptsuperscript𝑠1s^{\prime}_{1}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. In the end, if s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has been visited 75%percent7575\%75 % (corresponding to the mixture belief distribution with λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5), the action chosen in s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT will have more impact than the action chosen in s1subscriptsuperscript𝑠1s^{\prime}_{1}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

III-D2 IS-MCTS

s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTs1subscriptsuperscript𝑠1s^{\prime}_{1}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTs2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTs2subscriptsuperscript𝑠2s^{\prime}_{2}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTs2′′subscriptsuperscript𝑠′′2s^{\prime\prime}_{2}italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTs2′′′subscriptsuperscript𝑠′′′2s^{\prime\prime\prime}_{2}italic_s start_POSTSUPERSCRIPT ′ ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
(a) Constructed with the mixture belief distribution.
s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTs2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTs2subscriptsuperscript𝑠2s^{\prime}_{2}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTs2′′subscriptsuperscript𝑠′′2s^{\prime\prime}_{2}italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTs2′′′subscriptsuperscript𝑠′′′2s^{\prime\prime\prime}_{2}italic_s start_POSTSUPERSCRIPT ′ ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
(b) Constructed with private belief distribution.
Figure 2: Example of the tree constructed by IS-MCTS. The first player is acting in the red square, the second player is acting in the green diamond and the blue circle refers to the chance node.

With IS-MCTS, a singular algorithm is feasible as IS-MCTS creates a tree where the nodes represent infostates, and an infostate for player j𝑗jitalic_j may come from several infostates of player i𝑖iitalic_i.

An example is provided in Fig. 2. For the first player, two infostates are possible (s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s1subscriptsuperscript𝑠1s^{\prime}_{1}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and four infostates are possible for the second player after the first player’s action (s2={opriv(2)1=,opriv(2)2=1,opriv(2)3=a1}subscript𝑠2formulae-sequencesubscriptsuperscript𝑜1𝑝𝑟𝑖𝑣2formulae-sequencesubscriptsuperscript𝑜2𝑝𝑟𝑖𝑣21subscriptsuperscript𝑜3𝑝𝑟𝑖𝑣2subscript𝑎1s_{2}=\{o^{1}_{priv(2)}=\emptyset,o^{2}_{priv(2)}=1,o^{3}_{priv(2)}=a_{1}\}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_o start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = ∅ , italic_o start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = 1 , italic_o start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }, s2={opriv(2)1=,opriv(2)2=1,opriv(2)3=a2}subscriptsuperscript𝑠2formulae-sequencesubscriptsuperscript𝑜1𝑝𝑟𝑖𝑣2formulae-sequencesubscriptsuperscript𝑜2𝑝𝑟𝑖𝑣21subscriptsuperscript𝑜3𝑝𝑟𝑖𝑣2subscript𝑎2s^{\prime}_{2}=\{o^{1}_{priv(2)}=\emptyset,o^{2}_{priv(2)}=1,o^{3}_{priv(2)}=a% _{2}\}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_o start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = ∅ , italic_o start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = 1 , italic_o start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, s2′′={opriv(2)1=,opriv(2)2=2,opriv(2)3=a1}subscriptsuperscript𝑠′′2formulae-sequencesubscriptsuperscript𝑜1𝑝𝑟𝑖𝑣2formulae-sequencesubscriptsuperscript𝑜2𝑝𝑟𝑖𝑣22subscriptsuperscript𝑜3𝑝𝑟𝑖𝑣2subscript𝑎1s^{\prime\prime}_{2}=\{o^{1}_{priv(2)}=\emptyset,o^{2}_{priv(2)}=2,o^{3}_{priv% (2)}=a_{1}\}italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_o start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = ∅ , italic_o start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = 2 , italic_o start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } or s2′′′={opriv(2)1=,opriv(2)2=2,opriv(2)3=a2}subscriptsuperscript𝑠′′′2formulae-sequencesubscriptsuperscript𝑜1𝑝𝑟𝑖𝑣2formulae-sequencesubscriptsuperscript𝑜2𝑝𝑟𝑖𝑣22subscriptsuperscript𝑜3𝑝𝑟𝑖𝑣2subscript𝑎2s^{\prime\prime\prime}_{2}=\{o^{1}_{priv(2)}=\emptyset,o^{2}_{priv(2)}=2,o^{3}% _{priv(2)}=a_{2}\}italic_s start_POSTSUPERSCRIPT ′ ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_o start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = ∅ , italic_o start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = 2 , italic_o start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_i italic_v ( 2 ) end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }).

For the second player, all infostates are achievable through any infostate of the first player. For example, s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is achievable when sampling w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (from s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) or when sampling w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (from s1)s^{\prime}_{1})italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and playing the action a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

IV Experimentation

IV-A Benchmarks

For our experiments, the following benchmarks are tested ‘Liar’s Dice’ (LD), ‘Card Games (CG)’, and ‘Leduc Poker’ (LP). Each of them is described below.

IV-A1 Card game

For the purpose of the experimentation, we use a smaller version of classic trick-taking card games. The game is played with two players, 10/20102010/2010 / 20 cards known by all, 2/6262/62 / 6 are hidden and the rest is distributed to each player.

The playing phase is decomposed into tricks, the player starting the trick is the one who won the previous trick. The starting player of a trick can play any card in his hand, but the other players must follow the suit of the first player. If they can not, they can play any card they want but, without the possibility of winning the trick. The winner of the trick is the one with the highest-ranking card. At the end of the game, the points of each player are counted (plain version of trick-taking card game). The count is defined by the number of tricks won. A player wins if it has at least half of the points.

IV-A2 Liar’s Dice

Liar’s dice is a dice game played with two or more players, where each player possesses N𝑁Nitalic_N dice of K𝐾Kitalic_K sides and in which a player must deceive and be able to detect an opponent’s deception.

In the beginning, each player rolls his dice and observes the values. After that, players take turns guessing the number of dice of a particular type held by everyone. The game continues until a player accuses another of lying. If the player who made the assumption is right, he wins the game, on the opposite, if the challenged player did not lie, the challenged player wins.

During the game, a player can not bid less than previously, i.e. he must at least bid more dice than the previous player’s bid, or the same number of dice but with a higher value. Lastly, the highest face is a wild card, i.e. the value can be used to count for any other face.

IV-A3 Leduc Poker

Leduc Poker, as described in [18], is a variation of poker that uses a deck with only two suits, each containing three cards.

The game consists of two rounds. In the first round, each player is dealt a single private card. In the second round, a single board card is revealed. The maximum number of bets allowed is two, with the first round allowing raises of 2222 and the second round allowing raises of 4444. Both players begin the first round with 1111 already in the pot.

IV-B Experimentation

In our experiments, our objective is (i) to observe the extent to which an algorithm X𝑋Xitalic_X reveals information according to mixture belief distribution; (ii) to analyze how the mixture belief distribution impacts the performance against an opponent that uses the revealed information; and (iii) to analyze how the mixture belief distribution impacts the performance against an opponent that does not use the revealed information.

Our code is based on OpenSpiel [19]. This is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.

PIMC and IS-MCTS are used with their basic version, i.e. PIMC uses AlphaBeta and IS-MCTS uses random rollouts as the perfect information evaluator and an exploration constant of 0.70.70.70.7. For both, 1000100010001000 world states are sampled.

To achieve a stable policy (as PIMC and IS-MCTS are online algorithms), we run the algorithm multiple times for every infostate until the policy obtained has less than 1%percent11\%1 % of variation.

The experiments were conducted according to the player’s playing position (each position reveals more or less information). In the following part, the experiments are carried out for the first player and in the appendix for the second player.

IV-B1 How much information is revealed

000.20.20.20.20.40.40.40.40.60.60.60.60.80.80.80.81111222244446666888810101010Lambda valueTSSR valuePIMCIS-MCTS
(a) Liar’s dice with 2 dice
000.20.20.20.20.40.40.40.40.60.60.60.60.80.80.80.811112222444466668888Lambda valueTSSR valuePIMCIS-MCTS
(b) Liar’s dice with 3 dice
000.20.20.20.20.40.40.40.40.60.60.60.60.80.80.80.8111111111.21.21.21.21.41.41.41.41.61.61.61.61.81.81.81.822222.22.22.22.2Lambda valueTSSR valuePIMCIS-MCTS
(c) Leduc poker
000.20.20.20.20.40.40.40.40.60.60.60.60.80.80.80.8111111111.21.21.21.21.41.41.41.41.61.61.61.61.81.81.81.8Lambda valueTSSR valuePIMCIS-MCTS
(d) Card Game with 10 cards
Figure 3: Average TSSR for IS-MCTS and PIMC on multiple benchmarks according to λ𝜆\lambdaitalic_λ of the mixture distribution.

For analyzing the impact of the revealed information according to the distribution used, we use the formula called True State Sampling Ratio (TSSR) [20]. TSSR measures how much more likely it is for the opponent to guess the current world state when using an algorithm X𝑋Xitalic_X rather than using a uniform function.

The formula is TSSR(w)=η(wsi)|Si(si)|𝑇𝑆𝑆𝑅𝑤𝜂conditional𝑤subscript𝑠𝑖subscript𝑆𝑖subscript𝑠𝑖TSSR(w)=\eta(w\mid s_{i})\cdot|S_{i}(s_{i})|italic_T italic_S italic_S italic_R ( italic_w ) = italic_η ( italic_w ∣ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | where sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the infostate corresponding to w𝑤witalic_w, η(wsi)𝜂conditional𝑤subscript𝑠𝑖\eta(w\mid s_{i})italic_η ( italic_w ∣ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the probability that the true state is guessed given the information set sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The closer the result is to 1111, the less likely it is to know the real world state. Fig. 3 presents the TSSR value obtained according to λ𝜆\lambdaitalic_λ of the mixture distribution.

As expected, playing closer to the public belief distribution greatly reduces the probability of knowing the real-world state. In ‘Liar’s Dice’ with 2222 dice with PIMC, it is up to 10101010 fold more likely to guess the real world state when using the private instead of the public belief distribution.

In terms of information revealed, we observe that PIMC reveals more information than IS-MCTS in every benchmark. In ‘Leduc Poker’, it’s up to 2.22.22.22.2 times more likely to deduce the true state with PIMC at λ=0.0𝜆0.0\lambda=0.0italic_λ = 0.0 whereas, with IS-MCTS, it is ‘only’ 1.31.31.31.3 times more likely to deduce the true state.

In addition, ‘Liar’s Dice’ is the game that reveals the most information with the algorithm revealing up to 10101010 times more likely than random, whereas in ‘Leduc Poker’ or ‘Card Game’, it is only up to 2222 times more likely than random.

For the following experiments, it is expected to observe λ𝜆\lambdaitalic_λ closer to 1 for PIMC in ‘Liar’s Dice’, as it reveals more information, and therefore, could be exploited by the opponent.

IV-B2 How does the mixture impact the performance

\captionof

tableExpected utility against best responder when playing at the first player position. Algo Game λ𝜆\lambdaitalic_λ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 PIMC LD 2D 0.300 0.298 0.297 0.292 0.294 0.288 0.281 0.290 0.336 0.382 0.382 LD 3D 0.313 0.276 0.265 0.269 0.235 0.283 0.324 0.356 0.359 0.393 0.458 LP 0.622 0.616 0.660 0.767 0.797 1.481 1.626 1.480 1.532 1.599 1.611 IS-MCTS LD 2D 0.513 0.512 0.517 0.528 0.539 0.547 0.552 0.554 0.555 0.562 0.562 LP 0.797 0.890 0.966 0.959 1.158 1.226 1.402 1.673 1.786 2.083 2.326

To measure how the mixture impacts the performance, we compute the expected utility against the best responder. The best responder is the worst possible enemy of all algorithms, i.e. it knows exactly the policy our algorithm will execute, and therefore, can infer the true infostate and plays the best action against it.

The results are available in Table IV-B2 where the values represent the expected utility of the best responder and must be minimized. The results obtained are exact utility (without variation), as the best responder computes the best strategy knowing all the distributions in every infostate of the game.

We observe that the private belief distribution performs better than the public belief distribution, i.e. for all benchmarks and algorithms (better results are obtained when λ=0.0𝜆0.0\lambda=0.0italic_λ = 0.0 than when λ=1.0𝜆1.0\lambda=1.0italic_λ = 1.0).

In ‘Liar’s Dice’ with PIMC, the best performances are obtained when λ𝜆\lambdaitalic_λ is close to 0.50.50.50.5 (with 2222 dice, we obtain the best value when λ=0.6𝜆0.6\lambda=0.6italic_λ = 0.6). These results were expected, as PIMC reveals a lot of information with Liar’s Dice which is then exploited by the best responder.

On the other hand, when the algorithm reveals less information (as observed in ‘Leduc Poker’ or IS-MCTS), it is preferable to use the private belief distribution or very close, as it is not sufficient for the best responder to exploit the revealed information.

IV-B3 Can the use of multiple mixture belief distributions throughout the game improve performance

In this experiment, we analyze the use of multiple mixtures throughout the game to improve performance. For this purpose, we compute multiple mixture distributions against the best responder.

Refer to caption
(a) Leduc Poker with PIMC
Refer to caption
(b) Liar’s Dice 2 dice with PIMC
Refer to caption
(c) Leduc Poker with IS MCTS
Refer to caption
(d) Liar’s Dice 2 dice with IS MCTS
Figure 4: Heatmap of the expected utility against the best response when playing at the first position.

Fig. 4 represents heatmaps for ‘Leduc Poker’ and ‘Liar’s Dice’ according to the position throughout the game when using PIMC (resp. IS-MCTS). For both games, we have a mixture distribution for the first action and another for the second action.

In all experiments, we observe that using multiple mixtures throughout the game has an impact on the performance. In ‘Leduc Poker’ for both algorithms, not using our private belief distribution is more punished in the second round than in the first round (e.g. {0.0,1.0}0.01.0\{0.0,1.0\}{ 0.0 , 1.0 } has a value of 1.171.171.171.17 whereas {1.0,0.0}1.00.0\{1.0,0.0\}{ 1.0 , 0.0 } has a value of 1.881.881.881.88 for IS-MCTS). On the other hand, for ‘Liar’s Dice’, we observe that the first round is the most important one.

In addition, we observe that playing multiple mixtures improve performance. In ‘Liar’s Dice’, the best value for IS MCTS is obtained when we have {0.0,0.6}0.00.6\{0.0,0.6\}{ 0.0 , 0.6 } and for PIMC when we have {0.6,0.2}0.60.2\{0.6,0.2\}{ 0.6 , 0.2 }.

IV-B4 How does the mixture impact the winning rate

\captionof

tableWinning rate when the opponent uses ‘PIMC’ when playing at the first player position. Our Game λ𝜆\lambdaitalic_λ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 PIMC LD 3D 48.6 50.4 47.9 47.4 44.6 42.6 39.9 37.5 36.1 28.9 27.8 LD 5D 43.1 43.4 42.2 43.4 42.5 41.4 39.8 36.3 37.1 29.8 23.6 CG 10C 48.2 47.9 47.7 47.7 47.6 47.4 46.7 46. 45.5 39.8 31.6 CG 20C 53.7 53.8 54.2 54.5 53.9 53.2 52.8 52. 47.4 36.3 23.5 IS MCTS LD 3D 23.7 23.7 24.7 27. 23.1 23.1 21.7 20. 19.3 15.4 16.4 LD 5D 22. 20.9 21.9 22.2 21.9 20.8 21.6 21. 16.9 15.5 13.4 CG 10C 45.3 46.3 45.4 45.1 43.8 45.1 45. 43.1 42.7 37.6 30. CG 20C 36.5 38.5 38.2 36.2 36.4 36.6 35.5 34.9 33.3 33.1 20.8

As observed in the previous experiments, when using a λ𝜆\lambdaitalic_λ closer to the public belief distribution, we obtain a distribution of action less relevant but with the advantage of disclosing less information. Therefore, when faced with an opponent who does not infer on our private information, it is expected to lose the benefit of using a λ𝜆\lambdaitalic_λ closer to the public belief distribution. Nevertheless, using a λ𝜆\lambdaitalic_λ closer to the public belief distribution not only reveals less information but allows it to be more consistent with the other player’s doubts.

To measure the impact of being more consistent with the other player’s doubts, we evaluate the performance against an algorithm that does not try to infer our private information. To do this, we compute the winning rate against ‘PIMC’ over 1000100010001000 games which results in 3.1%percent3.13.1\%3.1 % variation (95%percent9595\%95 % of confidence interval). The scores are available in Table IV-B4.

As before, we observe that it is preferable to use private belief distribution instead of public belief distribution. In ‘Liar’s Dice’ with 3 dice with PIMC, we observe a drop of 20.820.820.820.8 in the winning rate between the private and public belief distribution. In addition, we observe that in every benchmark tested and for both algorithms, using a λ𝜆\lambdaitalic_λ between 0.00.00.00.0 to 0.50.50.50.5 does not produce a drop in performance, but provides equivalent results.

These results are surprising, as we could have expected a drop in performance as the actions are less relevant to the current infostate (as we have sampled less often the true infostate). This implies that being more consistent with the doubts of the other players compensates for the loss of the player’s private information.

V Conclusion

In this paper, we study the strengths and weaknesses of probability distributions (private and public) in which particular attention has been paid to the revealed information and the impact of this revealed information on performance. Our study has been carried out on determinization-based algorithms and on multiple imperfect information games.

We complete the study by proposing a new probability distribution, a mixture of the two previous ones, which solves problems encountered by other distributions. We show that using the mixture is beneficial to reduce the revealed information and improve performance. We also show that using multiple mixtures throughout the game improves performance. In addition, we observed that using the mixture against an opponent that does not use our private information revealed results in a good performance as we are being more consistent with the other player’s doubt.

An avenue for improvement would be to extend the utilization of using multiple mixtures throughout the game. For example, by using the mixture at each public infostate instead of a fixed time step or using a different lambda for the opponent player. Another area for improvement would be to extend the study of algorithms that do not use determinization or even, without probability distributions but bearing in mind that one should not always use one’s private information at the risk of revealing information and, on the contrary, that one should not always use one’s public information in order to be more consistent to one’s private knowledge. Lastly, it would be interesting to extend the results at a larger scale, either by using more games or by using larger games.

-A Complementary experiments

The following experiments are identical to those in the primary paper, with the exception that they are conducted for the second player position.

Similar results are observed, i.e. PIMC reveals more information than IS-MCTS, the private belief distribution obtains better performance than the public belief distribution against the best responder, using multiple mixtures is useful to improve the performance and it is all as well to play the mixture as the private against an opponent that does not try to infer.

Yet, we also observe some differences, especially that less information is revealed when playing in the second position, which results in λ𝜆\lambdaitalic_λ closer to the private belief distribution against the best responder.

\captionof

tableExpected utility for best responder against our algorithm being the second player. Algo Game λ𝜆\lambdaitalic_λ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 PIMC LD 2D 0.678 0.695 0.703 0.707 0.716 0.718 0.711 0.741 0.779 0.836 0.836 LP 0.398 0.400 0.459 0.612 0.796 1.461 1.450 1.509 1.593 1.615 1.632 IS-MCTS LD 2D 0.697 0.687 0.697 0.716 0.727 0.732 0.740 0.751 0.759 0.768 0.787 LP 0.784 0.784 0.898 0.800 1.017 1.078 1.186 1.324 1.561 1.728 2.002

000.20.20.20.20.40.40.40.40.60.60.60.60.80.80.80.8111111111.51.51.51.52222Lambda valueTSSR valuePIMCIS-MCTS
(a) Liar’s dice with 2 dice
000.20.20.20.20.40.40.40.40.60.60.60.60.80.80.80.8111111111.11.11.11.11.21.21.21.21.31.31.31.31.41.41.41.41.51.51.51.5Lambda valueTSSR valuePIMCIS-MCTS
(b) Leduc poker
Figure 5: Average TSSR according to λ𝜆\lambdaitalic_λ value of the mixture distribution.
Refer to caption
(a) Leduc poker with PIMC
Refer to caption
(b) Liar’s dice 2 dice with PIMC
Refer to caption
(c) Leduc poker with IS-MCTS
Refer to caption
(d) Liar’s dice 2 dice with IS-MCTS
Figure 6: Heatmap of expected utility against the best response when playing at the second position.
\captionof

tableWinning rate when the opponent uses ‘PIMC’ when playing at the second player position. Our Game λ𝜆\lambdaitalic_λ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 PIMC LD 3D 51.9 53. 51.3 49.8 49.7 51.3 48.7 48.6 46.9 46.1 42.7 LD 5D 56.7 55.5 56. 56.2 54.8 56.1 55.3 53. 51.9 44.7 42.3 IS MCTS LD 3D 48.4 51.3 49.9 49. 50.1 51. 47.4 44. 39.7 36.9 33.3 LD 5D 48.4 47.1 48. 46.7 47.8 45. 46.5 40.7 34.4 23.2 14.7

References

  • [1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, pp. 484–489, 2016.
  • [2] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis, “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm,” ArXiv, vol. abs/1712.01815, 2017.
  • [3] ——, “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,” Science, vol. 362, pp. 1140 – 1144, 2018.
  • [4] N. Brown and T. Sandholm, “Superhuman AI for multiplayer poker,” Science, vol. 365, pp. 885 – 890, 2019.
  • [5] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. W. Pachocki, M. Petrov, H. P. d. O. Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang, “Dota 2 with Large Scale Deep Reinforcement Learning,” ArXiv, vol. abs/1912.06680, 2019.
  • [6] O. Tammelin, N. Burch, M. B. Johanson, and M. Bowling, “Solving Heads-Up Limit Texas Hold’em,” in IJCAI, 2015.
  • [7] J. R. Long, N. R. Sturtevant, M. Buro, and T. Furtak, “Understanding the Success of Perfect Information Monte Carlo Sampling in Game Tree Search,” in AAAI, 2010.
  • [8] T. Furtak and M. Buro, “Recursive Monte Carlo search for imperfect information games,” 2013 IEEE Conference on Computational Inteligence in Games (CIG), pp. 1–8, 2013.
  • [9] P. I. Cowling, E. J. Powley, and D. Whitehouse, “Information Set Monte Carlo Tree Search,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, pp. 120–143, 2012.
  • [10] T. Cazenave and V. Ventos, “The αμ𝛼𝜇\alpha\muitalic_α italic_μ Search Algorithm for the Game of Bridge,” in Monte Carlo Search at IJCAI, ser. Communications in Computer and Information Science, 2021.
  • [11] V. Kovařík, M. Schmid, N. Burch, M. H. Bowling, and V. Lisý, “Rethinking Formal Models of Partially Observable Multiagent Decision Making,” Artif. Intell., vol. 303, p. 103645, 2022.
  • [12] M. Moravcík, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis, K. Waugh, M. B. Johanson, and M. H. Bowling, “DeepStack: Expert-level artificial intelligence in heads-up no-limit poker,” Science, vol. 356, pp. 508 – 513, 2017.
  • [13] N. Brown, A. Bakhtin, A. Lerer, and Q. Gong, “Combining Deep Reinforcement Learning and Search for Imperfect-Information Games,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20.   Red Hook, NY, USA: Curran Associates Inc., 2020, event-place: Vancouver, BC, Canada.
  • [14] T. Smith, “Probabilistic planning for robotic exploration,” Ph.D. dissertation, Carnegie Mellon University, Pittsburgh, PA, July 2007.
  • [15] J. S. Dibangoye, C. Amato, O. Buffet, and F. Charpillet, “Optimally Solving Dec-POMDPs as Continuous-State MDPs,” Journal of Artificial Intelligence Research, vol. 55, pp. 443–497, Feb. 2016.
  • [16] D. E. Knuth and R. W. Moore, “An analysis of alpha-beta pruning,” Artificial Intelligence, vol. 6, no. 4, pp. 293–326, 1975.
  • [17] C. Browne, E. J. Powley, D. Whitehouse, S. M. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. P. Liebana, S. Samothrakis, and S. Colton, “A Survey of Monte Carlo Tree Search Methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, pp. 1–43, 2012.
  • [18] F. Southey, M. P. Bowling, B. Larson, C. Piccione, N. Burch, D. Billings, and C. Rayner, “Bayes’ bluff: Opponent modelling in poker,” arXiv preprint arXiv:1207.1411, 2012.
  • [19] M. Lanctot, E. Lockhart, J.-B. Lespiau, V. F. Zambaldi, S. Upadhyay, J. Pérolat, S. Srinivasan, F. Timbers, K. Tuyls, S. Omidshafiei, D. Hennes, D. Morrill, P. Muller, T. Ewalds, R. Faulkner, J. Kramár, B. D. Vylder, B. Saeta, J. Bradbury, D. Ding, S. Borgeaud, M. Lai, J. Schrittwieser, T. W. Anthony, E. Hughes, I. Danihelka, and J. Ryan-Davis, “OpenSpiel: A Framework for Reinforcement Learning in Games,” ArXiv, vol. abs/1908.09453, 2019.
  • [20] C. Solinas, D. Rebstock, and M. Buro, “Improving Search with Supervised Learning in Trick-Based Card Games,” ArXiv, vol. abs/1903.09604, 2019.