(Translated by https://www.hiragana.jp/)
1 Introduction
\UseRawInputEncoding\DoubleSpacedXI\TheoremsNumberedThrough\ECRepeatTheorems\EquationsNumberedThrough\MANUSCRIPTNO
\RUNAUTHOR

Liu and Zhang

\RUNTITLE

PCL for RMAB with General Observation Models

\TITLE

PCL-Indexability and Whittle Index for Restless Bandits with General Observation Models

\ARTICLEAUTHORS\AUTHOR

Keqin Liu and Chengzhong Zhang \AFFNational Center for Applied Mathematics, Nanjing, China, 210093. \EMAILkqliu@nju.edu.cn

\ABSTRACT

In this paper, we consider a general observation model for restless multi-armed bandit problems. The operation of the player needs to be based on certain feedback mechanism that is error-prone due to resource constraints or environmental or intrinsic noises. By establishing a general probabilistic model for dynamics of feedback/observation, we formulate the problem as a restless bandit with a countable belief state space starting from an arbitrary initial belief (a priori information). We apply the achievable region method with partial conservation law (PCL) to the infinite-state problem and analyze its indexability and priority index (Whittle index). Finally, we propose an approximation process to transform the problem into which the AG algorithm of NiΓ±o-Mora and Bertsimas for finite-state problems can be applied to. Numerical experiments show that our algorithm has an excellent performance.

\KEYWORDS

restless bandit, POMDP, countable state space, partial conservation law, Whittle index \HISTORYThis paper was first submitted on July 3rd, 2024.

1 Introduction

Multi-armed bandit (MAB) is a classic operations research problem that involves a learner making choices among actions with uncertain/random rewards in order to maximize the expected return in long-run. The MAB problem is often associated with the important exploration-exploitation tradeoff and was initially proposed by Robbins (1952). Starting from the classic stochastic scheduling problem, the ongoing development of various MAB models makes it applicable to a wide range of practical fields, including clinical trials, recommendation systems, cognitive communications, and financial investments (Gittins 1979, Berry and Fristedt 1985, Press 2009, Farias and Madan 2011, Hoffman et al. 2011, Shen et al. 2015).

In the classic Bayesian MAB problem, the bandit machine has a total of N𝑁Nitalic_N arms and one player pulling an arm in each time slot. After a player selected one of these arms and activated it, a random reward will be accrued depending on the activated arm and its current state. In this process, all states are visible, and only the state of the arm being activated changes according to a Markov chain. The goal of the MAB problem is to search a policy that maximizes the long-term cumulative reward. Gittins (1979) proved that the classic MAB problem can be solved optimally by an index policy, referred to as Gittins index, and the player only needs to activate the arm with the largest index at each moment. The index policy significantly reduces the complexity of the problem from being exponential with the number of arms to being linear. After that, Whittle (1988) extended the MAB problem to the restless MAB (RMAB) model, in which K𝐾Kitalic_K arms can be activated at each moment, and the arms not chosen may also undergo state transitions over time. However, for the general RMAB problem with a finite state space, Papadimitriou and Tsitsiklis (1994) showed that the computational complexity of finding the optimal policy is already PSPACE-hard. Fortunately, Whittle (1988) generalized Gittins index to Whittle index that provides a possible solution to RMAB by considering Lagrangian relaxation and duality. Accordingly, Whittle index is optimal under a relaxed constraint that the time-average number of activated arms isΒ K𝐾Kitalic_K. For the original problem, the Whittle index policy has been proven to be asymptotically optimal per-arm wise as the number of arms goes to infinity under certain conditions (Weber and Weiss 1990, 1991). Nonetheless, not every RMAB is indexable, i.e.,Β Whittle index may not exist. And the establishment of indexability is itself a difficult problem. Even if indexability is proven for a particular RMAB, analytical solutions of the Whittle index function can still be hard to obtain. For some important categories of RMAB models, Whittle indexability and its strong performance have been demonstrated in the literature, such as the dual-speed bandit problem and partially observable RMAB (Glazebrook et al. 2002, Ahmad et al. 2009, Liu and Zhao 2010, Gittins et al. 2011, Liu 2021, Liu et al. 2024).

In this paper, we extend the partially observable RMAB considered in Liu et al. (2010, 2024) to general observation models. The previous work only considers special classes of observation errors and noises within the framework of partially observation Markov decision processes (POMDP) (Zhao and Sadler 2007, Liu et al. 2010, 2024). But general algorithms for POMDP often suffer from curse of dimensionality and becomes cumbersome when the value functions for dynamic programming are too complex (Smallwood and Sondik 1973, Sondik 1978). In order to deal with general observation models, we need an alternative methodology to design efficient algorithms for systems with large sets of parameters. In the interdisciplinary field of operations research and stochastic optimization, an analytical method called β€œAchievable Region” emerged in the 1990s. Interestingly, this method transforms a time-series optimization problem into linear programming (LP) problems that can be solved efficiently (Coffman and Mitrani 1980, Federgruen and Groenevelt 1988a, b, Gelenbe and Mitrani 2010). The challenge of using this method is mainly on proving the feasibility of such transformation (Bertsimas 1995). Referred to as the general conservation law (GCL), Bertsimas and NiΓ±o-Mora (1996) proposed the required structure of performance measures for the achievable region to be a polyhedron called extended polymatroid. Based on GCL, they extended Klimov’s algorithm to several stochastic scheduling problems, including the classic MAB problem (Klimov 1975, Thomas 1991). By relaxing the GCL restrictions, NiΓ±o-Mora (2001) proposed the partial conservation law (PCL) and specified conditions for an RMAB to satisfy PCL and subsequently lead to the numerical calculation of Whittle index. Later on, NiΓ±o-Mora (2002) offered an economic explanation of PCL-indexability and its relation to Whittle indexability where Whittle index can be interpreted as the optimal marginal cost rate. A more comprehensive summary of this work can be found in NiΓ±o-Mora (2007). However, all such work considers finite state spaces except that Frostig and Weiss (1999) extended the GCL framework for classic MAB to countable state spaces. In this paper, we extend the PCL framework to the RMAB with a general parametrization of observation model and an infinite state space. The main challenge of this extension is due to the high-dimensional probability state space (belief states) with complex dynamics for transitions over time. By establishing the weak duality between LP formulations, we build the PCL framework for analyzing our RMAB and subsequently design an efficient algorithm for calculating Whittle index when PCL-indexability is satisfied. Finally, we demonstrate the superior performance of our algorithm by numerical simulations.

2 Main Results

2.1 Model Formulation

In this section, we begin with the formulation of the RMAB problem with general observation models. Assume the system has N𝑁Nitalic_N arms and the state space of the n𝑛nitalic_nth arm is β„³nsubscriptℳ𝑛\mathcal{M}_{n}caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Therefore the entire system state space is β„³=∏n=1Nβ„³nβ„³superscriptsubscriptproduct𝑛1𝑁subscriptℳ𝑛\mathcal{M}=\prod_{n=1}^{N}\mathcal{M}_{n}caligraphic_M = ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. At each moment, the actual state of each arm undergos a state transition based on its own Markov probability transition matrix, and we will select K𝐾Kitalic_K arms to activate. For those activated arms, we can observe their states (with errors) and accrue some reward dependent on the observed states and the actual states. For those arms that are not activated, we can neither observe their states nor obtain reward from them, and their states still transit over time. Let Sn⁒(t),On⁒(t)βˆˆβ„€+subscript𝑆𝑛𝑑subscript𝑂𝑛𝑑superscriptβ„€S_{n}(t),O_{n}(t)\in\mathbb{Z}^{+}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) , italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ∈ blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be respectively the true and observed states of arm n𝑛nitalic_n in slot t𝑑titalic_t and A⁒(t)𝐴𝑑A(t)italic_A ( italic_t ) the set of activated arms in slot t𝑑titalic_t. First consider a single-arm process (i.e. N=1𝑁1N=1italic_N = 1 and β„³=β„³nβ„³subscriptℳ𝑛\mathcal{M}=\mathcal{M}_{n}caligraphic_M = caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) and drop the subscriptΒ n𝑛nitalic_n (sometimes the time index t𝑑titalic_t is also dropped if referring to the same time index) for convenience. Suppose the state transition matrix, error matrix and reward matrix are P={pi⁒j}ℳ×ℳ𝑃subscriptsubscript𝑝𝑖𝑗ℳℳP=\{p_{ij}\}_{\mathcal{M}\times\mathcal{M}}italic_P = { italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT caligraphic_M Γ— caligraphic_M end_POSTSUBSCRIPT, E={Ρいぷしろんi⁒j}ℳ×ℳ𝐸subscriptsubscriptπœ€π‘–π‘—β„³β„³E=\{\varepsilon_{ij}\}_{\mathcal{M}\times\mathcal{M}}italic_E = { italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT caligraphic_M Γ— caligraphic_M end_POSTSUBSCRIPT and R={ri⁒j}ℳ×ℳ𝑅subscriptsubscriptπ‘Ÿπ‘–π‘—β„³β„³R=\{r_{ij}\}_{\mathcal{M}\times\mathcal{M}}italic_R = { italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT caligraphic_M Γ— caligraphic_M end_POSTSUBSCRIPT respectively. For error matrix, Ρいぷしろんi⁒jsubscriptπœ€π‘–π‘—\varepsilon_{ij}italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the probability that the observed state is j𝑗jitalic_j when the true state is i𝑖iitalic_i, that is, Ρいぷしろんi⁒j=P⁒(O=j|S=i)subscriptπœ€π‘–π‘—π‘ƒπ‘‚conditional𝑗𝑆𝑖\varepsilon_{ij}=P(O=j|S=i)italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_P ( italic_O = italic_j | italic_S = italic_i ), while ri⁒jsubscriptπ‘Ÿπ‘–π‘—r_{ij}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the reward obtained when the true state is i𝑖iitalic_i and the observed state is j𝑗jitalic_j (different observations may lead to different sub-actions and thus different rewards following). Denoted by Ξ©γŠγ‚γŒa={(Ο‰γŠγ‚γŒ1,β‹―,Ο‰γŠγ‚γŒM)|βˆ‘i=1MΟ‰γŠγ‚γŒi=1,Ο‰γŠγ‚γŒiβ‰₯0,1≀i≀M:=|β„³|}subscriptΞ©γŠγ‚γŒπ‘Žconditional-setsubscriptπœ”1β‹―subscriptπœ”π‘€formulae-sequencesuperscriptsubscript𝑖1𝑀subscriptπœ”π‘–1formulae-sequencesubscriptπœ”π‘–01𝑖𝑀assignβ„³{\Omega}_{a}=\{(\omega_{1},\cdots,\omega_{M})|\sum_{i=1}^{M}\omega_{i}=1,% \omega_{i}\geq 0,1\leq i\leq M:=|\mathcal{M}|\}roman_Ξ©γŠγ‚γŒ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = { ( italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β‹― , italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) | βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT β‰₯ 0 , 1 ≀ italic_i ≀ italic_M := | caligraphic_M | } the belief space, for each 𝝎⁒(t)=(Ο‰γŠγ‚γŒ1⁒(t),β‹―,Ο‰γŠγ‚γŒM⁒(t))βˆˆΞ©γŠγ‚γŒaπŽπ‘‘subscriptπœ”1𝑑⋯subscriptπœ”π‘€π‘‘subscriptΞ©γŠγ‚γŒπ‘Ž\boldsymbol{\omega}(t)=(\omega_{1}(t),\cdots,\omega_{M}(t))\in{\Omega_{a}}bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) = ( italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , β‹― , italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_t ) ) ∈ roman_Ξ©γŠγ‚γŒ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, Ο‰γŠγ‚γŒi⁒(t)subscriptπœ”π‘–π‘‘\omega_{i}(t)italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) represents the conditional probability that the true state isΒ i𝑖iitalic_i in slotΒ t𝑑titalic_t (based on past observations). If the current belief state is (Ο‰γŠγ‚γŒ1,β‹―,Ο‰γŠγ‚γŒM)subscriptπœ”1β‹―subscriptπœ”π‘€(\omega_{1},\cdots,\omega_{M})( italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β‹― , italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ), the current expected reward is βˆ‘i=1MΟ‰γŠγ‚γŒi⁒(βˆ‘j=1MΡいぷしろんi⁒j⁒ri⁒j)superscriptsubscript𝑖1𝑀subscriptπœ”π‘–superscriptsubscript𝑗1𝑀subscriptπœ€π‘–π‘—subscriptπ‘Ÿπ‘–π‘—\sum_{i=1}^{M}\omega_{i}(\sum_{j=1}^{M}\varepsilon_{ij}r_{ij})βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( βˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ). In addition, we may sometimes receive additional feedback related to the true state and the immediate reward obtained that helps us better estimate the actual arm state (e.g., ACK/NAK in a communication channel to indicate whether the data was successfully transmittedΒ (Liu et al. 2024)). Assume that there are L𝐿Litalic_L feedback states (positive integers) in total that encompass all possible observations, and denote the feedback state at timeΒ t𝑑titalic_t by F⁒(t)⁒(L≀M2)𝐹𝑑𝐿superscript𝑀2F(t)~{}(L\leq M^{2})italic_F ( italic_t ) ( italic_L ≀ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Define ρろーi⁒j=P⁒(F⁒(t)=j|S⁒(t)=i)subscriptπœŒπ‘–π‘—π‘ƒπΉπ‘‘conditional𝑗𝑆𝑑𝑖\rho_{ij}=P(F(t)=j|S(t)=i)italic_ρろー start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_P ( italic_F ( italic_t ) = italic_j | italic_S ( italic_t ) = italic_i ). By Bayes rule, we have

P⁒(S⁒(t+1)=j|F⁒(t)=i)=P⁒(S⁒(t+1)=j,F⁒(t)=i)P⁒(F⁒(t)=i)=βˆ‘k=1Mpk⁒j⁒ρろーk⁒iβ’Ο‰γŠγ‚γŒkβˆ‘k=1Mρろーk⁒iβ’Ο‰γŠγ‚γŒk.𝑃𝑆𝑑1conditional𝑗𝐹𝑑𝑖𝑃formulae-sequence𝑆𝑑1𝑗𝐹𝑑𝑖𝑃𝐹𝑑𝑖superscriptsubscriptπ‘˜1𝑀subscriptπ‘π‘˜π‘—subscriptπœŒπ‘˜π‘–subscriptπœ”π‘˜superscriptsubscriptπ‘˜1𝑀subscriptπœŒπ‘˜π‘–subscriptπœ”π‘˜P(S(t+1)=j|F(t)=i)=\frac{P(S(t+1)=j,F(t)=i)}{P(F(t)=i)}=\frac{\sum_{k=1}^{M}p_% {kj}\rho_{ki}\omega_{k}}{\sum_{k=1}^{M}\rho_{ki}\omega_{k}}.italic_P ( italic_S ( italic_t + 1 ) = italic_j | italic_F ( italic_t ) = italic_i ) = divide start_ARG italic_P ( italic_S ( italic_t + 1 ) = italic_j , italic_F ( italic_t ) = italic_i ) end_ARG start_ARG italic_P ( italic_F ( italic_t ) = italic_i ) end_ARG = divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT italic_ρろー start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG βˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_ρろー start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG . (1)

Thus the rule of belief update is

𝝎⁒(t+1)={(βˆ‘i=1Mpi⁒1⁒ρろーi⁒jβ’Ο‰γŠγ‚γŒiβˆ‘i=1Mρろーi⁒jβ’Ο‰γŠγ‚γŒi,β‹―,βˆ‘i=1Mpi⁒M⁒ρろーi⁒jβ’Ο‰γŠγ‚γŒiβˆ‘i=1Mρろーi⁒jβ’Ο‰γŠγ‚γŒi),if active and ⁒F⁒(t)=j𝝎⁒(t)⁒P,if passive.πŽπ‘‘1casessuperscriptsubscript𝑖1𝑀subscript𝑝𝑖1subscriptπœŒπ‘–π‘—subscriptπœ”π‘–superscriptsubscript𝑖1𝑀subscriptπœŒπ‘–π‘—subscriptπœ”π‘–β‹―superscriptsubscript𝑖1𝑀subscript𝑝𝑖𝑀subscriptπœŒπ‘–π‘—subscriptπœ”π‘–superscriptsubscript𝑖1𝑀subscriptπœŒπ‘–π‘—subscriptπœ”π‘–if active andΒ πΉπ‘‘π‘—πŽπ‘‘π‘ƒif passive\boldsymbol{\omega}(t+1)=\begin{cases}(\frac{\sum_{i=1}^{M}p_{i1}\rho_{ij}% \omega_{i}}{\sum_{i=1}^{M}\rho_{ij}\omega_{i}},\cdots,\frac{\sum_{i=1}^{M}p_{% iM}\rho_{ij}\omega_{i}}{\sum_{i=1}^{M}\rho_{ij}\omega_{i}}),&\text{if active % and }F(t)=j\\ \boldsymbol{\omega}(t)P,&\text{if passive}\end{cases}.bold_italic_Ο‰γŠγ‚γŒ ( italic_t + 1 ) = { start_ROW start_CELL ( divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT italic_ρろー start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_ρろー start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , β‹― , divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_M end_POSTSUBSCRIPT italic_ρろー start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_ρろー start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , end_CELL start_CELL if active and italic_F ( italic_t ) = italic_j end_CELL end_ROW start_ROW start_CELL bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) italic_P , end_CELL start_CELL if passive end_CELL end_ROW . (2)

In a simpler scenario where there is no additional feedback but only the observed states, the update rule becomes

𝝎⁒(t+1)={(βˆ‘i=1Mpi⁒1⁒Ρいぷしろんi⁒jβ’Ο‰γŠγ‚γŒiβˆ‘i=1MΡいぷしろんi⁒jβ’Ο‰γŠγ‚γŒi,β‹―,βˆ‘i=1Mpi⁒M⁒Ρいぷしろんi⁒jβ’Ο‰γŠγ‚γŒiβˆ‘i=1MΡいぷしろんi⁒jβ’Ο‰γŠγ‚γŒi),if active and ⁒O⁒(t)=j𝝎⁒(t)⁒P,if passive.πŽπ‘‘1casessuperscriptsubscript𝑖1𝑀subscript𝑝𝑖1subscriptπœ€π‘–π‘—subscriptπœ”π‘–superscriptsubscript𝑖1𝑀subscriptπœ€π‘–π‘—subscriptπœ”π‘–β‹―superscriptsubscript𝑖1𝑀subscript𝑝𝑖𝑀subscriptπœ€π‘–π‘—subscriptπœ”π‘–superscriptsubscript𝑖1𝑀subscriptπœ€π‘–π‘—subscriptπœ”π‘–if active andΒ π‘‚π‘‘π‘—πŽπ‘‘π‘ƒif passive\boldsymbol{\omega}(t+1)=\begin{cases}(\frac{\sum_{i=1}^{M}p_{i1}\varepsilon_{% ij}\omega_{i}}{\sum_{i=1}^{M}\varepsilon_{ij}\omega_{i}},\cdots,\frac{\sum_{i=% 1}^{M}p_{iM}\varepsilon_{ij}\omega_{i}}{\sum_{i=1}^{M}\varepsilon_{ij}\omega_{% i}}),&\text{if active and }O(t)=j\\ \boldsymbol{\omega}(t)P,&\text{if passive}\end{cases}.bold_italic_Ο‰γŠγ‚γŒ ( italic_t + 1 ) = { start_ROW start_CELL ( divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , β‹― , divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_M end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , end_CELL start_CELL if active and italic_O ( italic_t ) = italic_j end_CELL end_ROW start_ROW start_CELL bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) italic_P , end_CELL start_CELL if passive end_CELL end_ROW . (3)

In the extreme case when we cannot observe any state after activating an arm, we simply treat the obtained reward (might beΒ 00) as the feedback. By normalizing the immediate reward r⁒(t)π‘Ÿπ‘‘r(t)italic_r ( italic_t ) to take only positive integer values, we have

ρろーi⁒j=subscriptπœŒπ‘–π‘—absent\displaystyle\rho_{ij}=italic_ρろー start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = P⁒(r⁒(t)=j|S⁒(t)=i)π‘ƒπ‘Ÿπ‘‘conditional𝑗𝑆𝑑𝑖\displaystyle P(r(t)=j|S(t)=i)italic_P ( italic_r ( italic_t ) = italic_j | italic_S ( italic_t ) = italic_i )
=\displaystyle== P⁒(r⁒(t)=j,S⁒(t)=i)P⁒(S⁒(t)=i)𝑃formulae-sequenceπ‘Ÿπ‘‘π‘—π‘†π‘‘π‘–π‘ƒπ‘†π‘‘π‘–\displaystyle\frac{P(r(t)=j,S(t)=i)}{P(S(t)=i)}divide start_ARG italic_P ( italic_r ( italic_t ) = italic_j , italic_S ( italic_t ) = italic_i ) end_ARG start_ARG italic_P ( italic_S ( italic_t ) = italic_i ) end_ARG
=\displaystyle== βˆ‘k=1MP⁒(r⁒(t)=j,O⁒(t)=k,S⁒(t)=i)Ο‰γŠγ‚γŒisuperscriptsubscriptπ‘˜1𝑀𝑃formulae-sequenceπ‘Ÿπ‘‘π‘—formulae-sequenceπ‘‚π‘‘π‘˜π‘†π‘‘π‘–subscriptπœ”π‘–\displaystyle\frac{\sum_{k=1}^{M}P(r(t)=j,O(t)=k,S(t)=i)}{\omega_{i}}divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_P ( italic_r ( italic_t ) = italic_j , italic_O ( italic_t ) = italic_k , italic_S ( italic_t ) = italic_i ) end_ARG start_ARG italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
=\displaystyle== βˆ‘k=1MP(r(t)=j|O(t)=k,S(t)=i)P(O(t)=k|S(t)=i)Ο‰γŠγ‚γŒiΟ‰γŠγ‚γŒi\displaystyle\frac{\sum_{k=1}^{M}P(r(t)=j|O(t)=k,S(t)=i)P(O(t)=k|S(t)=i)\omega% _{i}}{\omega_{i}}divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_P ( italic_r ( italic_t ) = italic_j | italic_O ( italic_t ) = italic_k , italic_S ( italic_t ) = italic_i ) italic_P ( italic_O ( italic_t ) = italic_k | italic_S ( italic_t ) = italic_i ) italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
=\displaystyle== βˆ‘k=1MπŸ™β’(ri⁒k=j)⁒Ρいぷしろんi⁒ksuperscriptsubscriptπ‘˜1𝑀1subscriptπ‘Ÿπ‘–π‘˜π‘—subscriptπœ€π‘–π‘˜\displaystyle\sum_{k=1}^{M}\mathbbm{1}(r_{ik}=j)\varepsilon_{ik}βˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT blackboard_1 ( italic_r start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = italic_j ) italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT

or equivalently ρろーi⁒r=βˆ‘j=1MπŸ™β’(ri⁒j=r)⁒Ρいぷしろんi⁒jsubscriptπœŒπ‘–π‘Ÿsuperscriptsubscript𝑗1𝑀1subscriptπ‘Ÿπ‘–π‘—π‘Ÿsubscriptπœ€π‘–π‘—\rho_{ir}=\sum_{j=1}^{M}\mathbbm{1}(r_{ij}=r)\varepsilon_{ij}italic_ρろー start_POSTSUBSCRIPT italic_i italic_r end_POSTSUBSCRIPT = βˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT blackboard_1 ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r ) italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Thus we have

βˆ‘i=1Mpi⁒1⁒ρろーi⁒rβ’Ο‰γŠγ‚γŒi=superscriptsubscript𝑖1𝑀subscript𝑝𝑖1subscriptπœŒπ‘–π‘Ÿsubscriptπœ”π‘–absent\displaystyle\sum_{i=1}^{M}p_{i1}\rho_{ir}\omega_{i}={}βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT italic_ρろー start_POSTSUBSCRIPT italic_i italic_r end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = βˆ‘i=1Mpi⁒1⁒(βˆ‘j=1MπŸ™β’(ri⁒j=r)⁒Ρいぷしろんi⁒jβ’Ο‰γŠγ‚γŒi)superscriptsubscript𝑖1𝑀subscript𝑝𝑖1superscriptsubscript𝑗1𝑀1subscriptπ‘Ÿπ‘–π‘—π‘Ÿsubscriptπœ€π‘–π‘—subscriptπœ”π‘–\displaystyle\sum_{i=1}^{M}p_{i1}(\sum_{j=1}^{M}\mathbbm{1}(r_{ij}=r)% \varepsilon_{ij}\omega_{i})βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ( βˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT blackboard_1 ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r ) italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=\displaystyle={}= βˆ‘i=1Mβˆ‘j=1MπŸ™β’(ri⁒j=r)⁒pi⁒1⁒Ρいぷしろんi⁒jβ’Ο‰γŠγ‚γŒisuperscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑀1subscriptπ‘Ÿπ‘–π‘—π‘Ÿsubscript𝑝𝑖1subscriptπœ€π‘–π‘—subscriptπœ”π‘–\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{M}\mathbbm{1}(r_{ij}=r)p_{i1}% \varepsilon_{ij}\omega_{i}βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT βˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT blackboard_1 ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r ) italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=\displaystyle={}= βˆ‘ri⁒j=rpi⁒1⁒Ρいぷしろんi⁒jβ’Ο‰γŠγ‚γŒi.subscriptsubscriptπ‘Ÿπ‘–π‘—π‘Ÿsubscript𝑝𝑖1subscriptπœ€π‘–π‘—subscriptπœ”π‘–\displaystyle\sum_{r_{ij}=r}p_{i1}\varepsilon_{ij}\omega_{i}.βˆ‘ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (4)

Similarly,

βˆ‘i=1Mρろーi⁒rβ’Ο‰γŠγ‚γŒi=βˆ‘ri⁒j=rΡいぷしろんi⁒jβ’Ο‰γŠγ‚γŒi.superscriptsubscript𝑖1𝑀subscriptπœŒπ‘–π‘Ÿsubscriptπœ”π‘–subscriptsubscriptπ‘Ÿπ‘–π‘—π‘Ÿsubscriptπœ€π‘–π‘—subscriptπœ”π‘–\sum_{i=1}^{M}\rho_{ir}\omega_{i}=\sum_{r_{ij}=r}\varepsilon_{ij}\omega_{i}.βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_ρろー start_POSTSUBSCRIPT italic_i italic_r end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = βˆ‘ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (5)

Combining (2.1) and (5), the belief state update rule is

𝝎⁒(t+1)={(βˆ‘ri⁒j=rpi⁒1⁒Ρいぷしろんi⁒jβ’Ο‰γŠγ‚γŒiβˆ‘ri⁒j=rΡいぷしろんi⁒jβ’Ο‰γŠγ‚γŒi,β‹―,βˆ‘ri⁒j=rpi⁒M⁒Ρいぷしろんi⁒jβ’Ο‰γŠγ‚γŒiβˆ‘ri⁒j=rΡいぷしろんi⁒jβ’Ο‰γŠγ‚γŒi),if active and ⁒r⁒(t)=r𝝎⁒(t)⁒P,if passive.πŽπ‘‘1casessubscriptsubscriptπ‘Ÿπ‘–π‘—π‘Ÿsubscript𝑝𝑖1subscriptπœ€π‘–π‘—subscriptπœ”π‘–subscriptsubscriptπ‘Ÿπ‘–π‘—π‘Ÿsubscriptπœ€π‘–π‘—subscriptπœ”π‘–β‹―subscriptsubscriptπ‘Ÿπ‘–π‘—π‘Ÿsubscript𝑝𝑖𝑀subscriptπœ€π‘–π‘—subscriptπœ”π‘–subscriptsubscriptπ‘Ÿπ‘–π‘—π‘Ÿsubscriptπœ€π‘–π‘—subscriptπœ”π‘–if active andΒ π‘Ÿπ‘‘π‘ŸπŽπ‘‘π‘ƒif passive\boldsymbol{\omega}(t+1)=\begin{cases}(\frac{\sum_{r_{ij}=r}p_{i1}\varepsilon_% {ij}\omega_{i}}{\sum_{r_{ij}=r}\varepsilon_{ij}\omega_{i}},\cdots,\frac{\sum_{% r_{ij}=r}p_{iM}\varepsilon_{ij}\omega_{i}}{\sum_{r_{ij}=r}\varepsilon_{ij}% \omega_{i}}),&\text{if active and }r(t)=r\\ \boldsymbol{\omega}(t)P,&\text{if passive}\end{cases}.bold_italic_Ο‰γŠγ‚γŒ ( italic_t + 1 ) = { start_ROW start_CELL ( divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG βˆ‘ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , β‹― , divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_M end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG βˆ‘ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , end_CELL start_CELL if active and italic_r ( italic_t ) = italic_r end_CELL end_ROW start_ROW start_CELL bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) italic_P , end_CELL start_CELL if passive end_CELL end_ROW . (6)

A more practical situation is that we can jointly obtain information from the observed state and the obtained reward. In this case, we can similarly give the belief update rule:

𝝎⁒(t+1)={(βˆ‘i=1Mpi⁒1⁒Ρいぷしろんi⁒jβ’Ο‰γŠγ‚γŒiβ’πŸ™β’(ri⁒j=r)βˆ‘i=1MΡいぷしろんi⁒jβ’Ο‰γŠγ‚γŒiβ’πŸ™β’(ri⁒j=r),β‹―,βˆ‘i=1Mpi⁒M⁒Ρいぷしろんi⁒jβ’Ο‰γŠγ‚γŒiβ’πŸ™β’(ri⁒j=r)βˆ‘i=1MΡいぷしろんi⁒jβ’Ο‰γŠγ‚γŒiβ’πŸ™β’(ri⁒j=r)),if active and ⁒O⁒(t)=j,r(t)=r(βˆƒis.t.ri⁒j=r)𝝎⁒(t)⁒P,if passive.\boldsymbol{\omega}(t+1)=\begin{cases}(\frac{\sum_{i=1}^{M}p_{i1}\varepsilon_{% ij}\omega_{i}\mathbbm{1}(r_{ij}=r)}{\sum_{i=1}^{M}\varepsilon_{ij}\omega_{i}% \mathbbm{1}(r_{ij}=r)},\cdots,\frac{\sum_{i=1}^{M}p_{iM}\varepsilon_{ij}\omega% _{i}\mathbbm{1}(r_{ij}=r)}{\sum_{i=1}^{M}\varepsilon_{ij}\omega_{i}\mathbbm{1}% (r_{ij}=r)}),&\text{if active and }O(t)=j,\\ &r(t)=r~{}(\exists i~{}s.t.~{}r_{ij}=r)\\ \boldsymbol{\omega}(t)P,&\text{if passive}\end{cases}.bold_italic_Ο‰γŠγ‚γŒ ( italic_t + 1 ) = { start_ROW start_CELL ( divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_1 ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r ) end_ARG start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_1 ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r ) end_ARG , β‹― , divide start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_M end_POSTSUBSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_1 ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r ) end_ARG start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ρいぷしろん start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_1 ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r ) end_ARG ) , end_CELL start_CELL if active and italic_O ( italic_t ) = italic_j , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_r ( italic_t ) = italic_r ( βˆƒ italic_i italic_s . italic_t . italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r ) end_CELL end_ROW start_ROW start_CELL bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) italic_P , end_CELL start_CELL if passive end_CELL end_ROW . (7)

For the restless multi-armed bandit model, the goal is to find a policy Ο€γ±γ„πœ‹\piitalic_πぱい which maps the belief states of all arms into an active set A⁒(t)𝐴𝑑A(t)italic_A ( italic_t ) in slot t𝑑titalic_t that maximize the long-term expected discounted reward. In other words, if we denote the reward obtained by the n𝑛nitalic_n-th arm in slot t𝑑titalic_t by rn⁒(t)subscriptπ‘Ÿπ‘›π‘‘r_{n}(t)italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ), then our objective is

maxπぱい⁑𝔼πぱいsubscriptπœ‹subscriptπ”Όπœ‹\displaystyle\max_{\pi}\mathbb{E}_{\pi}roman_max start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT [βˆ‘t=1+βˆžΞ²γΉγƒΌγŸtβˆ’1β’βˆ‘n=1Nrn⁒(t)|𝝎1⁒(1),β‹―,𝝎n⁒(1)],delimited-[]conditionalsuperscriptsubscript𝑑1superscript𝛽𝑑1superscriptsubscript𝑛1𝑁subscriptπ‘Ÿπ‘›π‘‘subscript𝝎11β‹―subscriptπŽπ‘›1\displaystyle\left[\sum_{t=1}^{+\infty}\beta^{t-1}\sum_{n=1}^{N}r_{n}(t)\bigg{% |}\boldsymbol{\omega}_{1}(1),\cdots,\boldsymbol{\omega}_{n}(1)\right],[ βˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT βˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) | bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 ) , β‹― , bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 ) ] , (8)
s.t.|A⁒(t)|=K,tβ‰₯1,formulae-sequence𝑠𝑑formulae-sequence𝐴𝑑𝐾𝑑1\displaystyle s.t.~{}|A(t)|=K,t\geq 1,italic_s . italic_t . | italic_A ( italic_t ) | = italic_K , italic_t β‰₯ 1 , (9)

where 0β‰€Ξ²γΉγƒΌγŸ<10𝛽10\leq\beta<10 ≀ italic_Ξ²γΉγƒΌγŸ < 1 is the discount factor and 𝝎n⁒(t)subscriptπŽπ‘›π‘‘\boldsymbol{\omega}_{n}(t)bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) is the belief state (vector) of arm n𝑛nitalic_n in slot t𝑑titalic_t. In this problem, the diversity of states, choices, errors makes the problem highly complex. In RMAB problems, searching for an easily computable priority index policy is the mainstream. The core idea is to assign an index (a real number) to the current state of each arm, and then activate those arms with topΒ K𝐾Kitalic_K large indexes. The goal of this paper is to theoretically characterize the conditions when such an index policy exists and provide a detailed algorithm for efficiently computing the index function (if it exists).

2.2 Whittle Index

Whittle (1988) relaxed the constraint on the exact number of arms activated in each slot, requiring only that the expected number of arms activated per slot on average (in the sense of discounted time) is K𝐾Kitalic_K, i.e.

𝔼πぱい⁒[βˆ‘t=1+βˆžΞ²γΉγƒΌγŸtβˆ’1β’βˆ‘n=1NπŸ™β’(n∈A⁒(t))|𝛀⁒(1)]=K1βˆ’Ξ²γΉγƒΌγŸsubscriptπ”Όπœ‹delimited-[]conditionalsuperscriptsubscript𝑑1superscript𝛽𝑑1superscriptsubscript𝑛1𝑁1𝑛𝐴𝑑𝛀1𝐾1𝛽\mathbb{E}_{\pi}\left[\sum_{t=1}^{+\infty}\beta^{t-1}\sum_{n=1}^{N}\mathbbm{1}% (n\in A(t))\bigg{|}\boldsymbol{\Omega}(1)\right]=\frac{K}{1-\beta}blackboard_E start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT [ βˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT βˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_1 ( italic_n ∈ italic_A ( italic_t ) ) | bold_Ξ©γŠγ‚γŒ ( 1 ) ] = divide start_ARG italic_K end_ARG start_ARG 1 - italic_Ξ²γΉγƒΌγŸ end_ARG (10)

or

𝔼πぱい⁒[βˆ‘t=1+βˆžΞ²γΉγƒΌγŸtβˆ’1β’βˆ‘n=1NπŸ™β’(nβˆ‰A⁒(t))|𝛀⁒(1)]=Nβˆ’K1βˆ’Ξ²γΉγƒΌγŸ,subscriptπ”Όπœ‹delimited-[]conditionalsuperscriptsubscript𝑑1superscript𝛽𝑑1superscriptsubscript𝑛1𝑁1𝑛𝐴𝑑𝛀1𝑁𝐾1𝛽\mathbb{E}_{\pi}\left[\sum_{t=1}^{+\infty}\beta^{t-1}\sum_{n=1}^{N}\mathbbm{1}% (n\notin A(t))\bigg{|}\boldsymbol{\Omega}(1)\right]=\frac{N-K}{1-\beta},blackboard_E start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT [ βˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT βˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_1 ( italic_n βˆ‰ italic_A ( italic_t ) ) | bold_Ξ©γŠγ‚γŒ ( 1 ) ] = divide start_ARG italic_N - italic_K end_ARG start_ARG 1 - italic_Ξ²γΉγƒΌγŸ end_ARG , (11)

where 𝛀⁒(t)=(𝝎1⁒(t),β‹―,𝝎N⁒(t))𝛀𝑑subscript𝝎1𝑑⋯subscriptπŽπ‘π‘‘\boldsymbol{\Omega}(t)=(\boldsymbol{\omega}_{1}(t),\cdots,\boldsymbol{\omega}_% {N}(t))bold_Ξ©γŠγ‚γŒ ( italic_t ) = ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , β‹― , bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t ) ) with 𝝎n⁒(t)subscriptπŽπ‘›π‘‘\boldsymbol{\omega}_{n}(t)bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) as the belief state (vector) for armΒ n𝑛nitalic_n at timeΒ t𝑑titalic_t. Thus the Lagrange optimization problem can be written as

maxπぱい⁑𝔼πぱい⁒[βˆ‘t=1+βˆžΞ²γΉγƒΌγŸtβˆ’1β’βˆ‘n=1N(πŸ™β’(n∈A⁒(t))⁒rn⁒(t)+Ξ»γ‚‰γ‚€γ β’πŸ™β’(nβˆ‰A⁒(t)))|𝛀⁒(1)].subscriptπœ‹subscriptπ”Όπœ‹delimited-[]conditionalsuperscriptsubscript𝑑1superscript𝛽𝑑1superscriptsubscript𝑛1𝑁1𝑛𝐴𝑑subscriptπ‘Ÿπ‘›π‘‘πœ†1𝑛𝐴𝑑𝛀1\max_{\pi}\mathbb{E}_{\pi}\left[\sum_{t=1}^{+\infty}\beta^{t-1}\sum_{n=1}^{N}(% \mathbbm{1}(n\in A(t))r_{n}(t)+\lambda\mathbbm{1}(n\notin A(t)))\bigg{|}% \boldsymbol{\Omega}(1)\right].roman_max start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT [ βˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT βˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( blackboard_1 ( italic_n ∈ italic_A ( italic_t ) ) italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) + italic_λらむだ blackboard_1 ( italic_n βˆ‰ italic_A ( italic_t ) ) ) | bold_Ξ©γŠγ‚γŒ ( 1 ) ] . (12)

The above problem can be decomposed into N𝑁Nitalic_N independent subproblems, that is, for any 1≀n≀N1𝑛𝑁1\leq n\leq N1 ≀ italic_n ≀ italic_N,

maxπぱい⁑𝔼πぱい⁒[βˆ‘t=1+βˆžΞ²γΉγƒΌγŸtβˆ’1⁒(πŸ™β’(n∈A⁒(t))⁒rn⁒(t)+Ξ»γ‚‰γ‚€γ β’πŸ™β’(nβˆ‰A⁒(t)))|𝝎n⁒(1)].subscriptπœ‹subscriptπ”Όπœ‹delimited-[]conditionalsuperscriptsubscript𝑑1superscript𝛽𝑑11𝑛𝐴𝑑subscriptπ‘Ÿπ‘›π‘‘πœ†1𝑛𝐴𝑑subscriptπŽπ‘›1\max_{\pi}\mathbb{E}_{\pi}\left[\sum_{t=1}^{+\infty}\beta^{t-1}(\mathbbm{1}(n% \in A(t))r_{n}(t)+\lambda\mathbbm{1}(n\notin A(t)))\bigg{|}\boldsymbol{\omega}% _{n}(1)\right].roman_max start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT [ βˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( blackboard_1 ( italic_n ∈ italic_A ( italic_t ) ) italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) + italic_λらむだ blackboard_1 ( italic_n βˆ‰ italic_A ( italic_t ) ) ) | bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 ) ] . (13)

The explanation of the optimization problem above is that for each arm, when it is not selected (made passive), we will receive a subsidy Ξ»γ‚‰γ‚€γ πœ†\lambdaitalic_λらむだ. Since the problems above are independent, we just need to consider the single-arm case. For notation simplicity, we drop the arm indexΒ n𝑛nitalic_n from now on. For a given arm, the optimal policy for the relaxed optimization problem divides the arm state space into two subset (here the arm state space is the arm belief state space): active set π’œβ’(λらむだ)π’œπœ†\mathcal{A}(\lambda)caligraphic_A ( italic_λらむだ ) and passive set 𝒫⁒(λらむだ)π’«πœ†\mathcal{P}(\lambda)caligraphic_P ( italic_λらむだ ). Specifically, 𝒫⁒(λらむだ)π’«πœ†\mathcal{P}(\lambda)caligraphic_P ( italic_λらむだ ) contains all belief states in which the optimal choice is passive when the subsidy is Ξ»γ‚‰γ‚€γ πœ†\lambdaitalic_λらむだ. In particular, for a certain state s𝑠sitalic_s, if both active and passive actiosn are optimal, we include it in 𝒫⁒(λらむだ)π’«πœ†\mathcal{P}(\lambda)caligraphic_P ( italic_λらむだ ), and π’œβ’(λらむだ)π’œπœ†\mathcal{A}(\lambda)caligraphic_A ( italic_λらむだ ) is just the complement of 𝒫⁒(λらむだ)π’«πœ†\mathcal{P}(\lambda)caligraphic_P ( italic_λらむだ ) in the entire state space. Under the concept of passive set, Whittle indexability can be stated as follows:

Definition 2.1 (Whittle Indexability)

An arm is indexable if the passive set 𝒫⁒(λらむだ)π’«πœ†\mathcal{P}(\lambda)caligraphic_P ( italic_λらむだ ) increases from βˆ…\emptysetβˆ… to the whole state space as the subsidy Ξ»γ‚‰γ‚€γ πœ†\lambdaitalic_λらむだ increases from βˆ’βˆž-\infty- ∞ to +∞+\infty+ ∞. The RMAB problem is indexable if every decoupled problem is indexable.

Indexability states that, once an arm is made passive with subsidy Ξ»γ‚‰γ‚€γ πœ†\lambdaitalic_λらむだ, it should also be passive with any λらむだ′superscriptπœ†β€²\lambda^{\prime}italic_λらむだ start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT larger than Ξ»γ‚‰γ‚€γ πœ†\lambdaitalic_λらむだ. If the problem satisfies indexability, its Whittle index is defined as follows:

Definition 2.2 (Whittle Index)

If an arm is indexable, the Whittle index W⁒(s)π‘Šπ‘ W(s)italic_W ( italic_s ) of a state s𝑠sitalic_s is the infimum subsidy Ξ»γ‚‰γ‚€γ πœ†\lambdaitalic_λらむだ that keepsΒ s𝑠sitalic_s in the passive set 𝒫⁒(λらむだ)π’«πœ†\mathcal{P}(\lambda)caligraphic_P ( italic_λらむだ ). That is,

W⁒(s):=inf{λらむだ:sβˆˆπ’«β’(λらむだ)}.assignπ‘Šπ‘ infimumconditional-setπœ†π‘ π’«πœ†W(s):=\inf\{\lambda:s\in\mathcal{P}(\lambda)\}.italic_W ( italic_s ) := roman_inf { italic_λらむだ : italic_s ∈ caligraphic_P ( italic_λらむだ ) } .

By continuity, W⁒(s)π‘Šπ‘ W(s)italic_W ( italic_s ) is the infimum subsidy that makes it equivalent to be active or passive at stateΒ s𝑠sitalic_s. Under the definitions of indexability and Whittle index, the Whittle index policy for the original multi-armed bandit problem is simply to activate the K𝐾Kitalic_K arms whose states offer the largest Whittle indices. In fact, for the classic MAB problem where passive arms do not change states, the Whittle indexability is always satisfied and the Whittle index is reduced to the Gittins index.

2.3 Belief State Space

Different from the perfect observation case, the update of the belief state is nonlinear for the general observation model. This yields much difficulty for us to use value functions and dynamic programming methods to analyze the problem. Given an initial belief state, it appears that the size of the belief state space will grow exponentially over time as all possible realizations of observations/rewards/feedback are incessantly traversed and updated. Fortunately, a large number of numerical experiments have shown that in the sense of Euclidean norm approximation, the state space exhibits convergence after a process of iterative calculations (see Sec.Β 3 for details). Note that the initial belief state is 𝝎𝝎\boldsymbol{\omega}bold_italic_Ο‰γŠγ‚γŒ, {ℬl}0≀l≀Lsubscriptsubscriptℬ𝑙0𝑙𝐿\{\mathcal{B}_{l}\}_{0\leq l\leq L}{ caligraphic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 ≀ italic_l ≀ italic_L end_POSTSUBSCRIPT is a list of operators defined by feedback states and state update rules, that is, for 1≀l≀L1𝑙𝐿1\leq l\leq L1 ≀ italic_l ≀ italic_L, ℬl⁒𝝎subscriptβ„¬π‘™πŽ\mathcal{B}_{l}\boldsymbol{\omega}caligraphic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ is the updated state caused by the l𝑙litalic_l-th feedback state under the active action, and ℬ0subscriptℬ0\mathcal{B}_{0}caligraphic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the state update operator under the passive action. In this paper, ℬ0⁒𝝎=𝝎⁒Psubscriptℬ0πŽπŽπ‘ƒ\mathcal{B}_{0}\boldsymbol{\omega}=\boldsymbol{\omega}Pcaligraphic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ = bold_italic_Ο‰γŠγ‚γŒ italic_P. Define the T𝑇Titalic_T-step state space recursively as follows:

Definition 2.3 (T𝑇Titalic_T-step state space)

Define

Ξ©γŠγ‚γŒβ’(T|𝝎)={{ℬl1⁒⋯⁒ℬlT⁒𝝎:0≀l1,β‹―,lT≀L}βˆͺΞ©γŠγ‚γŒβ’(Tβˆ’1|𝝎),Tβ‰₯2{ℬl⁒𝝎:0≀l≀L}βˆͺ{𝝎},T=1.Ξ©γŠγ‚γŒconditionalπ‘‡πŽcasesconditional-setsubscriptℬsubscript𝑙1β‹―subscriptℬsubscriptπ‘™π‘‡πŽformulae-sequence0subscript𝑙1β‹―subscriptπ‘™π‘‡πΏΞ©γŠγ‚γŒπ‘‡conditional1πŽπ‘‡2conditional-setsubscriptβ„¬π‘™πŽ0π‘™πΏπŽπ‘‡1\Omega(T|\boldsymbol{\omega})=\begin{cases}\{\mathcal{B}_{l_{1}}\cdots\mathcal% {B}_{l_{T}}\boldsymbol{\omega}:0\leq l_{1},\cdots,l_{T}\leq L\}\cup\Omega(T-1|% \boldsymbol{\omega}),&T\geq 2\\ \{\mathcal{B}_{l}\boldsymbol{\omega}:0\leq l\leq L\}\cup\{\boldsymbol{\omega}% \},&T=1\end{cases}.roman_Ξ©γŠγ‚γŒ ( italic_T | bold_italic_Ο‰γŠγ‚γŒ ) = { start_ROW start_CELL { caligraphic_B start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT β‹― caligraphic_B start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ : 0 ≀ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β‹― , italic_l start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≀ italic_L } βˆͺ roman_Ξ©γŠγ‚γŒ ( italic_T - 1 | bold_italic_Ο‰γŠγ‚γŒ ) , end_CELL start_CELL italic_T β‰₯ 2 end_CELL end_ROW start_ROW start_CELL { caligraphic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ : 0 ≀ italic_l ≀ italic_L } βˆͺ { bold_italic_Ο‰γŠγ‚γŒ } , end_CELL start_CELL italic_T = 1 end_CELL end_ROW .

We call Ξ©γŠγ‚γŒβ’(T|π›š)Ξ©γŠγ‚γŒconditionalπ‘‡π›š\Omega(T|\boldsymbol{\omega})roman_Ξ©γŠγ‚γŒ ( italic_T | bold_italic_Ο‰γŠγ‚γŒ ) the T𝑇Titalic_T-step state space under the initial belief state π›šπ›š\boldsymbol{\omega}bold_italic_Ο‰γŠγ‚γŒ.

Under the definition of the T𝑇Titalic_T-step state space, Ξ©γŠγ‚γŒβ’(+∞|𝝎𝟎)Ξ©γŠγ‚γŒconditionalsubscript𝝎0\Omega(+\infty|\boldsymbol{\omega_{0}})roman_Ξ©γŠγ‚γŒ ( + ∞ | bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) is the belief space obtained by traversing all possible state update rules starting from the initial state 𝝎𝟎subscript𝝎0\boldsymbol{\omega_{0}}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT. Clearly, Ξ©γŠγ‚γŒβ’(+∞|𝝎𝟎)Ξ©γŠγ‚γŒconditionalsubscript𝝎0\Omega(+\infty|\boldsymbol{\omega_{0}})roman_Ξ©γŠγ‚γŒ ( + ∞ | bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) is countable and we denote it by Ξ©γŠγ‚γŒβ’(𝝎𝟎)Ξ©γŠγ‚γŒsubscript𝝎0\Omega(\boldsymbol{\omega_{0}})roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ). Under a policy Ο€γ±γ„πœ‹\piitalic_πぱい, let pi⁒j⁒(πぱい):=P⁒(𝝎⁒(t+1)=πŽπ’‹|𝝎⁒(t)=πŽπ’Š)assignsubscriptπ‘π‘–π‘—πœ‹π‘ƒπŽπ‘‘1conditionalsubscriptπŽπ’‹πŽπ‘‘subscriptπŽπ’Šp_{ij}(\pi):=P(\boldsymbol{\omega}(t+1)=\boldsymbol{\omega_{j}}|\boldsymbol{% \omega}(t)=\boldsymbol{\omega_{i}})italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_πぱい ) := italic_P ( bold_italic_Ο‰γŠγ‚γŒ ( italic_t + 1 ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT | bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) be the conditional probability of transition from belief state πŽπ’ŠsubscriptπŽπ’Š\boldsymbol{\omega_{i}}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT to πŽπ’‹subscriptπŽπ’‹\boldsymbol{\omega_{j}}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT. Due to different actions of activation and passivity affecting the state update, decomposition of pi⁒j⁒(πぱい)subscriptπ‘π‘–π‘—πœ‹p_{ij}(\pi)italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_πぱい ) should be carried out. Let a⁒(t)π‘Žπ‘‘a(t)italic_a ( italic_t ) be the indicator function of whether the arm is activated in slot t𝑑titalic_t, then we define the transition probability of the belief state as follows:

pi⁒j0=P(𝝎(t+1)=𝝎j|𝝎(t)=𝝎i,a(t)=0),\displaystyle p_{ij}^{0}=P(\boldsymbol{\omega}(t+1)=\boldsymbol{\omega}_{j}|% \boldsymbol{\omega}(t)=\boldsymbol{\omega}_{i},a(t)=0),italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_P ( bold_italic_Ο‰γŠγ‚γŒ ( italic_t + 1 ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ( italic_t ) = 0 ) , (14)
pi⁒j1=P(𝝎(t+1)=𝝎j|𝝎(t)=𝝎i,a(t)=1).\displaystyle p_{ij}^{1}=P(\boldsymbol{\omega}(t+1)=\boldsymbol{\omega}_{j}|% \boldsymbol{\omega}(t)=\boldsymbol{\omega}_{i},a(t)=1).italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_P ( bold_italic_Ο‰γŠγ‚γŒ ( italic_t + 1 ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ( italic_t ) = 1 ) . (15)

By (14)-(15) and the total probability theorem, we have

pi⁒j⁒(πぱい)subscriptπ‘π‘–π‘—πœ‹\displaystyle p_{ij}(\pi)italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_πぱい ) =βˆ‘a=01P⁒(a⁒(t)=a|𝝎⁒(t)=𝝎i)⁒pi⁒ja.absentsuperscriptsubscriptπ‘Ž01π‘ƒπ‘Žπ‘‘conditionalπ‘ŽπŽπ‘‘subscriptπŽπ‘–superscriptsubscriptπ‘π‘–π‘—π‘Ž\displaystyle=\sum\limits_{a=0}^{1}P(a(t)=a|\boldsymbol{\omega}(t)=\boldsymbol% {\omega}_{i})p_{ij}^{a}.= βˆ‘ start_POSTSUBSCRIPT italic_a = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_P ( italic_a ( italic_t ) = italic_a | bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT . (16)

The elements in the probability transition matrix under passive and active actions have the following more specific expressions:

pi⁒j0={1,if 𝝎j=𝝎i⁒P0,otherwisesuperscriptsubscript𝑝𝑖𝑗0cases1if 𝝎j=𝝎i⁒P0otherwisep_{ij}^{0}=\begin{cases}1,&\text{if $\boldsymbol{\omega}_{j}=\boldsymbol{% \omega}_{i}P$}\\ 0,&\text{otherwise}\end{cases}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { start_ROW start_CELL 1 , end_CELL start_CELL if bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW (17)

and

pi⁒j1superscriptsubscript𝑝𝑖𝑗1\displaystyle p_{ij}^{1}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT =P⁒(𝝎⁒(t+1)=𝝎j|active and ⁒𝝎⁒(t)=𝝎i)absentπ‘ƒπŽπ‘‘1conditionalsubscriptπŽπ‘—active andΒ πŽπ‘‘subscriptπŽπ‘–\displaystyle=P(\boldsymbol{\omega}(t+1)=\boldsymbol{\omega}_{j}|\text{active % and }\boldsymbol{\omega}(t)=\boldsymbol{\omega}_{i})= italic_P ( bold_italic_Ο‰γŠγ‚γŒ ( italic_t + 1 ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | active and bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (18)
=βˆ‘{l:ℬl⁒𝝎i=𝝎j,1≀l≀L}P⁒(F⁒(t)=l|active and ⁒𝝎⁒(t)=𝝎i).absentsubscriptconditional-set𝑙formulae-sequencesubscriptℬ𝑙subscriptπŽπ‘–subscriptπŽπ‘—1𝑙𝐿𝑃𝐹𝑑conditional𝑙active andΒ πŽπ‘‘subscriptπŽπ‘–\displaystyle=\sum\limits_{\{l:\mathcal{B}_{l}\boldsymbol{\omega}_{i}=% \boldsymbol{\omega}_{j},1\leq l\leq L\}}P(F(t)=l|\text{active and }\boldsymbol% {\omega}(t)=\boldsymbol{\omega}_{i}).= βˆ‘ start_POSTSUBSCRIPT { italic_l : caligraphic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , 1 ≀ italic_l ≀ italic_L } end_POSTSUBSCRIPT italic_P ( italic_F ( italic_t ) = italic_l | active and bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

In the following sections, these two probability transition matrices will play an important role in the calculation of Whittle indices.

2.4 Two-Arm Problem and Achievable Region

With the rapid development at the junction of operations research, stochastic optimization and reinforcement learning, quite a few effective methods for finite-state problems have been found, among which achievable region with conservation laws is a great example. Bertsimas and NiΓ±o-Mora (1996) and NiΓ±o-Mora (2001) adopted the analytical framework of generalized conservation laws and partial conservation laws for the classic MAB and RMAB problems with finite state spaces, respectively, and provided efficient index algorithms for the corresponding problems. In the rest subsections, we will apply the PCL framework to the analysis of our RMAB problem with an infinite state space, and theoretically build the foundation for the construction of an efficient index policy.

Consider the single-arm process with M𝑀Mitalic_M states discussed in the previous section. For an initial belief state 𝝎𝟎subscript𝝎0\boldsymbol{\omega_{0}}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT, let Ξ©γŠγ‚γŒβ’(𝝎𝟎)Ξ©γŠγ‚γŒsubscript𝝎0\Omega(\boldsymbol{\omega_{0}})roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) be the countable belief state space generated by iteratively updating 𝝎𝟎subscript𝝎0\boldsymbol{\omega_{0}}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT through state transitions. During the process of making the arm active or passive, since all belief states fall within Ξ©γŠγ‚γŒβ’(𝝎𝟎)Ξ©γŠγ‚γŒsubscript𝝎0\Omega(\boldsymbol{\omega_{0}})roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ), the entire time period can be completely partitioned by the time segments in which each state is located. In this scenario, define the performance indicator variables for each belief state as follows:

I𝝎1⁒(t)={1,if the arm is active in slotΒ tΒ and its belief state is 𝝎0,otherwisesuperscriptsubscript𝐼𝝎1𝑑cases1if the arm is active in slotΒ tΒ and its belief state is 𝝎0otherwiseI_{\boldsymbol{\omega}}^{1}(t)=\begin{cases}1,&\text{if the arm is active in % slot $t$ and its belief state is $\boldsymbol{\omega}$}\\ 0,&\text{otherwise}\end{cases}italic_I start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_t ) = { start_ROW start_CELL 1 , end_CELL start_CELL if the arm is active in slot italic_t and its belief state is bold_italic_Ο‰γŠγ‚γŒ end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW

and

I𝝎0⁒(t)={1,if the arm is passive in slotΒ tΒ and its belief state is 𝝎0,otherwise.superscriptsubscript𝐼𝝎0𝑑cases1if the arm is passive in slotΒ tΒ and its belief state is 𝝎0otherwiseI_{\boldsymbol{\omega}}^{0}(t)=\begin{cases}1,&\text{if the arm is passive in % slot $t$ and its belief state is $\boldsymbol{\omega}$}\\ 0,&\text{otherwise}\end{cases}.italic_I start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) = { start_ROW start_CELL 1 , end_CELL start_CELL if the arm is passive in slot italic_t and its belief state is bold_italic_Ο‰γŠγ‚γŒ end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW .

Furthermore, define the performance measures of belief state 𝝎𝝎\boldsymbol{\omega}bold_italic_Ο‰γŠγ‚γŒ as follows:

xΟ‰γŠγ‚γŒa⁒(πぱい)=𝔼πぱい⁒[βˆ‘t=0+βˆžΞ²γΉγƒΌγŸt⁒I𝝎a⁒(t)],a=0,1formulae-sequencesuperscriptsubscriptπ‘₯πœ”π‘Žπœ‹subscriptπ”Όπœ‹delimited-[]superscriptsubscript𝑑0superscript𝛽𝑑superscriptsubscriptπΌπŽπ‘Žπ‘‘π‘Ž01x_{\omega}^{a}(\pi)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{+\infty}\beta^{t}I_{% \boldsymbol{\omega}}^{a}(t)\right],\quad a=0,1italic_x start_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_πぱい ) = blackboard_E start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT [ βˆ‘ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_t ) ] , italic_a = 0 , 1 (19)
xΟ‰γŠγ‚γŒa⁒(πぱい,𝝎i)=𝔼πぱい⁒[βˆ‘t=0+βˆžΞ²γΉγƒΌγŸt⁒I𝝎a⁒(t)|𝝎⁒(0)=𝝎i],a=0,1formulae-sequencesuperscriptsubscriptπ‘₯πœ”π‘Žπœ‹subscriptπŽπ‘–subscriptπ”Όπœ‹delimited-[]conditionalsuperscriptsubscript𝑑0superscript𝛽𝑑superscriptsubscriptπΌπŽπ‘Žπ‘‘πŽ0subscriptπŽπ‘–π‘Ž01x_{\omega}^{a}(\pi,\boldsymbol{\omega}_{i})=\mathbb{E}_{\pi}\left[\sum_{t=0}^{% +\infty}\beta^{t}I_{\boldsymbol{\omega}}^{a}(t)\bigg{|}\boldsymbol{\omega}(0)=% \boldsymbol{\omega}_{i}\right],\quad a=0,1italic_x start_POSTSUBSCRIPT italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_πぱい , bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT [ βˆ‘ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_t ) | bold_italic_Ο‰γŠγ‚γŒ ( 0 ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , italic_a = 0 , 1 (20)

where 𝝎⁒(t)πŽπ‘‘\boldsymbol{\omega}(t)bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) is the belief state of the arm in slot t𝑑titalic_t. In Whittle’s relaxation, the Lagrangian multiplier Ξ»γ‚‰γ‚€γ πœ†\lambdaitalic_λらむだ can be regarded as a subsidy when the arm is made passive. Thus the optimization objective of the partially observable RMAB problem with subsidy can be written as follows:

maxπぱいsubscriptπœ‹\displaystyle\max_{\pi}roman_max start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT βˆ‘πŽβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)R𝝎⁒x𝝎1+Ξ»γ‚‰γ‚€γ β’βˆ‘πŽβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)x𝝎0subscriptπŽΞ©γŠγ‚γŒsubscript𝝎0subscriptπ‘…πŽsuperscriptsubscriptπ‘₯𝝎1πœ†subscriptπŽΞ©γŠγ‚γŒsubscript𝝎0superscriptsubscriptπ‘₯𝝎0\displaystyle\sum\limits_{\boldsymbol{\omega}\in\Omega(\boldsymbol{\omega_{0}}% )}R_{\boldsymbol{\omega}}x_{\boldsymbol{\omega}}^{1}+\lambda\sum\limits_{% \boldsymbol{\omega}\in\Omega(\boldsymbol{\omega_{0}})}x_{\boldsymbol{\omega}}^% {0}βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_λらむだ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT (21)
subject to x𝝎i1+xπŽπ’Š0=e𝝎i+Ξ²γΉγƒΌγŸβ’(βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)pj⁒i1⁒x𝝎j1+βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)pj⁒i0⁒x𝝎j0),βˆ€πŽπ’ŠβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)formulae-sequencesuperscriptsubscriptπ‘₯subscriptπŽπ‘–1superscriptsubscriptπ‘₯subscriptπŽπ’Š0subscript𝑒subscriptπŽπ‘–π›½subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑗𝑖1superscriptsubscriptπ‘₯subscriptπŽπ‘—1subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑗𝑖0superscriptsubscriptπ‘₯subscriptπŽπ‘—0for-allsubscriptπŽπ’ŠΞ©γŠγ‚γŒsubscript𝝎0\displaystyle x_{\boldsymbol{\omega}_{i}}^{1}+x_{\boldsymbol{\omega_{i}}}^{0}=% e_{\boldsymbol{\omega}_{i}}+\beta(\sum\limits_{\boldsymbol{\omega_{j}}\in% \Omega(\boldsymbol{\omega_{0}})}p_{ji}^{1}x_{\boldsymbol{\omega}_{j}}^{1}+\sum% \limits_{\boldsymbol{\omega_{j}}\in\Omega(\boldsymbol{\omega_{0}})}p_{ji}^{0}x% _{\boldsymbol{\omega}_{j}}^{0}),\quad\forall\boldsymbol{\omega_{i}}\in\Omega(% \boldsymbol{\omega_{0}})italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_e start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_Ξ²γΉγƒΌγŸ ( βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , βˆ€ bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT )
x𝝎i1,x𝝎i0β‰₯0,βˆ€πŽπ’ŠβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)formulae-sequencesuperscriptsubscriptπ‘₯subscriptπŽπ‘–1superscriptsubscriptπ‘₯subscriptπŽπ‘–00for-allsubscriptπŽπ’ŠΞ©γŠγ‚γŒsubscript𝝎0\displaystyle x_{\boldsymbol{\omega}_{i}}^{1},x_{\boldsymbol{\omega}_{i}}^{0}% \geq 0,\quad\forall\boldsymbol{\omega_{i}}\in\Omega(\boldsymbol{\omega_{0}})italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT β‰₯ 0 , βˆ€ bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT )

where R𝝎=βˆ‘i=1MΟ‰γŠγ‚γŒi⁒Ri,𝝎=(Ο‰γŠγ‚γŒ1,β‹―,Ο‰γŠγ‚γŒM)formulae-sequencesubscriptπ‘…πŽsuperscriptsubscript𝑖1𝑀subscriptπœ”π‘–subscriptπ‘…π‘–πŽsubscriptπœ”1β‹―subscriptπœ”π‘€R_{\boldsymbol{\omega}}=\sum\limits_{i=1}^{M}\omega_{i}R_{i},\boldsymbol{% \omega}=(\omega_{1},\cdots,\omega_{M})italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT = βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_Ο‰γŠγ‚γŒ = ( italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β‹― , italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ), pi⁒jasuperscriptsubscriptπ‘π‘–π‘—π‘Žp_{ij}^{a}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT are given by (17)-(18), e𝝎isubscript𝑒subscriptπŽπ‘–e_{\boldsymbol{\omega}_{i}}italic_e start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the indicator whether the initial belief state is 𝝎isubscriptπŽπ‘–\boldsymbol{\omega}_{i}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, risubscriptπ‘Ÿπ‘–r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the expected immediate reward when S⁒(t)=i𝑆𝑑𝑖S(t)=iitalic_S ( italic_t ) = italic_i with the active action. Obviously, R𝝎subscriptπ‘…πŽR_{\boldsymbol{\omega}}italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT is bounded above and we can assume |R𝝎|≀C,βˆ€πŽβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)formulae-sequencesubscriptπ‘…πŽπΆfor-allπŽΞ©γŠγ‚γŒsubscript𝝎0|R_{\boldsymbol{\omega}}|\leq C,\forall\boldsymbol{\omega}\in\Omega(% \boldsymbol{\omega_{0}})| italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT | ≀ italic_C , βˆ€ bold_italic_Ο‰γŠγ‚γŒ ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ). For the equality constraints of this optimization problem, it can be understood that the left side of the equation is the direct performance measure for state 𝝎isubscriptπŽπ‘–\boldsymbol{\omega}_{i}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, while the right side of the equation is another way to represent it. That is, each occurrence of state 𝝎isubscriptπŽπ‘–\boldsymbol{\omega}_{i}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the system is transitioned by another state 𝝎jsubscriptπŽπ‘—\boldsymbol{\omega}_{j}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Therefore, the performance measure for state 𝝎isubscriptπŽπ‘–\boldsymbol{\omega}_{i}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be represented by a combination of performance measures of other states. This model with subsidy can be explained more intuitively through a two-arm system. In this system, the first arm is the original arm while the second arm (auxiliary arm) has only one stateΒ 0. In every slot we choose one of the two arms to activate. When the auxiliary arm is activated, we obtain a fixed reward Ξ»γ‚‰γ‚€γ πœ†\lambdaitalic_λらむだ. Our goal is to find a policy to maximize the long-term expected discounted reward by deciding which arm to choose in each slot. The objective function can thus be written as

maxΟ€γ±γ„β’βˆ‘πŽβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)R𝝎⁒x𝝎1⁒(πぱい)+λらむだ⁒x01⁒(πぱい),subscriptπœ‹subscriptπŽΞ©γŠγ‚γŒsubscript𝝎0subscriptπ‘…πŽsuperscriptsubscriptπ‘₯𝝎1πœ‹πœ†superscriptsubscriptπ‘₯01πœ‹\max\limits_{\pi}\sum\limits_{\boldsymbol{\omega}\in\Omega(\boldsymbol{\omega_% {0}})}R_{\boldsymbol{\omega}}x_{\boldsymbol{\omega}}^{1}(\pi)+\lambda x_{0}^{1% }(\pi),roman_max start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_πぱい ) + italic_λらむだ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_πぱい ) ,

where x01⁒(πぱい)superscriptsubscriptπ‘₯01πœ‹x_{0}^{1}(\pi)italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_πぱい ) is the performance measure of stateΒ 0 being activated under policyΒ Ο€γ±γ„πœ‹\piitalic_πぱい. Let X𝑋Xitalic_X be the set of all elements (x01⁒(πぱい),x𝝎1⁒(πぱい))πŽβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)subscriptsuperscriptsubscriptπ‘₯01πœ‹superscriptsubscriptπ‘₯𝝎1πœ‹πŽΞ©γŠγ‚γŒsubscript𝝎0(x_{0}^{1}(\pi),x_{\boldsymbol{\omega}}^{1}(\pi))_{\boldsymbol{\omega}\in% \Omega(\boldsymbol{\omega_{0}})}( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_πぱい ) , italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_πぱい ) ) start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT asΒ Ο€γ±γ„πœ‹\piitalic_πぱい traverses the admissible (feasible) policy set ΠぱいΠぱい\Piroman_Πぱい. We call X𝑋Xitalic_X the achievable region. Under this definition, the optimization objective function can be written as follows:

max(x01,x𝝎1)∈Xβ’βˆ‘πŽβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)R𝝎⁒x𝝎1+λらむだ⁒x01.subscriptsuperscriptsubscriptπ‘₯01superscriptsubscriptπ‘₯𝝎1𝑋subscriptπŽΞ©γŠγ‚γŒsubscript𝝎0subscriptπ‘…πŽsuperscriptsubscriptπ‘₯𝝎1πœ†superscriptsubscriptπ‘₯01\quad\max_{(x_{0}^{1},x_{\boldsymbol{\omega}}^{1})\in X}\sum\limits_{% \boldsymbol{\omega}\in\Omega(\boldsymbol{\omega_{0}})}R_{\boldsymbol{\omega}}x% _{\boldsymbol{\omega}}^{1}+\lambda x_{0}^{1}.roman_max start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ∈ italic_X end_POSTSUBSCRIPT βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_λらむだ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT . (OPT)

The core of solving this optimization problem is to mathematically characterize the achievable region X𝑋Xitalic_X. The so-called conservation law refers to the fact that for any (x01,x𝝎1)∈Xsuperscriptsubscriptπ‘₯01superscriptsubscriptπ‘₯𝝎1𝑋(x_{0}^{1},x_{\boldsymbol{\omega}}^{1})\in X( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ∈ italic_X, these components may satisfy certain equality or inequality constraints. For the above model, a trivial equality constraint is x01+βˆ‘πŽβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)x𝝎1=11βˆ’Ξ²γΉγƒΌγŸsuperscriptsubscriptπ‘₯01subscriptπŽΞ©γŠγ‚γŒsubscript𝝎0superscriptsubscriptπ‘₯𝝎111𝛽x_{0}^{1}+\sum\limits_{\boldsymbol{\omega}\in\Omega(\boldsymbol{\omega_{0}})}x% _{\boldsymbol{\omega}}^{1}=\frac{1}{1-\beta}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 - italic_Ξ²γΉγƒΌγŸ end_ARG. This is because the RMAB system has exactly one state in the active phase at each moment (time-conservation). In the following subsections, we will further explore the rich mathematical structure of performance measures for the two-arm system with countable states under the concept of achievable region.

2.5 Partial Conservation Law

In this section, we elaborate the relationship between performance measures for each state in the two-arm system. To measure the performance of the original arm under policy Ο€γ±γ„πœ‹\piitalic_πぱい, we define the following variables:

T𝝎iπぱいsuperscriptsubscript𝑇subscriptπŽπ‘–πœ‹\displaystyle T_{\boldsymbol{\omega}_{i}}^{\pi}italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT :=𝔼πぱい⁒[βˆ‘t=0+βˆžΞ²γΉγƒΌγŸtβ’πŸ™β’(a⁒(t)=1)|𝝎⁒(0)=𝝎i],πŽπ’ŠβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎),formulae-sequenceassignabsentsubscriptπ”Όπœ‹delimited-[]conditionalsuperscriptsubscript𝑑0superscript𝛽𝑑1π‘Žπ‘‘1𝝎0subscriptπŽπ‘–subscriptπŽπ’ŠΞ©γŠγ‚γŒsubscript𝝎0\displaystyle:=\mathbb{E}_{\pi}\left[\sum\limits_{t=0}^{+\infty}\beta^{t}% \mathbbm{1}(a(t)=1)\bigg{|}\boldsymbol{\omega}(0)=\boldsymbol{\omega}_{i}% \right],\quad\boldsymbol{\omega_{i}}\in\Omega(\boldsymbol{\omega_{0}}),:= blackboard_E start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT [ βˆ‘ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_1 ( italic_a ( italic_t ) = 1 ) | bold_italic_Ο‰γŠγ‚γŒ ( 0 ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) , (22)
R𝝎iπぱいsuperscriptsubscript𝑅subscriptπŽπ‘–πœ‹\displaystyle R_{\boldsymbol{\omega}_{i}}^{\pi}italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT :=𝔼πぱい⁒[βˆ‘t=0+βˆžΞ²γΉγƒΌγŸt⁒R𝝎⁒(t)β’πŸ™β’(a⁒(t)=1)|𝝎⁒(0)=𝝎i],πŽπ’ŠβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎).formulae-sequenceassignabsentsubscriptπ”Όπœ‹delimited-[]conditionalsuperscriptsubscript𝑑0superscript𝛽𝑑subscriptπ‘…πŽπ‘‘1π‘Žπ‘‘1𝝎0subscriptπŽπ‘–subscriptπŽπ’ŠΞ©γŠγ‚γŒsubscript𝝎0\displaystyle:=\mathbb{E}_{\pi}\left[\sum\limits_{t=0}^{+\infty}\beta^{t}R_{% \boldsymbol{\omega}(t)}\mathbbm{1}(a(t)=1)\bigg{|}\boldsymbol{\omega}(0)=% \boldsymbol{\omega}_{i}\right],\quad\boldsymbol{\omega_{i}}\in\Omega(% \boldsymbol{\omega_{0}}).:= blackboard_E start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT [ βˆ‘ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) end_POSTSUBSCRIPT blackboard_1 ( italic_a ( italic_t ) = 1 ) | bold_italic_Ο‰γŠγ‚γŒ ( 0 ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) . (23)

Here T𝝎iπぱいsuperscriptsubscript𝑇subscriptπŽπ‘–πœ‹T_{\boldsymbol{\omega}_{i}}^{\pi}italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT describes the expected total discounted time for the original arm to be activated in the two-arm system under policy Ο€γ±γ„πœ‹\piitalic_πぱい. We require that the policy Ο€γ±γ„πœ‹\piitalic_πぱい only depends on the original arm. Similarly, R𝝎iΞ©γŠγ‚γŒsuperscriptsubscript𝑅subscriptπŽπ‘–Ξ©γŠγ‚γŒR_{\boldsymbol{\omega}_{i}}^{\Omega}italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT represents the expected total discounted reward obtained from the original arm under policy Ο€γ±γ„πœ‹\piitalic_πぱい. Therefore, T𝝎iπぱいsuperscriptsubscript𝑇subscriptπŽπ‘–πœ‹T_{\boldsymbol{\omega}_{i}}^{\pi}italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT and R𝝎iπぱいsuperscriptsubscript𝑅subscriptπŽπ‘–πœ‹R_{\boldsymbol{\omega}_{i}}^{\pi}italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT also have the following expressions:

T𝝎iπぱいsuperscriptsubscript𝑇subscriptπŽπ‘–πœ‹\displaystyle T_{\boldsymbol{\omega}_{i}}^{\pi}italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT =βˆ‘πŽβˆˆΞ©γŠγ‚γŒβ’(𝝎0)x𝝎1⁒(πぱい,𝝎i),absentsubscriptπŽΞ©γŠγ‚γŒsubscript𝝎0superscriptsubscriptπ‘₯𝝎1πœ‹subscriptπŽπ‘–\displaystyle=\sum\limits_{\boldsymbol{\omega}\in\Omega(\boldsymbol{\omega}_{0% })}x_{\boldsymbol{\omega}}^{1}(\pi,\boldsymbol{\omega}_{i}),= βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_πぱい , bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (24)
R𝝎iπぱいsuperscriptsubscript𝑅subscriptπŽπ‘–πœ‹\displaystyle R_{\boldsymbol{\omega}_{i}}^{\pi}italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT =βˆ‘πŽβˆˆΞ©γŠγ‚γŒβ’(𝝎0)R𝝎⁒x𝝎1⁒(πぱい,𝝎i).absentsubscriptπŽΞ©γŠγ‚γŒsubscript𝝎0subscriptπ‘…πŽsuperscriptsubscriptπ‘₯𝝎1πœ‹subscriptπŽπ‘–\displaystyle=\sum\limits_{\boldsymbol{\omega}\in\Omega(\boldsymbol{\omega}_{0% })}R_{\boldsymbol{\omega}}x_{\boldsymbol{\omega}}^{1}(\pi,\boldsymbol{\omega}_% {i}).= βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_πぱい , bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (25)

Clearly, the optimal policy partitions the belief state spaceΒ Ξ©γŠγ‚γŒβ’(𝝎𝟎)Ξ©γŠγ‚γŒsubscript𝝎0\Omega(\boldsymbol{\omega_{0}})roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) of the original arm into two sets where the optimal actions should be active and passive respectively. For any Ξ©γŠγ‚γŒβŠ‚Ξ©γŠγ‚γŒβ’(𝝎𝟎)Ξ©γŠγ‚γŒΞ©γŠγ‚γŒsubscript𝝎0\Omega\subset\Omega(\boldsymbol{\omega_{0}})roman_Ξ©γŠγ‚γŒ βŠ‚ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ), we call Ο€γ±γ„Ξ©γŠγ‚γŒsubscriptπœ‹Ξ©γŠγ‚γŒ\pi_{\Omega}italic_πぱい start_POSTSUBSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUBSCRIPT an Ξ©γŠγ‚γŒΞ©γŠγ‚γŒ\Omegaroman_Ξ©γŠγ‚γŒ-priority policy if the original arm is activated when its current belief state falls into Ξ©γŠγ‚γŒΞ©γŠγ‚γŒ\Omegaroman_Ξ©γŠγ‚γŒ and made passive when the state falls into Ξ©γŠγ‚γŒcsuperscriptΞ©γŠγ‚γŒπ‘\Omega^{c}roman_Ξ©γŠγ‚γŒ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. For Ξ©γŠγ‚γŒΞ©γŠγ‚γŒ\Omegaroman_Ξ©γŠγ‚γŒ-priority policy, we have

T𝝎iΞ©γŠγ‚γŒsuperscriptsubscript𝑇subscriptπŽπ‘–Ξ©γŠγ‚γŒ\displaystyle T_{\boldsymbol{\omega}_{i}}^{\Omega}italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT =π”ΌΟ€γ±γ„Ξ©γŠγ‚γŒβ’[βˆ‘t=0+βˆžΞ²γΉγƒΌγŸtβ’πŸ™β’(𝝎⁒(t)βˆˆΞ©γŠγ‚γŒ)|𝝎⁒(0)=𝝎i],πŽπ’ŠβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎),formulae-sequenceabsentsubscript𝔼subscriptπœ‹Ξ©γŠγ‚γŒdelimited-[]conditionalsuperscriptsubscript𝑑0superscript𝛽𝑑1πŽπ‘‘Ξ©γŠγ‚γŒπŽ0subscriptπŽπ‘–subscriptπŽπ’ŠΞ©γŠγ‚γŒsubscript𝝎0\displaystyle=\mathbb{E}_{\pi_{\Omega}}\left[\sum\limits_{t=0}^{+\infty}\beta^% {t}\mathbbm{1}(\boldsymbol{\omega}(t)\in\Omega)\bigg{|}\boldsymbol{\omega}(0)=% \boldsymbol{\omega}_{i}\right],\quad\boldsymbol{\omega_{i}}\in\Omega(% \boldsymbol{\omega_{0}}),= blackboard_E start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ βˆ‘ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_1 ( bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) ∈ roman_Ξ©γŠγ‚γŒ ) | bold_italic_Ο‰γŠγ‚γŒ ( 0 ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) , (26)
R𝝎iΞ©γŠγ‚γŒsuperscriptsubscript𝑅subscriptπŽπ‘–Ξ©γŠγ‚γŒ\displaystyle R_{\boldsymbol{\omega}_{i}}^{\Omega}italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT =π”ΌΟ€γ±γ„Ξ©γŠγ‚γŒβ’[βˆ‘t=0+βˆžΞ²γΉγƒΌγŸt⁒R𝝎⁒(t)β’πŸ™β’(𝝎⁒(t)βˆˆΞ©γŠγ‚γŒ)|𝝎⁒(0)=𝝎i],πŽπ’ŠβˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎).formulae-sequenceabsentsubscript𝔼subscriptπœ‹Ξ©γŠγ‚γŒdelimited-[]conditionalsuperscriptsubscript𝑑0superscript𝛽𝑑subscriptπ‘…πŽπ‘‘1πŽπ‘‘Ξ©γŠγ‚γŒπŽ0subscriptπŽπ‘–subscriptπŽπ’ŠΞ©γŠγ‚γŒsubscript𝝎0\displaystyle=\mathbb{E}_{\pi_{\Omega}}\left[\sum\limits_{t=0}^{+\infty}\beta^% {t}R_{\boldsymbol{\omega}(t)}\mathbbm{1}(\boldsymbol{\omega}(t)\in\Omega)\bigg% {|}\boldsymbol{\omega}(0)=\boldsymbol{\omega}_{i}\right],\quad\boldsymbol{% \omega_{i}}\in\Omega(\boldsymbol{\omega_{0}}).= blackboard_E start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ βˆ‘ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_Ξ²γΉγƒΌγŸ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) end_POSTSUBSCRIPT blackboard_1 ( bold_italic_Ο‰γŠγ‚γŒ ( italic_t ) ∈ roman_Ξ©γŠγ‚γŒ ) | bold_italic_Ο‰γŠγ‚γŒ ( 0 ) = bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) . (27)

From the above definition, we can easily give the following dynamic programming equations for these two variables:

T𝝎iΞ©γŠγ‚γŒ={1+Ξ²γΉγƒΌγŸβ’βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)pi⁒j1⁒T𝝎iΞ©γŠγ‚γŒ,𝝎iβˆˆΞ©γŠγ‚γŒΞ²γΉγƒΌγŸβ’βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)pi⁒j0⁒T𝝎iΞ©γŠγ‚γŒπŽiβˆ‰Ξ©γŠγ‚γŒsuperscriptsubscript𝑇subscriptπŽπ‘–Ξ©γŠγ‚γŒcases1𝛽subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑖𝑗1superscriptsubscript𝑇subscriptπŽπ‘–Ξ©γŠγ‚γŒsubscriptπŽπ‘–Ξ©γŠγ‚γŒπ›½subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑖𝑗0superscriptsubscript𝑇subscriptπŽπ‘–Ξ©γŠγ‚γŒsubscriptπŽπ‘–Ξ©γŠγ‚γŒT_{\boldsymbol{\omega}_{i}}^{\Omega}=\begin{cases}1+\beta\sum\limits_{% \boldsymbol{\omega_{j}}\in\Omega(\boldsymbol{\omega_{0}})}p_{ij}^{1}T_{% \boldsymbol{\omega}_{i}}^{\Omega},&\boldsymbol{\omega}_{i}\in\Omega\\ \beta\sum\limits_{\boldsymbol{\omega_{j}}\in\Omega(\boldsymbol{\omega_{0}})}p_% {ij}^{0}T_{\boldsymbol{\omega}_{i}}^{\Omega}&\boldsymbol{\omega}_{i}\notin% \Omega\end{cases}italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT = { start_ROW start_CELL 1 + italic_Ξ²γΉγƒΌγŸ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT , end_CELL start_CELL bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ end_CELL end_ROW start_ROW start_CELL italic_Ξ²γΉγƒΌγŸ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT end_CELL start_CELL bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT βˆ‰ roman_Ξ©γŠγ‚γŒ end_CELL end_ROW (28)

and

R𝝎iΞ©γŠγ‚γŒ={R𝝎i+Ξ²γΉγƒΌγŸβ’βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)pi⁒j1⁒R𝝎iΞ©γŠγ‚γŒ,𝝎iβˆˆΞ©γŠγ‚γŒΞ²γΉγƒΌγŸβ’βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)pi⁒j0⁒R𝝎iΞ©γŠγ‚γŒπŽiβˆ‰Ξ©γŠγ‚γŒ.superscriptsubscript𝑅subscriptπŽπ‘–Ξ©γŠγ‚γŒcasessubscript𝑅subscriptπŽπ‘–π›½subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑖𝑗1superscriptsubscript𝑅subscriptπŽπ‘–Ξ©γŠγ‚γŒsubscriptπŽπ‘–Ξ©γŠγ‚γŒπ›½subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑖𝑗0superscriptsubscript𝑅subscriptπŽπ‘–Ξ©γŠγ‚γŒsubscriptπŽπ‘–Ξ©γŠγ‚γŒR_{\boldsymbol{\omega}_{i}}^{\Omega}=\begin{cases}R_{\boldsymbol{\omega}_{i}}+% \beta\sum\limits_{\boldsymbol{\omega_{j}}\in\Omega(\boldsymbol{\omega_{0}})}p_% {ij}^{1}R_{\boldsymbol{\omega}_{i}}^{\Omega},&\boldsymbol{\omega}_{i}\in\Omega% \\ \beta\sum\limits_{\boldsymbol{\omega_{j}}\in\Omega(\boldsymbol{\omega_{0}})}p_% {ij}^{0}R_{\boldsymbol{\omega}_{i}}^{\Omega}&\boldsymbol{\omega}_{i}\notin% \Omega\end{cases}.italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT = { start_ROW start_CELL italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_Ξ²γΉγƒΌγŸ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT , end_CELL start_CELL bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ end_CELL end_ROW start_ROW start_CELL italic_Ξ²γΉγƒΌγŸ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT end_CELL start_CELL bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT βˆ‰ roman_Ξ©γŠγ‚γŒ end_CELL end_ROW . (29)

If Ξ©γŠγ‚γŒβ’(𝝎𝟎)Ξ©γŠγ‚γŒsubscript𝝎0\Omega(\boldsymbol{\omega_{0}})roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) is a finite state space, we can directly solve for TπŽπ’Šsubscript𝑇subscriptπŽπ’ŠT_{\boldsymbol{\omega_{i}}}italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and RπŽπ’Šsubscript𝑅subscriptπŽπ’ŠR_{\boldsymbol{\omega_{i}}}italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT from the two equation sets above. However, a direct solution is not available for countable state spaces. To further investigate the properties of these variables, we define VπŽπ’ŠΟ€γ±γ„superscriptsubscript𝑉subscriptπŽπ’Šπœ‹V_{\boldsymbol{\omega_{i}}}^{\pi}italic_V start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT as the expected total discounted reward starting from stateΒ πŽπ’ŠsubscriptπŽπ’Š\boldsymbol{\omega_{i}}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT under policyΒ Ο€γ±γ„πœ‹\piitalic_πぱい. We have

VπŽπ’ŠΟ€γ±γ„=RπŽπ’ŠΟ€γ±γ„+λらむだ⁒(11βˆ’Ξ²γΉγƒΌγŸβˆ’TπŽπ’ŠΟ€γ±γ„).superscriptsubscript𝑉subscriptπŽπ’Šπœ‹superscriptsubscript𝑅subscriptπŽπ’Šπœ‹πœ†11𝛽superscriptsubscript𝑇subscriptπŽπ’Šπœ‹V_{\boldsymbol{\omega_{i}}}^{\pi}=R_{\boldsymbol{\omega_{i}}}^{\pi}+\lambda% \left(\frac{1}{1-\beta}-T_{\boldsymbol{\omega_{i}}}^{\pi}\right).italic_V start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT + italic_λらむだ ( divide start_ARG 1 end_ARG start_ARG 1 - italic_Ξ²γΉγƒΌγŸ end_ARG - italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT ) . (30)

Let VπŽπ’Š,0πぱいsuperscriptsubscript𝑉subscriptπŽπ’Š0πœ‹V_{\boldsymbol{\omega_{i}},0}^{\pi}italic_V start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT and VπŽπ’Š,1πぱいsuperscriptsubscript𝑉subscriptπŽπ’Š1πœ‹V_{\boldsymbol{\omega_{i}},1}^{\pi}italic_V start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT be the expected total discounted rewards with the initial state πŽπ’ŠsubscriptπŽπ’Š\boldsymbol{\omega_{i}}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT when taking the passive and active actions respectively, then

VπŽπ’Š,0πぱい=λらむだ+Ξ²γΉγƒΌγŸβ’βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)pi⁒j0⁒VπŽπ’‹Ο€γ±γ„,superscriptsubscript𝑉subscriptπŽπ’Š0πœ‹πœ†π›½subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑖𝑗0superscriptsubscript𝑉subscriptπŽπ’‹πœ‹\displaystyle V_{\boldsymbol{\omega_{i}},0}^{\pi}=\lambda+\beta\sum\limits_{% \boldsymbol{\omega_{j}}\in\Omega(\boldsymbol{\omega_{0}})}p_{ij}^{0}V_{% \boldsymbol{\omega_{j}}}^{\pi},italic_V start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT = italic_λらむだ + italic_Ξ²γΉγƒΌγŸ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT , (31)
VπŽπ’Š,1πぱい=RπŽπ’Š+Ξ²γΉγƒΌγŸβ’βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)pi⁒j1⁒VπŽπ’‹Ο€γ±γ„.superscriptsubscript𝑉subscriptπŽπ’Š1πœ‹subscript𝑅subscriptπŽπ’Šπ›½subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑖𝑗1superscriptsubscript𝑉subscriptπŽπ’‹πœ‹\displaystyle V_{\boldsymbol{\omega_{i}},1}^{\pi}=R_{\boldsymbol{\omega_{i}}}+% \beta\sum\limits_{\boldsymbol{\omega_{j}}\in\Omega(\boldsymbol{\omega_{0}})}p_% {ij}^{1}V_{\boldsymbol{\omega_{j}}}^{\pi}.italic_V start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_Ξ²γΉγƒΌγŸ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT . (32)

If, under policy Ο€γ±γ„πœ‹\piitalic_πぱい, the expected total discounted reward obtained by active and passive actions are the same for the initial state πŽπ’ŠsubscriptπŽπ’Š\boldsymbol{\omega_{i}}bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT, i.e., VπŽπ’Š,πŸŽΟ€γ±γ„=VπŽπ’Š,πŸΟ€γ±γ„superscriptsubscript𝑉subscriptπŽπ’Š0πœ‹superscriptsubscript𝑉subscriptπŽπ’Š1πœ‹V_{\boldsymbol{\omega_{i},0}}^{\pi}=V_{\boldsymbol{\omega_{i},1}}^{\pi}italic_V start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT bold_, bold_0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT = italic_V start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT bold_, bold_1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT, then we can solve forΒ Ξ»γ‚‰γ‚€γ πœ†\lambdaitalic_λらむだ:

λらむだ=RπŽπ’Š+Ξ²γΉγƒΌγŸβ’βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)(pi⁒j1βˆ’pi⁒j0)⁒RπŽπ’‹Ο€γ±γ„1+Ξ²γΉγƒΌγŸβ’βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)(pi⁒j1βˆ’pi⁒j0)⁒TπŽπ’‹Ο€γ±γ„.πœ†subscript𝑅subscriptπŽπ’Šπ›½subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑖𝑗1superscriptsubscript𝑝𝑖𝑗0superscriptsubscript𝑅subscriptπŽπ’‹πœ‹1𝛽subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑖𝑗1superscriptsubscript𝑝𝑖𝑗0superscriptsubscript𝑇subscriptπŽπ’‹πœ‹\lambda=\frac{R_{\boldsymbol{\omega_{i}}}+\beta\sum\limits_{\boldsymbol{\omega% _{j}}\in\Omega(\boldsymbol{\omega_{0}})}(p_{ij}^{1}-p_{ij}^{0})R_{\boldsymbol{% \omega_{j}}}^{\pi}}{1+\beta\sum\limits_{\boldsymbol{\omega_{j}}\in\Omega(% \boldsymbol{\omega_{0}})}(p_{ij}^{1}-p_{ij}^{0})T_{\boldsymbol{\omega_{j}}}^{% \pi}}.italic_λらむだ = divide start_ARG italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_Ξ²γΉγƒΌγŸ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_Ξ²γΉγƒΌγŸ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい end_POSTSUPERSCRIPT end_ARG . (33)

Now define

A𝝎iΞ©γŠγ‚γŒsuperscriptsubscript𝐴subscriptπŽπ‘–Ξ©γŠγ‚γŒ\displaystyle A_{\boldsymbol{\omega}_{i}}^{\Omega}italic_A start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT :=1+Ξ²γΉγƒΌγŸβ’βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)(pi⁒j1βˆ’pi⁒j0)⁒T𝝎jΞ©γŠγ‚γŒc,assignabsent1𝛽subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑖𝑗1superscriptsubscript𝑝𝑖𝑗0superscriptsubscript𝑇subscriptπŽπ‘—superscriptΞ©γŠγ‚γŒπ‘\displaystyle:=1+\beta\sum\limits_{\boldsymbol{\omega_{j}}\in\Omega(% \boldsymbol{\omega_{0}})}(p_{ij}^{1}-p_{ij}^{0})T_{\boldsymbol{\omega}_{j}}^{% \Omega^{c}},:= 1 + italic_Ξ²γΉγƒΌγŸ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) italic_T start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , (34)
W𝝎iΞ©γŠγ‚γŒsuperscriptsubscriptπ‘ŠsubscriptπŽπ‘–Ξ©γŠγ‚γŒ\displaystyle W_{\boldsymbol{\omega}_{i}}^{\Omega}italic_W start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ end_POSTSUPERSCRIPT :=R𝝎i+Ξ²γΉγƒΌγŸβ’βˆ‘πŽπ’‹βˆˆΞ©γŠγ‚γŒβ’(𝝎𝟎)(pi⁒j1βˆ’pi⁒j0)⁒R𝝎jΞ©γŠγ‚γŒc.assignabsentsubscript𝑅subscriptπŽπ‘–π›½subscriptsubscriptπŽπ’‹Ξ©γŠγ‚γŒsubscript𝝎0superscriptsubscript𝑝𝑖𝑗1superscriptsubscript𝑝𝑖𝑗0superscriptsubscript𝑅subscriptπŽπ‘—superscriptΞ©γŠγ‚γŒπ‘\displaystyle:=R_{\boldsymbol{\omega}_{i}}+\beta\sum\limits_{\boldsymbol{% \omega_{j}}\in\Omega(\boldsymbol{\omega_{0}})}(p_{ij}^{1}-p_{ij}^{0})R_{% \boldsymbol{\omega}_{j}}^{\Omega^{c}}.:= italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_Ξ²γΉγƒΌγŸ βˆ‘ start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ∈ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT bold_italic_Ο‰γŠγ‚γŒ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ξ©γŠγ‚γŒ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT . (35)

For the auxiliary arm, we have A0{0}=1superscriptsubscript𝐴001A_{0}^{\{0\}}=1italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT { 0 } end_POSTSUPERSCRIPT = 1 and W0{0}=λらむだsuperscriptsubscriptπ‘Š00πœ†W_{0}^{\{0\}}=\lambdaitalic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT { 0 } end_POSTSUPERSCRIPT = italic_λらむだ. Generally, we can extend these variables to the multi-arm case. Assume that there are two arms and their state spaces are Ξ©γŠγ‚γŒβ’(𝝎)Ξ©γŠγ‚γŒπŽ\Omega(\boldsymbol{\omega})roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ ) and Ξ©γŠγ‚γŒβ€²β’(πŽβ€²)superscriptΞ©γŠγ‚γŒβ€²superscriptπŽβ€²\Omega^{\prime}(\boldsymbol{\omega}^{\prime})roman_Ξ©γŠγ‚γŒ start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ) with Ξ©γŠγ‚γŒβ’(𝝎)βˆ©Ξ©γŠγ‚γŒβ€²β’(πŽβ€²)=βˆ…Ξ©γŠγ‚γŒπŽsuperscriptΞ©γŠγ‚γŒβ€²superscriptπŽβ€²\Omega(\boldsymbol{\omega})\cap\Omega^{\prime}(\boldsymbol{\omega}^{\prime})=\emptysetroman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ ) ∩ roman_Ξ©γŠγ‚γŒ start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ) = βˆ…. Then for Ξ©γŠγ‚γŒβŠ‚Ξ©γŠγ‚γŒβ’(𝝎)Ξ©γŠγ‚γŒΞ©γŠγ‚γŒπŽ\Omega\subset\Omega(\boldsymbol{\omega})roman_Ξ©γŠγ‚γŒ βŠ‚ roman_Ξ©γŠγ‚γŒ ( bold_italic_Ο‰γŠγ‚γŒ ), Ξ©γŠγ‚γŒβ€²βŠ‚Ξ©γŠγ‚γŒβ€²β’(πŽβ€²)superscriptΞ©γŠγ‚γŒβ€²superscriptΞ©γŠγ‚γŒβ€²superscriptπŽβ€²\Omega^{\prime}\subset\Omega^{\prime}(\boldsymbol{\omega}^{\prime})roman_Ξ©γŠγ‚γŒ start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT βŠ‚ roman_Ξ©γŠγ‚γŒ start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ( bold_italic_Ο‰γŠγ‚γŒ start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ) and