(Translated by https://www.hiragana.jp/)
Rethinking Transformers in Solving POMDPs

Rethinking Transformers in Solving POMDPs

Chenhao Lu    Ruizhe Shi    Yuyao Liu    Kaizhe Hu    Simon S. Du    Huazhe Xu
Abstract

Sequential decision-making algorithms such as reinforcement learning (RL) in real-world scenarios inevitably face environments with partial observability. This paper scrutinizes the effectiveness of a popular architecture, namely Transformers, in Partially Observable Markov Decision Processes (POMDPs) and reveals its theoretical and empirical limitations. We establish that regular languages, which Transformers struggle to model, are reducible to POMDPs. This poses a significant challenge for Transformers in learning POMDP-specific inductive biases, due to their lack of inherent recurrence found in other models like RNNs. This paper casts doubt on the prevalent belief in Transformers as sequence models for RL and proposes to introduce a point-wise recurrent structure. The Deep Linear Recurrent Unit (LRU) emerges as a well-suited alternative for Partially Observable RL, with empirical results highlighting the sub-optimal performance of Transformer and considerable strength of LRU. Our code is open-sourced111https://github.com/CTP314/TFPORL.

Transformer, RNN, POMDP, Linear RNN, Partially Observable RL

1 Introduction

Reinforcement Learning (RL) in the real world confronts the challenge of incomplete information (Dulac-Arnold et al., 2019) due to partial observability, necessitating decision-making based on historical data. The design of RL algorithms under partial observability, denoted as Partially Observable RL (Kaelbling et al., 1998; Littman & Sutton, 2001; Li et al., 2015), typically employs a hierarchical structure combining (SEQ,RL)SEQRL(\texttt{SEQ},\texttt{RL})( SEQ , RL ). This structure involves firstly feeding the history into a sequence model SEQ, such as Recurrent Neural Network (RNN) (Elman, 1990) or Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997), yielding a hidden state containing past information, then processing it using existing RL algorithms.

Regarding the sequence model, Transformer (Vaswani et al., 2017), renowned for its achievements in the natural language processing (NLP) domain (Radford et al., 2019; Brown et al., 2020; OpenAI, 2023), stands out as a prominent candidate. Transformers have shown a strong ability to handle contexts in a non-recurrent manner. Compared with their recurrent counterpart like RNNs, the advantages of Transformers as a sequence model shine in several aspects: 1) long-term memory capacity, as opposed to RNNs with rapid memory decay (Ni et al., 2023; Parisotto et al., 2019); 2) effective representation learning from context for specific tasks (Micheli et al., 2022; Laskin et al., 2022; Lee et al., 2022; Robine et al., 2023), benefiting meta-RL or certain environments (Bellemare et al., 2013); 3) stronger learning ability on large-scale datasets (Baker et al., 2022).

However, deploying Transformers in Partially Observable RL introduces challenges, commonly manifesting as sample inefficiency (Parisotto et al., 2019; Ni et al., 2023). This issue is similarly observed in computer vision (CV) and is attributed to the data amount that Transformers require to learn the problem-specific inductive biases (Dosovitskiy et al., 2021). While it is validated in CV, it remains unknown whether data amount is the key ingredient in decision making. Hence, a natural question arises: Can Transformers effectively solve decision-making problems in POMDPs with sufficient data?

In this work, we investigate this critical question and challenge the conventional wisdom. We argue that Transformers cannot solve POMDP even with massive data. This stance is inspired by a key observation: While most RNNs are complete for regular languages, Transformers falter to model them (Delétang et al., 2023; Hahn, 2020b) . A notable example is their struggle with tasks like PARITY, which is to determine the parity of the occurrence of “1” in a binary string. We hypothesize that, in POMDPs, this limitation becomes pronounced due to the close relationship between regular languages and POMDPs.

To elaborate further, regular languages exhibit a direct correspondence with Hidden Markov Models (HMMs) (Carrasco & Oncina, 1994), and POMDPs can be regarded as HMMs augmented with an incorporated decision-making process. We further establish that regular languages can be effectively reduced to POMDPs. From the computational complexity perspective, the parallel structure of the Transformer makes it equivalent to a constant-depth computation circuit. Some regular languages fall outside of this complexity class, making the POMDP problems derived from them harder, and Transformer would struggle to solve them. This is demonstrated both theoretically and empirically in this study.

To alleviate the limitations of Transformers caused by the parallel structure, we propose to introduce a pointwise recurrent structure. Upon reviewing current variants of sequence models with such a structure, we find that they can be broadly generalized as linear RNNs. Based on extensive experiments over a range of sequence models over POMDP tasks with diverse requirements, we highlight LRU (Orvieto et al., 2023) as a linear rnn model well-suited for Partially Observable RL. Our contributions are three-fold:

  • We demonstrate the theoretical limitations of Transformers as sequence model backbones for solving POMDPs, through rigorous analysis.

  • To better utilize the inductive bias of the sequence model, we study the advantages of the Transformer and the RNN, and advocate the linear RNN as a better-suited choice for solving POMDPs, taking advantage of both models.

  • Through extensive experiments across various tasks, We compare the capabilities exhibited by various sequence models across multiple dimensions. Specifically, we show that Transformers exhibit sub-optimal performance as the sequence model in certain POMDPs, while highlighting the strength of linear RNNs when assessed comprehensively.

2 Related Work

Theoretical limitations of Transformers. There is a substantial body of work investigating the theoretical limitations of Transformers from the perspective of computational complexity and formal language. For example, Delétang et al. (2023); Huang et al. (2022) experimentally verifies that RNNs can recognize regular languages, but Transformers are unable to achieve this. Additionally, Hahn (2020a) demonstrates that Transformers are not robust in handling sequence length extrapolation. Moreover, Merrill & Sabharwal (2023); Merrill et al. (2022) point out that, under limited precision, 𝖳𝖢0superscript𝖳𝖢0{\mathsf{TC}^{0}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT serves as an upper bound for the computational power of Transformers. Applying this result, Feng et al. (2023) illustrates the challenges Transformers face in solving practical problems such as arithmetic operations and linear systems of equations. Currently, works such as Ni et al. (2023); Morad et al. (2023); Deng et al. (2023) discuss the pros and cons of transformers in RL algorithms, with a focus on analyzing the advantages or providing simple evaluations. In contrast, integrating relevant theories from formal languages, we offer a new theoretical perspective on analyzing the limitations of transformers in RL.

Variants of sequence models for handling long contexts. Multiple variants of mainstream sequence models designed to handle long contexts have provided significant inspiration for this work. In the case of RNN-like models, addressing the issue of rapid memory decay has led to the emergence of linear RNNs (Gu et al., 2021; Orvieto et al., 2023), which remove activation functions in the recurrence part. These models have demonstrated excellent performances in benchmarks for long-range modeling (Tay et al., 2020). For Transformers, to tackle limited training length and inefficient inference, current studies emphasize the introduction of recurrence. Recurrence in Transformers can be categorized into two types: 1) chunkwise recurrence, which processes parts exceeding the context length using a recurrent block with minimal alterations to the original parallel structure (Dai et al., 2019; Hutchins et al., 2022); 2) pointwise recurrence, which derives recurrence representations by linearizing attention (Sun et al., 2023; Peng et al., 2023; Schlag et al., 2021; Katharopoulos et al., 2020). We argue that linear RNNs and pointwise recurrence Transformers ultimately converge to a similar solution, incorporating the strength of both approaches and are suitable for solving Partially Observable RL problems.

Applications of sequence models in RL. In recent years, there have been many applications of Transformers in RL, such as Decision Transformer (DT) (Chen et al., 2021), its variants (Yamagata et al., 2023; Wu et al., 2023) in offline RL; GTrXL (Parisotto et al., 2019), Online DT (Zheng et al., 2022) in online RL, and the Transformer State Space Model (Chen et al., 2022) as world models. There are works (Reid et al., 2022; Shi et al., 2023) showing that the inductive bias of pre-trained Transformers could help RL, where the states are fully observable. Hu et al. (2023) uses Transformer to solve a specific partially observable setting, frame drops, whereas demanding additional assumptions on the prior distribution. On the other hand, recent works comparing Transformer-based and RNN-based approaches in Partially Observable RL empirically support our idea that Transformers have weaknesses in partially observable environments (Morad et al., 2023; Deng et al., 2023). In many cases, simpler architectures like RNNs and LSTMs prove to be more effective. For instance, DreamerV3 (Hafner et al., 2023), which adopts GRU (Dey & Salem, 2017) as the backbone of the Recurrent State Space Model, has outperformed previous Transformer-based approaches like VPT (Baker et al., 2022) and IRIS (Micheli et al., 2022). Additionally, there has been a recent line of research on the application of linear RNNs in RL (Irie et al., 2021; Lu et al., 2024; Samsami et al., 2024), which has shown promising results. This paper will incorporate insights into the limitations of Transformers to analyze why this would be a natural choice.

3 Preliminaries

Sequential neural network. Sequential Neural Networks are a type of deep learning model for sequence modeling. Given a input sequence {ui}i=1nsuperscriptsubscriptsubscript𝑢𝑖𝑖1𝑛\quantity{u_{i}}_{i=1}^{n}{ start_ARG italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the model learns a hidden state sequence {xi}i=1nsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑛\quantity{x_{i}}_{i=1}^{n}{ start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and yields the output sequence {yi}i=1nsuperscriptsubscriptsubscript𝑦𝑖𝑖1𝑛\quantity{y_{i}}_{i=1}^{n}{ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. There are currently two mainstream methods for computing the hidden state xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

  • Recurrent-like: xt=σしぐま(Axt1+But1+c)subscript𝑥𝑡𝜎𝐴subscript𝑥𝑡1𝐵subscript𝑢𝑡1𝑐x_{t}=\sigma(Ax_{t-1}+Bu_{t-1}+c)italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σしぐま ( italic_A italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_B italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_c ) where σしぐま𝜎\sigmaitalic_σしぐま is the activation function;

  • Attention-like: xt=attn(WQU~,WKU~,WVU~)tsubscript𝑥𝑡attnsubscriptsuperscript𝑊𝑄~𝑈superscript𝑊𝐾~𝑈superscript𝑊𝑉~𝑈𝑡x_{t}=\operatorname{attn}\quantity(W^{Q}\tilde{U},W^{K}\tilde{U},W^{V}\tilde{U% })_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_attn ( start_ARG italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT over~ start_ARG italic_U end_ARG , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over~ start_ARG italic_U end_ARG , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT over~ start_ARG italic_U end_ARG end_ARG ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Here U~i=ui+pisubscript~𝑈𝑖subscript𝑢𝑖subscript𝑝𝑖\tilde{U}_{i}=u_{i}+p_{i}over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a position embedding, and attn(Q,K,V)attn𝑄𝐾𝑉\operatorname{attn}(Q,K,V)roman_attn ( italic_Q , italic_K , italic_V ) is defined as softmax(QK+M)Vsoftmax𝑄superscript𝐾top𝑀𝑉\operatorname{softmax}(QK^{\top}+M)Vroman_softmax ( italic_Q italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_M ) italic_V, where M𝑀Mitalic_M is an attention mask.

Refer to caption
(a) (Pointwise) Recurrent
Refer to caption
(b) (Parallel) Attention
Figure 1: The Left figure indicates the general structure of recurrent-like sequential neural networks and the right figure represents the attention-like ones. Both of them can be deepened by pointwise transformations and skip connections.

Current sequence models use multiple layers. The overall structure is illustrated in Figure 1. For recurrent models like RNNs, GRUs, etc., the output of the previous layer is directly used as the input for the next layer., while for Transformers, the pointwise transformations typically take the form of MLPs with skip connections.

POMDP. A Partially Observable Markov Decision Process (POMDP) \mathcal{M}caligraphic_M can be defined as (S,A,T,R,Ωおめが,O,γがんま)𝑆𝐴𝑇𝑅Ωおめが𝑂𝛾(S,A,T,R,\Omega,O,\gamma)( italic_S , italic_A , italic_T , italic_R , roman_Ωおめが , italic_O , italic_γがんま ) (Åström, 1965). At time i𝑖iitalic_i, the agent is in state siSsubscript𝑠𝑖𝑆s_{i}\in Sitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S, observes oiΩおめがO(|si)o_{i}\in\Omega\sim O(\cdot|s_{i})italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Ωおめが ∼ italic_O ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), takes action aiAsubscript𝑎𝑖𝐴a_{i}\in Aitalic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_A, receives reward R(si,ai)𝑅subscript𝑠𝑖subscript𝑎𝑖R(s_{i},a_{i})italic_R ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and would transit to si+1T(|si,ai)s_{i+1}\sim T(\cdot|s_{i},a_{i})italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ∼ italic_T ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The agent’s policy based on the observation history ht={(oi,ai,ri)}i=1tsubscript𝑡superscriptsubscriptsubscript𝑜𝑖subscript𝑎𝑖subscript𝑟𝑖𝑖1𝑡h_{t}=\quantity{\quantity(o_{i},a_{i},r_{i})}_{i=1}^{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ARG ( start_ARG italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) end_ARG } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is denoted as πぱい(|ht1,ot)\pi(\cdot|h_{t-1},o_{t})italic_πぱい ( ⋅ | italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We say an algorithm 𝒜𝒜\mathcal{A}caligraphic_A that can solve POMDP as being able to find the optimal policy πぱいsuperscript𝜋\pi^{\star}italic_πぱい start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT for given POMDP \mathcal{M}caligraphic_M where:

πぱい=argminπぱい𝔼atπぱい(|ht1,ot)stT(|st1,at)otO(|st)[t=0γがんまtR(st,at)].\pi^{\star}=\operatorname*{argmin}_{\pi}\mathop{\mathbb{E}}_{\begin{subarray}{% c}a_{t}\sim\pi(\cdot|h_{t-1},o_{t})\\ s_{t}\sim T(\cdot|s_{t-1},a_{t})\\ o_{t}\sim O(\cdot|s_{t})\end{subarray}}\quantity[\sum_{t=0}^{\infty}\gamma^{t}% R(s_{t},a_{t})]\;.italic_πぱい start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_πぱい ( ⋅ | italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_T ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_O ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γがんま start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ] .

DFA & regular language. Deterministic Finite Automata (DFA) can be defined as A=(S,Σしぐま,T,s0,F)𝐴𝑆Σしぐま𝑇subscript𝑠0𝐹A=(S,\Sigma,T,s_{0},F)italic_A = ( italic_S , roman_Σしぐま , italic_T , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_F ), where S𝑆Sitalic_S is a finite set of states, ΣしぐまΣしぐま\Sigmaroman_Σしぐま is a finite set of symbols, T:S×ΣしぐまS:𝑇𝑆Σしぐま𝑆T:S\times\Sigma\to Sitalic_T : italic_S × roman_Σしぐま → italic_S is the transition function, s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the start state, and FQ𝐹𝑄F\subseteq Qitalic_F ⊆ italic_Q is the set of accepting states. A string w𝑤witalic_w whose i𝑖iitalic_i-th symbol is wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is accepted by A𝐴Aitalic_A if (s0,s1,,sn)subscript𝑠0subscript𝑠1subscript𝑠𝑛\exists(s_{0},s_{1},\ldots,s_{n})∃ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), s.t. siSsubscript𝑠𝑖𝑆s_{i}\in Sitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S for 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, si=T(si1,wi)subscript𝑠𝑖𝑇subscript𝑠𝑖1subscript𝑤𝑖s_{i}=T(s_{i-1},w_{i})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_T ( italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and snFsubscript𝑠𝑛𝐹s_{n}\in Fitalic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_F. L𝐿Litalic_L is a regular language if it is recognized by DFA A𝐴Aitalic_A, that is, L={w:A accepts w}𝐿conditional-set𝑤𝐴 accepts 𝑤L=\{w:A\text{ {accepts} }w\}italic_L = { italic_w : italic_A italic_accepts italic_w }.

4 Limitations of Transformer in Partially Observable RL

RL algorithms typically take as inputs the current state with the assumption of the Markov property. In partially observable environments that lack the Markov property, the hidden state extracted by SEQ from the observation history is thus anticipated to contain the information of the real state to benefit subsequent RL.

Notably, Delétang et al. (2023) points out that Transformers (referred to as TFs) fail to recognize regular languages. Inspired by the significant correspondence between regular languages and POMDPs (details in Definition 4.2), we naturally conjecture that TF is not capable of retrieving the information of real state from partial observations accurately, which would lead to a decline in the performance of the pipeline (SEQ,RL)SEQRL(\texttt{SEQ},\texttt{RL})( SEQ , RL ).

Building upon this view, this section first shows that solving a POMDP problem is harder than solving a regular language problem. Afterward, we will introduce two theoretical results to elucidate the limitations of Transformers, supported by simple examples for illustrations. Consequently, it is inferred that (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) cannot address POMDPs generally.

4.1 Reduction from Regular Language to POMDP

Proposition 4.1.

If an algorithm 𝒜=(SEQ,RL)𝒜SEQRL\mathcal{A}=(\texttt{SEQ},\texttt{RL})caligraphic_A = ( SEQ , RL ) can solve POMDPs, then given a regular language L𝐿Litalic_L, 𝒜𝒜\mathcal{A}caligraphic_A can recognize L𝐿Litalic_L by solving a POMDP problem \mathcal{M}caligraphic_M.

Proof idea. We construct a POMDP \mathcal{M}caligraphic_M, such that each state represents a transition in L𝐿Litalic_L, and the observation at timestep t𝑡titalic_t is the corresponding character wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The agent could output accept or reject, and the reward is assigned if and only if the final output aligns with the acceptance of the string w=w0wt1𝑤subscript𝑤0subscript𝑤𝑡1w=w_{0}\ldots w_{t-1}italic_w = italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT … italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT in L𝐿Litalic_L. In this way, the optimal policy πぱい𝜋\piitalic_πぱい on \mathcal{M}caligraphic_M is to accept all wL𝑤𝐿w\in Litalic_w ∈ italic_L and reject all wL𝑤𝐿w\notin Litalic_w ∉ italic_L, so if an algorithm 𝒜𝒜\mathcal{A}caligraphic_A can solve POMDP \mathcal{M}caligraphic_M, then 𝒜𝒜\mathcal{A}caligraphic_A can recognize L𝐿Litalic_L. Proof details are deferred to Appendix B.1.

Definition 4.2 (POMDP derived from regular language L𝐿Litalic_L).

Given a regular language L𝐿Litalic_L, the POMDP derived as Proposition 4.1 is denoted as Lsuperscript𝐿\mathcal{M}^{L}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. For an integer n𝑛nitalic_n, L(n)superscript𝐿𝑛\mathcal{M}^{L}\quantity(n)caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( start_ARG italic_n end_ARG ) represents a special case of Lsuperscript𝐿\mathcal{M}^{L}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT whose horizon is no longer than n𝑛nitalic_n.

Remark 4.3.

When implementing RL algorithms, especially in online settings, it is common to set a truncated time n𝑛nitalic_n for training and evaluation purposes. For L(n)superscript𝐿𝑛\mathcal{M}^{L}\quantity(n)caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( start_ARG italic_n end_ARG ) with maximum horizon n𝑛nitalic_n, since observation during training and evaluation come from the same distribution, we can analogize it to fitting in supervised learning. Furthermore, in partially observable cases, historical information needs to be considered. If there is no time limit set, there is a need for length extrapolation, which we can analogize to generalization in supervised learning, as is captured by Lsuperscript𝐿\mathcal{M}^{L}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

In Figure 2, we illustrate how to construct Lsuperscript𝐿\mathcal{M}^{L}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. We also provide experiments to verify the reduction (cf. Section 6.1).

q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPTstartq1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT0110

(a)

q0,0subscript𝑞00q_{0},0italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0q1,1subscript𝑞11q_{1},1italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1q0,1subscript𝑞01q_{0},1italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1q1,0subscript𝑞10q_{1},0italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 0q0,#subscript𝑞0#q_{0},\#italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , #q1,#subscript𝑞1#q_{1},\#italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , #r=0𝑟0r=0italic_r = 0r=1𝑟1r=1italic_r = 11001010

(b)
Figure 2: Above: Illustration for DFA of PARITY. There are two states q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is both the initial state and the accepting state. The transitions are plotted in gray arrows. Below: Illustration for PARITYsuperscriptPARITY\mathcal{M}^{\texttt{PARITY}}caligraphic_M start_POSTSUPERSCRIPT PARITY end_POSTSUPERSCRIPT. The states are (qi,w)subscript𝑞𝑖𝑤(q_{i},w)( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w ) where i{0,1}𝑖01i\in\{0,1\}italic_i ∈ { 0 , 1 } and w{0,1,#}𝑤01#w\in\{0,1,\#\}italic_w ∈ { 0 , 1 , # }, and the agent could observe w𝑤witalic_w. The initial state are randomly sampled from (q0,w)subscript𝑞0𝑤(q_{0},w)( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_w ). The stochastic transitions are plotted in gray arrows. At final state (qi,#)subscript𝑞𝑖#(q_{i},\#)( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , # ), blue arrows stand for choosing accept, and red arrows stand for choosing reject.

4.2 Limitations in Fitting: Solving L(n)superscript𝐿𝑛\mathcal{M}^{L}\quantity(n)caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( start_ARG italic_n end_ARG )

While prior research has claimed universality for Transformers, specifically proving their Turing completeness and ability to approximate any seq-to-seq function on compact support (Bhattamishra et al., 2020; Pérez et al., 2021; Luo et al., 2022; Yun et al., 2019), it is crucial to note certain impractical assumptions underlying these assertions as they often rely on assumptions of infinite precision and finite length (Jiang et al., 2023).

In this subsection, we assume that (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) (TF denotes a Transformer) is a log-precision model that all values in the model have O(logn)𝑂𝑛O(\log n)italic_O ( roman_log italic_n ) precision, where n𝑛nitalic_n is the input length. This assumption aligns with reality since computer floating-point precision is typically 16, 32, or 64 bits, smaller than the sequence lengths commonly handled by sequence models. Based on this assumption, there exists a class of POMDPs, for which achieving solutions with (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) would demand an excessively large quantity of parameters. This type of problem can be directly mapped to a type of circuit complexity, with its definition provided in the appendix A.1.

Theorem 4.4.

Assume 𝖳𝖢0𝖭𝖢1superscript𝖳𝖢0superscript𝖭𝖢1{\mathsf{TC}^{0}}\neq{\mathsf{NC}^{1}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≠ sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. Given an 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete regular language L𝐿Litalic_L, for any depth D𝐷Ditalic_D and a any polynomial poly(n)poly𝑛\operatorname{poly}(n)roman_poly ( italic_n ), there exists a length n𝑛nitalic_n such that no log-precision (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) with depth D𝐷Ditalic_D and hidden dimension dpoly(n)𝑑poly𝑛d\leq\operatorname{poly}(n)italic_d ≤ roman_poly ( italic_n ) can solve L(n)superscript𝐿𝑛\mathcal{M}^{L}\quantity(n)caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( start_ARG italic_n end_ARG ).

Proof idea. At the heart of the proof is a contradiction achieved through circuit complexity theory (Arora & Barak, 2009). Merrill & Sabharwal (2023) has shown that 𝖳𝖢0superscript𝖳𝖢0{\mathsf{TC}^{0}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT circuits can simulate a log-precision Transformer with constant depth and polynomial hidden dimensions. Consequently, if (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) can solve 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete problems, it would cause both 𝖳𝖢0superscript𝖳𝖢0{\mathsf{TC}^{0}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complexities to collapse, a scenario generally deemed impossible (Yao, 1989). Proof details of Theorem 4.4 are deferred to Appendix B.2.

Following syntactic monoid theory (Straubing, 2012) and Barrington’s theorem (Barrington, 1986), a significant number of regular languages are 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete, such as the regular language ((0+1)3(010+1))superscriptsuperscript013superscript0101{((0+1)^{3}(01^{*}0+1))^{*}}( ( 0 + 1 ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( 01 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 0 + 1 ) ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (cf. Appendix A.2.3).

On the other hand, these two works inform us of another fact: for a regular language L𝐿Litalic_L, there are only two possibilities—either L𝖭𝖢1𝐿superscript𝖭𝖢1L\in{\mathsf{NC}^{1}}italic_L ∈ sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete or L𝖳𝖢0𝐿superscript𝖳𝖢0L\in{\mathsf{TC}^{0}}italic_L ∈ sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT (more specifically, L𝖠𝖢0𝐿superscript𝖠𝖢0L\in{\mathsf{AC}^{0}}italic_L ∈ sansserif_AC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT). As of now, the question of whether problems solvable by log-precision Transformers belong to 𝖳𝖢0superscript𝖳𝖢0{\mathsf{TC}^{0}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT remains an open problem (Merrill & Sabharwal, 2023). However, numerous experimental results (Delétang et al., 2023; Huang et al., 2022) suggest that Transformers do not perform well in handling certain regular languages within 𝖳𝖢0superscript𝖳𝖢0{\mathsf{TC}^{0}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, such as PARITY. In the next section, we demonstrate, from a generalization perspective, that Transformers cannot solve Lsuperscript𝐿\mathcal{M}^{L}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT for a broader range of regular languages L𝐿Litalic_L.

4.3 Limitations in Generalization: Solving Lsuperscript𝐿\mathcal{M}^{L}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT

When deploying the Partially Observable RL algorithm, we anticipate it to demonstrate the ability of length generalization. In this subsection, we examine scenarios corresponding to Lsuperscript𝐿\mathcal{M}^{L}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and no longer assume that the Transformer model operates with logarithmic precision.

Recent works (Press et al., 2021; Delétang et al., 2023; Ruoss et al., 2023) have empirically demonstrated that length extrapolation is a weakness of Transformers. Lemma 4.5 theoretically indicates that for any Transformer with the dot-product softmax attention mechanism, robust generalization is not achievable as the input length increases.

Lemma 4.5 (Lemma 5 in Hahn (2020a)).

Given a Transformer with softmax attention, let n𝑛nitalic_n be the input length. If we change one input uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i<n𝑖𝑛i<nitalic_i < italic_n) to uisuperscriptsubscript𝑢𝑖u_{i}^{\prime}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, then the change in the resulting hidden xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT at the output layer is bounded by O(D/n),D=uiui𝑂𝐷𝑛𝐷normsubscript𝑢𝑖superscriptsubscript𝑢𝑖O(D/n),D=\norm{u_{i}-u_{i}^{\prime}}italic_O ( italic_D / italic_n ) , italic_D = ∥ start_ARG italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∥ with constants depending on the parameter matrices.

Theorem 4.6.

Given an regular language L𝐿Litalic_L, let c(n,a)=#{xaL:|x|=n}𝑐𝑛𝑎#:𝑥𝑎𝐿𝑥𝑛c(n,a)=\#\quantity{xa\in L:\absolutevalue{x}=n}italic_c ( italic_n , italic_a ) = # { start_ARG italic_x italic_a ∈ italic_L : | start_ARG italic_x end_ARG | = italic_n end_ARG }. If there exists aΣしぐま𝑎Σしぐまa\in\Sigmaitalic_a ∈ roman_Σしぐま such that {n:0<c(n,a)<|Σしぐま|n}:𝑛0𝑐𝑛𝑎superscriptΣしぐま𝑛\quantity{n:0<c(n,a)<\absolutevalue{\Sigma}^{n}}{ start_ARG italic_n : 0 < italic_c ( italic_n , italic_a ) < | start_ARG roman_Σしぐま end_ARG | start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG } are infinite, and RL is a Lipschitz function, then (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) cannot solve Lsuperscript𝐿\mathcal{M}^{L}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

Proof idea. Since ΣしぐまΣしぐま\Sigmaroman_Σしぐま is a finite set, D𝐷Ditalic_D is deterministic. Then we will prove for L𝐿Litalic_L satisfying the conditions, there exists infinite u,u𝑢superscript𝑢u,u^{\prime}italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that u𝑢uitalic_u and usuperscript𝑢u^{\prime}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT differ by only 1111 positions but uL,uLformulae-sequence𝑢𝐿superscript𝑢𝐿u\in L,u^{\prime}\notin Litalic_u ∈ italic_L , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∉ italic_L. According to Lemma 4.5, the hidden states x𝑥xitalic_x and xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT output by the Transformer differ by O(1/n)𝑂1𝑛O(1/n)italic_O ( 1 / italic_n ). Since RL is a Lipschitz function, the results of RL(x)RL𝑥\texttt{RL}(x)RL ( italic_x ) and RL(x)RLsuperscript𝑥\texttt{RL}(x^{\prime})RL ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) also differ by O(1/n)𝑂1𝑛O(1/n)italic_O ( 1 / italic_n ). As n𝑛nitalic_n increases, information from non-current time steps will only have a negligible impact on the output of RL. Proof details of Theorem 4.6 are deferred to Appendix B.3.

Observing that PARITY satisfies the conditions outlined in Theorem 4.6, we can derive Corrollary 4.7.

Corollary 4.7.

If L=PARITY𝐿PARITYL=\texttt{PARITY}italic_L = PARITY and RL is a Lipschitz function, then (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) can not solve Lsuperscript𝐿\mathcal{M}^{L}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

The Lipschitz property is commonly observed in widely used learning-based RL algorithms, such as employing MLPs to predict Q𝑄Qitalic_Q-values, V𝑉Vitalic_V-values, or the probability distribution of the next action. For cases that do not satisfy the Lipschitz property, such as those relying on the maximum value rather than logits, Chiang & Cholak (2022) provides a constructive method for a Transformer that can recognize PARITY. This scenario corresponds to the greedy policy based on Q𝑄Qitalic_Q-values. However, this theorem indicates that Transformers do not model sequences in a way that accurately reconstructs the real states, which makes it hard for (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) to perform length extrapolation.

4.4 From POMDPs to Regular Languages

Through illustrating the limitations of (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) in handling POMDPs derived from regular languages, we demonstrate that there exist POMDP problems for which Transformers cannot effectively learn the corresponding inductive biases.

This class of POMDP problems corresponding to regular languages can be divided into three levels based on circuit complexity: <𝖳𝖢0,[𝖳𝖢0,𝖭𝖢1),𝖭𝖢1absentsuperscript𝖳𝖢0superscript𝖳𝖢0superscript𝖭𝖢1superscript𝖭𝖢1<{\mathsf{TC}^{0}},[{\mathsf{TC}^{0}},{\mathsf{NC}^{1}}),{\mathsf{NC}^{1}}< sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , [ sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. The difficulty for Transformers to handle these problems increases progressively. This difficulty classification can be extended to existing POMDP problems. Please refer to Appendix C for detailed discussion.

  • <𝖳𝖢0absentsuperscript𝖳𝖢0<{\mathsf{TC}^{0}}< sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT: Most tasks that solely assess pure memory capabilities are weaker than 𝖳𝖢0superscript𝖳𝖢0{\mathsf{TC}^{0}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. These tasks only involve extracting a finite number of tokens from the past and performing simple logical operations with current observation information. Most memory tasks mentioned in Ni et al. (2023) fall into this category. (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) excel at solving such problems.

  • [𝖳𝖢0,𝖭𝖢1)superscript𝖳𝖢0superscript𝖭𝖢1[\mathsf{TC}^{0},\mathsf{NC}^{1})[ sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ): This category already represents the vast majority of regular languages. The corresponding typical POMDPs are environments such as Passive Visual Match (Hung et al., 2019) or Memory Maze (Pasukonis et al., 2022), where there is a need to infer the current position based on historical information. This is typically manifested in the requirement to reconstruct a relatively simple state from complex historical data.

  • 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT: Currently, no existing discrete-state POMDP problem has been found to correspond to this class of regular languages. According to Theorem 4.4, it is difficult for (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) to learn the optimal policy.

Establishing a direct connection with regular languages is not particularly straightforward in continuous scenarios. However, some standard POMDP scenarios, such as Pybullet Occlusion Task (Ni et al., 2022), are at least not in the first level. These tasks require inferring the current actual state based on contextual information.

Furthermore, the preceding discussion implies that for (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ), the hidden state fed to RL is often not the underlying real state. In contrast, in subsequent experiments (cf. Section 6), we observe that (RNN,RL)RNNRL(\texttt{RNN},\texttt{RL})( RNN , RL ) behaves differently and can implicitly reconstruct the real state. The capability of recovering underlying real states with the Markov property is believed to be a prerequisite for solving Partially Observable RL. Therefore, for POMDPs in general cases, (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) may encounter issues.

5 Combining Transformer and RNN

Table 1: The recurrent representation for Transformer variants with pointwise recurrence. We compare different Transformer variants: FART, FWP, RWKV and RetNet. yi,uisubscript𝑦𝑖subscript𝑢𝑖y_{i},u_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are as defined in Section 3, si,zisubscript𝑠𝑖subscript𝑧𝑖s_{i},z_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are hidden states, and the other variables are parameters.

         Architectures          Recurrent Representation for a Single Head          FART (Katharopoulos et al., 2020)           yi=FFN(ϕ(uiWQ)ϕ(uiWQ)zi+ui),si=si1+ϕ(uiWk)(uiWV)zi=zi1+ϕ(uiWk)subscript𝑦𝑖FFNitalic-ϕsuperscriptsubscript𝑢𝑖subscript𝑊𝑄topitalic-ϕsuperscriptsubscript𝑢𝑖subscript𝑊𝑄topsubscript𝑧𝑖subscript𝑢𝑖subscript𝑠𝑖absentsubscript𝑠𝑖1italic-ϕsubscript𝑢𝑖subscript𝑊𝑘superscriptsubscript𝑢𝑖subscript𝑊𝑉topsubscript𝑧𝑖absentsubscript𝑧𝑖1italic-ϕsubscript𝑢𝑖subscript𝑊𝑘y_{i}=\operatorname{FFN}\quantity(\frac{\phi(u_{i}W_{Q})^{\top}}{\phi(u_{i}W_{% Q})^{\top}z_{i}}+u_{i}),\begin{aligned} s_{i}&=s_{i-1}+\phi\quantity(u_{i}W_{k% })(u_{i}W_{V})^{\top}\\ z_{i}&=z_{i-1}+\phi\quantity(u_{i}W_{k})\\ \end{aligned}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_FFN ( start_ARG divide start_ARG italic_ϕ ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϕ ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_ϕ ( start_ARG italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = italic_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_ϕ ( start_ARG italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) end_CELL end_ROW          FWP (Schlag et al., 2021)           yi=1ziϕ(Wqui)Wiϕ(Wqui),Wi=Wi1+(Wvui)ϕ(Wkui)zi=zi1+ϕ(Wkui)subscript𝑦𝑖1subscript𝑧𝑖italic-ϕsubscript𝑊𝑞subscript𝑢𝑖subscript𝑊𝑖italic-ϕsubscript𝑊𝑞subscript𝑢𝑖subscript𝑊𝑖absentsubscript𝑊𝑖1tensor-productsubscript𝑊𝑣subscript𝑢𝑖italic-ϕsubscript𝑊𝑘subscript𝑢𝑖subscript𝑧𝑖absentsubscript𝑧𝑖1italic-ϕsubscript𝑊𝑘subscript𝑢𝑖y_{i}=\frac{1}{z_{i}\phi\quantity(W_{q}u_{i})}W_{i}\phi\quantity(W_{q}u_{i}),% \begin{aligned} W_{i}&=W_{i-1}+(W_{v}u_{i})\otimes\phi\quantity(W_{k}u_{i})\\ z_{i}&=z_{i-1}+\phi\quantity(W_{k}u_{i})\\ \end{aligned}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( start_ARG italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) end_ARG italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( start_ARG italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , start_ROW start_CELL italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = italic_W start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + ( italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊗ italic_ϕ ( start_ARG italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = italic_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_ϕ ( start_ARG italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) end_CELL end_ROW          RWKV (Peng et al., 2023)           yi=Gate(si1+ev+Wkuiutzi1+ev+Wkui),si=ewsi1+eWkuiuizi=ewzi1+eWkuisubscript𝑦𝑖Gatesubscript𝑠𝑖1direct-productsuperscripte𝑣subscript𝑊𝑘subscript𝑢𝑖subscript𝑢𝑡subscript𝑧𝑖1superscripte𝑣subscript𝑊𝑘subscript𝑢𝑖subscript𝑠𝑖absentdirect-productsuperscripte𝑤subscript𝑠𝑖1direct-productsuperscriptesubscript𝑊𝑘subscript𝑢𝑖subscript𝑢𝑖subscript𝑧𝑖absentdirect-productsuperscripte𝑤subscript𝑧𝑖1superscriptesubscript𝑊𝑘subscript𝑢𝑖y_{i}=\operatorname{Gate}\quantity(\frac{s_{i-1}+\mathrm{e}^{v+W_{k}u_{i}}% \odot u_{t}}{z_{i-1}+\mathrm{e}^{v+W_{k}u_{i}}}),\begin{aligned} s_{i}&=% \mathrm{e}^{-w}\odot s_{i-1}+\mathrm{e}^{W_{k}u_{i}}\odot u_{i}\\ z_{i}&=\mathrm{e}^{-w}\odot z_{i-1}+\mathrm{e}^{W_{k}u_{i}}\\ \end{aligned}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Gate ( start_ARG divide start_ARG italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + roman_e start_POSTSUPERSCRIPT italic_v + italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊙ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + roman_e start_POSTSUPERSCRIPT italic_v + italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG end_ARG ) , start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = roman_e start_POSTSUPERSCRIPT - italic_w end_POSTSUPERSCRIPT ⊙ italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + roman_e start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊙ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = roman_e start_POSTSUPERSCRIPT - italic_w end_POSTSUPERSCRIPT ⊙ italic_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + roman_e start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW          RetNet (Sun et al., 2023)           yi=(τたう(XWG)GN(zi)),zi=γがんまzi1+(Kui)(Vui)formulae-sequencesubscript𝑦𝑖direct-product𝜏𝑋subscript𝑊𝐺GNsubscript𝑧𝑖subscript𝑧𝑖𝛾subscript𝑧𝑖1superscript𝐾subscript𝑢𝑖top𝑉subscript𝑢𝑖y_{i}=\quantity(\tau\quantity(XW_{G})\odot\operatorname{GN}(z_{i})),z_{i}=% \gamma z_{i-1}+(Ku_{i})^{\top}(Vu_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( start_ARG italic_τたう ( start_ARG italic_X italic_W start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG ) ⊙ roman_GN ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_γがんま italic_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + ( italic_K italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_V italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

From the analysis in Section 4, it becomes evident that RNN-like models (LSTM, GRU, RNN) emerge as promising sequence model choices for Partially Observable RL. There has been considerable theoretical work demonstrating their completeness on regular languages (Merrill, 2019; Korsky & Berwick, 2019). For cases of log precision, based on definitions, we can directly map the recurrent units of RNNs to transition functions in DFAs. Therefore, RNNs do not suffer from the theoretical constraints encountered by Transformers.

However, RNN-like models face the challenge of rapid memory decay (Ni et al., 2023; Parisotto et al., 2019), leading to an inferior performance on POMDP problems that demand long-term memory when compared to Transformers (Parisotto et al., 2019; Ni et al., 2023).

Another insight from the previous section is that the attention mechanism of Transformers is primarily to blame for their limitations (see Figure 1b). As articulated in  Merrill & Sabharwal (2023), there exists a trade-off between the highly parallel structure of Transformers and their computational capacity.

To alleviate these limitations of Transformers, a natural idea is to endow Transformers with the ability of pointwise recurrence (see Figure 1a). This line of development has been the focus of numerous efforts, resulting in several Transformer variants that incorporate this mechanism, as detailed in Table 1. The shared feature of these methods in their recurrence representation for a single head can be found in the simple linear operations they employ, such as xt=λらむだxt1+utsubscript𝑥𝑡𝜆subscript𝑥𝑡1subscript𝑢𝑡x_{t}=\lambda x_{t-1}+u_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_λらむだ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. While the non-linear components can be amortized across the layers through the FFN between layers. If the number of heads is set equal to the dimension of the hidden state hhitalic_h, then

𝐱t=𝚲𝐱t1+𝐮t,subscript𝐱𝑡𝚲subscript𝐱𝑡1subscript𝐮𝑡\displaystyle\mathbf{x}_{t}=\mathbf{\Lambda}\mathbf{x}_{t-1}+\mathbf{u}_{t}\;,bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Λらむだ bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (1)

where 𝚲=diag(λらむだ1,,λらむだh)𝚲diagsubscript𝜆1subscript𝜆\mathbf{\Lambda}=\operatorname{diag}\quantity(\lambda_{1},\ldots,\lambda_{h})bold_Λらむだ = roman_diag ( start_ARG italic_λらむだ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λらむだ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG ). In RetNet, operations involving xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, λらむだisubscript𝜆𝑖\lambda_{i}italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are performed over \mathbb{C}blackboard_C, while the remaining operations are carried out over {\mathbb{R}}blackboard_R. As for RWKV, λらむだisubscript𝜆𝑖\lambda_{i}italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a learnable parameter, while in the rest variants are hyperparameters.

From the perspective of RNN, if we linearize and diagonalize the RNN’s recurrent unit 𝐱t=𝐀𝐱t1+𝐁𝐮tsubscript𝐱𝑡𝐀subscript𝐱𝑡1𝐁subscript𝐮𝑡\mathbf{x}_{t}=\mathbf{A}\mathbf{x}_{t-1}+\mathbf{B}\mathbf{u}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_A bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_B bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we obtain the following form:

𝐱~t=𝚲𝐱~t1+𝐮~t,subscript𝐱~𝑡𝚲subscript𝐱~𝑡1subscript𝐮~𝑡\displaystyle\mathbf{\tilde{x}}_{t}=\mathbf{\Lambda}\mathbf{\tilde{x}}_{t-1}+% \mathbf{\tilde{u}}_{t}\;,start_ID over~ start_ARG bold_x end_ARG end_ID start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Λらむだ start_ID over~ start_ARG bold_x end_ARG end_ID start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + start_ID over~ start_ARG bold_u end_ARG end_ID start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (2)

where 𝐀=𝐏𝚲𝐏1,𝐱~t=𝐏1𝐱t,𝐮~t=𝐏1𝐁𝐮tformulae-sequence𝐀𝐏𝚲superscript𝐏1formulae-sequencesubscript𝐱~𝑡superscript𝐏1subscript𝐱𝑡subscript𝐮~𝑡superscript𝐏1𝐁subscript𝐮𝑡\mathbf{A}=\mathbf{P}\mathbf{\Lambda}\mathbf{P}^{-1},\mathbf{\tilde{x}}_{t}=% \mathbf{P}^{-1}\mathbf{x}_{t},\mathbf{\tilde{u}}_{t}=\mathbf{P}^{-1}\mathbf{B}% \mathbf{u}_{t}bold_A = bold_P bold_Λらむだ bold_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , start_ID over~ start_ARG bold_x end_ARG end_ID start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , start_ID over~ start_ARG bold_u end_ARG end_ID start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_B bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Since almost all matrices can be diagonalized over \mathbb{C}blackboard_C, the operations mentioned above are defined in \mathbb{C}blackboard_C (Horn & Johnson, 2012).

Comparing (1) and (2), the pointwise-recurrence Transformer can be viewed as a linear RNN with certain constraints, and the linear RNN serves as a balance point between Transformers and RNNs. To summarize, we expect linear RNN to be more suitable as a sequence model in Partially Observable RL for the following reasons.

Regular language. Many studies suggest that the recurrence with non-linear activation functions plays a crucial role in the completeness of RNNs in regular languages (Chung & Siegelmann, 2021), while linear RNNs may lose this completeness. However, some researches indicate that linear RNNs can effectively approximate RNNs (Huang et al., 2022; Lim et al., 2023) and perform well on formal language tasks similar in form to NLP (Huang et al., 2022; Irie et al., 2023). Compared to Transformers, their inductive biases are closer to HMMs. In subsequent experiments (cf. Section 6), we validate that (LRNN,RL)LRNNRL(\texttt{LRNN},\texttt{RL})( LRNN , RL ) can implicitly learn the states in POMDPs.

State space model. Linear RNNs have been proven to efficiently fit partially observable linear dynamic systems (Wang et al., 2022). While the transformer’s fitting capability has theoretical proofs only under certain specific conditions (Balim et al., 2023; Li et al., 2023), with no similar conclusion for more general situations. The linear dynamic system can be considered as a first-order approximation of a state space model, indicating the potential of linear RNNs in addressing a broader range of POMDPs.

Long term memory. The primary reason for the long-term dependency issues in RNNs is the challenge of gradient explosion or vanishing when input length increases during training (Pascanu et al., 2013). Transformers, due to their parallel structure, are less susceptible to this issue. To mitigate this problem, gate mechanisms are introduced to RNNs (Dey & Salem, 2017; Hochreiter & Schmidhuber, 1997). However, Kanai et al. (2017) indicates that the non-linear recurrence is the primary cause of gradient explosions. For linear RNNs, effectively managing the range of parameters λらむだisubscript𝜆𝑖\lambda_{i}italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT between [0,1]01[0,1][ 0 , 1 ] during initialization successfully addresses both gradient explosion and vanishing issues. This has been validated in certain supervised learning tasks with long-term dependencies (Orvieto et al., 2023; Gu et al., 2021).

6 Experiments

In this section, we compare the effectiveness of three different sequence models — Transformer, RNN, and linear RNN — in addressing partially observable decision-making problems within the realm of Partially Observable RL (SEQ, RL). We choose GPT (Radford et al., 2019), LSTM (Hochreiter & Schmidhuber, 1997), and LRU (Orvieto et al., 2023) as the representative architectures for these three types of models. To substantiate our hypotheses, we conduct experiments in three distinct POMDP scenarios, detailed in Sections 6.1 to 6.3. These experiments are designed to assess the models from various perspectives: 1) POMDPs derived from certain regular languages, including EVEN PAIRS, PARITY, and SYM(5); 2) tasks from Pybullet Partially Observable environments (Ni et al., 2022) that require the ability of state space modeling; 3) tasks that require pure long-term memory capabilities, such as Passive T-Maze and Passive Visual Match (Hung et al., 2018). Comprehensive implementation details, task descriptions, and supplementary results are presented in Appendix D. We also conduct a comparison with some published Transformer in RL there.

6.1 POMDPs Derived from Regular Languages

We construct this type of POMDP problem following the approach of Proposition 4.1, and use DQN (Van Hasselt et al., 2016) as the RL component in (SEQ,RL)SEQRL(\texttt{SEQ},\texttt{RL})( SEQ , RL ). Three regular language tasks correspond to the difficulty classification in Section 4.4. The learning curves are shown in Appendix D.5, Figure 16. We provide the experimental results on length extrapolation and model scale for this task in the appendix, where Theorem 4.4 and Theorem 4.6 are validated.

To look into how they model the regular languages, we visualize the hidden states in Figure 3c. Generally, all three sequence models can fit scenarios with short lengths. However, as the input length increases, LSTM exhibits the best fitting capability, followed by LRU, and GPT performs the least effectively. We observe that in POMDP tasks derived from the three regular languages, the distinct nature of these languages yields varied results:

Figure 3: Hidden state for regular language tasks. We visualize the hidden states of each sequence model during evaluation at length 25 using t-SNE (van der Maaten & Hinton, 2008) and annotate them according to their real states. Our classification corresponds to the state the observation history maps to in the reduced POMDP, namely (q,w)𝑞𝑤(q,w)( italic_q , italic_w ), while ‘T’ stands for the terminal state. The states with similar colors in the diagram generally produce the same type of observation.
Refer to caption
(a) EVEN PAIRS
Refer to caption
(b) PARITY
Refer to caption
(c) SYM(5)

EVEN PAIRS is a specific regular language that could be directly solved by memorizing the first character and comparing it with the last character, which aligns with the inductive bias of the attention mechanism. As a result, GPT solves EVEN PAIRSsuperscriptEVEN PAIRS\mathcal{M}^{\text{{EVEN PAIRS}}}caligraphic_M start_POSTSUPERSCRIPT EVEN PAIRS end_POSTSUPERSCRIPT reasonably well.

PARITY is a regular language with simple DFA in 𝖳𝖢0superscript𝖳𝖢0{\mathsf{TC}^{0}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. As shown in Figure 3b, LSTM and LRU are capable of accurately modeling PARITYsuperscriptPARITY\mathcal{M}^{\texttt{PARITY}}caligraphic_M start_POSTSUPERSCRIPT PARITY end_POSTSUPERSCRIPT. Through colors, it can be observed that the hidden state of the transformer is almost solely distinguished based on the current observation. It relies on processing the entire history through attention after encountering a terminal symbol. This is more like memorizing all the different strings, resulting in lower final returns.

SYM(5) is a 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete regular language as mentioned in Section 4.2, and we have shown the inability of GPT to solve SYM(5)(n)superscriptSYM(5)𝑛\mathcal{M}^{{\texttt{SYM(5)}}}(n)caligraphic_M start_POSTSUPERSCRIPT SYM(5) end_POSTSUPERSCRIPT ( italic_n ) in Theorem 4.4. Experimental results align with our claim, proving that GPT performs worst in this task and fails to recover the true state.


Refer to caption

Figure 4: Learning curves for PyBullet occlusion tasks. Mean of 5 seeds. The shaded area indicates 95959595% confidence intervals.
Table 2: Normalized scores for PyBullet occlusion tasks. We compare different sequence models LRU, GPT and LSTM. ‘V’ refers to ‘only velocities observable’, and ‘P’ refers to ‘only positions observable’. We present normalized scores defined in Appendix D.5, Equation 3. Blue highlight indicates the highest score, and orange highlight indicates the second-highest score.

Task Type LRU GPT LSTM Ant V 00029.8 ±plus-or-minus\pm± 20.4000 00009.4 ±plus-or-minus\pm± 7.10000 00007.4 ±plus-or-minus\pm± 8.80000 P 00081.2 ±plus-or-minus\pm± 28.7000 00038.2 ±plus-or-minus\pm± 26.0000 00005.7 ±plus-or-minus\pm± 3.30000 Cheetah V 00096.8 ±plus-or-minus\pm± 8.10000 00069.3 ±plus-or-minus\pm± 6.90000 00098.1 ±plus-or-minus\pm± 8.30000 P 00109.9 ±plus-or-minus\pm± 4.20000 00088.8 ±plus-or-minus\pm± 6.30000 00112.5 ±plus-or-minus\pm± 5.40000 Hopper V 00094.1 ±plus-or-minus\pm± 23.4000 00013.5 ±plus-or-minus\pm± 0.50000 00082.5 ±plus-or-minus\pm± 37.1000 P 00147.9 ±plus-or-minus\pm± 12.3000 00023.8 ±plus-or-minus\pm± 18.0000 00184.1 ±plus-or-minus\pm± 13.4000 Walker V 00061.7 ±plus-or-minus\pm± 14.6000 00022.3 ±plus-or-minus\pm± 6.00000 00012.2 ±plus-or-minus\pm± 7.00000 P 00079.3 ±plus-or-minus\pm± 23.1000 00049.5 ±plus-or-minus\pm± 4.80000 00094.6 ±plus-or-minus\pm± 36.6000 Average 0088.200 0039.300 0074.600

Refer to caption
Figure 5: Mean Squared Error (MSE) ratios for state space modeling tasks. ‘V’ refers to ‘only velocities observable’, and ‘P’ refers to ‘only positions observable’. To enable a comparative analysis of the performance among the three models, we present the MSE ratio, as defined in Appendix D.5, Equation 4.

6.2 PyBullet Partially Observable Environments

We conduct experiments on 8 partially observable environments, which are all PyBullet locomotion control tasks with parts of the observations occluded (Ni et al., 2022), and denote them as PyBullet Occlusion. These experiments encompass four distinct tasks: Ant, Cheetah, Hopper, and Walker, and we evaluate the models based on two types of observations: Velocities Only (V) and Positions Only (P). The normalized scores are demonstrated in Table 2, and we also provide learning curves in Figure 4. From the results, it is evident that LRU and LSTM outperforms GPT in all eight tasks, matching our claim that the Transformer architecture struggles at modeling partially observable sequences. The results showing that LSTM outperforms GPT are also verified in Ni et al. (2023).

Moreover, the general performances of LRU and LSTM are notably comparable, and LRU significantly outperforms LSTM in certain tasks, namely Ant (P, V), and Walker (V). Such results demonstrate that after linearization, recurrent-based models can still effectively retain their capacity to model the sequence, and can serve as a well-rounded balance integrating the strengths of both Transformer and RNN architectures.

We conduct ablation experiments with full observability in Appendix D.5, Table 7, and the overall performances of the three models are close, affirming that GPT’s inferior performance in POMDP scenarios stems from partial observability rather than other factors.

Refer to caption
(a) Passive T-Maze
Refer to caption
(b) Passive Visual Match
Figure 6: Results in pure long-term memory environments with varying memory lengths. Blue line indicates the return of optimal markov policy, which only has access to the observation.

To enhance our understanding of the capability to extract state information from observation sequences, we meticulously crafted two tasks. These tasks are aimed at determining the initial state, termed “Observability”, and forecasting the current state, referred to as “Constructability”, using historical observation sequences, and we adopt Mean Square Error (MSE) as our training target. Our experiments were conducted on the D4RL medium-expert dataset (Fu et al., 2020) of the aforementioned tasks, and the results (illustrated in Figure 5) are presented as the average MSE ratios across these tasks. The findings reveal that, in both scenarios, GPT is notably less competent compared to the other two models. In contrast, the LRU model demonstrates capability on par with the LSTM model. This observation lends further support to our hypothesis that GPT’s ability to reconstruct states from partially observable sequences is worse than that of the recurrent-based models.

6.3 Pure Long-term Memory Environments

Results for pure long-term memory environments, namely Passive T-Maze and Passive Visual Match, are provided in Figure 6, and learning curves are shown in Appendix D.5, Figure 17. In these experiments, we follow the work of Ni et al. (2023), which tests the long-term memory ability of Transformer-based agent and LSTM-based agent on two memory-demanding tasks. We observe that LRU performs comparably to GPT, while significantly outperforming LSTM. Furthermore, LRU beats GPT on Passive Visual Match, the harder task of the two which involves a complex reward function (Hung et al., 2018), showcasing its powerful long-term memory capability.

7 Conclusion

In this work, we challenge the suitability of Transformers as sequence models in Partially Observable RL. Through theoretical analysis and empirical evidence, we reveal Transformer’s limitations in solving POMDPs, particularly their struggle with modeling regular languages, a key aspect of POMDPs. As a remedy to these issues, We propose LRU as a more effective alternative, combining the strengths of recurrence and attention. Supported by extensive experiments, our findings challenge the prevailing use of Transformers in sequential decision-making tasks, and open new avenues for exploring recurrent structures in complex, partially observable environments.

It is also important to acknowledge the limitations of our work. After introducing recurrence, LRU serves as a choice to combine the advantages of Transformer and RNN, while still lacking theoretical guarantees for modeling regular languages. Although LRU demonstrates satisfactory performance in experiments, there remains a need for further exploration in this direction. Additionally, the theoretical analysis in this paper focuses more on the exploitation aspect of RL, while lacking discussion on exploration. Complex POMDP tasks not only require suitable sequence models but also need to be paired with appropriate RL algorithms.

Impact Statement

Our work revisits the application of Transformers in RL, aiming to advance the development of decision intelligence. If misused in downstream tasks, it has the potential to lead to adverse effects such as privacy breaches and societal harm. Nevertheless, this is not directly related to our research, as our primary focus is on theoretical investigations.

References

  • Arora & Barak (2009) Arora, S. and Barak, B. Computational complexity: a modern approach. Cambridge University Press, 2009.
  • Baker et al. (2022) Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., and Clune, J. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.
  • Balim et al. (2023) Balim, H., Du, Z., Oymak, S., and Ozay, N. Can transformers learn optimal filtering for unknown systems? arXiv preprint arXiv:2308.08536, 2023.
  • Barrington (1986) Barrington, D. A. Bounded-width polynomial-size branching programs recognize exactly those languages in nc. In Proceedings of the eighteenth annual ACM symposium on Theory of computing, pp.  1–5, 1986.
  • Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013. doi: 10.1613/jair.3912. URL https://doi.org/10.1613%2Fjair.3912.
  • Bhattamishra et al. (2020) Bhattamishra, S., Patel, A., and Goyal, N. On the computational power of transformers and its implications in sequence modeling. arXiv preprint arXiv:2006.09286, 2020.
  • Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • Carrasco & Oncina (1994) Carrasco, R. C. and Oncina, J. Learning stochastic regular grammars by means of a state merging method. In International Colloquium on Grammatical Inference, pp.  139–152. Springer, 1994.
  • Chen et al. (2022) Chen, C., Wu, Y., Yoon, J., and Ahn, S. Transdreamer: Reinforcement learning with transformer world models. CoRR, abs/2202.09481, 2022. URL https://arxiv.org/abs/2202.09481.
  • Chen et al. (2021) Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. CoRR, abs/2106.01345, 2021. URL https://arxiv.org/abs/2106.01345.
  • Chiang & Cholak (2022) Chiang, D. and Cholak, P. Overcoming a theoretical limitation of self-attention. arXiv preprint arXiv:2202.12172, 2022.
  • Chung & Siegelmann (2021) Chung, S. and Siegelmann, H. Turing completeness of bounded-precision recurrent neural networks. Advances in Neural Information Processing Systems, 34:28431–28441, 2021.
  • Dai et al. (2019) Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  • Delétang et al. (2023) Delétang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., and Ortega, P. A. Neural networks and the chomsky hierarchy. In 11th International Conference on Learning Representations, 2023.
  • Deng et al. (2023) Deng, F., Park, J., and Ahn, S. Facing off world model backbones: Rnns, transformers, and s4. arXiv preprint arXiv:2307.02064, 2023.
  • Dey & Salem (2017) Dey, R. and Salem, F. M. Gate-variants of gated recurrent unit (gru) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), pp.  1597–1600. IEEE, 2017.
  • Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  • Dulac-Arnold et al. (2019) Dulac-Arnold, G., Mankowitz, D. J., and Hester, T. Challenges of real-world reinforcement learning. CoRR, abs/1904.12901, 2019. URL http://arxiv.org/abs/1904.12901.
  • Elman (1990) Elman, J. L. Finding structure in time. Cognitive Science, 14(2):179–211, 1990. ISSN 0364-0213. doi: https://doi.org/10.1016/0364-0213(90)90002-E. URL https://www.sciencedirect.com/science/article/pii/036402139090002E.
  • Feng et al. (2023) Feng, G., Gu, Y., Zhang, B., Ye, H., He, D., and Wang, L. Towards revealing the mystery behind chain of thought: a theoretical perspective. arXiv preprint arXiv:2305.15408, 2023.
  • Fu et al. (2020) Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4RL: datasets for deep data-driven reinforcement learning. CoRR, abs/2004.07219, 2020. URL https://arxiv.org/abs/2004.07219.
  • Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  1582–1591. PMLR, 2018.
  • Gu et al. (2021) Gu, A., Goel, K., and Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  • Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  1856–1865. PMLR, 2018.
  • Hafner et al. (2023) Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  • Hahn (2020a) Hahn, M. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, December 2020a. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00306. URL http://dx.doi.org/10.1162/tacl_a_00306.
  • Hahn (2020b) Hahn, M. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, 2020b.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9:1735–80, 12 1997. doi: 10.1162/neco.1997.9.8.1735.
  • Horn & Johnson (2012) Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 2012.
  • Hu et al. (2023) Hu, K., Zheng, R. C., Gao, Y., and Xu, H. Decision transformer under random frame dropping. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=NmZXv4467ai.
  • Huang et al. (2022) Huang, F., Lu, K., Yuxi, C., Qin, Z., Fang, Y., Tian, G., and Li, G. Encoding recurrence into transformers. In The Eleventh International Conference on Learning Representations, 2022.
  • Hung et al. (2018) Hung, C., Lillicrap, T. P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., and Wayne, G. Optimizing agent behavior over long time scales by transporting value. CoRR, abs/1810.06721, 2018. URL http://arxiv.org/abs/1810.06721.
  • Hung et al. (2019) Hung, C.-C., Lillicrap, T., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., and Wayne, G. Optimizing agent behavior over long time scales by transporting value. Nature communications, 10(1):5223, 2019.
  • Hutchins et al. (2022) Hutchins, D., Schlag, I., Wu, Y., Dyer, E., and Neyshabur, B. Block-recurrent transformers. Advances in Neural Information Processing Systems, 35:33248–33261, 2022.
  • Irie et al. (2021) Irie, K., Schlag, I., Csordás, R., and Schmidhuber, J. Going beyond linear transformers with recurrent fast weight programmers. Advances in neural information processing systems, 34:7703–7717, 2021.
  • Irie et al. (2023) Irie, K., Csordás, R., and Schmidhuber, J. Practical computational power of linear transformers and their recurrent and self-referential extensions. arXiv preprint arXiv:2310.16076, 2023.
  • Jiang et al. (2023) Jiang, H., Li, Q., Li, Z., and Wang, S. A brief survey on the approximation theory for sequence modelling. arXiv preprint arXiv:2302.13752, 2023.
  • Kaelbling et al. (1998) Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1):99–134, 1998. ISSN 0004-3702. doi: https://doi.org/10.1016/S0004-3702(98)00023-X. URL https://www.sciencedirect.com/science/article/pii/S000437029800023X.
  • Kanai et al. (2017) Kanai, S., Fujiwara, Y., and Iwamura, S. Preventing gradient explosions in gated recurrent units. Advances in neural information processing systems, 30, 2017.
  • Katharopoulos et al. (2020) Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention, 2020.
  • Korsky & Berwick (2019) Korsky, S. A. and Berwick, R. C. On the computational power of rnns. arXiv preprint arXiv:1906.06349, 2019.
  • Laskin et al. (2022) Laskin, M., Wang, L., Oh, J., Parisotto, E., Spencer, S., Steigerwald, R., Strouse, D., Hansen, S., Filos, A., Brooks, E., et al. In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv:2210.14215, 2022.
  • Lee et al. (2022) Lee, K.-H., Nachum, O., Yang, M. S., Lee, L., Freeman, D., Guadarrama, S., Fischer, I., Xu, W., Jang, E., Michalewski, H., et al. Multi-game decision transformers. Advances in Neural Information Processing Systems, 35:27921–27936, 2022.
  • Li et al. (2015) Li, X., Li, L., Gao, J., He, X., Chen, J., Deng, L., and He, J. Recurrent reinforcement learning: A hybrid approach. CoRR, abs/1509.03044, 2015. URL http://arxiv.org/abs/1509.03044.
  • Li et al. (2023) Li, Y., Ildiz, M. E., Papailiopoulos, D., and Oymak, S. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pp.  19565–19594. PMLR, 2023.
  • Lim et al. (2023) Lim, Y. H., Zhu, Q., Selfridge, J., and Kasim, M. F. Parallelizing non-linear sequential models over the sequence length. arXiv preprint arXiv:2309.12252, 2023.
  • Littman & Sutton (2001) Littman, M. and Sutton, R. S. Predictive representations of state. In Dietterich, T., Becker, S., and Ghahramani, Z. (eds.), Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001. URL https://proceedings.neurips.cc/paper_files/paper/2001/file/1e4d36177d71bbb3558e43af9577d70e-Paper.pdf.
  • Lu et al. (2024) Lu, C., Schroecker, Y., Gu, A., Parisotto, E., Foerster, J., Singh, S., and Behbahani, F. Structured state space models for in-context reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  • Luo et al. (2022) Luo, S., Li, S., Zheng, S., Liu, T.-Y., Wang, L., and He, D. Your transformer may not be as powerful as you expect. Advances in Neural Information Processing Systems, 35:4301–4315, 2022.
  • Merrill (2019) Merrill, W. Sequential neural networks as automata. arXiv preprint arXiv:1906.01615, 2019.
  • Merrill & Sabharwal (2023) Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023.
  • Merrill et al. (2022) Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856, 2022.
  • Micheli et al. (2022) Micheli, V., Alonso, E., and Fleuret, F. Transformers are sample efficient world models. arXiv preprint arXiv:2209.00588, 2022.
  • Morad et al. (2023) Morad, S., Kortvelesy, R., Bettini, M., Liwicki, S., and Prorok, A. Popgym: Benchmarking partially observable reinforcement learning. arXiv preprint arXiv:2303.01859, 2023.
  • Ni et al. (2022) Ni, T., Eysenbach, B., and Salakhutdinov, R. Recurrent model-free RL can be a strong baseline for many pomdps. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  16691–16723. PMLR, 2022. URL https://proceedings.mlr.press/v162/ni22a.html.
  • Ni et al. (2023) Ni, T., Ma, M., Eysenbach, B., and Bacon, P.-L. When do transformers shine in rl? decoupling memory from credit assignment. arXiv preprint arXiv:2307.03864, 2023.
  • OpenAI (2023) OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  • Orvieto et al. (2023) Orvieto, A., Smith, S. L., Gu, A., Fernando, A., Gulcehre, C., Pascanu, R., and De, S. Resurrecting recurrent neural networks for long sequences. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • Parisotto et al. (2019) Parisotto, E., Song, H. F., Rae, J. W., Pascanu, R., Gülçehre, Ç., Jayakumar, S. M., Jaderberg, M., Kaufman, R. L., Clark, A., Noury, S., Botvinick, M. M., Heess, N., and Hadsell, R. Stabilizing transformers for reinforcement learning. CoRR, abs/1910.06764, 2019. URL http://arxiv.org/abs/1910.06764.
  • Pascanu et al. (2013) Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International conference on machine learning, pp.  1310–1318. Pmlr, 2013.
  • Pasukonis et al. (2022) Pasukonis, J., Lillicrap, T., and Hafner, D. Evaluating long-term memory in 3d mazes. arXiv preprint arXiv:2210.13383, 2022.
  • Peng et al. (2023) Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K. K., He, X., Hou, H., Lin, J., Kazienko, P., Kocon, J., Kong, J., Koptyra, B., Lau, H., Mantri, K. S. I., Mom, F., Saito, A., Song, G., Tang, X., Wang, B., Wind, J. S., Wozniak, S., Zhang, R., Zhang, Z., Zhao, Q., Zhou, P., Zhou, Q., Zhu, J., and Zhu, R.-J. Rwkv: Reinventing rnns for the transformer era, 2023.
  • Pérez et al. (2021) Pérez, J., Barceló, P., and Marinkovic, J. Attention is turing complete. The Journal of Machine Learning Research, 22(1):3463–3497, 2021.
  • Pin (2013) Pin, J.-E. Syntactic semigroups. In Handbook of Formal Languages: Volume 1 Word, Language, Grammar, pp.  679–746. Springer, 2013.
  • Press et al. (2021) Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  • Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners, 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  • Reid et al. (2022) Reid, M., Yamada, Y., and Gu, S. S. Can wikipedia help offline reinforcement learning? CoRR, abs/2201.12122, 2022. URL https://arxiv.org/abs/2201.12122.
  • Robine et al. (2023) Robine, J., Höftmann, M., Uelwer, T., and Harmeling, S. Transformer-based world models are happy with 100k interactions. arXiv preprint arXiv:2303.07109, 2023.
  • Rousseeuw (1987) Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.
  • Ruoss et al. (2023) Ruoss, A., Delétang, G., Genewein, T., Grau-Moya, J., Csordás, R., Bennani, M., Legg, S., and Veness, J. Randomized positional encodings boost length generalization of transformers. arXiv preprint arXiv:2305.16843, 2023.
  • Samsami et al. (2024) Samsami, M. R., Zholus, A., Rajendran, J., and Chandar, S. Mastering memory tasks with world models. arXiv preprint arXiv:2403.04253, 2024.
  • Schlag et al. (2021) Schlag, I., Irie, K., and Schmidhuber, J. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pp.  9355–9366. PMLR, 2021.
  • Shi et al. (2023) Shi, R., Liu, Y., Ze, Y., Du, S. S., and Xu, H. Unleashing the power of pre-trained language models for offline reinforcement learning. CoRR, abs/2310.20587, 2023. doi: 10.48550/ARXIV.2310.20587. URL https://doi.org/10.48550/arXiv.2310.20587.
  • Straubing (2012) Straubing, H. Finite automata, formal logic, and circuit complexity. Springer Science & Business Media, 2012.
  • Sun et al. (2023) Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models. CoRR, abs/2307.08621, 2023. doi: 10.48550/ARXIV.2307.08621. URL https://doi.org/10.48550/arXiv.2307.08621.
  • Tay et al. (2020) Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham, P., Rao, J., Yang, L., Ruder, S., and Metzler, D. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  • van der Maaten & Hinton (2008) van der Maaten, L. and Hinton, G. Viualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 11 2008.
  • Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  • Wang et al. (2022) Wang, L., Shen, B., Hu, B., and Cao, X. Can gradient descent provably learn linear dynamic systems? arXiv preprint arXiv:2211.10582, 2022.
  • Wu et al. (2023) Wu, Y., Wang, X., and Hamaya, M. Elastic decision transformer. CoRR, abs/2307.02484, 2023. doi: 10.48550/ARXIV.2307.02484. URL https://doi.org/10.48550/arXiv.2307.02484.
  • Yamagata et al. (2023) Yamagata, T., Khalil, A., and Santos-Rodríguez, R. Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline RL. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  38989–39007. PMLR, 2023. URL https://proceedings.mlr.press/v202/yamagata23a.html.
  • Yao (1989) Yao, A. C. Circuits and local computation. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, pp.  186–196, 1989.
  • Yun et al. (2019) Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S. J., and Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077, 2019.
  • Zheng et al. (2022) Zheng, Q., Zhang, A., and Grover, A. Online decision transformer. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  27042–27059. PMLR, 2022. URL https://proceedings.mlr.press/v162/zheng22c.html.
  • Åström (1965) Åström, K. Optimal control of markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1):174–205, 1965. ISSN 0022-247X. doi: https://doi.org/10.1016/0022-247X(65)90154-X.

Appendix A Additional Background and Notation

A.1 Circuit complexity

In this subsection, We introduce several basic complexity classes, namely 𝖠𝖢0superscript𝖠𝖢0{\mathsf{AC}^{0}}sansserif_AC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, 𝖳𝖢0superscript𝖳𝖢0{\mathsf{TC}^{0}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, and 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT:

  • 𝖠𝖢0superscript𝖠𝖢0{\mathsf{AC}^{0}}sansserif_AC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT contains all languages that are decided by Boolean circuits with constant depth, unbounded fan-in, and polynomial size, consisting of AND gates, OR gates, NOT gates;

  • 𝖳𝖢0superscript𝖳𝖢0{\mathsf{TC}^{0}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is 𝖠𝖢0superscript𝖠𝖢0{\mathsf{AC}^{0}}sansserif_AC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT with majority gates which outputs true if and only if more than half of the input bits are true;

  • 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT contains all languages that are decided by Boolean circuits with a logarithmic depth of 𝒪(logn)𝒪𝑛\mathcal{O}(\log n)caligraphic_O ( roman_log italic_n ) where n𝑛nitalic_n is the input length, constant fan-in, and polynomial-size, consisting of AND gates, OR gates, and NOT gates.

The relationships between them are 𝖠𝖢0𝖳𝖢0𝖭𝖢1superscript𝖠𝖢0superscript𝖳𝖢0superscript𝖭𝖢1{\mathsf{AC}^{0}}\subseteq{\mathsf{TC}^{0}}\subseteq{\mathsf{NC}^{1}}sansserif_AC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ⊆ sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ⊆ sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, and it is commonly conjectured that 𝖳𝖢0𝖭𝖢1superscript𝖳𝖢0superscript𝖭𝖢1{\mathsf{TC}^{0}}\neq{\mathsf{NC}^{1}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≠ sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT whereas it remains an open problem in the computation complexity theory. A language L𝖭𝖢1𝐿superscript𝖭𝖢1L\in{\mathsf{NC}^{1}}italic_L ∈ sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete w.r.t. 𝖠𝖢0superscript𝖠𝖢0{\mathsf{AC}^{0}}sansserif_AC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT reduction if for any L𝖭𝖢1superscript𝐿superscript𝖭𝖢1L^{\prime}\in{\mathsf{NC}^{1}}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, LstrongLsubscriptstrongsuperscript𝐿𝐿L^{\prime}\leq_{\text{strong}}Litalic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ start_POSTSUBSCRIPT strong end_POSTSUBSCRIPT italic_L, i.e. Lsuperscript𝐿L^{\prime}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is reducible to L𝐿Litalic_L under 𝖠𝖢0superscript𝖠𝖢0{\mathsf{AC}^{0}}sansserif_AC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT reduction. More details can be referred to Straubing (2012).

A.2 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete regular language

In this subsection, we introduce the approach of connecting regular languages and 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete problems using the syntactic monoid theory and Barrington’s theorem.

A.2.1 Syntactic monoid

The syntactic monoid is a concept in the algebraic language theory that establishes a connection between the language recognition and the group theory.

Definition A.1 (Syntactic congruence (Straubing, 2012)).

Let A𝐴Aitalic_A be a finite alphabet, and let LA𝐿superscript𝐴L\subseteq A^{*}italic_L ⊆ italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We define an equivalence relation Lsubscript𝐿\equiv_{L}≡ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT on Asuperscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: xLysubscript𝐿𝑥𝑦x\equiv_{L}yitalic_x ≡ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_y iff.

{(u,v)A×A:uxvL}={(u,v)A×A:uyvL}.:𝑢𝑣superscript𝐴superscript𝐴𝑢𝑥𝑣𝐿:𝑢𝑣superscript𝐴superscript𝐴𝑢𝑦𝑣𝐿\quantity{(u,v)\in A^{*}\times A^{*}:uxv\in L}=\quantity{(u,v)\in A^{*}\times A% ^{*}:uyv\in L}\;.{ start_ARG ( italic_u , italic_v ) ∈ italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : italic_u italic_x italic_v ∈ italic_L end_ARG } = { start_ARG ( italic_u , italic_v ) ∈ italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : italic_u italic_y italic_v ∈ italic_L end_ARG } .

Note that xaLya,axLayformulae-sequencesubscript𝐿𝑥𝑎𝑦𝑎subscript𝐿𝑎𝑥𝑎𝑦xa\equiv_{L}ya,ax\equiv_{L}ayitalic_x italic_a ≡ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_y italic_a , italic_a italic_x ≡ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_a italic_y if xLy,aAformulae-sequencesubscript𝐿𝑥𝑦𝑎𝐴x\equiv_{L}y,a\in Aitalic_x ≡ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_y , italic_a ∈ italic_A, it follows that Lsubscript𝐿\equiv_{L}≡ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is a congruence on Asuperscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, called the syntactic congruence.

Definition A.2 (Syntactic monoid (Straubing, 2012)).

Given a language LA𝐿superscript𝐴L\subseteq A^{*}italic_L ⊆ italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the quotient of Asuperscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by its congruence Lsubscript𝐿\equiv_{L}≡ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is called the syntactic monoid of L𝐿Litalic_L and is denoted as M(L)𝑀𝐿M(L)italic_M ( italic_L ).

For a regular language L𝐿Litalic_L, determining M(L)𝑀𝐿M(L)italic_M ( italic_L ) can be accomplished using a straightforward method (Pin, 2013). The procedure involves initially computing its minimal DFA, with the syntactic semigroup of L𝐿Litalic_L being equivalent to the transition semigroup S𝑆Sitalic_S of the DFA.

A.2.2 Barrington’s Theorem

Barrington (1986) demonstrated that the word problem of the group S5subscript𝑆5S_{5}italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT is 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete. The word problem of a group G𝐺Gitalic_G is defined as {g1gn=e:giG}:subscript𝑔1subscript𝑔𝑛𝑒subscript𝑔𝑖𝐺\quantity{g_{1}\ldots g_{n}=e:g_{i}\in G}{ start_ARG italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_e : italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_G end_ARG }. The following theorem offers a comprehensive statement of Barrington’s work.

Theorem A.3 (Theorem IX.1.5 in Straubing (2012)).

Given a regular language such that M(K)𝑀𝐾M(K)italic_M ( italic_K ) is not solvable. Then for all L𝖭𝖢1,LstrongKformulae-sequence𝐿superscript𝖭𝖢1subscriptstrong𝐿𝐾L\in{\mathsf{NC}^{1}},L\leq_{\text{strong}}Kitalic_L ∈ sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_L ≤ start_POSTSUBSCRIPT strong end_POSTSUBSCRIPT italic_K.

The methods used in the reduction process are simpler than 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT; specifically, they involve employing 𝖠𝖢0superscript𝖠𝖢0{\mathsf{AC}^{0}}sansserif_AC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT or 𝖳𝖢0superscript𝖳𝖢0{\mathsf{TC}^{0}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT for the reduction. The well-known connection between this theorem and the original word problem of S5subscript𝑆5S_{5}italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT is as follows: for n5𝑛5n\geq 5italic_n ≥ 5, the symmetric group Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is unsolvable.

A.2.3 Examples of 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete Regular Language

Proposition A.4.

If L=((0+1)3(010+1))𝐿superscriptsuperscript013superscript0101L={((0+1)^{3}(01^{*}0+1))^{*}}italic_L = ( ( 0 + 1 ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( 01 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 0 + 1 ) ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then L𝐿Litalic_L is 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete.

q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPTstartq1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTq2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTq3subscript𝑞3q_{3}italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTq4subscript𝑞4q_{4}italic_q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT0, 10, 10, 10110
Figure 7: The minimal DFA of ((0+1)3(010+1))superscriptsuperscript013superscript0101{((0+1)^{3}(01^{*}0+1))^{*}}( ( 0 + 1 ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( 01 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 0 + 1 ) ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
Proof.

Let fw:QQ:subscript𝑓𝑤𝑄𝑄f_{w}:Q\to Qitalic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT : italic_Q → italic_Q represents an element in the transition group L𝐿Litalic_L, where fw(q)subscript𝑓𝑤𝑞f_{w}(q)italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_q ) denotes reaching the node fw(q)subscript𝑓𝑤𝑞f_{w}(q)italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_q ) after inputting the string w𝑤witalic_w at node q𝑞qitalic_q. As illustrated in Figure 7, the transition group contains the following elements:

f0=(0123412340),f1=(0123412304).formulae-sequencesubscript𝑓0matrix0123412340subscript𝑓1matrix0123412304f_{0}=\matrixquantity(0&1&2&3&4\\ 1&2&3&4&0),\ f_{1}=\matrixquantity(0&1&2&3&4\\ 1&2&3&0&4).italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( start_ARG start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 3 end_CELL start_CELL 4 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 3 end_CELL start_CELL 4 end_CELL start_CELL 0 end_CELL end_ROW end_ARG end_ARG ) , italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( start_ARG start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 3 end_CELL start_CELL 4 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 3 end_CELL start_CELL 0 end_CELL start_CELL 4 end_CELL end_ROW end_ARG end_ARG ) .

Then f11=f13=f111superscriptsubscript𝑓11superscriptsubscript𝑓13subscript𝑓111f_{1}^{-1}=f_{1}^{3}=f_{111}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT 111 end_POSTSUBSCRIPT and f01=f04=f0000superscriptsubscript𝑓01superscriptsubscript𝑓04subscript𝑓0000f_{0}^{-1}=f_{0}^{4}=f_{0000}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT 0000 end_POSTSUBSCRIPT. Note that (0 1 2 3 4)01234\quantity(0\ 1\ 2\ 3\ 4)( start_ARG 0 1 2 3 4 end_ARG ) and (0 1 2 3)0123\quantity(0\ 1\ 2\ 3)( start_ARG 0 1 2 3 end_ARG ) are the generators of S5subscript𝑆5S_{5}italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT so M(L)=S5𝑀𝐿subscript𝑆5M(L)=S_{5}italic_M ( italic_L ) = italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT is not solvable. According to Theorem A.3, L𝐿Litalic_L is 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete. ∎

Appendix B Theoretical Results

B.1 Proof of Proposition 4.1

Proof of Proposition 4.1.

This proof is based on construction. Given a regular language LΣしぐま𝐿superscriptΣしぐまL\subseteq\Sigma^{*}italic_L ⊆ roman_Σしぐま start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We insert an end symbol #Σしぐま#Σしぐま\#\notin\Sigma# ∉ roman_Σしぐま to obtain a new regular language L#=(Q,Σしぐま{#},δでるた,F,q0)superscript𝐿#𝑄Σしぐま#𝛿𝐹subscript𝑞0L^{\#}=\quantity(Q,\Sigma\cup\quantity{\#},\delta,F,q_{0})italic_L start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT = ( start_ARG italic_Q , roman_Σしぐま ∪ { start_ARG # end_ARG } , italic_δでるた , italic_F , italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) s.t. wL𝑤𝐿w\in Litalic_w ∈ italic_L iff. w#L#𝑤#superscript𝐿#w\#\in L^{\#}italic_w # ∈ italic_L start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT. Construct a POMDP =(S,A,T,R,Ωおめが,O,γがんま)𝑆𝐴𝑇𝑅Ωおめが𝑂𝛾\mathcal{M}=(S,A,T,R,\Omega,O,\gamma)caligraphic_M = ( italic_S , italic_A , italic_T , italic_R , roman_Ωおめが , italic_O , italic_γがんま ). The state space S𝑆Sitalic_S is Q×(Σしぐま{#})𝑄Σしぐま#Q\times\left(\Sigma\cup\{\#\}\right)italic_Q × ( roman_Σしぐま ∪ { # } ). The action space A𝐴Aitalic_A is {accept,reject}acceptreject\quantity{\text{{accept}},\text{{reject}}}{ start_ARG accept , reject end_ARG } and the observation space ΩおめがΩおめが\Omegaroman_Ωおめが is the alphabet Σしぐま{#}Σしぐま#\Sigma\cup\quantity{\#}roman_Σしぐま ∪ { start_ARG # end_ARG }. The initial state is (q0,w0)subscript𝑞0subscript𝑤0(q_{0},w_{0})( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), where w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is randomly sampled from Σしぐま{#}Σしぐま#\Sigma\cup\{\#\}roman_Σしぐま ∪ { # }. Given a state (qt,wt)subscript𝑞𝑡subscript𝑤𝑡(q_{t},w_{t})( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at timestep t𝑡titalic_t, the agent could observe the character wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. If wt#subscript𝑤𝑡#w_{t}\neq\#italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ #, the next state would be (qt+1,wt+1)subscript𝑞𝑡1subscript𝑤𝑡1(q_{t+1},w_{t+1})( italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) where qt+1=δでるた(qt,wt)subscript𝑞𝑡1𝛿subscript𝑞𝑡subscript𝑤𝑡q_{t+1}=\delta(q_{t},w_{t})italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_δでるた ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and wt+1subscript𝑤𝑡1w_{t+1}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is randomly sampled from Σしぐま{#}Σしぐま#\Sigma\cup\{\#\}roman_Σしぐま ∪ { # }, and the agent would receive no reward; If wt=#subscript𝑤𝑡#w_{t}=\#italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = #, the process would terminate, and the agent would receive a reward of 1111 if it correctly outputs the acceptance of w=w0wt1𝑤subscript𝑤0subscript𝑤𝑡1w=w_{0}\ldots w_{t-1}italic_w = italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT … italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT in L𝐿Litalic_L. Note that the optimal policy πぱい𝜋\piitalic_πぱい on \mathcal{M}caligraphic_M is to accept all wL𝑤𝐿w\in Litalic_w ∈ italic_L and reject all wL𝑤𝐿w\notin Litalic_w ∉ italic_L, so if an algorithm 𝒜𝒜\mathcal{A}caligraphic_A can solve POMDP problems, then 𝒜𝒜\mathcal{A}caligraphic_A can recognize L𝐿Litalic_L. ∎

B.2 Proof of Theorem 4.4

Lemma B.1 (Theorem 2 in (Merrill & Sabharwal, 2023)).

Given an integer d𝑑ditalic_d and polynomial Q𝑄Qitalic_Q, any log-precision transformer with depth d𝑑ditalic_d and hidden size Q(n)𝑄𝑛Q(n)italic_Q ( italic_n ) operating on inputs in ΣしぐまnsuperscriptΣしぐま𝑛\Sigma^{n}roman_Σしぐま start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can be simulated by a logspace-uniform threshold circuit family of depth 3+(9+2d)d392subscript𝑑direct-sum𝑑3+(9+2d_{\oplus})d3 + ( 9 + 2 italic_d start_POSTSUBSCRIPT ⊕ end_POSTSUBSCRIPT ) italic_d.

Remark B.2.

The scope outlined by Lemma B.1 for Transformers is quite broad, as its description of FNNs allows for any log-precision function. Therefore, in the case of a log-precision (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) algorithm, we can distribute the RL part across the last FNN layer of the original Transformer, treating the entire model as a single Transformer.

Proof of Theorem 4.4.

Proof by contradiction. Suppose there exists an integer d𝑑ditalic_d and polynomial Q𝑄Qitalic_Q such that for any n𝑛nitalic_n, a log-precision 𝒜=(TF,RL)𝒜TFRL\mathcal{A}=(\texttt{TF},\texttt{RL})caligraphic_A = ( TF , RL ) with depth d𝑑ditalic_d and hidden size Q(n)𝑄𝑛Q(n)italic_Q ( italic_n ) can solve Lsuperscript𝐿\mathcal{M}^{L}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, where L𝐿Litalic_L is a 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete regular language. Given wΣしぐま𝑤superscriptΣしぐまw\in\Sigma^{*}italic_w ∈ roman_Σしぐま start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, The algorithm 𝒜𝒜\mathcal{A}caligraphic_A can determine the validity of wL𝑤𝐿w\in Litalic_w ∈ italic_L by checking whether the action output by 𝒜(w)𝒜𝑤\mathcal{A}(w)caligraphic_A ( italic_w ) is “accept”. Consequently, 𝒜𝒜\mathcal{A}caligraphic_A can solve an 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete problem.

At the same time, as stated in Remark B.2, we can treat (TF,RL)TFRL(\texttt{TF},\texttt{RL})( TF , RL ) as a single Transformer. Based on Lemma B.1, 𝒜𝒜\mathcal{A}caligraphic_A can be interpreted as a logspace-uniform threshold circuit family of constant depth, indicating that L𝖳𝖢0𝐿superscript𝖳𝖢0L\in{\mathsf{TC}^{0}}italic_L ∈ sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. Since we assume 𝖳𝖢0𝖭𝖢1superscript𝖳𝖢0superscript𝖭𝖢1{\mathsf{TC}^{0}}\neq{\mathsf{NC}^{1}}sansserif_TC start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≠ sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, the existence of such an algorithm 𝒜𝒜\mathcal{A}caligraphic_A is not possible. ∎

B.3 Proof of Theorem 4.6

Lemma B.3.

Given an integer n𝑛nitalic_n, a symbol aΣしぐま𝑎Σしぐまa\in\Sigmaitalic_a ∈ roman_Σしぐま, a regular language LΣしぐま𝐿superscriptΣしぐまL\subseteq\Sigma^{*}italic_L ⊆ roman_Σしぐま start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, let L[n]={xL:|x|=n}𝐿delimited-[]𝑛:𝑥𝐿𝑥𝑛L[n]=\quantity{x\in L:\absolutevalue{x}=n}italic_L [ italic_n ] = { start_ARG italic_x ∈ italic_L : | start_ARG italic_x end_ARG | = italic_n end_ARG } and

Pn,a={(xa,xa)L[n+1]×(L¯)[n+1]:d(x,x)=1},subscript𝑃𝑛𝑎:𝑥𝑎superscript𝑥𝑎𝐿delimited-[]𝑛1¯𝐿delimited-[]𝑛1𝑑𝑥superscript𝑥1P_{n,a}=\quantity{(xa,x^{\prime}a)\in L[n+1]\times(\bar{L})[n+1]:d(x,x^{\prime% })=1},italic_P start_POSTSUBSCRIPT italic_n , italic_a end_POSTSUBSCRIPT = { start_ARG ( italic_x italic_a , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_a ) ∈ italic_L [ italic_n + 1 ] × ( over¯ start_ARG italic_L end_ARG ) [ italic_n + 1 ] : italic_d ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 end_ARG } ,

where d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) denotes the number of different symbols in x𝑥xitalic_x and xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. If 0<c(n,a)<|Σしぐま|n0𝑐𝑛𝑎superscriptΣしぐま𝑛0<c(n,a)<\absolutevalue{\Sigma}^{n}0 < italic_c ( italic_n , italic_a ) < | start_ARG roman_Σしぐま end_ARG | start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, then |Pn,a|>0subscript𝑃𝑛𝑎0\absolutevalue{P_{n,a}}>0| start_ARG italic_P start_POSTSUBSCRIPT italic_n , italic_a end_POSTSUBSCRIPT end_ARG | > 0.

Proof.

Suppose |Pn,a|=0subscript𝑃𝑛𝑎0\absolutevalue{P_{n,a}}=0| start_ARG italic_P start_POSTSUBSCRIPT italic_n , italic_a end_POSTSUBSCRIPT end_ARG | = 0. Note that if xaL[n+1]𝑥𝑎𝐿delimited-[]𝑛1xa\in L[n+1]italic_x italic_a ∈ italic_L [ italic_n + 1 ], then {xa:d(x,x)=1}L[n+1]:superscript𝑥𝑎𝑑𝑥superscript𝑥1𝐿delimited-[]𝑛1\quantity{x^{\prime}a:d(x,x^{\prime})=1}\subseteq L[n+1]{ start_ARG italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_a : italic_d ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 end_ARG } ⊆ italic_L [ italic_n + 1 ]. Repeating this deduction, we can cover all xΣしぐまn𝑥superscriptΣしぐま𝑛x\in\Sigma^{n}italic_x ∈ roman_Σしぐま start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Hence, ΣしぐまnaL[n+1]superscriptΣしぐま𝑛𝑎𝐿delimited-[]𝑛1\Sigma^{n}a\subseteq L[n+1]roman_Σしぐま start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a ⊆ italic_L [ italic_n + 1 ], that is, c(n,a)=|Σしぐま|n𝑐𝑛𝑎superscriptΣしぐま𝑛c(n,a)=\absolutevalue{\Sigma}^{n}italic_c ( italic_n , italic_a ) = | start_ARG roman_Σしぐま end_ARG | start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. If xaL[n+1]𝑥𝑎𝐿delimited-[]𝑛1xa\notin L[n+1]italic_x italic_a ∉ italic_L [ italic_n + 1 ], then xa(L¯)[n+1]𝑥𝑎¯𝐿delimited-[]𝑛1xa\in(\bar{L})[n+1]italic_x italic_a ∈ ( over¯ start_ARG italic_L end_ARG ) [ italic_n + 1 ]. As stated above, c(n,a)=0𝑐𝑛𝑎0c(n,a)=0italic_c ( italic_n , italic_a ) = 0. By contradiction, Pn,a>0subscript𝑃𝑛𝑎0P_{n,a}>0italic_P start_POSTSUBSCRIPT italic_n , italic_a end_POSTSUBSCRIPT > 0. ∎

Proof of Theorem 4.6.

Since ΣしぐまΣしぐま\Sigmaroman_Σしぐま is finite, D=maxa,bΣしぐまu(a)u(b)𝐷subscript𝑎𝑏Σしぐまnorm𝑢𝑎𝑢𝑏D=\max_{a,b\in\Sigma}\norm{u(a)-u(b)}italic_D = roman_max start_POSTSUBSCRIPT italic_a , italic_b ∈ roman_Σしぐま end_POSTSUBSCRIPT ∥ start_ARG italic_u ( italic_a ) - italic_u ( italic_b ) end_ARG ∥ where u𝑢uitalic_u denotes the embedding vector of the given symbol. By Lemma 4.5, there exists n𝑛nitalic_n such that {n:|Pn,a|>0}:𝑛subscript𝑃𝑛𝑎0\quantity{n:\absolutevalue{P_{n,a}}>0}{ start_ARG italic_n : | start_ARG italic_P start_POSTSUBSCRIPT italic_n , italic_a end_POSTSUBSCRIPT end_ARG | > 0 end_ARG } is infinte.

Therefore, there exists infinite sequences u𝑢uitalic_u and usuperscript𝑢u^{\prime}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that u𝑢uitalic_u and usuperscript𝑢u^{\prime}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT differ by only 1111 position, yet uL𝑢𝐿u\in Litalic_u ∈ italic_L while uLsuperscript𝑢𝐿u^{\prime}\notin Litalic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∉ italic_L. According to Lemma 4.5,

xxuu=O(D/n)=O(1/n).norm𝑥superscript𝑥norm𝑢superscript𝑢𝑂𝐷𝑛𝑂1𝑛\norm{x-x^{\prime}}\leq\norm{u-u^{\prime}}=O(D/n)=O(1/n)\;.∥ start_ARG italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∥ ≤ ∥ start_ARG italic_u - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∥ = italic_O ( italic_D / italic_n ) = italic_O ( 1 / italic_n ) .

Since RL is a Lipschitz function, there exists a constant C𝐶Citalic_C such that

RL(x)RL(x)Cxx=O(1/n).normRL𝑥RLsuperscript𝑥𝐶norm𝑥superscript𝑥𝑂1𝑛\norm{\texttt{RL}(x)-\texttt{RL}(x^{\prime})}\leq C\norm{x-x^{\prime}}=O(1/n)\;.∥ start_ARG RL ( italic_x ) - RL ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ∥ ≤ italic_C ∥ start_ARG italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∥ = italic_O ( 1 / italic_n ) .

As n𝑛nitalic_n increases, information from non-current time steps will only have a negligible impact on the output of RL. ∎

Appendix C Discussion on Existing POMDP Problems

Using existing POMDP problems as examples, here demonstrates the derivation of POMDPs cast into regular languages.

Passive T-Maze (Ni et al., 2023). The movement strategy towards the corridor’s endpoint is akin to recognizing the regular language 0(01)0superscript010(01)^{*}0 ( 01 ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with a DFA. If this DFA accepts the string formed by all current histories, then the agent moves upwards; otherwise, it moves downwards.

Passive Visual-Match (Hung et al., 2019). The complete state space of this environment is large, including player coordinates, coordinates of all fruits, whether fruits are collected, and other information. For convenience, we decouple this POMDP. Consistent with the analysis of Ni et al. (2023), this environment is divided into immediate greedy policies and long-term memory policies. For long-term memory policies, the environment not only needs to recognize the regular language 0(01)0superscript010(01)^{*}0 ( 01 ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as in Passive T-Maze; but due to the existence of greedy policies, the player needs a strategy to move from any position in the room to the endpoint. The states required for this strategy only involve player coordinates, and judging the current coordinates based on historical information only requires a simple regular language. Considering directly treating the grid of the current room as the states of the DFA, and treating the actions {L,R,U,D}𝐿𝑅𝑈𝐷\{L,R,U,D\}{ italic_L , italic_R , italic_U , italic_D } as characters, if the player wants to determine the current position (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), it is equivalent to recognizing the regular language that accepted by a DFA whose terminal state is (x,y)𝑥𝑦(x,y)( italic_x , italic_y ).

Appendix D Experimental Details

D.1 Task descriptions

D.1.1 Pybullet tasks

Ant. This task is to simulate a hexapod robot resembling an ant. The objective is to develop a control policy that enables the ant to leverage the six legs for specific movements.

Walker. This task is to simulate a bipedal humanoid robot. The goal is to design a control strategy that facilitates stable and efficient walking, mimicking human-like locomotion patterns.

HalfCheetah. This task is to simulate a quadruped robot inspired by the cheetah’s anatomy. The aim is to devise a control policy that allows the robot to achieve rapid and agile locomotion.

Hopper. This task is to simulate a single-legged robot, and the objective is to develop a control strategy for jumping locomotion to achieve efficient forward movement.

Task (F) stands for the original task with full observation, while Task (V) and Task (P) stand for that only velocities or positions are observable, respectively. In Figure 8, we provide the visualization of each task.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Visualizations of Tasks in Pybullet. From left to right are Ant, Walker, HalfCheetah and Hopper.

D.1.2 POMDPs determined by regular languages

PARITY. Given a 01010101 sequence, compute whether the number of 1111 is even. The formal expression is 0(101)0superscript0superscriptsuperscript101superscript00^{*}(10^{*}1)^{*}0^{*}0 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 10 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 1 ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 0 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

EVEN PAIRS. Given a 01010101 sequence, compute whether its first and last bit are the same. The formal expression is written as (0[01]0)|(1[01]1)conditional0superscriptdelimited-[]0101superscriptdelimited-[]011(0[01]^{*}0)|(1[01]^{*}1)( 0 [ 01 ] start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 0 ) | ( 1 [ 01 ] start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 1 ).

SYM(5). Given a 01010101 sequence, compute whether it belongs to a case of 𝖭𝖢1superscript𝖭𝖢1{\mathsf{NC}^{1}}sansserif_NC start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT complete regular languages, namely S5, with the formal expression ((0+1)3(010+1))superscriptsuperscript013superscript0101{((0+1)^{3}(01^{*}0+1))^{*}}( ( 0 + 1 ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( 01 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 0 + 1 ) ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

D.1.3 Pure long-term memory tasks

Passive T-Maze (Ni et al. (2023)). The environment is a long corridor of length L𝐿Litalic_L from the initial state O𝑂Oitalic_O to J𝐽Jitalic_J and J𝐽Jitalic_J is connected with two terminal states G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and G2subscript𝐺2G_{2}