Look into the Future: Deep Contextualized Sequential Recommendation

Lei Zheng Shanghai Jiao Tong UniversityShanghaiChina zhenglei2016@sjtu.edu.cn , Ning Li Shanghai Jiao Tong UniversityShanghaiChina lining01@sjtu.edu.cn , Yanhua Huang Xiaohongshu Inc.ShanghaiChina yanhuahuang@xiaohongshu.com , Ruiwen Xu Xiaohongshu Inc.ShanghaiChina ruiwenxu@xiaohongshu.com , Weinan Zhang Shanghai Jiao Tong UniversityShanghaiChina wnzhang@sjtu.edu.cn and Yong Yu Shanghai Jiao Tong UniversityShanghaiChina yyu@apex.sjtu.edu.cn

(2018)

Abstract.

Sequential recommendation aims to estimate how a user’s interests evolve over time via uncovering valuable patterns from user behavior history. Many previous sequential models have solely relied on users’ historical information to model the evolution of their interests, neglecting the crucial role that future information plays in accurately capturing these dynamics. However, effectively incorporating future information in sequential modeling is non-trivial since it is impossible to make the current-step prediction for any target user by leveraging his future data. In this paper, we propose a novel framework of sequential recommendation called Look into the Future (LIFT), which builds and leverages the contexts of sequential recommendation. In LIFT, the context of a target user’s interaction is represented based on i) his own past behaviors and ii) the past and future behaviors of the retrieved similar interactions from other users. As such, the learned context will be more informative and effective in predicting the target user’s behaviors in sequential recommendation without temporal data leakage. Furthermore, in order to exploit the intrinsic information embedded within the context itself, we introduce an innovative pretraining methodology incorporating behavior masking. In our extensive experiments on five real-world datasets, LIFT achieves significant performance improvement on click-through rate prediction and rating prediction tasks in sequential recommendation over strong baselines, demonstrating that retrieving and leveraging relevant contexts from the global user pool greatly benefits sequential recommendation. The experiment code is provided at https://anonymous.4open.science/r/LIFT-277C/Readme.md.

Sequential Recommendation, Context Representation, Retrieval-Enhanced Methods, Pretraining

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†conference: Proceedings of the 31th International ACM SIGKDD Conference on Knowledge Discovery & Data Mining; August 3–7, 2025; Toronto, ON, Canada

1. Introduction

Deep learning has been widely adopted for predicting user behaviors in online recommender systems (Zhang et al., 2021). For the scenarios with sequences of user behaviors, sequential recommendation techniques, which extract valuable information from the past behavior sequence of the target user to predict his next behavior, have been well studied (Pi et al., 2019; Huang et al., 2018).

As illustrated in Figure 1, the major techniques for deep learning-based sequential recommendation are generally threefold. Firstly, deep architectures are designed to extend the capability of traditional collaborative filtering models for better mining of the feature interactions (Guo et al., 2017; Zhang et al., 2016; Wang et al., 2017). Secondly, different models like recurrent neural networks (RNN), memory networks, or transformers are adopted to learn an effective representation of the user behavior sequence, based on which a deep learning predictor is built to predict the behavior label for the current-step recommendation (Pi et al., 2019; Chen et al., 2018; Sun et al., 2019). Dealing with the long behavior sequence problem, thirdly, the retrieval methods are leveraged to fetch far early yet relevant behaviors, which are then aggregated according to the target prediction condition and fed into the final label predictor (Qi et al., 2020; Qin et al., 2020).

The primary aim of sequence modeling in recommender systems is to capture the evolving trends in user interests (Zhou et al., 2019). Merely relying on historical data is insufficient to model these dynamic changes effectively (Yuan et al., 2020). Incorporating future information provides a hindsight perspective to help predict shifts in user preferences. For instance, after purchasing a smartphone, a user is likely to show significant interest in related accessories. However, incorporating future data for predictive modeling without causing data leakage presents a significant challenge. Most existing models (Zhou et al., 2018b; Kang and McAuley, 2018) neglect the potential of future user behaviors, while a few attempts (Yuan et al., 2020; Sun et al., 2019) implicitly use future information during training but fail to leverage it during the inference stage, leaving a significant potential for improvement.

To address this issue, we employ retrieval-based methods to introduce context that contains future information into the model. Previously, retrieval techniques have been integrated into sequential recommendations to effectively access and leverage longer historical information (Pi et al., 2019; Qin et al., 2020). Nonetheless, these retrieval methods typically focus on raw data from history, overlooking the importance of context. The interaction sequences of similar users with similar items tend to follow comparable patterns. By utilizing retrieval methods, we can fetch similar contexts in the log data from other users to approximate the future information of the target user. This approach eliminates the issue of temporal data leakage while enhancing the model’s ability to predict user behaviors.

In this paper, we propose a novel framework of sequential recommendation called Look into the Future (LIFT), which focuses on retrieval in the user contextual information space while leveraging the future behaviors of the retrieved contexts, as illustrated on the bottom part of Figure 1. As mentioned above, it is infeasible to use real future information to predict the current behavior. Thus, given the current-step context with the candidate item for behavior label prediction, LIFT performs retrieval to fetch the most similar interaction behavior from the whole user pool. Then, extending from each of the retrieved behaviors, the historical and future behavior data are imported to enrich the context of that retrieved behavior. Note that each of the included future behaviors is still earlier than the current timestep of the target user behavior, which avoids any data leakage issue. Such future behaviors can be regarded as a kind of privileged information (Vapnik et al., 2015; Xu et al., 2020), which is not accessible for current-step prediction but is accessible for historical behavior data. To our knowledge, there is no previous work on sequential recommendation that performs effective retrieval of multiple users’ behavior contexts that include both past and future behaviors.

Refer to caption — Figure 1. The comparison between LIFT and conventional models entails several key distinctions: a) Traditional models rely solely on instant user and item information when making predictions. b) Sequential models, conversely, typically incorporate the user’s historical interactions to capture their evolving interests over time. c) Retrieval-based models perform retrieval to fetch far-before but relevant historical behaviors to build the user profile for predictions. d) LIFT focuses on interaction context, encompassing both the historical and future sequence of interactions for each user-item interaction.

Furthermore, it is crucial not only to utilize information derived from the target sample but also to exploit the inherent information within the context itself for the purpose of enhancing contextual representation. Therefore, besides the supervised training with the label prediction loss, a representation learning loss is much important in our task. To learn an effective representation of the context with both history and future behavior data, we further devise a pretraining method that performs masked behavior prediction during the pretraining stage. Moreover, to reduce the noise introduced by the retrieved data, we designed an attention mechanism that assigns different weights to the retrieved data using dense embeddings of users and items. We conduct extensive experiments on five real-world sequential recommendation datasets with click-through rate prediction and top-N ranking tasks, where LIFT demonstrates significant performance improvements over strong baselines.

Overall, the main contributions of this paper are threefold.

•

We propose a novel LIFT framework that incorporates future information as part of the context. By leveraging retrieval techniques, LIFT utilizes relevant contextual information to enhance the prediction performance of sequential recommendations while avoiding temporal data leakage. To our knowledge, this is the first work to integrate future information comprehensively in both training and inference phases within a recommender system based on retrieval techniques.
•

In the pursuit of obtaining valuable representations of retrieved user behaviors, we adopt an approach that leverages intrinsic contextual information. Specifically, we devise a pretraining methodology that is tailored to the format of user sequence data and introduces a novel self-supervised loss function referred to as mask behavior loss.
•

To avoid noise in the retrieved behavior sequences, we propose a key-based attention mechanism to aggregate the retrieved data.

2. Related Work

Sequential Recommendation. For sequential recommendation, user behavior modeling is the core technique that mines user preferences from historical interaction behaviors meticulously. To better extract informative knowledge from user’s behavior sequence, various network structures have been proposed, including Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Attention Networks, and Memory Networks. GRU4Rec (Hidasi et al., 2015) designs Gated Recurrent Units (GRUs) to capture the preference-evolving relationship while Caser (Tang and Wang, 2018) leverages the horizontal and vertical convolution to model skip behaviors at both union-level and point-level. Moreover, the attention mechanism is the most popular method for modeling item dependencies, and several influential works are proposed, including SASRec (Kang and McAuley, 2018), DIN (Zhou et al., 2018b), DIEN (Zhou et al., 2019), and BERT4Rec (Sun et al., 2019). To better mine user preferences from their historical behaviors, two kinds of work are proposed to further extend the behavior modeling. The first one is multi-behavior modeling (Zhou et al., 2018a; Yuan et al., 2022) that explicitly leverages different types of behavior (e.g., click and purchase behaviors) to measure item correlations within different behavior sequences, thus capturing users’ diverse interests. The second one is behavior modeling with side information (Zhang et al., 2019) that involves various item attribute features (e.g., category) in addition to item IDs for better exploiting the rich knowledge.

Retrieval-enhanced Recommendation. To enhance the performance of recommender systems, the retrieval-enhanced recommendation is proposed, where the most relevant items are retrieved from an extremely long behavior sequence. Specifically, UBR4CTR (Qin et al., 2020) and SIM (Qi et al., 2020) are designed to retrieve beneficial behaviors from the user’s historical behavior, where UBR4CTR deploys the search engine method while SIM uses the hard search and soft search approaches. To make the search procedure end-to-end, ETA (Chen et al., 2021) is proposed by leveraging the SimHash algorithm to map the user behavior into a low-dimensional space, hence achieving learnable retrieval. Moreover, recent works further extend the retrieval-enhanced recommendation from item-level retrieval to sample-level retrieval. RIM (Qin et al., 2021) is the first to deploy this method that leverages the search engine to retrieve several relevant samples from the search pool and performs neighbor aggregation.

Pretraining for Recommendation. Pretraining the deep learning models (or the data representation) with self-supervised learning methods has been widely studied in natural language processing and computer vision. Generally, there are two major categories of self-supervised training methods, namely contrastive learning (Gao et al., 2021; Chen et al., 2020) and mask recovery (Brown et al., 2020; Jacob et al., 2019; He et al., 2022). For recommender systems or tabular data prediction tasks, there are some recent attempts in this direction. To list a few examples, BERT4Rec (Sun et al., 2019) focuses on learning the representation of sequential behaviors via masking the final item of a subsequence of user behaviors. SCARF (Bahri et al., 2022) raises a contrastive learning loss via feature corruption over the tabular data. MISS (Guo et al., 2022) proposes interest-level contrastive losses to take the place of sample-level losses in order to mine self-supervision signals from user behaviors of multiple interests. S3(Zhou et al., 2020) proporsed to leverage inherent data correlations to generate self-supervision signals and improve data representations through pre-training methods. However, these works borrow from the Cloze task in natural language processing by masking items to have the model predict item-related information, without considering the crucial role of behaviors in recommender systems. Also, there are recent attempts to leverage the knowledge from the pretraining work on outsourced data, such as knowledge graphs (Wong et al., 2021) and language corpus (Cui et al., 2022), to enhance the recommender systems.

3. Formulation & Preliminaries

In this section, we formulate the studied problem and introduce the preliminaries. A recommendation dataset $\mathcal{D}=\{d_{z}\}_{z=1}^{N}$ is represented as a set of interactions $d_{z}$ between user $u_{z}$ and item $i_{z}$ . Each interaction $d_{z}$ can be formulated as the feature-label pair $(x_{z},y_{z})$ , where $x_{z}=\{x_{j}^{z}\}_{j=1}^{M}$ contains the user and item features and a label $y\in\{0,1\}$ indicating whether the user will click the item or not, i.e., click-through rate (CTR).

A user’s interaction sequence $s$ can be defined as a list of consecutive interactions for the same user sorted by time $s=[d_{1},d_{2},\ldots,d_{T}]$ where $d_{1}$ is the earliest interaction and $d_{T}$ is the latest one. We define the last $L$ interactions before $d_{z}$ as the history sequence $h_{z}$ and the future $L$ interactions after $d_{z}$ as the future sequence $f_{z}$ . Then we define the full context $c_{z}$ as the combination of the history interaction sequence $h_{z}$ and future interaction sequence $f_{z}$ :

(1)

c_{z}=(h_{z},f_{z})~{}.

To represent $d_{z}$ , previous works on sequential recommendation only consider modeling the historical part of the context (Hidasi et al., 2015; Zhou et al., 2018b), while in this work, we use the full context to model $d_{z}$ .

The goal of a traditional recommendation task is to predict the label of the interaction $d_{z}$ based on the features $x_{z}$ from users, items, and the user $u_{z}$ ’s history behavior sequences. Such a prediction can be formulated as

(2)

\hat{y}_{z}=F_{\theta}(x_{z},h_{z})~{},

where $F_{\theta}$ is the learning model with the parameter $\theta$ .

Here, we would use both the historical part information $h_{z}$ and future part information $f_{z}$ . However, the future information for $d_{z}$ is not visible in the inference stage. So we design a retrieval-based framework to retrieve the most relevant interactions’ future sequence as the future sequence for the interaction $d_{z}$ . Moreover, we still use the most similar interaction’s history sequence to enhance the historical information.

Following the experiment setting from other retrieval-based sequential recommendation methods (Qin et al., 2021), we split the dataset as $\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}},\mathcal{D}_{\text{% retrieval}}$ for training, testing, and retrieval, respectively. With the relevant contexts sequences, which include future sequences and history sequences, the label prediction is formulated as

(3)

\hat{y}_{z}=F_{\theta}(x_{z},h_{z},C_{z})~{},

where $C_{z}$ represents the context sequences for $d_{z}$ and

(4)

C_{z}=R(x_{z})~{},

where $R$ is a retriever that could fetch relevant context sequences according to $x_{z}$ . We first retrieve relevant interactions $\{d_{r}\}$ . Then use the retrieved interactions to find their contexts to build $C_{z}$ . We will describe the procedure details in the next section.

Traditional methods only use the label from the target interaction $d_{z}$ to train $F_{\theta}$ . However, if we only use the labels from the target interaction $d_{z}$ , we will neglect most of the behavior information in the retrieved context sequences $C_{z}$ . Inspired by pretrained language models (Devlin et al., 2019; Brown et al., 2020), which have achieved tremendous success in the field of natural language processing (NLP) by learning universal representations in a self-supervised manner, we propose a self-supervised pretraining method that is supervised by the signal from the retrieved sequences itself. We denote the pretrained encoder as $E_{\omega}$ where the input of the encoder is any user interaction sequence $s$ and the output of $E_{\omega}$ is a $v$ -dimension vector. The embedding of any user interaction sequence $s$ is written as

(5)

\mathbf{s}=E_{\omega}(s)~{},

where $s\in\mathbb{R}^{1\times v}$ and the retrieved context sequences set includes a history sequence set $H$ and a future sequence set $F$ . We retrieve $K$ context sequences, where $K$ can be tuned in this framework. Then the retrieved context embedding set is written as

(6)

\mathbf{C}=(\mathbf{H},\mathbf{F})=(E_{\omega}(H),E_{\omega}(F))~{},

where $H\in\mathbb{R}^{K\times v},F\in\mathbb{R}^{K\times v},C\in\mathbb{R}^{K\times 2v}$ .

As such, the parameters can be optimized by minimizing the self-supervised loss function as

(7)

\omega^{*}={\underset{\omega}{{\arg\min}}\,\sum_{s\in\mathcal{S}_{\text{% pretrain}}}\mathcal{L}_{\text{pretrain}}(E_{\omega}(s))}~{},

where $\mathcal{S}_{\text{pretrain}}$ is a sequence set sampled from the retrieved dataset $\mathcal{D}_{\text{retrieval}}$ . As the retriever $R$ is non-parametric, we optimize $F_{\theta}$ and $E_{\omega}$ independently. Specifically, after pretraining and fixing the parameter of $E_{\omega}$ , we train $F_{\theta}$ via

(8)

\theta^{*}={\underset{\theta}{{\arg\min}}\,\sum_{(x_{z},y_{z})\in\mathcal{D}_{% \text{train}}}\mathcal{L}_{\text{prediction}}(F_{\theta}(x_{z},\mathbf{h}_{z},% \mathbf{C}_{z}),y_{z})}~{},

where $\mathbf{h}_{z}$ is the history sequence of $u_{z}$ , encoded by $E_{\omega}$ .

4. The LIFT Framework

In this section, we provide an overview of the whole LIFT framework and then describe the details of its the specific descriptions for its component encoder, retriever, and predictor, respectively.

4.1. Overview

There are three major components in LIFT, i.e., the encoder, retriever, and predictor. The encoder $E$ is responsible for encoding the user sequence to a $v$ -dimensional vector. Given target interaction $d_{t}$ , its feature part $x_{t}$ is regarded as a query, and the label part $y_{t}$ is to be predicted. The retriever $R$ is used to retrieve relevant context sequences $C_{t}$ . The predictor $F_{\theta}$ makes use of the feature vector $x_{t}$ , the user $u_{t}$ ’s history sequence $h_{t}$ , and the relevant context sequence set $C_{t}$ to predict the final label $\hat{y_{t}}$ .

There are two steps in LIFT framework: pretraining and predicting. During pretraining, an encoder $E$ is trained on a sampled $\mathcal{S}_{\text{pretrain}}$ from $\mathcal{D}_{\text{retrieval}}$ . For $predicting$ , the LIFT model first uses the retriever $R$ to find the top- $K$ relevant user context sequences $C_{t}$ . And then, each sequence in $C_{t}$ will be encoded by $E_{\omega}$ . After aggregating the information of the target user and item $x_{t}$ , the target user’s history $h_{t}$ , retrieved sequence $C_{t}$ , the predictor $F_{\theta}$ will generate the output label prediction.

The main contribution of the LIFT framework is that it makes use of the full user context to help predict the target user behavior. Another distinctive feature of LIFT is that it leverages the pretraining method in a quite new paradigm and significantly improves the prediction performance in our experiment, which may open a new research direction of the sequential recommendation task. Next, we will describe the three components of the LIFT framework in detail.

4.2. Encoder

The encoder $E_{\omega}$ maps a user interaction sequence into a real-valued vector. It is pretrained separately in the framework.

Model Architecture. LIFT adopts a decoder-only Transformer as the encoder, which is based on the original implementation described in Vaswani et al. (2017). Transformer has been shown to perform strongly in language modeling (Brown et al., 2020; Devlin et al., 2019), machine translation (Vaswani et al., 2017), recommender system (Sun et al., 2019; Zhang et al., 2024), etc. Different tasks may leverage different parts of the Transformer architecture. The difference between the Transformer encoder and decoder is that the self-attention sub-layer in the decoder is modified to make sure that the prediction in position $p$ could only depend on the positions earlier than $p$ (Vaswani et al., 2017). For the data of the recommender system exhibiting temporal characteristics, we use the decoder part of the Transformer. We also compare the performance of different encoders in the experiment, which shows the superiority of the Transformer decoder.

Input/Output. The input of the encoder $E_{\omega}$ is a user interaction sequence $s=\{d_{1},d_{2},\ldots,d_{T}\}$ . The output of $E_{\omega}$ is an embedding representing the sequence.

To feed the interaction $d_{z}$ into the encoder, $d_{z}$ ’s feature part $x_{z}$ is fed into an embedding layer. The embedding layer will map the one-hot encoded vector $x_{z}$ into a dense vector $\bf{x}_{z}$ , where each feature $a_{i}$ in $x_{z}$ will be mapped into a $w$ -dimensional vector and then be concatenated into one $(M\times w)$ -dimensional dense vector. The label part of the $d_{z}$ will be preserved. In the training stage, if the interaction $d_{z}$ is chosen to be masked, the label will be replaced by a special token [MSK]. We will describe the training loss in the next subsection. After processing the sequence, the output will be sent into the decoder-only Transformer decoder sequentially, and the output is a set of hidden states $O=\{o_{1},o_{2},...,o_{T}\}$ . Because only the last step in the decoder can utilize the full information in the sequence, we choose the final state $o_{T}$ as the representation of the interaction sequence $\mathbf{s}$ .

Pretraining Data Preparation. The pretraining dataset $\mathcal{S}_{\text{pretrain}}$ is sampled from $\mathcal{D}_{\text{retrieval}}$ . For each user $u$ in $\mathcal{U}$ , we use all the interactions in $\mathcal{D}_{retrieval}$ of $u$ to assemble an interaction sequence $s_{u}$ . Then we get a sequence set $S$ . For any sequence $s$ in $S$ , we pick up a subsequence every $L$ interactions. Using these subsequences, we get the pretraining dataset $\mathcal{S}_{\text{pretrain}}$ .

Mask Behavior Loss. To leverage the inherent information contained within the context, we formulate a self-supervised loss function. In contrast to certain existing sequential recommendation models (Zhou et al., 2018b, 2019), wherein only the target interaction label is considered as the prediction target for any given interaction sequence, all other interaction labels within the sequence are omitted. It is rational to argue that incorporating more behavioral information from the context as the supervision signal would lead to a more comprehensive representation of the context. And compared to previous pretraining methods (Sun et al., 2019; Zhou et al., 2020) that mask items or item parts, we propose masking user behavior to better incorporate user behavior information into the pretrained model, enhancing user-item interaction modeling.

As shown in Figure 2, the left part describes the encoder architecture and loss in the pretraining procedure. In order to train a deep representation of the sequence, we mask parts of the interaction’s labels and then predict those masked behaviors. Unlike the Cloze task (Taylor, 1953) in natural language processing, it is difficult to predict the masked tokens in the user interaction sequences because the data in recommendation systems lacks syntactic structure like that found in natural language. An interaction $d_{i}$ in the input sequence $s$ can be represented as:

(9)

d_{i}=\left\{\begin{matrix}(x_{i},\texttt{MSK})&\text{if $d_{i}$ is a masked % behavior}\\ (x_{i},y_{i})&\text{otherwise}\end{matrix}\right.

where $i\in[1,T]$ .

We adopt the binary cross entropy loss as the objective function. In the pre-training procedure, we feed the masked interactions’ corresponding outputs as input into a multi-layer perceptron (MLP) to predict the masked labels. Thus the pretrain loss $\mathcal{L}_{\text{pretrain}}$ can be represented as $\mathcal{L}_{\text{pretrain}}=$

(10)

\displaystyle-\sum_{d_{i}\in\mathcal{M}}\left[y_{i}\log\sigma(\text{MLP}(o_{i}% ))+(1-y_{i})\log(1-\sigma(\text{MLP}(o_{i})))\right]

where $\mathcal{M}$ is the set of masked interactions and $\sigma(q)=1/(1+e^{-q})$ is the sigmoid function. We apply the $L_{\text{pretrain}}$ in every sequence in $S_{\text{pretrain}}$ .In our experiments, we test different mask ratios, which shows that the best mask ratio should be different for different datasets. In contrast to the mask item loss(Sun et al., 2019), when dealing with sequences that are annotated with multiple behaviors, it becomes relatively more feasible for the mask behavior loss to propagate and encompass these labeled behaviors.

4.3. Retriever

The retriever $R$ is responsible for retrieving context sequences similar to the target interaction $d_{t}$ . We use the feature part $x_{t}$ of $d_{t}$ as the query to utilize the search engine in finding similar user contexts in the retrieval pool. This section will cover the details of the datastore and search processes.

Datastore. The datastore serves as a database in which the keys correspond to all the interactions within the retrieval dataset denoted as $\mathcal{D}_{\text{retrieval}},$ while the values correspond to their associated context sequences. For the sake of efficiency, we employ a pretrained encoder to perform offline encoding of all these sequences into their respective embedding representations while the key is still the sparse format for retrieval.

The encoder $E_{\omega}$ encodes an interaction $d_{z}$ ’s context $c_{z}$ ’s history part $h_{z}$ and future part $f_{z}$ into two fixed-length vector representations respectively. Thus for any sample $d_{z}$ in $\mathcal{D}_{\text{retrieval}}$ , we define the key-value pair $(k_{z},v_{z})$ , where the key $k_{z}$ is the raw features $x_{z}$ of $d_{z}$ and the value $v_{z}$ is the encoded context embedding that contains a history sequence embedding $\mathbf{h}_{z}$ and a future sequence embedding $\mathbf{f}_{z}$ . The datastore $(\mathcal{K},\mathcal{V})$ is the set of all key-value pairs constructed from all the samples in the retrieval dataset $\mathcal{D}_{\text{retrieval}}$ :

(11)

(\mathcal{K},\mathcal{V})=\{(x_{z},(E_{\omega}(h_{z}),E_{\omega}(f_{z})))|d_{z% }\in\mathcal{D}_{\text{retrieval}}\}~{}.

Query. The retrieval process uses the target sample’s raw features $x_{t}$ as the query to find the nearby keys in the datastore. We use the traditional BM25 (Robertson et al., 1995) algorithm to retrieve relevant keys, and return the top $K$ ranked key-value pairs. In the BM25 algorithm, we treat any sample $x_{z}$ as the query and a key $x_{d}$ in the datastore as the document. As such, the ranking score can be calculated as

(12)		$\displaystyle\text{RankScore}(x_{z},x_{d})$	$\displaystyle=\sum_{j=1}^{M}\text{IDF}(x_{j}^{z})\cdot\mathbf{1}(x_{j}^{z}=x_{% j}^{d})~{},$
(13)		$\displaystyle\text{IDF}(x_{j}^{z})$	$\displaystyle=\log\frac{N-N(x_{j}^{z})+0.5}{N(x_{j}^{z})+0.5}~{},$

where $\mathbf{1}(\cdot)$ is the indicator function, $N$ is the number of data samples in $\mathcal{D}_{\text{retrieval}}$ , and $N(x_{j}^{z})$ is the number of data samples that have the categorical feature value $x_{j}^{z}$ .

Let $\hat{\mathcal{K}}$ represent the retrieved keys and $\hat{\mathcal{V}}$ be the retrieved values from the retrieved set. After the retrieval process, we obtain the top- $K$ relevance pairs in the datastore.

4.4. Predictor

After the retrieval process, we obtain the context of the target sample $x_{t}$ . The contextual information includes the user $u_{t}$ ’s history and both the history and future information of the target sample’s relevant interactions. As shown in the bottom part of stage 2 in Figure 3, in the predictor, we use both the contextual information and the embedding of $x_{t}$ to produce the final label prediction $\hat{y_{t}}$ .

Key-based Attention Aggregation. From the retriever, we get the context embedding set $\hat{\mathcal{V}}$ of relevant interactions. Before feeding this to the final prediction layer, we use the key-based attention mechanism to aggregate the retrieved $\hat{\mathcal{K}}$ and $\hat{\mathcal{V}}$ . Let $x_{i}$ be the $i$ -th sample in the key set $\hat{\mathcal{K}}$ . We use an embedding layer to convert $x_{i}$ into a $M\times w$ -dimensional dense vector $\mathbf{x}_{i}$ and we use the same embedding layer to map $x_{t}$ into a $M\times w$ -dimensional dense vector $\mathbf{x}_{t}$ . The key attention weight is defined as

(14)

\alpha_{i}=\frac{e^{\mathbf{x}_{i}^{T}{W}\mathbf{x}_{t}}}{\sum_{j=1}^{K}e^{% \mathbf{x}_{j}^{T}{W}\mathbf{x}_{t}}}~{},

where ${W}\in\mathbb{R}^{Mw\times Mw}$ is the attention layer parameter matrix. The allocation of attention weights to value sets is determined based on their respective key attention weights. Thus we could use $\bf{\alpha}$ to aggregate the retrieved key set $\hat{\mathcal{K}}$ and value set $\hat{\mathcal{V}}$ . For the retrieved samples in $\hat{\mathcal{K}}$ , the aggregated vector $\mathbf{x}_{r}$ can be written as $\mathbf{x}_{r}=\sum_{i=1}^{K}\alpha_{i}\cdot\mathbf{x}_{i},~{}~{}x_{i}\in\hat{% \mathcal{K}}$ . For the history part of the retrieved context embeddings $\hat{\mathcal{V}}$ , the aggregated embedding $\mathbf{e}_{r}^{h}$ can be written as $\mathbf{e}_{r}^{h}=\sum_{i=1}^{K}\alpha_{i}\cdot\mathbf{h}_{i},~{}~{}\mathbf{h% }_{i}\in\hat{\mathcal{V}}$ . Similarly, for the future part of the retrieved context embeddings $\hat{\mathcal{V}}$ , the aggregated embedding $\mathbf{e}_{r}^{f}$ can be written as $\mathbf{e}_{r}^{f}=\sum_{i=1}^{K}\alpha_{i}\cdot\mathbf{f}_{i},~{}~{}\mathbf{f% }_{i}\in\hat{\mathcal{V}}$ .

Prediction. In the final prediction layer, we first use a factor interaction layer to perform the inter-sample interaction for the target sample and the retrieved key set. Since these two vectors are categorical data, one could use high-order interaction to further explore the useful patterns, but this is out of the scope of this paper. Finally, we feed the embedding of the target sample, retrieved keys, and retrieved values into an MLP to make the label prediction.

After the embedding layer, each sample $x_{z}$ in $\mathcal{D}$ can be represented as a concatenation of $M$ feature embeddings:

(15)

\mathbf{x}_{z}=[{\mathbf{a}_{1}^{z}},\ldots,{\mathbf{a}_{M}^{z}}]~{},

where $\mathbf{a}_{i}^{z}$ represents the $i$ -the feature’s embedding of the sample $x_{z}$ . We use an interaction layer to compute the combination of these features as

(16)

{\mathbf{x}_{\text{inter}}^{z}}=[\text{inter}(\mathbf{a}_{1}^{z},\mathbf{a}_{2% }^{z}),\text{inter}(\mathbf{a}_{1}^{z},\mathbf{a}_{3}^{z}),\ldots,\text{inter}% (\mathbf{a}_{M-1}^{z},\mathbf{a}_{M}^{z})]~{},

where the $\text{inter}(\cdot)$ represents the interaction function. We could use different interaction functions such as inner product (Guo et al., 2017), kernel product (Qu et al., 2018), or micro-network (Qu et al., 2018). We use the same architecture of the interaction layer for the target sample $x_{t}$ and aggregated vector $\mathbf{x}_{r}$ . After $\mathbf{x}_{r}$ going through the interaction layer, we get the embedding $\mathbf{x}_{\text{inter}}^{r}$ . Finally, we concatenate all the vectors from the framework as

(17)

\text{hidden}=\text{concat}[\mathbf{x}_{\text{inter}}^{t},\mathbf{x}_{\text{% inter}}^{r},\mathbf{e}_{r}^{f},\mathbf{e}_{r}^{h},\mathbf{h}_{t}]~{},

where $\mathbf{h}_{t}$ is the user $u_{t}$ ’s history representation. As such, the label final predicted $\hat{y}_{z}$ is written as

(18)

\hat{y}_{z}=\sigma(\text{MLP}(\text{hidden}))~{}.

Given the prediction is a binary classification for CTR prediction, we use the binary cross-entropy loss between the label $y_{t}$ and prediction $\hat{y}_{t}$ for training.

4.5. Time Complexity & Speedup

In this section, we provide a brief time complexity analysis of LIFT framework and then discuss the speedup feasibility. The detailed discussions are deferred to Appendix C.

Time complexity. LIFT consists of two major phases. The first is the pretraining phase targeted at context sequences. In this phase, the dataset $D_{\text{retrieval}}$ is segmented into a set of sequences, each of which is characterized by a length $L$ . As the operations on each sequence with a time complexity of $O(1)$ , the time complexity of this phase is $O(|D_{\text{retrieval}}|/L)$ .

The second phase lies in the LIFT model training and inference. Besides the neural network computation, the retrieval process is incorporated into the model, accounting for a time complexity of $O(F)+O(F\cdot|D_{\text{retrieval}}|/U)=O(F\cdot|D_{\text{retrieval}}|/U)$ , where $U$ denotes the total number of unique features encountered in $D_{\text{retrieval}}$ , $F$ denotes the number of features in the query.

Speedup. To enhance the efficiency of the retrieval process, we may adopt a method of storing the retrieval outcomes, eliminating the need to conduct the retrieval procedure during online inference. This approach shifts the retrieval process to an offline setting. For instance, in recommender systems, this method involves storing the retrieval results for all (or frequent) users, along with their respective recalled candidate items. This method represents a balance between the complexities of space and time, optimizing resource utilization by trading off storage space for processing speed.

5. Experiments

This section starts with five research questions (RQs), which we use to guide the experiments and discussions.¹¹1We provide the experiment code with running instructions on Anonymous GitHub (https://anonymous.4open.science/r/LIFT-277C/Readme.md).

RQ1

Does LIFT achieve the best performance?
RQ2

Does future data benefit the final result?
RQ3

How does the proposed pretraining method have a positive impact on the prediction performance?
RQ4

Do different encoders and mask label ratios have different impacts on context representation learning?
RQ5

How is the time efficiency of LIFT and whether is it potential to be deployed online?

5.1. Datasets

We evaluate the performance of LIFT by conducting experiments for CTR prediction tasks on three large-scale real-world datasets, i.e., Taobao, Tmall, and Alipay²²2https://tianchi.aliyun.com/dataset/x, where ‘x’ is ‘649’, ‘42’, and ‘53’ for Taobao, Tmall, and Alipay, respectively.. For top-N ranking, we utilize two widely-used public recommendation datasets: MovieLen³³3https://grouplens.org/datasets/movielens/1m/ and LastFM⁴⁴4http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html. We count the number of instances, fields of these datasets, which is shown in Table 1.

Table 1. Dataset statistics

Datasets	# Instances	# Fields	Task
Taobao	100,150,807	4	CTR Prediction
Tmall	54,925,331	9	CTR Prediction
Alipay	35,179,371	6	CTR Prediction
Movielens-1M	1,000,209	7	Top-N Ranking
LastFM	18,993,371	5	Top-N Ranking

For dataset preprocessing, we follow the common practice in RIM(Qin et al., 2021) and DERT (Zheng et al., 2023). We split each dataset into three parts, i.e., retrieval set, train set, and test set, based on the global timestamps. We select the hyperparameters using cross validation over the training set. Specifically, the retrieval set consists of the earliest data instances, the test set comprises the latest data instances, and the remaining intermediate data instances are allocated to the train set. For pretraining, we use the retrieval set to pretrain the encoder. For the non-retrieval baseline models, we merge the retrieval set and the training set as the final training set.

5.2. Evaluation Metrics

We choose the widely used metrics area under the ROC curve (AUC) and negative log-likelihood (LogLoss) to evaluate the performance for CTR prediction. For top-N ranking, we use hit ratio (HR@N), normalized discounted cumulative gains (NDCG@N), and mean reciprocal rank (MRR). Significance test on each metric between the first and second performed methods is conducted, with * marked for positive test results. We also conduct a significance test on the metrics between LIFT and the best performed baseline, marked with “*” for positive results.

Table 2. Performance comparison of CTR prediction task baselines. GBDT and DeepFM are the traditional methods. Others are sequential modeling methods. For a fair comparison, traditional models are trained on both the retrieval set and the training set. The best results are in bold fonts while the second best results are underlined. “Rel. Impr.” of each row means the relative AUC improvement of LIFT against the baseline. Improvements are statistically significant with

p<0.01

Models	Taobao			Tmall			Alipay
Models	LogLoss	AUC	Rel. Impr.	LogLoss	AUC	Rel. Impr.	LogLoss	AUC	Rel. Impr.
GBDT	0.6797	0.6134	44.39%	0.5103	0.8319	11.13%	0.9062	0.6747	30.49%
DeepFM	0.6497	0.6710	32.00%	0.4695	0.8581	7.74%	0.6271	0.6971	26.29%
FATE	0.6497	0.6762	30.98%	0.4737	0.8553	8.09%	0.6199	0.7356	19.68%
BERT4Rec	0.6356	0.6852	29.26%	0.4017	0.8981	2.94%	0.6024	0.7321	20.26%
DIN	0.6086	0.7433	19.16%	0.4292	0.8796	5.10%	0.6044	0.7647	15.13%
DIEN	0.6084	0.7506	18.00%	0.4445	0.8838	4.61%	0.6454	0.7502	17.36%
SIM	0.5795	0.7825	13.19%	0.4520	0.8857	4.38%	0.6089	0.7600	15.84%
UBR	0.5432	0.8169	8.42%	0.4368	0.8975	3.01%	0.5747	0.7952	10.71%
RIM	0.4644	0.8563	3.43%	0.3804	0.9138	1.17%	0.5615	0.8006	9.97%
DERT	0.4486	0.8647	2.42%	0.3585	0.9200	0.4%	0.5319	0.8087	8.86%
LIFT w/o pretrain	0.4369*	0.8727*	1.49%	0.3509*	0.9236*	0.10%	0.4707*	0.8572*	2.71%
LIFT	0.4129*	0.8857*	-	0.3489*	0.9245*	-	0.4361*	0.8804*	-

Table 3. Performance comparison of top-N ranking tasks on the ML-1M and LastFM datasets. The best results are highlighted in bold, while the second-best results are underlined. Significant improvements are indicated by

p<0.01

Datasets	Metric	FPMC	TransRec	NARM	GRU4Rec	SASRec	RIM	DERT	LIFT
ML-1M	NDCG@5	0.0788	0.0808	0.0866	0.0872	0.0981	0.1577	0.1634	0.1806^∗
	NDCG@10	0.1184	0.1217	0.1254	0.1265	0.1341	0.2059	0.2117	0.2293^∗
	MRR	0.1041	0.1078	0.1113	0.1135	0.1193	0.1704	0.1774	0.1914^∗
	HR@1	0.0261	0.0275	0.0337	0.0369	0.0392	0.0645	0.0747	0.0800^∗
	HR@5	0.1334	0.1375	0.1418	0.1395	0.1588	0.2515	0.2540	0.2808^∗
	HR@10	0.2577	0.2659	0.2631	0.2624	0.2709	0.4014	0.4035	0.4324^∗
LastFM	NDCG@5	0.0432	0.1148	0.0916	0.1229	0.1163	0.2165	0.2620	0.2723^∗
	NDCG@10	0.0685	0.1441	0.1185	0.1486	0.1409	0.2911	0.3217	0.3444^∗
	MRR	0.0694	0.1303	0.1083	0.1362	0.1289	0.2210	0.2694	0.2727^∗
	HR@1	0.0148	0.0563	0.0423	0.0658	0.0584	0.0915	0.1488^∗	0.1485
	HR@5	0.0733	0.1725	0.1394	0.1785	0.1729	0.3468	0.3742	0.4010^∗
	HR@10	0.1531	0.2628	0.2227	0.2581	0.2499	0.5780	0.5597	0.6310^∗

Table 4. The influence of different encoders of LIFT on prediction performance. “(p)” means the encoder is pretrained.

Encoder	RNN		RNN (pretrain)		Transformer Encoder		Transformer Encoder (p)		Transformer Decoder		Transformer Decoder (p)
Encoder	LogLoss	AUC	LogLoss	AUC	LogLoss	AUC	LogLoss	AUC	LogLoss	AUC	LogLoss	AUC
Taobao	0.4434	0.8667	0.4360	0.8715	0.4432	0.8684	0.4454	0.8679	0.4369	0.8727	0.4129	0.8857
Tmall	0.3730	0.9132	0.3637	0.9175	0.3539	0.9223	0.3528	0.9227	0.3509	0.9236	0.3489	0.9245
Alipay	0.4937	0.8391	0.4856	0.8468	0.4807	0.8504	0.4707	0.8572	0.4730	0.8556	0.4361	0.8804

Table 5. The influence of the usage of future data in LIFT.

Dataset	No Context		Future Only		History Only		Future & History
Dataset	LogLoss	AUC	LogLoss	AUC	LogLoss	AUC	LogLoss	AUC
Taobao	.4644	.8563	.4417	.8682	.4405	.8694	.4129	.8857
Tmall	.3804	.9138	.3606	.9191	.3588	.9199	.3489	.9245
Alipay	.5615	.8006	.4781	.8525	.4780	.8527	.4361	.8804

5.3. Compared Methods

In CTR prediction, we compare LIFT with nine solid baselines that can be categorized into three groups. The first group is traditional tabular models, which do not utilize the sequential or retrieval mechanism. This group includes GBDT (Chen et al., 1996), a widely-used gradient-boosted trees model, and DeepFM (Guo et al., 2017), a factorization-machine based deep neural network. The second group consists of end-to-end sequential deep models, such as DIN (Zhou et al., 2018b) and DIEN (Zhou et al., 2019) that are attention-based recurrent neural networks for CTR prediction, and BERT4Rec (Sun et al., 2019) that is a Transformers-based sequential recommendation model. The third group includes retrieval-based models, such as SIM (Qi et al., 2020), UBR (Qin et al., 2020), RIM (Qin et al., 2021) and DERT(Zheng et al., 2023). Additionally, FATE (Wu et al., 2021) is a tabular data representation learning model that can be viewed as a random retrieval method for prediction.

For top-N ranking, we compare LIFT against seven baseline models. FPMC (Rendle, 2010) and TransRec (He et al., 2017) are factorization-based approaches. The other baselines—NARM (Li et al., 2017), GRU4Rec (Hidasi et al., 2015), SASRec (Khandelwal et al., 2019), RIM (Qin et al., 2021) and DERT (Zheng et al., 2023)—are recently proposed neural network models.

5.4. Overall Performance (RQ1)

The overall performance comparison result is provided in Table 2, where we have the following observations: (i) LIFT consistently outperforms all nine baselines. Specifically, compared to the best baseline RIM, LIFT achieves a relatively improved AUC by 3.43%, 1.17% and 9.97% on Taobao, Tmall, and Alipay, respectively. This clearly demonstrates the effectiveness of context information learned and retrieved in LIFT framework. (ii) Without pretraining, LIFT still consistently outperforms the best baseline, which indicates that the raw information of future and history still yields a significant improvement on the prediction performance. (iii) The the retrieval-based methods LIFT and RIM are superior to the traditional methods and sequential CTR models, which means that the retrieval methods are able to make better use of the context information. (iv) in the table 3, it can be observed that LIFT achieves significant improvements over these baselines across nearly all metrics on both datasets. This demonstrates that incorporating contextual information enables the sequential model to perform well in top-N ranking tasks.

5.5. Further Analysis

We further study the effectiveness of important modules in LIFT, i.e., the usage of future context, the pertaining, the encoder architecture, and the hyperparameters.

Future Context (RQ2). Table 5 compares the different parts of the context information’s impact on the final prediction performance. We can observe that both the history information and future information significantly impact on the final prediction. If we only use part of the context information, such as future only or history only, we get a worse result than using both of them, which indicates that the future part of the context provides different information compared with the history part. This observation is entirely neglected by previous works.

Pretraining (RQ3). From the last two rows of Table 2, we find that over the three datasets, the model with pretraining performs better than those without, which clearly demonstrates the efficiency of pretraining. The enhanced performance of the pretrained encoder implies that relying solely on the label provided by the target sample is insufficient for effectively capturing contextual information. To improve the modeling of context, it is imperative to harness the inherent information embedded within the context itself.

The Choice of the Encoder (RQ4). The prediction performance of LIFT with different encoders is provided in Table 4, where we can observe that the Transformer decoder based encoder yields the best performance on both non-pretraining and pretraining settings. The reason could be that the Transformer decoder could be aware of the position of the sequence better. From Figure 4, we observe the mask ratio’s influence on the prediction performance. Usually, the traditional models that encode the user interaction sequence only use the last label as the supervised signal. From the result, we can find that in all three datasets, such a $1$ label supervised strategy is not the best choice. The optimal mask label rate is different over datasets where the number is usually larger than 50%.

Time Efficiency (RQ5). To evaluate the time efficiency of LIFT, we compare its real inference time against other three mainstream models in different types. The parameters that performed best on Alipay were selected for each model to test the inference performance. As shown in Figure 5, LIFT does not introduce significant overload compared with the basic retrieval-based model RIM. The optimal retrieval set size $K$ for LIFT’s AUC on the Alipay dataset is 15. RIM with $K=10$ is approximately 30% faster than LIFT. With offline pretraining, LIFT operates far more efficiently than the transformer-based model BERT4Rec.

6. Conclusion

In this work, we propose a retrieval-based framework called LIFT to better utilize the context information of the current user interaction. We are the first work to include future information as part of the context without temporal data leakage in the training and inference stages. Moreover, we use a pretraining method to better mine the information in the context sequences and propose a novel mask behavior loss. The performance of the LIFT framework shows that both historical and future information yield significant improvements in the CTR prediction and top-N ranking performance. Also, from the comparison of the pretraining method, we find that context pretraining is a promising solution to further improve the prediction performance of a variety of sequential recommendation models. In the future, we will investigate deeper on context representation learning with more sophisticated retrieval methods and the speedup schemes to deploy LIFT to real-world recommenders.

References

(1)
Bahri et al. (2022) Dara Bahri, Heinrich Jiang, Yi Tay, and Donald Metzler. 2022. Scarf: Self-Supervised Contrastive Learning using Random Feature Corruption. In International Conference on Learning Representations.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
Chen et al. (1996) Ming-Syan Chen, Jiawei Han, and Philip S. Yu. 1996. Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and data Engineering 8, 6 (1996), 866–883.
Chen et al. (2021) Qiwei Chen, Changhua Pei, Shanshan Lv, Chao Li, Junfeng Ge, and Wenwu Ou. 2021. End-to-end user behavior retrieval in click-through rateprediction model. arXiv preprint arXiv:2108.04468 (2021).
Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
Chen et al. (2018) Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2018. Sequential recommendation with user memory networks. In WSDM.
Cui et al. (2022) Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. M6-rec: Generative pretrained language models are open-ended recommender systems. arXiv preprint arXiv:2205.08084 (2022).
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021. Association for Computational Linguistics (ACL), 6894–6910.
Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. Deepfm: a factorization-machine based neural network for ctr prediction. IJCAI (2017).
Guo et al. (2022) Wei Guo, Can Zhang, Zhicheng He, Jiarui Qin, Huifeng Guo, Bo Chen, Ruiming Tang, Xiuqiang He, and Rui Zhang. 2022. Miss: Multi-interest self-supervised learning framework for click-through rate prediction. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 727–740.
He et al. (2022) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000–16009.
He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In WWW. 173–182.
Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
Huang et al. (2018) Jin Huang, Wayne Xin Zhao, Hongjian Dou, Ji-Rong Wen, and Edward Y Chang. 2018. Improving Sequential Recommendation with Knowledge-Enhanced Memory Networks. In SIGIR.
Jacob et al. (2019) Devlin Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
Khandelwal et al. (2019) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2019. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations.
Li et al. (2017) Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428.
Pi et al. (2019) Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practice on long sequential user behavior modeling for click-through rate prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2671–2679.
Qi et al. (2020) Pi Qi, Xiaoqiang Zhu, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
Qin et al. (2021) Jiarui Qin, Weinan Zhang, Rong Su, Zhirong Liu, Weiwen Liu, Ruiming Tang, Xiuqiang He, and Yong Yu. 2021. Retrieval & Interaction Machine for Tabular Data Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1379–1389.
Qin et al. (2020) Jiarui Qin, W. Zhang, Xin Wu, Jiarui Jin, Yuchen Fang, and Y. Yu. 2020. User Behavior Retrieval for Click-Through Rate Prediction. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.
Qu et al. (2018) Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng Guo, Yong Yu, and Xiuqiang He. 2018. Product-based neural networks for user response prediction over multi-field categorical data. TOIS 37, 1 (2018), 1–35.
Rendle (2010) Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International conference on data mining. IEEE, 995–1000.
Robertson et al. (1995) Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. Nist Special Publication Sp 109 (1995), 109.
Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the eleventh ACM international conference on web search and data mining. 565–573.
Taylor (1953) Wilson L Taylor. 1953. “Cloze procedure”: A new tool for measuring readability. Journalism quarterly 30, 4 (1953), 415–433.
Vapnik et al. (2015) Vladimir Vapnik, Rauf Izmailov, et al. 2015. Learning using privileged information: similarity control and knowledge transfer. J. Mach. Learn. Res. 16, 1 (2015), 2023–2049.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. In ADKDD. 1–7.
Wong et al. (2021) Chi-Man Wong, Fan Feng, Wen Zhang, Chi-Man Vong, Hui Chen, Yichi Zhang, Peng He, Huan Chen, Kun Zhao, and Huajun Chen. 2021. Improving conversational recommender system by pretraining billion-scale knowledge graph. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2607–2612.
Wu et al. (2021) Qitian Wu, Chenxiao Yang, and Junchi Yan. 2021. Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach. In Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021.
Xu et al. (2020) Chen Xu, Quan Li, Junfeng Ge, Jinyang Gao, Xiaoyong Yang, Changhua Pei, Fei Sun, Jian Wu, Hanxiao Sun, and Wenwu Ou. 2020. Privileged features distillation at taobao recommendations. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2590–2598.
Yuan et al. (2022) Enming Yuan, Wei Guo, Zhicheng He, Huifeng Guo, Chengkai Liu, and Ruiming Tang. 2022. Multi-behavior sequential transformer recommender. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 1642–1652.
Yuan et al. (2020) Fajie Yuan, Xiangnan He, Haochuan Jiang, Guibing Guo, Jian Xiong, Zhezhao Xu, and Yilin Xiong. 2020. Future data helps training: Modeling future contexts for session-based recommendation. In Proceedings of The Web Conference 2020. 303–313.
Zhang et al. (2024) Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu, Yan Gao, Yao Hu, and Enhong Chen. 2024. NoteLLM-2: Multimodal Large Representation Models for Recommendation. arXiv preprint arXiv:2405.16789 (2024).
Zhang et al. (2019) Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S Sheng, Jiajie Xu, Deqing Wang, Guanfeng Liu, Xiaofang Zhou, et al. 2019. Feature-level Deeper Self-Attention Network for Sequential Recommendation.. In IJCAI. 4320–4326.
Zhang et al. (2016) Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep Learning over Multi-field Categorical Data: A Case Study on User Response Prediction. ECIR (2016).
Zhang et al. (2021) Weinan Zhang, Jiarui Qin, Wei Guo, Ruiming Tang, and Xiuqiang He. 2021. Deep learning for click-through rate estimation. IJCAI (2021).
Zheng et al. (2023) Lei Zheng, Ning Li, Xianyu Chen, Quan Gan, and Weinan Zhang. 2023. Dense Representation Learning and Retrieval for Tabular Data Prediction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3559–3569.
Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5941–5948.
Zhou et al. (2018b) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018b. Deep interest network for click-through rate prediction. In KDD.
Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management. 1893–1902.
Zhou et al. (2018a) Meizi Zhou, Zhuoye Ding, Jiliang Tang, and Dawei Yin. 2018a. Micro behaviors: A new perspective in e-commerce recommender systems. In Proceedings of the eleventh ACM international conference on web search and data mining. 727–735.

Appendix A Notations

The notations and their descriptions are summarized in Table 6.

Table 6. Notations and corresponding descriptions.

Notation	Description
$\mathcal{D}_{\text{train}}$ , $\mathcal{D}_{\text{test}}$ , $\mathcal{D}_{\text{retrieval}}$	Training set, test set, retrieval set
$x_{z}$ , $\mathbf{x}_{z}$	The raw feature and the embedding of $z$ -th sample
$x_{t}$ , $\mathbf{x}_{t}$	The raw feature and the embedding of target sample
$E_{\omega}$	The encoder and its parameters
$R$	The retriever
$F_{\theta}$	The predictor and its parameters
$d$	An interaction
$v$	The encoder output vector dimension
$w$	The feature embedding vector dimension
$K$	The size of retrieved contexts
$L$	The length of history and future sequence
$c$ , $h$ , $f$	The context sequence, history sequence, future sequence
$C$ , $H$ , $F$	The context sequence set, history sequence set, future sequence set

Appendix B Datastore of LIFT

As shown in Figure 6, the datastore $(\mathcal{K},\mathcal{V})$ is the set of all key-value pairs constructed from all the samples in the retrieval dataset $\mathcal{D}_{\text{retrieval}}$ :

(19)

(\mathcal{K},\mathcal{V})=\{(x_{z},(E_{\omega}(h_{z}),E_{\omega}(f_{z})))|d_{z% }\in\mathcal{D}_{\text{retrieval}}\}~{}.

The primary keys within this datastore are intricately linked to the interactions in the dataset $\mathcal{D}_{\text{retrieval}}$ , which represents the original data format. In parallel, the datastore houses values that align with the contextual information associated with these interactions. It is noteworthy that this contextual information is derived through an encoding process that applies a pretrained encoder, denoted as $E$ , to the context sequences.

Appendix C Time complexity

We aim to provide a comprehensive analysis of the time complexity within our proposed framework, which consists of two major components.

The first component addresses the pretraining phase targeted at context sequences. In this phase, the dataset $D_{\text{retrieval}}$ is segmented into a set of sequences, each of which is characterized by a length $L$ . Both training and inference stages involve operations on each sequence with a time complexity of $O(1)$ . Consequently, the overall time complexity during the pretraining phase is $O(|D_{\text{retrieval}}|/L)$ .

The second component pertains to the complexity of the main framework’s algorithm, which diverges from traditional neural network methodologies by incorporating a retrieval process during both training and inference phases. We commence by examining the complexity of this retrieval process. Here, $|D_{\text{retrieval}}|$ represents the quantity of samples, and $U$ denotes the total number of unique features encountered in $D_{\text{retrieval}}$ . Note that the mean length of the posting lists in the inverted index is $|D_{\text{retrieval}}|/U$ . As explained in Section 4.3, the retrieval operation, which encompasses retrieving all posting lists of features in $x_{t}$ , necessitates a time complexity of $O(F)$ , where $F$ symbolizes the number of features in the query. This phase is deemed a constant time operation. The average count of samples retrieved is $F\cdot|D_{\text{retrieval}}|/U$ , and the complexity of the ranking operation scales linearly with the number of retrieved samples, indicated as $O(F\cdot|D_{\text{retrieval}}|/U)$ . Therefore, the total time complexity of the retrieval process is $O(F)+O(F\cdot|D_{\text{retrieval}}|/U)=O(F\cdot|D_{\text{retrieval}}|/U)$ . Besides the retrieval process, the neural network computation time complexity of the predictor is about the same with the other models such as RIM (Qin et al., 2021) or DIN (Zhou et al., 2018b). The detailed architecture acceleration would be an effective factor for the model inference effieiency.

Appendix D Hyperparameter in the Retriever

In the LIFT framework, $L$ means the context sequence length and $K$ means the retrieved samples count. From Figure 7 and Figure 8, we can find that with the increase of $K$ , the final prediction result first goes up and then down. The curve indicates that along with with the $K$ , the information first increases, and then more noises are introduced to become the dominant part. The AUC curves of $L$ show the similar trends of $K$ , i.e., as $L$ increases, the AUC first gets better and then drops down. We think it is the same reason as for $K$ . Moreover, because of the limitation of the GPU resources, we only conduct this hyperparameter study on Taobao and Alipay. and we could only increase the $L$ to 70 in the history length of Taobao, where the downtrend just started.