(Translated by https://www.hiragana.jp/)
Look into the Future: Deep Contextualized Sequential Recommendation

Look into the Future: Deep Contextualized Sequential Recommendation

Lei Zheng Shanghai Jiao Tong UniversityShanghaiChina zhenglei2016@sjtu.edu.cn Ning Li Shanghai Jiao Tong UniversityShanghaiChina lining01@sjtu.edu.cn Yanhua Huang Xiaohongshu Inc.ShanghaiChina yanhuahuang@xiaohongshu.com Ruiwen Xu Xiaohongshu Inc.ShanghaiChina ruiwenxu@xiaohongshu.com Weinan Zhang Shanghai Jiao Tong UniversityShanghaiChina wnzhang@sjtu.edu.cn  and  Yong Yu Shanghai Jiao Tong UniversityShanghaiChina yyu@apex.sjtu.edu.cn
(2018)
Abstract.

Sequential recommendation aims to estimate how a user’s interests evolve over time via uncovering valuable patterns from user behavior history. Many previous sequential models have solely relied on users’ historical information to model the evolution of their interests, neglecting the crucial role that future information plays in accurately capturing these dynamics. However, effectively incorporating future information in sequential modeling is non-trivial since it is impossible to make the current-step prediction for any target user by leveraging his future data. In this paper, we propose a novel framework of sequential recommendation called Look into the Future (LIFT), which builds and leverages the contexts of sequential recommendation. In LIFT, the context of a target user’s interaction is represented based on i) his own past behaviors and ii) the past and future behaviors of the retrieved similar interactions from other users. As such, the learned context will be more informative and effective in predicting the target user’s behaviors in sequential recommendation without temporal data leakage. Furthermore, in order to exploit the intrinsic information embedded within the context itself, we introduce an innovative pretraining methodology incorporating behavior masking. In our extensive experiments on five real-world datasets, LIFT achieves significant performance improvement on click-through rate prediction and rating prediction tasks in sequential recommendation over strong baselines, demonstrating that retrieving and leveraging relevant contexts from the global user pool greatly benefits sequential recommendation. The experiment code is provided at https://anonymous.4open.science/r/LIFT-277C/Readme.md.

Sequential Recommendation, Context Representation, Retrieval-Enhanced Methods, Pretraining
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYisbn: 978-1-4503-XXXX-X/18/06conference: Proceedings of the 31th International ACM SIGKDD Conference on Knowledge Discovery & Data Mining; August 3–7, 2025; Toronto, ON, Canada

1. Introduction

Deep learning has been widely adopted for predicting user behaviors in online recommender systems (Zhang et al., 2021). For the scenarios with sequences of user behaviors, sequential recommendation techniques, which extract valuable information from the past behavior sequence of the target user to predict his next behavior, have been well studied (Pi et al., 2019; Huang et al., 2018).

As illustrated in Figure 1, the major techniques for deep learning-based sequential recommendation are generally threefold. Firstly, deep architectures are designed to extend the capability of traditional collaborative filtering models for better mining of the feature interactions (Guo et al., 2017; Zhang et al., 2016; Wang et al., 2017). Secondly, different models like recurrent neural networks (RNN), memory networks, or transformers are adopted to learn an effective representation of the user behavior sequence, based on which a deep learning predictor is built to predict the behavior label for the current-step recommendation (Pi et al., 2019; Chen et al., 2018; Sun et al., 2019). Dealing with the long behavior sequence problem, thirdly, the retrieval methods are leveraged to fetch far early yet relevant behaviors, which are then aggregated according to the target prediction condition and fed into the final label predictor (Qi et al., 2020; Qin et al., 2020).

The primary aim of sequence modeling in recommender systems is to capture the evolving trends in user interests (Zhou et al., 2019). Merely relying on historical data is insufficient to model these dynamic changes effectively (Yuan et al., 2020). Incorporating future information provides a hindsight perspective to help predict shifts in user preferences. For instance, after purchasing a smartphone, a user is likely to show significant interest in related accessories. However, incorporating future data for predictive modeling without causing data leakage presents a significant challenge. Most existing models (Zhou et al., 2018b; Kang and McAuley, 2018) neglect the potential of future user behaviors, while a few attempts (Yuan et al., 2020; Sun et al., 2019) implicitly use future information during training but fail to leverage it during the inference stage, leaving a significant potential for improvement.

To address this issue, we employ retrieval-based methods to introduce context that contains future information into the model. Previously, retrieval techniques have been integrated into sequential recommendations to effectively access and leverage longer historical information (Pi et al., 2019; Qin et al., 2020). Nonetheless, these retrieval methods typically focus on raw data from history, overlooking the importance of context. The interaction sequences of similar users with similar items tend to follow comparable patterns. By utilizing retrieval methods, we can fetch similar contexts in the log data from other users to approximate the future information of the target user. This approach eliminates the issue of temporal data leakage while enhancing the model’s ability to predict user behaviors.

In this paper, we propose a novel framework of sequential recommendation called Look into the Future (LIFT), which focuses on retrieval in the user contextual information space while leveraging the future behaviors of the retrieved contexts, as illustrated on the bottom part of Figure 1. As mentioned above, it is infeasible to use real future information to predict the current behavior. Thus, given the current-step context with the candidate item for behavior label prediction, LIFT performs retrieval to fetch the most similar interaction behavior from the whole user pool. Then, extending from each of the retrieved behaviors, the historical and future behavior data are imported to enrich the context of that retrieved behavior. Note that each of the included future behaviors is still earlier than the current timestep of the target user behavior, which avoids any data leakage issue. Such future behaviors can be regarded as a kind of privileged information (Vapnik et al., 2015; Xu et al., 2020), which is not accessible for current-step prediction but is accessible for historical behavior data. To our knowledge, there is no previous work on sequential recommendation that performs effective retrieval of multiple users’ behavior contexts that include both past and future behaviors.

Refer to caption
Figure 1. The comparison between LIFT and conventional models entails several key distinctions: a) Traditional models rely solely on instant user and item information when making predictions. b) Sequential models, conversely, typically incorporate the user’s historical interactions to capture their evolving interests over time. c) Retrieval-based models perform retrieval to fetch far-before but relevant historical behaviors to build the user profile for predictions. d) LIFT focuses on interaction context, encompassing both the historical and future sequence of interactions for each user-item interaction.

Furthermore, it is crucial not only to utilize information derived from the target sample but also to exploit the inherent information within the context itself for the purpose of enhancing contextual representation. Therefore, besides the supervised training with the label prediction loss, a representation learning loss is much important in our task. To learn an effective representation of the context with both history and future behavior data, we further devise a pretraining method that performs masked behavior prediction during the pretraining stage. Moreover, to reduce the noise introduced by the retrieved data, we designed an attention mechanism that assigns different weights to the retrieved data using dense embeddings of users and items. We conduct extensive experiments on five real-world sequential recommendation datasets with click-through rate prediction and top-N ranking tasks, where LIFT demonstrates significant performance improvements over strong baselines.

Overall, the main contributions of this paper are threefold.

  • We propose a novel LIFT framework that incorporates future information as part of the context. By leveraging retrieval techniques, LIFT utilizes relevant contextual information to enhance the prediction performance of sequential recommendations while avoiding temporal data leakage. To our knowledge, this is the first work to integrate future information comprehensively in both training and inference phases within a recommender system based on retrieval techniques.

  • In the pursuit of obtaining valuable representations of retrieved user behaviors, we adopt an approach that leverages intrinsic contextual information. Specifically, we devise a pretraining methodology that is tailored to the format of user sequence data and introduces a novel self-supervised loss function referred to as mask behavior loss.

  • To avoid noise in the retrieved behavior sequences, we propose a key-based attention mechanism to aggregate the retrieved data.

2. Related Work

Sequential Recommendation. For sequential recommendation, user behavior modeling is the core technique that mines user preferences from historical interaction behaviors meticulously. To better extract informative knowledge from user’s behavior sequence, various network structures have been proposed, including Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Attention Networks, and Memory Networks. GRU4Rec (Hidasi et al., 2015) designs Gated Recurrent Units (GRUs) to capture the preference-evolving relationship while Caser (Tang and Wang, 2018) leverages the horizontal and vertical convolution to model skip behaviors at both union-level and point-level. Moreover, the attention mechanism is the most popular method for modeling item dependencies, and several influential works are proposed, including SASRec (Kang and McAuley, 2018), DIN (Zhou et al., 2018b), DIEN (Zhou et al., 2019), and BERT4Rec (Sun et al., 2019). To better mine user preferences from their historical behaviors, two kinds of work are proposed to further extend the behavior modeling. The first one is multi-behavior modeling (Zhou et al., 2018a; Yuan et al., 2022) that explicitly leverages different types of behavior (e.g., click and purchase behaviors) to measure item correlations within different behavior sequences, thus capturing users’ diverse interests. The second one is behavior modeling with side information (Zhang et al., 2019) that involves various item attribute features (e.g., category) in addition to item IDs for better exploiting the rich knowledge.

Retrieval-enhanced Recommendation. To enhance the performance of recommender systems, the retrieval-enhanced recommendation is proposed, where the most relevant items are retrieved from an extremely long behavior sequence. Specifically, UBR4CTR (Qin et al., 2020) and SIM (Qi et al., 2020) are designed to retrieve beneficial behaviors from the user’s historical behavior, where UBR4CTR deploys the search engine method while SIM uses the hard search and soft search approaches. To make the search procedure end-to-end, ETA (Chen et al., 2021) is proposed by leveraging the SimHash algorithm to map the user behavior into a low-dimensional space, hence achieving learnable retrieval. Moreover, recent works further extend the retrieval-enhanced recommendation from item-level retrieval to sample-level retrieval. RIM (Qin et al., 2021) is the first to deploy this method that leverages the search engine to retrieve several relevant samples from the search pool and performs neighbor aggregation.

Pretraining for Recommendation. Pretraining the deep learning models (or the data representation) with self-supervised learning methods has been widely studied in natural language processing and computer vision. Generally, there are two major categories of self-supervised training methods, namely contrastive learning (Gao et al., 2021; Chen et al., 2020) and mask recovery (Brown et al., 2020; Jacob et al., 2019; He et al., 2022). For recommender systems or tabular data prediction tasks, there are some recent attempts in this direction. To list a few examples, BERT4Rec (Sun et al., 2019) focuses on learning the representation of sequential behaviors via masking the final item of a subsequence of user behaviors. SCARF (Bahri et al., 2022) raises a contrastive learning loss via feature corruption over the tabular data. MISS (Guo et al., 2022) proposes interest-level contrastive losses to take the place of sample-level losses in order to mine self-supervision signals from user behaviors of multiple interests. S3(Zhou et al., 2020) proporsed to leverage inherent data correlations to generate self-supervision signals and improve data representations through pre-training methods. However, these works borrow from the Cloze task in natural language processing by masking items to have the model predict item-related information, without considering the crucial role of behaviors in recommender systems. Also, there are recent attempts to leverage the knowledge from the pretraining work on outsourced data, such as knowledge graphs (Wong et al., 2021) and language corpus (Cui et al., 2022), to enhance the recommender systems.

3. Formulation & Preliminaries

In this section, we formulate the studied problem and introduce the preliminaries. A recommendation dataset 𝒟={dz}z=1N𝒟superscriptsubscriptsubscript𝑑𝑧𝑧1𝑁\mathcal{D}=\{d_{z}\}_{z=1}^{N}caligraphic_D = { italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is represented as a set of interactions dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT between user uzsubscript𝑢𝑧u_{z}italic_u start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and item izsubscript𝑖𝑧i_{z}italic_i start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. Each interaction dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT can be formulated as the feature-label pair (xz,yz)subscript𝑥𝑧subscript𝑦𝑧(x_{z},y_{z})( italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ), where xz={xjz}j=1Msubscript𝑥𝑧superscriptsubscriptsuperscriptsubscript𝑥𝑗𝑧𝑗1𝑀x_{z}=\{x_{j}^{z}\}_{j=1}^{M}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT contains the user and item features and a label y{0,1}𝑦01y\in\{0,1\}italic_y ∈ { 0 , 1 } indicating whether the user will click the item or not, i.e., click-through rate (CTR).

A user’s interaction sequence s𝑠sitalic_s can be defined as a list of consecutive interactions for the same user sorted by time s=[d1,d2,,dT]𝑠subscript𝑑1subscript𝑑2subscript𝑑𝑇s=[d_{1},d_{2},\ldots,d_{T}]italic_s = [ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] where d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the earliest interaction and dTsubscript𝑑𝑇d_{T}italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is the latest one. We define the last L𝐿Litalic_L interactions before dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT as the history sequence hzsubscript𝑧h_{z}italic_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and the future L𝐿Litalic_L interactions after dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT as the future sequence fzsubscript𝑓𝑧f_{z}italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. Then we define the full context czsubscript𝑐𝑧c_{z}italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT as the combination of the history interaction sequence hzsubscript𝑧h_{z}italic_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and future interaction sequence fzsubscript𝑓𝑧f_{z}italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT:

(1) cz=(hz,fz).subscript𝑐𝑧subscript𝑧subscript𝑓𝑧c_{z}=(h_{z},f_{z})~{}.italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = ( italic_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) .

To represent dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, previous works on sequential recommendation only consider modeling the historical part of the context (Hidasi et al., 2015; Zhou et al., 2018b), while in this work, we use the full context to model dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT.

The goal of a traditional recommendation task is to predict the label of the interaction dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT based on the features xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT from users, items, and the user uzsubscript𝑢𝑧u_{z}italic_u start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT’s history behavior sequences. Such a prediction can be formulated as

(2) y^z=Fθ(xz,hz),subscript^𝑦𝑧subscript𝐹𝜃subscript𝑥𝑧subscript𝑧\hat{y}_{z}=F_{\theta}(x_{z},h_{z})~{},over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ,

where Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the learning model with the parameter θ𝜃\thetaitalic_θ.

Here, we would use both the historical part information hzsubscript𝑧h_{z}italic_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and future part information fzsubscript𝑓𝑧f_{z}italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. However, the future information for dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is not visible in the inference stage. So we design a retrieval-based framework to retrieve the most relevant interactions’ future sequence as the future sequence for the interaction dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. Moreover, we still use the most similar interaction’s history sequence to enhance the historical information.

Following the experiment setting from other retrieval-based sequential recommendation methods (Qin et al., 2021), we split the dataset as 𝒟train,𝒟test,𝒟retrievalsubscript𝒟trainsubscript𝒟testsubscript𝒟retrieval\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}},\mathcal{D}_{\text{% retrieval}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT for training, testing, and retrieval, respectively. With the relevant contexts sequences, which include future sequences and history sequences, the label prediction is formulated as

(3) y^z=Fθ(xz,hz,Cz),subscript^𝑦𝑧subscript𝐹𝜃subscript𝑥𝑧subscript𝑧subscript𝐶𝑧\hat{y}_{z}=F_{\theta}(x_{z},h_{z},C_{z})~{},over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ,

where Czsubscript𝐶𝑧C_{z}italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT represents the context sequences for dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and

(4) Cz=R(xz),subscript𝐶𝑧𝑅subscript𝑥𝑧C_{z}=R(x_{z})~{},italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_R ( italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ,

where R𝑅Ritalic_R is a retriever that could fetch relevant context sequences according to xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. We first retrieve relevant interactions {dr}subscript𝑑𝑟\{d_{r}\}{ italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT }. Then use the retrieved interactions to find their contexts to build Czsubscript𝐶𝑧C_{z}italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. We will describe the procedure details in the next section.

Traditional methods only use the label from the target interaction dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT to train Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. However, if we only use the labels from the target interaction dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, we will neglect most of the behavior information in the retrieved context sequences Czsubscript𝐶𝑧C_{z}italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. Inspired by pretrained language models (Devlin et al., 2019; Brown et al., 2020), which have achieved tremendous success in the field of natural language processing (NLP) by learning universal representations in a self-supervised manner, we propose a self-supervised pretraining method that is supervised by the signal from the retrieved sequences itself. We denote the pretrained encoder as Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT where the input of the encoder is any user interaction sequence s𝑠sitalic_s and the output of Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT is a v𝑣vitalic_v-dimension vector. The embedding of any user interaction sequence s𝑠sitalic_s is written as

(5) 𝐬=Eω(s),𝐬subscript𝐸𝜔𝑠\mathbf{s}=E_{\omega}(s)~{},bold_s = italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_s ) ,

where s1×v𝑠superscript1𝑣s\in\mathbb{R}^{1\times v}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_v end_POSTSUPERSCRIPT and the retrieved context sequences set includes a history sequence set H𝐻Hitalic_H and a future sequence set F𝐹Fitalic_F. We retrieve K𝐾Kitalic_K context sequences, where K𝐾Kitalic_K can be tuned in this framework. Then the retrieved context embedding set is written as

(6) 𝐂=(𝐇,𝐅)=(Eω(H),Eω(F)),𝐂𝐇𝐅subscript𝐸𝜔𝐻subscript𝐸𝜔𝐹\mathbf{C}=(\mathbf{H},\mathbf{F})=(E_{\omega}(H),E_{\omega}(F))~{},bold_C = ( bold_H , bold_F ) = ( italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_H ) , italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_F ) ) ,

where HK×v,FK×v,CK×2vformulae-sequence𝐻superscript𝐾𝑣formulae-sequence𝐹superscript𝐾𝑣𝐶superscript𝐾2𝑣H\in\mathbb{R}^{K\times v},F\in\mathbb{R}^{K\times v},C\in\mathbb{R}^{K\times 2v}italic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_v end_POSTSUPERSCRIPT , italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_v end_POSTSUPERSCRIPT , italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × 2 italic_v end_POSTSUPERSCRIPT.

As such, the parameters can be optimized by minimizing the self-supervised loss function as

(7) ω=argmin𝜔s𝒮pretrainpretrain(Eω(s)),superscript𝜔𝜔subscript𝑠subscript𝒮pretrainsubscriptpretrainsubscript𝐸𝜔𝑠\omega^{*}={\underset{\omega}{{\arg\min}}\,\sum_{s\in\mathcal{S}_{\text{% pretrain}}}\mathcal{L}_{\text{pretrain}}(E_{\omega}(s))}~{},italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = underitalic_ω start_ARG roman_arg roman_min end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_s ) ) ,

where 𝒮pretrainsubscript𝒮pretrain\mathcal{S}_{\text{pretrain}}caligraphic_S start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT is a sequence set sampled from the retrieved dataset 𝒟retrievalsubscript𝒟retrieval\mathcal{D}_{\text{retrieval}}caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT. As the retriever R𝑅Ritalic_R is non-parametric, we optimize Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT independently. Specifically, after pretraining and fixing the parameter of Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, we train Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT via

(8) θ=argmin𝜃(xz,yz)𝒟trainprediction(Fθ(xz,𝐡z,𝐂z),yz),superscript𝜃𝜃subscriptsubscript𝑥𝑧subscript𝑦𝑧subscript𝒟trainsubscriptpredictionsubscript𝐹𝜃subscript𝑥𝑧subscript𝐡𝑧subscript𝐂𝑧subscript𝑦𝑧\theta^{*}={\underset{\theta}{{\arg\min}}\,\sum_{(x_{z},y_{z})\in\mathcal{D}_{% \text{train}}}\mathcal{L}_{\text{prediction}}(F_{\theta}(x_{z},\mathbf{h}_{z},% \mathbf{C}_{z}),y_{z})}~{},italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = underitalic_θ start_ARG roman_arg roman_min end_ARG ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT prediction end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ,

where 𝐡zsubscript𝐡𝑧\mathbf{h}_{z}bold_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the history sequence of uzsubscript𝑢𝑧u_{z}italic_u start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, encoded by Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT.

Refer to caption
Figure 2. The architectural components of the encoder and predictor within the LIFT framework are as follows: (a) Pretrained Sequence Encoder: The embedding layer is omitted in this component. LIFT employs a decoder-only Transformer architecture as the encoder, which undergoes pretraining via the mask behavior loss. During the pretraining stage, the primary focus is on leveraging contextual information inherent within the sequence data itself. (b) Training of the Predictor: During the training stage, emphasis is placed solely on the training of the predictor. LIFT incorporates three distinct types of information to inform its predictions, namely the target sample xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT itself, the user’s historical interactions, and the retrieved context, encompassing historical interactions from similar instances as well as future interactions. In this phase, the label information utilized for training is exclusively derived from the target samples.

4. The LIFT Framework

In this section, we provide an overview of the whole LIFT framework and then describe the details of its the specific descriptions for its component encoder, retriever, and predictor, respectively.

Refer to caption
Figure 3. The overview workflow in LIFT. In the initial phase (Stage 1), we pretrain an encoder to convert sequences into embeddings. In the subsequent phase (Stage 2), we illustrate the data flow in the figure. In Step 1, we send xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT as a query to the retriever. In Step 2, the retriever outputs the retrieval result. In Step 3, we input the retrieval result into the encoder. In Step 4, we obtain the embeddings from the encoder. In Step 5, we send xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and the encoded embeddings to the predictor to obtain the final result.

4.1. Overview

There are three major components in LIFT, i.e., the encoder, retriever, and predictor. The encoder E𝐸Eitalic_E is responsible for encoding the user sequence to a v𝑣vitalic_v-dimensional vector. Given target interaction dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, its feature part xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is regarded as a query, and the label part ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is to be predicted. The retriever R𝑅Ritalic_R is used to retrieve relevant context sequences Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The predictor Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT makes use of the feature vector xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the user utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’s history sequence htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the relevant context sequence set Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to predict the final label yt^^subscript𝑦𝑡\hat{y_{t}}over^ start_ARG italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG.

There are two steps in LIFT framework: pretraining and predicting. During pretraining, an encoder E𝐸Eitalic_E is trained on a sampled 𝒮pretrainsubscript𝒮pretrain\mathcal{S}_{\text{pretrain}}caligraphic_S start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT from 𝒟retrievalsubscript𝒟retrieval\mathcal{D}_{\text{retrieval}}caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT . For predicting𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔predictingitalic_p italic_r italic_e italic_d italic_i italic_c italic_t italic_i italic_n italic_g, the LIFT model first uses the retriever R𝑅Ritalic_R to find the top-K𝐾Kitalic_K relevant user context sequences Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. And then, each sequence in Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will be encoded by Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT. After aggregating the information of the target user and item xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the target user’s history htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, retrieved sequence Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the predictor Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT will generate the output label prediction.

The main contribution of the LIFT framework is that it makes use of the full user context to help predict the target user behavior. Another distinctive feature of LIFT is that it leverages the pretraining method in a quite new paradigm and significantly improves the prediction performance in our experiment, which may open a new research direction of the sequential recommendation task. Next, we will describe the three components of the LIFT framework in detail.

4.2. Encoder

The encoder Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT maps a user interaction sequence into a real-valued vector. It is pretrained separately in the framework.

Model Architecture. LIFT adopts a decoder-only Transformer as the encoder, which is based on the original implementation described in Vaswani et al. (2017). Transformer has been shown to perform strongly in language modeling (Brown et al., 2020; Devlin et al., 2019), machine translation (Vaswani et al., 2017), recommender system (Sun et al., 2019; Zhang et al., 2024), etc. Different tasks may leverage different parts of the Transformer architecture. The difference between the Transformer encoder and decoder is that the self-attention sub-layer in the decoder is modified to make sure that the prediction in position p𝑝pitalic_p could only depend on the positions earlier than p𝑝pitalic_p (Vaswani et al., 2017). For the data of the recommender system exhibiting temporal characteristics, we use the decoder part of the Transformer. We also compare the performance of different encoders in the experiment, which shows the superiority of the Transformer decoder.

Input/Output. The input of the encoder Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT is a user interaction sequence s={d1,d2,,dT}𝑠subscript𝑑1subscript𝑑2subscript𝑑𝑇s=\{d_{1},d_{2},\ldots,d_{T}\}italic_s = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. The output of Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT is an embedding representing the sequence.

To feed the interaction dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT into the encoder, dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT’s feature part xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is fed into an embedding layer. The embedding layer will map the one-hot encoded vector xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT into a dense vector 𝐱𝐳subscript𝐱𝐳\bf{x}_{z}bold_x start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT, where each feature aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT will be mapped into a w𝑤witalic_w-dimensional vector and then be concatenated into one (M×w)𝑀𝑤(M\times w)( italic_M × italic_w )-dimensional dense vector. The label part of the dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT will be preserved. In the training stage, if the interaction dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is chosen to be masked, the label will be replaced by a special token [MSK]. We will describe the training loss in the next subsection. After processing the sequence, the output will be sent into the decoder-only Transformer decoder sequentially, and the output is a set of hidden states O={o1,o2,,oT}𝑂subscript𝑜1subscript𝑜2subscript𝑜𝑇O=\{o_{1},o_{2},...,o_{T}\}italic_O = { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. Because only the last step in the decoder can utilize the full information in the sequence, we choose the final state oTsubscript𝑜𝑇o_{T}italic_o start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT as the representation of the interaction sequence 𝐬𝐬\mathbf{s}bold_s.

Pretraining Data Preparation. The pretraining dataset 𝒮pretrainsubscript𝒮pretrain\mathcal{S}_{\text{pretrain}}caligraphic_S start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT is sampled from 𝒟retrievalsubscript𝒟retrieval\mathcal{D}_{\text{retrieval}}caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT. For each user u𝑢uitalic_u in 𝒰𝒰\mathcal{U}caligraphic_U, we use all the interactions in 𝒟retrievalsubscript𝒟𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑎𝑙\mathcal{D}_{retrieval}caligraphic_D start_POSTSUBSCRIPT italic_r italic_e italic_t italic_r italic_i italic_e italic_v italic_a italic_l end_POSTSUBSCRIPT of u𝑢uitalic_u to assemble an interaction sequence susubscript𝑠𝑢s_{u}italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Then we get a sequence set S𝑆Sitalic_S. For any sequence s𝑠sitalic_s in S𝑆Sitalic_S, we pick up a subsequence every L𝐿Litalic_L interactions. Using these subsequences, we get the pretraining dataset 𝒮pretrainsubscript𝒮pretrain\mathcal{S}_{\text{pretrain}}caligraphic_S start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT.

Mask Behavior Loss. To leverage the inherent information contained within the context, we formulate a self-supervised loss function. In contrast to certain existing sequential recommendation models (Zhou et al., 2018b, 2019), wherein only the target interaction label is considered as the prediction target for any given interaction sequence, all other interaction labels within the sequence are omitted. It is rational to argue that incorporating more behavioral information from the context as the supervision signal would lead to a more comprehensive representation of the context. And compared to previous pretraining methods (Sun et al., 2019; Zhou et al., 2020) that mask items or item parts, we propose masking user behavior to better incorporate user behavior information into the pretrained model, enhancing user-item interaction modeling.

As shown in Figure 2, the left part describes the encoder architecture and loss in the pretraining procedure. In order to train a deep representation of the sequence, we mask parts of the interaction’s labels and then predict those masked behaviors. Unlike the Cloze task (Taylor, 1953) in natural language processing, it is difficult to predict the masked tokens in the user interaction sequences because the data in recommendation systems lacks syntactic structure like that found in natural language. An interaction disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the input sequence s𝑠sitalic_s can be represented as:

(9) di={(xi,MSK)if di is a masked behavior(xi,yi)otherwised_{i}=\left\{\begin{matrix}(x_{i},\texttt{MSK})&\text{if $d_{i}$ is a masked % behavior}\\ (x_{i},y_{i})&\text{otherwise}\end{matrix}\right.italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ARG start_ROW start_CELL ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , MSK ) end_CELL start_CELL if italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a masked behavior end_CELL end_ROW start_ROW start_CELL ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL otherwise end_CELL end_ROW end_ARG

where i[1,T]𝑖1𝑇i\in[1,T]italic_i ∈ [ 1 , italic_T ].

We adopt the binary cross entropy loss as the objective function. In the pre-training procedure, we feed the masked interactions’ corresponding outputs as input into a multi-layer perceptron (MLP) to predict the masked labels. Thus the pretrain loss pretrainsubscriptpretrain\mathcal{L}_{\text{pretrain}}caligraphic_L start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT can be represented as pretrain=subscriptpretrainabsent\mathcal{L}_{\text{pretrain}}=caligraphic_L start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT =

(10) di[yilogσ(MLP(oi))+(1yi)log(1σ(MLP(oi)))]subscriptsubscript𝑑𝑖delimited-[]subscript𝑦𝑖𝜎MLPsubscript𝑜𝑖1subscript𝑦𝑖1𝜎MLPsubscript𝑜𝑖\displaystyle-\sum_{d_{i}\in\mathcal{M}}\left[y_{i}\log\sigma(\text{MLP}(o_{i}% ))+(1-y_{i})\log(1-\sigma(\text{MLP}(o_{i})))\right]- ∑ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_M end_POSTSUBSCRIPT [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_σ ( MLP ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - italic_σ ( MLP ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) ]

where \mathcal{M}caligraphic_M is the set of masked interactions and σ(q)=1/(1+eq)𝜎𝑞11superscript𝑒𝑞\sigma(q)=1/(1+e^{-q})italic_σ ( italic_q ) = 1 / ( 1 + italic_e start_POSTSUPERSCRIPT - italic_q end_POSTSUPERSCRIPT ) is the sigmoid function. We apply the Lpretrainsubscript𝐿pretrainL_{\text{pretrain}}italic_L start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT in every sequence in Spretrainsubscript𝑆pretrainS_{\text{pretrain}}italic_S start_POSTSUBSCRIPT pretrain end_POSTSUBSCRIPT.In our experiments, we test different mask ratios, which shows that the best mask ratio should be different for different datasets. In contrast to the mask item loss(Sun et al., 2019), when dealing with sequences that are annotated with multiple behaviors, it becomes relatively more feasible for the mask behavior loss to propagate and encompass these labeled behaviors.

4.3. Retriever

The retriever R𝑅Ritalic_R is responsible for retrieving context sequences similar to the target interaction dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We use the feature part xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the query to utilize the search engine in finding similar user contexts in the retrieval pool. This section will cover the details of the datastore and search processes.

Datastore. The datastore serves as a database in which the keys correspond to all the interactions within the retrieval dataset denoted as 𝒟retrieval,subscript𝒟retrieval\mathcal{D}_{\text{retrieval}},caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT , while the values correspond to their associated context sequences. For the sake of efficiency, we employ a pretrained encoder to perform offline encoding of all these sequences into their respective embedding representations while the key is still the sparse format for retrieval.

The encoder Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT encodes an interaction dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT’s context czsubscript𝑐𝑧c_{z}italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT’s history part hzsubscript𝑧h_{z}italic_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and future part fzsubscript𝑓𝑧f_{z}italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT into two fixed-length vector representations respectively. Thus for any sample dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT in 𝒟retrievalsubscript𝒟retrieval\mathcal{D}_{\text{retrieval}}caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT, we define the key-value pair (kz,vz)subscript𝑘𝑧subscript𝑣𝑧(k_{z},v_{z})( italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ), where the key kzsubscript𝑘𝑧k_{z}italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the raw features xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT of dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and the value vzsubscript𝑣𝑧v_{z}italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the encoded context embedding that contains a history sequence embedding 𝐡zsubscript𝐡𝑧\mathbf{h}_{z}bold_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and a future sequence embedding 𝐟zsubscript𝐟𝑧\mathbf{f}_{z}bold_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. The datastore (𝒦,𝒱)𝒦𝒱(\mathcal{K},\mathcal{V})( caligraphic_K , caligraphic_V ) is the set of all key-value pairs constructed from all the samples in the retrieval dataset 𝒟retrievalsubscript𝒟retrieval\mathcal{D}_{\text{retrieval}}caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT:

(11) (𝒦,𝒱)={(xz,(Eω(hz),Eω(fz)))|dz𝒟retrieval}.𝒦𝒱conditional-setsubscript𝑥𝑧subscript𝐸𝜔subscript𝑧subscript𝐸𝜔subscript𝑓𝑧subscript𝑑𝑧subscript𝒟retrieval(\mathcal{K},\mathcal{V})=\{(x_{z},(E_{\omega}(h_{z}),E_{\omega}(f_{z})))|d_{z% }\in\mathcal{D}_{\text{retrieval}}\}~{}.( caligraphic_K , caligraphic_V ) = { ( italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , ( italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ) ) | italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT } .

Query. The retrieval process uses the target sample’s raw features xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the query to find the nearby keys in the datastore. We use the traditional BM25 (Robertson et al., 1995) algorithm to retrieve relevant keys, and return the top K𝐾Kitalic_K ranked key-value pairs. In the BM25 algorithm, we treat any sample xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT as the query and a key xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT in the datastore as the document. As such, the ranking score can be calculated as

(12) RankScore(xz,xd)RankScoresubscript𝑥𝑧subscript𝑥𝑑\displaystyle\text{RankScore}(x_{z},x_{d})RankScore ( italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) =j=1MIDF(xjz)𝟏(xjz=xjd),absentsuperscriptsubscript𝑗1𝑀IDFsuperscriptsubscript𝑥𝑗𝑧1superscriptsubscript𝑥𝑗𝑧superscriptsubscript𝑥𝑗𝑑\displaystyle=\sum_{j=1}^{M}\text{IDF}(x_{j}^{z})\cdot\mathbf{1}(x_{j}^{z}=x_{% j}^{d})~{},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT IDF ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) ⋅ bold_1 ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ,
(13) IDF(xjz)IDFsuperscriptsubscript𝑥𝑗𝑧\displaystyle\text{IDF}(x_{j}^{z})IDF ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) =logNN(xjz)+0.5N(xjz)+0.5,absent𝑁𝑁superscriptsubscript𝑥𝑗𝑧0.5𝑁superscriptsubscript𝑥𝑗𝑧0.5\displaystyle=\log\frac{N-N(x_{j}^{z})+0.5}{N(x_{j}^{z})+0.5}~{},= roman_log divide start_ARG italic_N - italic_N ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) + 0.5 end_ARG start_ARG italic_N ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) + 0.5 end_ARG ,

where 𝟏()1\mathbf{1}(\cdot)bold_1 ( ⋅ ) is the indicator function, N𝑁Nitalic_N is the number of data samples in 𝒟retrievalsubscript𝒟retrieval\mathcal{D}_{\text{retrieval}}caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT, and N(xjz)𝑁superscriptsubscript𝑥𝑗𝑧N(x_{j}^{z})italic_N ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) is the number of data samples that have the categorical feature value xjzsuperscriptsubscript𝑥𝑗𝑧x_{j}^{z}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT.

Let 𝒦^^𝒦\hat{\mathcal{K}}over^ start_ARG caligraphic_K end_ARG represent the retrieved keys and 𝒱^^𝒱\hat{\mathcal{V}}over^ start_ARG caligraphic_V end_ARG be the retrieved values from the retrieved set. After the retrieval process, we obtain the top-K𝐾Kitalic_K relevance pairs in the datastore.

4.4. Predictor

After the retrieval process, we obtain the context of the target sample xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The contextual information includes the user utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’s history and both the history and future information of the target sample’s relevant interactions. As shown in the bottom part of stage 2 in Figure 3, in the predictor, we use both the contextual information and the embedding of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to produce the final label prediction yt^^subscript𝑦𝑡\hat{y_{t}}over^ start_ARG italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG.

Key-based Attention Aggregation. From the retriever, we get the context embedding set 𝒱^^𝒱\hat{\mathcal{V}}over^ start_ARG caligraphic_V end_ARG of relevant interactions. Before feeding this to the final prediction layer, we use the key-based attention mechanism to aggregate the retrieved 𝒦^^𝒦\hat{\mathcal{K}}over^ start_ARG caligraphic_K end_ARG and 𝒱^^𝒱\hat{\mathcal{V}}over^ start_ARG caligraphic_V end_ARG. Let xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the i𝑖iitalic_i-th sample in the key set 𝒦^^𝒦\hat{\mathcal{K}}over^ start_ARG caligraphic_K end_ARG. We use an embedding layer to convert xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into a M×w𝑀𝑤M\times witalic_M × italic_w-dimensional dense vector 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and we use the same embedding layer to map xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into a M×w𝑀𝑤M\times witalic_M × italic_w-dimensional dense vector 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The key attention weight is defined as

(14) αi=e𝐱iTW𝐱tj=1Ke𝐱jTW𝐱t,subscript𝛼𝑖superscript𝑒superscriptsubscript𝐱𝑖𝑇𝑊subscript𝐱𝑡superscriptsubscript𝑗1𝐾superscript𝑒superscriptsubscript𝐱𝑗𝑇𝑊subscript𝐱𝑡\alpha_{i}=\frac{e^{\mathbf{x}_{i}^{T}{W}\mathbf{x}_{t}}}{\sum_{j=1}^{K}e^{% \mathbf{x}_{j}^{T}{W}\mathbf{x}_{t}}}~{},italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ,

where WMw×Mw𝑊superscript𝑀𝑤𝑀𝑤{W}\in\mathbb{R}^{Mw\times Mw}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_M italic_w × italic_M italic_w end_POSTSUPERSCRIPT is the attention layer parameter matrix. The allocation of attention weights to value sets is determined based on their respective key attention weights. Thus we could use α𝛼\bf{\alpha}italic_α to aggregate the retrieved key set 𝒦^^𝒦\hat{\mathcal{K}}over^ start_ARG caligraphic_K end_ARG and value set 𝒱^^𝒱\hat{\mathcal{V}}over^ start_ARG caligraphic_V end_ARG. For the retrieved samples in 𝒦^^𝒦\hat{\mathcal{K}}over^ start_ARG caligraphic_K end_ARG, the aggregated vector 𝐱rsubscript𝐱𝑟\mathbf{x}_{r}bold_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT can be written as 𝐱r=i=1Kαi𝐱i,xi𝒦^formulae-sequencesubscript𝐱𝑟superscriptsubscript𝑖1𝐾subscript𝛼𝑖subscript𝐱𝑖subscript𝑥𝑖^𝒦\mathbf{x}_{r}=\sum_{i=1}^{K}\alpha_{i}\cdot\mathbf{x}_{i},~{}~{}x_{i}\in\hat{% \mathcal{K}}bold_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over^ start_ARG caligraphic_K end_ARG. For the history part of the retrieved context embeddings 𝒱^^𝒱\hat{\mathcal{V}}over^ start_ARG caligraphic_V end_ARG, the aggregated embedding 𝐞rhsuperscriptsubscript𝐞𝑟\mathbf{e}_{r}^{h}bold_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT can be written as 𝐞rh=i=1Kαi𝐡i,𝐡i𝒱^formulae-sequencesuperscriptsubscript𝐞𝑟superscriptsubscript𝑖1𝐾subscript𝛼𝑖subscript𝐡𝑖subscript𝐡𝑖^𝒱\mathbf{e}_{r}^{h}=\sum_{i=1}^{K}\alpha_{i}\cdot\mathbf{h}_{i},~{}~{}\mathbf{h% }_{i}\in\hat{\mathcal{V}}bold_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over^ start_ARG caligraphic_V end_ARG. Similarly, for the future part of the retrieved context embeddings 𝒱^^𝒱\hat{\mathcal{V}}over^ start_ARG caligraphic_V end_ARG, the aggregated embedding 𝐞rfsuperscriptsubscript𝐞𝑟𝑓\mathbf{e}_{r}^{f}bold_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT can be written as 𝐞rf=i=1Kαi𝐟i,𝐟i𝒱^formulae-sequencesuperscriptsubscript𝐞𝑟𝑓superscriptsubscript𝑖1𝐾subscript𝛼𝑖subscript𝐟𝑖subscript𝐟𝑖^𝒱\mathbf{e}_{r}^{f}=\sum_{i=1}^{K}\alpha_{i}\cdot\mathbf{f}_{i},~{}~{}\mathbf{f% }_{i}\in\hat{\mathcal{V}}bold_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over^ start_ARG caligraphic_V end_ARG.

Prediction. In the final prediction layer, we first use a factor interaction layer to perform the inter-sample interaction for the target sample and the retrieved key set. Since these two vectors are categorical data, one could use high-order interaction to further explore the useful patterns, but this is out of the scope of this paper. Finally, we feed the embedding of the target sample, retrieved keys, and retrieved values into an MLP to make the label prediction.

After the embedding layer, each sample xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT in 𝒟𝒟\mathcal{D}caligraphic_D can be represented as a concatenation of M𝑀Mitalic_M feature embeddings:

(15) 𝐱z=[𝐚1z,,𝐚Mz],subscript𝐱𝑧superscriptsubscript𝐚1𝑧superscriptsubscript𝐚𝑀𝑧\mathbf{x}_{z}=[{\mathbf{a}_{1}^{z}},\ldots,{\mathbf{a}_{M}^{z}}]~{},bold_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = [ bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ] ,

where 𝐚izsuperscriptsubscript𝐚𝑖𝑧\mathbf{a}_{i}^{z}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT represents the i𝑖iitalic_i-the feature’s embedding of the sample xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. We use an interaction layer to compute the combination of these features as

(16) 𝐱interz=[inter(𝐚1z,𝐚2z),inter(𝐚1z,𝐚3z),,inter(𝐚M1z,𝐚Mz)],superscriptsubscript𝐱inter𝑧intersuperscriptsubscript𝐚1𝑧superscriptsubscript𝐚2𝑧intersuperscriptsubscript𝐚1𝑧superscriptsubscript𝐚3𝑧intersuperscriptsubscript𝐚𝑀1𝑧superscriptsubscript𝐚𝑀𝑧{\mathbf{x}_{\text{inter}}^{z}}=[\text{inter}(\mathbf{a}_{1}^{z},\mathbf{a}_{2% }^{z}),\text{inter}(\mathbf{a}_{1}^{z},\mathbf{a}_{3}^{z}),\ldots,\text{inter}% (\mathbf{a}_{M-1}^{z},\mathbf{a}_{M}^{z})]~{},bold_x start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT = [ inter ( bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) , inter ( bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) , … , inter ( bold_a start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) ] ,

where the inter()inter\text{inter}(\cdot)inter ( ⋅ ) represents the interaction function. We could use different interaction functions such as inner product (Guo et al., 2017), kernel product (Qu et al., 2018), or micro-network (Qu et al., 2018). We use the same architecture of the interaction layer for the target sample xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and aggregated vector 𝐱rsubscript𝐱𝑟\mathbf{x}_{r}bold_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. After 𝐱rsubscript𝐱𝑟\mathbf{x}_{r}bold_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT going through the interaction layer, we get the embedding 𝐱interrsuperscriptsubscript𝐱inter𝑟\mathbf{x}_{\text{inter}}^{r}bold_x start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. Finally, we concatenate all the vectors from the framework as

(17) hidden=concat[𝐱intert,𝐱interr,𝐞rf,𝐞rh,𝐡t],hiddenconcatsuperscriptsubscript𝐱inter𝑡superscriptsubscript𝐱inter𝑟superscriptsubscript𝐞𝑟𝑓superscriptsubscript𝐞𝑟subscript𝐡𝑡\text{hidden}=\text{concat}[\mathbf{x}_{\text{inter}}^{t},\mathbf{x}_{\text{% inter}}^{r},\mathbf{e}_{r}^{f},\mathbf{e}_{r}^{h},\mathbf{h}_{t}]~{},hidden = concat [ bold_x start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ,

where 𝐡tsubscript𝐡𝑡\mathbf{h}_{t}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the user utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’s history representation. As such, the label final predicted y^zsubscript^𝑦𝑧\hat{y}_{z}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is written as

(18) y^z=σ(MLP(hidden)).subscript^𝑦𝑧𝜎MLPhidden\hat{y}_{z}=\sigma(\text{MLP}(\text{hidden}))~{}.over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_σ ( MLP ( hidden ) ) .

Given the prediction is a binary classification for CTR prediction, we use the binary cross-entropy loss between the label ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and prediction y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for training.

4.5. Time Complexity & Speedup

In this section, we provide a brief time complexity analysis of LIFT framework and then discuss the speedup feasibility. The detailed discussions are deferred to Appendix C.

Time complexity. LIFT consists of two major phases. The first is the pretraining phase targeted at context sequences. In this phase, the dataset Dretrievalsubscript𝐷retrievalD_{\text{retrieval}}italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT is segmented into a set of sequences, each of which is characterized by a length L𝐿Litalic_L. As the operations on each sequence with a time complexity of O(1)𝑂1O(1)italic_O ( 1 ), the time complexity of this phase is O(|Dretrieval|/L)𝑂subscript𝐷retrieval𝐿O(|D_{\text{retrieval}}|/L)italic_O ( | italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT | / italic_L ).

The second phase lies in the LIFT model training and inference. Besides the neural network computation, the retrieval process is incorporated into the model, accounting for a time complexity of O(F)+O(F|Dretrieval|/U)=O(F|Dretrieval|/U)𝑂𝐹𝑂𝐹subscript𝐷retrieval𝑈𝑂𝐹subscript𝐷retrieval𝑈O(F)+O(F\cdot|D_{\text{retrieval}}|/U)=O(F\cdot|D_{\text{retrieval}}|/U)italic_O ( italic_F ) + italic_O ( italic_F ⋅ | italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT | / italic_U ) = italic_O ( italic_F ⋅ | italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT | / italic_U ), where U𝑈Uitalic_U denotes the total number of unique features encountered in Dretrievalsubscript𝐷retrievalD_{\text{retrieval}}italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT, F𝐹Fitalic_F denotes the number of features in the query.

Speedup. To enhance the efficiency of the retrieval process, we may adopt a method of storing the retrieval outcomes, eliminating the need to conduct the retrieval procedure during online inference. This approach shifts the retrieval process to an offline setting. For instance, in recommender systems, this method involves storing the retrieval results for all (or frequent) users, along with their respective recalled candidate items. This method represents a balance between the complexities of space and time, optimizing resource utilization by trading off storage space for processing speed.

5. Experiments

This section starts with five research questions (RQs), which we use to guide the experiments and discussions.111We provide the experiment code with running instructions on Anonymous GitHub (https://anonymous.4open.science/r/LIFT-277C/Readme.md).

  • RQ1

    Does LIFT achieve the best performance?

  • RQ2

    Does future data benefit the final result?

  • RQ3

    How does the proposed pretraining method have a positive impact on the prediction performance?

  • RQ4

    Do different encoders and mask label ratios have different impacts on context representation learning?

  • RQ5

    How is the time efficiency of LIFT and whether is it potential to be deployed online?

5.1. Datasets

We evaluate the performance of LIFT by conducting experiments for CTR prediction tasks on three large-scale real-world datasets, i.e., Taobao, Tmall, and Alipay222https://tianchi.aliyun.com/dataset/x, where ‘x’ is ‘649’, ‘42’, and ‘53’ for Taobao, Tmall, and Alipay, respectively.. For top-N ranking, we utilize two widely-used public recommendation datasets: MovieLen333https://grouplens.org/datasets/movielens/1m/ and LastFM444http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html. We count the number of instances, fields of these datasets, which is shown in Table 1.

Table 1. Dataset statistics
Datasets # Instances # Fields Task
Taobao 100,150,807 4 CTR Prediction
Tmall 54,925,331 9 CTR Prediction
Alipay 35,179,371 6 CTR Prediction
Movielens-1M 1,000,209 7 Top-N Ranking
LastFM 18,993,371 5 Top-N Ranking

For dataset preprocessing, we follow the common practice in RIM(Qin et al., 2021) and DERT (Zheng et al., 2023). We split each dataset into three parts, i.e., retrieval set, train set, and test set, based on the global timestamps. We select the hyperparameters using cross validation over the training set. Specifically, the retrieval set consists of the earliest data instances, the test set comprises the latest data instances, and the remaining intermediate data instances are allocated to the train set. For pretraining, we use the retrieval set to pretrain the encoder. For the non-retrieval baseline models, we merge the retrieval set and the training set as the final training set.

5.2. Evaluation Metrics

We choose the widely used metrics area under the ROC curve (AUC) and negative log-likelihood (LogLoss) to evaluate the performance for CTR prediction. For top-N ranking, we use hit ratio (HR@N), normalized discounted cumulative gains (NDCG@N), and mean reciprocal rank (MRR). Significance test on each metric between the first and second performed methods is conducted, with * marked for positive test results. We also conduct a significance test on the metrics between LIFT and the best performed baseline, marked with “*” for positive results.

Table 2. Performance comparison of CTR prediction task baselines. GBDT and DeepFM are the traditional methods. Others are sequential modeling methods. For a fair comparison, traditional models are trained on both the retrieval set and the training set. The best results are in bold fonts while the second best results are underlined. “Rel. Impr.” of each row means the relative AUC improvement of LIFT against the baseline. Improvements are statistically significant with p<0.01𝑝0.01p<0.01italic_p < 0.01.
Models Taobao Tmall Alipay
LogLoss AUC Rel. Impr. LogLoss AUC Rel. Impr. LogLoss AUC Rel. Impr.
GBDT 0.6797 0.6134 44.39% 0.5103 0.8319 11.13% 0.9062 0.6747 30.49%
DeepFM 0.6497 0.6710 32.00% 0.4695 0.8581 7.74% 0.6271 0.6971 26.29%
FATE 0.6497 0.6762 30.98% 0.4737 0.8553 8.09% 0.6199 0.7356 19.68%
BERT4Rec 0.6356 0.6852 29.26% 0.4017 0.8981 2.94% 0.6024 0.7321 20.26%
DIN 0.6086 0.7433 19.16% 0.4292 0.8796 5.10% 0.6044 0.7647 15.13%
DIEN 0.6084 0.7506 18.00% 0.4445 0.8838 4.61% 0.6454 0.7502 17.36%
SIM 0.5795 0.7825 13.19% 0.4520 0.8857 4.38% 0.6089 0.7600 15.84%
UBR 0.5432 0.8169 8.42% 0.4368 0.8975 3.01% 0.5747 0.7952 10.71%
RIM 0.4644 0.8563 3.43% 0.3804 0.9138 1.17% 0.5615 0.8006 9.97%
DERT 0.4486 0.8647 2.42% 0.3585 0.9200 0.4% 0.5319 0.8087 8.86%
LIFT w/o pretrain 0.4369* 0.8727* 1.49% 0.3509* 0.9236* 0.10% 0.4707* 0.8572* 2.71%
LIFT 0.4129* 0.8857* - 0.3489* 0.9245* - 0.4361* 0.8804* -
Table 3. Performance comparison of top-N ranking tasks on the ML-1M and LastFM datasets. The best results are highlighted in bold, while the second-best results are underlined. Significant improvements are indicated by p<0.01𝑝0.01p<0.01italic_p < 0.01.
Datasets Metric FPMC TransRec NARM GRU4Rec SASRec RIM DERT LIFT
ML-1M NDCG@5 0.0788 0.0808 0.0866 0.0872 0.0981 0.1577 0.1634 0.1806
NDCG@10 0.1184 0.1217 0.1254 0.1265 0.1341 0.2059 0.2117 0.2293
MRR 0.1041 0.1078 0.1113 0.1135 0.1193 0.1704 0.1774 0.1914
HR@1 0.0261 0.0275 0.0337 0.0369 0.0392 0.0645 0.0747 0.0800
HR@5 0.1334 0.1375 0.1418 0.1395 0.1588 0.2515 0.2540 0.2808
HR@10 0.2577 0.2659 0.2631 0.2624 0.2709 0.4014 0.4035 0.4324
LastFM NDCG@5 0.0432 0.1148 0.0916 0.1229 0.1163 0.2165 0.2620 0.2723
NDCG@10 0.0685 0.1441 0.1185 0.1486 0.1409 0.2911 0.3217 0.3444
MRR 0.0694 0.1303 0.1083 0.1362 0.1289 0.2210 0.2694 0.2727
HR@1 0.0148 0.0563 0.0423 0.0658 0.0584 0.0915 0.1488 0.1485
HR@5 0.0733 0.1725 0.1394 0.1785 0.1729 0.3468 0.3742 0.4010
HR@10 0.1531 0.2628 0.2227 0.2581 0.2499 0.5780 0.5597 0.6310
Table 4. The influence of different encoders of LIFT on prediction performance. “(p)” means the encoder is pretrained.
Encoder RNN RNN (pretrain) Transformer Encoder Transformer Encoder (p) Transformer Decoder Transformer Decoder (p)
LogLoss AUC LogLoss AUC LogLoss AUC LogLoss AUC LogLoss AUC LogLoss AUC
Taobao 0.4434 0.8667 0.4360 0.8715 0.4432 0.8684 0.4454 0.8679 0.4369 0.8727 0.4129 0.8857
Tmall 0.3730 0.9132 0.3637 0.9175 0.3539 0.9223 0.3528 0.9227 0.3509 0.9236 0.3489 0.9245
Alipay 0.4937 0.8391 0.4856 0.8468 0.4807 0.8504 0.4707 0.8572 0.4730 0.8556 0.4361 0.8804
Table 5. The influence of the usage of future data in LIFT.
Dataset No Context Future Only History Only Future & History
LogLoss AUC LogLoss AUC LogLoss AUC LogLoss AUC
Taobao .4644 .8563 .4417 .8682 .4405 .8694 .4129 .8857
Tmall .3804 .9138 .3606 .9191 .3588 .9199 .3489 .9245
Alipay .5615 .8006 .4781 .8525 .4780 .8527 .4361 .8804

5.3. Compared Methods

In CTR prediction, we compare LIFT with nine solid baselines that can be categorized into three groups. The first group is traditional tabular models, which do not utilize the sequential or retrieval mechanism. This group includes GBDT (Chen et al., 1996), a widely-used gradient-boosted trees model, and DeepFM (Guo et al., 2017), a factorization-machine based deep neural network. The second group consists of end-to-end sequential deep models, such as DIN (Zhou et al., 2018b) and DIEN (Zhou et al., 2019) that are attention-based recurrent neural networks for CTR prediction, and BERT4Rec (Sun et al., 2019) that is a Transformers-based sequential recommendation model. The third group includes retrieval-based models, such as SIM (Qi et al., 2020), UBR (Qin et al., 2020), RIM (Qin et al., 2021) and DERT(Zheng et al., 2023). Additionally, FATE (Wu et al., 2021) is a tabular data representation learning model that can be viewed as a random retrieval method for prediction.

For top-N ranking, we compare LIFT against seven baseline models. FPMC (Rendle, 2010) and TransRec (He et al., 2017) are factorization-based approaches. The other baselines—NARM (Li et al., 2017), GRU4Rec (Hidasi et al., 2015), SASRec (Khandelwal et al., 2019), RIM (Qin et al., 2021) and DERT (Zheng et al., 2023)—are recently proposed neural network models.

5.4. Overall Performance (RQ1)

The overall performance comparison result is provided in Table 2, where we have the following observations: (i) LIFT consistently outperforms all nine baselines. Specifically, compared to the best baseline RIM, LIFT achieves a relatively improved AUC by 3.43%, 1.17% and 9.97% on Taobao, Tmall, and Alipay, respectively. This clearly demonstrates the effectiveness of context information learned and retrieved in LIFT framework. (ii) Without pretraining, LIFT still consistently outperforms the best baseline, which indicates that the raw information of future and history still yields a significant improvement on the prediction performance. (iii) The the retrieval-based methods LIFT and RIM are superior to the traditional methods and sequential CTR models, which means that the retrieval methods are able to make better use of the context information. (iv) in the table 3, it can be observed that LIFT achieves significant improvements over these baselines across nearly all metrics on both datasets. This demonstrates that incorporating contextual information enables the sequential model to perform well in top-N ranking tasks.

5.5. Further Analysis

We further study the effectiveness of important modules in LIFT, i.e., the usage of future context, the pertaining, the encoder architecture, and the hyperparameters.

Future Context (RQ2). Table 5 compares the different parts of the context information’s impact on the final prediction performance. We can observe that both the history information and future information significantly impact on the final prediction. If we only use part of the context information, such as future only or history only, we get a worse result than using both of them, which indicates that the future part of the context provides different information compared with the history part. This observation is entirely neglected by previous works.

Pretraining (RQ3). From the last two rows of Table 2, we find that over the three datasets, the model with pretraining performs better than those without, which clearly demonstrates the efficiency of pretraining. The enhanced performance of the pretrained encoder implies that relying solely on the label provided by the target sample is insufficient for effectively capturing contextual information. To improve the modeling of context, it is imperative to harness the inherent information embedded within the context itself.

Refer to caption
Figure 4. Performance w.r.t. different pretraining mask rates.

The Choice of the Encoder (RQ4). The prediction performance of LIFT with different encoders is provided in Table 4, where we can observe that the Transformer decoder based encoder yields the best performance on both non-pretraining and pretraining settings. The reason could be that the Transformer decoder could be aware of the position of the sequence better. From Figure 4, we observe the mask ratio’s influence on the prediction performance. Usually, the traditional models that encode the user interaction sequence only use the last label as the supervised signal. From the result, we can find that in all three datasets, such a 1111 label supervised strategy is not the best choice. The optimal mask label rate is different over datasets where the number is usually larger than 50%.

Time Efficiency (RQ5). To evaluate the time efficiency of LIFT, we compare its real inference time against other three mainstream models in different types. The parameters that performed best on Alipay were selected for each model to test the inference performance. As shown in Figure 5, LIFT does not introduce significant overload compared with the basic retrieval-based model RIM. The optimal retrieval set size K𝐾Kitalic_K for LIFT’s AUC on the Alipay dataset is 15. RIM with K=10𝐾10K=10italic_K = 10 is approximately 30% faster than LIFT. With offline pretraining, LIFT operates far more efficiently than the transformer-based model BERT4Rec.

Refer to caption
Figure 5. Inference time on Alipay.

6. Conclusion

In this work, we propose a retrieval-based framework called LIFT to better utilize the context information of the current user interaction. We are the first work to include future information as part of the context without temporal data leakage in the training and inference stages. Moreover, we use a pretraining method to better mine the information in the context sequences and propose a novel mask behavior loss. The performance of the LIFT framework shows that both historical and future information yield significant improvements in the CTR prediction and top-N ranking performance. Also, from the comparison of the pretraining method, we find that context pretraining is a promising solution to further improve the prediction performance of a variety of sequential recommendation models. In the future, we will investigate deeper on context representation learning with more sophisticated retrieval methods and the speedup schemes to deploy LIFT to real-world recommenders.

References

  • (1)
  • Bahri et al. (2022) Dara Bahri, Heinrich Jiang, Yi Tay, and Donald Metzler. 2022. Scarf: Self-Supervised Contrastive Learning using Random Feature Corruption. In International Conference on Learning Representations.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Chen et al. (1996) Ming-Syan Chen, Jiawei Han, and Philip S. Yu. 1996. Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and data Engineering 8, 6 (1996), 866–883.
  • Chen et al. (2021) Qiwei Chen, Changhua Pei, Shanshan Lv, Chao Li, Junfeng Ge, and Wenwu Ou. 2021. End-to-end user behavior retrieval in click-through rateprediction model. arXiv preprint arXiv:2108.04468 (2021).
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  • Chen et al. (2018) Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2018. Sequential recommendation with user memory networks. In WSDM.
  • Cui et al. (2022) Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. M6-rec: Generative pretrained language models are open-ended recommender systems. arXiv preprint arXiv:2205.08084 (2022).
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  • Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021. Association for Computational Linguistics (ACL), 6894–6910.
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. Deepfm: a factorization-machine based neural network for ctr prediction. IJCAI (2017).
  • Guo et al. (2022) Wei Guo, Can Zhang, Zhicheng He, Jiarui Qin, Huifeng Guo, Bo Chen, Ruiming Tang, Xiuqiang He, and Rui Zhang. 2022. Miss: Multi-interest self-supervised learning framework for click-through rate prediction. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 727–740.
  • He et al. (2022) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000–16009.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In WWW. 173–182.
  • Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
  • Huang et al. (2018) Jin Huang, Wayne Xin Zhao, Hongjian Dou, Ji-Rong Wen, and Edward Y Chang. 2018. Improving Sequential Recommendation with Knowledge-Enhanced Memory Networks. In SIGIR.
  • Jacob et al. (2019) Devlin Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
  • Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
  • Khandelwal et al. (2019) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2019. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations.
  • Li et al. (2017) Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428.
  • Pi et al. (2019) Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practice on long sequential user behavior modeling for click-through rate prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2671–2679.
  • Qi et al. (2020) Pi Qi, Xiaoqiang Zhu, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
  • Qin et al. (2021) Jiarui Qin, Weinan Zhang, Rong Su, Zhirong Liu, Weiwen Liu, Ruiming Tang, Xiuqiang He, and Yong Yu. 2021. Retrieval & Interaction Machine for Tabular Data Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1379–1389.
  • Qin et al. (2020) Jiarui Qin, W. Zhang, Xin Wu, Jiarui Jin, Yuchen Fang, and Y. Yu. 2020. User Behavior Retrieval for Click-Through Rate Prediction. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.
  • Qu et al. (2018) Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng Guo, Yong Yu, and Xiuqiang He. 2018. Product-based neural networks for user response prediction over multi-field categorical data. TOIS 37, 1 (2018), 1–35.
  • Rendle (2010) Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International conference on data mining. IEEE, 995–1000.
  • Robertson et al. (1995) Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. Nist Special Publication Sp 109 (1995), 109.
  • Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
  • Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the eleventh ACM international conference on web search and data mining. 565–573.
  • Taylor (1953) Wilson L Taylor. 1953. “Cloze procedure”: A new tool for measuring readability. Journalism quarterly 30, 4 (1953), 415–433.
  • Vapnik et al. (2015) Vladimir Vapnik, Rauf Izmailov, et al. 2015. Learning using privileged information: similarity control and knowledge transfer. J. Mach. Learn. Res. 16, 1 (2015), 2023–2049.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. In ADKDD. 1–7.
  • Wong et al. (2021) Chi-Man Wong, Fan Feng, Wen Zhang, Chi-Man Vong, Hui Chen, Yichi Zhang, Peng He, Huan Chen, Kun Zhao, and Huajun Chen. 2021. Improving conversational recommender system by pretraining billion-scale knowledge graph. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2607–2612.
  • Wu et al. (2021) Qitian Wu, Chenxiao Yang, and Junchi Yan. 2021. Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach. In Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021.
  • Xu et al. (2020) Chen Xu, Quan Li, Junfeng Ge, Jinyang Gao, Xiaoyong Yang, Changhua Pei, Fei Sun, Jian Wu, Hanxiao Sun, and Wenwu Ou. 2020. Privileged features distillation at taobao recommendations. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2590–2598.
  • Yuan et al. (2022) Enming Yuan, Wei Guo, Zhicheng He, Huifeng Guo, Chengkai Liu, and Ruiming Tang. 2022. Multi-behavior sequential transformer recommender. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 1642–1652.
  • Yuan et al. (2020) Fajie Yuan, Xiangnan He, Haochuan Jiang, Guibing Guo, Jian Xiong, Zhezhao Xu, and Yilin Xiong. 2020. Future data helps training: Modeling future contexts for session-based recommendation. In Proceedings of The Web Conference 2020. 303–313.
  • Zhang et al. (2024) Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu, Yan Gao, Yao Hu, and Enhong Chen. 2024. NoteLLM-2: Multimodal Large Representation Models for Recommendation. arXiv preprint arXiv:2405.16789 (2024).
  • Zhang et al. (2019) Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S Sheng, Jiajie Xu, Deqing Wang, Guanfeng Liu, Xiaofang Zhou, et al. 2019. Feature-level Deeper Self-Attention Network for Sequential Recommendation.. In IJCAI. 4320–4326.
  • Zhang et al. (2016) Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep Learning over Multi-field Categorical Data: A Case Study on User Response Prediction. ECIR (2016).
  • Zhang et al. (2021) Weinan Zhang, Jiarui Qin, Wei Guo, Ruiming Tang, and Xiuqiang He. 2021. Deep learning for click-through rate estimation. IJCAI (2021).
  • Zheng et al. (2023) Lei Zheng, Ning Li, Xianyu Chen, Quan Gan, and Weinan Zhang. 2023. Dense Representation Learning and Retrieval for Tabular Data Prediction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3559–3569.
  • Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5941–5948.
  • Zhou et al. (2018b) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018b. Deep interest network for click-through rate prediction. In KDD.
  • Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management. 1893–1902.
  • Zhou et al. (2018a) Meizi Zhou, Zhuoye Ding, Jiliang Tang, and Dawei Yin. 2018a. Micro behaviors: A new perspective in e-commerce recommender systems. In Proceedings of the eleventh ACM international conference on web search and data mining. 727–735.

Appendix A Notations

The notations and their descriptions are summarized in Table 6.

Table 6. Notations and corresponding descriptions.
Notation Description
𝒟trainsubscript𝒟train\mathcal{D}_{\text{train}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, 𝒟testsubscript𝒟test\mathcal{D}_{\text{test}}caligraphic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT, 𝒟retrievalsubscript𝒟retrieval\mathcal{D}_{\text{retrieval}}caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT Training set, test set, retrieval set
xzsubscript𝑥𝑧x_{z}italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, 𝐱zsubscript𝐱𝑧\mathbf{x}_{z}bold_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT The raw feature and the embedding of z𝑧zitalic_z-th sample
xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT The raw feature and the embedding of target sample
Eωsubscript𝐸𝜔E_{\omega}italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT The encoder and its parameters
R𝑅Ritalic_R The retriever
Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT The predictor and its parameters
d𝑑ditalic_d An interaction
v𝑣vitalic_v The encoder output vector dimension
w𝑤witalic_w The feature embedding vector dimension
K𝐾Kitalic_K The size of retrieved contexts
L𝐿Litalic_L The length of history and future sequence
c𝑐citalic_c, hhitalic_h, f𝑓fitalic_f The context sequence, history sequence, future sequence
C𝐶Citalic_C, H𝐻Hitalic_H, F𝐹Fitalic_F The context sequence set, history sequence set, future sequence set

Appendix B Datastore of LIFT

As shown in Figure 6, the datastore (𝒦,𝒱)𝒦𝒱(\mathcal{K},\mathcal{V})( caligraphic_K , caligraphic_V ) is the set of all key-value pairs constructed from all the samples in the retrieval dataset 𝒟retrievalsubscript𝒟retrieval\mathcal{D}_{\text{retrieval}}caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT:

(19) (𝒦,𝒱)={(xz,(Eω(hz),Eω(fz)))|dz𝒟retrieval}.𝒦𝒱conditional-setsubscript𝑥𝑧subscript𝐸𝜔subscript𝑧subscript𝐸𝜔subscript𝑓𝑧subscript𝑑𝑧subscript𝒟retrieval(\mathcal{K},\mathcal{V})=\{(x_{z},(E_{\omega}(h_{z}),E_{\omega}(f_{z})))|d_{z% }\in\mathcal{D}_{\text{retrieval}}\}~{}.( caligraphic_K , caligraphic_V ) = { ( italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , ( italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , italic_E start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ) ) | italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT } .
Refer to caption
Figure 6. The datastore in LIFT.

The primary keys within this datastore are intricately linked to the interactions in the dataset 𝒟retrievalsubscript𝒟retrieval\mathcal{D}_{\text{retrieval}}caligraphic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT, which represents the original data format. In parallel, the datastore houses values that align with the contextual information associated with these interactions. It is noteworthy that this contextual information is derived through an encoding process that applies a pretrained encoder, denoted as E𝐸Eitalic_E, to the context sequences.

Appendix C Time complexity

We aim to provide a comprehensive analysis of the time complexity within our proposed framework, which consists of two major components.

The first component addresses the pretraining phase targeted at context sequences. In this phase, the dataset Dretrievalsubscript𝐷retrievalD_{\text{retrieval}}italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT is segmented into a set of sequences, each of which is characterized by a length L𝐿Litalic_L. Both training and inference stages involve operations on each sequence with a time complexity of O(1)𝑂1O(1)italic_O ( 1 ). Consequently, the overall time complexity during the pretraining phase is O(|Dretrieval|/L)𝑂subscript𝐷retrieval𝐿O(|D_{\text{retrieval}}|/L)italic_O ( | italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT | / italic_L ).

The second component pertains to the complexity of the main framework’s algorithm, which diverges from traditional neural network methodologies by incorporating a retrieval process during both training and inference phases. We commence by examining the complexity of this retrieval process. Here, |Dretrieval|subscript𝐷retrieval|D_{\text{retrieval}}|| italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT | represents the quantity of samples, and U𝑈Uitalic_U denotes the total number of unique features encountered in Dretrievalsubscript𝐷retrievalD_{\text{retrieval}}italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT. Note that the mean length of the posting lists in the inverted index is |Dretrieval|/Usubscript𝐷retrieval𝑈|D_{\text{retrieval}}|/U| italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT | / italic_U. As explained in Section 4.3, the retrieval operation, which encompasses retrieving all posting lists of features in xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, necessitates a time complexity of O(F)𝑂𝐹O(F)italic_O ( italic_F ), where F𝐹Fitalic_F symbolizes the number of features in the query. This phase is deemed a constant time operation. The average count of samples retrieved is F|Dretrieval|/U𝐹subscript𝐷retrieval𝑈F\cdot|D_{\text{retrieval}}|/Uitalic_F ⋅ | italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT | / italic_U, and the complexity of the ranking operation scales linearly with the number of retrieved samples, indicated as O(F|Dretrieval|/U)𝑂𝐹subscript𝐷retrieval𝑈O(F\cdot|D_{\text{retrieval}}|/U)italic_O ( italic_F ⋅ | italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT | / italic_U ). Therefore, the total time complexity of the retrieval process is O(F)+O(F|Dretrieval|/U)=O(F|Dretrieval|/U)𝑂𝐹𝑂𝐹subscript𝐷retrieval𝑈𝑂𝐹subscript𝐷retrieval𝑈O(F)+O(F\cdot|D_{\text{retrieval}}|/U)=O(F\cdot|D_{\text{retrieval}}|/U)italic_O ( italic_F ) + italic_O ( italic_F ⋅ | italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT | / italic_U ) = italic_O ( italic_F ⋅ | italic_D start_POSTSUBSCRIPT retrieval end_POSTSUBSCRIPT | / italic_U ). Besides the retrieval process, the neural network computation time complexity of the predictor is about the same with the other models such as RIM (Qin et al., 2021) or DIN (Zhou et al., 2018b). The detailed architecture acceleration would be an effective factor for the model inference effieiency.

Appendix D Hyperparameter in the Retriever

In the LIFT framework, L𝐿Litalic_L means the context sequence length and K𝐾Kitalic_K means the retrieved samples count. From Figure 7 and Figure 8, we can find that with the increase of K𝐾Kitalic_K, the final prediction result first goes up and then down. The curve indicates that along with with the K𝐾Kitalic_K, the information first increases, and then more noises are introduced to become the dominant part. The AUC curves of L𝐿Litalic_L show the similar trends of K𝐾Kitalic_K, i.e., as L𝐿Litalic_L increases, the AUC first gets better and then drops down. We think it is the same reason as for K𝐾Kitalic_K. Moreover, because of the limitation of the GPU resources, we only conduct this hyperparameter study on Taobao and Alipay. and we could only increase the L𝐿Litalic_L to 70 in the history length of Taobao, where the downtrend just started.

Refer to caption
Figure 7. Hyperparameters study of LIFT on Taobao.
Refer to caption
Figure 8. Hyperparameters study of LIFT on Alipay.