Integrating Textual Embeddings from Contrastive Learning with Generative Recommender for Enhanced Personalization

Yijun Liu
Ming Hsieh Department of Electrical and Computer Engineering
University of Southern California
yijunl@usc.edu

Abstract

Recent advances in recommender systems have highlighted the complementary strengths of generative modeling and pretrained language models. We propose a hybrid framework that augments the Hierarchical Sequential Transduction Unit (HSTU) generative recommender with BLaIR—a contrastive text embedding model. This integration enriches item representations with semantic signals from textual metadata while preserving HSTU’s powerful sequence modeling capabilities.

We evaluate our method on two domains from the Amazon Reviews 2023 dataset, comparing it against the original HSTU and a variant that incorporates embeddings from OpenAI’s state-of-the-art text-embedding-3-large model. While the OpenAI embedding model is likely trained on a substantially larger corpus with significantly more parameters, our lightweight BLaIR-enhanced approach—pretrained on domain-specific data—consistently achieves better performance, highlighting the effectiveness of contrastive text embeddings in compute-efficient settings. Code is available at https://github.com/snapfinger/HSTU-BLaIR.

Yijun Liu Ming Hsieh Department of Electrical and Computer Engineering University of Southern California yijunl@usc.edu

1 Introduction

Recent progress in recommender systems has been fueled by advances in deep learning Covington et al. (2016), and increasingly by techniques from natural language processing Geng et al. (2022); Lyu et al. (2023); Bao et al. (2023); Hou et al. (2024a, b); Zhang et al. (2024); Xu et al. (2024); Liao et al. (2024). While the dominant industrial paradigm has long relied on multi-stage recommendation pipelines—where two-tower architectures perform efficient candidate retrieval from a massive corpus, followed by a ranking model, often an MLP, for fine-grained scoring Covington et al. (2016); Yi et al. (2019); Huang et al. (2020); Yao et al. (2021)—these systems often require extensive feature engineering and may struggle to capture complex temporal and semantic dependencies. In contrast, generative recommenders jointly model user interaction sequences and directly generate item predictions in an autoregressive fashion Yuan et al. (2019); Rajput et al. (2023); Zhai et al. (2024). This paradigm enables richer temporal modeling and improved personalization. A key subtask within this space is sequential recommendation, which focuses on capturing the temporal dynamics of user interactions Kang and McAuley (2018); Sun et al. (2019).

Although recent work has begun to explore the integration of generative recommenders with pretrained language models, most existing approaches rely on prompting or fine-tuning large language models (LLMs) Geng et al. (2022); Li et al. (2023); Ji et al. (2024). These models are often not specifically tailored for recommendation tasks and tend to be computationally inefficient—especially given that state-of-the-art LLMs typically contain billions of parameters Xu et al. (2025). As a result, the broader incorporation of pretrained language models into generative recommendation systems remains in its early stages, with critical challenges around scalability, domain adaptation, and efficient utilization of textual signals still largely unresolved.

On the other hand, self-supervised contrastive learning has shown strong potential for producing high-quality text embeddings that capture fine-grained semantic alignment Gao et al. (2021); Hou et al. (2024a). Generative recommenders, in parallel, excel at modeling structured user behavior over time. Combining these complementary strengths offers a promising direction to unify semantic understanding with behavioral dynamics, enabling more expressive, adaptable, and semantically informed recommendation systems.

In this work, we integrate two recent state-of-the-art models: BLaIR Hou et al. (2024a), a contrastive text encoder pretrained on user reviews and item metadata from the Amazon Reviews 2023 dataset, and HSTU Zhai et al. (2024), a generative model that formulates recommendation as sequential transduction, unifying retrieval and ranking through autoregressive modeling. BLaIR generates semantically rich textual representations from item metadata, while HSTU provides a scalable, end-to-end architecture for sequential recommendation. By fusing BLaIR’s semantic embeddings with HSTU’s behavioral modeling, our framework enhances personalization while maintaining scalability and interpretability—leveraging human-readable metadata and avoiding the computational overhead of large language models.

2 Method

2.1 HSTU-BLaIR

We extend the Hierarchical Sequential Transduction Unit (HSTU) Zhai et al. (2024), a generative model for sequential recommendation, by incorporating semantic item information. HSTU encodes user histories into autoregressive embeddings via stacked transformers with hierarchical attention and relative positional bias, scoring candidates through a interaction module that unifies retrieval and ranking. We adopt a configuration of four transformer blocks and four attention heads to balance efficiency and capacity. To enrich item representations, we integrate precomputed textual embeddings from BLaIR Hou et al. (2024a), a contrastive learning–based encoder trained on user reviews and item metadata from the Amazon Reviews 2023 dataset. We use the publicly released BLaIR_BASE checkpoint (~125M parameters), pretrained solely on the first 80% of reviews sorted by timestamp, ensuring strict temporal separation from downstream test data.

Refer to caption — Figure 1: Illustration of the main components in the HSTU-BLaIR method for sequential recommendation. The figure shows how precomputed BLaIR text embeddings are projected and fused with trainable item ID embeddings via element-wise addition ( $\oplus$ ), and integrated into the HSTU model. The dotted region highlights the embedding fusion module.

We fuse textual and ID-based item representations via element-wise addition in the embedding module:

\bm{e}_{\text{combined}}=\bm{e}_{\text{item}}+\bm{W}_{\text{text}}\bm{e}_{% \text{text}}

(1)

where $\bm{e}_{\text{item}}$ is the trainable ID embedding, $\bm{e}_{\text{text}}$ is the fixed BLaIR embedding, and $\bm{W}_{\text{text}}$ is a learnable projection. The fused embedding is added to the positional encoding:

\bm{e}_{\text{pos}}^{\prime}=\bm{e}_{\text{pos}}+\bm{e}_{\text{combined}}

(2)

enabling the model to jointly capture item identity and semantic context. We also refine HSTU’s negative sampling by incorporating text-based similarity, producing more semantically informed contrastive pairs for sampled softmax loss. Fig. 1 illustrates the core components of the HSTU-BLaIR method, focusing on the fusion of textual and ID embeddings.

2.2 Dataset

We train and evaluate HSTU-BLaIR on the Video Games and Office Products subsets of the Amazon Reviews 2023 dataset Hou et al. (2024a); dataset statistics are shown in Table 1. Each subset includes user purchase histories, product descriptions, and review texts. We use the 5-core version, which filters out users and items with fewer than five interactions, following the offline evaluation setting in Zhai et al. (2024).

	Video Games	Office Products
Items	25,612	77,551
Users	94,762	223,308
Interactions	814,585	1,800,877

Table 1: Statistics of the two benchmark subsets used in our sequential recommendation experiments, including the number of users, items, and interactions.

Table 2: Evaluation metrics on Video Games and Office Products datasets. Each cell shows absolute values (top row) and percentage improvements over HSTU / SASRec (bottom row). Best values per column are bolded.

Dataset Model HR@10 HR@50 HR@200 NDCG@10 NDCG@200 Video Games SASRec .1028 .2317 .3941 .0573 .1097 — — — — — HSTU .1315 .2765 .4565 .0741 .1327 (+28%) (+19%) (+16%) (+29%) (+21%) HSTU-OpenAI (TE3L) .1328 .2821 .4645 .0742 .1341 (+1.0% / +29%) (+2.0% / +22%) (+1.8% / +18%) (+0.1% / +30%) (+1.1% / +22%) HSTU-BLaIR .1353 .2852 .4684 .0760 .1361 (+2.9% / +32%) (+3.1% / +23%) (+2.6% / +19%) (+2.6% / +33%) (+2.6% / +24%) Office Products SASRec .0281 .0668 .1331 .0153 .0335 — — — — — HSTU .0395 .0880 .1649 .0223 .0443 (+41%) (+32%) (+24%) (+46%) (+32%) HSTU-OpenAI (TE3L) .0477 .1050 .1940 .0269 .0526 (+20.8% / +70%) (+19.3% / +57%) (+17.6% / +46%) (+20.6% / +76%) (+18.7% / +57%) HSTU-BLaIR .0484 .1068 .1946 .0271 .0529 (+22.5% / +72%) (+21.4% / +60%) (+18.0% / +46%) (+21.5% / +77%) (+19.4% / +58%)

2.3 Tasks

We evaluate our model on next-item prediction in a sequential recommendation setting. Following Zhai et al. (2024), user interactions are sorted chronologically, and a leave-one-out evaluation strategy is applied: for each user, the most recent interaction is held out for testing, while all prior interactions are used for training. We also adopt the training and evaluation protocol from Zhai et al. (2024), which includes full data shuffling and multi-epoch training. All models, including our HSTU-BLaIR and the baselines, are trained for 100 epochs to ensure fair comparison.

2.4 Baselines

We compare our proposed framework against three baselines: (1) SASRec Kang and McAuley (2018), a widely used transformer-based sequential recommender; (2) HSTU Zhai et al. (2024), the original generative recommender that uses only learned item ID embeddings without textual input; and (3) HSTU-OpenAI(TE3L), a variant of HSTU that integrates textual embeddings generated by OpenAI’s text-embedding-3-large model OpenAI (2024) in place of BLaIR.

OpenAI has not disclosed the architectural details, number of parameters, or training data used for text-embedding-3-large. However, based on prior publicly known OpenAI models—such as GPT-3 Brown et al. (2020)—it is likely that text-embedding-3-large contains at least billions of parameters and was trained on a diverse, large-scale corpus including Wikipedia, books, and large-scale web data, far exceeding the 125M parameters and domain-specific training data (Amazon Reviews 2023) used in BLaIR_BASE Hou et al. (2024a).

The text-embedding-3-large model supports a maximum input length of 8,179 tokens. In our datasets, a small number of items (2 in Video Games and 4 in Office Products) have review texts exceeding this limit. To ensure compatibility, we truncate these inputs by retaining only the first 8,179 tokens.

2.5 Evaluation Metrics

We assess model performance using two standard ranking metrics commonly used in recommendation systems:

HR@K (Hit Rate at K) measures whether the ground-truth next item appears within the top- $K$ predicted items. It reflects the model’s ability to include the correct item in its top- $K$ recommendations, regardless of position.

NDCG@K (Normalized Discounted Cumulative Gain at K) extends this by considering the position of the correct item in the ranked list, assigning higher scores to items that appear earlier. It is defined as $\text{NDCG@}K=\frac{\text{DCG@}K}{\text{IDCG@}K}$ , where $\text{DCG@}K=\sum_{i=1}^{K}\frac{rel_{i}}{\log_{2}(i+1)}$ and $\text{IDCG@}K$ denotes the maximum possible DCG for an ideal ranking. Here, $rel_{i}$ is a binary relevance indicator (1 if the item at position $i$ is correct, 0 otherwise).

3 Results

Table 2 reports the performance of our proposed HSTU-BLaIR method in comparison to several baselines on the Video Games and Office Products subsets. As shown in Table 1, these datasets differ notably in scale and sparsity, with Office Products featuring significantly more users and items, resulting in a sparser interaction matrix.

Across both benchmarks and all evaluation metrics, HSTU-BLaIR consistently outperforms the original HSTU model, highlighting the effectiveness of incorporating semantic item information through BLaIR’s pretrained textual embeddings. Notably, it also outperforms HSTU-OpenAI (TE3L), which leverages OpenAI’s text-embedding-3-large, suggesting that domain-specific contrastive encoders like BLaIR yield more effective representations for recommendation tasks than general-purpose large language models trained on broad web corpora.

Improvements are especially pronounced in the sparser Office Products dataset, where HSTU-BLaIR achieves up to a 77% gain in NDCG@10 over SASRec, and up to 21.5% over HSTU. These results highlight the effectiveness and generalizability of our approach across different dataset scales and sparsity levels.

Conclusion and Future Work

This work explores the integration of contrastively learned textual embeddings into a generative recommender system. Our results show that enriching sequential user models with semantically grounded text representations leads to consistent improvements in recommendation performance. Specifically, we fuse BLaIR-generated text embeddings with HSTU’s item ID embeddings via element-wise addition, allowing the model to jointly leverage structured and unstructured item information. This simple yet effective strategy underscores the promise of combining generative recommenders with pretrained contrastive encoders.

Future work will explore more advanced fusion mechanisms that dynamically adapt item semantics based on user context, as well as richer interaction modules beyond dot product similarity to capture more complex preference patterns. Finally, we plan to investigate the generalizability of the method to out-of-domain datasets and explore strategies for adapting or pretraining text encoders in new domains.

Limitations

While our approach demonstrates the effectiveness of integrating contrastive textual signals into generative sequential recommenders, it has several limitations. First, item representations are static, composed of a fixed combination of ID embeddings and precomputed BLaIR text embeddings. Second, the model relies on dot product similarity for user-item interaction, which may limit its expressiveness. Third, the performance of BLaIR depends on the availability and quality of item metadata; in domains with sparse or noisy textual content, the benefits of text embeddings may be reduced. Lastly, although we enforce strict temporal separation between BLaIR pretraining and downstream evaluation data, the encoder is still trained on the same corpus (Amazon Reviews 2023), which may limit its ability to generalize to datasets with different distributions.

References

Bao et al. (2023) Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems, pages 1007–1014.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pages 191–198.
Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In Empirical Methods in Natural Language Processing (EMNLP).
Geng et al. (2022) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (RLP): A unified pretrain, personalized prompt & predict paradigm (P5). In Proceedings of the 16th ACM conference on recommender systems, pages 299–315.
Hou et al. (2024a) Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. 2024a. Bridging language and items for retrieval and recommendation. arXiv preprint arXiv:2403.03952.
Hou et al. (2024b) Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2024b. Large language models are zero-shot rankers for recommender systems. In European Conference on Information Retrieval, pages 364–381. Springer.
Huang et al. (2020) Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2553–2561.
Ji et al. (2024) Jianchao Ji, Zelong Li, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Juntao Tan, and Yongfeng Zhang. 2024. Genrec: Large language model for generative recommendation. In European Conference on Information Retrieval, pages 494–502. Springer.
Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM), pages 197–206. IEEE.
Li et al. (2023) Jinming Li, Wentao Zhang, Tian Wang, Guanglei Xiong, Alan Lu, and Gerard Medioni. 2023. GPT4Rec: A generative framework for personalized recommendation and user interests interpretation. arXiv preprint arXiv:2304.03879.
Liao et al. (2024) Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, Xiang Wang, and Xiangnan He. 2024. LLaRA: Large language-recommendation assistant. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1785–1795.
Lyu et al. (2023) Hanjia Lyu, Song Jiang, Hanqing Zeng, Yinglong Xia, Qifan Wang, Si Zhang, Ren Chen, Christopher Leung, Jiajie Tang, and Jiebo Luo. 2023. LLM-Rec: Personalized recommendation via prompting large language models. arXiv preprint arXiv:2307.15780.
OpenAI (2024) OpenAI. 2024. New embedding models and api updates. https://openai.com/index/new-embedding-models-and-api-updates/.
Rajput et al. (2023) Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al. 2023. Recommender systems with generative retrieval. Advances in Neural Information Processing Systems, 36:10299–10315.
Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1441–1450.
Xu et al. (2024) Shuyuan Xu, Wenyue Hua, and Yongfeng Zhang. 2024. OpenP5: An open-source platform for developing, training, and evaluating llm-based recommender systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 386–394.
Xu et al. (2025) Wujiang Xu, Qitian Wu, Zujie Liang, Jiaojiao Han, Xuying Ning, Yunxiao Shi, Wenfang Lin, and Yongfeng Zhang. 2025. SLMRec: Distilling large language models into small for sequential recommendation. In International Conference on Learning Representations (ICLR 2025).
Yao et al. (2021) Tiansheng Yao, Xinyang Yi, Derek Zhiyuan Cheng, Felix Yu, Ting Chen, Aditya Menon, Lichan Hong, Ed H Chi, Steve Tjoa, Jieqi Kang, et al. 2021. Self-supervised learning for large-scale item recommendations. In Proceedings of the 30th ACM international conference on information & knowledge management, pages 4321–4330.
Yi et al. (2019) Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. 2019. Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM conference on recommender systems, pages 269–277.
Yuan et al. (2019) Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M Jose, and Xiangnan He. 2019. A simple convolutional generative network for next item recommendation. In Proceedings of the twelfth ACM international conference on web search and data mining, pages 582–590.
Zhai et al. (2024) Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. 2024. Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org.
Zhang et al. (2024) Chao Zhang, Shiwei Wu, Haoxin Zhang, Tong Xu, Yan Gao, Yao Hu, and Enhong Chen. 2024. NoteLLM: A retrievable large language model for note recommendation. In Companion Proceedings of the ACM Web Conference 2024, pages 170–179.