(Translated by https://www.hiragana.jp/)
CSMF: Cascaded Selective Mask Fine-Tuning for Multi-Objective Embedding-Based Retrieval

CSMF: Cascaded Selective Mask Fine-Tuning for Multi-Objective Embedding-Based Retrieval

Hao Deng 0009-0002-6335-7405 Alibaba International Digital Commerce GroupBeijingChina denghao.deng@alibaba-inc.com Haibo Xing 0009-0006-5786-7627 Alibaba International Digital Commerce GroupHangzhouChina xinghaibo.xhb@alibaba-inc.com Kanefumi Matsuyama 0009-0002-1365-5375 Alibaba International Digital Commerce GroupHangzhouChina kanefumi.matsuyama@alibaba-inc.com Moyu Zhang 0000-0002-9104-1881 Alibaba International Digital Commerce GroupBeijingChina zhangmoyu.zmy@alibaba-inc.com Jinxin Hu 0000-0002-7252-5207 Alibaba International Digital Commerce GroupBeijingChina jinxin.hjx@lazada.com Hong Wen 0009-0006-5786-7627 UnaffiliatedHangzhouChina dreamonewh@gmail.com Yu Zhang 0000-0002-6057-7886 Alibaba International Digital Commerce GroupBeijingChina daoji@lazada.com Xiaoyi Zeng 0000-0002-3742-4910 Alibaba International Digital Commerce GroupHangzhouChina yuanhan@taobao.com  and  Jing Zhang 0000-0001-6595-7661 School of Computer Science, Wuhan UniversityWuhanChina jingzhang.cv@gmail.com
(2025)
Abstract.

Multi-objective embedding-based retrieval (EBR) has become increasingly critical due to the growing complexity of user behaviors and commercial objectives. While traditional approaches often suffer from data sparsity and limited information sharing between objectives, recent methods utilizing a shared network alongside dedicated sub-networks for each objective partially address these limitations. However, such methods significantly increase the model parameters, leading to an increased retrieval latency and a limited ability to model causal relationships between objectives. To address these challenges, we propose the Cascaded Selective Mask Fine-Tuning (CSMF), a novel method that enhances both retrieval efficiency and serving performance for multi-objective EBR. The CSMF framework selectively masks model parameters to free up independent learning space for each objective, leveraging the cascading relationships between objectives during the sequential fine-tuning. Without increasing network parameters or online retrieval overhead, CSMF computes a linearly weighted fusion score for multiple objective probabilities while supporting flexible adjustment of each objective’s weight across various recommendation scenarios. Experimental results on real-world datasets demonstrate the superior performance of CSMF, and online experiments validate its significant practical value.

Recommendation Systems, Embedding-Based Retrieval, Efficient Fine-Tuning, Multi-Objective Optimization
journalyear: 2025copyright: acmlicensedconference: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 13–18, 2025; Padua, Italybooktitle: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), July 13–18, 2025, Padua, Italydoi: 10.1145/3726302.3729939isbn: 979-8-4007-1592-1/2025/07ccs: Information systems Retrieval models and ranking

1. Introduction

The primary goal of recommendation systems on e-commerce platforms is to assist users to quickly identify highly relevant products from a vast pool of candidates under strict time constraints. A widely adopted approach is the implementation of a cascaded multi-stage selection process, typically divided into two stages (Huang et al., 2020): retrieval and ranking. Retrieval methods are broadly classified as rule-based retrieval and Embedding-Based Retrieval (EBR). Rule-based retrieval methods, such as item-CF (Sarwar et al., 2001) and user-CF (Kaufinann, 2006), leverage collaborative filtering signals to perform lightweight retrieval. In contrast, as the advancement in deep learning, EBR methods have demonstrated superior retrieval accuracy, which has led to widespread adoption in industrial applications (Huang et al., 2020; Jiang et al., 2022).

Refer to caption
Figure 1. The cascading relationships among user actions. (a) The recommendation page. (b) The product detail page, displayed after the user clicks on a product, prominently features a ”Buy Now” button. (c) The checkout page is displayed after the user clicks the ”Buy Now” button.
\Description

xx.

Typically, EBR employs a two-tower deep neural network architecture to balance recall efficiency and system performance (Huang et al., 2013). Specifically, user and product information are encoded into two separate vectors via parallel neural networks, and the relevance is determined by the distance between these vectors (e.g., dot product). In deployment, product vectors are pre-computed, enabling efficient online retrieval of the top-k products using an Approximate Nearest Neighbor (ANN) system (e.g. FAISS (Johnson et al., 2019)) in sub-linear time.

Recently, multi-objective retrieval optimization has emerged as a fundamental challenge for EBR in industrial systems. Given the diverse commercial objectives in industry, EBR is increasingly required to retrieve a product set that satisfies multiple objectives simultaneously. For example, e-commerce platform at Taobao (Zheng et al., 2022) simultaneously optimizes four objectives: relevance, exposure, clicks and conversions. Similarly, online advertising recommendation systems at Tencent (Xu et al., 2022) focus on optimizing the objectives of clicks and conversions. For the objectives of clicks and conversions, the pattern of exposure → click → conversion is sequential, with a fixed order (Ma et al., 2018a), as shown in Figure 1.

Existing methods for multi-objective EBR can be broadly categorized into two types: Multi-Model and Single-Model approaches. In the Multi-Model approach, separate EBR models are independently developed for each objective, as shown in Figure 2(a). However, this method struggles to capture the interrelationships among objectives and suffers from data sparsity in downstream objectives (e.g., conversions objective) (Wang et al., 2023). In recent years, the Single-Model approach has been more widely adopted (Zheng et al., 2022; Zhang et al., 2022; He et al., 2023), leveraging a single EBR model to simultaneously learn multiple objectives. One variant of the Single-Model (Zheng et al., 2022; Jiang et al., 2022) combines the training data of all objectives and optimizes a single score to fit them. However, this approach encounters significant gradient conflict issues (Yu et al., 2020) when modeling within the same parameter space, as shown in Figure 2(b). The application of Mixture of Experts (MoE) techniques (Ma et al., 2018b) to multi-objective modeling, as shown in the Figure 2(c), has inspired researchers (Xu et al., 2022; Jiang et al., 2022) to design a shared network along with dedicated sub-networks for each objective to mitigate information conflicts and improve information sharing.

However, the above approaches introduce a large number of network parameters, increasing vector dimensionality for online ANN retrieval and worsening both service latency and storage overhead. Furthermore, they allocate equal parameters to all objectives, exacerbating learning difficulties due to data sparsity.

Additionally, they overlook the sequential relationship between objectives, hindering models from accurately capturing both objective interdependencies and users’ actual behavioral patterns.

To address the identified challenges, we propose the Cascaded Selective Mask Fine-Tuning Framework (CSMF) for multi-objective EBR, inspired by the success of parameter-efficient fine-tuning (PEFT) techniques used in large language models (LLMs), as shown in Figure 2(d). CSMF enhances retrieval efficiency and system performance by dividing the training process into three stages, leveraging the cascading relationship between objectives. First, a backbone model is pre-trained using large-scale exposure data as positive samples for EBR. Next, the model undergoes two fine-tuning stages: first with click data and then with conversion data. Prior to fine-tuning, CSMF selectively masks redundant parameters with low informational value, freeing up parameter space for subsequent tasks. To preserve accuracy after pruning, the unpruned parameters are fine-tuned again on a small subset of previous data. Once the accuracy is recovered, these parameters are frozen to retain the knowledge of earlier tasks. By iterating through the pre-train → selective mask → accuracy recovery → fine-tune cycle, CSMF achieves multiple objectives sequentially, mitigating issues of knowledge sharing and catastrophic forgetting. Furthermore, the CSMF framework encounters two key challenges: efficient parameter pruning to optimize each objective, and resolving conflicts between objectives during multi-stage training. To address these challenges, we propose the cumulative percentile-based pruning (CPP) method, which adaptively prunes neurons based on the information distribution of each layer. Additionally, we introduce a cross-stage adaptive margin loss function (AML) to reduce negative transfer effects caused by objective conflicts, dynamically adjusting the difficulty of contrastive learning (Mikolov et al., 2013). In CSMF, selective parameter allocation reduces conflicts between objectives without requiring additional parameters.

In online deployment, CSMF partitions the parameter space using a cascaded selective mask fine-tuning strategy, allowing for flexible, linear fusion of multiple objective probabilities. By assigning dynamic, objective-specific weights to parameter subsets, it adapts to varying business needs. Unlike MoE-based EBR, CSMF avoids increasing output vector dimensionality, reducing both retrieval latency and storage overhead in online ANN systems.

In summary, our proposed method makes the following key contributions:

  • We introduce the Cascaded Selective Mask Fine-Tuning for multi-objective EBR, which effectively enhances retrieval efficiency and system performance during online serving.

  • We integrate a modified softmax loss and an effective parameter selection approach within CSMF to address objective conflicts and reduce catastrophic forgetting.

  • We present a flexible online multi-objective retrieval method that identifies the jointly optimal candidate set for multiple objectives, without increasing network parameters or burdening online systems.

  • We conducted extensive offline experiments on real-world industrial datasets and deployed our proposed method in an online advertising system, validating the superiority of our method over competitors.

Refer to caption
Figure 2. Methods for multi-objective EBR. (a) Separate EBR models for each objective (Yi et al., 2019). (b) A unified EBR model trained on a mixed multi-objective dataset (Zheng et al., 2022; Zhao et al., 2021). (c) MOE-based methods for multi-objective EBR model (Xu et al., 2022). (d) Our proposed CSMF.
\Description

xx.

2. Related Work

2.1. Multi-Objective Embedding-Based Retrieval

Embedding-Based Retrieval has gained significant popularity in the industry, particularly within search, recommendation, and advertising systems (Zheng et al., 2022; Ma et al., 2018b; Zhang et al., 2022). DSSM (Kim, 2014) was one of the first to leverage a two-tower deep neural network to generate semantic vector representations for queries and documents. YouTubeDNN (Yi et al., 2019), a widely adopted benchmark, uses an ANN system for efficient online retrieval of top-k items. Recently, as EBR applications grow in industry, there has been an increasing focus on optimizing its multiple objectives. In industrial systems, retrieval tasks typically involve multiple objectives (Xu et al., 2022). For example, recommendation systems prioritize objectives such as clicks and conversions, while short video platforms also track metrics like user engagement duration and video completion rates (Wang et al., 2024). However, these objectives frequently conflict (Xu et al., 2022), making multi-objective balancing in EBR a critical research challenge. As mentioned earlier, state-of-the-art multi-task learning methods (Ma et al., 2018b; Tang et al., 2020) can be applied to EBR models. For example, Tencent proposed the MVKE model(Xu et al., 2022), which leverages the MOE architecture to jointly optimize clicks and conversions. However, MOE-based approaches increase output vector dimensionality in EBR, resulting in higher computational complexity and latency in the online ANN retrieval system. Additionally, Taobao introduced the MOPPR model (Zheng et al., 2022), which addresses the challenge of balancing multiple objectives using a listwise (Cao et al., 2007) approach. Methods like DMTL (Zhao et al., 2021; Zhang et al., 2022) use distillation learning to facilitate joint optimization of multiple objectives in EBR.

However, the above methods overlook the cascaded relationships among objectives and introduce a large number of network parameters, increasing latency and storage burden during serving. This paper addresses these limitations by focusing on the cascaded relationships among objectives, optimizing information sharing and mitigating catastrophic forgetting in multi-objective EBR.

2.2. Parameter-Efficient Fine-Tuning

Deep learning models often require large datasets for sufficient learning, but such datasets are not always available.

Transfer learning is an effective technique to address this challenge (Alyafeai et al., 2020). Fine-tuning is a commonly used approach in transfer learning (Howard and Ruder, 2018; Devlin, 2018). It can be classified into two types: full parameter fine-tuning and partial parameter fine-tuning. In full parameter fine-tuning, all network parameters are updated to suit downstream tasks (Gao et al., 2021; Dodge et al., 2020). However, this approach often results in challenges like forgetting upstream knowledge and high resource consumption (Mallya et al., 2018). Recently, with advancements in LLMs, partial parameter fine-tuning methods have proven effective. These methods preserve upstream model information while adapting to downstream task objectives (Hu et al., 2021; Xin et al., 2024). This technique is known as PEFT. PEFT can be classified into three types (Xin et al., 2024): adaptive, reparameterized, and selective. Both adaptive and reparameterized methods retain upstream model parameters while adding a small number of additional parameters during fine-tuning. LoRA (Hu et al., 2021) proposed using a trainable low-rank matrix to facilitate learning in downstream models. Subsequent works (Hayou et al., 2024; Liu et al., 2024; Zhou et al., 2024) aim to improve the efficiency of knowledge transfer in LoRA-based methods. PackNet (Mallya and Lazebnik, 2018), based on the selective approach, introduces a training framework that divides the upstream model’s parameters into two segments: one is used to fit the downstream model, and the other remains unchanged to preserve the upstream knowledge. This method is particularly suited for scenarios where there are cascaded dependencies between upstream and downstream tasks. Moreover, these methods generally do not introduce additional network parameters.

Refer to caption
Figure 3. Illustration of CSMF Framework. Taking one of the matrices in the user or item tower as an example, the CSMF framework for three objectives (exposure, click, and conversion) is organized into three stages. First, the backbone model is pre-trained on large-scale exposure data. The model then undergoes two fine-tuning stages with click and conversion data, respectively. Before fine-tuning, CSMF selectively masks redundant parameters to free up space for the new tasks. To ensure accuracy, unpruned parameters are fine-tuned again on a small subset of the current stage’s data before being frozen. This iterative process (pre-train → selective mask → accuracy recovery → fine-tune) enables the sequential optimization of multiple objectives, addressing knowledge sharing and catastrophic forgetting issues.
\Description

xx.

However, most existing PEFT methods have primarily focused on the field of natural language processing, with limited application in retrieval models. In this paper, we apply PEFT to address the challenges of knowledge transfer and the decline in online efficiency caused by parameter expansion in multi-objective EBR.

3. Problem Formulation

In this section, we define the EBR model. Let 𝒰𝒰\mathcal{U}caligraphic_U be the set of users, with each user denoted by u𝑢uitalic_u, and \mathcal{I}caligraphic_I be the set of items, with each item denoted by i𝑖iitalic_i. The goal is to retrieve a set of items Tu={i1,i2,.,im}T_{u}=\{i_{1},i_{2},....,i_{m}\}italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … . , italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, where m𝑚mitalic_m represents the size of the set. The retrieved item set Tusubscript𝑇𝑢T_{u}italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT must optimize multiple objectives simultaneously, such as exposure, click, and conversion.

A common approach constructs a two-tower based EBR to address multiple objectives and selects the top-k items based on model’s output scores. User-side features usubscript𝑢\mathcal{B}_{u}caligraphic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT are input into the user-side encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, yielding the user-side vector 𝐞θu=fθ(u)superscriptsubscript𝐞𝜃𝑢subscript𝑓𝜃subscript𝑢\mathbf{e}_{\theta}^{u}=f_{\theta}(\mathcal{B}_{u})bold_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ), where θ𝜃\thetaitalic_θ denotes the trainable model parameters. Similarly, item-side features 𝒬isubscript𝒬𝑖\mathcal{Q}_{i}caligraphic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are input into the item-side encoder gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, resulting in the item-side vector 𝐯θi=gθ(𝒬i)superscriptsubscript𝐯𝜃𝑖subscript𝑔𝜃subscript𝒬𝑖\mathbf{v}_{\theta}^{i}=g_{\theta}(\mathcal{Q}_{i})bold_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The relevance score suisuperscript𝑠𝑢𝑖s^{ui}italic_s start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT between user u𝑢uitalic_u and item i𝑖iitalic_i is computed using a distance function hθ(u,i)subscript𝜃𝑢𝑖h_{\theta}(u,i)italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u , italic_i ), i.e., dot product. The top-k items can be determined as follows:

(1) Tusubscript𝑇𝑢\displaystyle T_{u}italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT =argTopKiIhθ(u,i),absentsubscript𝑎𝑟𝑔𝑇𝑜𝑝𝐾𝑖𝐼subscript𝜃𝑢𝑖\displaystyle=\mathop{argTopK}\limits_{i\in I}h_{\theta}(u,i),= start_BIGOP italic_a italic_r italic_g italic_T italic_o italic_p italic_K end_BIGOP start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u , italic_i ) ,
hθ(u,i)=𝐞θu𝐯θi.subscript𝜃𝑢𝑖superscriptsuperscriptsubscript𝐞𝜃𝑢topsuperscriptsubscript𝐯𝜃𝑖\displaystyle h_{\theta}(u,i)={\mathbf{e}_{\theta}^{u}}^{\top}{\mathbf{v}_{% \theta}^{i}}.italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u , italic_i ) = bold_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT .

The positive sample set IPsuperscript𝐼𝑃I^{P}italic_I start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT for EBR typically comes from users’ explicit feedback, such as the click dataset 𝒪𝒪\mathcal{O}caligraphic_O and the conversion dataset \mathcal{R}caligraphic_R. These feedback exhibit a cascading hierarchy in user behavior, establishing relationships among datasets. Specifically, the conversion dataset \mathcal{R}caligraphic_R is a subset of the click dataset 𝒪𝒪\mathcal{O}caligraphic_O, which in turn is a subset of the exposure dataset 𝒟𝒟\mathcal{D}caligraphic_D, and so on: 𝒪𝒟𝒪𝒟\mathcal{R}\subset\mathcal{O}\subset\mathcal{D}\subset\mathcal{I}caligraphic_R ⊂ caligraphic_O ⊂ caligraphic_D ⊂ caligraphic_I. The negative sampling strategy consists of in-batch negative sampling (BNS) (Huang et al., 2020) and unexposed items within the same request (Zheng et al., 2022). The negative sample set is denoted as INsuperscript𝐼𝑁I^{N}italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. During training, EBR uses contrastive learning to distinguish IPsuperscript𝐼𝑃I^{P}italic_I start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT from INsuperscript𝐼𝑁I^{N}italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT (Mikolov et al., 2013). The softmax-based probability is defined as:

(2) pθ(i|u)exp(hθ(u,i))j=1INexp(hθ(u,j)),iIP.formulae-sequenceproportional-tosubscript𝑝𝜃conditional𝑖𝑢𝑒𝑥𝑝subscript𝜃𝑢𝑖superscriptsubscript𝑗1superscript𝐼𝑁𝑒𝑥𝑝subscript𝜃𝑢𝑗𝑖superscript𝐼𝑃p_{\theta}(i|u)\propto\frac{exp(h_{\theta}(u,i))}{\sum_{j=1}^{I^{N}}exp(h_{% \theta}(u,j))},\ \ i\in I^{P}.italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_i | italic_u ) ∝ divide start_ARG italic_e italic_x italic_p ( italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u , italic_i ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_e italic_x italic_p ( italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u , italic_j ) ) end_ARG , italic_i ∈ italic_I start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT .

During the training phase, model parameters θ𝜃\thetaitalic_θ are updated by minimizing the negative log likelihood logpθ(i|u)𝑙𝑜𝑔subscript𝑝𝜃conditional𝑖𝑢-log\ p_{\theta}(i|u)- italic_l italic_o italic_g italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_i | italic_u ).

4. Multi-Objective Cascaded Selective Mask Fine-Tuning

This section introduces the CSMF training framework, a multi-stage training method that integrates the Cumulative Percentile Based Pruning Method and the Cross-Stage Adaptive Margin Loss Function. Figure 3 illustrates the overall training framework.

4.1. Training Framework

Drawing inspiration from PEFT techniques in LLMs, we propose the CSMF method for EBR. In our framework, following the sequence of cascading objectives, the training process is divided into three tasks: exposure, click, and conversion. Importantly, the data volume decreases significantly, with |𝒟|>>|𝒪|>>||much-greater-than𝒟𝒪much-greater-than|\mathcal{D}|>>|\mathcal{O}|>>|\mathcal{R}|| caligraphic_D | > > | caligraphic_O | > > | caligraphic_R |, making the modeling task progressively more challenging. Following PEFT principles, CSMF uses exposure tasks with large datasets to support the tasks with smaller datasets during the cascaded fine-tuning (Alyafeai et al., 2020), significantly improving retrieval efficiency for each objective. Thus, the exposure task serves as the upstream task for the click task, and the conversion task serves as the downstream task relative to the click task.

This method enables a single-EBR to simultaneously output the exposure probability sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, click probability sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and conversion probability srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, without increasing the dimensionality of the final vectors (e.g., 𝐞θusuperscriptsubscript𝐞𝜃𝑢\mathbf{e}_{\theta}^{u}bold_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT and 𝐯θisuperscriptsubscript𝐯𝜃𝑖\mathbf{v}_{\theta}^{i}bold_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT). Moreover, by employing flexible probability combination calculations, the CSMF method can efficiently retrieve a candidate set optimized for multiple objectives, as detailed in Section 4.4.

During the pre-train stage for the exposure task, the k𝑘kitalic_k-th layer of the model’s network parameters wkθ(k<=|θ|)subscript𝑤𝑘𝜃𝑘𝜃w_{k}\in\theta\ (k<=|\theta|)italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_θ ( italic_k < = | italic_θ | ) is utilized to learn whether a product will be exposed to a user. Positive samples are collected from the exposure dataset 𝒟𝒟\mathcal{D}caligraphic_D, while the negative samples consist of BNS (Huang et al., 2020) and unexposed items within the same request (Zheng et al., 2022). After multiple training epochs, the parameter set θ𝜃\thetaitalic_θ converges to θd^={w1d^,w2d^,,wkd^,}superscript𝜃^𝑑subscriptsuperscript𝑤^𝑑1subscriptsuperscript𝑤^𝑑2subscriptsuperscript𝑤^𝑑𝑘\theta^{\hat{d}}=\{w^{\hat{d}}_{1},w^{\hat{d}}_{2},...,w^{\hat{d}}_{k},...\}italic_θ start_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT = { italic_w start_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … }.

(3) θd^=argminθui𝒟ulogpθ(i|u),superscript𝜃^𝑑subscript𝑎𝑟𝑔𝑚𝑖𝑛𝜃subscript𝑢subscript𝑖subscript𝒟𝑢𝑙𝑜𝑔subscript𝑝𝜃conditional𝑖𝑢{\theta}^{\hat{d}}=\mathop{argmin}\limits_{\theta}\sum_{u}\sum_{i\in\mathcal{D% }_{u}}-log\ p_{\theta}(i|u),italic_θ start_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_l italic_o italic_g italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_i | italic_u ) ,

where 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT represents the set of products exposed to user u𝑢uitalic_u. In deep neural networks, some parameters exhibit redundancy, and masking these parameters generally does not significantly impact model accuracy (Mallya and Lazebnik, 2018). To free up learnable parameter space for downstream tasks, we selectively mask the parameter set θd^superscript𝜃^𝑑\theta^{\hat{d}}italic_θ start_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT, which involves pruning redundant parameters. The pruning process evaluates the information value of each neuron. If a neuron’s information value falls below a threshold δ𝛿\deltaitalic_δ, it is deemed to have minimal impact on overall accuracy. Thus, the primary task at this stage is to assess each neuron’s information value. We explored two parameter pruning methods, as detailed in Section 4.2.

After pruning, each wkd^θd^subscriptsuperscript𝑤^𝑑𝑘superscript𝜃^𝑑w^{\hat{d}}_{k}\in\theta^{\hat{d}}italic_w start_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_θ start_POSTSUPERSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT is divided into two parts: redundant parameters for exposure tasks, denoted as θpsuperscript𝜃𝑝\theta^{p}italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and the retained parameter set, denoted as θdsuperscript𝜃𝑑\theta^{d}italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Furthermore, we perform the accuracy recovery operation to ensure the accuracy of the exposure task. Specifically, a small subset of the exposure dataset 𝒟𝒟superscript𝒟𝒟\mathcal{D^{{}^{\prime}}}\subset\mathcal{D}caligraphic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ⊂ caligraphic_D is used to fine-tune the parameter set θdsuperscript𝜃𝑑\theta^{d}italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT once again. Notably, this fine-tuning process does not modify θpsuperscript𝜃𝑝\theta^{p}italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Once the accuracy is recovered, θdsuperscript𝜃𝑑\theta^{d}italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is frozen and does not undergo updates. Thus, the exposure probability sdui=𝐞θdu𝐯θdisuperscriptsubscript𝑠𝑑𝑢𝑖superscriptsuperscriptsubscript𝐞superscript𝜃𝑑𝑢topsuperscriptsubscript𝐯superscript𝜃𝑑𝑖s_{d}^{ui}={\mathbf{e}_{\theta^{d}}^{u}}^{\top}{\mathbf{v}_{\theta^{d}}^{i}}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT = bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is calculated using the parameter set θdsuperscript𝜃𝑑\theta^{d}italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

During fine-tune stage of the click task, θdsuperscript𝜃𝑑\theta^{d}italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT remains unchanged, while the click dataset serves as positive samples to update θpsuperscript𝜃𝑝\theta^{p}italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, fitting the click probability hosubscript𝑜h_{o}italic_h start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. After several training epochs, the parameter set θpsuperscript𝜃𝑝\theta^{p}italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT converges to θo^={w1o^,w2o^,wko^,}superscript𝜃^𝑜subscriptsuperscript𝑤^𝑜1subscriptsuperscript𝑤^𝑜2subscriptsuperscript𝑤^𝑜𝑘\theta^{\hat{o}}=\{w^{\hat{o}}_{1},w^{\hat{o}}_{2},...w^{\hat{o}}_{k},...\}italic_θ start_POSTSUPERSCRIPT over^ start_ARG italic_o end_ARG end_POSTSUPERSCRIPT = { italic_w start_POSTSUPERSCRIPT over^ start_ARG italic_o end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT over^ start_ARG italic_o end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_w start_POSTSUPERSCRIPT over^ start_ARG italic_o end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … }.

(4) {θd;θo^}=argminθpui𝒪ulogp{θd;θp}(i|u),superscript𝜃𝑑superscript𝜃^𝑜subscript𝑎𝑟𝑔𝑚𝑖𝑛superscript𝜃𝑝subscript𝑢subscript𝑖subscript𝒪𝑢𝑙𝑜𝑔subscript𝑝superscript𝜃𝑑superscript𝜃𝑝conditional𝑖𝑢\{\theta^{d};\theta^{\hat{o}}\}=\mathop{argmin}\limits_{\theta^{p}}\sum_{u}% \sum_{i\in\mathcal{O}_{u}}-log\ p_{\{\theta^{d};\theta^{p}\}}(i|u),{ italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT over^ start_ARG italic_o end_ARG end_POSTSUPERSCRIPT } = start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_O start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_l italic_o italic_g italic_p start_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT ( italic_i | italic_u ) ,

where 𝒪usubscript𝒪𝑢\mathcal{O}_{u}caligraphic_O start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT represents the click product set for user u𝑢uitalic_u. Using the same approach as in the exposure task, we apply the same parameter pruning method to the set θo^superscript𝜃^𝑜\theta^{\hat{o}}italic_θ start_POSTSUPERSCRIPT over^ start_ARG italic_o end_ARG end_POSTSUPERSCRIPT. The parameter set θo^superscript𝜃^𝑜\theta^{\hat{o}}italic_θ start_POSTSUPERSCRIPT over^ start_ARG italic_o end_ARG end_POSTSUPERSCRIPT is divided into two parts: θrsuperscript𝜃𝑟\theta^{r}italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT and θosuperscript𝜃𝑜\theta^{o}italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT. Here, θosuperscript𝜃𝑜\theta^{o}italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT is the retained parameter set for click task. To preserve accuracy after pruning, a subset of the click dataset 𝒪𝒪superscript𝒪𝒪\mathcal{O^{{}^{\prime}}}\subset\mathcal{O}caligraphic_O start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ⊂ caligraphic_O is used to fine-tune the parameter set θosuperscript𝜃𝑜\theta^{o}italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT. The click probability souisuperscriptsubscript𝑠𝑜𝑢𝑖s_{o}^{ui}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT for user u𝑢uitalic_u and product i𝑖iitalic_i is calculated using the parameter set θosuperscript𝜃𝑜\theta^{o}italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT as follows: soui=𝐞{θd;θo}u𝐯{θd;θo}isuperscriptsubscript𝑠𝑜𝑢𝑖superscriptsuperscriptsubscript𝐞superscript𝜃𝑑superscript𝜃𝑜𝑢topsuperscriptsubscript𝐯superscript𝜃𝑑superscript𝜃𝑜𝑖s_{o}^{ui}={\mathbf{e}_{\{\theta^{d};\theta^{o}\}}^{u}}^{\top}{\mathbf{v}_{\{% \theta^{d};\theta^{o}\}}^{i}}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT = bold_e start_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

For the remaining parameter set θrsuperscript𝜃𝑟\theta^{r}italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, the conversion dataset serves as positive samples. While freezing θosuperscript𝜃𝑜\theta^{o}italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and θdsuperscript𝜃𝑑\theta^{d}italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, θrsuperscript𝜃𝑟\theta^{r}italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is updated to fit the conversion probability sruisuperscriptsubscript𝑠𝑟𝑢𝑖s_{r}^{ui}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT:

(5) sruisuperscriptsubscript𝑠𝑟𝑢𝑖\displaystyle s_{r}^{ui}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT =𝐞{θd;θo;θr}u𝐯{θd;θo;θr}iabsentsuperscriptsuperscriptsubscript𝐞superscript𝜃𝑑superscript𝜃𝑜superscript𝜃𝑟𝑢topsuperscriptsubscript𝐯superscript𝜃𝑑superscript𝜃𝑜superscript𝜃𝑟𝑖\displaystyle={\mathbf{e}_{\{\theta^{d};\ \ \theta^{o};\ \ \theta^{r}\}}^{u}}^% {\top}{\mathbf{v}_{\{\theta^{d};\ \ \theta^{o};\ \ \theta^{r}\}}^{i}}= bold_e start_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
=𝐞θu𝐯θi.absentsuperscriptsuperscriptsubscript𝐞𝜃𝑢topsuperscriptsubscript𝐯𝜃𝑖\displaystyle={\mathbf{e}_{\theta}^{u}}^{\top}{\mathbf{v}_{\theta}^{i}}.= bold_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT .

After training, the parameter space θ𝜃\thetaitalic_θ in the CSMF method is divided into three parts: θdsuperscript𝜃𝑑\theta^{d}italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, θosuperscript𝜃𝑜\theta^{o}italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, and θrsuperscript𝜃𝑟\theta^{r}italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. Using appropriate weighted combinations, the method simultaneously yields exposure probability sduisuperscriptsubscript𝑠𝑑𝑢𝑖s_{d}^{ui}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT, click probability souisuperscriptsubscript𝑠𝑜𝑢𝑖s_{o}^{ui}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT, and conversion probability sruisuperscriptsubscript𝑠𝑟𝑢𝑖s_{r}^{ui}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT, as detailed in Section 4.4. Redundancy pruning and neuron-level sharing ensure that information from upstream tasks is fully retained in downstream parameter spaces, achieving lossless transfer and mitigating information forgetting. Moreover, judicious pruning of redundant parameters allocates independent optimization space to each objective, reducing gradient conflicts and improving efficiency.

4.2. CPP: Cumulative Percentile-based Pruning

One of the key challenges in the CSMF method is evaluating the importance of neurons during the parameter pruning process. This evaluation is crucial because it directly affects the transmission of information from upstream tasks, thereby influencing the overall efficiency of the model. The significance of a neuron is typically measured by the amount of information it encapsulates. Therefore, neurons with higher information value should be retained, while those with lower information should be pruned to free up optimization space for downstream tasks.

To measure the information value, PackNet (Mallya and Lazebnik, 2018) uses the absolute value of the neurons as the evaluation criterion. In each layer of network parameters wkθ(k<=|θ|)subscript𝑤𝑘𝜃𝑘𝜃w_{k}\in\theta(k<=|\theta|)italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_θ ( italic_k < = | italic_θ | ), neurons are sorted by their absolute values, with the top mk(mk<=|wk|)subscript𝑚𝑘subscript𝑚𝑘subscript𝑤𝑘m_{k}\ (m_{k}<=|w_{k}|)italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < = | italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ) neurons being retained, while the remainder are pruned. However, this method applies a fixed pruning ratio to each layer, disregarding the varying importance of neurons across different layers (Shahroudnejad, 2021).

To address this limitation, we propose an adaptive pruning method called the Cumulative Percentile-based Pruning method. For each layer parameter wkθsubscript𝑤𝑘𝜃w_{k}\in\thetaitalic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_θ, we first compute the cumulative sum of the absolute values of all neurons, denoted as 𝒞k={c1,c2,,cnk}subscript𝒞𝑘subscript𝑐1subscript𝑐2subscript𝑐subscript𝑛𝑘\mathcal{C}_{k}=\{c_{1},c_{2},...,c_{n_{k}}\}caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, where nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the total number of neurons in wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Specifically, cnk=1jnk|wkj|subscript𝑐subscript𝑛𝑘subscript1𝑗subscript𝑛𝑘subscript𝑤𝑘𝑗c_{n_{k}}=\sum_{1\leq j\leq n_{k}}|w_{kj}|italic_c start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT | denotes the total information content in this layer, where wkjwksubscript𝑤𝑘𝑗subscript𝑤𝑘w_{kj}\in w_{k}italic_w start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ∈ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Next, we define a pruning ratio τ(0<τ<1)𝜏0𝜏1\tau\ (0<\tau<1)italic_τ ( 0 < italic_τ < 1 ), and calculate the maximum cumulative percentile to be pruned, denoted as indk=cnkτ𝑖𝑛subscript𝑑𝑘subscript𝑐subscript𝑛𝑘𝜏ind_{k}=c_{n_{k}}*\tauitalic_i italic_n italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∗ italic_τ. Based on this cumulative percentile, we compute the probability that each neuron wkjwksubscript𝑤𝑘𝑗subscript𝑤𝑘w_{kj}\in w_{k}italic_w start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ∈ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT will be pruned, expressed as 𝒫(wkj)=f(cjindk)𝒫subscript𝑤𝑘𝑗𝑓subscript𝑐𝑗𝑖𝑛subscript𝑑𝑘\mathcal{P}(w_{kj})=f(c_{j}\leq ind_{k})caligraphic_P ( italic_w start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ) = italic_f ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_i italic_n italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), where f()𝑓f(\cdot)italic_f ( ⋅ ) is a comparison function that returns 1 if the condition cj<=indksubscript𝑐𝑗𝑖𝑛subscript𝑑𝑘c_{j}<=ind_{k}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < = italic_i italic_n italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is met, and 0 otherwise. If 𝒫(wkj)>0𝒫subscript𝑤𝑘𝑗0\mathcal{P}(w_{kj})>0caligraphic_P ( italic_w start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ) > 0, the neuron is pruned; otherwise, it is retained. This approach adaptively determines the number of neurons to prune based on the distribution of information content across each layer.

4.3. AML: Cross-Stage Adaptive Margin Loss

Conflicts in multi-objective learning are inevitable and often lead to negative transfer effects, reducing downstream task performance (Crawshaw, 2020). The CSMF mitigates this issue by assigning distinct parameter spaces to each objective, thereby partially reducing conflicts. However, since downstream tasks inherit all upstream parameters, significant conflicts between objectives can still hinder downstream learning and result in severe negative transfer effects.

To address this challenge, we propose a cross-stage adaptive margin loss function within the CSMF framework. AML allows downstream tasks to dynamically adjust the margin size between positive and negative sample pairs. When downstream objectives align with upstream predictions, optimization becomes easier. Conversely, when downstream objectives conflict with upstream predictions, additional effort is required to correct the bias.

For instance, in the click task, consider two items, i𝑖iitalic_i and j𝑗jitalic_j, for a user u𝑢uitalic_u, where i𝑖iitalic_i is a positive sample and j𝑗jitalic_j is a negative sample. If the exposure task scores sdui>sdujsuperscriptsubscript𝑠𝑑𝑢𝑖superscriptsubscript𝑠𝑑𝑢𝑗s_{d}^{ui}>s_{d}^{uj}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT > italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT, the click task reduces optimization effort for this sample. The margin in click scores between items i𝑖iitalic_i and j𝑗jitalic_j should preserve the margin in exposure scores. However, if sdui<sdujsuperscriptsubscript𝑠𝑑𝑢𝑖superscriptsubscript𝑠𝑑𝑢𝑗s_{d}^{ui}<s_{d}^{uj}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT < italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT, the model amplifies the margin in the click scores, compensating by increasing the difference between souisuperscriptsubscript𝑠𝑜𝑢𝑖s_{o}^{ui}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT and soujsuperscriptsubscript𝑠𝑜𝑢𝑗s_{o}^{uj}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT. The loss function for the click task is expressed as follows:

(6) clk=ui𝒪ulogesouiesoui+jIiNe(soujmdij),subscript𝑐𝑙𝑘subscript𝑢subscript𝑖subscript𝒪𝑢𝑙𝑜𝑔superscript𝑒superscriptsubscript𝑠𝑜𝑢𝑖superscript𝑒superscriptsubscript𝑠𝑜𝑢𝑖subscript𝑗subscriptsuperscript𝐼𝑁𝑖superscript𝑒superscriptsubscript𝑠𝑜𝑢𝑗subscriptsuperscript𝑚𝑖𝑗𝑑\displaystyle\mathcal{L}_{clk}=\sum_{u}\sum_{i\in\mathcal{O}_{u}}-log\frac{e^{% s_{o}^{ui}}}{e^{s_{o}^{ui}}+\sum_{j\in I^{N}_{i}}e^{(s_{o}^{uj}-m^{ij}_{d})}},caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_O start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_l italic_o italic_g divide start_ARG italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_j ∈ italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT - italic_m start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG ,
mdij={sduisduj+σ,sduisduj(sdujsdui)η+σ,sdui<sduj.subscriptsuperscript𝑚𝑖𝑗𝑑casessuperscriptsubscript𝑠𝑑𝑢𝑖superscriptsubscript𝑠𝑑𝑢𝑗𝜎superscriptsubscript𝑠𝑑𝑢𝑖superscriptsubscript𝑠𝑑𝑢𝑗superscriptsubscript𝑠𝑑𝑢𝑗superscriptsubscript𝑠𝑑𝑢𝑖𝜂𝜎superscriptsubscript𝑠𝑑𝑢𝑖superscriptsubscript𝑠𝑑𝑢𝑗\displaystyle m^{ij}_{d}=\left\{\begin{array}[]{l}s_{d}^{ui}-s_{d}^{uj}+\sigma% ,\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ {s_{d}^{ui}\geq s_{d}^{uj}}\\ (s_{d}^{uj}-s_{d}^{ui})*\eta+\sigma,\ \ \ \ \ \ \ {s_{d}^{ui}<s_{d}^{uj}}\end{% array}\right..italic_m start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT + italic_σ , italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT ≥ italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ( italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT ) ∗ italic_η + italic_σ , italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT < italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY .

In Eq. (6), IiNsuperscriptsubscript𝐼𝑖𝑁I_{i}^{N}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT represents the negative sample set for the pair ¡user u𝑢uitalic_u, item i𝑖iitalic_i¿, and mdijsubscriptsuperscript𝑚𝑖𝑗𝑑m^{ij}_{d}italic_m start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT represents the adaptive margin between items i𝑖iitalic_i and j𝑗jitalic_j, while σ𝜎\sigmaitalic_σ denotes the minimum distance between positive sample and negative sample. When sduisdujsuperscriptsubscript𝑠𝑑𝑢𝑖superscriptsubscript𝑠𝑑𝑢𝑗s_{d}^{ui}\geq s_{d}^{uj}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT ≥ italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT, this indicates that the optimization objective of the click task aligns with the exposure probability, and the margin should simply reflect the exposure probability margin. Conversely, when sdui<sdujsuperscriptsubscript𝑠𝑑𝑢𝑖superscriptsubscript𝑠𝑑𝑢𝑗s_{d}^{ui}<s_{d}^{uj}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT < italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT, this reflects a misalignment between the click task and the exposure task, necessitating an adjustment by increasing the click probability margin between items i𝑖iitalic_i and j𝑗jitalic_j. The parameter η𝜂\etaitalic_η is a hyperparameter that serves as an adaptive coefficient in AML method, controlling the extent of corrective adjustments required by downstream tasks.

Unlike the click task, the conversion task involves two upstream tasks, requiring consideration of inconsistencies with both the click and exposure tasks. Thus, the consistency between exposure probability, click probability, and the conversion label is crucial. To avoid introducing additional estimation biases, we define the upstream probability as the product of cascade probabilities soui=sduisouisuperscriptsubscript𝑠superscript𝑜𝑢𝑖superscriptsubscript𝑠𝑑𝑢𝑖superscriptsubscript𝑠𝑜𝑢𝑖s_{o^{{}^{\prime}}}^{ui}=s_{d}^{ui}*s_{o}^{ui}italic_s start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT = italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT ∗ italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT for user u𝑢uitalic_u and product i𝑖iitalic_i. In the conversion task learning, the adaptive margin moijsubscriptsuperscript𝑚𝑖𝑗𝑜m^{ij}_{o}italic_m start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT between items i𝑖iitalic_i and j𝑗jitalic_j is defined as:

(7) moij={souisouj+σ,souisouj(soujsoui)η+σ,soui<souj.subscriptsuperscript𝑚𝑖𝑗𝑜casessuperscriptsubscript𝑠superscript𝑜𝑢𝑖superscriptsubscript𝑠superscript𝑜𝑢𝑗𝜎superscriptsubscript𝑠superscript𝑜𝑢𝑖superscriptsubscript𝑠superscript𝑜𝑢𝑗superscriptsubscript𝑠superscript𝑜𝑢𝑗superscriptsubscript𝑠superscript𝑜𝑢𝑖𝜂𝜎superscriptsubscript𝑠superscript𝑜𝑢𝑖superscriptsubscript𝑠superscript𝑜𝑢𝑗m^{ij}_{o}=\left\{\begin{array}[]{l}s_{o^{{}^{\prime}}}^{ui}-s_{o^{{}^{\prime}% }}^{uj}+\sigma,\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ {s_{o^{{}^{\prime}}}^{ui}\geq s% _{o^{{}^{\prime}}}^{uj}}\\ (s_{o^{{}^{\prime}}}^{uj}-s_{o^{{}^{\prime}}}^{ui})*\eta+\sigma,\ \ \ \ \ \ \ % {s_{o^{{}^{\prime}}}^{ui}<s_{o^{{}^{\prime}}}^{uj}}\end{array}\right..italic_m start_POSTSUPERSCRIPT italic_i italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT + italic_σ , italic_s start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT ≥ italic_s start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ( italic_s start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT ) ∗ italic_η + italic_σ , italic_s start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT < italic_s start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_j end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY .

In the CSMF method, as downstream tasks can leverage probabilities of upstream tasks during training, allowing the AML function to assess the conflicts between objectives and make adaptive adjustments to the margin between positive and negative samples.

4.4. Flexible Online Multi-Objective Retrieval

This section introduces the online deployment of the CSMF method, a flexible multi-objective online retrieval approach. By assigning appropriate weights to specific parameters, a fused value for each objective’s probability score is derived, enabling simultaneous retrieval of an optimal joint candidate set across multiple objectives. This approach also allows flexible adjustment of objective weights, enabling quick adaptation in different industrial contexts.

In the CSMF method, the model’s parameter set θ𝜃\thetaitalic_θ is divided into three distinct, mutually exclusive parts: θdsuperscript𝜃𝑑\theta^{d}italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, θosuperscript𝜃𝑜\theta^{o}italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, and θrsuperscript𝜃𝑟\theta^{r}italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. By correctly combining these parameters, we can accurately compute the exposure probability sduisuperscriptsubscript𝑠𝑑𝑢𝑖s_{d}^{ui}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT, click probability souisuperscriptsubscript𝑠𝑜𝑢𝑖s_{o}^{ui}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT, and conversion probability sruisuperscriptsubscript𝑠𝑟𝑢𝑖s_{r}^{ui}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT for user u𝑢uitalic_u and item i𝑖iitalic_i. Specifically, sduisuperscriptsubscript𝑠𝑑𝑢𝑖s_{d}^{ui}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT is derived from the parameter set θdsuperscript𝜃𝑑\theta^{d}italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, while souisuperscriptsubscript𝑠𝑜𝑢𝑖s_{o}^{ui}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT depends on both θdsuperscript𝜃𝑑\theta^{d}italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and θosuperscript𝜃𝑜\theta^{o}italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT. The conversion probability sruisuperscriptsubscript𝑠𝑟𝑢𝑖s_{r}^{ui}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT relies on the entire parameter set θ𝜃\thetaitalic_θ. Formal decompositions demonstrate that by applying weights to the final-layer vector of the user-side tower, 𝐞θu={𝐞θdu;𝐞θou;𝐞θru}superscriptsubscript𝐞𝜃𝑢superscriptsubscript𝐞superscript𝜃𝑑𝑢superscriptsubscript𝐞superscript𝜃𝑜𝑢superscriptsubscript𝐞superscript𝜃𝑟𝑢\mathbf{e}_{\theta}^{u}=\{\mathbf{e}_{\theta^{d}}^{u};\mathbf{e}_{\theta^{o}}^% {u};\mathbf{e}_{\theta^{r}}^{u}\}bold_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = { bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ; bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ; bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT }, the weighted sum across all three objectives can be computed. Here, {;}\{\cdot;\cdot\}{ ⋅ ; ⋅ } denotes the vector concatenation operation. Assuming 𝐞θusuperscriptsubscript𝐞𝜃𝑢\mathbf{e}_{\theta}^{u}bold_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT has a dimensionality of mesuperscript𝑚𝑒m^{e}italic_m start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, while 𝐞θdusuperscriptsubscript𝐞superscript𝜃𝑑𝑢\mathbf{e}_{\theta^{d}}^{u}bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, 𝐞θousuperscriptsubscript𝐞superscript𝜃𝑜𝑢\mathbf{e}_{\theta^{o}}^{u}bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, and 𝐞θrusuperscriptsubscript𝐞superscript𝜃𝑟𝑢\mathbf{e}_{\theta^{r}}^{u}bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT have dimensions mdesubscriptsuperscript𝑚𝑒𝑑m^{e}_{d}italic_m start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, moesubscriptsuperscript𝑚𝑒𝑜m^{e}_{o}italic_m start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and mresubscriptsuperscript𝑚𝑒𝑟m^{e}_{r}italic_m start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, respectively, it follows that me=mde+moe+mresuperscript𝑚𝑒subscriptsuperscript𝑚𝑒𝑑subscriptsuperscript𝑚𝑒𝑜subscriptsuperscript𝑚𝑒𝑟m^{e}=m^{e}_{d}+m^{e}_{o}+m^{e}_{r}italic_m start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = italic_m start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_m start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_m start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Similarly, the item-side vector 𝐯θisuperscriptsubscript𝐯𝜃𝑖\mathbf{v}_{\theta}^{i}bold_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT follows the same logic as 𝐞θusuperscriptsubscript𝐞𝜃𝑢\mathbf{e}_{\theta}^{u}bold_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT.

Let kdsubscript𝑘𝑑k_{d}italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, kosubscript𝑘𝑜k_{o}italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT represent the combination weights for the exposure probability sduisuperscriptsubscript𝑠𝑑𝑢𝑖s_{d}^{ui}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT, click probability souisuperscriptsubscript𝑠𝑜𝑢𝑖s_{o}^{ui}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT, and conversion probability sruisuperscriptsubscript𝑠𝑟𝑢𝑖s_{r}^{ui}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT, respectively. Using an inner product as the distance function, the relevance score suisuperscript𝑠𝑢𝑖s^{ui}italic_s start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT can be expressed as:

(8) sui=kd𝐞θdu𝐯θdi+ko𝐞{θd;θo}u𝐯{θd;θo}i+kr𝐞{θd;θo;θr}u𝐯{θd;θo;θr}i=(kd+ko+kr)𝐞θdu𝐯θdi+(ko+kr)𝐞θou𝐯θoi+kr𝐞θru𝐯θri={(kd+ko+kr)𝐞θdu;(ko+kr)𝐞θou;kr𝐞θru}{𝐯θdi;𝐯θoi;𝐯θri}=𝐞θu𝐯θi.superscript𝑠𝑢𝑖subscript𝑘𝑑superscriptsuperscriptsubscript𝐞superscript𝜃𝑑𝑢topsuperscriptsubscript𝐯superscript𝜃𝑑𝑖subscript𝑘𝑜superscriptsuperscriptsubscript𝐞superscript𝜃𝑑superscript𝜃𝑜𝑢topsuperscriptsubscript𝐯superscript𝜃𝑑superscript𝜃𝑜𝑖subscript𝑘𝑟superscriptsuperscriptsubscript𝐞superscript𝜃𝑑superscript𝜃𝑜superscript𝜃𝑟𝑢topsuperscriptsubscript𝐯superscript𝜃𝑑superscript𝜃𝑜superscript𝜃𝑟𝑖subscript𝑘𝑑subscript𝑘𝑜subscript𝑘𝑟superscriptsuperscriptsubscript𝐞superscript𝜃𝑑𝑢topsuperscriptsubscript𝐯superscript𝜃𝑑𝑖subscript𝑘𝑜subscript𝑘𝑟superscriptsuperscriptsubscript𝐞superscript𝜃𝑜𝑢topsuperscriptsubscript𝐯superscript𝜃𝑜𝑖subscript𝑘𝑟superscriptsuperscriptsubscript𝐞superscript𝜃𝑟𝑢topsuperscriptsubscript𝐯superscript𝜃𝑟𝑖subscript𝑘𝑑subscript𝑘𝑜subscript𝑘𝑟superscriptsuperscriptsubscript𝐞superscript𝜃𝑑𝑢topsubscript𝑘𝑜subscript𝑘𝑟superscriptsuperscriptsubscript𝐞superscript𝜃𝑜𝑢topsubscript𝑘𝑟superscriptsuperscriptsubscript𝐞superscript𝜃𝑟𝑢topsuperscriptsubscript𝐯superscript𝜃𝑑𝑖superscriptsubscript𝐯superscript𝜃𝑜𝑖superscriptsubscript𝐯superscript𝜃𝑟𝑖superscriptsuperscriptsubscript𝐞𝜃𝑢topsuperscriptsubscript𝐯𝜃𝑖\displaystyle\begin{split}s^{ui}&=k_{d}*{\mathbf{e}_{\theta^{d}}^{u}}^{\top}% \mathbf{v}_{\theta^{d}}^{i}+k_{o}*{\mathbf{e}_{\{\theta^{d};\theta^{o}\}}^{u}}% ^{\top}\mathbf{v}_{\{\theta^{d};\theta^{o}\}}^{i}\\ &\quad+k_{r}*{\mathbf{e}_{\{\theta^{d};\theta^{o};\theta^{r}\}}^{u}}^{\top}% \mathbf{v}_{\{\theta^{d};\theta^{o};\theta^{r}\}}^{i}\\ &=(k_{d}+k_{o}+k_{r})*{\mathbf{e}_{\theta^{d}}^{u}}^{\top}\mathbf{v}_{\theta^{% d}}^{i}+(k_{o}+k_{r})*{\mathbf{e}_{\theta^{o}}^{u}}^{\top}\mathbf{v}_{\theta^{% o}}^{i}\\ &\quad+k_{r}*{\mathbf{e}_{\theta^{r}}^{u}}^{\top}\mathbf{v}_{\theta^{r}}^{i}\\ &=\{(k_{d}+k_{o}+k_{r})*{\mathbf{e}_{\theta^{d}}^{u}}^{\top};\ \ (k_{o}+k_{r})% *{\mathbf{e}_{\theta^{o}}^{u}}^{\top};\ \ k_{r}*{\mathbf{e}_{\theta^{r}}^{u}}^% {\top}\}\\ &\quad\{\mathbf{v}_{\theta^{d}}^{i};\ \ \mathbf{v}_{\theta^{o}}^{i};\ \ % \mathbf{v}_{\theta^{r}}^{i}\}\\ &={\mathbf{e}_{\theta}^{u}}^{\top}\mathbf{v}_{\theta}^{i}.\\ \end{split}start_ROW start_CELL italic_s start_POSTSUPERSCRIPT italic_u italic_i end_POSTSUPERSCRIPT end_CELL start_CELL = italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∗ bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∗ bold_e start_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∗ bold_e start_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ( italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ∗ bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + ( italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ∗ bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∗ bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = { ( italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ∗ bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ; ( italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ∗ bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ; italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∗ bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL { bold_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ; bold_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ; bold_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = bold_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . end_CELL end_ROW

As shown in Eq. (8), the weighted sums for the three objectives are computed by assigning weight values of kd+ko+krsubscript𝑘𝑑subscript𝑘𝑜subscript𝑘𝑟k_{d}+k_{o}+k_{r}italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, ko+krsubscript𝑘𝑜subscript𝑘𝑟k_{o}+k_{r}italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to 𝐞θdisuperscriptsubscript𝐞superscript𝜃𝑑𝑖\mathbf{e}_{\theta^{d}}^{i}bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, 𝐞θoisuperscriptsubscript𝐞superscript𝜃𝑜𝑖\mathbf{e}_{\theta^{o}}^{i}bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, and 𝐞θrisuperscriptsubscript𝐞superscript𝜃𝑟𝑖\mathbf{e}_{\theta^{r}}^{i}bold_e start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in the user-side vector 𝐞θusuperscriptsubscript𝐞𝜃𝑢\mathbf{e}_{\theta}^{u}bold_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, respectively. Using the weight triplet <kd,ko,kr><k_{d},k_{o},k_{r}>< italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT >, we can compute the combined weighted sum for all three objectives, enabling us to retrieval a candidate set of items optimized for these objectives. In some recommendation scenarios, such as those tailored for different user groups, the emphasis on each objective may vary. Accordingly, we can adjust the triplet <kd,ko,kr><k_{d},k_{o},k_{r}>< italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT > to fine-tune the retrieval targets.

5. Experiments

In this section, we perform extensive experiments on two datasets to evaluate the effectiveness of our proposed framework and address the following questions:

  • RQ1: How does our CSMF compare to other state-of-the-art models in terms of overall performance?

  • RQ2: What is the impact of each component on the overall performance of the model?

  • RQ3: What is the effect of the hyper-parameters on the performance of our model?

  • RQ4: How does the online deployment performance of our CSMF compare to other state-of-the-art models?

Table 1. Statistics of datasets used in experiments. #Pvs and #Items denote the number of requests and items, respectively, while #Exposures, #Clicks, and #Conversions indicate the number of exposure, click, and conversion events in the dataset.
Dataset #Items #Pvs #Exposures #Clicks #Conversions
Industrial Dataset 0.11B 109M 2.1B 19M 0.3M
AliExpress Dataset 16.7M 8.7M 130M 3.6M 61.8K
Table 2. Comparison of Methods on Industrial and AliExpress Datasets. We use the nDCG@50 and the Recall@50 metrics to evaluate the efficiency of the click objective on the click dataset and the conversion objective on the conversion dataset. The last row reports the relative improvement of our proposed method compared to the best baseline result.
Method #Industrial Dataset #AliExpress Dataset
Click Conversion Click Conversion
nDCG@50 Recall@50 nDCG@50 Recall@50 nDCG@50 Recall@50 nDCG@50 Recall@50
YouTubeDNN-Sep 0.0998 0.2715 0.0698 0.2990 0.0453 0.1502 0.0328 0.0514
YouTubeDNN-Mix 0.0978 0.2692 0.0715 0.3089 0.0451 0.1497 0.0332 0.0525
DTML 0.1098 0.2824 0.0845 0.3283 0.0471 0.1567 0.0442 0.1179
MVKE 0.1118 0.2957 0.0885 0.3338 0.0543 0.1741 0.0477 0.1367
DMMP 0.1138 0.2980 0.0878 0.3319 0.0542 0.1746 0.0481 0.1378
MOPPR 0.1128 0.2946 0.0895 0.3436 0.0531 0.1738 0.0472 0.1329
CSMF (Ours) 0.1178 0.3177 0.0915 0.3694 0.0551 0.1777 0.0492 0.1397
Improvement +3.51% +6.61% +2.23% +7.51% +1.47% +1.78% +2.29% +1.38%

5.1. Experimental Setup

5.1.1. Dataset

Two datasets are used to evaluate the proposed CSMF method. Table 1 provides statistical information for both two datasets. The details are outlined as follows:

  • Industrial Dataset. Following existing baselines (Zheng et al., 2022; Xu et al., 2022), we use real-world recommendation logs with exposure, click, and conversion events for both training and testing. This dataset is collected from an online advertising recommendation system of a leading e-commerce platform in Southeast Asia. The training data precede the testing data. On the user side, we consider three types of features: profile information (e.g., gender, location), behavioral data (e.g., click and conversion sequences from the past 3 and 30 days), and statistical metrics such as click-through rate and conversion rate, which are crucial for modeling user interests (Fan et al., 2022). On the item side, we use two feature domains: ID-based attributes (e.g., category ID, seller ID) and historical statistical attributes.

  • AliExpress Dataset. This dataset (pengcheng Li et al., 2020) is publicly available and is known as the AliExpress Dataset. The training and testing sets were split based on a time sequence. However, unlike the industrial dataset, this dataset does not include a set of items that were not exposed to users in each request. In this experiment, we selected the RU dataset, which is the largest dataset.

5.1.2. Baselines

We compare our proposed CSMF approach with the following representative multi-objective EBR models:

  • YouTubeDNN-Sep: This method, proposed by YouTubeDNN (Yi et al., 2019), is one of the most widely adopted benchmark EBR models in the industry. As shown in Figure 2(a), this version trains two models: one for the click objective using the click dataset and another for the conversion objective using the conversion dataset.

  • YouTubeDNN-Mix: This variant follows the same training configuration as YouTubeDNN-Sep but trains a single model using both the click and conversion datasets, as shown in Figure 2(b).

  • MOPPR(Zheng et al., 2022): Developed by Taobao, this model aggregates samples at the PV level and optimizes four objectives using a list-wise approach. It also serves as a strong baseline in our online system.

  • DTML(Zhao et al., 2021): This method leverages knowledge distillation (Hinton, 2015), where a DNN-based teacher model is trained in parallel with an EBR model to simultaneously optimize both the click and conversion objectives.

  • MVKE(Xu et al., 2022): Based on the MMOE framework (Ma et al., 2018b), this method constructs multiple experts in both the user and item towers to facilitate multi-objective learning. Notably, as the number of experts increases, the dimensionality of both user and item vectors in EBR expands proportionally, increasing both storage and computational overhead during serving.

  • DMMP(Yi et al., 2024): This method combines an MOE-based structure with knowledge distillation to build a three-tower-based multi-objective EBR model. It also uses a personalized gating network to control the retrieval weight for each objective.

5.1.3. Hyper-parameter Settings

The training process uses a distributed TensorFlow (Abadi et al., 2016) platform, consisting of 10 parameter servers and 60 workers, each with 12 CPUs. The CSMF model is trained in three sequential stages, focusing on the exposure, click, and conversion tasks, respectively. Negative samples for the exposure task are obtained through two methods: BNS (Huang et al., 2020) and the unexposed items within the same page view (PV) (Zheng et al., 2022). For both the click and conversion tasks, negative samples are drawn from BNS. The pruning process is performed using the CPP method, with a predefined threshold parameter of τ=0.75𝜏0.75\tau=0.75italic_τ = 0.75. The adaptive coefficient η𝜂\etaitalic_η is set to 1.8 in the AML function. The triplet of online objective weights is configured as <kd,ko,kr>=<1,1.8,1.2><k_{d},k_{o},k_{r}>=<1,1.8,1.2>< italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT > = < 1 , 1.8 , 1.2 >. User and item vectors have a dimensionality of 64. During training, the batch size is set to 256, and the learning rate is set to 0.0001.

5.1.4. Evaluation Metrics

The evaluation metrics differ between the offline and online stages. In line with prior research (Zheng et al., 2022), during the offline stage, Recall@N and nDCG@N are used to evaluate model performance on both click and conversion datasets. For the online stage, standard metrics from the advertising system are used to evaluate performance, including RPM (Revenue Per Mille), CVR (Conversion Rate), and CTR (Click-through Rate).

5.2. Overall Performance Comparison (RQ1)

Table 2 summarizes the performance of our proposed method and the baseline models, highlighting the best results in bold and the second-best results with underlines. All the performance gains are statistically significant at p<0.05𝑝0.05p<0.05italic_p < 0.05.

On the industrial dataset, the proposed CSMF method consistently outperforms all baseline methods. Specifically, it achieves significant improvements in Recall@50, with increases of 6.61% and 7.51% on the click and conversion datasets, respectively, compared to the best results among the baselines.Similarly, CSMF achieves gains of 3.51% and 2.23% in NDCG@50 on the click and conversion datasets, respectively. Notably, MVKE and DMMP outperform MOPPR on the click dataset, whereas MOPPR performs better on the conversion dataset, highlighting the difficulty of balancing multiple objectives in MOE-based methods for cascaded multi-objective tasks.In contrast, by selectively freeing parameter space, CSMF sequentially models multiple objectives, effectively addressing challenges related to information sharing and catastrophic forgetting. Additionally, compared to the YouTubeDNN-Sep model, the YouTubeDNN-Mix model underperforms on the click dataset but demonstrates some improvement on the conversion dataset. Notably, as cascading objectives progress, positive samples become increasingly sparse. This indicates that multi-objective joint training can enhance the performance of “deeper objectives” (e.g., conversion). However, it may face challenges such as information conflict and catastrophic forgetting.

On the AliExpress dataset, CSMF achieves the best performance in both Recall@50 and NDCG@50 among all baselines. However, due to the absence of unexposed labels for each request, the pretraining task for CSMF was adjusted from exposure learning to click learning, resulting in relatively smaller performance improvement compared to the industrial dataset. Similarly, since MOPPR also depends on unexposed data, it faces similar limitations, resulting in inferior performance compared to MVKE and DMMP.

Table 3. Ablation Study of CSMF.
Method Click Conversion
nDCG@50 Recall@50 nDCG@50 Recall@50
CSMF (Ours) 0.1178 0.3177 0.0915 0.3694
w/o CPP 0.1158 0.3060 0.0885 0.3529
w/o AML 0.1148 0.3097 0.0895 0.3574
w/o AR 0.1134 0.3002 0.0912 0.3690

5.3. Ablation Study (RQ2)

We conducted detailed ablation studies on the industrial dataset to evaluate the effectiveness of each module in the proposed method. We consider three variants, as follows:

  • w/o CPP: CSMF trained without the cumulative percentile pruning (CPP) method.

  • w/o AML: CSMF trained without the cross-stage AML function.

  • w/o AR: CSMF trained without the accuracy recovery operation.

Table 3 presents the performance metrics for the three variant models. The ablation study of CPP evaluates the performance with fixed pruning ratios. Based on the parameter configuration of PackNet (Mallya and Lazebnik, 2018), a fixed pruning ratio of 0.75 was applied. Using the industrial dataset as an example, the CPP method improves the Recall@50 metric by 3.68% and 4.47% on the click and conversion datasets, respectively, demonstrating its effectiveness in facilitating information transfer across cascaded objectives.

The AML method improves the Recall@50 metric by 2.52% and 3.25% on the click and conversion datasets, respectively. Regarding NDCG@50, the AML method achieves improvements of 2.55% and 2.19% on the click and conversion datasets, respectively, highlighting its potential to alleviate optimization conflicts between objectives. Additionally, removing the accuracy recovery operation has a more negative impact on the click task while having a minimal effect on the conversion task. As the conversion task is the final stage, it is hardly affected by this variant. The results demonstrate that parameter pruning still negatively affects the current task’s accuracy, which is expected. Hence, accuracy recovery is a critical operation in CSMF.

Refer to caption
Figure 4. The performance of CSMF with different pruning ratios in the CPP method.
\Description

xx.

5.4. Hyperparameters Sensitivity Analysis (RQ3)

This section examines the sensitivity of the hyperparameter within the CSMF method on the industrial dataset. We set the pruning ratio τ𝜏\tauitalic_τ to eight increasing values: τ{25%,35%,45%,55%,65%,75%,85%,95%}𝜏percent25percent35percent45percent55percent65percent75percent85percent95\tau\in\{25\%,35\%,45\%,55\%,65\%,75\%,85\%,95\%\}italic_τ ∈ { 25 % , 35 % , 45 % , 55 % , 65 % , 75 % , 85 % , 95 % }. Figure 4 presents the experimental results corresponding to these parameter settings. As the pruning ratio increases from 25% to 75%, model efficiency steadily improves. This trend suggests that upstream tasks using large datasets require a larger parameter space for effective learning. Additionally, information from the upstream model can assist the downstream model in improving efficiency. However, as the pruning ratio shifts from 75% to 95%, there is a significant decline in model efficiency, highlighting the need for each objective to have a sufficient independent parameter space. Our method effectively reduces information conflicts between cascading objectives without increasing the number of model parameters.

Refer to caption
(a) Sensitivity Analysis Of η𝜂\etaitalic_η
Refer to caption
(b) Sensitivity Analysis Of kosubscript𝑘𝑜k_{o}italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT
Refer to caption
(c) Sensitivity Analysis Of krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
Refer to caption
(d) Sensitivity Analysis Of kdsubscript𝑘𝑑k_{d}italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
Figure 5. The performance of CSMF under different hyperparameters (η𝜂\etaitalic_η, kosubscript𝑘𝑜k_{o}italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and kdsubscript𝑘𝑑k_{d}italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT). η𝜂\etaitalic_η denotes the adaptive coefficient in the AML method, while kosubscript𝑘𝑜k_{o}italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and kdsubscript𝑘𝑑k_{d}italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT represent the weights for the click objective, conversion objective, and exposure objective, respectively.

Additionally, η𝜂\etaitalic_η represents the adaptive margin adjustment coefficient in the cross-stage training of the AML method. We configure six continuous values for η𝜂\etaitalic_η: η{1.2,1.4,1.6,1.8,2.0,2.2}𝜂1.21.41.61.82.02.2\eta\in\{1.2,1.4,1.6,1.8,2.0,2.2\}italic_η ∈ { 1.2 , 1.4 , 1.6 , 1.8 , 2.0 , 2.2 }. Figure 5 (a) shows the experimental results for these parameters. As η𝜂\etaitalic_η increases, the change in Recall@50 is more significant on the conversion dataset than on the click dataset. This indicates that optimization conflicts for the conversion objective become more pronounced. The AML method mitigates objective conflicts during multi-stage training.

kosubscript𝑘𝑜k_{o}italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT represents the retrieval weight of the click objective. With kdsubscript𝑘𝑑k_{d}italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT=1 and krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT=1.2, we configure six sets of continuous parameters for ko{1.2,1.4,1.6,1.8,2.0,2.2}subscript𝑘𝑜1.21.41.61.82.02.2k_{o}\in\{1.2,1.4,1.6,1.8,2.0,2.2\}italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ { 1.2 , 1.4 , 1.6 , 1.8 , 2.0 , 2.2 }. Figure 5 (b) shows the experimental results for these parameters. As kosubscript𝑘𝑜k_{o}italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT increases, the Recall@50 for the click dataset gradually improves; however, when kosubscript𝑘𝑜k_{o}italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT exceeds 1.8, the Recall@50 for the conversion dataset begins to decline.

krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT represents the retrieval weight for the conversion objective. With kd=1subscript𝑘𝑑1k_{d}=1italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 1 and ko=1.8subscript𝑘𝑜1.8k_{o}=1.8italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 1.8, we configure six sets of continuous parameters for kr0.6,0.8,1.0,1.2,1.4,1.6subscript𝑘𝑟0.60.81.01.21.41.6k_{r}\in{0.6,0.8,1.0,1.2,1.4,1.6}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ 0.6 , 0.8 , 1.0 , 1.2 , 1.4 , 1.6. Figure 5 (c) shows the experimental results for these parameters. As krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT increases, Recall@50 on the conversion dataset improves gradually. However, when krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT exceeds 1.2, Recall@50 on the click dataset starts to decline. Similarly, kdsubscript𝑘𝑑k_{d}italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT represents the retrieval weight for the exposure objective.

With ko=1.8subscript𝑘𝑜1.8k_{o}=1.8italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 1.8 and kr=1.2subscript𝑘𝑟1.2k_{r}=1.2italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 1.2, we configure six sets of continuous parameters for kd0.4,0.6,0.8,1.0,1.2,1.4subscript𝑘𝑑0.40.60.81.01.21.4k_{d}\in{0.4,0.6,0.8,1.0,1.2,1.4}italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ 0.4 , 0.6 , 0.8 , 1.0 , 1.2 , 1.4. Figure 5 (d) shows the experimental results for these parameters. Beyond kd=1subscript𝑘𝑑1k_{d}=1italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 1, Recall@50 starts to decline on both the click and conversion datasets.

These results show that achieving a jointly optimal set of multiple objectives requires carefully balancing the weight coefficients for each objective. Moreover, the CSMF method can optimize retrieval for each objective, thus addressing the diverse needs of various industrial scenarios.

Table 4. System performance of different methods during serving. The performance differences compared to the baseline model MOPPR are highlighted in bold.
Method storage size of vectors ANN indexing time
(MB) (ms/request)
MOPPR 1020.11 1.21
MVKE 4077.52 (+299%) 3.96 (+227%)
DMMP 1022.15 (+0.20%) 2.15 (+77%)
CSMF(Ours) 1019.95 (-0.02%) 1.22 (+0.83%)

5.5. Online Serving Performance (RQ4)

The online deployment of the multi-objective EBR method based on ANN consists of two components: the user side and the product side. The user side focuses primarily on the ANN retrieval time, typically measured in milliseconds per request. Since the item side is pre-computed, the primary concern is the memory required for online deployment. This online performance test evaluates the top four methods based on NDCG@50 and Recall@50: MOPPR, MVKE, DMMP, and our proposed CSMF.

As shown in Table 4, compared to MOPPR, MVKE significantly increases storage requirements during online service, and the time needed for ANN retrieval. Specifically, the product vector table expands by 299%, and the ANN retrieval time increases by 227%. Similar to MVKE, the MOE-based DMMP method also suffers from the same performance degradation issue in online service. In contrast, our proposed CSMF method does not increase the storage space for online deployment, and the ANN retrieval time increases by only 0.83%, remaining within acceptable limits.

The multi-stage training process of CSMF increases offline training time by 34.9% compared to MOPPR (from 3h 23min to 4h 34min). However, this increase is still within an acceptable range and does not affect the model’s daily updates.

Given the strict time constraints of industrial recommendation systems to ensure a seamless user experience, system efficiency is critical. In contrast, our method enhances retrieval efficiency without imposing additional strain on the online system.

5.6. Online Experiments

To robustly validate the effectiveness of our proposed method, we conducted an online A/B test on an online advertising recommendation system from October 5 to 16, 2024. The control group for this A/B test used the MOPPR model, while the experimental group employed our proposed CSMF method. To ensure fairness, each group consisted of 25% randomly selected users. Specifically, we observed a 0.42% increase in RPM, a 0.57% rise in CTR, and a 0.67% increase in CVR compared to the baseline model. These online results further validate the effectiveness of the proposed CSMF method for multi-objective EBR.

6. Conclusion

In this paper, we address the limitations of existing multi-objective embedding-based retrieval (EBR) methods by proposing the Cascaded Selective Mask Fine-Tuning (CSMF). CSMF innovatively organizes the training process into three stages: pre-training a backbone model with large-scale exposure data, followed by sequential fine-tuning on click and conversion tasks. A key feature of CSMF is its selective masking of redundant parameters during fine-tuning, which not only preserves information from the upstream model but also mitigates conflicts between objectives. Importantly, CSMF achieves these improvements without increasing output vector dimensionality, thereby avoiding additional retrieval latency and storage overhead.

Our findings demonstrate that CSMF significantly enhances retrieval efficiency and system performance during online serving. By employing a modified softmax loss function and an efficient parameter selection method, CSMF effectively addresses objective conflicts and reduces catastrophic forgetting. Moreover, CSMF enables flexible computation of weighted fusion scores for multiple objective probabilities, supporting adaptable retrieval in various recommendation scenarios. Extensive offline experiments on real-world datasets and online deployment in an advertising system validate the superior performance and practical value of CSMF. In summary, CSMF offers a novel and practical solution for multi-objective EBR, providing valuable insights for future research in this domain.

References

  • (1)
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. {{\{{TensorFlow}}\}}: a system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265–283.
  • Alyafeai et al. (2020) Zaid Alyafeai, Maged Saeed AlShaibani, and Irfan Ahmad. 2020. A survey on transfer learning in natural language processing. arXiv preprint arXiv:2007.04239 (2020).
  • Cao et al. (2007) Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. 129–136.
  • Crawshaw (2020) Michael Crawshaw. 2020. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796 (2020).
  • Devlin (2018) Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Dodge et al. (2020) Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305 (2020).
  • Fan et al. (2022) Zhifang Fan, Dan Ou, Yulong Gu, Bairan Fu, Xiang Li, Wentian Bao, Xin-Yu Dai, Xiaoyi Zeng, Tao Zhuang, and Qingwen Liu. 2022. Modeling users’ contextualized page-wise feedback for click-through rate prediction in e-commerce search. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 262–270.
  • Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821 (2021).
  • Hayou et al. (2024) Soufiane Hayou, Nikhil Ghosh, and Bin Yu. 2024. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354 (2024).
  • He et al. (2023) Yunzhong He, Yuxin Tian, Mengjiao Wang, Feier Chen, Licheng Yu, Maolong Tang, Congcong Chen, Ning Zhang, Bin Kuang, and Arul Prakash. 2023. Que2engage: Embedding-based retrieval for relevant and engaging products at facebook marketplace. In Companion Proceedings of the ACM Web Conference 2023. 386–390.
  • Hinton (2015) Geoffrey Hinton. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015).
  • Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018).
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  • Huang et al. (2020) Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2553–2561.
  • Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333–2338.
  • Jiang et al. (2022) Yuchen Jiang, Qi Li, Han Zhu, Jinbei Yu, Jin Li, Ziru Xu, Huihui Dong, and Bo Zheng. 2022. Adaptive domain interest network for multi-domain recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3212–3221.
  • Johnson et al. (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data (2019), 535–547.
  • Kaufinann (2006) Morgan Kaufinann. 2006. Data mining: Concepts and techniques. (2006), 4.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
  • Liu et al. (2024) Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. 2024. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353 (2024).
  • Ma et al. (2018b) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018b. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1930–1939.
  • Ma et al. (2018a) Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. 2018a. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1137–1140.
  • Mallya et al. (2018) Arun Mallya, Dillon Davis, and Svetlana Lazebnik. 2018. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European conference on computer vision (ECCV). 67–82.
  • Mallya and Lazebnik (2018) Arun Mallya and Svetlana Lazebnik. 2018. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 7765–7773.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems (2013).
  • pengcheng Li et al. (2020) pengcheng Li, Runze Li, Qing Da, An-Xiang Zeng, and Lijun Zhang. 2020. Improving Multi-Scenario Learning to Rank in E-commerce by Exploiting Task Relationships in the Label Space. In proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2020, Virtual Event, Ireland, October 19- 23,2019. ACM, New York,NY,USA.
  • Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. 285–295.
  • Shahroudnejad (2021) Atefeh Shahroudnejad. 2021. A survey on understanding, visualizations, and explanation of deep neural networks. arXiv preprint arXiv:2102.01792 (2021).
  • Tang et al. (2020) Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems. 269–278.
  • Wang et al. (2024) Xu Wang, Jiangxia Cao, Zhiyi Fu, Kun Gai, and Guorui Zhou. 2024. HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou. arXiv preprint arXiv:2408.05430 (2024).
  • Wang et al. (2023) Yuhao Wang, Ha Tsz Lam, Yi Wong, Ziru Liu, Xiangyu Zhao, Yichao Wang, Bo Chen, Huifeng Guo, and Ruiming Tang. 2023. Multi-task deep recommender systems: A survey. arXiv preprint arXiv:2302.03525 (2023).
  • Xin et al. (2024) Yi Xin, Siqi Luo, Haodi Zhou, Junlong Du, Xiaohong Liu, Yue Fan, Qing Li, and Yuntao Du. 2024. Parameter-efficient fine-tuning for pre-trained vision models: A survey. arXiv preprint arXiv:2402.02242 (2024).
  • Xu et al. (2022) Zhenhui Xu, Meng Zhao, Liqun Liu, Lei Xiao, Xiaopeng Zhang, and Bifeng Zhang. 2022. Mixture of virtual-kernel experts for multi-objective user profile modeling. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4257–4267.
  • Yi et al. (2024) Qingqing Yi, Jingjing Tang, Yujian Zeng, Xueting Zhang, and Weiqi Xu. 2024. DMMP: A distillation-based multi-task multi-tower learning model for personalized recommendation. Knowledge-Based Systems 284 (2024), 111236.
  • Yi et al. (2019) Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. 2019. Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM conference on recommender systems. 269–277.
  • Yu et al. (2020) Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems 33 (2020), 5824–5836.
  • Zhang et al. (2022) Jianjin Zhang, Zheng Liu, Weihao Han, Shitao Xiao, Ruicheng Zheng, Yingxia Shao, Hao Sun, Hanqing Zhu, Premkumar Srinivasan, Weiwei Deng, et al. 2022. Uni-retriever: Towards learning the unified embedding based retriever in bing sponsored search. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4493–4501.
  • Zhao et al. (2021) Zhong Zhao, Yanmei Fu, Hanming Liang, Li Ma, Guangyao Zhao, and Hongwei Jiang. 2021. Distillation based multi-task learning: A candidate generation model for improving reading duration. arXiv preprint arXiv:2102.07142 (2021).
  • Zheng et al. (2022) Yukun Zheng, Jiang Bian, Guanghao Meng, Chao Zhang, Honggang Wang, Zhixuan Zhang, Sen Li, Tao Zhuang, Qingwen Liu, and Xiaoyi Zeng. 2022. Multi-Objective Personalized Product Retrieval in Taobao Search. arXiv preprint arXiv:2210.04170 (2022).
  • Zhou et al. (2024) Hongyun Zhou, Xiangyu Lu, Wang Xu, Conghui Zhu, and Tiejun Zhao. 2024. LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation. arXiv preprint arXiv:2402.07721 (2024).