CSMF: Cascaded Selective Mask Fine-Tuning for Multi-Objective Embedding-Based Retrieval
Abstract.
Multi-objective embedding-based retrieval (EBR) has become increasingly critical due to the growing complexity of user behaviors and commercial objectives. While traditional approaches often suffer from data sparsity and limited information sharing between objectives, recent methods utilizing a shared network alongside dedicated sub-networks for each objective partially address these limitations. However, such methods significantly increase the model parameters, leading to an increased retrieval latency and a limited ability to model causal relationships between objectives. To address these challenges, we propose the Cascaded Selective Mask Fine-Tuning (CSMF), a novel method that enhances both retrieval efficiency and serving performance for multi-objective EBR. The CSMF framework selectively masks model parameters to free up independent learning space for each objective, leveraging the cascading relationships between objectives during the sequential fine-tuning. Without increasing network parameters or online retrieval overhead, CSMF computes a linearly weighted fusion score for multiple objective probabilities while supporting flexible adjustment of each objective’s weight across various recommendation scenarios. Experimental results on real-world datasets demonstrate the superior performance of CSMF, and online experiments validate its significant practical value.
1. Introduction
The primary goal of recommendation systems on e-commerce platforms is to assist users to quickly identify highly relevant products from a vast pool of candidates under strict time constraints. A widely adopted approach is the implementation of a cascaded multi-stage selection process, typically divided into two stages (Huang et al., 2020): retrieval and ranking. Retrieval methods are broadly classified as rule-based retrieval and Embedding-Based Retrieval (EBR). Rule-based retrieval methods, such as item-CF (Sarwar et al., 2001) and user-CF (Kaufinann, 2006), leverage collaborative filtering signals to perform lightweight retrieval. In contrast, as the advancement in deep learning, EBR methods have demonstrated superior retrieval accuracy, which has led to widespread adoption in industrial applications (Huang et al., 2020; Jiang et al., 2022).

xx.
Typically, EBR employs a two-tower deep neural network architecture to balance recall efficiency and system performance (Huang et al., 2013). Specifically, user and product information are encoded into two separate vectors via parallel neural networks, and the relevance is determined by the distance between these vectors (e.g., dot product). In deployment, product vectors are pre-computed, enabling efficient online retrieval of the top-k products using an Approximate Nearest Neighbor (ANN) system (e.g. FAISS (Johnson et al., 2019)) in sub-linear time.
Recently, multi-objective retrieval optimization has emerged as a fundamental challenge for EBR in industrial systems. Given the diverse commercial objectives in industry, EBR is increasingly required to retrieve a product set that satisfies multiple objectives simultaneously. For example, e-commerce platform at Taobao (Zheng et al., 2022) simultaneously optimizes four objectives: relevance, exposure, clicks and conversions. Similarly, online advertising recommendation systems at Tencent (Xu et al., 2022) focus on optimizing the objectives of clicks and conversions. For the objectives of clicks and conversions, the pattern of exposure → click → conversion is sequential, with a fixed order (Ma et al., 2018a), as shown in Figure 1.
Existing methods for multi-objective EBR can be broadly categorized into two types: Multi-Model and Single-Model approaches. In the Multi-Model approach, separate EBR models are independently developed for each objective, as shown in Figure 2(a). However, this method struggles to capture the interrelationships among objectives and suffers from data sparsity in downstream objectives (e.g., conversions objective) (Wang et al., 2023). In recent years, the Single-Model approach has been more widely adopted (Zheng et al., 2022; Zhang et al., 2022; He et al., 2023), leveraging a single EBR model to simultaneously learn multiple objectives. One variant of the Single-Model (Zheng et al., 2022; Jiang et al., 2022) combines the training data of all objectives and optimizes a single score to fit them. However, this approach encounters significant gradient conflict issues (Yu et al., 2020) when modeling within the same parameter space, as shown in Figure 2(b). The application of Mixture of Experts (MoE) techniques (Ma et al., 2018b) to multi-objective modeling, as shown in the Figure 2(c), has inspired researchers (Xu et al., 2022; Jiang et al., 2022) to design a shared network along with dedicated sub-networks for each objective to mitigate information conflicts and improve information sharing.
However, the above approaches introduce a large number of network parameters, increasing vector dimensionality for online ANN retrieval and worsening both service latency and storage overhead. Furthermore, they allocate equal parameters to all objectives, exacerbating learning difficulties due to data sparsity.
Additionally, they overlook the sequential relationship between objectives, hindering models from accurately capturing both objective interdependencies and users’ actual behavioral patterns.
To address the identified challenges, we propose the Cascaded Selective Mask Fine-Tuning Framework (CSMF) for multi-objective EBR, inspired by the success of parameter-efficient fine-tuning (PEFT) techniques used in large language models (LLMs), as shown in Figure 2(d). CSMF enhances retrieval efficiency and system performance by dividing the training process into three stages, leveraging the cascading relationship between objectives. First, a backbone model is pre-trained using large-scale exposure data as positive samples for EBR. Next, the model undergoes two fine-tuning stages: first with click data and then with conversion data. Prior to fine-tuning, CSMF selectively masks redundant parameters with low informational value, freeing up parameter space for subsequent tasks. To preserve accuracy after pruning, the unpruned parameters are fine-tuned again on a small subset of previous data. Once the accuracy is recovered, these parameters are frozen to retain the knowledge of earlier tasks. By iterating through the pre-train → selective mask → accuracy recovery → fine-tune cycle, CSMF achieves multiple objectives sequentially, mitigating issues of knowledge sharing and catastrophic forgetting. Furthermore, the CSMF framework encounters two key challenges: efficient parameter pruning to optimize each objective, and resolving conflicts between objectives during multi-stage training. To address these challenges, we propose the cumulative percentile-based pruning (CPP) method, which adaptively prunes neurons based on the information distribution of each layer. Additionally, we introduce a cross-stage adaptive margin loss function (AML) to reduce negative transfer effects caused by objective conflicts, dynamically adjusting the difficulty of contrastive learning (Mikolov et al., 2013). In CSMF, selective parameter allocation reduces conflicts between objectives without requiring additional parameters.
In online deployment, CSMF partitions the parameter space using a cascaded selective mask fine-tuning strategy, allowing for flexible, linear fusion of multiple objective probabilities. By assigning dynamic, objective-specific weights to parameter subsets, it adapts to varying business needs. Unlike MoE-based EBR, CSMF avoids increasing output vector dimensionality, reducing both retrieval latency and storage overhead in online ANN systems.
In summary, our proposed method makes the following key contributions:
-
•
We introduce the Cascaded Selective Mask Fine-Tuning for multi-objective EBR, which effectively enhances retrieval efficiency and system performance during online serving.
-
•
We integrate a modified softmax loss and an effective parameter selection approach within CSMF to address objective conflicts and reduce catastrophic forgetting.
-
•
We present a flexible online multi-objective retrieval method that identifies the jointly optimal candidate set for multiple objectives, without increasing network parameters or burdening online systems.
-
•
We conducted extensive offline experiments on real-world industrial datasets and deployed our proposed method in an online advertising system, validating the superiority of our method over competitors.

xx.
2. Related Work
2.1. Multi-Objective Embedding-Based Retrieval
Embedding-Based Retrieval has gained significant popularity in the industry, particularly within search, recommendation, and advertising systems (Zheng et al., 2022; Ma et al., 2018b; Zhang et al., 2022). DSSM (Kim, 2014) was one of the first to leverage a two-tower deep neural network to generate semantic vector representations for queries and documents. YouTubeDNN (Yi et al., 2019), a widely adopted benchmark, uses an ANN system for efficient online retrieval of top-k items. Recently, as EBR applications grow in industry, there has been an increasing focus on optimizing its multiple objectives. In industrial systems, retrieval tasks typically involve multiple objectives (Xu et al., 2022). For example, recommendation systems prioritize objectives such as clicks and conversions, while short video platforms also track metrics like user engagement duration and video completion rates (Wang et al., 2024). However, these objectives frequently conflict (Xu et al., 2022), making multi-objective balancing in EBR a critical research challenge. As mentioned earlier, state-of-the-art multi-task learning methods (Ma et al., 2018b; Tang et al., 2020) can be applied to EBR models. For example, Tencent proposed the MVKE model(Xu et al., 2022), which leverages the MOE architecture to jointly optimize clicks and conversions. However, MOE-based approaches increase output vector dimensionality in EBR, resulting in higher computational complexity and latency in the online ANN retrieval system. Additionally, Taobao introduced the MOPPR model (Zheng et al., 2022), which addresses the challenge of balancing multiple objectives using a listwise (Cao et al., 2007) approach. Methods like DMTL (Zhao et al., 2021; Zhang et al., 2022) use distillation learning to facilitate joint optimization of multiple objectives in EBR.
However, the above methods overlook the cascaded relationships among objectives and introduce a large number of network parameters, increasing latency and storage burden during serving. This paper addresses these limitations by focusing on the cascaded relationships among objectives, optimizing information sharing and mitigating catastrophic forgetting in multi-objective EBR.
2.2. Parameter-Efficient Fine-Tuning
Deep learning models often require large datasets for sufficient learning, but such datasets are not always available.
Transfer learning is an effective technique to address this challenge (Alyafeai et al., 2020). Fine-tuning is a commonly used approach in transfer learning (Howard and Ruder, 2018; Devlin, 2018). It can be classified into two types: full parameter fine-tuning and partial parameter fine-tuning. In full parameter fine-tuning, all network parameters are updated to suit downstream tasks (Gao et al., 2021; Dodge et al., 2020). However, this approach often results in challenges like forgetting upstream knowledge and high resource consumption (Mallya et al., 2018). Recently, with advancements in LLMs, partial parameter fine-tuning methods have proven effective. These methods preserve upstream model information while adapting to downstream task objectives (Hu et al., 2021; Xin et al., 2024). This technique is known as PEFT. PEFT can be classified into three types (Xin et al., 2024): adaptive, reparameterized, and selective. Both adaptive and reparameterized methods retain upstream model parameters while adding a small number of additional parameters during fine-tuning. LoRA (Hu et al., 2021) proposed using a trainable low-rank matrix to facilitate learning in downstream models. Subsequent works (Hayou et al., 2024; Liu et al., 2024; Zhou et al., 2024) aim to improve the efficiency of knowledge transfer in LoRA-based methods. PackNet (Mallya and Lazebnik, 2018), based on the selective approach, introduces a training framework that divides the upstream model’s parameters into two segments: one is used to fit the downstream model, and the other remains unchanged to preserve the upstream knowledge. This method is particularly suited for scenarios where there are cascaded dependencies between upstream and downstream tasks. Moreover, these methods generally do not introduce additional network parameters.

xx.
However, most existing PEFT methods have primarily focused on the field of natural language processing, with limited application in retrieval models. In this paper, we apply PEFT to address the challenges of knowledge transfer and the decline in online efficiency caused by parameter expansion in multi-objective EBR.
3. Problem Formulation
In this section, we define the EBR model. Let be the set of users, with each user denoted by , and be the set of items, with each item denoted by . The goal is to retrieve a set of items , where represents the size of the set. The retrieved item set must optimize multiple objectives simultaneously, such as exposure, click, and conversion.
A common approach constructs a two-tower based EBR to address multiple objectives and selects the top-k items based on model’s output scores. User-side features are input into the user-side encoder , yielding the user-side vector , where denotes the trainable model parameters. Similarly, item-side features are input into the item-side encoder , resulting in the item-side vector . The relevance score between user and item is computed using a distance function , i.e., dot product. The top-k items can be determined as follows:
(1) | ||||
The positive sample set for EBR typically comes from users’ explicit feedback, such as the click dataset and the conversion dataset . These feedback exhibit a cascading hierarchy in user behavior, establishing relationships among datasets. Specifically, the conversion dataset is a subset of the click dataset , which in turn is a subset of the exposure dataset , and so on: . The negative sampling strategy consists of in-batch negative sampling (BNS) (Huang et al., 2020) and unexposed items within the same request (Zheng et al., 2022). The negative sample set is denoted as . During training, EBR uses contrastive learning to distinguish from (Mikolov et al., 2013). The softmax-based probability is defined as:
(2) |
During the training phase, model parameters are updated by minimizing the negative log likelihood .
4. Multi-Objective Cascaded Selective Mask Fine-Tuning
This section introduces the CSMF training framework, a multi-stage training method that integrates the Cumulative Percentile Based Pruning Method and the Cross-Stage Adaptive Margin Loss Function. Figure 3 illustrates the overall training framework.
4.1. Training Framework
Drawing inspiration from PEFT techniques in LLMs, we propose the CSMF method for EBR. In our framework, following the sequence of cascading objectives, the training process is divided into three tasks: exposure, click, and conversion. Importantly, the data volume decreases significantly, with , making the modeling task progressively more challenging. Following PEFT principles, CSMF uses exposure tasks with large datasets to support the tasks with smaller datasets during the cascaded fine-tuning (Alyafeai et al., 2020), significantly improving retrieval efficiency for each objective. Thus, the exposure task serves as the upstream task for the click task, and the conversion task serves as the downstream task relative to the click task.
This method enables a single-EBR to simultaneously output the exposure probability , click probability and conversion probability , without increasing the dimensionality of the final vectors (e.g., and ). Moreover, by employing flexible probability combination calculations, the CSMF method can efficiently retrieve a candidate set optimized for multiple objectives, as detailed in Section 4.4.
During the pre-train stage for the exposure task, the -th layer of the model’s network parameters is utilized to learn whether a product will be exposed to a user. Positive samples are collected from the exposure dataset , while the negative samples consist of BNS (Huang et al., 2020) and unexposed items within the same request (Zheng et al., 2022). After multiple training epochs, the parameter set converges to .
(3) |
where represents the set of products exposed to user . In deep neural networks, some parameters exhibit redundancy, and masking these parameters generally does not significantly impact model accuracy (Mallya and Lazebnik, 2018). To free up learnable parameter space for downstream tasks, we selectively mask the parameter set , which involves pruning redundant parameters. The pruning process evaluates the information value of each neuron. If a neuron’s information value falls below a threshold , it is deemed to have minimal impact on overall accuracy. Thus, the primary task at this stage is to assess each neuron’s information value. We explored two parameter pruning methods, as detailed in Section 4.2.
After pruning, each is divided into two parts: redundant parameters for exposure tasks, denoted as and the retained parameter set, denoted as . Furthermore, we perform the accuracy recovery operation to ensure the accuracy of the exposure task. Specifically, a small subset of the exposure dataset is used to fine-tune the parameter set once again. Notably, this fine-tuning process does not modify . Once the accuracy is recovered, is frozen and does not undergo updates. Thus, the exposure probability is calculated using the parameter set .
During fine-tune stage of the click task, remains unchanged, while the click dataset serves as positive samples to update , fitting the click probability . After several training epochs, the parameter set converges to .
(4) |
where represents the click product set for user . Using the same approach as in the exposure task, we apply the same parameter pruning method to the set . The parameter set is divided into two parts: and . Here, is the retained parameter set for click task. To preserve accuracy after pruning, a subset of the click dataset is used to fine-tune the parameter set . The click probability for user and product is calculated using the parameter set as follows: .
For the remaining parameter set , the conversion dataset serves as positive samples. While freezing and , is updated to fit the conversion probability :
(5) | ||||
After training, the parameter space in the CSMF method is divided into three parts: , , and . Using appropriate weighted combinations, the method simultaneously yields exposure probability , click probability , and conversion probability , as detailed in Section 4.4. Redundancy pruning and neuron-level sharing ensure that information from upstream tasks is fully retained in downstream parameter spaces, achieving lossless transfer and mitigating information forgetting. Moreover, judicious pruning of redundant parameters allocates independent optimization space to each objective, reducing gradient conflicts and improving efficiency.
4.2. CPP: Cumulative Percentile-based Pruning
One of the key challenges in the CSMF method is evaluating the importance of neurons during the parameter pruning process. This evaluation is crucial because it directly affects the transmission of information from upstream tasks, thereby influencing the overall efficiency of the model. The significance of a neuron is typically measured by the amount of information it encapsulates. Therefore, neurons with higher information value should be retained, while those with lower information should be pruned to free up optimization space for downstream tasks.
To measure the information value, PackNet (Mallya and Lazebnik, 2018) uses the absolute value of the neurons as the evaluation criterion. In each layer of network parameters , neurons are sorted by their absolute values, with the top neurons being retained, while the remainder are pruned. However, this method applies a fixed pruning ratio to each layer, disregarding the varying importance of neurons across different layers (Shahroudnejad, 2021).
To address this limitation, we propose an adaptive pruning method called the Cumulative Percentile-based Pruning method. For each layer parameter , we first compute the cumulative sum of the absolute values of all neurons, denoted as , where is the total number of neurons in . Specifically, denotes the total information content in this layer, where . Next, we define a pruning ratio , and calculate the maximum cumulative percentile to be pruned, denoted as . Based on this cumulative percentile, we compute the probability that each neuron will be pruned, expressed as , where is a comparison function that returns 1 if the condition is met, and 0 otherwise. If , the neuron is pruned; otherwise, it is retained. This approach adaptively determines the number of neurons to prune based on the distribution of information content across each layer.
4.3. AML: Cross-Stage Adaptive Margin Loss
Conflicts in multi-objective learning are inevitable and often lead to negative transfer effects, reducing downstream task performance (Crawshaw, 2020). The CSMF mitigates this issue by assigning distinct parameter spaces to each objective, thereby partially reducing conflicts. However, since downstream tasks inherit all upstream parameters, significant conflicts between objectives can still hinder downstream learning and result in severe negative transfer effects.
To address this challenge, we propose a cross-stage adaptive margin loss function within the CSMF framework. AML allows downstream tasks to dynamically adjust the margin size between positive and negative sample pairs. When downstream objectives align with upstream predictions, optimization becomes easier. Conversely, when downstream objectives conflict with upstream predictions, additional effort is required to correct the bias.
For instance, in the click task, consider two items, and , for a user , where is a positive sample and is a negative sample. If the exposure task scores , the click task reduces optimization effort for this sample. The margin in click scores between items and should preserve the margin in exposure scores. However, if , the model amplifies the margin in the click scores, compensating by increasing the difference between and . The loss function for the click task is expressed as follows:
(6) | |||
In Eq. (6), represents the negative sample set for the pair ¡user , item ¿, and represents the adaptive margin between items and , while denotes the minimum distance between positive sample and negative sample. When , this indicates that the optimization objective of the click task aligns with the exposure probability, and the margin should simply reflect the exposure probability margin. Conversely, when , this reflects a misalignment between the click task and the exposure task, necessitating an adjustment by increasing the click probability margin between items and . The parameter is a hyperparameter that serves as an adaptive coefficient in AML method, controlling the extent of corrective adjustments required by downstream tasks.
Unlike the click task, the conversion task involves two upstream tasks, requiring consideration of inconsistencies with both the click and exposure tasks. Thus, the consistency between exposure probability, click probability, and the conversion label is crucial. To avoid introducing additional estimation biases, we define the upstream probability as the product of cascade probabilities for user and product . In the conversion task learning, the adaptive margin between items and is defined as:
(7) |
In the CSMF method, as downstream tasks can leverage probabilities of upstream tasks during training, allowing the AML function to assess the conflicts between objectives and make adaptive adjustments to the margin between positive and negative samples.
4.4. Flexible Online Multi-Objective Retrieval
This section introduces the online deployment of the CSMF method, a flexible multi-objective online retrieval approach. By assigning appropriate weights to specific parameters, a fused value for each objective’s probability score is derived, enabling simultaneous retrieval of an optimal joint candidate set across multiple objectives. This approach also allows flexible adjustment of objective weights, enabling quick adaptation in different industrial contexts.
In the CSMF method, the model’s parameter set is divided into three distinct, mutually exclusive parts: , , and . By correctly combining these parameters, we can accurately compute the exposure probability , click probability , and conversion probability for user and item . Specifically, is derived from the parameter set , while depends on both and . The conversion probability relies on the entire parameter set . Formal decompositions demonstrate that by applying weights to the final-layer vector of the user-side tower, , the weighted sum across all three objectives can be computed. Here, denotes the vector concatenation operation. Assuming has a dimensionality of , while , , and have dimensions , , and , respectively, it follows that . Similarly, the item-side vector follows the same logic as .
Let , , and represent the combination weights for the exposure probability , click probability , and conversion probability , respectively. Using an inner product as the distance function, the relevance score can be expressed as:
(8) |
As shown in Eq. (8), the weighted sums for the three objectives are computed by assigning weight values of , , and to , , and in the user-side vector , respectively. Using the weight triplet , we can compute the combined weighted sum for all three objectives, enabling us to retrieval a candidate set of items optimized for these objectives. In some recommendation scenarios, such as those tailored for different user groups, the emphasis on each objective may vary. Accordingly, we can adjust the triplet to fine-tune the retrieval targets.
5. Experiments
In this section, we perform extensive experiments on two datasets to evaluate the effectiveness of our proposed framework and address the following questions:
-
•
RQ1: How does our CSMF compare to other state-of-the-art models in terms of overall performance?
-
•
RQ2: What is the impact of each component on the overall performance of the model?
-
•
RQ3: What is the effect of the hyper-parameters on the performance of our model?
-
•
RQ4: How does the online deployment performance of our CSMF compare to other state-of-the-art models?
Dataset | #Items | #Pvs | #Exposures | #Clicks | #Conversions |
---|---|---|---|---|---|
Industrial Dataset | 0.11B | 109M | 2.1B | 19M | 0.3M |
AliExpress Dataset | 16.7M | 8.7M | 130M | 3.6M | 61.8K |
Method | #Industrial Dataset | #AliExpress Dataset | ||||||
---|---|---|---|---|---|---|---|---|
Click | Conversion | Click | Conversion | |||||
nDCG@50 | Recall@50 | nDCG@50 | Recall@50 | nDCG@50 | Recall@50 | nDCG@50 | Recall@50 | |
YouTubeDNN-Sep | 0.0998 | 0.2715 | 0.0698 | 0.2990 | 0.0453 | 0.1502 | 0.0328 | 0.0514 |
YouTubeDNN-Mix | 0.0978 | 0.2692 | 0.0715 | 0.3089 | 0.0451 | 0.1497 | 0.0332 | 0.0525 |
DTML | 0.1098 | 0.2824 | 0.0845 | 0.3283 | 0.0471 | 0.1567 | 0.0442 | 0.1179 |
MVKE | 0.1118 | 0.2957 | 0.0885 | 0.3338 | 0.0543 | 0.1741 | 0.0477 | 0.1367 |
DMMP | 0.1138 | 0.2980 | 0.0878 | 0.3319 | 0.0542 | 0.1746 | 0.0481 | 0.1378 |
MOPPR | 0.1128 | 0.2946 | 0.0895 | 0.3436 | 0.0531 | 0.1738 | 0.0472 | 0.1329 |
CSMF (Ours) | 0.1178 | 0.3177 | 0.0915 | 0.3694 | 0.0551 | 0.1777 | 0.0492 | 0.1397 |
Improvement | +3.51% | +6.61% | +2.23% | +7.51% | +1.47% | +1.78% | +2.29% | +1.38% |
5.1. Experimental Setup
5.1.1. Dataset
Two datasets are used to evaluate the proposed CSMF method. Table 1 provides statistical information for both two datasets. The details are outlined as follows:
-
•
Industrial Dataset. Following existing baselines (Zheng et al., 2022; Xu et al., 2022), we use real-world recommendation logs with exposure, click, and conversion events for both training and testing. This dataset is collected from an online advertising recommendation system of a leading e-commerce platform in Southeast Asia. The training data precede the testing data. On the user side, we consider three types of features: profile information (e.g., gender, location), behavioral data (e.g., click and conversion sequences from the past 3 and 30 days), and statistical metrics such as click-through rate and conversion rate, which are crucial for modeling user interests (Fan et al., 2022). On the item side, we use two feature domains: ID-based attributes (e.g., category ID, seller ID) and historical statistical attributes.
-
•
AliExpress Dataset. This dataset (pengcheng Li et al., 2020) is publicly available and is known as the AliExpress Dataset. The training and testing sets were split based on a time sequence. However, unlike the industrial dataset, this dataset does not include a set of items that were not exposed to users in each request. In this experiment, we selected the RU dataset, which is the largest dataset.
5.1.2. Baselines
We compare our proposed CSMF approach with the following representative multi-objective EBR models:
-
•
YouTubeDNN-Sep: This method, proposed by YouTubeDNN (Yi et al., 2019), is one of the most widely adopted benchmark EBR models in the industry. As shown in Figure 2(a), this version trains two models: one for the click objective using the click dataset and another for the conversion objective using the conversion dataset.
-
•
YouTubeDNN-Mix: This variant follows the same training configuration as YouTubeDNN-Sep but trains a single model using both the click and conversion datasets, as shown in Figure 2(b).
-
•
MOPPR(Zheng et al., 2022): Developed by Taobao, this model aggregates samples at the PV level and optimizes four objectives using a list-wise approach. It also serves as a strong baseline in our online system.
- •
-
•
MVKE(Xu et al., 2022): Based on the MMOE framework (Ma et al., 2018b), this method constructs multiple experts in both the user and item towers to facilitate multi-objective learning. Notably, as the number of experts increases, the dimensionality of both user and item vectors in EBR expands proportionally, increasing both storage and computational overhead during serving.
-
•
DMMP(Yi et al., 2024): This method combines an MOE-based structure with knowledge distillation to build a three-tower-based multi-objective EBR model. It also uses a personalized gating network to control the retrieval weight for each objective.
5.1.3. Hyper-parameter Settings
The training process uses a distributed TensorFlow (Abadi et al., 2016) platform, consisting of 10 parameter servers and 60 workers, each with 12 CPUs. The CSMF model is trained in three sequential stages, focusing on the exposure, click, and conversion tasks, respectively. Negative samples for the exposure task are obtained through two methods: BNS (Huang et al., 2020) and the unexposed items within the same page view (PV) (Zheng et al., 2022). For both the click and conversion tasks, negative samples are drawn from BNS. The pruning process is performed using the CPP method, with a predefined threshold parameter of . The adaptive coefficient is set to 1.8 in the AML function. The triplet of online objective weights is configured as . User and item vectors have a dimensionality of 64. During training, the batch size is set to 256, and the learning rate is set to 0.0001.
5.1.4. Evaluation Metrics
The evaluation metrics differ between the offline and online stages. In line with prior research (Zheng et al., 2022), during the offline stage, Recall@N and nDCG@N are used to evaluate model performance on both click and conversion datasets. For the online stage, standard metrics from the advertising system are used to evaluate performance, including RPM (Revenue Per Mille), CVR (Conversion Rate), and CTR (Click-through Rate).
5.2. Overall Performance Comparison (RQ1)
Table 2 summarizes the performance of our proposed method and the baseline models, highlighting the best results in bold and the second-best results with underlines. All the performance gains are statistically significant at .
On the industrial dataset, the proposed CSMF method consistently outperforms all baseline methods. Specifically, it achieves significant improvements in Recall@50, with increases of 6.61% and 7.51% on the click and conversion datasets, respectively, compared to the best results among the baselines.Similarly, CSMF achieves gains of 3.51% and 2.23% in NDCG@50 on the click and conversion datasets, respectively. Notably, MVKE and DMMP outperform MOPPR on the click dataset, whereas MOPPR performs better on the conversion dataset, highlighting the difficulty of balancing multiple objectives in MOE-based methods for cascaded multi-objective tasks.In contrast, by selectively freeing parameter space, CSMF sequentially models multiple objectives, effectively addressing challenges related to information sharing and catastrophic forgetting. Additionally, compared to the YouTubeDNN-Sep model, the YouTubeDNN-Mix model underperforms on the click dataset but demonstrates some improvement on the conversion dataset. Notably, as cascading objectives progress, positive samples become increasingly sparse. This indicates that multi-objective joint training can enhance the performance of “deeper objectives” (e.g., conversion). However, it may face challenges such as information conflict and catastrophic forgetting.
On the AliExpress dataset, CSMF achieves the best performance in both Recall@50 and NDCG@50 among all baselines. However, due to the absence of unexposed labels for each request, the pretraining task for CSMF was adjusted from exposure learning to click learning, resulting in relatively smaller performance improvement compared to the industrial dataset. Similarly, since MOPPR also depends on unexposed data, it faces similar limitations, resulting in inferior performance compared to MVKE and DMMP.
Method | Click | Conversion | ||
---|---|---|---|---|
nDCG@50 | Recall@50 | nDCG@50 | Recall@50 | |
CSMF (Ours) | 0.1178 | 0.3177 | 0.0915 | 0.3694 |
w/o CPP | 0.1158 | 0.3060 | 0.0885 | 0.3529 |
w/o AML | 0.1148 | 0.3097 | 0.0895 | 0.3574 |
w/o AR | 0.1134 | 0.3002 | 0.0912 | 0.3690 |
5.3. Ablation Study (RQ2)
We conducted detailed ablation studies on the industrial dataset to evaluate the effectiveness of each module in the proposed method. We consider three variants, as follows:
-
•
w/o CPP: CSMF trained without the cumulative percentile pruning (CPP) method.
-
•
w/o AML: CSMF trained without the cross-stage AML function.
-
•
w/o AR: CSMF trained without the accuracy recovery operation.
Table 3 presents the performance metrics for the three variant models. The ablation study of CPP evaluates the performance with fixed pruning ratios. Based on the parameter configuration of PackNet (Mallya and Lazebnik, 2018), a fixed pruning ratio of 0.75 was applied. Using the industrial dataset as an example, the CPP method improves the Recall@50 metric by 3.68% and 4.47% on the click and conversion datasets, respectively, demonstrating its effectiveness in facilitating information transfer across cascaded objectives.
The AML method improves the Recall@50 metric by 2.52% and 3.25% on the click and conversion datasets, respectively. Regarding NDCG@50, the AML method achieves improvements of 2.55% and 2.19% on the click and conversion datasets, respectively, highlighting its potential to alleviate optimization conflicts between objectives. Additionally, removing the accuracy recovery operation has a more negative impact on the click task while having a minimal effect on the conversion task. As the conversion task is the final stage, it is hardly affected by this variant. The results demonstrate that parameter pruning still negatively affects the current task’s accuracy, which is expected. Hence, accuracy recovery is a critical operation in CSMF.

xx.
5.4. Hyperparameters Sensitivity Analysis (RQ3)
This section examines the sensitivity of the hyperparameter within the CSMF method on the industrial dataset. We set the pruning ratio to eight increasing values: . Figure 4 presents the experimental results corresponding to these parameter settings. As the pruning ratio increases from 25% to 75%, model efficiency steadily improves. This trend suggests that upstream tasks using large datasets require a larger parameter space for effective learning. Additionally, information from the upstream model can assist the downstream model in improving efficiency. However, as the pruning ratio shifts from 75% to 95%, there is a significant decline in model efficiency, highlighting the need for each objective to have a sufficient independent parameter space. Our method effectively reduces information conflicts between cascading objectives without increasing the number of model parameters.




Additionally, represents the adaptive margin adjustment coefficient in the cross-stage training of the AML method. We configure six continuous values for : . Figure 5 (a) shows the experimental results for these parameters. As increases, the change in Recall@50 is more significant on the conversion dataset than on the click dataset. This indicates that optimization conflicts for the conversion objective become more pronounced. The AML method mitigates objective conflicts during multi-stage training.
represents the retrieval weight of the click objective. With =1 and =1.2, we configure six sets of continuous parameters for . Figure 5 (b) shows the experimental results for these parameters. As increases, the Recall@50 for the click dataset gradually improves; however, when exceeds 1.8, the Recall@50 for the conversion dataset begins to decline.
represents the retrieval weight for the conversion objective. With and , we configure six sets of continuous parameters for . Figure 5 (c) shows the experimental results for these parameters. As increases, Recall@50 on the conversion dataset improves gradually. However, when exceeds 1.2, Recall@50 on the click dataset starts to decline. Similarly, represents the retrieval weight for the exposure objective.
With and , we configure six sets of continuous parameters for . Figure 5 (d) shows the experimental results for these parameters. Beyond , Recall@50 starts to decline on both the click and conversion datasets.
These results show that achieving a jointly optimal set of multiple objectives requires carefully balancing the weight coefficients for each objective. Moreover, the CSMF method can optimize retrieval for each objective, thus addressing the diverse needs of various industrial scenarios.
Method | storage size of vectors | ANN indexing time |
---|---|---|
(MB) | (ms/request) | |
MOPPR | 1020.11 | 1.21 |
MVKE | 4077.52 (+299%) | 3.96 (+227%) |
DMMP | 1022.15 (+0.20%) | 2.15 (+77%) |
CSMF(Ours) | 1019.95 (-0.02%) | 1.22 (+0.83%) |
5.5. Online Serving Performance (RQ4)
The online deployment of the multi-objective EBR method based on ANN consists of two components: the user side and the product side. The user side focuses primarily on the ANN retrieval time, typically measured in milliseconds per request. Since the item side is pre-computed, the primary concern is the memory required for online deployment. This online performance test evaluates the top four methods based on NDCG@50 and Recall@50: MOPPR, MVKE, DMMP, and our proposed CSMF.
As shown in Table 4, compared to MOPPR, MVKE significantly increases storage requirements during online service, and the time needed for ANN retrieval. Specifically, the product vector table expands by 299%, and the ANN retrieval time increases by 227%. Similar to MVKE, the MOE-based DMMP method also suffers from the same performance degradation issue in online service. In contrast, our proposed CSMF method does not increase the storage space for online deployment, and the ANN retrieval time increases by only 0.83%, remaining within acceptable limits.
The multi-stage training process of CSMF increases offline training time by 34.9% compared to MOPPR (from 3h 23min to 4h 34min). However, this increase is still within an acceptable range and does not affect the model’s daily updates.
Given the strict time constraints of industrial recommendation systems to ensure a seamless user experience, system efficiency is critical. In contrast, our method enhances retrieval efficiency without imposing additional strain on the online system.
5.6. Online Experiments
To robustly validate the effectiveness of our proposed method, we conducted an online A/B test on an online advertising recommendation system from October 5 to 16, 2024. The control group for this A/B test used the MOPPR model, while the experimental group employed our proposed CSMF method. To ensure fairness, each group consisted of 25% randomly selected users. Specifically, we observed a 0.42% increase in RPM, a 0.57% rise in CTR, and a 0.67% increase in CVR compared to the baseline model. These online results further validate the effectiveness of the proposed CSMF method for multi-objective EBR.
6. Conclusion
In this paper, we address the limitations of existing multi-objective embedding-based retrieval (EBR) methods by proposing the Cascaded Selective Mask Fine-Tuning (CSMF). CSMF innovatively organizes the training process into three stages: pre-training a backbone model with large-scale exposure data, followed by sequential fine-tuning on click and conversion tasks. A key feature of CSMF is its selective masking of redundant parameters during fine-tuning, which not only preserves information from the upstream model but also mitigates conflicts between objectives. Importantly, CSMF achieves these improvements without increasing output vector dimensionality, thereby avoiding additional retrieval latency and storage overhead.
Our findings demonstrate that CSMF significantly enhances retrieval efficiency and system performance during online serving. By employing a modified softmax loss function and an efficient parameter selection method, CSMF effectively addresses objective conflicts and reduces catastrophic forgetting. Moreover, CSMF enables flexible computation of weighted fusion scores for multiple objective probabilities, supporting adaptable retrieval in various recommendation scenarios. Extensive offline experiments on real-world datasets and online deployment in an advertising system validate the superior performance and practical value of CSMF. In summary, CSMF offers a novel and practical solution for multi-objective EBR, providing valuable insights for future research in this domain.
References
- (1)
- Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: a system for Large-Scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265–283.
- Alyafeai et al. (2020) Zaid Alyafeai, Maged Saeed AlShaibani, and Irfan Ahmad. 2020. A survey on transfer learning in natural language processing. arXiv preprint arXiv:2007.04239 (2020).
- Cao et al. (2007) Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. 129–136.
- Crawshaw (2020) Michael Crawshaw. 2020. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796 (2020).
- Devlin (2018) Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Dodge et al. (2020) Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305 (2020).
- Fan et al. (2022) Zhifang Fan, Dan Ou, Yulong Gu, Bairan Fu, Xiang Li, Wentian Bao, Xin-Yu Dai, Xiaoyi Zeng, Tao Zhuang, and Qingwen Liu. 2022. Modeling users’ contextualized page-wise feedback for click-through rate prediction in e-commerce search. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 262–270.
- Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821 (2021).
- Hayou et al. (2024) Soufiane Hayou, Nikhil Ghosh, and Bin Yu. 2024. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354 (2024).
- He et al. (2023) Yunzhong He, Yuxin Tian, Mengjiao Wang, Feier Chen, Licheng Yu, Maolong Tang, Congcong Chen, Ning Zhang, Bin Kuang, and Arul Prakash. 2023. Que2engage: Embedding-based retrieval for relevant and engaging products at facebook marketplace. In Companion Proceedings of the ACM Web Conference 2023. 386–390.
- Hinton (2015) Geoffrey Hinton. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015).
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018).
- Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
- Huang et al. (2020) Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2553–2561.
- Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333–2338.
- Jiang et al. (2022) Yuchen Jiang, Qi Li, Han Zhu, Jinbei Yu, Jin Li, Ziru Xu, Huihui Dong, and Bo Zheng. 2022. Adaptive domain interest network for multi-domain recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3212–3221.
- Johnson et al. (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data (2019), 535–547.
- Kaufinann (2006) Morgan Kaufinann. 2006. Data mining: Concepts and techniques. (2006), 4.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
- Liu et al. (2024) Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. 2024. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353 (2024).
- Ma et al. (2018b) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018b. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1930–1939.
- Ma et al. (2018a) Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. 2018a. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1137–1140.
- Mallya et al. (2018) Arun Mallya, Dillon Davis, and Svetlana Lazebnik. 2018. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European conference on computer vision (ECCV). 67–82.
- Mallya and Lazebnik (2018) Arun Mallya and Svetlana Lazebnik. 2018. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 7765–7773.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems (2013).
- pengcheng Li et al. (2020) pengcheng Li, Runze Li, Qing Da, An-Xiang Zeng, and Lijun Zhang. 2020. Improving Multi-Scenario Learning to Rank in E-commerce by Exploiting Task Relationships in the Label Space. In proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2020, Virtual Event, Ireland, October 19- 23,2019. ACM, New York,NY,USA.
- Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. 285–295.
- Shahroudnejad (2021) Atefeh Shahroudnejad. 2021. A survey on understanding, visualizations, and explanation of deep neural networks. arXiv preprint arXiv:2102.01792 (2021).
- Tang et al. (2020) Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems. 269–278.
- Wang et al. (2024) Xu Wang, Jiangxia Cao, Zhiyi Fu, Kun Gai, and Guorui Zhou. 2024. HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou. arXiv preprint arXiv:2408.05430 (2024).
- Wang et al. (2023) Yuhao Wang, Ha Tsz Lam, Yi Wong, Ziru Liu, Xiangyu Zhao, Yichao Wang, Bo Chen, Huifeng Guo, and Ruiming Tang. 2023. Multi-task deep recommender systems: A survey. arXiv preprint arXiv:2302.03525 (2023).
- Xin et al. (2024) Yi Xin, Siqi Luo, Haodi Zhou, Junlong Du, Xiaohong Liu, Yue Fan, Qing Li, and Yuntao Du. 2024. Parameter-efficient fine-tuning for pre-trained vision models: A survey. arXiv preprint arXiv:2402.02242 (2024).
- Xu et al. (2022) Zhenhui Xu, Meng Zhao, Liqun Liu, Lei Xiao, Xiaopeng Zhang, and Bifeng Zhang. 2022. Mixture of virtual-kernel experts for multi-objective user profile modeling. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4257–4267.
- Yi et al. (2024) Qingqing Yi, Jingjing Tang, Yujian Zeng, Xueting Zhang, and Weiqi Xu. 2024. DMMP: A distillation-based multi-task multi-tower learning model for personalized recommendation. Knowledge-Based Systems 284 (2024), 111236.
- Yi et al. (2019) Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. 2019. Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM conference on recommender systems. 269–277.
- Yu et al. (2020) Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems 33 (2020), 5824–5836.
- Zhang et al. (2022) Jianjin Zhang, Zheng Liu, Weihao Han, Shitao Xiao, Ruicheng Zheng, Yingxia Shao, Hao Sun, Hanqing Zhu, Premkumar Srinivasan, Weiwei Deng, et al. 2022. Uni-retriever: Towards learning the unified embedding based retriever in bing sponsored search. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4493–4501.
- Zhao et al. (2021) Zhong Zhao, Yanmei Fu, Hanming Liang, Li Ma, Guangyao Zhao, and Hongwei Jiang. 2021. Distillation based multi-task learning: A candidate generation model for improving reading duration. arXiv preprint arXiv:2102.07142 (2021).
- Zheng et al. (2022) Yukun Zheng, Jiang Bian, Guanghao Meng, Chao Zhang, Honggang Wang, Zhixuan Zhang, Sen Li, Tao Zhuang, Qingwen Liu, and Xiaoyi Zeng. 2022. Multi-Objective Personalized Product Retrieval in Taobao Search. arXiv preprint arXiv:2210.04170 (2022).
- Zhou et al. (2024) Hongyun Zhou, Xiangyu Lu, Wang Xu, Conghui Zhu, and Tiejun Zhao. 2024. LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation. arXiv preprint arXiv:2402.07721 (2024).