Improving Cross-lingual Representation for Semantic Retrieval with Code-switching

Mieradilijiang Maimaiti

{}^{1}

, Yuanhang Zheng

{}^{2*}

, Ji Zhang

{}^{1}

,
Fei Huang

{}^{1}

, Yue Zhang

{}^{3}

, Wenpei Luo

{}^{4}

, Kaiyu Huang

{}^{5}

{}^{1}

Alibaba DAMO Academy

{}^{2}

Department of Computer Science and Technology, Tsinghua University, Beijing, China

{}^{3}

Department of Computer Science and Technology, Westlake University, Hangzhou, China

{}^{4}

Department of Computer Science and Technology, Dalian University of Technology, Dalian

{}^{5}

Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, China
{mieradilijiang.mea, zj122146, f.huang}@alibaba-inc.com,
zheng-yh19@mails.tsinghua.edu.cn, zhangyue@westlake.edu.cn,
22109239@mail.dlut.edu.cn, kyhuang@bjtu.edu.cn Equal contribution Corresponding author: Ji Zhang

Abstract

Semantic Retrieval (SR) has become an indispensable part of the FAQ system in the task-oriented question-answering (QA) dialogue scenario. The demands for a cross-lingual smart-customer-service system for an e-commerce platform or some particular business conditions have been increasing recently. Most previous studies exploit cross-lingual pre-trained models (PTMs) for multi-lingual knowledge retrieval directly, while some others also leverage the continual pre-training before fine-tuning PTMs on the downstream tasks. However, no matter which schema is used, the previous work ignores to inform PTMs of some features of the downstream task, i.e. train their PTMs without providing any signals related to SR. To this end, in this work, we propose an Alternative Cross-lingual PTM for SR via code-switching. We are the first to utilize the code-switching approach for cross-lingual SR. Besides, we introduce the novel code-switched continual pre-training instead of directly using the PTMs on the SR tasks. The experimental results show that our proposed approach consistently outperforms the previous SOTA methods on SR and semantic textual similarity (STS) tasks with three business corpora and four open datasets in 20+ languages.

1 Introduction

In recent years, pre-trained models (PTMs) have demonstrated success on many downstream tasks of natural language processing (NLP). Intuitively, PTMs such as ELMO (Peters et al., 2018), GPT (Radford et al., 2018), GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), BERT (Devlin et al., 2019), and RoBERTa (Liu et al., 2019) have achieved remarkable results by transferring knowledge learned from a large amount of unlabeled corpus to various downstream NLP tasks. To learn the cross-lingual representations, previous methods like multi-lingual BERT (mBERT) (Devlin et al., 2019) and XLM (Conneau and Lample, 2019) have extended PTMs to multiple languages.

Refer to caption — Figure 1: The brief illustration of the semantic retrieval with leveraging knowledge base for FAQ system in the task-oriented dialogue scenario.

Semantic retrieval (SR) (Kiros et al., 2015) has become the ubiquitous method in the FAQ system (i.e., task-oriented question-answering (QA) (Xiong et al., 2021)) which is incorporated into the smart-customer-service platform for the e-commerce scenario. For the cross-lingual scenario, many pre-training methods have been presented for the multi-lingual downstream tasks, such as XLM-R (Conneau et al., 2020), XNLG (Chi et al., 2020), InfoXLM (Chi et al., 2021) and VECO (Luo et al., 2021). Intuitively, the main challenge of SR is how to accurately retrieve the corresponding sentence from the knowledge base (query-label pairs) (Kiros et al., 2015). Commonly used approaches mainly take some variants of the BERT model as a backbone and then directly fine-tune on the downstream tasks (Devlin et al., 2019; Conneau and Lample, 2019; Huang et al., 2019; Yang et al., 2021; Ouyang et al., 2021).

Specifically, mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), Unicoder (Huang et al., 2019), CMLM (Yang et al., 2021) and ERNIE-M (Ouyang et al., 2021) learn the cross-lingual sentence representation mainly using masked language modeling (MLM). For other objective functions, MMTE (Siddhant et al., 2020) exploits multi-lingual machine translation, and CRISS (Tran et al., 2020) leverages unsupervised parallel data mining. Some models use Siamese network architectures to better adapt them to SR. For example, InferSent (Conneau et al., 2017) uses natural language inference (NLI) datasets to train the Siamese network. USE (Yang et al., 2019), M-USE (Yang et al., 2020b) and LaBSE (Feng et al., 2022a) exploit ranking loss. SimCSE (Gao et al., 2021), InfoXLM (Chi et al., 2021) and HICTL (Wei et al., 2021) use contrastive learning. However, the previous highly similar approaches almost ignore the transmission of some features of the downstream tasks to PTMs. In other words, most methods (Chi et al., 2021; Luo et al., 2021) directly fine-tune the models on downstream tasks without providing any signals related to SR. In addition, they are mainly pre-trained on combined monolingual data where few of the sentences are code-switched. Since the user queries often contain many code-switched sentences, it is insufficient to exploit the commonly used methods directly for the SR task in the e-commerce scenario.

In this work, as depicted in Figure 1, we aim to enhance the performance of SR for the FAQ system in the e-commerce scenario. We propose a novel pre-training approach for sentence-level SR with code-switched cross-lingual data. Our motivation comes from the ignorance of previous studies. One of the recent studies (Yang et al., 2020a) also tries to exploit the code-switching strategy in the machine translation scenario, but no one has tried to leverage code-switching on the task of multi-lingual SR. Furthermore, the previous methods (Xu et al., 2022) have exploited multi-lingual PTMs on the SR task by only masking the query instead of masking the label. They intend to use more efficient PTMs to fine-tune the SR task rather than making PTMs stronger by providing some signals. To allow the PTMs to learn the signals directly related to downstream tasks, we present an Alternative Cross-Lingual PTM for semantic retrieval using code-switching, which consists of three main steps. First, we generate code-switched data based on bilingual dictionaries. Then, we pre-train a model on the code-switched data using a weighted sum of the alternating language modeling (ALM) loss (Yang et al., 2020a) and the similarity loss. Finally, we fine-tune the model on the SR corpus. By providing additional training signals related to SR during the pre-training process, our proposed approach can learn better about the SR task. Our main contributions are as follows:

•

Experiments show that our approach remarkably outperforms the SOTA methods with various evaluation metrics.
•

Our method improves the robustness of the model for sentence-level SR on both the in-house datasets and open corpora.
•

To the best of our knowledge, we first present alternative cross-lingual PTM for SR using code-switching in the FAQ system (e-commerce scenario).

2 Preliminaries

2.1 Masked Language Modeling

Masked language modeling (MLM) (Devlin et al., 2019) is a pre-training objective focused on learning representations of natural language sentences. When pre-training a model using MLM objectives, we let the model predict the masked words in the input sentence. Formally, we divide each sentence $\mathbf{x}$ into the masked part $\mathbf{x}_{m}$ and the observed part $\mathbf{x}_{o}$ , and we train the model (which is parameterized by $\mathbf{\theta}$ ) to minimize

\mathcal{L}_{MLM}=-\log P(\mathbf{x}_{m}|\mathbf{x}_{o};\mathbf{\theta}).

(1)

When calculating Eq.(1), we assume that the model independently predicts each masked word. Formally, we assume that all masked words $x_{i}$ in the masked part $\mathbf{x}_{m}$ are independent conditioned on $\mathbf{x}_{o}$ . Thus, Eq.(1) can be rewritten as

\mathcal{L}_{MLM}=-\sum_{x_{i}\in\mathbf{x}_{m}}\log P(x_{i}|\mathbf{x}_{o};% \mathbf{\theta}).

(2)

2.2 Cross-lingual LM Pre-training

To improve the performances of various models on the NLP tasks of different languages, cross-lingual PTMs (Devlin et al., 2019; Conneau and Lample, 2019; Conneau et al., 2020) have been proposed. Generally, cross-lingual PTMs are trained on multi-lingual corpora using the MLM objective. During the pre-training process, the corpora of low-resource languages are usually oversampled to improve the model’s performance on low-resource languages. To better align the representations of the sentences in different languages, cross-lingual PTMs may use another objective called translation language modeling (TLM), which requires the model to predict the masked words in both the source and the target sentences in a parallel sentence pair. Formally, given a parallel sentence pair $\langle\mathbf{x},\mathbf{y}\rangle$ , we randomly divide the source sentence $\mathbf{x}$ into the masked part $\mathbf{x}_{m}$ and the observed part $\mathbf{x}_{o}$ , and also divide the target sentence $\mathbf{y}$ into the masked part $\mathbf{y}_{m}$ and the observed part $\mathbf{y}_{o}$ . Then we minimize

\mathcal{L}_{TLM}=-\log P(\mathbf{x}_{m},\mathbf{y}_{m}|\mathbf{x}_{o},\mathbf% {y}_{o};\mathbf{\theta}).

(3)

2.3 Semantic Retrieval

Semantic retrieval (SR) aims to retrieve sentences similar to the query sentence in a knowledge base (Kiros et al., 2015). Specifically, the semantic retrieval model converts sentences into vectors, and similar sentences are retrieved based on the cosine similarity.

Formally, given a sentence $\mathbf{x}$ , the model encodes $\mathbf{x}$ into a vector $\mathbf{v}_{\mathbf{x}}$ . When we need to retrieve sentences similar to the query $\mathbf{q}$ , we calculate the cosine similarity between $\mathbf{v}_{\mathbf{q}}$ and $\mathbf{v}_{\mathbf{x}}$ for each sentence $\mathbf{x}$ in the knowledge base $\mathcal{K}$ :

sim(\mathbf{q},\mathbf{x})=\frac{\mathbf{v}_{\mathbf{q}}\cdot\mathbf{v}_{% \mathbf{x}}}{||\mathbf{v}_{\mathbf{q}}||\times||\mathbf{v}_{\mathbf{x}}||}.

(4)

Finally, we retrieve the sentence $\mathbf{x}^{*}$ which is most similar to the query $\mathbf{q}$ in $\mathcal{K}$ :

\mathbf{x}^{*}=\mathop{\rm argmax}\limits_{\mathbf{x}\in\mathcal{K}}sim(% \mathbf{q},\mathbf{x}).

(5)

2.4 Code-switching

To reduce the representation gap between words of different languages in the cross-lingual PTMs. Yang et al. (2020a) proposed ALM, which is based on code-switching. Specifically, given a source sentence, we construct a code-switched sentence by randomly replacing some source words with the corresponding target words. For example, suppose the English source sentence is “I like music” and then we replace some words in the sentence with Chinese. If the replaced English words are “I” and “music” and their corresponding Chinese words are “我wŏ” and “音yīn乐lè”, respectively, then the code-switched sentence is “我wŏ like 音yīn乐lè”.

Formally, suppose that we conduct code-switching on a source sentence $\mathbf{x}=\{x_{1},x_{2},\dots,x_{n}\}$ . First, we randomly choose a subset $S$ from $\{1,2,\dots,n\}$ . Then, for each element $i{\in}S$ , we replace $x_{i}$ with its corresponding target word $y_{i}$ to construct the code-switched sentence $\mathbf{z}=\{z_{1},z_{2},\dots,z_{n}\}$ , where

z_{i}=\begin{cases}x_{i}&i{\notin}S,\\ y_{i}&i{\in}S.\end{cases}

(6)

3 Method

3.1 Alternative Cross-lingual PTM

The main architecture of our model is shown in 2. We jointly train the PTM on the code-switched data using the cross-lingual masked language model (XMLM) and the similarity loss to address the limitation of existing PTMs that are trained without signals directly related to the downstream tasks (e.g., SR). Contrarily, we add a similarity loss term to the pre-training objective to adjust the similarity between input ( $query$ ) and output ( $label$ ). Thus, the similarity between $query$ and $label$ has been controlled by exploiting the similarity loss during the continual pre-training step. For the sentence-level SR task, the given knowledge is composed of a certain number of $\langle query,label\rangle$ pairs (see Figure 2). We regard $\mathbf{q}$ as a $query$ and take $\mathbf{l}$ as a corresponding $label$ . Given a query $\mathbf{q}=\{q_{1},\dots,q_{i},\dots,q_{I}\}$ and a label $\mathbf{l}=\{l_{1},\dots,l_{j},\dots,l_{J}\}$ , the standard retrieval models usually formulate the sentence-level SR as a calculation of the similarity between $query$ and $label$ on the semantic space:

	$\displaystyle\mathbf{v}_{\mathbf{q}}$	$\displaystyle=encode(q_{1},\dots,q_{i},\dots,q_{I}),$		(7)
	$\displaystyle\mathbf{v}_{\mathbf{l}}$	$\displaystyle=encode(l_{1},\dots,l_{j},\dots,l_{J}).$		(8)

Algorithm 1 Cross-lingual SR with Code-switching

1:user query

Q_{user}=\{\mathbf{q}_{user}^{(u)}\}_{u=1}^{U}

, monolingual knowledge (query-label pairs)

\mathcal{K}_{mono}=\{\langle\mathbf{q}^{(m)},\mathbf{l}^{(m)}\rangle\}_{m=1}^{M}

, Bi-lingual dictionary

D_{bi}=\{\langle L_{1}^{(n)},L_{en}^{(n)}\rangle\}_{n=1}^{N}

;

2:retrieved top-

k

3.2 Building Code-switched Data for SR

We aim to improve the SR model in the business scenario. Since the LAZADA corpus includes many code-switched sentences, we build the code-switched data using authentic business corpora and language features. During the construction of the code-switched data, we replace each token among the $query$ and $label$ into the corresponding multi-lingual words based on some openly available multi-lingual lexicon-level dictionaries with some percentages.

Table 4: The comparison with accuracy score (Top-30 queries) between baseline systems on business corpora.

Model	AliExpress			LAZADA				DARAZ				Avg.
Model	Ar	En	Zh	Id	Ms	Fil	Th	Ur	Bn	Ne	Si	Avg.
mBERT	79.6	78.0	89.4	55.3	53.6	70.4	71.1	83.5	82.3	56.7	75.8	72.3
Unicoder	64.3	69.2	79.9	46.0	48.8	64.4	62.1	74.6	75.3	48.7	65.8	63.6
XLMR ${}_{Large}$	81.1	81.0	90.1	68.3	59.9	71.2	82.1	85.6	84.5	65.2	70.2	76.3
SimCSE-BERT ${}_{Large}$	72.2	79.0	52.0	49.3	56.5	72.2	78.5	82.8	83.1	57.6	76.0	69.0
InfoXLM ${}_{Large}$	79.7	82.5	89.6	69.0	58.1	75.4	80.4	82.7	80.7	60.2	76.8	75.9
VECO	85.9	83.0	91.4	68.7	58.7	75.3	82.3	87.3	87.6	61.4	81.4	78.1
CMLM	84.3	80.4	91.2	65.3	57.8	74.3	76.7	85.4	87.5	61.6	81.4	76.9
LaBSE	85.8	81.1	91.4	65.8	59.8	75.7	76.7	85.7	85.6	62.9	81.6	77.5
Ours	89.0	82.6	93.9	73.7	62.8	78.7	83.5	88.5	90.9	67.2	84.0	81.3

The code-switched cross-lingual corpus consists of two parts such as $query$ and $label$ . The original $query$ and $label$ are formulated as follows:

	$\displaystyle\mathbf{q}$	$\displaystyle=\{q_{1},\dots,q_{i},\dots,q_{I}\},$		(14)
	$\displaystyle\mathbf{l}$	$\displaystyle=\{l_{1},\dots,l_{j},\dots,l_{J}\},$		(15)

where the $I$ and $J$ represent the length of $query$ and $label$ , respectively. We replace the tokens among the $query$ and $label$ with the frequently used languages in Alibaba over-sea’s cross-border e-commerce platform. The newly constructed data should be as follow:

	$\displaystyle\mathbf{q}^{\prime}$	$\displaystyle=\{q_{1}^{\prime},\dots,q_{i}^{\prime},\dots,q_{I}^{\prime}\},$		(16)
	$\displaystyle\mathbf{l}^{\prime}$	$\displaystyle=\{l_{1}^{\prime},\dots,l_{j}^{\prime},\dots,l_{J}^{\prime}\},$		(17)

where $q_{i}^{\prime}$ and $l_{j}^{\prime}$ denote the tokens after the replacement (for all integers $i\in[1,I]$ and $j\in[1,J]$ ). As the final step, we combine the newly generated $\mathbf{q}^{\prime}$ and $\mathbf{l}^{\prime}$ to build the linguistically motivated code-switched monolingual corpus (i.e., $\langle\mathbf{q}^{\prime},\mathbf{l}^{\prime}\rangle$ ). Then we continually pre-train our model with a similar idea of ALM.

4 Experiments

4.1 Setup

Data preparation

The languages selected from the business dataset are Arabic (Ar), English (En), Chinese (Zh), Indonesian (Id), Malay (Ms), Filipino (Fil), Thai (Th), Urdu (Ur), Bengali (Bn), Nepali (Ne), and Sinhala (Si). Specifically, Ar, En, and Zh are originated from AliExpress corpora, while Id, Ms, Fil, and Th are from the LAZADA corpora, and Ur, Bn, Ne, and Si are from DARAZ corpora, respectively. The characteristics of our business corpora are shown in Table 1. Among them, the LAZADA corpus belongs to the code-switched dataset, and we provide the code-switching rates both on offline and online data separately (See Table 2). We also make some explorations on the SR task using the Quora Duplicate Questions Dataset¹¹1https://quoradata.quora.com/First-Quora-DatasetRelease-Question-Pairs with Faiss (Johnson et al., 2021) toolkit²²2https://github.com/facebookresearch/faiss. Then we evaluate the model performance by exploiting the mean reciprocal rank (MRR) to validate the effectiveness of different approaches. Additionally, we conduct our experiments on the semantic textual similarity (STS) task using the SentEval toolkit (Conneau et al., 2017) for evaluation.

For model robustness, we conduct an experiment on the openly available dataset AskUbuntu³³3https://github.com/taolei87/askubuntu (Lei et al., 2016) in English. We also make an investigation on the Tatoeba corpus (Artetxe and Schwenk, 2019) in $11$ language pairs by exploiting the BUCC2018 corpus (Zweigenbaum et al., 2017) in $4$ language pairs, which are originated from the well-known and representative benchmark XTREME⁴⁴4https://github.com/google-research/xtreme (Hu et al., 2020). In the STS task, we leverage Spearman’s rank correlation coefficient to measure the quality of correlation between human labels and calculated similarity (Gao et al., 2021). We exploit the bilingual dictionary ConceptNet5.7.0 (Speer et al., 2017) and MUSE (Lample et al., 2018) during the generations of the code-switched data. However, for the English corpus, we keep the original English sentences and do not leverage the code-switching. We conduct all the experiments on Zh without Chinese word segmentation (for AliExpress) and without converting them into simplified scripts for BUCC corpora. The hyper-parameters as shown in Table 3.

Baselines

To further verify the effectiveness of our method, we compare the proposed approach with the following highly related methods:

•

mBERT (Devlin et al., 2019) is transformer based multi-lingual bidirectional encoder representation and is pre-trained by leveraging the MMLM on the monolingual corpus.
•

Unicoder (Huang et al., 2019) by taking advantage of multi-task learning framework to learn the cross-lingual semantic representations via monolingual and parallel corpora to gain better results on downstream tasks.
•

XLM-R (Conneau et al., 2020) is more efficient than XLM and uses huge amount of mono-lingual datasets that originated from Common Crawl (Wenzek et al., 2020) which includes 100 languages to train MMLM.
•

SimCSE (Gao et al., 2021) propose a self-predictive contrastive learning that takes an input sentence and predicts itself as the objective.
•

InfoXLM (Chi et al., 2021) is an efficient method to learn the cross-lingual model training by adding a constraints.
•

VECO (Luo et al., 2021) obtains better results on both generation and understanding tasks by introducing the variable enc-dec framework.
•

CMLM (Yang et al., 2021) is a totally unsupervised learning method, conditional MLM, can effectively learn the sentence representation on huge amount of unlabeled data via integrating the sentence representation learning into MLM training.
•

LaBSE (Feng et al., 2022a) adapts the mBERT to generate the language-agnostic sentence embedding for 109 languages and is pre-trained by combining the MLM and TLM with translation ranking task leveraging bi-directional dual encoders.

4.2 Main Results

SR Results on Business Data

Table 4 shows the retrieving results of the proposed method on Ali-Express, DARAZ, and LAZADA corpora by evaluating the accuracy on TOP-30 retrieved queries, respectively. The conduction of our experiment is composed of two steps, firstly we continually pre-train the models by combining the queries and labels among the training set. Then we conduct the fine-tuning on different languages by exploiting their own train set and dev set. Unlike other baselines, in our continual pre-training step, we utilize code-switched queries and labels instead of combining the original data to train our model. Among the baselines, VECO achieves better results on almost every language from the business corpus. The code-switching method has the most positive effects both on Id, Fil (The corpora of these two languages are highly code-switched. See Table 2) and Bn, Si (DARAZ), but brings fewer benefits for Zh (Ali-Express). As we do not leverage any code-switched data for En, we obtain less improvement than VECO. However, our approach consistently outperforms all the baselines on each language except En. As depicted in Figure 3, we also evaluate the models with accuracy scores on Top-10 and Top-20 queries (For the details, see Table 7 & 8 in Appendix).

Results of Semantic Textual Similarity (STS)

As depicted in Figure 4, we further verify the model performance of our proposed approach on the highly similar task STS that is close to SR. In this experiment, all of the test sets only include English sentences. Thus we continually pre-train each baseline on the BUCC2018 corpora by combing all the English monolingual datasets. For a fair comparison, we exploit the BUCC data to continually pre-train our model. We evaluate the baselines and our model on the test sets only using the continually pre-trained model instead of the fine-tuned model. The presented model also obtains consistent improvements on all test sets. We provide more details in Table 9 (see Appendix).

Table 5: The comparison with various evaluation metric on AskUbuntu corpus. “P@1" and “P@5" denote the precision score on Top-1 query and Top-5 queries, respectively. “Acc." represents the accuracy score on Top-30 queries.

Model	AskUbuntu
Model	p@1	p@5	Acc.
mBERT	49.0	41.2	54.6
Unicoder	52.2	38.1	45.5
XLMR ${}_{Large}$	55.4	43.4	60.1
SimCSE-BERT ${}_{Large}$	53.2	42.6	56.1
InfoXLM ${}_{Large}$	52.7	43.2	55.8
VECO	53.2	41.8	59.8
CMLM	53.2	41.0	59.6
LaBSE	54.8	42.8	59.3
Ours	57.5	43.8	61.1

Table 6: The comparison with MRR@10 evaluation metric on the Quora Duplicate Questions dataset. The meaning of “Acc." is similar to that in Table 5.

Model	MRR	Acc.
mBERT	0.436	0.506
Unicoder	$0.227$	0.253
XLMR ${}_{Large}$	$0.498$	0.547
SimCSE-BERT ${}_{Large}$	0.453	0.538
InfoXLM ${}_{Large}$	0.575	0.620
VECO	$0.517$	0.579
CMLM	0.551	0.605
LaBSE	0.493	0.560
Ours	0.584	0.644

Verification of Robustness

As shown in Table 5, we also further explore the performance of our model on the openly available corpus AskUbuntu (Lei et al., 2016) and Tatoeba benchmark (Artetxe and Schwenk, 2019). In this experiment, we conduct continual pre-training and fine-tuning by leveraging only the AskUbuntu corpus without using other datasets. We evaluate all the baselines and our model using different evaluation metrics p@1, p@5, and accuracy with Top-1, Top-5, and Top-30 queries, respectively. Moreover, we further verify the effectiveness of our approach on another openly available benchmark Tatoeba and obtain remarkably better results than baselines. For more details, see Table 10 in Appendix.

4.3 Monolingual Semantic Retrieval

As shown in Table 6, we also tend to verify the semantic retrieving skill of our approach on another openly available Quora Duplicate Questions dataset, which only includes English monolingual data. Since this corpus only provides the test set, we merge the English part of the train set from BUCC18 as our monolingual data. Then we continually pre-train each model to evaluate them on the Quora Duplicate Questions dataset without using the fine-tuned model. In this dataset, InfoXLM ${}_{Large}$ obtains better results than other baselines, but our method outperforms all the baselines, which indicates that our approach has better retrieving skills compared to similar methods.

4.4 Ablation Study

The Effect of Similarity Loss $\mathcal{L}^{(\mathbf{q},\mathbf{l})}_{sim}$

As illustrated in Figure 5(a), it is an essential part of the cross-lingual PTM with similarity. We observe that the similarity brings a positive effect on the performance of our model. Our approach achieves better improvements with learning the similarity loss during the pre-training stage than without similarity compared with other baselines.

The Effect of $\lambda$

$\lambda$ controls the weight of the XMLM, which appears in Equation (10). As depicted in Figure 5(b), when $\lambda=0.2$ , our model achieves the best retrieving performance compared with other values. We provide the details of the effectiveness of different values for $\lambda$ in Table 11 (see Appendix).

The Effect of Code-switching

As illustrated in Figure 5(c), we also investigate the effectiveness of code-switching for our method. First, the performance becomes lower if we keep the data without code-switching ( $Cmd_{r}=0\%$ ), which demonstrates the effectiveness of code-switching. Since all the languages in the LAZADA corpus originally included code-switched scripts, it may obtain lower performance when we train the model without code-switched data. Second, when $Cmd_{r}=10\%$ , our model reaches the best average performance (For more details, see Table 12 in Appendix).

5 Related Work

Semantic Retrieval

Semantic retrieval is an essential task in NLP, which requires the model to calculate the sentence embeddings, and then similar sentences can be retrieved by the embeddings. Early SR methods are constructed based on traditional word2vec representations (Kiros et al., 2015; Hill et al., 2016). Subsequently, various studies have proposed using siamese networks to perform semantic retrieval (Neculoiu et al., 2016; Kashyap et al., 2016; Bao et al., 2018). With widely using the pre-trained language models, Reimers and Gurevych (Reimers and Gurevych, 2019) propose Sentence-BERT, which learns sentence embeddings by fine-tuning a siamese BERT network on NLI datasets. To conduct multi-lingual SR, Reimers and Gurevych (Reimers and Gurevych, 2020) extend Sentence-BERT to its multi-lingual version by knowledge distillation. To better leverage unlabeled data for SR, Gao et al. (2021) propose SimCSE, which uses contrastive learning to train sentence embeddings. To improve performances of multi-lingual SR, Chi et al. (2021) propose InfoXLM, which utilizes MLM, TLM, and contrastive learning objectives.

Code-switching

Code-switching is a pre-training technique to improve cross-lingual pre-trained models. That is used in PTMs for machine translation (Yang et al., 2020c; Lin et al., 2020; Yang et al., 2020a). For example, Yang et al. (2020c) utilize code-switching on monolingual data by replacing some continuous words into the target language and letting the model predict the replaced words. Lin et al. (2020) use code-switching on the source side of the multi-lingual parallel corpora to pre-train an encoder-decoder model for multi-lingual machine translation. Feng et al. (2022b) mitigate the limitation of the code-switching method for grammatical incoherence and negative effects on token-sensitive tasks. Yang et al. (2020a) propose ALM for cross-lingual pre-training, which requires the model to predict the masked words in the code-switched sentences. Krishnan et al. (2021) augment monolingual source data by leveraging the multilingual code-switching via random translation to improve the generalizability of large multi-lingual language models. Besides, code-switching has been utilized in other NLP tasks, including named entity recognition (Singh et al., 2018), question answering (Chandu et al., 2018; Gupta et al., 2018), universal dependency parsing (Bhat et al., 2018), morphological tagging (Özateş and Çetinoğlu, 2021), language modeling (Pratapa et al., 2018), automatic speech recognition (Kumar and Bora, 2018), natural language inference (Khanuja et al., 2020) and sentiment analysis (Patwa et al., 2020; Liu et al., 2020; Aparaschivei et al., 2020; Zhang et al., 2021). To the best of our knowledge, we are the first to utilize code-switching for semantic retrieval.

6 Conclusion and Future work

We introduce a straightforward pre-training approach to sentence-level semantic retrieval with code-switched cross-lingual data for the FAQ system in the task-oriented QA dialogue e-commerce scenario. Intuitively, code-switching is an emerging trend of communication in both bilingual and multi-lingual regions. Our experimental result shows that the proposed approach remarkably outperforms the previous highly similar baseline systems on the tasks of semantic retrieval and semantic textual similarity with three business corpora and four open corpora using many evaluation metrics. In future work, we will expand our method to other natural language understanding tasks. Besides, we will also leverage the different embedding distance calculation metrics instead of only using cosine similarity.

References

Aparaschivei et al. (2020) Lavinia Aparaschivei, Andrei Palihovici, and Daniela Gîfu. 2020. FII-UAIC at semeval-2020 task 9: Sentiment analysis for code-mixed social media text using CNN. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, pages 928–933.
Artetxe and Schwenk (2019) Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
Bao et al. (2018) Wei Bao, Wugedele Bao, Jinhua Du, Yuanyuan Yang, and Xiaobing Zhao. 2018. Attentive siamese lstm network for semantic textual similarity measure. In 2018 International Conference on Asian Language Processing (IALP), pages 312–317.
Bhat et al. (2018) Irshad Bhat, Riyaz A. Bhat, Manish Shrivastava, and Dipti Sharma. 2018. Universal Dependency parsing for Hindi-English code-switching. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 987–998, New Orleans, Louisiana. Association for Computational Linguistics.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020.
Chandu et al. (2018) Khyathi Raghavi Chandu, Ekaterina Loginova, Vishal Gupta, Josef van Genabith, Günter Neumann, Manoj Kumar Chinnakotla, Eric Nyberg, and Alan W. Black. 2018. Code-mixed question answering challenge: Crowd-sourcing data and techniques. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching@ACL 2018, pages 29–38.
Chi et al. (2020) Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and Heyan Huang. 2020. Cross-lingual natural language generation via pre-training. In Proceedings of the AAAI conference on artificial intelligence, pages 7570–7577.
Chi et al. (2021) Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2021. Infoxlm: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3576–3588.
Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
Conneau and Lample (2019) Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pages 7057–7067.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
Feng et al. (2022a) Fangxiaoyu Feng, Yinfei Yang, Daniel Matthew Cer, N. Arivazhagan, and Wei Wang. 2022a. Language-agnostic bert sentence embedding. In ACL.
Feng et al. (2022b) Yukun Feng, Feng Li, and Philipp Koehn. 2022b. Toward the limitation of code-switching in cross-lingual transfer. In Conference on Empirical Methods in Natural Language Processing.
Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910.
Gupta et al. (2018) Vishal Gupta, Manoj Kumar Chinnakotla, and Manish Shrivastava. 2018. Transliteration better than translation? answering code-mixed questions over a knowledge base. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching@ACL 2018, pages 39–50.
Hill et al. (2016) Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1367–1377.
Hu et al. (2020) Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. ArXiv, abs/2003.11080.
Huang et al. (2019) Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and M. Zhou. 2019. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In EMNLP.
Johnson et al. (2021) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7:535–547.
Kashyap et al. (2016) Abhay Kashyap, Lushan Han, Roberto Yus, Jennifer Sleeman, Taneeya Satyapanich, Sunil Gandhi, and Tim Finin. 2016. Robust semantic text similarity using lsa, machine learning, and linguistic resources. Language Resources and Evaluation, 50:125–161.
Khanuja et al. (2020) Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, and Monojit Choudhury. 2020. A new dataset for natural language inference from code-mixed conversations. In CALCS.
Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pages 3294–3302.
Krishnan et al. (2021) Jitin Krishnan, Antonios Anastasopoulos, Hemant Purohit, and Huzefa Rangwala. 2021. Multilingual code-switching for zero-shot cross-lingual intent prediction and slot filling. ArXiv, abs/2103.07792.
Kumar and Bora (2018) Ritesh Kumar and Manas Jyoti Bora. 2018. Part-of-speech annotation of english-assamese code-mixed texts: Two approaches. In Proceedings of the First International Workshop on Language Cognition and Computational Models, pages 94–103.
Lample et al. (2018) Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In 6th International Conference on Learning Representations.
Lei et al. (2016) Tao Lei, Hrishikesh Joshi, R. Barzilay, T. Jaakkola, K. Tymoshenko, Alessandro Moschitti, and Lluís Màrquez i Villodre. 2016. Semi-supervised question retrieval with gated convolutions. In NAACL.
Lin et al. (2020) Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu, Jiangtao Feng, Hao Zhou, and Lei Li. 2020. Pre-training multilingual neural machine translation by leveraging alignment information. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2649–2663.
Liu et al. (2020) Jiaxiang Liu, Xuyi Chen, Shikun Feng, Shuohuan Wang, Xuan Ouyang, Yu Sun, Zhengjie Huang, and Weiyue Su. 2020. Kk2018 at semeval-2020 task 9: Adversarial training for code-mixing sentiment classification. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, pages 817–823.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv: 1907.11692.
Luo et al. (2021) Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi, Songfang Huang, Fei Huang, and Luo Si. 2021. Veco: Variable and flexible cross-lingual pre-training for language understanding and generation. In ACL.
Neculoiu et al. (2016) Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. 2016. Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 148–157.
Ouyang et al. (2021) Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE-M: Enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 27–38, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Özateş and Çetinoğlu (2021) Şaziye Betül Özateş and Özlem Çetinoğlu. 2021. A language-aware approach to code-switched morphological tagging. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 72–83, Online. Association for Computational Linguistics.
Patwa et al. (2020) Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, and Amitava Das. 2020. Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, pages 774–790.
Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2227–2237.
Pratapa et al. (2018) Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, Sandipan Dandapat, and Kalika Bali. 2018. Language modeling for code-mixing: The role of linguistic theory based synthetic data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1543–1553.
Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3980–3990.
Reimers and Gurevych (2020) Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4512–4525.
Siddhant et al. (2020) Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Ari, Jason Riesa, Ankur Bapna, Orhan Firat, and Karthik Raman. 2020. Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation. In Proceedings of the AAAI conference on artificial intelligence, pages 8854–8861.
Singh et al. (2018) Vinay Singh, Deepanshu Vijay, Syed Sarfaraz Akhtar, and Manish Shrivastava. 2018. Named entity recognition for hindi-english code-mixed social media text. In Proceedings of the Seventh Named Entities Workshop, pages 27–35.
Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 4444–4451.
Tran et al. (2020) Chau Tran, Yuqing Tang, Xian Li, and Jiatao Gu. 2020. Cross-lingual retrieval for iterative self-supervised training. Advances in Neural Information Processing Systems, 33:2207–2219.
Wei et al. (2021) Xiangpeng Wei, Yue Hu, Rongxiang Weng, Luxi Xing, Heng Yu, and Weihua Luo. 2021. On learning universal representations across languages. ICLR.
Wenzek et al. (2020) Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm’an, Armand Joulin, and Edouard Grave. 2020. Ccnet: Extracting high quality monolingual datasets from web crawl data. In LREC.
Xiong et al. (2021) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations.
Xu et al. (2022) Wenshen Xu, Mieradilijiang Maimaiti, Yuanhang Zheng, Xin Tang, and Ji Zhang. 2022. Auto-mlm: Improved contrastive learning for self-supervised multi-lingual knowledge retrieval. arXiv preprint arXiv: 2203.16187.
Yang et al. (2020a) Jian Yang, Shuming Ma, Dongdong Zhang, Shuangzhi Wu, Zhoujun Li, and Ming Zhou. 2020a. Alternating language modeling for cross-lingual pre-training. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, pages 9386–9393.
Yang et al. (2019) Yinfei Yang, Gustavo Hernández Ábrego, Steve Yuan, Mandy Guo, Qinlan Shen, Daniel Cer, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019. Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 5370–5378. ijcai.org.
Yang et al. (2020b) Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil. 2020b. Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94, Online. Association for Computational Linguistics.
Yang et al. (2020c) Zhen Yang, Bojie Hu, Ambyera Han, Shen Huang, and Qi Ju. 2020c. CSP: code-switching pre-training for neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2624–2636.
Yang et al. (2021) Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, and Eric Darve. 2021. Universal sentence representation learning with conditional masked language model. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6216–6228, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Zhang et al. (2021) Wenxuan Zhang, Ruidan He, Haiyun Peng, Lidong Bing, and Wai Lam. 2021. Cross-lingual aspect-based sentiment analysis with aspect term code-switching. In Conference on Empirical Methods in Natural Language Processing.
Zweigenbaum et al. (2017) Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2017. Overview of the second bucc shared task: Spotting parallel sentences in comparable corpora. In BUCC@ACL.

Table 7: The comparison with accuracy score (Top-10 queries) between cross-lingual sentence retrieval baseline systems on AliExpress, LAZADA, and DARAZ corpora.

Model	AliExpress			LAZADA				DARAZ				Avg.
Model	Ar	En	Zh	Id	Ms	Fil	Th	Ur	Bn	Ne	Si	Avg.
mBERT	56.3	55.2	78.0	32.1	29.8	46.9	48.1	67.6	67.4	35.0	51.6	51.6
Unicoder	43.1	49.8	66.4	25.2	24.7	41.7	41.9	59.3	56.5	27.9	38.8	43.2
XLMR ${}_{Large}$	65.1	61.0	79.1	45.7	36.0	47.1	59.4	68.2	66.5	43.5	44.6	56.0
SimCSE-BERT ${}_{Large}$	48.4	58.8	40.3	27.4	31.5	48.5	58.1	66.0	68.8	36.9	50.2	48.6
InfoXLM ${}_{Large}$	60.2	61.7	76.9	44.7	33.8	51.6	61.3	66.5	64.6	37.3	49.4	55.3
VECO	67.4	61.7	81.7	47.6	35.3	52.8	60.0	70.7	70.6	40.3	52.4	58.2
CMLM	67.6	62.9	80.1	41.2	35.8	49.0	55.9	71.3	72.0	38.4	56.0	57.3
LaBSE	71.1	61.3	81.0	41.5	35.8	52.1	55.5	71.8	74.1	40.7	57.4	58.4
Ours	73.1	64.5	85.1	50.1	38.4	56.6	62.2	75.1	75.8	47.8	59.6	62.6

Table 8: The comparison with accuracy score (Top-20 queries) between cross-lingual sentence retrieval baseline systems on AliExpress, LAZADA, and DARAZ corpora.

Model	AliExpress			LAZADA				DARAZ				Avg.
Model	Ar	En	Zh	Id	Ms	Fil	Th	Ur	Bn	Ne	Si	Avg.
mBERT	71.8	71.8	85.5	46.1	42.2	62.2	63.3	78.2	78.6	48.2	69.2	65.2
Unicoder	55.6	63.0	74.2	36.9	37.3	56.4	54.6	69.0	68.6	40.7	56.6	55.7
XLMR ${}_{Large}$	74.1	73.8	87.2	60.9	49.4	62.3	74.2	80.3	79.5	57.6	60.2	69.0
SimCSE-BERT ${}_{Large}$	63.3	71.9	47.5	40.6	47.2	64.4	73.7	78.4	79.2	50.6	66.6	62.1
InfoXLM ${}_{Large}$	73.4	75.8	85.8	60.9	45.9	66.7	72.6	77.4	75.9	51.1	66.8	68.4
VECO	80.4	76.3	88.3	61.1	47.9	68.1	73.8	82.4	82.3	53.4	69.0	71.2
CMLM	80.3	74.9	88.4	56.4	47.1	64.9	69.0	80.9	82.8	52.5	72.0	69.9
LaBSE	82.2	73.6	88.3	58.4	49.5	67.3	68.8	81.4	81.1	54.9	74.2	70.9
Ours	85.3	77.2	91.4	67.3	52.3	71.4	76.7	85.1	86.3	60.0	76.4	75.4

Table 9: The comparison of sentence embedding performance on STS tasks. “STS12-STS16", “STS-B" and “SICK-R" denote SemEval2012-2016, STS benchmark and SICK relatedness dataset, respectively.

Model	STS $12$	STS $13$	STS $14$	STS $15$	STS $16$	STS-B	SICK-R	Avg.
mBERT	42.46	62.14	52.35	65.36	66.20	60.51	60.87	58.56
Unicoder	$41.07$	$56.92$	$49.76$	$60.86$	$53.65$	$47.97$	$54.78$	$52.14$
XLMR ${}_{Large}$	$39.79$	$62.65$	$52.09$	$62.26$	$64.39$	$59.27$	$61.07$	$57.36$
SimCSE-BERT ${}_{Large}$	42.35	67.34	57.20	70.36	69.41	59.86	63.77	61.47
InfoXLM ${}_{Large}$	32.23	52.35	39.42	52.04	60.82	54.04	59.61	50.07
VECO	41.76	60.75	52.21	64.79	67.26	58.93	61.17	58.12
CMLM	30.14	61.77	47.45	61.25	62.73	53.23	56.62	53.31
LaBSE	47.43	64.13	55.72	69.66	64.21	57.60	60.68	59.92
Ours	44.53	68.20	55.99	71.39	66.07	66.32	68.03	62.93

Appendix A Case Study

To further demonstrate the better performance of the proposed approach, we make some visualizations of the retrieved results on the Business corpus between the previously introduced SOTA pre-trained models (i.g., we choose the five most efficient baselines) for cross-lingual scenarios.

As illustrated in Figure 6, the baselines fail to retrieve the key information “delivery failed” from the user query “it says delivery failed when i track it”, while our proposed method can retrieve such vital information, which indicates that our model can improve the performance of sentence retrieval in the business domain.

Table 10: The comparison with accuracy score (Top-30 queries) on Tatoeba corpus for each language.

Model	af	de	es	fr	it	ja	kk	nl	pt	sw	te
mBERT (Devlin et al., 2019)	55.4	78.0	74.2	74.7	73.6	73.1	50.1	71.2	76.6	32.3	58.5
Unicoder (Huang et al., 2019)	15.8	15.4	20.2	40.4	19.4	32.6	16.5	21.7	20.7	25.4	18.4
XLMR (Chi et al., 2021)	67.8	86.6	86.2	87.7	83.2	84.9	46.4	84.2	86.8	36.7	70.9
SimCSE (Gao et al., 2021)	23.0	28.4	25.1	39.1	25.4	21.6	20.4	20.7	21.2	16.7	17.7
InfoXLM (Chi et al., 2021)	84.7	90.6	86.4	93.7	89.5	88.2	77.2	91.3	89.8	68.9	74.3
VECO (Luo et al., 2021)	86.3	96.5	95.7	96.4	95.3	93.5	71.1	94.6	95.4	73.0	81.1
CMLM (Yang et al., 2021)	87.9	94.1	91.5	97.1	94.3	91.3	75.6	94.7	95.6	74.3	83.3
LaBSE (Feng et al., 2022a)	93.9	95.5	94.4	97.5	95.3	92.1	83.3	95.4	96.1	75.4	85.9
Ours	95.6	96.8	97.3	98.1	96.3	94.3	85.2	96.5	96.3	77.2	86.3

Table 11: The effect of

\lambda

on business corpora with accuracy score (Top-30 queries).

$\lambda$	AliExpress			LAZADA				DARAZ				Avg.
$\lambda$	Ar	En	Zh	Id	Ms	Fil	Th	Ur	Bn	Ne	Si	Avg.
$0.1$	87.1	82.2	91.1	68.3	60.8	77.7	80.2	88.0	90.4	60.2	84.0	79.1
$0.2$	85.4	81.9	91.1	70.9	58.8	77.6	82.2	87.9	88.4	65.2	81.6	79.2
$0.3$	81.1	81.7	90.6	70.7	59.3	78.1	76.8	88.2	89.8	62.0	83.6	78.3
$0.4$	82.9	81.2	91.7	70.9	58.8	74.3	82.3	87.6	88.6	63.0	83.8	78.6
$0.5$	87.2	80.2	90.6	70.8	60.6	72.8	82.2	88.1	89.2	64.9	82.3	79.0

Table 12: The effect of code-switching rate (“

Cmd_{r}

") on business corpora with accuracy score (Top-30 queries).

$Cmd_{r}$	AliExpress			LAZADA				DARAZ				Avg.
$Cmd_{r}$	Ar	En	Zh	Id	Ms	Fil	Th	Ur	Bn	Ne	Si	Avg.
$0\%$	86.7	82.2	89.9	67.5	58.6	75.4	77.4	82.7	87.9	57.3	76.8	76.6
$10\%$	89.0	82.2	92.4	73.7	61.3	78.4	80.2	84.6	90.8	60.2	79.0	79.3
$20\%$	87.1	82.2	91.4	68.3	60.8	77.4	80.2	88.0	90.4	60.2	84.0	79.1
$30\%$	84.5	82.2	92.4	70.5	59.2	76.6	81.7	87.7	89.3	66.1	74.4	78.6
$40\%$	86.2	82.2	92.3	69.8	61.3	75.8	79.8	82.3	84.2	65.2	78.0	77.9
$50\%$	85.1	82.2	91.9	69.7	59.3	76.4	81.1	85.9	79.9	65.5	78.4	77.8

Appendix B Results on Business Datasets

As shown both in Table 7 and Table 8, we make some explorations for the retrieving skill of our introduced approach on Ali-Express, DARAZ, and LAZADA corpora by evaluating the accuracy on TOP-10 and TOP-20 retrieved queries, respectively. The conduction of our experiment is composed of two steps: first, we continually pre-train the models by combining the queries and labels among the training set. Then we conduct the fine-tuning on different languages by exploiting their own train set $\&$ dev set. Similar to the results on TOP-30 queries (see Table 4), VECO also obtains higher results among the baselines systems. However, our proposed model achieves remarkably better results than all baselines on each of the languages of the three business datasets.

Appendix C Results on the Task of STS

We regard that there exists a bit of difference between the two tasks, such as semantic retrieval (SR) and semantic textual similarity (STS). But both of them take sentence-level representation as a backbone. Therefore, we make some investigations on the task of STS. As shown in Table 9, we also make further validation on the similar task STS by comparing the semantic representation skill of the baseline models and our proposed method. In this experiment, all of the test sets (STS12, STS13, STS14, STS15, STS16, STS-B, and SICK-R) only include English sentences. Thus we evaluate the baselines and our model only using the continually pre-trained model instead of the fine-tuned model. We leverage Spearman’s rank correlation coefficient to measure the quality of all models. Among the baseline systems, the SimCSE-BERT ${}_{Large}$ obtain higher results than other baseline approaches, but our presented method steadily outperforms all the baseline models.

Appendix D Results on Publicly Open Corpora

We conduct meaningful experiments on well-known and broadly used open available public datasets AskUbuntu and Tatoeba benchmark for sentence-level semantic retrieval. In contrast with the experiment on the AskUbuntu dataset, in this experiment, we continually pre-train all the models by leveraging the BUCC2018 corpus for the continual pre-training step. Due to the BUCC2018 corpus containing the train set and dev set, for a fair comparison, we also fine-tune our continually pre-trained model via the BUCC2018. Moreover, the Tatoeba dataset covers more than 40 languages (shown with their ISO 639-1 code for brevity). But in our experiment, we only choose the part that differs from our business corpora, such as Afrikaans (af), German (de), Spanish (es), French (fr), Italian (it), Japanese (ja), Kazakh (kk), Dutch (nl), Portuguese (pt), Swahili (sw) and Telugu(te). As shown in Table 10, all the detailed reviews about the comparison results are evaluated by accuracy on each language for Tatoeba. In this experiment, we choose 11 languages.

Appendix E The Effect of the Hyper-parameter

As shown in Table 11, we make further validation on both business data and publicly available open corpora with different values of $\lambda$ . For selecting the values of $\lambda$ , we fix the code-switching rate $Cmd_{r}$ during the continual pre-training stage for our model. The experimental result shows when the $\lambda=0.2$ , our model achieves better results than other values. Thus, we set $0.2$ as the default value of $\lambda$ in all experiments.

Additionally, as given in Table 12, we also investigate the different values of the code-switching rate $Cmd_{r}$ . Similarly, we fix the lambda $\lambda$ during the continual pre-training step to select the proper values of $Cmd_{r}$ for our model. It is not hard to infer from the experimental results that, when the $Cmd_{r}=10\%$ , our proposed approach obtains the highest performance compared with other values of $Cmd_{r}$ . Therefore, we take the $10\%$ as a default value for $Cmd_{r}$ . The values of $\lambda$ and $Cmd_{r}$ are identical for the business and open corpora in the whole experiment.

	$\displaystyle\mathcal{L}^{(\mathbf{q},\mathbf{l})}_{XMLM}=$	$\displaystyle-\sum_{q_{i}\in\mathbf{q}_{m}}\log P(q_{i}\|\mathbf{q}_{o};\mathbf% {\theta})$		(12)
		$\displaystyle-\sum_{l_{j}\in\mathbf{l}_{m}}\log P(l_{j}\|\mathbf{l}_{o};\mathbf% {\theta}),$		(12)

Model	AliExpress			LAZADA				DARAZ
Model	Ar	En	Zh	Id	Ms	Fil	Th	Ur	Bn	Ne	Si
Train	$12.8$ K	$16.0$ K	$11.2$ K	$20.1$ K	$18.8$ K	$20.7$ K	$20.7$ K	$6.9$ K	$8.7$ K	$26.1$ K	$4.0$ K
Dev	$1$ K	$1$ K	$1$ K	$1$ K	$1$ K	$1$ K	$1$ K	$1$ K	$1$ K	$2.0$ K	$0.5$ K
Test	$1$ K	$1$ K	$1$ K	$1$ K	$1$ K	$1$ K	$1$ K	$1$ K	$1$ K	$2.0$ K	$0.5$ K

Languages	Code-switching Rate (Offline)			Code-switching Rate (Online)
Languages	Mixed	English	Native	Mixed	English	Native
Indonesian (Id)	$76.92$ %	$1.16$ %	$21.92$ %	$85.23$ %	$0.34$ %	$14.43$ %
Malay (Ms)	$27.90$ %	$71.60$ %	$0.50$ %	$38.87$ %	$57.06$ %	$2.38$ %
Filipino (Fil)	$49.31$ %	$50.60$ %	$0.08$ %	$72.09$ %	$26.86$ %	$1.04$ %
Thai (Th)	$4.49$ %	$1.84$ %	$93.67$ %	$10.38$ %	$4.67$ %	$84.95$ %

Parameter	Value
Word Embedding	$1280$
Vocabulary Size	$200$ K
Dropout	$0.1$
Learning Rate	$1e-5$
Margin	$0.1$
Optimizer	Adam
Masking Probability	$0.15$
$\lambda$	0.2
Code-switching Rate	$10\%$