(Translated by https://www.hiragana.jp/)
Improving Cross-lingual Representation for Semantic Retrieval with Code-switching

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata
  • failed: galois

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.01364v1 [cs.CL] 03 Mar 2024

Improving Cross-lingual Representation for Semantic Retrieval with Code-switching

Mieradilijiang Maimaiti11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT , Yuanhang Zheng2*2{}^{2*}start_FLOATSUPERSCRIPT 2 * end_FLOATSUPERSCRIPT, Ji Zhang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT ,
Fei Huang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yue Zhang33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Wenpei Luo44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT, Kaiyu Huang55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTAlibaba DAMO Academy
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTDepartment of Computer Science and Technology, Tsinghua University, Beijing, China
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTDepartment of Computer Science and Technology, Westlake University, Hangzhou, China
44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTDepartment of Computer Science and Technology, Dalian University of Technology, Dalian
55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTBeijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, China
{mieradilijiang.mea, zj122146, f.huang}@alibaba-inc.com,
zheng-yh19@mails.tsinghua.edu.cn, zhangyue@westlake.edu.cn,
22109239@mail.dlut.edu.cn, kyhuang@bjtu.edu.cn
  Equal contribution  Corresponding author: Ji Zhang
Abstract

Semantic Retrieval (SR) has become an indispensable part of the FAQ system in the task-oriented question-answering (QA) dialogue scenario. The demands for a cross-lingual smart-customer-service system for an e-commerce platform or some particular business conditions have been increasing recently. Most previous studies exploit cross-lingual pre-trained models (PTMs) for multi-lingual knowledge retrieval directly, while some others also leverage the continual pre-training before fine-tuning PTMs on the downstream tasks. However, no matter which schema is used, the previous work ignores to inform PTMs of some features of the downstream task, i.e. train their PTMs without providing any signals related to SR. To this end, in this work, we propose an Alternative Cross-lingual PTM for SR via code-switching. We are the first to utilize the code-switching approach for cross-lingual SR. Besides, we introduce the novel code-switched continual pre-training instead of directly using the PTMs on the SR tasks. The experimental results show that our proposed approach consistently outperforms the previous SOTA methods on SR and semantic textual similarity (STS) tasks with three business corpora and four open datasets in 20+ languages.

1 Introduction

In recent years, pre-trained models (PTMs) have demonstrated success on many downstream tasks of natural language processing (NLP). Intuitively, PTMs such as ELMO (Peters et al., 2018), GPT (Radford et al., 2018), GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), BERT (Devlin et al., 2019), and RoBERTa (Liu et al., 2019) have achieved remarkable results by transferring knowledge learned from a large amount of unlabeled corpus to various downstream NLP tasks. To learn the cross-lingual representations, previous methods like multi-lingual BERT (mBERT) (Devlin et al., 2019) and XLM (Conneau and Lample, 2019) have extended PTMs to multiple languages.

Refer to caption
Figure 1: The brief illustration of the semantic retrieval with leveraging knowledge base for FAQ system in the task-oriented dialogue scenario.

Semantic retrieval (SR) (Kiros et al., 2015) has become the ubiquitous method in the FAQ system (i.e., task-oriented question-answering (QA) (Xiong et al., 2021)) which is incorporated into the smart-customer-service platform for the e-commerce scenario. For the cross-lingual scenario, many pre-training methods have been presented for the multi-lingual downstream tasks, such as XLM-R (Conneau et al., 2020), XNLG (Chi et al., 2020), InfoXLM (Chi et al., 2021) and VECO (Luo et al., 2021). Intuitively, the main challenge of SR is how to accurately retrieve the corresponding sentence from the knowledge base (query-label pairs) (Kiros et al., 2015). Commonly used approaches mainly take some variants of the BERT model as a backbone and then directly fine-tune on the downstream tasks (Devlin et al., 2019; Conneau and Lample, 2019; Huang et al., 2019; Yang et al., 2021; Ouyang et al., 2021).

Specifically, mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), Unicoder (Huang et al., 2019), CMLM (Yang et al., 2021) and ERNIE-M (Ouyang et al., 2021) learn the cross-lingual sentence representation mainly using masked language modeling (MLM). For other objective functions, MMTE (Siddhant et al., 2020) exploits multi-lingual machine translation, and CRISS (Tran et al., 2020) leverages unsupervised parallel data mining. Some models use Siamese network architectures to better adapt them to SR. For example, InferSent (Conneau et al., 2017) uses natural language inference (NLI) datasets to train the Siamese network. USE (Yang et al., 2019), M-USE (Yang et al., 2020b) and LaBSE (Feng et al., 2022a) exploit ranking loss. SimCSE (Gao et al., 2021), InfoXLM (Chi et al., 2021) and HICTL (Wei et al., 2021) use contrastive learning. However, the previous highly similar approaches almost ignore the transmission of some features of the downstream tasks to PTMs. In other words, most methods (Chi et al., 2021; Luo et al., 2021) directly fine-tune the models on downstream tasks without providing any signals related to SR. In addition, they are mainly pre-trained on combined monolingual data where few of the sentences are code-switched. Since the user queries often contain many code-switched sentences, it is insufficient to exploit the commonly used methods directly for the SR task in the e-commerce scenario.

In this work, as depicted in Figure 1, we aim to enhance the performance of SR for the FAQ system in the e-commerce scenario. We propose a novel pre-training approach for sentence-level SR with code-switched cross-lingual data. Our motivation comes from the ignorance of previous studies. One of the recent studies (Yang et al., 2020a) also tries to exploit the code-switching strategy in the machine translation scenario, but no one has tried to leverage code-switching on the task of multi-lingual SR. Furthermore, the previous methods (Xu et al., 2022) have exploited multi-lingual PTMs on the SR task by only masking the query instead of masking the label. They intend to use more efficient PTMs to fine-tune the SR task rather than making PTMs stronger by providing some signals. To allow the PTMs to learn the signals directly related to downstream tasks, we present an Alternative Cross-Lingual PTM for semantic retrieval using code-switching, which consists of three main steps. First, we generate code-switched data based on bilingual dictionaries. Then, we pre-train a model on the code-switched data using a weighted sum of the alternating language modeling (ALM) loss (Yang et al., 2020a) and the similarity loss. Finally, we fine-tune the model on the SR corpus. By providing additional training signals related to SR during the pre-training process, our proposed approach can learn better about the SR task. Our main contributions are as follows:

  • Experiments show that our approach remarkably outperforms the SOTA methods with various evaluation metrics.

  • Our method improves the robustness of the model for sentence-level SR on both the in-house datasets and open corpora.

  • To the best of our knowledge, we first present alternative cross-lingual PTM for SR using code-switching in the FAQ system (e-commerce scenario).

2 Preliminaries

2.1 Masked Language Modeling

Masked language modeling (MLM) (Devlin et al., 2019) is a pre-training objective focused on learning representations of natural language sentences. When pre-training a model using MLM objectives, we let the model predict the masked words in the input sentence. Formally, we divide each sentence 𝐱𝐱\mathbf{x}bold_x into the masked part 𝐱msubscript𝐱𝑚\mathbf{x}_{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and the observed part 𝐱osubscript𝐱𝑜\mathbf{x}_{o}bold_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and we train the model (which is parameterized by θ𝜃\mathbf{\theta}italic_θ) to minimize

MLM=logP(𝐱m|𝐱o;θ).subscript𝑀𝐿𝑀𝑃conditionalsubscript𝐱𝑚subscript𝐱𝑜𝜃\mathcal{L}_{MLM}=-\log P(\mathbf{x}_{m}|\mathbf{x}_{o};\mathbf{\theta}).caligraphic_L start_POSTSUBSCRIPT italic_M italic_L italic_M end_POSTSUBSCRIPT = - roman_log italic_P ( bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ; italic_θ ) . (1)

When calculating Eq.(1), we assume that the model independently predicts each masked word. Formally, we assume that all masked words xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the masked part 𝐱msubscript𝐱𝑚\mathbf{x}_{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are independent conditioned on 𝐱osubscript𝐱𝑜\mathbf{x}_{o}bold_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. Thus, Eq.(1) can be rewritten as

MLM=xi𝐱mlogP(xi|𝐱o;θ).subscript𝑀𝐿𝑀subscriptsubscript𝑥𝑖subscript𝐱𝑚𝑃conditionalsubscript𝑥𝑖subscript𝐱𝑜𝜃\mathcal{L}_{MLM}=-\sum_{x_{i}\in\mathbf{x}_{m}}\log P(x_{i}|\mathbf{x}_{o};% \mathbf{\theta}).caligraphic_L start_POSTSUBSCRIPT italic_M italic_L italic_M end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ; italic_θ ) . (2)
Refer to caption
Figure 2: The architecture of our proposed model Alternative Cross-Lingual PTM for SR. The code-switched tokens for query and label are “status"\Rightarrowzhuàngtaì", “failed"\Rightarrowshībaì" and “"delivery"\Rightarrowyùnsòng" separately. The "[CLS]" symbol stands for the sentence representation of query and label. The structures used for the query and the label are the same.

2.2 Cross-lingual LM Pre-training

To improve the performances of various models on the NLP tasks of different languages, cross-lingual PTMs (Devlin et al., 2019; Conneau and Lample, 2019; Conneau et al., 2020) have been proposed. Generally, cross-lingual PTMs are trained on multi-lingual corpora using the MLM objective. During the pre-training process, the corpora of low-resource languages are usually oversampled to improve the model’s performance on low-resource languages. To better align the representations of the sentences in different languages, cross-lingual PTMs may use another objective called translation language modeling (TLM), which requires the model to predict the masked words in both the source and the target sentences in a parallel sentence pair. Formally, given a parallel sentence pair 𝐱,𝐲𝐱𝐲\langle\mathbf{x},\mathbf{y}\rangle⟨ bold_x , bold_y ⟩, we randomly divide the source sentence 𝐱𝐱\mathbf{x}bold_x into the masked part 𝐱msubscript𝐱𝑚\mathbf{x}_{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and the observed part 𝐱osubscript𝐱𝑜\mathbf{x}_{o}bold_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and also divide the target sentence 𝐲𝐲\mathbf{y}bold_y into the masked part 𝐲msubscript𝐲𝑚\mathbf{y}_{m}bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and the observed part 𝐲osubscript𝐲𝑜\mathbf{y}_{o}bold_y start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. Then we minimize

TLM=logP(𝐱m,𝐲m|𝐱o,𝐲o;θ).subscript𝑇𝐿𝑀𝑃subscript𝐱𝑚conditionalsubscript𝐲𝑚subscript𝐱𝑜subscript𝐲𝑜𝜃\mathcal{L}_{TLM}=-\log P(\mathbf{x}_{m},\mathbf{y}_{m}|\mathbf{x}_{o},\mathbf% {y}_{o};\mathbf{\theta}).caligraphic_L start_POSTSUBSCRIPT italic_T italic_L italic_M end_POSTSUBSCRIPT = - roman_log italic_P ( bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ; italic_θ ) . (3)

2.3 Semantic Retrieval

Semantic retrieval (SR) aims to retrieve sentences similar to the query sentence in a knowledge base (Kiros et al., 2015). Specifically, the semantic retrieval model converts sentences into vectors, and similar sentences are retrieved based on the cosine similarity.

Formally, given a sentence 𝐱𝐱\mathbf{x}bold_x, the model encodes 𝐱𝐱\mathbf{x}bold_x into a vector 𝐯𝐱subscript𝐯𝐱\mathbf{v}_{\mathbf{x}}bold_v start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT. When we need to retrieve sentences similar to the query 𝐪𝐪\mathbf{q}bold_q, we calculate the cosine similarity between 𝐯𝐪subscript𝐯𝐪\mathbf{v}_{\mathbf{q}}bold_v start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐯𝐱subscript𝐯𝐱\mathbf{v}_{\mathbf{x}}bold_v start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT for each sentence 𝐱𝐱\mathbf{x}bold_x in the knowledge base 𝒦𝒦\mathcal{K}caligraphic_K:

sim(𝐪,𝐱)=𝐯𝐪𝐯𝐱𝐯𝐪×𝐯𝐱.𝑠𝑖𝑚𝐪𝐱subscript𝐯𝐪subscript𝐯𝐱normsubscript𝐯𝐪normsubscript𝐯𝐱sim(\mathbf{q},\mathbf{x})=\frac{\mathbf{v}_{\mathbf{q}}\cdot\mathbf{v}_{% \mathbf{x}}}{||\mathbf{v}_{\mathbf{q}}||\times||\mathbf{v}_{\mathbf{x}}||}.italic_s italic_i italic_m ( bold_q , bold_x ) = divide start_ARG bold_v start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ⋅ bold_v start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_ARG start_ARG | | bold_v start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT | | × | | bold_v start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT | | end_ARG . (4)

Finally, we retrieve the sentence 𝐱*superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT which is most similar to the query 𝐪𝐪\mathbf{q}bold_q in 𝒦𝒦\mathcal{K}caligraphic_K:

𝐱*=argmax𝐱𝒦sim(𝐪,𝐱).superscript𝐱subscriptargmax𝐱𝒦𝑠𝑖𝑚𝐪𝐱\mathbf{x}^{*}=\mathop{\rm argmax}\limits_{\mathbf{x}\in\mathcal{K}}sim(% \mathbf{q},\mathbf{x}).bold_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_argmax start_POSTSUBSCRIPT bold_x ∈ caligraphic_K end_POSTSUBSCRIPT italic_s italic_i italic_m ( bold_q , bold_x ) . (5)

2.4 Code-switching

To reduce the representation gap between words of different languages in the cross-lingual PTMs. Yang et al. (2020a) proposed ALM, which is based on code-switching. Specifically, given a source sentence, we construct a code-switched sentence by randomly replacing some source words with the corresponding target words. For example, suppose the English source sentence is “I like music” and then we replace some words in the sentence with Chinese. If the replaced English words are “I” and “music” and their corresponding Chinese words are “” and “yīn”, respectively, then the code-switched sentence is “ like yīn”.

Formally, suppose that we conduct code-switching on a source sentence 𝐱={x1,x2,,xn}𝐱subscript𝑥1subscript𝑥2subscript𝑥𝑛\mathbf{x}=\{x_{1},x_{2},\dots,x_{n}\}bold_x = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. First, we randomly choose a subset S𝑆Sitalic_S from {1,2,,n}12𝑛\{1,2,\dots,n\}{ 1 , 2 , … , italic_n }. Then, for each element iS𝑖𝑆i{\in}Sitalic_i ∈ italic_S, we replace xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with its corresponding target word yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to construct the code-switched sentence 𝐳={z1,z2,,zn}𝐳subscript𝑧1subscript𝑧2subscript𝑧𝑛\mathbf{z}=\{z_{1},z_{2},\dots,z_{n}\}bold_z = { italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where

zi={xiiS,yiiS.subscript𝑧𝑖casessubscript𝑥𝑖𝑖𝑆subscript𝑦𝑖𝑖𝑆z_{i}=\begin{cases}x_{i}&i{\notin}S,\\ y_{i}&i{\in}S.\end{cases}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL italic_i ∉ italic_S , end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL italic_i ∈ italic_S . end_CELL end_ROW (6)

3 Method

3.1 Alternative Cross-lingual PTM

The main architecture of our model is shown in 2. We jointly train the PTM on the code-switched data using the cross-lingual masked language model (XMLM) and the similarity loss to address the limitation of existing PTMs that are trained without signals directly related to the downstream tasks (e.g., SR). Contrarily, we add a similarity loss term to the pre-training objective to adjust the similarity between input (query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y) and output (label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l). Thus, the similarity between query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y and label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l has been controlled by exploiting the similarity loss during the continual pre-training step. For the sentence-level SR task, the given knowledge is composed of a certain number of query,label𝑞𝑢𝑒𝑟𝑦𝑙𝑎𝑏𝑒𝑙\langle query,label\rangle⟨ italic_q italic_u italic_e italic_r italic_y , italic_l italic_a italic_b italic_e italic_l ⟩ pairs (see Figure 2). We regard 𝐪𝐪\mathbf{q}bold_q as a query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y and take 𝐥𝐥\mathbf{l}bold_l as a corresponding label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l. Given a query 𝐪={q1,,qi,,qI}𝐪subscript𝑞1subscript𝑞𝑖subscript𝑞𝐼\mathbf{q}=\{q_{1},\dots,q_{i},\dots,q_{I}\}bold_q = { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT } and a label 𝐥={l1,,lj,,lJ}𝐥subscript𝑙1subscript𝑙𝑗subscript𝑙𝐽\mathbf{l}=\{l_{1},\dots,l_{j},\dots,l_{J}\}bold_l = { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT }, the standard retrieval models usually formulate the sentence-level SR as a calculation of the similarity between query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y and label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l on the semantic space:

𝐯𝐪subscript𝐯𝐪\displaystyle\mathbf{v}_{\mathbf{q}}bold_v start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT =encode(q1,,qi,,qI),absent𝑒𝑛𝑐𝑜𝑑𝑒subscript𝑞1subscript𝑞𝑖subscript𝑞𝐼\displaystyle=encode(q_{1},\dots,q_{i},\dots,q_{I}),= italic_e italic_n italic_c italic_o italic_d italic_e ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) , (7)
𝐯𝐥subscript𝐯𝐥\displaystyle\mathbf{v}_{\mathbf{l}}bold_v start_POSTSUBSCRIPT bold_l end_POSTSUBSCRIPT =encode(l1,,lj,,lJ).absent𝑒𝑛𝑐𝑜𝑑𝑒subscript𝑙1subscript𝑙𝑗subscript𝑙𝐽\displaystyle=encode(l_{1},\dots,l_{j},\dots,l_{J}).= italic_e italic_n italic_c italic_o italic_d italic_e ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ) . (8)
Algorithm 1 Cross-lingual SR with Code-switching
1:user query Quser={𝐪user(u)}u=1Usubscript𝑄𝑢𝑠𝑒𝑟superscriptsubscriptsuperscriptsubscript𝐪𝑢𝑠𝑒𝑟𝑢𝑢1𝑈Q_{user}=\{\mathbf{q}_{user}^{(u)}\}_{u=1}^{U}italic_Q start_POSTSUBSCRIPT italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT = { bold_q start_POSTSUBSCRIPT italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT, monolingual knowledge (query-label pairs) 𝒦mono={𝐪(m),𝐥(m)}m=1Msubscript𝒦𝑚𝑜𝑛𝑜superscriptsubscriptsuperscript𝐪𝑚superscript𝐥𝑚𝑚1𝑀\mathcal{K}_{mono}=\{\langle\mathbf{q}^{(m)},\mathbf{l}^{(m)}\rangle\}_{m=1}^{M}caligraphic_K start_POSTSUBSCRIPT italic_m italic_o italic_n italic_o end_POSTSUBSCRIPT = { ⟨ bold_q start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , bold_l start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ⟩ } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, Bi-lingual dictionary Dbi={L1(n),Len(n)}n=1Nsubscript𝐷𝑏𝑖superscriptsubscriptsuperscriptsubscript𝐿1𝑛superscriptsubscript𝐿𝑒𝑛𝑛𝑛1𝑁D_{bi}=\{\langle L_{1}^{(n)},L_{en}^{(n)}\rangle\}_{n=1}^{N}italic_D start_POSTSUBSCRIPT italic_b italic_i end_POSTSUBSCRIPT = { ⟨ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT , italic_L start_POSTSUBSCRIPT italic_e italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ⟩ } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT;
2:retrieved top-k𝑘kitalic_k similar question Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT;
3:Obtain code-switched knowledge 𝒦cmdsubscript𝒦𝑐𝑚𝑑\mathcal{K}_{cmd}caligraphic_K start_POSTSUBSCRIPT italic_c italic_m italic_d end_POSTSUBSCRIPT using Dbisubscript𝐷𝑏𝑖D_{bi}italic_D start_POSTSUBSCRIPT italic_b italic_i end_POSTSUBSCRIPT on 𝒦monosubscript𝒦𝑚𝑜𝑛𝑜\mathcal{K}_{mono}caligraphic_K start_POSTSUBSCRIPT italic_m italic_o italic_n italic_o end_POSTSUBSCRIPT
4:for N knowledge-pair in 𝒦cmdsubscript𝒦𝑐𝑚𝑑\mathcal{K}_{cmd}caligraphic_K start_POSTSUBSCRIPT italic_c italic_m italic_d end_POSTSUBSCRIPT do \triangleright training
5:     achieve the similarity sim(𝐪,𝐥)𝑠𝑖𝑚𝐪𝐥sim(\mathbf{q},\mathbf{l})italic_s italic_i italic_m ( bold_q , bold_l ) (Eq.9)
6:     Obtain the code-switched XMLM(𝐪,𝐥)subscriptsuperscript𝐪𝐥𝑋𝑀𝐿𝑀\mathcal{L}^{(\mathbf{q},\mathbf{l})}_{XMLM}caligraphic_L start_POSTSUPERSCRIPT ( bold_q , bold_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X italic_M italic_L italic_M end_POSTSUBSCRIPT
7:     Jointly optimize the total loss (Eq.10)
8:end for
9:Return retrieved top k𝑘kitalic_k Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from 𝒦cmdsubscript𝒦𝑐𝑚𝑑\mathcal{K}_{cmd}caligraphic_K start_POSTSUBSCRIPT italic_c italic_m italic_d end_POSTSUBSCRIPT according to Qusersubscript𝑄𝑢𝑠𝑒𝑟Q_{user}italic_Q start_POSTSUBSCRIPT italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT

We retrieve the query 𝐪𝐪\mathbf{q}bold_q similar to the label 𝐥𝐥\mathbf{l}bold_l by calculateing the cosine similarity between 𝐯𝐪subscript𝐯𝐪\mathbf{v}_{\mathbf{q}}bold_v start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐯𝐥subscript𝐯𝐥\mathbf{v}_{\mathbf{l}}bold_v start_POSTSUBSCRIPT bold_l end_POSTSUBSCRIPT for each query,label𝑞𝑢𝑒𝑟𝑦𝑙𝑎𝑏𝑒𝑙\langle query,label\rangle⟨ italic_q italic_u italic_e italic_r italic_y , italic_l italic_a italic_b italic_e italic_l ⟩ pair in the knowledge base 𝒦𝒦\mathcal{K}caligraphic_K:

sim(𝐪,𝐥)=𝐯𝐪𝐯𝐥𝐯𝐪×𝐯𝐥.𝑠𝑖𝑚𝐪𝐥subscript𝐯𝐪subscript𝐯𝐥normsubscript𝐯𝐪normsubscript𝐯𝐥sim(\mathbf{q},\mathbf{l})=\frac{\mathbf{v}_{\mathbf{q}}\cdot\mathbf{v}_{% \mathbf{l}}}{||\mathbf{v}_{\mathbf{q}}||\times||\mathbf{v}_{\mathbf{l}}||}.italic_s italic_i italic_m ( bold_q , bold_l ) = divide start_ARG bold_v start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ⋅ bold_v start_POSTSUBSCRIPT bold_l end_POSTSUBSCRIPT end_ARG start_ARG | | bold_v start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT | | × | | bold_v start_POSTSUBSCRIPT bold_l end_POSTSUBSCRIPT | | end_ARG . (9)

Then, we rank the retrieved sentences according to their similarity score to recall the Top-k𝑘kitalic_k similar asking questions that are semantically close to the original input query. The total objective function of our proposed model consists of two parts, the XMLM and the similarity loss, which can be formulated as follows:

total=λ*XMLM(𝐪,𝐥)+sim(𝐪,𝐥),subscript𝑡𝑜𝑡𝑎𝑙𝜆subscriptsuperscript𝐪𝐥𝑋𝑀𝐿𝑀subscriptsuperscript𝐪𝐥𝑠𝑖𝑚\displaystyle\mathcal{L}_{total}=\mathbf{\lambda}*\mathcal{L}^{(\mathbf{q},% \mathbf{l})}_{XMLM}+\mathcal{L}^{(\mathbf{q},\mathbf{l})}_{sim},caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_λ * caligraphic_L start_POSTSUPERSCRIPT ( bold_q , bold_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X italic_M italic_L italic_M end_POSTSUBSCRIPT + caligraphic_L start_POSTSUPERSCRIPT ( bold_q , bold_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT , (10)

where λ>0𝜆0\lambda>0italic_λ > 0 controls the weight of the XMLM.

Intuitively, since the XMLM is highly similar to the monolingual MLM, the masked token prediction task can be extended to the cross-lingual settings. Generally, the monolingual MLM loss is as follows:

MLM(𝐪)=qi𝐪mlogP(qi|𝐪o;θ),subscriptsuperscript𝐪𝑀𝐿𝑀subscriptsubscript𝑞𝑖subscript𝐪𝑚𝑃conditionalsubscript𝑞𝑖subscript𝐪𝑜𝜃\displaystyle\mathcal{L}^{(\mathbf{q})}_{MLM}=-\sum_{q_{i}\in\mathbf{q}_{m}}% \log P(q_{i}|\mathbf{q}_{o};\mathbf{\theta}),caligraphic_L start_POSTSUPERSCRIPT ( bold_q ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_L italic_M end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_q start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ; italic_θ ) , (11)

where 𝐪msubscript𝐪𝑚\mathbf{q}_{m}bold_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝐪osubscript𝐪𝑜\mathbf{q}_{o}bold_q start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT are the masked part and the observed part of the input query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y, respectively. The masked version of the input label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l is also similar to query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y, i.e. we also mask the label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l by using the same masking strategy of query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y.

Concretely, as shown in Algorithm 1, we merge the pairs of the query,label𝑞𝑢𝑒𝑟𝑦𝑙𝑎𝑏𝑒𝑙\langle query,label\rangle⟨ italic_q italic_u italic_e italic_r italic_y , italic_l italic_a italic_b italic_e italic_l ⟩ with the code-switched format, and regard it as the input of MLM. The XMLM is as follows:

XMLM(𝐪,𝐥)=subscriptsuperscript𝐪𝐥𝑋𝑀𝐿𝑀absent\displaystyle\mathcal{L}^{(\mathbf{q},\mathbf{l})}_{XMLM}=caligraphic_L start_POSTSUPERSCRIPT ( bold_q , bold_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X italic_M italic_L italic_M end_POSTSUBSCRIPT = qi𝐪mlogP(qi|𝐪o;θ)subscriptsubscript𝑞𝑖subscript𝐪𝑚𝑃conditionalsubscript𝑞𝑖subscript𝐪𝑜𝜃\displaystyle-\sum_{q_{i}\in\mathbf{q}_{m}}\log P(q_{i}|\mathbf{q}_{o};\mathbf% {\theta})- ∑ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_q start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ; italic_θ ) (12)
lj𝐥mlogP(lj|𝐥o;θ),subscriptsubscript𝑙𝑗subscript𝐥𝑚𝑃conditionalsubscript𝑙𝑗subscript𝐥𝑜𝜃\displaystyle-\sum_{l_{j}\in\mathbf{l}_{m}}\log P(l_{j}|\mathbf{l}_{o};\mathbf% {\theta}),- ∑ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P ( italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_l start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ; italic_θ ) ,

where 𝐥msubscript𝐥𝑚\mathbf{l}_{m}bold_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝐥osubscript𝐥𝑜\mathbf{l}_{o}bold_l start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT are the masked part and the observed part of the label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l, respectvely. Besides, we provide additional training signals related to the downstream task during the continual pre-training process. Specifically, we expect the vectorized representation of the query 𝐪𝐪\mathbf{q}bold_q to be close to its corresponding label 𝐥𝐥\mathbf{l}bold_l, but far from any incorrect label 𝐥𝐥superscript𝐥𝐥\mathbf{l}^{\prime}\neq\mathbf{l}bold_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ bold_l. To achieve this, we define the similarity loss as:

sim(𝐪,𝐥)=logexpsim(𝐪,𝐥)expsim(𝐪,𝐥)+𝐥expsim(𝐪,𝐥),subscriptsuperscript𝐪𝐥𝑠𝑖𝑚𝑠𝑖𝑚𝐪𝐥𝑠𝑖𝑚𝐪𝐥subscriptsuperscript𝐥𝑠𝑖𝑚𝐪superscript𝐥\mathcal{L}^{(\mathbf{q},\mathbf{l})}_{sim}=-\log\frac{\exp sim(\mathbf{q},% \mathbf{l})}{\exp sim(\mathbf{q},\mathbf{l})+\sum\limits_{\mathbf{l}^{\prime}% \in\mathcal{B}}\exp sim(\mathbf{q},\mathbf{l}^{\prime})},caligraphic_L start_POSTSUPERSCRIPT ( bold_q , bold_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT = - roman_log divide start_ARG roman_exp italic_s italic_i italic_m ( bold_q , bold_l ) end_ARG start_ARG roman_exp italic_s italic_i italic_m ( bold_q , bold_l ) + ∑ start_POSTSUBSCRIPT bold_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_B end_POSTSUBSCRIPT roman_exp italic_s italic_i italic_m ( bold_q , bold_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG , (13)

where \mathcal{B}caligraphic_B denotes the set of all labels in a training batch other than 𝐥𝐥\mathbf{l}bold_l.

Table 1: Characteristics of our business corpus. “Train/Dev/Test" are original data without code-switched.
Model AliExpress LAZADA DARAZ
Ar En Zh Id Ms Fil Th Ur Bn Ne Si
Train 12.812.812.812.8K 16.016.016.016.0K 11.211.211.211.2K 20.120.120.120.1K 18.818.818.818.8K 20.720.720.720.7K 20.720.720.720.7K 6.96.96.96.9K 8.78.78.78.7K 26.126.126.126.1K 4.04.04.04.0K
Dev 1111K 1111K 1111K 1111K 1111K 1111K 1111K 1111K 1111K 2.02.02.02.0K 0.50.50.50.5K
Test 1111K 1111K 1111K 1111K 1111K 1111K 1111K 1111K 1111K 2.02.02.02.0K 0.50.50.50.5K
Table 2: The Code-switching rate of each query for LAZADA. “Mixed" stands for code-switched queries.
Languages Code-switching Rate (Offline) Code-switching Rate (Online)
Mixed English Native Mixed English Native
Indonesian (Id) 76.9276.9276.9276.92% 1.161.161.161.16% 21.9221.9221.9221.92% 85.2385.2385.2385.23% 0.340.340.340.34% 14.4314.4314.4314.43%
Malay (Ms) 27.9027.9027.9027.90% 71.6071.6071.6071.60% 0.500.500.500.50% 38.8738.8738.8738.87% 57.0657.0657.0657.06% 2.382.382.382.38%
Filipino (Fil) 49.3149.3149.3149.31% 50.6050.6050.6050.60% 0.080.080.080.08% 72.0972.0972.0972.09% 26.8626.8626.8626.86% 1.041.041.041.04%
Thai (Th) 4.494.494.494.49% 1.841.841.841.84% 93.6793.6793.6793.67% 10.3810.3810.3810.38% 4.674.674.674.67% 84.9584.9584.9584.95%
Table 3: Hyper-parameter settings.
Parameter Value
Word Embedding 1280128012801280
Vocabulary Size 200200200200K
Dropout 0.10.10.10.1
Learning Rate 1e51𝑒51e-51 italic_e - 5
Margin 0.10.10.10.1
Optimizer Adam
Masking Probability 0.150.150.150.15
λ𝜆\lambdaitalic_λ 0.2
Code-switching Rate 10%percent1010\%10 %

3.2 Building Code-switched Data for SR

We aim to improve the SR model in the business scenario. Since the LAZADA corpus includes many code-switched sentences, we build the code-switched data using authentic business corpora and language features. During the construction of the code-switched data, we replace each token among the query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y and label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l into the corresponding multi-lingual words based on some openly available multi-lingual lexicon-level dictionaries with some percentages.

Table 4: The comparison with accuracy score (Top-30 queries) between baseline systems on business corpora.
Model AliExpress LAZADA DARAZ Avg.
Ar En Zh Id Ms Fil Th Ur Bn Ne Si
mBERT 79.6 78.0 89.4 55.3 53.6 70.4 71.1 83.5 82.3 56.7 75.8 72.3
Unicoder 64.3 69.2 79.9 46.0 48.8 64.4 62.1 74.6 75.3 48.7 65.8 63.6
XLMRLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 81.1 81.0 90.1 68.3 59.9 71.2 82.1 85.6 84.5 65.2 70.2 76.3
SimCSE-BERTLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 72.2 79.0 52.0 49.3 56.5 72.2 78.5 82.8 83.1 57.6 76.0 69.0
InfoXLMLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 79.7 82.5 89.6 69.0 58.1 75.4 80.4 82.7 80.7 60.2 76.8 75.9
VECO 85.9 83.0 91.4 68.7 58.7 75.3 82.3 87.3 87.6 61.4 81.4 78.1
CMLM 84.3 80.4 91.2 65.3 57.8 74.3 76.7 85.4 87.5 61.6 81.4 76.9
LaBSE 85.8 81.1 91.4 65.8 59.8 75.7 76.7 85.7 85.6 62.9 81.6 77.5
Ours 89.0 82.6 93.9 73.7 62.8 78.7 83.5 88.5 90.9 67.2 84.0 81.3

The code-switched cross-lingual corpus consists of two parts such as query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y and label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l. The original query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y and label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l are formulated as follows:

𝐪𝐪\displaystyle\mathbf{q}bold_q ={q1,,qi,,qI},absentsubscript𝑞1subscript𝑞𝑖subscript𝑞𝐼\displaystyle=\{q_{1},\dots,q_{i},\dots,q_{I}\},= { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT } , (14)
𝐥𝐥\displaystyle\mathbf{l}bold_l ={l1,,lj,,lJ},absentsubscript𝑙1subscript𝑙𝑗subscript𝑙𝐽\displaystyle=\{l_{1},\dots,l_{j},\dots,l_{J}\},= { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT } , (15)

where the I𝐼Iitalic_I and J𝐽Jitalic_J represent the length of query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y and label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l, respectively. We replace the tokens among the query𝑞𝑢𝑒𝑟𝑦queryitalic_q italic_u italic_e italic_r italic_y and label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l with the frequently used languages in Alibaba over-sea’s cross-border e-commerce platform. The newly constructed data should be as follow:

𝐪superscript𝐪\displaystyle\mathbf{q}^{\prime}bold_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ={q1,,qi,,qI},absentsuperscriptsubscript𝑞1superscriptsubscript𝑞𝑖superscriptsubscript𝑞𝐼\displaystyle=\{q_{1}^{\prime},\dots,q_{i}^{\prime},\dots,q_{I}^{\prime}\},= { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } , (16)
𝐥superscript𝐥\displaystyle\mathbf{l}^{\prime}bold_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ={l1,,lj,,lJ},absentsuperscriptsubscript𝑙1superscriptsubscript𝑙𝑗superscriptsubscript𝑙𝐽\displaystyle=\{l_{1}^{\prime},\dots,l_{j}^{\prime},\dots,l_{J}^{\prime}\},= { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } , (17)

where qisuperscriptsubscript𝑞𝑖q_{i}^{\prime}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ljsuperscriptsubscript𝑙𝑗l_{j}^{\prime}italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denote the tokens after the replacement (for all integers i[1,I]𝑖1𝐼i\in[1,I]italic_i ∈ [ 1 , italic_I ] and j[1,J]𝑗1𝐽j\in[1,J]italic_j ∈ [ 1 , italic_J ]). As the final step, we combine the newly generated 𝐪superscript𝐪\mathbf{q}^{\prime}bold_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐥superscript𝐥\mathbf{l}^{\prime}bold_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to build the linguistically motivated code-switched monolingual corpus (i.e., 𝐪,𝐥superscript𝐪superscript𝐥\langle\mathbf{q}^{\prime},\mathbf{l}^{\prime}\rangle⟨ bold_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩). Then we continually pre-train our model with a similar idea of ALM.

4 Experiments

4.1 Setup

Data preparation

The languages selected from the business dataset are Arabic (Ar), English (En), Chinese (Zh), Indonesian (Id), Malay (Ms), Filipino (Fil), Thai (Th), Urdu (Ur), Bengali (Bn), Nepali (Ne), and Sinhala (Si). Specifically, Ar, En, and Zh are originated from AliExpress corpora, while Id, Ms, Fil, and Th are from the LAZADA corpora, and Ur, Bn, Ne, and Si are from DARAZ corpora, respectively. The characteristics of our business corpora are shown in Table 1. Among them, the LAZADA corpus belongs to the code-switched dataset, and we provide the code-switching rates both on offline and online data separately (See Table 2). We also make some explorations on the SR task using the Quora Duplicate Questions Dataset111https://quoradata.quora.com/First-Quora-DatasetRelease-Question-Pairs with Faiss (Johnson et al., 2021) toolkit222https://github.com/facebookresearch/faiss. Then we evaluate the model performance by exploiting the mean reciprocal rank (MRR) to validate the effectiveness of different approaches. Additionally, we conduct our experiments on the semantic textual similarity (STS) task using the SentEval toolkit (Conneau et al., 2017) for evaluation.

For model robustness, we conduct an experiment on the openly available dataset AskUbuntu333https://github.com/taolei87/askubuntu (Lei et al., 2016) in English. We also make an investigation on the Tatoeba corpus (Artetxe and Schwenk, 2019) in 11111111 language pairs by exploiting the BUCC2018 corpus (Zweigenbaum et al., 2017) in 4444 language pairs, which are originated from the well-known and representative benchmark XTREME444https://github.com/google-research/xtreme (Hu et al., 2020). In the STS task, we leverage Spearman’s rank correlation coefficient to measure the quality of correlation between human labels and calculated similarity (Gao et al., 2021). We exploit the bilingual dictionary ConceptNet5.7.0 (Speer et al., 2017) and MUSE (Lample et al., 2018) during the generations of the code-switched data. However, for the English corpus, we keep the original English sentences and do not leverage the code-switching. We conduct all the experiments on Zh without Chinese word segmentation (for AliExpress) and without converting them into simplified scripts for BUCC corpora. The hyper-parameters as shown in Table 3.

Baselines

To further verify the effectiveness of our method, we compare the proposed approach with the following highly related methods:

  • mBERT (Devlin et al., 2019) is transformer based multi-lingual bidirectional encoder representation and is pre-trained by leveraging the MMLM on the monolingual corpus.

  • Unicoder (Huang et al., 2019) by taking advantage of multi-task learning framework to learn the cross-lingual semantic representations via monolingual and parallel corpora to gain better results on downstream tasks.

  • XLM-R (Conneau et al., 2020) is more efficient than XLM and uses huge amount of mono-lingual datasets that originated from Common Crawl (Wenzek et al., 2020) which includes 100 languages to train MMLM.

  • SimCSE (Gao et al., 2021) propose a self-predictive contrastive learning that takes an input sentence and predicts itself as the objective.

  • InfoXLM (Chi et al., 2021) is an efficient method to learn the cross-lingual model training by adding a constraints.

  • VECO (Luo et al., 2021) obtains better results on both generation and understanding tasks by introducing the variable enc-dec framework.

  • CMLM (Yang et al., 2021) is a totally unsupervised learning method, conditional MLM, can effectively learn the sentence representation on huge amount of unlabeled data via integrating the sentence representation learning into MLM training.

  • LaBSE (Feng et al., 2022a) adapts the mBERT to generate the language-agnostic sentence embedding for 109 languages and is pre-trained by combining the MLM and TLM with translation ranking task leveraging bi-directional dual encoders.

Refer to caption
Figure 3: The comparison with average accuracy score (Top-10/20 queries) on the business dataset.
Refer to caption
Figure 4: The comparison of sentence embedding performance with average Spearman’s rank on STS tasks.

4.2 Main Results

SR Results on Business Data

Table 4 shows the retrieving results of the proposed method on Ali-Express, DARAZ, and LAZADA corpora by evaluating the accuracy on TOP-30 retrieved queries, respectively. The conduction of our experiment is composed of two steps, firstly we continually pre-train the models by combining the queries and labels among the training set. Then we conduct the fine-tuning on different languages by exploiting their own train set and dev set. Unlike other baselines, in our continual pre-training step, we utilize code-switched queries and labels instead of combining the original data to train our model. Among the baselines, VECO achieves better results on almost every language from the business corpus. The code-switching method has the most positive effects both on Id, Fil (The corpora of these two languages are highly code-switched. See Table 2) and Bn, Si (DARAZ), but brings fewer benefits for Zh (Ali-Express). As we do not leverage any code-switched data for En, we obtain less improvement than VECO. However, our approach consistently outperforms all the baselines on each language except En. As depicted in Figure 3, we also evaluate the models with accuracy scores on Top-10 and Top-20 queries (For the details, see Table 7 & 8 in Appendix).

Results of Semantic Textual Similarity (STS)

As depicted in Figure 4, we further verify the model performance of our proposed approach on the highly similar task STS that is close to SR. In this experiment, all of the test sets only include English sentences. Thus we continually pre-train each baseline on the BUCC2018 corpora by combing all the English monolingual datasets. For a fair comparison, we exploit the BUCC data to continually pre-train our model. We evaluate the baselines and our model on the test sets only using the continually pre-trained model instead of the fine-tuned model. The presented model also obtains consistent improvements on all test sets. We provide more details in Table 9 (see Appendix).

Table 5: The comparison with various evaluation metric on AskUbuntu corpus. “P@1" and “P@5" denote the precision score on Top-1 query and Top-5 queries, respectively. “Acc." represents the accuracy score on Top-30 queries.
Model AskUbuntu
p@1 p@5 Acc.
mBERT 49.0 41.2 54.6
Unicoder 52.2 38.1 45.5
XLMRLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 55.4 43.4 60.1
SimCSE-BERTLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 53.2 42.6 56.1
InfoXLMLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 52.7 43.2 55.8
VECO 53.2 41.8 59.8
CMLM 53.2 41.0 59.6
LaBSE 54.8 42.8 59.3
Ours 57.5 43.8 61.1
Table 6: The comparison with MRR@10 evaluation metric on the Quora Duplicate Questions dataset. The meaning of “Acc." is similar to that in Table 5.
Model MRR Acc.
mBERT 0.436 0.506
Unicoder 0.2270.2270.2270.227 0.253
XLMRLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 0.4980.4980.4980.498 0.547
SimCSE-BERTLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 0.453 0.538
InfoXLMLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 0.575 0.620
VECO 0.5170.5170.5170.517 0.579
CMLM 0.551 0.605
LaBSE 0.493 0.560
Ours 0.584 0.644
Refer to caption
(a) with and without Similarity Loss
Refer to caption
(b) Value of λ𝜆\lambdaitalic_λ
Refer to caption
(c) Value of Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
Figure 5: The effect of similarity loss (sim(𝐪,𝐥)subscriptsuperscript𝐪𝐥𝑠𝑖𝑚\mathcal{L}^{(\mathbf{q},\mathbf{l})}_{sim}caligraphic_L start_POSTSUPERSCRIPT ( bold_q , bold_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT) and different values of the hyper-parameters λ𝜆\lambdaitalic_λ and Code-mixing Rate (Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT) in our model on business corpora with accuracy score (Top-30 queries). (a) “w/" and “w/o" denote the accuracy score with or without similarity loss. (b) and (c) also represent the average accuracy score with different values of λ𝜆\lambdaitalic_λ (default value is 0.20.20.20.2) and Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (default value is 10%percent1010\%10 %), respectively.

Verification of Robustness

As shown in Table 5, we also further explore the performance of our model on the openly available corpus AskUbuntu (Lei et al., 2016) and Tatoeba benchmark (Artetxe and Schwenk, 2019). In this experiment, we conduct continual pre-training and fine-tuning by leveraging only the AskUbuntu corpus without using other datasets. We evaluate all the baselines and our model using different evaluation metrics p@1, p@5, and accuracy with Top-1, Top-5, and Top-30 queries, respectively. Moreover, we further verify the effectiveness of our approach on another openly available benchmark Tatoeba and obtain remarkably better results than baselines. For more details, see Table 10 in Appendix.

4.3 Monolingual Semantic Retrieval

As shown in Table 6, we also tend to verify the semantic retrieving skill of our approach on another openly available Quora Duplicate Questions dataset, which only includes English monolingual data. Since this corpus only provides the test set, we merge the English part of the train set from BUCC18 as our monolingual data. Then we continually pre-train each model to evaluate them on the Quora Duplicate Questions dataset without using the fine-tuned model. In this dataset, InfoXLMLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT obtains better results than other baselines, but our method outperforms all the baselines, which indicates that our approach has better retrieving skills compared to similar methods.

4.4 Ablation Study

The Effect of Similarity Loss sim(𝐪,𝐥)subscriptsuperscript𝐪𝐥𝑠𝑖𝑚\mathcal{L}^{(\mathbf{q},\mathbf{l})}_{sim}caligraphic_L start_POSTSUPERSCRIPT ( bold_q , bold_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT

As illustrated in Figure 5(a), it is an essential part of the cross-lingual PTM with similarity. We observe that the similarity brings a positive effect on the performance of our model. Our approach achieves better improvements with learning the similarity loss during the pre-training stage than without similarity compared with other baselines.

The Effect of λ𝜆\lambdaitalic_λ

λ𝜆\lambdaitalic_λ controls the weight of the XMLM, which appears in Equation (10). As depicted in Figure 5(b), when λ=0.2𝜆0.2\lambda=0.2italic_λ = 0.2, our model achieves the best retrieving performance compared with other values. We provide the details of the effectiveness of different values for λ𝜆\lambdaitalic_λ in Table 11 (see Appendix).

The Effect of Code-switching

As illustrated in Figure 5(c), we also investigate the effectiveness of code-switching for our method. First, the performance becomes lower if we keep the data without code-switching (Cmdr=0%𝐶𝑚subscript𝑑𝑟percent0Cmd_{r}=0\%italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0 %), which demonstrates the effectiveness of code-switching. Since all the languages in the LAZADA corpus originally included code-switched scripts, it may obtain lower performance when we train the model without code-switched data. Second, when Cmdr=10%𝐶𝑚subscript𝑑𝑟percent10Cmd_{r}=10\%italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 10 %, our model reaches the best average performance (For more details, see Table 12 in Appendix).

5 Related Work

Semantic Retrieval

Semantic retrieval is an essential task in NLP, which requires the model to calculate the sentence embeddings, and then similar sentences can be retrieved by the embeddings. Early SR methods are constructed based on traditional word2vec representations (Kiros et al., 2015; Hill et al., 2016). Subsequently, various studies have proposed using siamese networks to perform semantic retrieval (Neculoiu et al., 2016; Kashyap et al., 2016; Bao et al., 2018). With widely using the pre-trained language models, Reimers and Gurevych (Reimers and Gurevych, 2019) propose Sentence-BERT, which learns sentence embeddings by fine-tuning a siamese BERT network on NLI datasets. To conduct multi-lingual SR, Reimers and Gurevych (Reimers and Gurevych, 2020) extend Sentence-BERT to its multi-lingual version by knowledge distillation. To better leverage unlabeled data for SR, Gao et al. (2021) propose SimCSE, which uses contrastive learning to train sentence embeddings. To improve performances of multi-lingual SR, Chi et al. (2021) propose InfoXLM, which utilizes MLM, TLM, and contrastive learning objectives.

Code-switching

Code-switching is a pre-training technique to improve cross-lingual pre-trained models. That is used in PTMs for machine translation (Yang et al., 2020c; Lin et al., 2020; Yang et al., 2020a). For example, Yang et al. (2020c) utilize code-switching on monolingual data by replacing some continuous words into the target language and letting the model predict the replaced words. Lin et al. (2020) use code-switching on the source side of the multi-lingual parallel corpora to pre-train an encoder-decoder model for multi-lingual machine translation. Feng et al. (2022b) mitigate the limitation of the code-switching method for grammatical incoherence and negative effects on token-sensitive tasks. Yang et al. (2020a) propose ALM for cross-lingual pre-training, which requires the model to predict the masked words in the code-switched sentences. Krishnan et al. (2021) augment monolingual source data by leveraging the multilingual code-switching via random translation to improve the generalizability of large multi-lingual language models. Besides, code-switching has been utilized in other NLP tasks, including named entity recognition (Singh et al., 2018), question answering (Chandu et al., 2018; Gupta et al., 2018), universal dependency parsing (Bhat et al., 2018), morphological tagging (Özateş and Çetinoğlu, 2021), language modeling (Pratapa et al., 2018), automatic speech recognition (Kumar and Bora, 2018), natural language inference (Khanuja et al., 2020) and sentiment analysis (Patwa et al., 2020; Liu et al., 2020; Aparaschivei et al., 2020; Zhang et al., 2021). To the best of our knowledge, we are the first to utilize code-switching for semantic retrieval.

6 Conclusion and Future work

We introduce a straightforward pre-training approach to sentence-level semantic retrieval with code-switched cross-lingual data for the FAQ system in the task-oriented QA dialogue e-commerce scenario. Intuitively, code-switching is an emerging trend of communication in both bilingual and multi-lingual regions. Our experimental result shows that the proposed approach remarkably outperforms the previous highly similar baseline systems on the tasks of semantic retrieval and semantic textual similarity with three business corpora and four open corpora using many evaluation metrics. In future work, we will expand our method to other natural language understanding tasks. Besides, we will also leverage the different embedding distance calculation metrics instead of only using cosine similarity.

References

  • Aparaschivei et al. (2020) Lavinia Aparaschivei, Andrei Palihovici, and Daniela Gîfu. 2020. FII-UAIC at semeval-2020 task 9: Sentiment analysis for code-mixed social media text using CNN. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, pages 928–933.
  • Artetxe and Schwenk (2019) Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
  • Bao et al. (2018) Wei Bao, Wugedele Bao, Jinhua Du, Yuanyuan Yang, and Xiaobing Zhao. 2018. Attentive siamese lstm network for semantic textual similarity measure. In 2018 International Conference on Asian Language Processing (IALP), pages 312–317.
  • Bhat et al. (2018) Irshad Bhat, Riyaz A. Bhat, Manish Shrivastava, and Dipti Sharma. 2018. Universal Dependency parsing for Hindi-English code-switching. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 987–998, New Orleans, Louisiana. Association for Computational Linguistics.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020.
  • Chandu et al. (2018) Khyathi Raghavi Chandu, Ekaterina Loginova, Vishal Gupta, Josef van Genabith, Günter Neumann, Manoj Kumar Chinnakotla, Eric Nyberg, and Alan W. Black. 2018. Code-mixed question answering challenge: Crowd-sourcing data and techniques. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching@ACL 2018, pages 29–38.
  • Chi et al. (2020) Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and Heyan Huang. 2020. Cross-lingual natural language generation via pre-training. In Proceedings of the AAAI conference on artificial intelligence, pages 7570–7577.
  • Chi et al. (2021) Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2021. Infoxlm: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3576–3588.
  • Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  • Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
  • Conneau and Lample (2019) Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pages 7057–7067.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
  • Feng et al. (2022a) Fangxiaoyu Feng, Yinfei Yang, Daniel Matthew Cer, N. Arivazhagan, and Wei Wang. 2022a. Language-agnostic bert sentence embedding. In ACL.
  • Feng et al. (2022b) Yukun Feng, Feng Li, and Philipp Koehn. 2022b. Toward the limitation of code-switching in cross-lingual transfer. In Conference on Empirical Methods in Natural Language Processing.
  • Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910.
  • Gupta et al. (2018) Vishal Gupta, Manoj Kumar Chinnakotla, and Manish Shrivastava. 2018. Transliteration better than translation? answering code-mixed questions over a knowledge base. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching@ACL 2018, pages 39–50.
  • Hill et al. (2016) Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1367–1377.
  • Hu et al. (2020) Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. ArXiv, abs/2003.11080.
  • Huang et al. (2019) Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and M. Zhou. 2019. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In EMNLP.
  • Johnson et al. (2021) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7:535–547.
  • Kashyap et al. (2016) Abhay Kashyap, Lushan Han, Roberto Yus, Jennifer Sleeman, Taneeya Satyapanich, Sunil Gandhi, and Tim Finin. 2016. Robust semantic text similarity using lsa, machine learning, and linguistic resources. Language Resources and Evaluation, 50:125–161.
  • Khanuja et al. (2020) Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, and Monojit Choudhury. 2020. A new dataset for natural language inference from code-mixed conversations. In CALCS.
  • Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pages 3294–3302.
  • Krishnan et al. (2021) Jitin Krishnan, Antonios Anastasopoulos, Hemant Purohit, and Huzefa Rangwala. 2021. Multilingual code-switching for zero-shot cross-lingual intent prediction and slot filling. ArXiv, abs/2103.07792.
  • Kumar and Bora (2018) Ritesh Kumar and Manas Jyoti Bora. 2018. Part-of-speech annotation of english-assamese code-mixed texts: Two approaches. In Proceedings of the First International Workshop on Language Cognition and Computational Models, pages 94–103.
  • Lample et al. (2018) Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In 6th International Conference on Learning Representations.
  • Lei et al. (2016) Tao Lei, Hrishikesh Joshi, R. Barzilay, T. Jaakkola, K. Tymoshenko, Alessandro Moschitti, and Lluís Màrquez i Villodre. 2016. Semi-supervised question retrieval with gated convolutions. In NAACL.
  • Lin et al. (2020) Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu, Jiangtao Feng, Hao Zhou, and Lei Li. 2020. Pre-training multilingual neural machine translation by leveraging alignment information. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2649–2663.
  • Liu et al. (2020) Jiaxiang Liu, Xuyi Chen, Shikun Feng, Shuohuan Wang, Xuan Ouyang, Yu Sun, Zhengjie Huang, and Weiyue Su. 2020. Kk2018 at semeval-2020 task 9: Adversarial training for code-mixing sentiment classification. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, pages 817–823.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv: 1907.11692.
  • Luo et al. (2021) Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi, Songfang Huang, Fei Huang, and Luo Si. 2021. Veco: Variable and flexible cross-lingual pre-training for language understanding and generation. In ACL.
  • Neculoiu et al. (2016) Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. 2016. Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 148–157.
  • Ouyang et al. (2021) Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE-M: Enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 27–38, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Özateş and Çetinoğlu (2021) Şaziye Betül Özateş and Özlem Çetinoğlu. 2021. A language-aware approach to code-switched morphological tagging. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 72–83, Online. Association for Computational Linguistics.
  • Patwa et al. (2020) Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, and Amitava Das. 2020. Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, pages 774–790.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2227–2237.
  • Pratapa et al. (2018) Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, Sandipan Dandapat, and Kalika Bali. 2018. Language modeling for code-mixing: The role of linguistic theory based synthetic data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1543–1553.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3980–3990.
  • Reimers and Gurevych (2020) Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4512–4525.
  • Siddhant et al. (2020) Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Ari, Jason Riesa, Ankur Bapna, Orhan Firat, and Karthik Raman. 2020. Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation. In Proceedings of the AAAI conference on artificial intelligence, pages 8854–8861.
  • Singh et al. (2018) Vinay Singh, Deepanshu Vijay, Syed Sarfaraz Akhtar, and Manish Shrivastava. 2018. Named entity recognition for hindi-english code-mixed social media text. In Proceedings of the Seventh Named Entities Workshop, pages 27–35.
  • Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 4444–4451.
  • Tran et al. (2020) Chau Tran, Yuqing Tang, Xian Li, and Jiatao Gu. 2020. Cross-lingual retrieval for iterative self-supervised training. Advances in Neural Information Processing Systems, 33:2207–2219.
  • Wei et al. (2021) Xiangpeng Wei, Yue Hu, Rongxiang Weng, Luxi Xing, Heng Yu, and Weihua Luo. 2021. On learning universal representations across languages. ICLR.
  • Wenzek et al. (2020) Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm’an, Armand Joulin, and Edouard Grave. 2020. Ccnet: Extracting high quality monolingual datasets from web crawl data. In LREC.
  • Xiong et al. (2021) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations.
  • Xu et al. (2022) Wenshen Xu, Mieradilijiang Maimaiti, Yuanhang Zheng, Xin Tang, and Ji Zhang. 2022. Auto-mlm: Improved contrastive learning for self-supervised multi-lingual knowledge retrieval. arXiv preprint arXiv: 2203.16187.
  • Yang et al. (2020a) Jian Yang, Shuming Ma, Dongdong Zhang, Shuangzhi Wu, Zhoujun Li, and Ming Zhou. 2020a. Alternating language modeling for cross-lingual pre-training. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, pages 9386–9393.
  • Yang et al. (2019) Yinfei Yang, Gustavo Hernández Ábrego, Steve Yuan, Mandy Guo, Qinlan Shen, Daniel Cer, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019. Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 5370–5378. ijcai.org.
  • Yang et al. (2020b) Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil. 2020b. Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94, Online. Association for Computational Linguistics.
  • Yang et al. (2020c) Zhen Yang, Bojie Hu, Ambyera Han, Shen Huang, and Qi Ju. 2020c. CSP: code-switching pre-training for neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2624–2636.
  • Yang et al. (2021) Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, and Eric Darve. 2021. Universal sentence representation learning with conditional masked language model. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6216–6228, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Zhang et al. (2021) Wenxuan Zhang, Ruidan He, Haiyun Peng, Lidong Bing, and Wai Lam. 2021. Cross-lingual aspect-based sentiment analysis with aspect term code-switching. In Conference on Empirical Methods in Natural Language Processing.
  • Zweigenbaum et al. (2017) Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2017. Overview of the second bucc shared task: Spotting parallel sentences in comparable corpora. In BUCC@ACL.
Refer to caption
(a) CMLM
Refer to caption
(b) XLMR
Refer to caption
(c) LaBSE
Refer to caption
(d) InfoXLM
Refer to caption
(e) VECO
Refer to caption
(f) Ours
Figure 6: The comparison of the retrieved labels among the baselines and our proposed method, where the y-axis and the x-axis denote the user query and the retrieved label, respectively.
Table 7: The comparison with accuracy score (Top-10 queries) between cross-lingual sentence retrieval baseline systems on AliExpress, LAZADA, and DARAZ corpora.
Model AliExpress LAZADA DARAZ Avg.
Ar En Zh Id Ms Fil Th Ur Bn Ne Si
mBERT 56.3 55.2 78.0 32.1 29.8 46.9 48.1 67.6 67.4 35.0 51.6 51.6
Unicoder 43.1 49.8 66.4 25.2 24.7 41.7 41.9 59.3 56.5 27.9 38.8 43.2
XLMRLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 65.1 61.0 79.1 45.7 36.0 47.1 59.4 68.2 66.5 43.5 44.6 56.0
SimCSE-BERTLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 48.4 58.8 40.3 27.4 31.5 48.5 58.1 66.0 68.8 36.9 50.2 48.6
InfoXLMLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 60.2 61.7 76.9 44.7 33.8 51.6 61.3 66.5 64.6 37.3 49.4 55.3
VECO 67.4 61.7 81.7 47.6 35.3 52.8 60.0 70.7 70.6 40.3 52.4 58.2
CMLM 67.6 62.9 80.1 41.2 35.8 49.0 55.9 71.3 72.0 38.4 56.0 57.3
LaBSE 71.1 61.3 81.0 41.5 35.8 52.1 55.5 71.8 74.1 40.7 57.4 58.4
Ours 73.1 64.5 85.1 50.1 38.4 56.6 62.2 75.1 75.8 47.8 59.6 62.6
Table 8: The comparison with accuracy score (Top-20 queries) between cross-lingual sentence retrieval baseline systems on AliExpress, LAZADA, and DARAZ corpora.
Model AliExpress LAZADA DARAZ Avg.
Ar En Zh Id Ms Fil Th Ur Bn Ne Si
mBERT 71.8 71.8 85.5 46.1 42.2 62.2 63.3 78.2 78.6 48.2 69.2 65.2
Unicoder 55.6 63.0 74.2 36.9 37.3 56.4 54.6 69.0 68.6 40.7 56.6 55.7
XLMRLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 74.1 73.8 87.2 60.9 49.4 62.3 74.2 80.3 79.5 57.6 60.2 69.0
SimCSE-BERTLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 63.3 71.9 47.5 40.6 47.2 64.4 73.7 78.4 79.2 50.6 66.6 62.1
InfoXLMLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 73.4 75.8 85.8 60.9 45.9 66.7 72.6 77.4 75.9 51.1 66.8 68.4
VECO 80.4 76.3 88.3 61.1 47.9 68.1 73.8 82.4 82.3 53.4 69.0 71.2
CMLM 80.3 74.9 88.4 56.4 47.1 64.9 69.0 80.9 82.8 52.5 72.0 69.9
LaBSE 82.2 73.6 88.3 58.4 49.5 67.3 68.8 81.4 81.1 54.9 74.2 70.9
Ours 85.3 77.2 91.4 67.3 52.3 71.4 76.7 85.1 86.3 60.0 76.4 75.4
Table 9: The comparison of sentence embedding performance on STS tasks. “STS12-STS16", “STS-B" and “SICK-R" denote SemEval2012-2016, STS benchmark and SICK relatedness dataset, respectively.
Model STS12121212 STS13131313 STS14141414 STS15151515 STS16161616 STS-B SICK-R Avg.
mBERT 42.46 62.14 52.35 65.36 66.20 60.51 60.87 58.56
Unicoder 41.0741.0741.0741.07 56.9256.9256.9256.92 49.7649.7649.7649.76 60.8660.8660.8660.86 53.6553.6553.6553.65 47.9747.9747.9747.97 54.7854.7854.7854.78 52.1452.1452.1452.14
XLMRLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 39.7939.7939.7939.79 62.6562.6562.6562.65 52.0952.0952.0952.09 62.2662.2662.2662.26 64.3964.3964.3964.39 59.2759.2759.2759.27 61.0761.0761.0761.07 57.3657.3657.3657.36
SimCSE-BERTLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 42.35 67.34 57.20 70.36 69.41 59.86 63.77 61.47
InfoXLMLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT 32.23 52.35 39.42 52.04 60.82 54.04 59.61 50.07
VECO 41.76 60.75 52.21 64.79 67.26 58.93 61.17 58.12
CMLM 30.14 61.77 47.45 61.25 62.73 53.23 56.62 53.31
LaBSE 47.43 64.13 55.72 69.66 64.21 57.60 60.68 59.92
Ours 44.53 68.20 55.99 71.39 66.07 66.32 68.03 62.93

Appendix A Case Study

To further demonstrate the better performance of the proposed approach, we make some visualizations of the retrieved results on the Business corpus between the previously introduced SOTA pre-trained models (i.g., we choose the five most efficient baselines) for cross-lingual scenarios.

As illustrated in Figure 6, the baselines fail to retrieve the key information “delivery failed” from the user query “it says delivery failed when i track it”, while our proposed method can retrieve such vital information, which indicates that our model can improve the performance of sentence retrieval in the business domain.

Table 10: The comparison with accuracy score (Top-30 queries) on Tatoeba corpus for each language.
Model af de es fr it ja kk nl pt sw te
mBERT (Devlin et al., 2019) 55.4 78.0 74.2 74.7 73.6 73.1 50.1 71.2 76.6 32.3 58.5
Unicoder (Huang et al., 2019) 15.8 15.4 20.2 40.4 19.4 32.6 16.5 21.7 20.7 25.4 18.4
XLMR (Chi et al., 2021) 67.8 86.6 86.2 87.7 83.2 84.9 46.4 84.2 86.8 36.7 70.9
SimCSE (Gao et al., 2021) 23.0 28.4 25.1 39.1 25.4 21.6 20.4 20.7 21.2 16.7 17.7
InfoXLM (Chi et al., 2021) 84.7 90.6 86.4 93.7 89.5 88.2 77.2 91.3 89.8 68.9 74.3
VECO (Luo et al., 2021) 86.3 96.5 95.7 96.4 95.3 93.5 71.1 94.6 95.4 73.0 81.1
CMLM (Yang et al., 2021) 87.9 94.1 91.5 97.1 94.3 91.3 75.6 94.7 95.6 74.3 83.3
LaBSE (Feng et al., 2022a) 93.9 95.5 94.4 97.5 95.3 92.1 83.3 95.4 96.1 75.4 85.9
Ours 95.6 96.8 97.3 98.1 96.3 94.3 85.2 96.5 96.3 77.2 86.3
Table 11: The effect of λ𝜆\lambdaitalic_λ on business corpora with accuracy score (Top-30 queries).
λ𝜆\lambdaitalic_λ AliExpress LAZADA DARAZ Avg.
Ar En Zh Id Ms Fil Th Ur Bn Ne Si
0.10.10.10.1 87.1 82.2 91.1 68.3 60.8 77.7 80.2 88.0 90.4 60.2 84.0 79.1
0.20.20.20.2 85.4 81.9 91.1 70.9 58.8 77.6 82.2 87.9 88.4 65.2 81.6 79.2
0.30.30.30.3 81.1 81.7 90.6 70.7 59.3 78.1 76.8 88.2 89.8 62.0 83.6 78.3
0.40.40.40.4 82.9 81.2 91.7 70.9 58.8 74.3 82.3 87.6 88.6 63.0 83.8 78.6
0.50.50.50.5 87.2 80.2 90.6 70.8 60.6 72.8 82.2 88.1 89.2 64.9 82.3 79.0
Table 12: The effect of code-switching rate (“Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT") on business corpora with accuracy score (Top-30 queries).
Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT AliExpress LAZADA DARAZ Avg.
Ar En Zh Id Ms Fil Th Ur Bn Ne Si
0%percent00\%0 % 86.7 82.2 89.9 67.5 58.6 75.4 77.4 82.7 87.9 57.3 76.8 76.6
10%percent1010\%10 % 89.0 82.2 92.4 73.7 61.3 78.4 80.2 84.6 90.8 60.2 79.0 79.3
20%percent2020\%20 % 87.1 82.2 91.4 68.3 60.8 77.4 80.2 88.0 90.4 60.2 84.0 79.1
30%percent3030\%30 % 84.5 82.2 92.4 70.5 59.2 76.6 81.7 87.7 89.3 66.1 74.4 78.6
40%percent4040\%40 % 86.2 82.2 92.3 69.8 61.3 75.8 79.8 82.3 84.2 65.2 78.0 77.9
50%percent5050\%50 % 85.1 82.2 91.9 69.7 59.3 76.4 81.1 85.9 79.9 65.5 78.4 77.8

Appendix B Results on Business Datasets

As shown both in Table 7 and Table 8, we make some explorations for the retrieving skill of our introduced approach on Ali-Express, DARAZ, and LAZADA corpora by evaluating the accuracy on TOP-10 and TOP-20 retrieved queries, respectively. The conduction of our experiment is composed of two steps: first, we continually pre-train the models by combining the queries and labels among the training set. Then we conduct the fine-tuning on different languages by exploiting their own train set &\&& dev set. Similar to the results on TOP-30 queries (see Table 4), VECO also obtains higher results among the baselines systems. However, our proposed model achieves remarkably better results than all baselines on each of the languages of the three business datasets.

Appendix C Results on the Task of STS

We regard that there exists a bit of difference between the two tasks, such as semantic retrieval (SR) and semantic textual similarity (STS). But both of them take sentence-level representation as a backbone. Therefore, we make some investigations on the task of STS. As shown in Table 9, we also make further validation on the similar task STS by comparing the semantic representation skill of the baseline models and our proposed method. In this experiment, all of the test sets (STS12, STS13, STS14, STS15, STS16, STS-B, and SICK-R) only include English sentences. Thus we evaluate the baselines and our model only using the continually pre-trained model instead of the fine-tuned model. We leverage Spearman’s rank correlation coefficient to measure the quality of all models. Among the baseline systems, the SimCSE-BERTLarge𝐿𝑎𝑟𝑔𝑒{}_{Large}start_FLOATSUBSCRIPT italic_L italic_a italic_r italic_g italic_e end_FLOATSUBSCRIPT obtain higher results than other baseline approaches, but our presented method steadily outperforms all the baseline models.

Appendix D Results on Publicly Open Corpora

We conduct meaningful experiments on well-known and broadly used open available public datasets AskUbuntu and Tatoeba benchmark for sentence-level semantic retrieval. In contrast with the experiment on the AskUbuntu dataset, in this experiment, we continually pre-train all the models by leveraging the BUCC2018 corpus for the continual pre-training step. Due to the BUCC2018 corpus containing the train set and dev set, for a fair comparison, we also fine-tune our continually pre-trained model via the BUCC2018. Moreover, the Tatoeba dataset covers more than 40 languages (shown with their ISO 639-1 code for brevity). But in our experiment, we only choose the part that differs from our business corpora, such as Afrikaans (af), German (de), Spanish (es), French (fr), Italian (it), Japanese (ja), Kazakh (kk), Dutch (nl), Portuguese (pt), Swahili (sw) and Telugu(te). As shown in Table 10, all the detailed reviews about the comparison results are evaluated by accuracy on each language for Tatoeba. In this experiment, we choose 11 languages.

Appendix E The Effect of the Hyper-parameter

As shown in Table 11, we make further validation on both business data and publicly available open corpora with different values of λ𝜆\lambdaitalic_λ. For selecting the values of λ𝜆\lambdaitalic_λ, we fix the code-switching rate Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT during the continual pre-training stage for our model. The experimental result shows when the λ=0.2𝜆0.2\lambda=0.2italic_λ = 0.2, our model achieves better results than other values. Thus, we set 0.20.20.20.2 as the default value of λ𝜆\lambdaitalic_λ in all experiments.

Additionally, as given in Table 12, we also investigate the different values of the code-switching rate Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Similarly, we fix the lambda λ𝜆\lambdaitalic_λ during the continual pre-training step to select the proper values of Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT for our model. It is not hard to infer from the experimental results that, when the Cmdr=10%𝐶𝑚subscript𝑑𝑟percent10Cmd_{r}=10\%italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 10 %, our proposed approach obtains the highest performance compared with other values of Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Therefore, we take the 10%percent1010\%10 % as a default value for Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. The values of λ𝜆\lambdaitalic_λ and Cmdr𝐶𝑚subscript𝑑𝑟Cmd_{r}italic_C italic_m italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are identical for the business and open corpora in the whole experiment.