This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.
(Translated by https://www.hiragana.jp/) Improving Cross-lingual Representation for Semantic Retrieval with Code-switching
HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
Report issue for preceding element
failed: inconsolata
failed: galois
Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
Improving Cross-lingual Representation for Semantic Retrieval with Code-switching
Report issue for preceding element
Mieradilijiang Maimaiti , Yuanhang Zheng, Ji Zhang ,
Fei Huang, Yue Zhang, Wenpei Luo, Kaiyu Huang Alibaba DAMO Academy
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Department of Computer Science and Technology, Westlake University, Hangzhou, China
Department of Computer Science and Technology, Dalian University of Technology, Dalian
Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, China
{mieradilijiang.mea, zj122146, f.huang}@alibaba-inc.com,
zheng-yh19@mails.tsinghua.edu.cn, zhangyue@westlake.edu.cn,
22109239@mail.dlut.edu.cn, kyhuang@bjtu.edu.cn
Equal contribution Corresponding author: Ji Zhang
Report issue for preceding element
Abstract
Report issue for preceding element
Semantic Retrieval (SR) has become an indispensable part of the FAQ system in the task-oriented question-answering (QA) dialogue scenario.
The demands for a cross-lingual smart-customer-service system for an e-commerce platform or some particular business conditions have been increasing recently.
Most previous studies exploit cross-lingual pre-trained models (PTMs) for multi-lingual knowledge retrieval directly,
while some others also leverage the continual pre-training before fine-tuning PTMs on the downstream tasks.
However, no matter which schema is used, the previous work ignores to inform PTMs of some features of the downstream task, i.e. train their PTMs without providing any signals related to SR.
To this end, in this work, we propose an Alternative Cross-lingual PTM for SR via code-switching.
We are the first to utilize the code-switching approach for cross-lingual SR.
Besides, we introduce the novel code-switched continual pre-training instead of directly using the PTMs on the SR tasks.
The experimental results show that our proposed approach consistently outperforms the previous SOTA methods on SR and semantic textual similarity (STS) tasks with three business corpora and four open datasets in 20+ languages.
Report issue for preceding element
1 Introduction
Report issue for preceding element
In recent years, pre-trained models (PTMs) have demonstrated success on many downstream tasks of natural language processing (NLP).
Intuitively, PTMs such as ELMO (Peters et al., 2018), GPT (Radford et al., 2018), GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), BERT (Devlin et al., 2019), and RoBERTa (Liu et al., 2019) have achieved remarkable results by transferring knowledge learned from a large amount of unlabeled corpus to various downstream NLP tasks.
To learn the cross-lingual representations, previous methods like multi-lingual BERT (mBERT) (Devlin et al., 2019) and XLM (Conneau and Lample, 2019) have extended PTMs to multiple languages.
Report issue for preceding element
Figure 1: The brief illustration of the semantic retrieval with leveraging knowledge base for FAQ system in the task-oriented dialogue scenario.Report issue for preceding element
Semantic retrieval (SR) (Kiros et al., 2015) has become the ubiquitous method in the FAQ system (i.e., task-oriented question-answering (QA) (Xiong et al., 2021)) which is incorporated into the smart-customer-service platform for the e-commerce scenario.
For the cross-lingual scenario, many pre-training methods have been presented for the multi-lingual downstream tasks, such as XLM-R (Conneau et al., 2020), XNLG (Chi et al., 2020), InfoXLM (Chi et al., 2021) and VECO (Luo et al., 2021).
Intuitively, the main challenge of SR is how to accurately retrieve the corresponding sentence from the knowledge base (query-label pairs) (Kiros et al., 2015).
Commonly used approaches mainly take some variants of the BERT model as a backbone and then directly fine-tune on the downstream tasks (Devlin et al., 2019; Conneau and Lample, 2019; Huang et al., 2019; Yang et al., 2021; Ouyang et al., 2021).
Report issue for preceding element
Specifically, mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), Unicoder (Huang et al., 2019), CMLM (Yang et al., 2021) and ERNIE-M (Ouyang et al., 2021) learn the cross-lingual sentence representation mainly using masked language modeling (MLM). For other objective functions,
MMTE (Siddhant et al., 2020) exploits multi-lingual machine translation, and CRISS (Tran et al., 2020) leverages unsupervised parallel data mining.
Some models use Siamese network architectures to better adapt them to SR.
For example, InferSent (Conneau et al., 2017) uses natural language inference (NLI) datasets to train the Siamese network. USE (Yang et al., 2019), M-USE (Yang et al., 2020b) and LaBSE (Feng et al., 2022a) exploit ranking loss. SimCSE (Gao et al., 2021), InfoXLM (Chi et al., 2021) and HICTL (Wei et al., 2021) use contrastive learning.
However, the previous highly similar approaches almost ignore the transmission of some features of the downstream tasks to PTMs.
In other words, most methods (Chi et al., 2021; Luo et al., 2021) directly fine-tune the models on downstream tasks without providing any signals related to SR.
In addition, they are mainly pre-trained on combined monolingual data where few of the sentences are code-switched.
Since the user queries often contain many code-switched sentences, it is insufficient to exploit the commonly used methods directly for the SR task in the e-commerce scenario.
Report issue for preceding element
In this work, as depicted in Figure 1,
we aim to enhance the performance of SR for the FAQ system
in the e-commerce scenario.
We propose a novel pre-training approach for sentence-level SR with code-switched cross-lingual data.
Our motivation comes from the ignorance of previous studies.
One of the recent studies (Yang et al., 2020a) also tries to exploit the code-switching strategy in the machine translation scenario, but no one has tried to leverage code-switching on the task of multi-lingual SR.
Furthermore, the previous methods (Xu et al., 2022) have exploited multi-lingual PTMs on the SR task by only masking the query instead of masking the label.
They intend to use more efficient PTMs to fine-tune the SR task rather than making PTMs stronger by providing some signals.
To allow the PTMs to learn the signals directly related to downstream tasks, we present an Alternative Cross-Lingual PTM for semantic retrieval using code-switching, which consists of three main steps.
First, we generate code-switched data based on bilingual dictionaries.
Then, we pre-train a model on the code-switched data using a weighted sum of the alternating language modeling (ALM) loss (Yang et al., 2020a) and the similarity loss.
Finally, we fine-tune the model on the SR corpus. By providing additional training signals related to SR during the pre-training process, our proposed approach can learn better about the SR task.
Our main contributions are as follows:
Report issue for preceding element
•
Experiments show that our approach remarkably outperforms the SOTA methods with various evaluation metrics.
Report issue for preceding element
•
Our method improves the robustness of the model for sentence-level SR on both the in-house datasets and open corpora.
Report issue for preceding element
•
To the best of our knowledge, we first present alternative cross-lingual PTM for SR using code-switching in the FAQ system (e-commerce scenario).
Report issue for preceding element
2 Preliminaries
Report issue for preceding element
2.1 Masked Language Modeling
Report issue for preceding element
Masked language modeling (MLM) (Devlin et al., 2019) is a pre-training objective focused on learning representations of natural language sentences. When pre-training a model using MLM objectives, we let the model predict the masked words in the input sentence. Formally, we divide each sentence into the masked part and the observed part , and we train the model (which is parameterized by ) to minimize
Report issue for preceding element
(1)
When calculating Eq.(1), we assume that the model independently predicts each masked word.
Formally, we assume that all masked words in the masked part are independent conditioned on .
Thus, Eq.(1) can be rewritten as
Report issue for preceding element
(2)
Figure 2: The architecture of our proposed model Alternative Cross-Lingual PTM for SR. The code-switched tokens for query and label are “status"“状态", “failed"“失败" and “"delivery"“运送" separately. The "[CLS]" symbol stands for the sentence representation of query and label. The structures used for the query and the label are the same.Report issue for preceding element
2.2 Cross-lingual LM Pre-training
Report issue for preceding element
To improve the performances of various models on the NLP tasks of different languages, cross-lingual PTMs (Devlin et al., 2019; Conneau and Lample, 2019; Conneau et al., 2020) have been proposed.
Generally, cross-lingual PTMs are trained on multi-lingual corpora using the MLM objective. During the pre-training process, the corpora of low-resource languages are usually oversampled to improve the model’s performance on low-resource languages.
To better align the representations of the sentences in different languages, cross-lingual PTMs may use another objective called translation language modeling (TLM), which requires the model to predict the masked words in both the source and the target sentences in a parallel sentence pair.
Formally, given a parallel sentence pair , we randomly divide the source sentence into the masked part and the observed part , and also divide the target sentence into the masked part and the observed part . Then we minimize
Report issue for preceding element
(3)
2.3 Semantic Retrieval
Report issue for preceding element
Semantic retrieval (SR) aims to retrieve sentences similar to the query sentence in a knowledge base (Kiros et al., 2015). Specifically, the semantic retrieval model converts sentences into vectors, and similar sentences are retrieved based on the cosine similarity.
Report issue for preceding element
Formally, given a sentence , the model encodes into a vector . When we need to retrieve sentences similar to the query , we calculate the cosine similarity between and for each sentence in the knowledge base :
Report issue for preceding element
(4)
Finally, we retrieve the sentence which is most similar to the query in :
Report issue for preceding element
(5)
2.4 Code-switching
Report issue for preceding element
To reduce the representation gap between words of different languages in the cross-lingual PTMs. Yang et al. (2020a) proposed
ALM, which is based on code-switching.
Specifically, given a source sentence, we construct a code-switched sentence by randomly replacing some source words with the corresponding target words.
For example, suppose the English source sentence is “I like music” and then we replace some words in the sentence with Chinese. If the replaced English words are “I” and “music” and their corresponding Chinese words are “我” and “音乐”, respectively, then the code-switched sentence is “我 like 音乐”.
Report issue for preceding element
Formally, suppose that we conduct code-switching on a source sentence . First, we randomly choose a subset from . Then, for each element , we replace with its corresponding target word to construct the code-switched sentence , where
Report issue for preceding element
(6)
3 Method
Report issue for preceding element
3.1 Alternative Cross-lingual PTM
Report issue for preceding element
The main architecture of our model is shown in 2.
We jointly train the PTM on the code-switched data using the cross-lingual masked language model (XMLM) and the similarity loss
to address the limitation of existing PTMs that are trained without signals directly related to the downstream tasks (e.g., SR).
Contrarily, we add a similarity loss term to the pre-training objective to adjust the similarity between input () and output ().
Thus, the similarity between and has been controlled by exploiting the similarity loss during the continual pre-training step.
For the sentence-level SR task, the given knowledge is composed of a certain number of pairs (see Figure 2).
We regard as a and take as a corresponding .
Given a query and a label , the standard retrieval models usually formulate the sentence-level SR as a calculation of the similarity between and on the semantic space:
We retrieve the query similar to the label by calculateing the cosine similarity between and for each pair in the knowledge base :
Report issue for preceding element
(9)
Then, we rank the retrieved sentences according to their similarity score to recall the Top- similar asking questions that are semantically close to the original input query.
The total objective function of our proposed model consists of two parts, the XMLM and the similarity loss, which can be formulated as follows:
Report issue for preceding element
(10)
where controls the weight of the XMLM.
Report issue for preceding element
Intuitively, since the XMLM is highly similar to the monolingual MLM, the masked token prediction task can be extended to the cross-lingual settings. Generally, the monolingual MLM loss is as follows:
Report issue for preceding element
(11)
where and are the masked part and the observed part of the input , respectively. The masked version of the input is also similar to , i.e. we also mask the by using the same masking strategy of .
Report issue for preceding element
Concretely, as shown in Algorithm 1, we merge the pairs of the with the code-switched format, and regard it as the input of MLM. The XMLM is as follows:
Report issue for preceding element
(12)
where and are the masked part and the observed part of the , respectvely. Besides, we provide additional training signals related to the downstream task
during the continual pre-training process. Specifically, we expect the vectorized representation of the query to be close to its corresponding label , but far from any incorrect label . To achieve this, we define the similarity loss as:
Report issue for preceding element
(13)
where denotes the set of all labels in a training batch other than .
Report issue for preceding element
Table 1: Characteristics of our business corpus. “Train/Dev/Test" are original data without code-switched.
Model
AliExpress
LAZADA
DARAZ
Ar
En
Zh
Id
Ms
Fil
Th
Ur
Bn
Ne
Si
Train
K
K
K
K
K
K
K
K
K
K
K
Dev
K
K
K
K
K
K
K
K
K
K
K
Test
K
K
K
K
K
K
K
K
K
K
K
Report issue for preceding element
Table 2: The Code-switching rate of each query for LAZADA. “Mixed" stands for code-switched queries.
Languages
Code-switching Rate (Offline)
Code-switching Rate (Online)
Mixed
English
Native
Mixed
English
Native
Indonesian (Id)
%
%
%
%
%
%
Malay (Ms)
%
%
%
%
%
%
Filipino (Fil)
%
%
%
%
%
%
Thai (Th)
%
%
%
%
%
%
Report issue for preceding element
Table 3: Hyper-parameter settings.
Parameter
Value
Word Embedding
Vocabulary Size
K
Dropout
Learning Rate
Margin
Optimizer
Adam
Masking Probability
0.2
Code-switching Rate
Report issue for preceding element
3.2 Building Code-switched Data for SR
Report issue for preceding element
We aim to improve the SR model in the business scenario. Since the LAZADA corpus includes many code-switched sentences, we build the code-switched data using authentic business corpora and language features.
During the construction of the code-switched data, we replace each token among the and into the corresponding multi-lingual words based on some openly available multi-lingual lexicon-level dictionaries with some percentages.
Report issue for preceding element
Table 4: The comparison with accuracy score (Top-30 queries) between baseline systems on business corpora.
Model
AliExpress
LAZADA
DARAZ
Avg.
Ar
En
Zh
Id
Ms
Fil
Th
Ur
Bn
Ne
Si
mBERT
79.6
78.0
89.4
55.3
53.6
70.4
71.1
83.5
82.3
56.7
75.8
72.3
Unicoder
64.3
69.2
79.9
46.0
48.8
64.4
62.1
74.6
75.3
48.7
65.8
63.6
XLMR
81.1
81.0
90.1
68.3
59.9
71.2
82.1
85.6
84.5
65.2
70.2
76.3
SimCSE-BERT
72.2
79.0
52.0
49.3
56.5
72.2
78.5
82.8
83.1
57.6
76.0
69.0
InfoXLM
79.7
82.5
89.6
69.0
58.1
75.4
80.4
82.7
80.7
60.2
76.8
75.9
VECO
85.9
83.0
91.4
68.7
58.7
75.3
82.3
87.3
87.6
61.4
81.4
78.1
CMLM
84.3
80.4
91.2
65.3
57.8
74.3
76.7
85.4
87.5
61.6
81.4
76.9
LaBSE
85.8
81.1
91.4
65.8
59.8
75.7
76.7
85.7
85.6
62.9
81.6
77.5
Ours
89.0
82.6
93.9
73.7
62.8
78.7
83.5
88.5
90.9
67.2
84.0
81.3
Report issue for preceding element
The code-switched cross-lingual corpus consists of two parts such as and . The original and are formulated as follows:
Report issue for preceding element
(14)
(15)
where the and represent the length of and , respectively. We replace the tokens among the and with the frequently used languages in Alibaba over-sea’s cross-border e-commerce platform. The newly constructed data should be as follow:
Report issue for preceding element
(16)
(17)
where and denote the tokens after the replacement (for all integers and ). As the final step, we combine the newly generated and to build the linguistically motivated code-switched monolingual corpus (i.e., ). Then we continually pre-train our model with a similar idea of ALM.
Report issue for preceding element
4 Experiments
Report issue for preceding element
4.1 Setup
Report issue for preceding element
Data preparation
Report issue for preceding element
The languages selected from the business dataset are Arabic (Ar), English (En), Chinese (Zh), Indonesian (Id), Malay (Ms), Filipino (Fil), Thai (Th), Urdu (Ur), Bengali (Bn), Nepali (Ne), and Sinhala (Si).
Specifically, Ar, En, and Zh are originated from AliExpress corpora, while Id, Ms, Fil, and Th are from the LAZADA corpora, and Ur, Bn, Ne, and Si are from DARAZ corpora, respectively.
The characteristics of our business corpora are shown in Table 1.
Among them, the LAZADA corpus belongs to the code-switched dataset, and we provide the code-switching rates both on offline and online data separately (See Table 2).
We also make some explorations on the SR task using the Quora Duplicate Questions Dataset111https://quoradata.quora.com/First-Quora-DatasetRelease-Question-Pairs with Faiss (Johnson et al., 2021) toolkit222https://github.com/facebookresearch/faiss.
Then we evaluate the model performance by exploiting the mean reciprocal rank (MRR) to validate the effectiveness of different approaches.
Additionally, we conduct our experiments on the semantic textual similarity (STS) task using the SentEval toolkit (Conneau et al., 2017) for evaluation.
Report issue for preceding element
For model robustness, we conduct an experiment on the openly available dataset AskUbuntu333https://github.com/taolei87/askubuntu (Lei et al., 2016) in English.
We also make an investigation on the Tatoeba corpus (Artetxe and Schwenk, 2019) in language pairs by exploiting the BUCC2018 corpus (Zweigenbaum et al., 2017) in language pairs, which are originated from the well-known and representative benchmark XTREME444https://github.com/google-research/xtreme (Hu et al., 2020).
In the STS task, we leverage Spearman’s rank correlation coefficient to measure the quality of correlation between human labels and calculated similarity (Gao et al., 2021).
We exploit the bilingual dictionary ConceptNet5.7.0 (Speer et al., 2017) and MUSE (Lample et al., 2018) during the generations of the code-switched data.
However, for the English corpus, we keep the original English sentences and do not leverage the code-switching.
We conduct all the experiments on Zh without Chinese word segmentation (for AliExpress) and without converting them into simplified scripts for BUCC corpora. The hyper-parameters as shown in Table 3.
Report issue for preceding element
Baselines
Report issue for preceding element
To further verify the effectiveness of our method, we compare the proposed approach with the following highly related methods:
Report issue for preceding element
•
mBERT (Devlin et al., 2019) is transformer based multi-lingual bidirectional encoder representation and is pre-trained by leveraging the MMLM on the monolingual corpus.
Report issue for preceding element
•
Unicoder (Huang et al., 2019) by taking advantage of multi-task learning framework to learn the cross-lingual semantic representations via monolingual and parallel corpora to gain better results on downstream tasks.
Report issue for preceding element
•
XLM-R (Conneau et al., 2020) is more efficient than XLM and uses huge amount of mono-lingual datasets that originated from Common Crawl (Wenzek et al., 2020) which includes 100 languages to train MMLM.
Report issue for preceding element
•
SimCSE (Gao et al., 2021) propose a self-predictive contrastive learning that takes an input sentence and predicts itself as the objective.
Report issue for preceding element
•
InfoXLM (Chi et al., 2021) is an efficient method to learn the cross-lingual model training by adding a constraints.
Report issue for preceding element
•
VECO (Luo et al., 2021) obtains better results on both generation and understanding tasks by introducing the variable enc-dec framework.
Report issue for preceding element
•
CMLM (Yang et al., 2021) is a totally unsupervised learning method, conditional MLM, can effectively learn the sentence representation on huge amount of unlabeled data via integrating the sentence representation learning into MLM training.
Report issue for preceding element
•
LaBSE (Feng et al., 2022a) adapts the mBERT to generate the language-agnostic sentence embedding for 109 languages and is pre-trained by combining the MLM and TLM with translation ranking task leveraging bi-directional dual encoders.
Report issue for preceding element
Figure 3: The comparison with average accuracy score (Top-10/20 queries) on the business dataset.Report issue for preceding elementFigure 4:
The comparison of sentence embedding performance with average Spearman’s rank on STS tasks.Report issue for preceding element
4.2 Main Results
Report issue for preceding element
SR Results on Business Data
Report issue for preceding element
Table 4 shows the retrieving results of the proposed method on Ali-Express, DARAZ, and LAZADA corpora by evaluating the accuracy on TOP-30 retrieved queries, respectively.
The conduction of our experiment is composed of two steps, firstly we continually pre-train the models by combining the queries and labels among the training set. Then we conduct the fine-tuning on different languages by exploiting their own train set and dev set.
Unlike other baselines, in our continual pre-training step, we utilize code-switched queries and labels instead of combining the original data to train our model.
Among the baselines, VECO achieves better results on almost every language from the business corpus.
The code-switching method has the most positive effects both on Id, Fil (The corpora of these two languages are highly code-switched. See Table 2) and Bn, Si (DARAZ), but brings fewer benefits for Zh (Ali-Express). As we do not leverage any code-switched data for En, we obtain less improvement than VECO.
However, our approach consistently outperforms all the baselines on each language except En.
As depicted in Figure 3, we also evaluate the models with accuracy scores on Top-10 and Top-20 queries (For the details, see Table 7 & 8 in Appendix).
Report issue for preceding element
Results of Semantic Textual Similarity (STS)
Report issue for preceding element
As depicted in Figure 4, we further verify the model performance of our proposed approach on the highly similar task STS that is close to SR.
In this experiment, all of the test sets only include English sentences. Thus we continually pre-train each baseline on the BUCC2018 corpora by combing all the English monolingual datasets.
For a fair comparison, we exploit the BUCC data to continually pre-train our model.
We evaluate the baselines and our model on the test sets only using the continually pre-trained model instead of the fine-tuned model.
The presented model also obtains consistent improvements on all test sets.
We provide more details in Table 9 (see Appendix).
Report issue for preceding element
Table 5: The comparison with various evaluation metric on AskUbuntu corpus. “P@1" and “P@5" denote the precision score on Top-1 query and Top-5 queries, respectively. “Acc." represents the accuracy score on Top-30 queries.
Model
AskUbuntu
p@1
p@5
Acc.
mBERT
49.0
41.2
54.6
Unicoder
52.2
38.1
45.5
XLMR
55.4
43.4
60.1
SimCSE-BERT
53.2
42.6
56.1
InfoXLM
52.7
43.2
55.8
VECO
53.2
41.8
59.8
CMLM
53.2
41.0
59.6
LaBSE
54.8
42.8
59.3
Ours
57.5
43.8
61.1
Report issue for preceding element
Table 6: The comparison with MRR@10 evaluation metric on the Quora Duplicate Questions dataset. The meaning of “Acc." is similar to that in Table 5.
Model
MRR
Acc.
mBERT
0.436
0.506
Unicoder
0.253
XLMR
0.547
SimCSE-BERT
0.453
0.538
InfoXLM
0.575
0.620
VECO
0.579
CMLM
0.551
0.605
LaBSE
0.493
0.560
Ours
0.584
0.644
Report issue for preceding element
(a) with and without Similarity LossReport issue for preceding element
(b) Value of Report issue for preceding element
(c) Value of Report issue for preceding element
Figure 5: The effect of similarity loss () and different values of the hyper-parameters and Code-mixing Rate () in our model on business corpora with accuracy score (Top-30 queries). (a) “w/" and “w/o" denote the accuracy score with or without similarity loss. (b) and (c) also represent the average accuracy score with different values of (default value is ) and (default value is ), respectively.
Report issue for preceding element
Verification of Robustness
Report issue for preceding element
As shown in Table 5, we also further explore the performance of our model on the openly available corpus AskUbuntu (Lei et al., 2016) and Tatoeba benchmark (Artetxe and Schwenk, 2019).
In this experiment, we conduct continual pre-training and fine-tuning by leveraging only the AskUbuntu corpus without using other datasets.
We evaluate all the baselines and our model using different evaluation metrics p@1, p@5, and accuracy with Top-1, Top-5, and Top-30 queries, respectively.
Moreover, we further verify the effectiveness of our approach on another openly available benchmark Tatoeba and obtain remarkably better results than baselines. For more details, see Table 10 in Appendix.
Report issue for preceding element
4.3 Monolingual Semantic Retrieval
Report issue for preceding element
As shown in Table 6, we also tend to verify the semantic retrieving skill of our approach on another openly available Quora Duplicate Questions dataset, which only includes English monolingual data.
Since this corpus only provides the test set, we merge the English part of the train set from BUCC18 as our monolingual data. Then we continually pre-train each model to evaluate them on the Quora Duplicate Questions dataset without using the fine-tuned model.
In this dataset, InfoXLM obtains better results than other baselines, but our method outperforms all the baselines, which indicates that our approach has better retrieving skills compared to similar methods.
Report issue for preceding element
4.4 Ablation Study
Report issue for preceding element
The Effect of Similarity Loss
Report issue for preceding element
As illustrated in Figure 5(a), it is an essential part of the cross-lingual PTM with similarity. We observe that the similarity brings a positive effect on the performance of our model.
Our approach achieves better improvements
with learning the similarity loss during the pre-training stage than without similarity compared with other baselines.
Report issue for preceding element
The Effect of
Report issue for preceding element
controls the weight of the XMLM, which appears in Equation (10). As depicted in Figure 5(b), when , our model achieves the best retrieving performance compared with other values.
We provide the details of the effectiveness of different values for in Table 11 (see Appendix).
Report issue for preceding element
The Effect of Code-switching
Report issue for preceding element
As illustrated in Figure 5(c), we also investigate the effectiveness of code-switching for our method.
First, the performance becomes lower if we keep the data without code-switching (), which demonstrates the effectiveness of code-switching. Since all the languages in the LAZADA corpus originally included code-switched scripts, it may obtain lower performance when we train the model without code-switched data.
Second, when , our model reaches the best average performance (For more details, see Table 12 in Appendix).
Report issue for preceding element
5 Related Work
Report issue for preceding element
Semantic Retrieval
Report issue for preceding element
Semantic retrieval is an essential task in NLP, which requires the model to calculate the sentence embeddings, and then similar sentences can be retrieved by the embeddings.
Early SR methods are constructed based on traditional word2vec representations (Kiros et al., 2015; Hill et al., 2016). Subsequently, various studies have proposed using siamese networks to perform semantic retrieval (Neculoiu et al., 2016; Kashyap et al., 2016; Bao et al., 2018).
With widely using the pre-trained language models,
Reimers and Gurevych (Reimers and Gurevych, 2019) propose Sentence-BERT, which learns sentence embeddings by fine-tuning a siamese BERT network on NLI datasets.
To conduct multi-lingual SR, Reimers and Gurevych (Reimers and Gurevych, 2020) extend Sentence-BERT to its multi-lingual version by knowledge distillation.
To better leverage unlabeled data for SR, Gao et al. (2021) propose SimCSE, which uses contrastive learning to train sentence embeddings.
To improve performances of multi-lingual SR, Chi et al. (2021) propose InfoXLM, which utilizes MLM, TLM, and contrastive learning objectives.
Report issue for preceding element
Code-switching
Report issue for preceding element
Code-switching is a pre-training technique to improve cross-lingual pre-trained models. That is used in PTMs for machine translation (Yang et al., 2020c; Lin et al., 2020; Yang et al., 2020a).
For example,
Yang et al. (2020c) utilize code-switching on monolingual data by replacing some continuous words into the target language and letting the model predict the replaced words.
Lin et al. (2020) use code-switching on the source side of the multi-lingual parallel corpora to pre-train an encoder-decoder model for multi-lingual machine translation.
Feng et al. (2022b) mitigate the limitation of the code-switching method for grammatical incoherence and negative effects on token-sensitive tasks.
Yang et al. (2020a) propose ALM for cross-lingual pre-training, which requires the model to predict the masked words in the code-switched sentences. Krishnan et al. (2021) augment monolingual source data by leveraging the multilingual code-switching via random translation to improve the generalizability of large multi-lingual language models.
Besides, code-switching has been utilized in other NLP tasks, including named entity recognition (Singh et al., 2018), question answering (Chandu et al., 2018; Gupta et al., 2018),
universal dependency parsing (Bhat et al., 2018), morphological tagging (Özateş and Çetinoğlu, 2021),
language modeling (Pratapa et al., 2018), automatic speech recognition (Kumar and Bora, 2018), natural language inference (Khanuja et al., 2020) and sentiment analysis (Patwa et al., 2020; Liu et al., 2020; Aparaschivei et al., 2020; Zhang et al., 2021).
To the best of our knowledge, we are the first to utilize code-switching for semantic retrieval.
Report issue for preceding element
6 Conclusion and Future work
Report issue for preceding element
We introduce a straightforward pre-training approach to sentence-level semantic retrieval with code-switched cross-lingual data for the FAQ system in the task-oriented QA dialogue e-commerce scenario.
Intuitively, code-switching is an emerging trend of communication in both bilingual and multi-lingual regions.
Our experimental result shows that the proposed approach remarkably outperforms the previous highly similar baseline systems on the tasks of semantic retrieval and semantic textual similarity with three business corpora and four open corpora using many evaluation metrics.
In future work, we will expand our method to other natural language understanding tasks. Besides, we will also leverage the different embedding distance calculation metrics instead of only using cosine similarity.
Report issue for preceding element
References
Report issue for preceding element
Aparaschivei et al. (2020)↑
Lavinia Aparaschivei, Andrei Palihovici, and Daniela Gîfu. 2020.
FII-UAIC at semeval-2020 task 9: Sentiment analysis for code-mixed social media text using CNN.
In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, pages 928–933.
Bhat et al. (2018)↑
Irshad Bhat, Riyaz A. Bhat, Manish Shrivastava, and Dipti Sharma. 2018.
Universal Dependency parsing for Hindi-English code-switching.
In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 987–998, New Orleans, Louisiana. Association for Computational Linguistics.
Brown et al. (2020)↑
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020.
Language models are few-shot learners.
In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020.
Chandu et al. (2018)↑
Khyathi Raghavi Chandu, Ekaterina Loginova, Vishal Gupta, Josef van Genabith, Günter Neumann, Manoj Kumar Chinnakotla, Eric Nyberg, and Alan W. Black. 2018.
Code-mixed question answering challenge: Crowd-sourcing data and techniques.
In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching@ACL 2018, pages 29–38.
Chi et al. (2020)↑
Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and Heyan Huang. 2020.
Cross-lingual natural language generation via pre-training.
In Proceedings of the AAAI conference on artificial intelligence, pages 7570–7577.
Chi et al. (2021)↑
Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2021.
Infoxlm: An information-theoretic framework for cross-lingual language model pre-training.
In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3576–3588.
Conneau et al. (2020)↑
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020.
Unsupervised cross-lingual representation learning at scale.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
Conneau and Lample (2019)↑
Alexis Conneau and Guillaume Lample. 2019.
Cross-lingual language model pretraining.
In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pages 7057–7067.
Devlin et al. (2019)↑
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.
BERT: pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
Feng et al. (2022a)↑
Fangxiaoyu Feng, Yinfei Yang, Daniel Matthew Cer, N. Arivazhagan, and Wei Wang. 2022a.
Language-agnostic bert sentence embedding.
In ACL.
Gao et al. (2021)↑
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
Simcse: Simple contrastive learning of sentence embeddings.
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910.
Gupta et al. (2018)↑
Vishal Gupta, Manoj Kumar Chinnakotla, and Manish Shrivastava. 2018.
Transliteration better than translation? answering code-mixed questions over a knowledge base.
In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching@ACL 2018, pages 39–50.
Hill et al. (2016)↑
Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.
Learning distributed representations of sentences from unlabelled data.
In The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1367–1377.
Hu et al. (2020)↑
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020.
Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.
ArXiv, abs/2003.11080.
Huang et al. (2019)↑
Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and M. Zhou. 2019.
Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks.
In EMNLP.
Johnson et al. (2021)↑
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021.
Billion-scale similarity search with gpus.
IEEE Transactions on Big Data, 7:535–547.
Kashyap et al. (2016)↑
Abhay Kashyap, Lushan Han, Roberto Yus, Jennifer Sleeman, Taneeya Satyapanich, Sunil Gandhi, and Tim Finin. 2016.
Robust semantic text similarity using lsa, machine learning, and linguistic resources.
Language Resources and Evaluation, 50:125–161.
Khanuja et al. (2020)↑
Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, and Monojit Choudhury. 2020.
A new dataset for natural language inference from code-mixed conversations.
In CALCS.
Kiros et al. (2015)↑
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015.
Skip-thought vectors.
In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pages 3294–3302.
Kumar and Bora (2018)↑
Ritesh Kumar and Manas Jyoti Bora. 2018.
Part-of-speech annotation of english-assamese code-mixed texts: Two approaches.
In Proceedings of the First International Workshop on Language Cognition and Computational Models, pages 94–103.
Lample et al. (2018)↑
Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018.
Word translation without parallel data.
In 6th International Conference on Learning Representations.
Lei et al. (2016)↑
Tao Lei, Hrishikesh Joshi, R. Barzilay, T. Jaakkola, K. Tymoshenko, Alessandro Moschitti, and Lluís Màrquez i Villodre. 2016.
Semi-supervised question retrieval with gated convolutions.
In NAACL.
Lin et al. (2020)↑
Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu, Jiangtao Feng, Hao Zhou, and Lei Li. 2020.
Pre-training multilingual neural machine translation by leveraging alignment information.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2649–2663.
Liu et al. (2020)↑
Jiaxiang Liu, Xuyi Chen, Shikun Feng, Shuohuan Wang, Xuan Ouyang, Yu Sun, Zhengjie Huang, and Weiyue Su. 2020.
Kk2018 at semeval-2020 task 9: Adversarial training for code-mixing sentiment classification.
In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, pages 817–823.
Liu et al. (2019)↑
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized BERT pretraining approach.
arXiv preprint arXiv: 1907.11692.
Luo et al. (2021)↑
Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi, Songfang Huang, Fei Huang, and Luo Si. 2021.
Veco: Variable and flexible cross-lingual pre-training for language understanding and generation.
In ACL.
Neculoiu et al. (2016)↑
Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. 2016.
Learning text similarity with siamese recurrent networks.
In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 148–157.
Özateş and Çetinoğlu (2021)↑
Şaziye Betül Özateş and Özlem Çetinoğlu. 2021.
A language-aware approach to code-switched morphological tagging.
In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 72–83, Online. Association for Computational Linguistics.
Patwa et al. (2020)↑
Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, and Amitava Das. 2020.
Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets.
In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, pages 774–790.
Peters et al. (2018)↑
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018.
Deep contextualized word representations.
In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2227–2237.
Pratapa et al. (2018)↑
Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, Sandipan Dandapat, and Kalika Bali. 2018.
Language modeling for code-mixing: The role of linguistic theory based synthetic data.
In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1543–1553.
Radford et al. (2018)↑
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018.
Improving language understanding with unsupervised learning.
Radford et al. (2019)↑
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask learners.
Reimers and Gurevych (2019)↑
Nils Reimers and Iryna Gurevych. 2019.
Sentence-bert: Sentence embeddings using siamese bert-networks.
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3980–3990.
Reimers and Gurevych (2020)↑
Nils Reimers and Iryna Gurevych. 2020.
Making monolingual sentence embeddings multilingual using knowledge distillation.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4512–4525.
Siddhant et al. (2020)↑
Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Ari, Jason Riesa, Ankur Bapna, Orhan Firat, and Karthik Raman. 2020.
Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation.
In Proceedings of the AAAI conference on artificial intelligence, pages 8854–8861.
Singh et al. (2018)↑
Vinay Singh, Deepanshu Vijay, Syed Sarfaraz Akhtar, and Manish Shrivastava. 2018.
Named entity recognition for hindi-english code-mixed social media text.
In Proceedings of the Seventh Named Entities Workshop, pages 27–35.
Speer et al. (2017)↑
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.
Conceptnet 5.5: An open multilingual graph of general knowledge.
In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 4444–4451.
Tran et al. (2020)↑
Chau Tran, Yuqing Tang, Xian Li, and Jiatao Gu. 2020.
Cross-lingual retrieval for iterative self-supervised training.
Advances in Neural Information Processing Systems, 33:2207–2219.
Wei et al. (2021)↑
Xiangpeng Wei, Yue Hu, Rongxiang Weng, Luxi Xing, Heng Yu, and Weihua Luo. 2021.
On learning universal representations across languages.
ICLR.
Wenzek et al. (2020)↑
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm’an, Armand Joulin, and Edouard Grave. 2020.
Ccnet: Extracting high quality monolingual datasets from web crawl data.
In LREC.
Xiong et al. (2021)↑
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021.
Approximate nearest neighbor negative contrastive learning for dense text retrieval.
In 9th International Conference on Learning Representations.
Xu et al. (2022)↑
Wenshen Xu, Mieradilijiang Maimaiti, Yuanhang Zheng, Xin Tang, and Ji Zhang. 2022.
Auto-mlm: Improved contrastive learning for self-supervised multi-lingual knowledge retrieval.
arXiv preprint arXiv: 2203.16187.
Yang et al. (2020a)↑
Jian Yang, Shuming Ma, Dongdong Zhang, Shuangzhi Wu, Zhoujun Li, and Ming Zhou. 2020a.
Alternating language modeling for cross-lingual pre-training.
In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, pages 9386–9393.
Yang et al. (2019)↑
Yinfei Yang, Gustavo Hernández Ábrego, Steve Yuan, Mandy Guo, Qinlan Shen, Daniel Cer, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019.
Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax.
In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 5370–5378. ijcai.org.
Yang et al. (2020b)↑
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil. 2020b.
Multilingual universal sentence encoder for semantic retrieval.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94, Online. Association for Computational Linguistics.
Yang et al. (2020c)↑
Zhen Yang, Bojie Hu, Ambyera Han, Shen Huang, and Qi Ju. 2020c.
CSP: code-switching pre-training for neural machine translation.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2624–2636.
Yang et al. (2021)↑
Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, and Eric Darve. 2021.
Universal sentence representation learning with conditional masked language model.
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6216–6228, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Zweigenbaum et al. (2017)↑
Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2017.
Overview of the second bucc shared task: Spotting parallel sentences in comparable corpora.
In BUCC@ACL.
(a) CMLMReport issue for preceding element
(b) XLMRReport issue for preceding element
(c) LaBSEReport issue for preceding element
(d) InfoXLMReport issue for preceding element
(e) VECOReport issue for preceding element
(f) OursReport issue for preceding element
Figure 6: The comparison of the retrieved labels among the baselines and our proposed method, where the y-axis and the x-axis denote the user query and the retrieved label, respectively.Report issue for preceding element
Table 7: The comparison with accuracy score (Top-10 queries) between cross-lingual sentence retrieval baseline systems on AliExpress, LAZADA, and DARAZ corpora.
Model
AliExpress
LAZADA
DARAZ
Avg.
Ar
En
Zh
Id
Ms
Fil
Th
Ur
Bn
Ne
Si
mBERT
56.3
55.2
78.0
32.1
29.8
46.9
48.1
67.6
67.4
35.0
51.6
51.6
Unicoder
43.1
49.8
66.4
25.2
24.7
41.7
41.9
59.3
56.5
27.9
38.8
43.2
XLMR
65.1
61.0
79.1
45.7
36.0
47.1
59.4
68.2
66.5
43.5
44.6
56.0
SimCSE-BERT
48.4
58.8
40.3
27.4
31.5
48.5
58.1
66.0
68.8
36.9
50.2
48.6
InfoXLM
60.2
61.7
76.9
44.7
33.8
51.6
61.3
66.5
64.6
37.3
49.4
55.3
VECO
67.4
61.7
81.7
47.6
35.3
52.8
60.0
70.7
70.6
40.3
52.4
58.2
CMLM
67.6
62.9
80.1
41.2
35.8
49.0
55.9
71.3
72.0
38.4
56.0
57.3
LaBSE
71.1
61.3
81.0
41.5
35.8
52.1
55.5
71.8
74.1
40.7
57.4
58.4
Ours
73.1
64.5
85.1
50.1
38.4
56.6
62.2
75.1
75.8
47.8
59.6
62.6
Report issue for preceding element
Table 8: The comparison with accuracy score (Top-20 queries) between cross-lingual sentence retrieval baseline systems on AliExpress, LAZADA, and DARAZ corpora.
Model
AliExpress
LAZADA
DARAZ
Avg.
Ar
En
Zh
Id
Ms
Fil
Th
Ur
Bn
Ne
Si
mBERT
71.8
71.8
85.5
46.1
42.2
62.2
63.3
78.2
78.6
48.2
69.2
65.2
Unicoder
55.6
63.0
74.2
36.9
37.3
56.4
54.6
69.0
68.6
40.7
56.6
55.7
XLMR
74.1
73.8
87.2
60.9
49.4
62.3
74.2
80.3
79.5
57.6
60.2
69.0
SimCSE-BERT
63.3
71.9
47.5
40.6
47.2
64.4
73.7
78.4
79.2
50.6
66.6
62.1
InfoXLM
73.4
75.8
85.8
60.9
45.9
66.7
72.6
77.4
75.9
51.1
66.8
68.4
VECO
80.4
76.3
88.3
61.1
47.9
68.1
73.8
82.4
82.3
53.4
69.0
71.2
CMLM
80.3
74.9
88.4
56.4
47.1
64.9
69.0
80.9
82.8
52.5
72.0
69.9
LaBSE
82.2
73.6
88.3
58.4
49.5
67.3
68.8
81.4
81.1
54.9
74.2
70.9
Ours
85.3
77.2
91.4
67.3
52.3
71.4
76.7
85.1
86.3
60.0
76.4
75.4
Report issue for preceding element
Table 9: The comparison of sentence embedding performance on STS tasks. “STS12-STS16", “STS-B" and “SICK-R" denote SemEval2012-2016, STS benchmark and SICK relatedness dataset, respectively.
Model
STS
STS
STS
STS
STS
STS-B
SICK-R
Avg.
mBERT
42.46
62.14
52.35
65.36
66.20
60.51
60.87
58.56
Unicoder
XLMR
SimCSE-BERT
42.35
67.34
57.20
70.36
69.41
59.86
63.77
61.47
InfoXLM
32.23
52.35
39.42
52.04
60.82
54.04
59.61
50.07
VECO
41.76
60.75
52.21
64.79
67.26
58.93
61.17
58.12
CMLM
30.14
61.77
47.45
61.25
62.73
53.23
56.62
53.31
LaBSE
47.43
64.13
55.72
69.66
64.21
57.60
60.68
59.92
Ours
44.53
68.20
55.99
71.39
66.07
66.32
68.03
62.93
Report issue for preceding element
Appendix A Case Study
Report issue for preceding element
To further demonstrate the better performance of the proposed approach, we make some visualizations of the retrieved results on the Business corpus between the previously introduced SOTA pre-trained models (i.g., we choose the five most efficient baselines) for cross-lingual scenarios.
Report issue for preceding element
As illustrated in Figure 6, the baselines fail to retrieve the key information “delivery failed” from the user query “it says delivery failed when i track it”, while our proposed method can retrieve such vital information, which indicates that our model can improve the performance of sentence retrieval in the business domain.
Report issue for preceding element
Table 10: The comparison with accuracy score (Top-30 queries) on Tatoeba corpus for each language.
Table 11: The effect of on business corpora with accuracy score (Top-30 queries).
AliExpress
LAZADA
DARAZ
Avg.
Ar
En
Zh
Id
Ms
Fil
Th
Ur
Bn
Ne
Si
87.1
82.2
91.1
68.3
60.8
77.7
80.2
88.0
90.4
60.2
84.0
79.1
85.4
81.9
91.1
70.9
58.8
77.6
82.2
87.9
88.4
65.2
81.6
79.2
81.1
81.7
90.6
70.7
59.3
78.1
76.8
88.2
89.8
62.0
83.6
78.3
82.9
81.2
91.7
70.9
58.8
74.3
82.3
87.6
88.6
63.0
83.8
78.6
87.2
80.2
90.6
70.8
60.6
72.8
82.2
88.1
89.2
64.9
82.3
79.0
Report issue for preceding element
Table 12: The effect of code-switching rate (“") on business corpora with accuracy score (Top-30 queries).
AliExpress
LAZADA
DARAZ
Avg.
Ar
En
Zh
Id
Ms
Fil
Th
Ur
Bn
Ne
Si
86.7
82.2
89.9
67.5
58.6
75.4
77.4
82.7
87.9
57.3
76.8
76.6
89.0
82.2
92.4
73.7
61.3
78.4
80.2
84.6
90.8
60.2
79.0
79.3
87.1
82.2
91.4
68.3
60.8
77.4
80.2
88.0
90.4
60.2
84.0
79.1
84.5
82.2
92.4
70.5
59.2
76.6
81.7
87.7
89.3
66.1
74.4
78.6
86.2
82.2
92.3
69.8
61.3
75.8
79.8
82.3
84.2
65.2
78.0
77.9
85.1
82.2
91.9
69.7
59.3
76.4
81.1
85.9
79.9
65.5
78.4
77.8
Report issue for preceding element
Appendix B Results on Business Datasets
Report issue for preceding element
As shown both in Table 7 and Table 8, we make some explorations for the retrieving skill of our introduced approach on Ali-Express, DARAZ, and LAZADA corpora by evaluating the accuracy on TOP-10 and TOP-20 retrieved queries, respectively.
The conduction of our experiment is composed of two steps: first, we continually pre-train the models by combining the queries and labels among the training set. Then we conduct the fine-tuning on different languages by exploiting their own train set dev set.
Similar to the results on TOP-30 queries (see Table 4), VECO also obtains higher results among the baselines systems. However, our proposed model achieves remarkably better results than all baselines on each of the languages of the three business datasets.
Report issue for preceding element
Appendix C Results on the Task of STS
Report issue for preceding element
We regard that there exists a bit of difference between the two tasks, such as semantic retrieval (SR) and semantic textual similarity (STS). But both of them take sentence-level representation as a backbone.
Therefore, we make some investigations on the task of STS.
As shown in Table 9, we also make further validation on the similar task STS by comparing the semantic representation skill of the baseline models and our proposed method.
In this experiment, all of the test sets (STS12, STS13, STS14, STS15, STS16, STS-B, and SICK-R) only include English sentences. Thus we evaluate the baselines and our model only using the continually pre-trained model instead of the fine-tuned model.
We leverage Spearman’s rank correlation coefficient to measure the quality of all models.
Among the baseline systems, the SimCSE-BERT obtain higher results than other baseline approaches, but our presented method steadily outperforms all the baseline models.
Report issue for preceding element
Appendix D Results on Publicly Open Corpora
Report issue for preceding element
We conduct meaningful experiments on well-known and broadly used open available public datasets AskUbuntu and Tatoeba benchmark for sentence-level semantic retrieval.
In contrast with the experiment on the AskUbuntu dataset,
in this experiment, we continually pre-train all the models by leveraging the BUCC2018 corpus for the continual pre-training step.
Due to the BUCC2018 corpus containing the train set and dev set, for a fair comparison, we also fine-tune our continually pre-trained model via the BUCC2018.
Moreover, the Tatoeba dataset covers more than 40 languages (shown with their ISO 639-1 code for brevity). But in our experiment, we only choose the part that differs from our business corpora, such as Afrikaans (af), German (de), Spanish (es), French (fr), Italian (it), Japanese (ja), Kazakh (kk), Dutch (nl), Portuguese (pt), Swahili (sw) and Telugu(te).
As shown in Table 10, all the detailed reviews about the comparison results are evaluated by accuracy on each language for Tatoeba. In this experiment, we choose 11 languages.
Report issue for preceding element
Appendix E The Effect of the Hyper-parameter
Report issue for preceding element
As shown in Table 11, we make further validation on both business data and publicly available open corpora with different values of .
For selecting the values of , we fix the code-switching rate during the continual pre-training stage for our model.
The experimental result shows when the , our model achieves better results than other values. Thus, we set as the default value of
in all experiments.
Report issue for preceding element
Additionally, as given in Table 12, we also investigate
the different values of the code-switching rate .
Similarly, we fix the lambda during the continual pre-training step to select the proper values of for our model.
It is not hard to infer from the experimental results that, when the , our proposed approach obtains the highest performance compared with other values of .
Therefore, we take the as a default value for .
The values of and are identical for the business and open corpora in the whole experiment.