(Translated by https://www.hiragana.jp/)
Building Russian Benchmark for Evaluation of Information Retrieval Models
\addbibresource

references.bib

Building Russian Benchmark for Evaluation
of Information Retrieval Models

Abstract

We introduce RusBEIR, a comprehensive benchmark designed for zero-shot evaluation of information retrieval (IR) models in the Russian language. Comprising 17 datasets from various domains, it integrates adapted, translated, and newly created datasets, enabling systematic comparison of lexical and neural models. Our study highlights the importance of preprocessing for lexical models in morphologically rich languages and confirms BM25 as a strong baseline for full-document retrieval. Neural models, such as mE5-large and BGE-M3, demonstrate superior performance on most datasets, but face challenges with long-document retrieval due to input size constraints. RusBEIR offers a unified, open-source framework that promotes research in Russian-language information retrieval. The benchmark is available for public use on GitHub.

Keywords: information retrieval, benchmark, lexical model, dense model, reranker

Аннотация

Мы представляем RusBEIR — это бенчмарк, предназначенный для zero-shot оценки моделей информационного поиска (IR) на русском языке. Он включает 17 наборов данных из различных доменов, объединяя адаптированные, переведенные и созданные наборы данных, что позволяет проводить систематическое сравнение лексических и нейронных моделей. Наше исследование подчеркивает важность предобработки для лексических моделей в языках с богатой морфологией и подтверждает, что модель BM25 обеспечивает высокое качество поиска, особенно для полных документов. Нейронные модели, такие как mE5-large и BGE-M3, показывают высокие результаты на большинстве наборов данных, но сталкиваются с трудностями при работе с длинными документами из-за ограничений на максимальную длину входа. RusBEIR предоставляет унифицированную открытую платформу, способствующую развитию исследований в области информационного поиска на русском языке. RusBEIR является проектом с открытым исходным кодом и доступен на GitHub.

Ключевые слова: информационный поиск, бенчмарк, лексическая модель, нейросетевая модель, реранкер

Grigory Kovalev
Lomonosov Moscow

State University

Russia

kaengreg@ya.ru

Mikhail Tikhomirov
Lomonosov Moscow

State University

Russia

tikhomirov.mm@gmail.com

Evgeny Kozhevnikov
Lomonosov Moscow

State University

Russia

dovvakkin@gmail.com

Max Kornilov
Lomonosov Moscow

State University

Russia

max.korn@bk.ru

Natalia Loukachevitch
Lomonosov Moscow

State University

Russia

louk_nat@mail.ru

Создание русского бенчмарка для оценки моделей
информационного поиска
                 Ковалев Григорий                  Тихомиров Михаил                  МГУ им. М.В. Ломоносова                  МГУ им. М.В. Ломоносова                  Россия                  Россия                  kaengreg@ya.ru                  tikhomirov.mm@gmail.com

                 Кожевников Евгений                  Корнилов Максим
                 МГУ им. М.В. Ломоносова                  МГУ им. М.В. Ломоносова
                 Россия                  Россия
                 dovvakkin@gmail.com                  max.korn@bk.ru
                 Лукашевич Наталья
                 МГУ им. М.В. Ломоносова
                 Россия
                 louk_nat@mail.ru

1 Introduction

Traditionally, Information Retrieval (IR) was based on lexical models such as TF-IDF and BM25, but these models are known as bag-of-words models, which do not take into account the document context. Modern approaches are based on neural models, in particular on Transformer models. Most recent advancements in IR leverage neural retrieval models built upon pre-trained Transformer architectures, such as BERT [devlin-etal-2019-bert]. These models address the limitations of traditional lexical methods by capturing semantic relationships and contextual information, enabling them to bridge the lexical gap inherent in keyword-based retrieval approaches. Unlike lexical models, which rely solely on the presence of query terms in documents, neural models represent queries and documents in a dense vector space, facilitating more accurate retrieval through similarity measures like cosine similarity.

Neural retrieval systems have demonstrated significant performance improvements over traditional methods, particularly in tasks such as open-domain question answering, claim verification, and passage retrieval. However, these advancements often come at the cost of increased computational resources and the need for extensive training data. Due to the scarcity of labeled data, neural models are frequently applied in zero-shot settings.

Traditionally, neural retrievers have been trained on large datasets such as MS MARCO [bajaj2016ms] and Natural Questions [hardeniya2016natural]. Before the introduction of BEIR [thakur2021beir], these models were often evaluated on the same datasets they were trained on, gaining a significant advantage over lexical approaches like BM25.

To address this limitation, the authors of BEIR introduced a robust and diverse benchmark designed to evaluate model generalization across tasks and domains. BEIR consists of 18 retrieval datasets from a variety of domains, providing a more accurate and comprehensive framework for evaluating neural retrieval systems. Notably, the results of BEIR revealed that neural models do not consistently outperform lexical approaches, highlighting the need for careful evaluation in diverse settings.

The zero-shot application of neural retrievers is particularly important for underrepresented languages, such as Slavic languages, where the availability of information retrieval datasets is limited. Consequently, there is a growing demand for multilingual evaluation benchmarks akin to the monolingual BEIR framework. Such benchmarks would enable robust cross-lingual evaluation and foster the development of neural retrieval systems for less commonly studied languages, addressing the current gaps in multilingual information retrieval research.

Interestingly, the performance gap between lexical and dense retrieval models remains a topic of interest. Although dense models typically excel in retrieval accuracy, lexical methods such as BM25 offer a lightweight alternative with significantly lower computational overhead. Investigating this trade-off can provide valuable insight into the practical application of retrieval models across diverse scenarios, especially when computational efficiency is a priority.

In this paper, we present RusBEIR, a BEIR-inspired benchmark designed for the zero-shot evaluation of Information Retrieval (IR) models in the Russian language. RusBEIR comprises 17 datasets that span various domains and tasks. Some datasets have been adapted from BEIR and similar benchmarks, others are newly collected specifically for this benchmark or sourced from existing Russian or multilingual benchmarks. Our primary objective is to establish a large-scale benchmark tailored to information retrieval in Russian, with a particular emphasis on zero-shot approaches. In addition, we explore whether neural models consistently outperform traditional lexical methods across diverse scenarios. With this aim, we evaluate a range of models, including BM25, BGE-M3, mE5, RoSBERTa, and LaBSE [chen2024bge, wang2024multilingual, snegirev2024russian, feng2022language]. This benchmark offers a comprehensive resource for advancing and evaluating IR systems in Russian, fostering research into the comparative strengths of neural and lexical approaches.

2 Related work

The information-retrieval domain has a rich history of creating datasets, benchmarks, organizing various evaluations. Specialized evaluation conferences such as TREC, CLEF, NTCIR have been held since 90-th of 20 century. There exist numerous national information-retrieval initiatives: in Poland [kobylinski2023poleval], India [ganguly2023proceedings] and other countries. In Russia during 2003-2011, ROMIP workshop [dobrov2004russian] was a place for evaluation of approaches in information-retrieval tasks, such as ad hoc retrieval, thematic categorization, question-answering, summarization, etc.

Evaluating neural models in information retrieval requires creating benchmarks comprising diverse datasets. Development of information-retrieval benchmarks for non-English languages is usually based on the English BEIR benchmark [thakur2021beir].

The Polish benchmark BEIR-PL [wojtasik2024beir] was created via automatic translation of 13 datasets from BEIR. For translation, the Google Translate service was used. In [dadas2024pirb], the authors describe the Polish Information Retrieval Benchmark (PIRB), encompassing 41 text information retrieval tasks for Polish. The datasets in PIRB comprise the BEIR-PL datasets, several other existing information-retrieval datasets, and also nine datasets crawled from Polish websites. In evaluation, it was found that the best results on the benchmark were achieved by the mE5-large model [wang2024multilingual]. The authors also trained a learning-to-rank model combining scores of several basic models and achieved better results.

To create Dutch BEIR, the authors of [banar2024beir] translated initial BEIR datasets into Dutch using the Gemini-1.5-flash model. To assess the translation quality, ten items from each dataset were randomly sampled and checked by a Dutch native speaker. It was shown that 98% of checked samples were translated correctly or with minor issues. The authors tested BM25, neural models (including mE5-large and BGE-M3 models) and reranking approaches combining BM25 and a neural reranker. They conclude that BM25 still provides a competitive baseline, and, in many cases, is only outperformed by larger dense models.

The authors of [acharya2024hindi] created Hindi BEIR benchmark. Hindi-BEIR encompasses 15 diverse datasets from 6 distinct domains. They translated BEIR datasets using Indic-Trans2 model, a neural translation model supporting translations across all 22 Indic languages (including English). They translated 9 datasets from the source BEIR benchmark. To check the quality of translation, the authors back-translated the Hindi translations into English. Then they calculated the char-based Chrf(++) score [popovic2015chrf] between the original English query/document and the backtranslated English query/document. Also 5 publicly available information-retrieval datasets were added to the benchmark. In experiments, neural models (BGE-M3, mE5, LASER, LaBSE) were compared with BM25. The best results were obtained with BGE-M3, which is signifcantly better than other approaches.

For Russian, the MTEB benchmark [muennighoff2023mteb] for evaluating embeddings has been created. Russian MTEB comprises 23 datasets in 7 task categories including three information-retrieval datasets [snegirev2024russian]. The best models in the Russian MTEB information-retrieval section with the size less than 1b are BGE-M3 [chen2024bge] and Multilingual E5-large [wang2024multilingual].

3 Datasets in RusBEIR

RusBEIR is a Russian benchmark inspired by BEIR [thakur2021beir], designed for zero-shot evaluation of Information Retrieval (IR) models. Adhering to the principles of BEIR, it offers a robust and diverse evaluation framework, enabling the assessment of IR models across a wide range of tasks and domains in the Russian language.

The datasets in the RusBEIR benchmark consist of available open-source datasets, datasets that have been translated from English, and newly created datasets. Table 1 provides a description of the available datasets. We will discuss the datasets in more detail in the following subsections.

3.1 Translated BEIR Datasets

BEIR consists of multilingual and monolingual (English) datasets. To achieve reproducibility of results from BEIR and its analogues, it was decided to translate the monolingual datasets into the Russian language and evaluate them with models used in our benchmark.

The choice of translation method was based on studies conducted as part of the creation of the multilingual MsMarco dataset mMarco [bonifacio2021mmarco], where experiments with Google Translate and the Helsinki model were conducted, and the results of similar experiments from the PL-BEIR [wojtasik2024beir] project were analyzed. According to the results of these studies, Google Translate showed better translation quality compared to the Helsinki model. Therefore, Google Translate was chosen.

As a result, we introduce 4 datasets from the original BEIR datasets [thakur2021beir], which were translated into the Russian language.

  • NF-Corpus is a comprehensive full-text English retrieval dataset designed for medical information retrieval tasks. It contains a collection of queries formulated in non-technical English sourced from NutritionFacts.org 111NutritionFacts.org and corresponding medical documents written in a complex terminology-heavy language primarily derived from PubMed 222https://pubmed.ncbi.nlm.nih.gov, a database of medical literature.

  • ArguAna is a dataset designed for the argument retrieval task, derived from debates on idebate.org 333idebate.org. It covers controversial topics across 15 themes, such as “economy” and “health.” The dataset includes a corpus consisting of debate texts and queries derived from these debates. The task is to retrieve relevant arguments from the corpus.

  • SciFact is a dataset for scientific claim verification, consisting of expert-written claims paired with abstracts from research literature. Each abstract is annotated with evidence supporting or refuting the claims, along with rationales justifying the decisions.

  • SCIDOCS is a dataset focused on citation prediction, designed to evaluate the ability of scientific document embeddings to predict citation relationships between research papers.

Source (↓) Task (↓) Dataset (↓) Origin (↓) Relevancy Train Dev Test Corpus Avg. Word Lengths (D/Q)
BEIR Bio-Medical IR rus-NFCorpus Translation Binary 2,590 324 323 3,633 216.6 / 3.5
BEIR Argument Retrieval rus-ArguAna Translation Binary 1,406 8,674 147.8 / 173.8
BEIR Fact Checking rus-SciFact Translation Binary 809 300 5,183 185.8 / 11.2
BEIR Citation-Prediction rus-SCIDOCS Translation Binary 1000 25,657 153.1 / 9.8
BEIR Information-Retrieval rus-MMARCO Part of multilingual Binary 502,939 6980 8,841,823 49.6 / 5.95
Open-Source Dataset Information-Retrieval rus-MIRACL Part of multilingual Binary 4,683 1,252 9,543,918 43 / 6.2
Open-Source Dataset Question Answering (QA) rus-XQuAD Part of multilingual Binary 1,190 240 112.9 / 8.6
Open-Source Dataset Question Answering (QA) rus-XQuAD-sentences Part of multilingual Binary 1,190 1212 22.4 / 8.6
Open-Source Dataset Question Answering (QA) rus-Tydi QA Part of multilingual Binary 1,162 89,154 69.4 / 6.5
Open-Source Dataset Information-Retrieval SberQuAD-retrieval Originally Russian Binary 45,328 5,036 23,936 17,474 100.4 / 8.7
Open-Source Dataset Information-Retrieval ruSciBench-retrieval Originally Russian Binary 345 200,532 89.9 / 9.2
Open-Source Dataset Question Answering (QA) ru-facts Originally Russian Binary 2,241 753 6,236 28.1 / 23.9
RU-MTEB Information-Retrieval RuBQ Originally Russian Binary 1,692 56,826 62.07 / 6.4
RU-MTEB Information-Retrieval Ria-News Originally Russian Binary 10,000 704,344 155.2 / 8.8
rusBEIR Information-Retrieval wikifacts-articles Originally Russian 3-level 540 1,324 2,535.9 / 11.4
rusBEIR Fact Checking wikifacts-para Originally Russian 3-level 540 15,317 219.2 / 11.4
rusBEIR Information-Retrieval wikifacts-sents Originally Russian 3-level 540 188,026 17.8 / 11.4
Table 1: Overview of datasets and tasks for information retrieval and related fields. All datasets are available at HuggingFace

3.2 Russian Parts of Multilingual Datasets

The main objective of BEIR is to gather a large and diverse set of data from various domains and tasks. This will force models to operate in an out-of-distribution environment and help to evaluate them more accurately. In order to expand our collection of Russian datasets, we also retrieved the Russian portions of existing multilingual datasets, including mMARCO [bonifacio2021mmarco], MIRACL [zhang2023miracl], XQUAD [artetxe2020cross], and TyDiQA [clark2020tydi].

The mMARCO (Multilingual MS MARCO) dataset [bonifacio2021mmarco] is a multilingual adaptation of the popular MS MARCO dataset, designed for information retrieval and question answering tasks. It extends the original English MS MARCO dataset into multiple languages, including Russian.

MIRACL is a multilingual dataset for information retrieval in 18 languages. The queries were taken mainly from the Mr. TYDI dataset. Passages were retrieved from Wikipedia by an ensemble model, and 10 top documents were annotated by human annotators.

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset designed to evaluate the performance of cross-lingual question answering systems. It consists of a collection of 240 passages and 1,190 question-answer pairs from the development set of the SQuAD v1.1 dataset [rajpurkar2016squad], along with their professional translations into 10 languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. This makes the dataset entirely parallel across 11 languages.

Tydi QA is a question-answering dataset covering 11 typologically diverse languages. Questions were written by humans on Wikipedia topics. Answers should not be contained in the first 100 characters of the corresponding Wikipedia article. The questions were written for each language, not translated.

3.3 Existing Russian Datasets

The Russian Massive Text Embedding Benchmark (ruMTEB) is an extension of the Massive Text Embedding Benchmark (MTEB) tailored specifically for the Russian language. The authors of ruMTEB introduced 17 new datasets in Russian which were categorized into 7 groups.

In our benchmark we use 2 of presented IR datasets: RuBQ and Ria-News. RuBQ [rybin2021rubq] is a specialized dataset for Russian-language question answering over Wikidata, offering a rich set of questions paired with structured answers.

The Ria-News dataset [gavrilov2019self] is a collection of Russian-language news articles published by the RIA Novosti news agency (2010-2014). This dataset presents a task in which a model is required to locate the text of a specific news article within a larger corpus of news articles based on its corresponding title, which acts as a query.

Besides, we added publicly-available IR-related datasets: SberQuad [efimov2020sberquad], ruSciBench and ru-facts [kozlova2023fact].

SberQuAD is a Russian-language machine reading comprehension (MRC) dataset inspired by the popular English SQuAD [rajpurkar2016squad] (Stanford Question Answering Dataset). It provides annotated passages and question-answer pairs in Russian.

ruSciBench is a Russian-language benchmark designed to evaluate the performance of text embedding models for scientific articles. 444The dataset is a ported version of qa_science_ru https://huggingface.co/datasets/AIR- Bench/qa_science_ru from the Air-Bench repository, which in turn is a port of the ru_sci_bench (https://huggingface.co/datasets/mlsa-iai-msu-lab/ru_sci_bench) dataset from the MSLA-Iai MSU Lab The corpus consists of abstracts, and the queries are LLM-generated questions for these abstracts.

ru-facts [kozlova2023fact] is a fact-checking dataset developed by translating and expanding the FEVER dataset with additional data from the Russian news summarization corpus Gazeta 555https://huggingface.co/datasets/IlyaGusev/gazeta, using a paraphrasing model 666https://habr.com/en/company/sberdevices/blog/667106/, and rule-based transformations from the Ru_Paraphraser dataset 777https://huggingface.co/datasets/merionum/ru_paraphraser.

3.4 New Russian Wikipedia-based Datasets

We also introduce a new series of Russian Wikipedia-based datasets. The datasets are based on Wikipedia section “Did you know …”. The section contains interesting facts, which are extracted from Wikipedia articles. The articles mentioned in a fact are provided with hyperlinks. For example, the fact “The 2024 American Samoan gubernatorial election was won by Pula and Pulu?” mentions three Wikipedia articles (”2024 American Samoan gubernatorial election”, ”Pula”, ”Pulu”), from which the fact should be inferred.

University students were asked to find relevant sentences in the mentioned articles that confirm the fact. They marked relevant sentences with scores of 2 or 1. Irrelevant sentences have zero scores. Relevant sentences with score 2 contain the full fact. If a sentence contains a part of the fact it obtains score 1. In total, 540 facts have been annotated.

Using facts, extracted articles and created annotations, three datasets with the same queries but different documents have been created.

  • wikifacts-sents dataset consists of sentences extracted from articles, some of which confirm the fact which stands as a query. The documents in this dataset are the shortest in the benchmark;

  • wikifacts-articles dataset comprises all full articles mentioned in facts. Relevant articles contain relevant sentences. This dataset includes the longest documents in the benchmark and can be used for evaluation of full-document retrieval;

  • wikifacts-para dataset comprises existing paragraphs from the extracted articles, the documents on the datasets are significantly shorter than in the wikifacts-articles dataset, but still longer than most benchmark datasets;

Having such variants, we can evaluate different information-retrieval tasks on the same annotated data.

3.5 BEIR Format Compatibility

Our datasets are presented in a unified format and are compatible with the original BEIR benchmark. Queries are predetermined questions in natural language that are used to evaluate the performance of information retrieval (IR) systems. A corpus refers to a collection of documents that the system searches through in order to find relevant information for the given query. Relevance judgments, also known as qrels, indicate the association between queries and documents. All queries, corpora, and relevance judgments are stored in JSONL and TSV file formats, respectively.

4 Models

For evaluation, we used the BM25 lexical model and dense retrieval models.

4.1 Preprocessing for BM25 model

The main baseline was calculated using the BM25 lexical model implemented in the Elasticsearch engine888https://www.elastic.co/, with the language analyzer disabled to avoid stemming, which is less suitable for the Russian language. We specially preprocessed data to be used as input for BM25.

The text preprocessing method consists of the following steps:

  1. 1.

    Lowercasing: Converting all text to lowercase to ensure uniformity and eliminate case sensitivity.

  2. 2.

    Punctuation and Special Character Removal: Using regex to remove non-alphanumeric characters, leaving only letters, digits, and spaces to reduce noise.

  3. 3.

    Space Normalization: Removing extra spaces and trimming leading or trailing whitespace.

  4. 4.

    Tokenization: Splitting text into individual words for processing.

  5. 5.

    Lemmatization: Using PyMorphy3 [korobov2015morphological] to convert words into their dictionary forms, reducing data dimensionality while preserving semantic meaning. This approach is particularly effective for the Russian language due to its rich morphology, as it avoids the inaccuracies that stemming introduces by truncating words without context.

  6. 6.

    Stop Word Removal: excluding overly frequent words that contribute little to the text content using the default stopword list provided by the NLTK package 999https://www.nltk.org [hardeniya2016natural], augmented with two Russian pronouns: “which” ( “который” ) and “such”( “такой” ).

4.2 Neural baseline models

Neural baseline models used in our work are subdivided into pre-trained dense retrievers (bi-encoders) and rerankers. Bi-encoders generate embeddings for queries and documents and calculate their cosine similarity. Rerankers take a query and a document as an input and calculate the probability of the document to be relevant to the query. Rerankers are applied to the best documents found by lexical or dense retrievers and usually improve the performance of combined retrieval. Dense retrievers include the following pre-trained bi-encoders:

  • LaBSE bi-encoder [feng2022language]. LaBSE was pre-trained with a translation ranking task. This allows to find sentence paraphrases in a single language or different languages.101010https://huggingface.co/sentence-transformers/LaBSE

  • Multilingual E5 in three sizes: large 111111https://huggingface.co/intfloat/multilingual-e5-large, base 121212https://huggingface.co/intfloat/multilingual-e5-base and small 131313https://huggingface.co/intfloat/multilingual-e5-small [wang2024multilingual]. The multilingual E5 model was trained on a large multilingual corpus using a weakly supervised contrastive pretraining method with InfoNCE contrastive loss. Then it was fine-tuned on high-quality labeled multilingual datasets for retrieval tasks.

  • BGE-M3 model 141414https://huggingface.co/BAAI/bge-m3 [chen2024bge]. The BGE-M3 model was pre-trained on a large multilingual and cross-lingual unsupervised data, and subsequently fine-tuned on multilingual retrieval datasets using a custom loss function based on the InfoNCE loss function.

  • USER-BGE-M3 151515https://huggingface.co/deepvk/USER-bge-m3. USER-BGE-M3 is a sentence-transformer model for training embeddings for Russian. The model is initialized from the en-ru-BGE-M3 model 161616https://huggingface.co/TatonkaHF/bge-m3_en_ru, a shrinked version of the BGE-M3 model, and then trained on the Russian datasets.

  • ru-en-RoSBERTa171717https://huggingface.co/ai-forever/ru-en-RoSBERTa [snegirev2024russian]. ruRoBERTa model [zmitrovich2024family] 181818https://huggingface.co/ai-forever/ruRoberta-large was used as a basic model and then RoSBERTa embeddings were fine-tuned on Russian and English datasets.

As a reranker, we use the bge-reranker-v2-m3 reranker 191919https://huggingface.co/BAAI/bge-reranker-v2-m3. In our work, we use BGE models with a max-length parameter set to 2048.

Model Based on Parameters Dim Max input
Multilingual-E5-large XLM-RoBERTa-large 560M 1024 512
Multilingual-E5-base XLM-RoBERTa-base 278M 768 512
Multilingual-E5-small Multilingual-MiniLM 118M 384 512
BGE-M3 BGE-M3 568M 1024 8192
USER-BGE-M3 BGE-M3 359M 1024 8192
RoSBERTa SBERT 404M 1024 512
LaBSE LaBSE 471M 768 256
bge-reranker-v2-m3 BGE-M3 568M 1024 8192
Table 2: Model Specifications and Details

5 Results

We evaluated the models on the RusBEIR datasets using NDCG@10, MAP@10, and Recall@10. Since all metrics showed similar trends, we present only the NDCG@10 results in the table below for brevity. Additional details on MAP@10 and Recall@10 are available in the Additional Metrics section.

Model (→) Lexical Dense Re-ranking
Dataset (↓) BM25 mE5-large mE5-base mE5-small BGE-M3 USER-BGE-M3 RoSBERTa LaBSE BM25+BGE mE5-large+BGE BGE-M3+BGE
rus-NFCorpus 32.33 30.96 26.90 26.79 30.86 30.28 27.24 18.53 34.83 33.18 32.46
rus-ArguAna 41.49 49.06 39.40 39.59 50.75 46.52 49.38 25.52 52.91 54.01 53.87
rus-SciFact 65.60 63.49 63.46 60.46 62.42 58.25 53.90 29.07 70.40 71.34 69.64
rus-SCIDOCS 13.99 13.47 12.09 10.60 15.04 14.46 14.43 8.17 15.31 15.98 16.21
rus-MMARCO 15.25 34.04 30.27 29.07 29.51 27.92 20.16 9.06 24.12 36.95 34.52
rus-MIRACL 25.13 66.99 61.41 58.52 70.50 67.23 53.11 15.70 41.51 75.90 76.44
rus-XQuAD 96.19 97.33 95.84 95.66 95.97 95.63 93.90 69.77 98.85 98.97 98.97
rus-XQuAD-Sentences 82.36 88.84 86.37 85.41 86.91 85.42 83.20 75.33 89.93 92.08 91.69
rus-TyDi QA 35.80 59.41 55.91 55.23 58.34 57.86 52.06 28.05 50.12 66.20 65.78
SberQuad-retrieval 68.19 67.11 65.13 61.03 68.26 67.03 63.59 37.54 70.34 69.41 68.21
ruSciBench-retrieval 36.69 50.81 45.74 42.93 55.85 53.58 44.89 17.93 49.93 65.33 69.05
ru-facts 92.56 93.65 93.55 93.06 93.91 93.77 93.66 93.10 92.72 92.87 92.87
RuBQ 37.33 74.11 69.63 68.60 71.26 70.00 66.81 30.59 56.90 77.03 76.00
Ria-News 64.63 80.67 70.24 70.00 82.99 83.52 78.85 61.57 78.12 86.22 86.85
wikifacts-articles 84.28 66.09 63.04 67.86 74.50 79.41 74.13 45.17 85.25 83.06 83.91
wikifacts-para 61.31 50.15 49.51 34.71 54.55 57.53 50.66 14.78 66.61 59.95 63.76
wikifacts-sents 33.64 35.90 30.75 22.57 37.59 34.90 40.59 25.79 39.96 38.53 39.20
Avg 52.16 60.12 56.43 54.24 61.13 60.19 56.50 35.63 59.87 65.71 65.85
Table 3: Performance comparison across different models and datasets. The best results for each dataset are in bold; the results of the best single models are underlined.

The analysis of Table 3 indicates that the best performance on the benchmark is achieved through the combination of the BGE-M3 model and the BGE reranker. Notably, the combination of mE5-large bi-encoder with the BGE reranker yields close results. Among the individual models, the mE5-large bi-encoder and both multilingual BGE variants stand out as top performers, surpassing BM25 by an average margin of 15.9 percentage points.

Overall, LaBSE performs the worst among all the models presented. This can be attributed to its training objective, which focuses on finding similar sentences across different languages or paraphrases within the same language. As a result, when confronted with queries that lack lexical overlap with sentences in the corpus, its performance drops.

RoSBERTa model performs on par with mE5-base and mE5-small, but the size of mE5-base (278M) againts RoSBERTa (404M) makes mE5-base more preferable to use.

At the same time, it is worth noting that the BM25 model is the best single model on four datasets: rus-NFCorpus, rus-SciFact, wikifacts-articles and wikifacts-para. The best results on these datasets, as well as others where single BM25 performed only slightly worse than neural retrievers, are achieved by combining BM25 with the BGE reranker. Three of the datasets with a significant BM25 margin contain longer documents than the average in the benchmark. On the wikifacts-articles dataset, which includes full-text documents, BM25 outperforms the BGE-M3 model by 13 percentage points and the mE5-large model by 27 percentage points. This highlights a limitation of the mE5 models in retrieving long documents due to their small maximum input size (512 tokens). Additionally, the rus-NFCorpus and rus-SciFact datasets are domain-specific, which may result in lower-quality multilingual vector representations compared to general datasets.

Furthermore, it should be noted that the results of the BGE models presented in Table 3 were obtained with a maximum input length set to 2048. However, as indicated in Table 2, BGE models can process up to 8192 tokens, making them more suitable for full-text search in long documents.

Our experiments demonstrated that BM25 remains a strong baseline for information retrieval, particularly for full-document retrieval. Neural models, especially mE5-large and BGE-M3, achieved the best results on the benchmark and confirmed the findings of other BEIR-based studies [thakur2021beir, wang2022text].

6 Conclusion

In this paper, we introduced RusBEIR, a comprehensive BEIR-inspired benchmark designed for the zero-shot evaluation of information retrieval (IR) models in the Russian language. Consisting of 17 datasets from diverse domains and tasks, RusBEIR integrates adapted datasets from existing benchmarks alongside novel datasets to further enrich its collection. By providing a large-scale resource compatible with the original BEIR format, RusBEIR enables systematic evaluation and comparison of both lexical and neural IR models, with a particular emphasis on zero-shot performance.

Our study stresses the importance of accurate preprocessing, particularly for lexical models, where preprocessing significantly impacts the performance in morphologically rich languages as Russian. Additionally, we introduced a series of Russian Wikipedia-based datasets that further expand the scope of RusBEIR, enabling more granular exploration of IR performance across document lengths and tasks.

The results of our experiments confirm that BM25 remains a robust baseline for full-document retrieval, while state-of-the-art neural models, such as mE5-large and BGE-M3, demonstrate superior performance on most datasets. These findings are consistent with previous BEIR-based studies and underscore the advantages of neural approaches, particularly when unprocessed data are used as input. However, our analysis also highlights certain limitations of neural models, such as challenges with long-document retrieval due to input size constraints. The efficiency comparison between BM25 and neural models such as mE5 and BGE remains an open question and will be explored further in future research.

By providing a unified framework and detailed insights into the comparative performance of lexical and neural models, we hope RusBEIR will serve as a valuable tool for advancing research and innovation in information retrieval for the Russian language.

Acknowledgements

The work is supported by the Russian Science Foundation under Agreement No. 25-21-00206.
The research was carried out using the MSU-270 supercomputer of Lomonosov Moscow State University.

\printbibliography

7 Additional metrics

7.1 MAP

Mean Average Precision (MAP) is used to assess the overall precision of a retrieval system across multiple queries. It computes the average precision for each query and then takes the mean across all queries. MAP provides a single summary measure that reflects both the ranking quality and the system’s ability to retrieve relevant documents.

The MAP@10 results obtained from the models’ inference on the benchmark datasets are shown below.

Model (→) Lexical Dense Re-ranking
Dataset (↓) BM25 mE5-large mE5-base mE5-small BGE-M3 USER-BGE-M3 RoSBERTa LaBSE BM25+BGE mE5-large+BGE BGE+BGE
rus-NFCorpus 12.52 11.39 9.23 9.37 11.40 10.99 10.02 5.74 13.47 12.62 12.33
rus-ArguAna 32.76 40.71 31.69 32.08 41.85 37.57 40.32 20.40 45.32 45.48 45.12
rus-SciFact 61.47 59.76 58.84 55.67 57.82 53.60 49.25 25.91 67.10 67.72 66.25
rus-SCIDOCS 8.03 7.53 6.69 5.89 8.64 8.29 8.27 4.45 8.88 9.28 9.37
rus-MMARCO 11.88 28.11 24.94 23.82 24.03 22.59 16.03 07.07 21.30 30.78 29.13
rus-MIRACL 18.61 56.64 50.90 48.00 60.52 57.11 42.36 10.94 35.79 67.24 67.77
rus-XQuAD 95.04 96.11 94.63 94.37 94.81 94.35 92.24 65.84 98.57 98.64 98.64
rus-XQuAD-Sentences 79.32 85.89 83.15 82.03 83.80 81.99 79.32 71.12 88.44 90.08 89.75
rus-TyDi QA 30.16 51.78 48.79 48.35 51.02 50.90 44.88 22.91 46.13 59.50 59.18
SberQuad-retrieval 58.36 57.43 55.95 50.94 60.25 58.81 55.38 30.49 60.84 59.96 58.90
ruSciBench-retrieval 27.07 39.31 34.50 31.50 43.30 41.47 33.72 12.48 40.43 54.74 58.12
ru-facts 90.03 91.66 91.30 90.66 91.79 91.60 91.47 90.70 90.24 90.39 90.39
RuBQ 29.36 66.24 61.94 60.95 63.84 62.29 58.72 24.30 51.64 70.10 69.25
Ria-News 60.41 75.94 65.67 65.59 79.94 80.62 75.44 57.79 76.75 84.17 84.74
wikifacts-articles 78.60 65.96 55.80 60.32 68.32 73.51 67.09 37.95 80.44 78.59 79.34
wikifacts-para 50.67 42.50 39.54 26.52 44.00 46.87 40.43 10.01 56.71 51.09 54.11
wikifacts-sents 24.45 29.53 22.52 16.17 27.44 25.01 29.84 18.15 30.50 28.38 28.60
Avg 45.22 53.32 49.18 47.19 53.69 52.80 49.10 30.37 53.68 58.75 58.88
Table 4: Performance comparison across different models and datasets. The best results for each dataset are in bold; the results of the best single models are underlined.

7.2 Recall

Recall quantifies the proportion of relevant documents that are successfully retrieved by the system. It is defined as the ratio of the number of relevant documents retrieved to the total number of relevant documents available. In the context of information retrieval, high recall is crucial to ensure that the system does not miss important information.

The Recall@10 results obtained from the models’ inference on the benchmark datasets are shown below.

Model (→) Lexical Dense Re-ranking
Dataset (↓) BM25 mE5-large mE5-base mE5-small BGE-M3 USER-BGE-M3 RoSBERTa LaBSE BM25+BGE mE5-large+BGE BGE+BGE
rus-NFCorpus 16.09 15.68 12.56 12.79 14.93 14.56 13.17 8.57 16.69 15.59 14.97
rus-ArguAna 69.70 75.82 64.30 63.87 79.16 75.32 78.52 42.11 76.81 81.01 81.65
rus-SciFact 76.63 76.88 76.42 73.46 75.08 70.90 66.61 37.71 79.39 80.88 78.58
rus-SCIDOCS 14.48 14.14 12.80 11.14 15.59 14.88 15.34 8.33 15.66 16.34 17.02
rus-MMARCO 25.77 52.38 46.68 45.36 46.53 44.42 33.02 15.26 32.32 55.90 50.90
rus-MIRACL 31.32 76.70 71.03 68.43 79.59 76.44 63.69 21.16 39.28 81.81 82.59
rus-XQuAD 99.58 99.75 99.50 99.50 99.41 99.41 98.91 82.02 99.66 99.92 99.92
rus-XQuAD-Sentences 91.78 97.44 96.09 95.76 96.30 95.88 95.13 88.40 94.31 98.07 97.48
rus-TyDi QA 51.26 79.03 75.34 73.94 78.31 76.55 71.56 42.43 59.07 83.88 82.80
SberQuad-retrieval 96.47 93.47 92.14 90.71 91.94 91.42 88.15 58.84 97.32 96.29 94.70
ruSciBench-retrieval 38.63 53.98 47.15 45.40 57.87 54.68 46.91 19.23 45.64 62.51 66.62
ru-facts 99.82 100.00 100.00 99.96 100.00 100.00 99.96 100.00 99.82 100.00 100.00
RuBQ 52.71 86.26 83.10 81.20 84.06 83.18 81.29 41.47 62.00 88.84 86.90
Ria-News 77.84 90.30 84.50 83.73 92.34 92.42 89.41 73.41 82.19 92.44 93.24
wikifacts-articles 88.39 77.49 71.44 77.90 82.39 86.72 82.99 53.21 90.64 85.84 88.98
wikifacts-para 67.30 60.51 57.65 41.44 62.87 65.81 59.37 17.87 70.17 62.48 67.77
wikifacts-sents 35.42 40.65 32.28 24.63 39.89 37.39 43.22 26.85 39.83 40.10 41.54
Avg 60.78 70.03 66.06 64.07 70.37 69.41 66.31 43.35 64.75 73.05 73.27
Table 5: Performance comparison across different models and datasets. The best results for each dataset are in bold; the results of the best single models are underlined.