How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque.
Data (BL2MP dataset and pretraining corpora), models and evaluation scripts from our work How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque. accepted at LREC-COLING2024.
We introduce BL2MP (Basque L2 student-based Minimal Pairs), designed to assess the grammatical knowledge of language Models in the Basque language, inspired by the BLiMP benchmark. The BL2MP dataset includes examples sourced from the bai&by language academy, derived from essays written by students enrolled at the academy. These instances provide a wealth of authentic and natural grammatical errors, representing genuine mistakes made by learners and thus offering a realistic reflection of real-world language errors.
BL2MP is also available on HuggingFace 🤗
We employed three corpora of different sizes (5M, 25M, 125M) in our experiments:
We also share the lemmatized counterparts:
5M_lemma, 25M_lemma, 125M_lemma
MLM validation datasets:
We trained 3 BERT models of different sizes, namely mini, medium and base (with 4L, 8L and 12L respectively) with each corpus.
Here we share the best performing checkpoint for each model:
bert_mini_eu_5M, bert_mini_eu_25M, bert_mini_eu_125M
bert_medium_eu_5M, bert_medium_eu_25M, bert_medium_eu_125M
bert_base_eu_5M, bert_base_eu_25M, bert_base_eu_125M
We also share the models trained with the lemmatized version:
bert_medium_eu_5M_lemma, bert_medium_eu_25M_lemma, bert_medium_eu_125M_lemma
We used minicons to evaluate (0-shot) our MLMs on BL2MP, which uses Salazar et al. (2020) to score sentences.
It can be installed with pip:
pip install minicons
And the we evaluate a MLMs as follows:
python3 mlm-score.py --input bl2mp.jsonl --output_dir output/ --lm orai-nlp/ElhBERTeu-medium --device cuda:0
There are different versions of the dataset and evaluation script, created for different experiments, but all of them use the same call to minicons, and differ only in reading the input data, and the conditions set to filter minimal-pairs to compute the final accuracy score.
Gorka Urbizu [1] [2], Muitze Zulaika [1], Xabier Saralegi [1], Ander Corral [1]
Affiliations:
[1] Orai NLP Technologies
[2] University of the Basque Country
Copyright (C) by Orai NLP Technologies.
The corpora, datasets, models and scripts created in this work, are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0.
International License (CC BY-NC-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
If you use these corpora, datasets or models please cite the following paper:
- G. Urbizu, M. Zulaika, X. Saralegi, A. Corral. How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING2024). May, 2024. Torino, Italy
Gorka Urbizu, Muitze Zulaika: {g.urbizu,m.zulaika}@orai.eus