Abstract
Named entity recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low-resource languages. In this work, we focus on NER for low-resource language and present our case study in the context of the Indian language Marathi. The advancement of NLP research revolves around the utilization of pre-trained transformer models such as BERT for the development of NER models. However, we focus on improving the performance of shallow models based on CNN and LSTM by combining the best of both worlds. In the era of transformers, these traditional deep learning models are still relevant because of their high computational efficiency. We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models. We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT. We show the importance of using subword tokenization for NER and present our study toward building efficient NLP systems. The evaluation is performed on L3Cube-MahaNER dataset using tokenizers from MahaBERT, MahaGPT, IndicBERT, and mBERT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
References
Li J, Sun A, Han J, Li C (2020) A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng 34(1):50–70
Shen Y, Yun H, Lipton ZC, Kronrod Y, Anandkumar A (2017) Deep active learning for named entity recognition. CoRR abs/1707.05928. Preprint at http://arxiv.org/abs/1707.05928
Litake O, Sabane MR, Patil PS, Ranade AA, Joshi R (2022) L3cube-mahaner: a Marathi named entity recognition dataset and Bert models. In: Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, pp 29–34
Joshi R (2022) L3cube-mahacorpus and mahabert: Marathi monolingual corpus, Marathi Bert language models, and resources. In: Proceedings of the WILDRE—6 Workshop within the 13th Language Resources and Evaluation Conference, pp 97–101
Joshi R, Joshi R (2022) Evaluating input representation for language identification in hindi-english code mixed text. In: ICDSMLA 2020. Springer Singapore, Singapore, pp 795–802
Kudo T (2018) Subword regularization: improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 66–75
Kudo T, Richardson J (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. Preprint at https://arxiv.org/abs/1808.06226
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1715–1725 (2016)
Patil P, Ranade A, Sabane M, Litake O, Joshi R (2022) L3cube-mahaner: a Marathi named entity recognition dataset and Bert models. Preprint at https://arxiv.org/abs/2204.06029
Chopra D, Joshi N, Mathur I (2016) Named entity recognition in Hindi using hidden Markov model. In: 2016 Second International Conference on Computational Intelligence and Communication Technology (CICT), pp 581–586
Frei J, Kramer F (2021) GERNERMED—an open German medical NER model. CoRR abs/2109.12104. Preprint at https://arxiv.org/abs/2109.12104
Speck R, Ngonga Ngomo AC (2014) Ensemble learning for named entity recognition. In: Mika P, Tudorache T, Bernstein A, Welty C, Knoblock C, Vrandeˇci´c D, Groth P, Noy N, Janowicz K, Goble C (eds) The Semantic Web—ISWC 2014. Springer International Publishing, Cham, pp 519–534
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. Preprint at https://arxiv.org/abs/1603.01360
Singh J, Joshi N, Mathur I (2013) Part of speech tagging of Marathi text using trigram method. Preprint at https://arxiv.org/abs/1307.4299
Manamini S, Ahamed A, Rajapakshe R, Reemal G, Jayasena S, Dias G, Ranathunga S (2016) Ananya-a named-entity-recognition (NER) system for Sinhala language. In: 2016 Moratuwa Engineering Research Conference (MERCon). IEEE, pp 30–35
Sukardi S, Susanty M, Irawan A, Putra RF (2020) Low complexity named-entity recognition for Indonesian language using BILSTM-CNNs. In: 2020 3rd International Conference on Information and Communications Technology (ICOIACT). IEEE, pp 137–142
Ushio A, Neves L, Silva V, Barbieri F, Camacho-Collados J (2022) Named entity recognition in twitter: a dataset and analysis on short-term temporal shifts. Preprint at https://arxiv.org/abs/2210.03797
Litake O, Sabane M, Patil P, Ranade A, Joshi R (2023) Mono versus multilingual Bert: a case study in Hindi and Marathi named entity recognition. In: Proceedings of 3rd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2022. Springer, pp 607–618
Joshi R (2022) L3cube-hindbert and devbert: pre-trained Bert transformer models for Devanagari based Hindi and Marathi languages. Preprint at https://arxiv.org/abs/2211.11418
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186. https://aclanthology.org/N19-1423
Kakwani D, Kunchukuttan A, Golla S, Gokul N, Bhattacharyya A, Khapra MM, Kumar P (2020) Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp 4948–4961
Joshi R (2022) L3cube-mahanlp: Marathi natural language processing datasets, models, and library. Preprint at https://arxiv.org/abs/2205.14728
Acknowledgements
We would like to express our sincere gratitude toward the L3Cube mentorship program and our mentor for their continual support and guidance. We are grateful to Pune Institute of Computer Technology for encouraging and supporting us throughout the research period. The issue statement and ideas provided in this work are from L3Cube and its mentors and are a part of the L3Cube-MahaNLP project [22].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chaudhari, H., Patil, A., Lavekar, D., Khairnar, P., Joshi, R., Pande, S. (2023). On Significance of Subword Tokenization for Low-Resource and Efficient Named Entity Recognition: A Case Study in Marathi. In: Swaroop, A., Polkowski, Z., Correia, S.D., Virdee, B. (eds) Proceedings of Data Analytics and Management. ICDAM 2023. Lecture Notes in Networks and Systems, vol 787. Springer, Singapore. https://doi.org/10.1007/978-981-99-6550-2_37
Download citation
DOI: https://doi.org/10.1007/978-981-99-6550-2_37
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-6549-6
Online ISBN: 978-981-99-6550-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)