On Significance of Subword Tokenization for Low-Resource and Efficient Named Entity Recognition: A Case Study in Marathi

Chaudhari, Harsh; Patil, Anuja; Lavekar, Dhanashree; Khairnar, Pranav; Joshi, Raviraj; Pande, Sachin

doi:10.1007/978-981-99-6550-2_37

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 787))

Included in the following conference series:

International Conference on Data Analytics & Management

140 Accesses

Abstract

Named entity recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low-resource languages. In this work, we focus on NER for low-resource language and present our case study in the context of the Indian language Marathi. The advancement of NLP research revolves around the utilization of pre-trained transformer models such as BERT for the development of NER models. However, we focus on improving the performance of shallow models based on CNN and LSTM by combining the best of both worlds. In the era of transformers, these traditional deep learning models are still relevant because of their high computational efficiency. We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models. We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT. We show the importance of using subword tokenization for NER and present our study toward building efficient NLP systems. The evaluation is performed on L3Cube-MahaNER dataset using tokenizers from MahaBERT, MahaGPT, IndicBERT, and mBERT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 20591; Price includes VAT (Japan)

Softcover Book: JPY 25739; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Systematic Study of Various Approaches and Problem Areas of Named Entity Recognition

PUNER-Parsi ULMFiT for Named-Entity Recognition in Persian Texts

A Detailed Analysis and Improvement of Feature-Based Named Entity Recognition for Turkish

Notes

References

Li J, Sun A, Han J, Li C (2020) A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng 34(1):50–70
Article Google Scholar
Shen Y, Yun H, Lipton ZC, Kronrod Y, Anandkumar A (2017) Deep active learning for named entity recognition. CoRR abs/1707.05928. Preprint at http://arxiv.org/abs/1707.05928
Litake O, Sabane MR, Patil PS, Ranade AA, Joshi R (2022) L3cube-mahaner: a Marathi named entity recognition dataset and Bert models. In: Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, pp 29–34
Google Scholar
Joshi R (2022) L3cube-mahacorpus and mahabert: Marathi monolingual corpus, Marathi Bert language models, and resources. In: Proceedings of the WILDRE—6 Workshop within the 13th Language Resources and Evaluation Conference, pp 97–101
Google Scholar
Joshi R, Joshi R (2022) Evaluating input representation for language identification in hindi-english code mixed text. In: ICDSMLA 2020. Springer Singapore, Singapore, pp 795–802
Google Scholar
Kudo T (2018) Subword regularization: improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 66–75
Google Scholar
Kudo T, Richardson J (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. Preprint at https://arxiv.org/abs/1808.06226
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1715–1725 (2016)
Google Scholar
Patil P, Ranade A, Sabane M, Litake O, Joshi R (2022) L3cube-mahaner: a Marathi named entity recognition dataset and Bert models. Preprint at https://arxiv.org/abs/2204.06029
Chopra D, Joshi N, Mathur I (2016) Named entity recognition in Hindi using hidden Markov model. In: 2016 Second International Conference on Computational Intelligence and Communication Technology (CICT), pp 581–586
Google Scholar
Frei J, Kramer F (2021) GERNERMED—an open German medical NER model. CoRR abs/2109.12104. Preprint at https://arxiv.org/abs/2109.12104
Speck R, Ngonga Ngomo AC (2014) Ensemble learning for named entity recognition. In: Mika P, Tudorache T, Bernstein A, Welty C, Knoblock C, Vrandeˇci´c D, Groth P, Noy N, Janowicz K, Goble C (eds) The Semantic Web—ISWC 2014. Springer International Publishing, Cham, pp 519–534
Google Scholar
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. Preprint at https://arxiv.org/abs/1603.01360
Singh J, Joshi N, Mathur I (2013) Part of speech tagging of Marathi text using trigram method. Preprint at https://arxiv.org/abs/1307.4299
Manamini S, Ahamed A, Rajapakshe R, Reemal G, Jayasena S, Dias G, Ranathunga S (2016) Ananya-a named-entity-recognition (NER) system for Sinhala language. In: 2016 Moratuwa Engineering Research Conference (MERCon). IEEE, pp 30–35
Google Scholar
Sukardi S, Susanty M, Irawan A, Putra RF (2020) Low complexity named-entity recognition for Indonesian language using BILSTM-CNNs. In: 2020 3rd International Conference on Information and Communications Technology (ICOIACT). IEEE, pp 137–142
Google Scholar
Ushio A, Neves L, Silva V, Barbieri F, Camacho-Collados J (2022) Named entity recognition in twitter: a dataset and analysis on short-term temporal shifts. Preprint at https://arxiv.org/abs/2210.03797
Litake O, Sabane M, Patil P, Ranade A, Joshi R (2023) Mono versus multilingual Bert: a case study in Hindi and Marathi named entity recognition. In: Proceedings of 3rd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2022. Springer, pp 607–618
Google Scholar
Joshi R (2022) L3cube-hindbert and devbert: pre-trained Bert transformer models for Devanagari based Hindi and Marathi languages. Preprint at https://arxiv.org/abs/2211.11418
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186. https://aclanthology.org/N19-1423
Kakwani D, Kunchukuttan A, Golla S, Gokul N, Bhattacharyya A, Khapra MM, Kumar P (2020) Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp 4948–4961
Google Scholar
Joshi R (2022) L3cube-mahanlp: Marathi natural language processing datasets, models, and library. Preprint at https://arxiv.org/abs/2205.14728

Download references

Acknowledgements

We would like to express our sincere gratitude toward the L3Cube mentorship program and our mentor for their continual support and guidance. We are grateful to Pune Institute of Computer Technology for encouraging and supporting us throughout the research period. The issue statement and ideas provided in this work are from L3Cube and its mentors and are a part of the L3Cube-MahaNLP project [22].

Author information

Authors and Affiliations

Pune Institute of Computer Technology, Pune, Maharashtra, India
Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar & Sachin Pande
Indian Institute of Technology Madras, Chennai, Tamilnadu, India
Raviraj Joshi
L3Cube, Pune, Maharashtra, India
Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar & Raviraj Joshi

Authors

Harsh Chaudhari
View author publications
You can also search for this author in PubMed Google Scholar
Anuja Patil
View author publications
You can also search for this author in PubMed Google Scholar
Dhanashree Lavekar
View author publications
You can also search for this author in PubMed Google Scholar
Pranav Khairnar
View author publications
You can also search for this author in PubMed Google Scholar
Raviraj Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Sachin Pande
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raviraj Joshi .

Editor information

Editors and Affiliations

Department of Information Technology, Bhagwan Parshuram Institute of Technology, New Delhi, Delhi, India
Abhishek Swaroop
Jan Wyzykowski University, Polkowice, Poland
Zdzislaw Polkowski
Polytechnic Institute of Portalegre, Portalegre, Portugal
Sérgio Duarte Correia
Centre for Communications Technology, London Metropolitan university, London, UK
Bal Virdee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chaudhari, H., Patil, A., Lavekar, D., Khairnar, P., Joshi, R., Pande, S. (2023). On Significance of Subword Tokenization for Low-Resource and Efficient Named Entity Recognition: A Case Study in Marathi. In: Swaroop, A., Polkowski, Z., Correia, S.D., Virdee, B. (eds) Proceedings of Data Analytics and Management. ICDAM 2023. Lecture Notes in Networks and Systems, vol 787. Springer, Singapore. https://doi.org/10.1007/978-981-99-6550-2_37

Download citation

DOI: https://doi.org/10.1007/978-981-99-6550-2_37
Published: 28 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-6549-6
Online ISBN: 978-981-99-6550-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

On Significance of Subword Tokenization for Low-Resource and Efficient Named Entity Recognition: A Case Study in Marathi

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Systematic Study of Various Approaches and Problem Areas of Named Entity Recognition

PUNER-Parsi ULMFiT for Named-Entity Recognition in Persian Texts

A Detailed Analysis and Improvement of Feature-Based Named Entity Recognition for Turkish

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

On Significance of Subword Tokenization for Low-Resource and Efficient Named Entity Recognition: A Case Study in Marathi

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Systematic Study of Various Approaches and Problem Areas of Named Entity Recognition

PUNER-Parsi ULMFiT for Named-Entity Recognition in Persian Texts

A Detailed Analysis and Improvement of Feature-Based Named Entity Recognition for Turkish

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation