(Translated by https://www.hiragana.jp/)
On Significance of Subword Tokenization for Low-Resource and Efficient Named Entity Recognition: A Case Study in Marathi | SpringerLink
Skip to main content

On Significance of Subword Tokenization for Low-Resource and Efficient Named Entity Recognition: A Case Study in Marathi

  • Conference paper
  • First Online:
Proceedings of Data Analytics and Management (ICDAM 2023)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 787))

Included in the following conference series:

  • 140 Accesses

Abstract

Named entity recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low-resource languages. In this work, we focus on NER for low-resource language and present our case study in the context of the Indian language Marathi. The advancement of NLP research revolves around the utilization of pre-trained transformer models such as BERT for the development of NER models. However, we focus on improving the performance of shallow models based on CNN and LSTM by combining the best of both worlds. In the era of transformers, these traditional deep learning models are still relevant because of their high computational efficiency. We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models. We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT. We show the importance of using subword tokenization for NER and present our study toward building efficient NLP systems. The evaluation is performed on L3Cube-MahaNER dataset using tokenizers from MahaBERT, MahaGPT, IndicBERT, and mBERT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 20591
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 25739
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://huggingface.co/l3cube-pune/marathi-bert-v2.

  2. 2.

    https://huggingface.co/l3cube-pune/marathi-gpt.

  3. 3.

    https://huggingface.co/l3cube-pune/marathi-bert-scratch.

  4. 4.

    https://huggingface.co/l3cube-pune/marathi-bert-v2.

  5. 5.

    https://huggingface.co/bert-base-multilingual-cased.

  6. 6.

    https://huggingface.co/ai4bharat/indic-bert.

  7. 7.

    https://huggingface.co/l3cube-pune/marathi-gpt.

References

  1. Li J, Sun A, Han J, Li C (2020) A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng 34(1):50–70

    Article  Google Scholar 

  2. Shen Y, Yun H, Lipton ZC, Kronrod Y, Anandkumar A (2017) Deep active learning for named entity recognition. CoRR abs/1707.05928. Preprint at http://arxiv.org/abs/1707.05928

  3. Litake O, Sabane MR, Patil PS, Ranade AA, Joshi R (2022) L3cube-mahaner: a Marathi named entity recognition dataset and Bert models. In: Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, pp 29–34

    Google Scholar 

  4. Joshi R (2022) L3cube-mahacorpus and mahabert: Marathi monolingual corpus, Marathi Bert language models, and resources. In: Proceedings of the WILDRE—6 Workshop within the 13th Language Resources and Evaluation Conference, pp 97–101

    Google Scholar 

  5. Joshi R, Joshi R (2022) Evaluating input representation for language identification in hindi-english code mixed text. In: ICDSMLA 2020. Springer Singapore, Singapore, pp 795–802

    Google Scholar 

  6. Kudo T (2018) Subword regularization: improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 66–75

    Google Scholar 

  7. Kudo T, Richardson J (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. Preprint at https://arxiv.org/abs/1808.06226

  8. Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1715–1725 (2016)

    Google Scholar 

  9. Patil P, Ranade A, Sabane M, Litake O, Joshi R (2022) L3cube-mahaner: a Marathi named entity recognition dataset and Bert models. Preprint at https://arxiv.org/abs/2204.06029

  10. Chopra D, Joshi N, Mathur I (2016) Named entity recognition in Hindi using hidden Markov model. In: 2016 Second International Conference on Computational Intelligence and Communication Technology (CICT), pp 581–586

    Google Scholar 

  11. Frei J, Kramer F (2021) GERNERMED—an open German medical NER model. CoRR abs/2109.12104. Preprint at https://arxiv.org/abs/2109.12104

  12. Speck R, Ngonga Ngomo AC (2014) Ensemble learning for named entity recognition. In: Mika P, Tudorache T, Bernstein A, Welty C, Knoblock C, Vrandeˇci´c D, Groth P, Noy N, Janowicz K, Goble C (eds) The Semantic Web—ISWC 2014. Springer International Publishing, Cham, pp 519–534

    Google Scholar 

  13. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. Preprint at https://arxiv.org/abs/1603.01360

  14. Singh J, Joshi N, Mathur I (2013) Part of speech tagging of Marathi text using trigram method. Preprint at https://arxiv.org/abs/1307.4299

  15. Manamini S, Ahamed A, Rajapakshe R, Reemal G, Jayasena S, Dias G, Ranathunga S (2016) Ananya-a named-entity-recognition (NER) system for Sinhala language. In: 2016 Moratuwa Engineering Research Conference (MERCon). IEEE, pp 30–35

    Google Scholar 

  16. Sukardi S, Susanty M, Irawan A, Putra RF (2020) Low complexity named-entity recognition for Indonesian language using BILSTM-CNNs. In: 2020 3rd International Conference on Information and Communications Technology (ICOIACT). IEEE, pp 137–142

    Google Scholar 

  17. Ushio A, Neves L, Silva V, Barbieri F, Camacho-Collados J (2022) Named entity recognition in twitter: a dataset and analysis on short-term temporal shifts. Preprint at https://arxiv.org/abs/2210.03797

  18. Litake O, Sabane M, Patil P, Ranade A, Joshi R (2023) Mono versus multilingual Bert: a case study in Hindi and Marathi named entity recognition. In: Proceedings of 3rd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2022. Springer, pp 607–618

    Google Scholar 

  19. Joshi R (2022) L3cube-hindbert and devbert: pre-trained Bert transformer models for Devanagari based Hindi and Marathi languages. Preprint at https://arxiv.org/abs/2211.11418

  20. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186. https://aclanthology.org/N19-1423

  21. Kakwani D, Kunchukuttan A, Golla S, Gokul N, Bhattacharyya A, Khapra MM, Kumar P (2020) Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp 4948–4961

    Google Scholar 

  22. Joshi R (2022) L3cube-mahanlp: Marathi natural language processing datasets, models, and library. Preprint at https://arxiv.org/abs/2205.14728

Download references

Acknowledgements

We would like to express our sincere gratitude toward the L3Cube mentorship program and our mentor for their continual support and guidance. We are grateful to Pune Institute of Computer Technology for encouraging and supporting us throughout the research period. The issue statement and ideas provided in this work are from L3Cube and its mentors and are a part of the L3Cube-MahaNLP project [22].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raviraj Joshi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chaudhari, H., Patil, A., Lavekar, D., Khairnar, P., Joshi, R., Pande, S. (2023). On Significance of Subword Tokenization for Low-Resource and Efficient Named Entity Recognition: A Case Study in Marathi. In: Swaroop, A., Polkowski, Z., Correia, S.D., Virdee, B. (eds) Proceedings of Data Analytics and Management. ICDAM 2023. Lecture Notes in Networks and Systems, vol 787. Springer, Singapore. https://doi.org/10.1007/978-981-99-6550-2_37

Download citation

Publish with us

Policies and ethics