Curation of a polysemous word dataset for word sense disambiguation in Hausa language

Authors

  • Halima Aminu Department of Computer Science, Aliko Dangote University of Science and Technology, Wudil, Nigeria https://orcid.org/0009-0009-5064-4975
  • I.R. Saidu Department of Intelligence and Cyber Security, Nigerian Defence Academy, Kaduna State, Nigeria
  • P.O. Odion Department of Computer Science, Nigerian Defence Academy, Kaduna, Nigeria

DOI:

https://doi.org/10.64497/jssci.77

Keywords:

Hausa Language Processing, Word Sense Disambiguation (WSD), Polysemy, Natural Language Processing (NLP), Low-Resource Languages, Linguistic Annotation

Abstract

The challenge of Word Sense Disambiguation (WSD) is fundamental to Natural Language Processing (NLP), particularly in low-resource languages where lexical ambiguity hinders effective language understanding. Hausa, a major Chadic language spoken by over 60 million people, lacks structured lexical resources for disambiguating polysemous words. This paper presents the development and curation of a high-quality Hausa Polysemous Word Sense Disambiguation dataset consisting of 2,021 manually selected and annotated lemmas. Each lemma is disambiguated into its distinct senses, accompanied by contextual Hausa example sentences, English glosses, and translations. The dataset is designed to support the training and evaluation of supervised and semi-supervised WSD models for Hausa and serves as a foundational resource for semantic NLP tasks in low-resource settings. The annotation schema, curation methodology, and linguistic validation process are described in detail. This work fills a critical gap in Hausa NLP and provides a reproducible framework for constructing sense-annotated corpora in other under-resourced languages.

Downloads

Download data is not yet available.

References

[1] R. Navigli, “Word Sense Disambiguation: A Survey,” ACM Comput. Surv., vol. 41, no. 2, pp. 1–69, Feb. 2009. DOI: https://doi.org/10.1145/1459352.1459355

[2] D. Jurafsky and J. H. Martin, *Speech and Language Processing*, 3rd ed., Pearson, 2023.

[3] G. A. Miller, “WordNet: A Lexical Database for English,” Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995.

[4] M. Palmer, C. Fellbaum, and M. Marcus, “SemCor: A Sense-Tagged Corpus for English,” in *Proc. LREC*, 2001.

[5] R. Pradhan et al., “OntoNotes: A Unified Relational Semantic Representation,” in *Proc. ICSC*, 2007.

[6] P. Newman, *The Hausa Language: An Encyclopedic Reference Grammar*, Yale University Press, 2000.

[7] G. A. Miller, “WordNet: A Lexical Database for English,” Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995. DOI: https://doi.org/10.1145/219717.219748

[8] M. Palmer, C. Fellbaum, and M. Marcus, “SemCor: A Sense-Tagged Corpus for English,” in Proc. LREC, 2001.

[9] R. Pradhan et al., “OntoNotes: A Unified Relational Semantic Representation,” in Proc. ICSC, 2007. DOI: https://doi.org/10.1109/ICOSC.2007.4338389

[10] H. Liu et al., “Word Sense Disambiguation with Contextualized Word Embeddings,” in Proc. NAACL, 2019.

[11] R. Navigli and S. Ponzetto, “BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network,” Artif. Intell., vol. 193, pp. 217–250, 2012. DOI: https://doi.org/10.1016/j.artint.2012.07.001

[12] D. M. Scannell, “The Crúbadán Project: Corpus building for under-resourced languages,” in Proc. ICGL, 2007.

[13] I. Bello and M. Sarki, “A Finite State Morphological Analyzer for Hausa Verbs,” Int. J. Comput. Linguist. Res., vol. 6, no. 1, pp. 13–21, 2015.

[14] A. Ogueji et al., “Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Models for Hausa POS Tagging,” in Proc. WANLP, 2020. DOI: https://doi.org/10.18653/v1/2021.mrl-1.11

[15] T. Bapna et al., “Building Massively Multilingual ASR Systems: The CoVoST 2 Dataset,” in Proc. EMNLP, 2020.

[16] M. Nekoto et al., “Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,” in Findings of ACL, 2020.

[17] D. Loureiro, K. Rezaee, T. P. Pilehvar, and J. Camacho-Collados, “Analysis and Evaluation of Language Models for Word Sense Disambiguation,” Comput. Linguist., vol. 47, no. 2, pp. 387–443, Jun. 2021. DOI: https://doi.org/10.1162/coli_a_00405

[18] S. Blevins and L. Zettlemoyer, “Language Models as Word Sense Disambiguators,” in Proc. EMNLP, 2022.

[19] A. Conia, S. Chandra, B. Saha, and A. Navigli, “Many-Shot In-Context Learning for Word Sense Disambiguation,” in Proc. ACL, 2024.

[20] M. Bevilacqua, T. Pasini, A. Raganato, and R. Navigli, “Recent Trends in Word Sense Disambiguation: A Survey,” in Proc. IJCAI, 2021. DOI: https://doi.org/10.24963/ijcai.2021/593

[21] M. Sahlgren and F. Carlsson, “Word Sense Disambiguation for Low-Resource Languages: A Systematic Literature Review,” Nat. Lang. Eng., vol. 28, no. 3, pp. 321–347, May 2022.

[22] I. A. Muhammad and S. Abubakar, “Hausa Language Processing: A Survey of Resources, Methods, and Opportunities,” J. Afr. Lang. Comput., vol. 2, no. 1, pp. 45–62, 2023.

[23] A. Abdulmumin and B. Indurkhya, “Challenges and Opportunities in Hausa NLP: A Critical Review,” in Proc. AfricaNLP Workshop, 2024.

[24] C. Adelani et al., “AfriSenti: A Large-Scale Twitter Sentiment Dataset for African Languages,” in Proc. EMNLP, 2023.

[25] H. Buzaaba, S. Bird, and A. Serem, “Towards Neural Machine Translation for Edoid Languages,” in Proc. ICLR Workshop on NLP for Positive Impact, 2023.

Downloads

Published

2025-09-05

How to Cite

Aminu , H., Saidu , I., & Odion, P. (2025). Curation of a polysemous word dataset for word sense disambiguation in Hausa language. Journal of Statistical Sciences and Computational Intelligence, 1(3), 175–186. https://doi.org/10.64497/jssci.77
Views
  • Abstract 1618
  • PDF 285

Similar Articles

You may also start an advanced similarity search for this article.