Curation of a polysemous word dataset for word sense disambiguation in Hausa language
DOI:
https://doi.org/10.64497/jssci.77Keywords:
Hausa Language Processing, Word Sense Disambiguation (WSD), Polysemy, Natural Language Processing (NLP), Low-Resource Languages, Linguistic AnnotationAbstract
The challenge of Word Sense Disambiguation (WSD) is fundamental to Natural Language Processing (NLP), particularly in low-resource languages where lexical ambiguity hinders effective language understanding. Hausa, a major Chadic language spoken by over 60 million people, lacks structured lexical resources for disambiguating polysemous words. This paper presents the development and curation of a high-quality Hausa Polysemous Word Sense Disambiguation dataset consisting of 2,021 manually selected and annotated lemmas. Each lemma is disambiguated into its distinct senses, accompanied by contextual Hausa example sentences, English glosses, and translations. The dataset is designed to support the training and evaluation of supervised and semi-supervised WSD models for Hausa and serves as a foundational resource for semantic NLP tasks in low-resource settings. The annotation schema, curation methodology, and linguistic validation process are described in detail. This work fills a critical gap in Hausa NLP and provides a reproducible framework for constructing sense-annotated corpora in other under-resourced languages.
Downloads
References
[1] R. Navigli, “Word Sense Disambiguation: A Survey,” ACM Comput. Surv., vol. 41, no. 2, pp. 1–69, Feb. 2009. DOI: https://doi.org/10.1145/1459352.1459355
[2] D. Jurafsky and J. H. Martin, *Speech and Language Processing*, 3rd ed., Pearson, 2023.
[3] G. A. Miller, “WordNet: A Lexical Database for English,” Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995.
[4] M. Palmer, C. Fellbaum, and M. Marcus, “SemCor: A Sense-Tagged Corpus for English,” in *Proc. LREC*, 2001.
[5] R. Pradhan et al., “OntoNotes: A Unified Relational Semantic Representation,” in *Proc. ICSC*, 2007.
[6] P. Newman, *The Hausa Language: An Encyclopedic Reference Grammar*, Yale University Press, 2000.
[7] G. A. Miller, “WordNet: A Lexical Database for English,” Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995. DOI: https://doi.org/10.1145/219717.219748
[8] M. Palmer, C. Fellbaum, and M. Marcus, “SemCor: A Sense-Tagged Corpus for English,” in Proc. LREC, 2001.
[9] R. Pradhan et al., “OntoNotes: A Unified Relational Semantic Representation,” in Proc. ICSC, 2007. DOI: https://doi.org/10.1109/ICOSC.2007.4338389
[10] H. Liu et al., “Word Sense Disambiguation with Contextualized Word Embeddings,” in Proc. NAACL, 2019.
[11] R. Navigli and S. Ponzetto, “BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network,” Artif. Intell., vol. 193, pp. 217–250, 2012. DOI: https://doi.org/10.1016/j.artint.2012.07.001
[12] D. M. Scannell, “The Crúbadán Project: Corpus building for under-resourced languages,” in Proc. ICGL, 2007.
[13] I. Bello and M. Sarki, “A Finite State Morphological Analyzer for Hausa Verbs,” Int. J. Comput. Linguist. Res., vol. 6, no. 1, pp. 13–21, 2015.
[14] A. Ogueji et al., “Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Models for Hausa POS Tagging,” in Proc. WANLP, 2020. DOI: https://doi.org/10.18653/v1/2021.mrl-1.11
[15] T. Bapna et al., “Building Massively Multilingual ASR Systems: The CoVoST 2 Dataset,” in Proc. EMNLP, 2020.
[16] M. Nekoto et al., “Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,” in Findings of ACL, 2020.
[17] D. Loureiro, K. Rezaee, T. P. Pilehvar, and J. Camacho-Collados, “Analysis and Evaluation of Language Models for Word Sense Disambiguation,” Comput. Linguist., vol. 47, no. 2, pp. 387–443, Jun. 2021. DOI: https://doi.org/10.1162/coli_a_00405
[18] S. Blevins and L. Zettlemoyer, “Language Models as Word Sense Disambiguators,” in Proc. EMNLP, 2022.
[19] A. Conia, S. Chandra, B. Saha, and A. Navigli, “Many-Shot In-Context Learning for Word Sense Disambiguation,” in Proc. ACL, 2024.
[20] M. Bevilacqua, T. Pasini, A. Raganato, and R. Navigli, “Recent Trends in Word Sense Disambiguation: A Survey,” in Proc. IJCAI, 2021. DOI: https://doi.org/10.24963/ijcai.2021/593
[21] M. Sahlgren and F. Carlsson, “Word Sense Disambiguation for Low-Resource Languages: A Systematic Literature Review,” Nat. Lang. Eng., vol. 28, no. 3, pp. 321–347, May 2022.
[22] I. A. Muhammad and S. Abubakar, “Hausa Language Processing: A Survey of Resources, Methods, and Opportunities,” J. Afr. Lang. Comput., vol. 2, no. 1, pp. 45–62, 2023.
[23] A. Abdulmumin and B. Indurkhya, “Challenges and Opportunities in Hausa NLP: A Critical Review,” in Proc. AfricaNLP Workshop, 2024.
[24] C. Adelani et al., “AfriSenti: A Large-Scale Twitter Sentiment Dataset for African Languages,” in Proc. EMNLP, 2023.
[25] H. Buzaaba, S. Bird, and A. Serem, “Towards Neural Machine Translation for Edoid Languages,” in Proc. ICLR Workshop on NLP for Positive Impact, 2023.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Halima Aminu , I.R. Saidu , P.O. Odion

This work is licensed under a Creative Commons Attribution 4.0 International License.
- Abstract 1618
- PDF 285

