6 research outputs found

    Toward an effective Igbo part-of-speech tagger

    Get PDF
    Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments conducted using an Igbo corpus as a test bed for identifying the POS taggers and the Machine Learning (ML) methods that can achieve a good performance with the small dataset available for the language. Experiments have been conducted using different well-known POS taggers developed for English or European languages, and different training data styles and sizes. Igbo has a number of language-specific characteristics that present a challenge for effective POS tagging. One interesting case is the wide use of verbs (and nominalizations thereof) that have an inherent noun complement, which form “linked pairs” in the POS tagging scheme, but which may appear discontinuously. Another issue is Igbo's highly productive agglutinative morphology, which can produce many variant word forms from a given root. This productivity is a key cause of the out-of-vocabulary (OOV) words observed during Igbo tagging. We report results of experiments on a promising direction for improving tagging performance on such morphologically-inflected OOV words

    Corpus-Based Approaches to Igbo Diacritic Restoration

    Get PDF
    With natural language processing (NLP), researchers aim to get the computer to identify and understand the patterns in human languages. This is often difficult because a language embeds many dynamic and varied properties in its syntaxes, pragmatics and phonology, which needs to be captured and processed. The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries. But these research works focus more on well resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese etc. Over 95% of the world’s 7000 languages are low-resourced for NLP i.e. they have little or no data, tools, and techniques for NLP work. In this thesis, we present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages. Focusing on Igbo language, we report the steps taken to develop a flexible framework for generating datasets for diacritic restoration. Three main approaches, the standard n-gram model, the classification models and the embedding models were proposed. The standard n-gram models use a sequence of previous words to the target stripped word as key predictors of the correct variants. For the classification models, a window of words on both sides of the target stripped word were use. The embedding models compare the similarity scores of the combined context word embeddings and the embeddings of each of the candidate variant vectors. The processes and techniques involved in projecting embeddings from a model trained with English texts to an Igbo embedding space and the creation of intrinsic evaluation tasks to validate the models were also discussed. A comparative analysis of the results indicate that all the approaches significantly improved on the baseline performance which uses the unigram model. The details of the processed involved in building the models as well as the possible directions for future work are discussed in this work

    Developing Methods and Resources for Automated Processing of the African Language Igbo

    Get PDF
    Natural Language Processing (NLP) research is still in its infancy in Africa. Most of languages in Africa have few or zero NLP resources available, of which Igbo is among those at zero state. In this study, we develop NLP resources to support NLP-based research in the Igbo language. The springboard is the development of a new part-of-speech (POS) tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result of language internal features not recognized in EAGLES. The tagset consists of three granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The medium-grained tagset is to strike a balance between the other two grains for practical purpose. Following this is the preprocessing of Igbo electronic texts through normalization and tokenization processes. The tokenizer is developed in this study using the tagset definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million tokens. This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus (IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the IgbTS where necessary. A novel automatic method was developed to bootstrap a manual annotation process through exploitation of the by-products of this IAA exercise, to improve IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach was adopted to propose erroneous instances on IgbTC for correction. A novel automatic method that uses knowledge of affixes to flag and correct all morphologically-inflected words in the IgbTC whose tags violate their status as not being morphologically-inflected was also developed and used. Experiments towards the development of an automatic POS tagging system for Igbo using IgbTC show good accuracy scores comparable to other languages that these taggers have been tested on, such as English. Accuracy on the words previously unseen during the taggers’ training (also called unknown words) is considerably low, and much lower on the unknown words that are morphologically-complex, which indicates difficulty in handling morphologically-complex words in Igbo. This was improved by adopting a morphological reconstruction method (a linguistically-informed segmentation into stems and affixes) that reformatted these morphologically-complex words into patterns learnable by machines. This enables taggers to use the knowledge of stems and associated affixes of these morphologically-complex words during the tagging process to predict their appropriate tags. Interestingly, this method outperforms other methods that existing taggers use in handling unknown words, and achieves an impressive increase for the accuracy of the morphologically-inflected unknown words and overall unknown words. These developments are the first NLP toolkit for the Igbo language and a step towards achieving the objective of Basic Language Resources Kits (BLARK) for the language. This IgboNLP toolkit will be made available for the NLP community and should encourage further research and development for the language

    SERENGETI: Massively Multilingual Language Models for Africa

    Full text link
    Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks, achieving 82.27 average F_1. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}}Comment: To appear in Findings of ACL 202

    Developing Methods and Resources for Automated Processing of the African Language Igbo

    Get PDF
    Natural Language Processing (NLP) research is still in its infancy in Africa. Most of languages in Africa have few or zero NLP resources available, of which Igbo is among those at zero state. In this study, we develop NLP resources to support NLP-based research in the Igbo language. The springboard is the development of a new part-of-speech (POS) tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result of language internal features not recognized in EAGLES. The tagset consists of three granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The medium-grained tagset is to strike a balance between the other two grains for practical purpose. Following this is the preprocessing of Igbo electronic texts through normalization and tokenization processes. The tokenizer is developed in this study using the tagset definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million tokens. This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus (IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the IgbTS where necessary. A novel automatic method was developed to bootstrap a manual annotation process through exploitation of the by-products of this IAA exercise, to improve IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach was adopted to propose erroneous instances on IgbTC for correction. A novel automatic method that uses knowledge of affixes to flag and correct all morphologically-inflected words in the IgbTC whose tags violate their status as not being morphologically-inflected was also developed and used. Experiments towards the development of an automatic POS tagging system for Igbo using IgbTC show good accuracy scores comparable to other languages that these taggers have been tested on, such as English. Accuracy on the words previously unseen during the taggers’ training (also called unknown words) is considerably low, and much lower on the unknown words that are morphologically-complex, which indicates difficulty in handling morphologically-complex words in Igbo. This was improved by adopting a morphological reconstruction method (a linguistically-informed segmentation into stems and affixes) that reformatted these morphologically-complex words into patterns learnable by machines. This enables taggers to use the knowledge of stems and associated affixes of these morphologically-complex words during the tagging process to predict their appropriate tags. Interestingly, this method outperforms other methods that existing taggers use in handling unknown words, and achieves an impressive increase for the accuracy of the morphologically-inflected unknown words and overall unknown words. These developments are the first NLP toolkit for the Igbo language and a step towards achieving the objective of Basic Language Resources Kits (BLARK) for the language. This IgboNLP toolkit will be made available for the NLP community and should encourage further research and development for the language

    A Basic Language Resource Kit Implementation for the IgboNLP Project

    No full text
    Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach taken to creating an initial set of resources for Igbo, including an electronic text corpus, a part-of-speech (POS) tagset, and a POS-tagged subcorpus. We discuss the approach taken in gathering texts, the preprocessing of these texts, and the development of the POS tagged corpus. We also discuss some of the problems encountered during corpus and tagset development and the solutions arrived at for these problems
    corecore