3 research outputs found
Predicting Morphologically-Complex Unknown Words in Igbo
The effective handling of previously unseen words is an important factor in the performance of part-of-speech taggers. Some trainable POS taggers use suffix (sometimes prefix) strings as cues in handling unknown words (in effect serving as a proxy for actual linguistic affixes). In the context of creating a tagger for the African language Igbo, we compare the performance of some existing taggers, implementing such an approach, to a novel method for handling morphologically complex unknown words, based on morphological reconstruction (i.e. a linguistically-informed segmentation into root and affixes). The novel method outperforms these other systems by several percentage points, achieving accuracies of around 92 % on morphologically-complex unknown words
Toward an effective Igbo part-of-speech tagger
Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments conducted using an Igbo corpus as a test bed for identifying the POS taggers and the Machine Learning (ML) methods that can achieve a good performance with the small dataset available for the language. Experiments have been conducted using different well-known POS taggers developed for English or European languages, and different training data styles and sizes. Igbo has a number of language-specific characteristics that present a challenge for effective POS tagging. One interesting case is the wide use of verbs (and nominalizations thereof) that have an inherent noun complement, which form “linked pairs” in the POS tagging scheme, but which may appear discontinuously. Another issue is Igbo's highly productive agglutinative morphology, which can produce many variant word forms from a given root. This productivity is a key cause of the out-of-vocabulary (OOV) words observed during Igbo tagging. We report results of experiments on a promising direction for improving tagging performance on such morphologically-inflected OOV words
Developing Methods and Resources for Automated Processing of the African Language Igbo
Natural Language Processing (NLP) research is still in its infancy in Africa. Most of
languages in Africa have few or zero NLP resources available, of which Igbo is among those
at zero state. In this study, we develop NLP resources to support NLP-based research in
the Igbo language. The springboard is the development of a new part-of-speech (POS)
tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result
of language internal features not recognized in EAGLES. The tagset consists of three
granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The
medium-grained tagset is to strike a balance between the other two grains for practical
purpose. Following this is the preprocessing of Igbo electronic texts through normalization
and tokenization processes. The tokenizer is developed in this study using the tagset
definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million
tokens.
This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus
(IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an
inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the
IgbTS where necessary. A novel automatic method was developed to bootstrap a manual
annotation process through exploitation of the by-products of this IAA exercise, to improve
IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach
was adopted to propose erroneous instances on IgbTC for correction. A novel automatic
method that uses knowledge of affixes to flag and correct all morphologically-inflected
words in the IgbTC whose tags violate their status as not being morphologically-inflected
was also developed and used.
Experiments towards the development of an automatic POS tagging system for Igbo
using IgbTC show good accuracy scores comparable to other languages that these taggers
have been tested on, such as English. Accuracy on the words previously unseen during
the taggers’ training (also called unknown words) is considerably low, and much lower
on the unknown words that are morphologically-complex, which indicates difficulty in
handling morphologically-complex words in Igbo. This was improved by adopting a
morphological reconstruction method (a linguistically-informed segmentation into stems
and affixes) that reformatted these morphologically-complex words into patterns learnable
by machines. This enables taggers to use the knowledge of stems and associated affixes
of these morphologically-complex words during the tagging process to predict their
appropriate tags. Interestingly, this method outperforms other methods that existing
taggers use in handling unknown words, and achieves an impressive increase for the
accuracy of the morphologically-inflected unknown words and overall unknown words.
These developments are the first NLP toolkit for the Igbo language and a step towards
achieving the objective of Basic Language Resources Kits (BLARK) for the language. This
IgboNLP toolkit will be made available for the NLP community and should encourage
further research and development for the language