27 research outputs found
Part-of-speech Tagset and Corpus Development for Igbo, an African
This project aims to develop linguistic resources to support computational NLP research on the
Igbo language. The starting point for this project is the development of a new part-of-speech tagging
scheme based on the EAGLES tagset guidelines, adapted to incorporate additional language
internal features. The tags are currently being used in a part-of-speech annotation task for the
development of POS tagged Igbo corpus. The proposed tagset has 59 tags
Toward an effective Igbo part-of-speech tagger
Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments conducted using an Igbo corpus as a test bed for identifying the POS taggers and the Machine Learning (ML) methods that can achieve a good performance with the small dataset available for the language. Experiments have been conducted using different well-known POS taggers developed for English or European languages, and different training data styles and sizes. Igbo has a number of language-specific characteristics that present a challenge for effective POS tagging. One interesting case is the wide use of verbs (and nominalizations thereof) that have an inherent noun complement, which form âlinked pairsâ in the POS tagging scheme, but which may appear discontinuously. Another issue is Igbo's highly productive agglutinative morphology, which can produce many variant word forms from a given root. This productivity is a key cause of the out-of-vocabulary (OOV) words observed during Igbo tagging. We report results of experiments on a promising direction for improving tagging performance on such morphologically-inflected OOV words
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOURCE LANGUAGES TAGSET- A FOCUS ON AN AFRICAN IGBO
In this paper, we demonstrate the efficacy of a POS annotation method that employed the services of two automatic approaches to assist POS tagged corpus creation for a novel language in NLP. The two approaches are cross-lingual and monolingual POS tags projection. We used cross-lingual to automatically create an initial âerrorfulâ tagged corpus for a target language via word-alignment. The resources for creating this are derived from a source language rich in NLP resources. A monolingual method is applied to clean the induce noise via an alignment process and to transform the source language tags to the target language tags. We used English and Igbo as our case study. This is possible because there are parallel texts that exist between English and Igbo, and the source language English has available NLP resources. The results of the experiment show a steady improvement in accuracy and rate of tags transformation with score ranges of 6.13% to 83.79% and 8.67% to 98.37% respectively. The rate of tags transformation evaluates the rate at which source language tags are translated to target language tags
Use of Transformation-Based Learning in Annotation Pipeline of Igbo, an African Language
The accuracy of an annotated corpus can be increased through evaluation and re- vision of the annotation scheme, and through adjudication of the disagreements found. In this paper, we describe a novel process that has been applied to improve a part-of-speech (POS) tagged corpus for the African language Igbo. An inter-annotation agreement (IAA) exercise was undertaken to iteratively revise the tagset used in the creation of the initial tagged corpus, with the aim of refining the tagset and maximizing annotator performance. The tagset revisions and other corrections were efficiently propagated to the overall corpus in a semi-automated manner using transformation-based learning (TBL) to identify candidates for cor- rection and to propose possible tag corrections. The affected word-tag pairs in the corpus were inspected to ensure a high quality end-product with an accuracy that would not be achieved through a purely automated process. The results show that the tagging accuracy increases from 88% to 94%. The tagged corpus is potentially re-usable for other dialects of the language
Developing Methods and Resources for Automated Processing of the African Language Igbo
Natural Language Processing (NLP) research is still in its infancy in Africa. Most of
languages in Africa have few or zero NLP resources available, of which Igbo is among those
at zero state. In this study, we develop NLP resources to support NLP-based research in
the Igbo language. The springboard is the development of a new part-of-speech (POS)
tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result
of language internal features not recognized in EAGLES. The tagset consists of three
granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The
medium-grained tagset is to strike a balance between the other two grains for practical
purpose. Following this is the preprocessing of Igbo electronic texts through normalization
and tokenization processes. The tokenizer is developed in this study using the tagset
definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million
tokens.
This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus
(IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an
inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the
IgbTS where necessary. A novel automatic method was developed to bootstrap a manual
annotation process through exploitation of the by-products of this IAA exercise, to improve
IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach
was adopted to propose erroneous instances on IgbTC for correction. A novel automatic
method that uses knowledge of affixes to flag and correct all morphologically-inflected
words in the IgbTC whose tags violate their status as not being morphologically-inflected
was also developed and used.
Experiments towards the development of an automatic POS tagging system for Igbo
using IgbTC show good accuracy scores comparable to other languages that these taggers
have been tested on, such as English. Accuracy on the words previously unseen during
the taggersâ training (also called unknown words) is considerably low, and much lower
on the unknown words that are morphologically-complex, which indicates difficulty in
handling morphologically-complex words in Igbo. This was improved by adopting a
morphological reconstruction method (a linguistically-informed segmentation into stems
and affixes) that reformatted these morphologically-complex words into patterns learnable
by machines. This enables taggers to use the knowledge of stems and associated affixes
of these morphologically-complex words during the tagging process to predict their
appropriate tags. Interestingly, this method outperforms other methods that existing
taggers use in handling unknown words, and achieves an impressive increase for the
accuracy of the morphologically-inflected unknown words and overall unknown words.
These developments are the first NLP toolkit for the Igbo language and a step towards
achieving the objective of Basic Language Resources Kits (BLARK) for the language. This
IgboNLP toolkit will be made available for the NLP community and should encourage
further research and development for the language
Developing Methods and Resources for Automated Processing of the African Language Igbo
Natural Language Processing (NLP) research is still in its infancy in Africa. Most of
languages in Africa have few or zero NLP resources available, of which Igbo is among those
at zero state. In this study, we develop NLP resources to support NLP-based research in
the Igbo language. The springboard is the development of a new part-of-speech (POS)
tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result
of language internal features not recognized in EAGLES. The tagset consists of three
granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The
medium-grained tagset is to strike a balance between the other two grains for practical
purpose. Following this is the preprocessing of Igbo electronic texts through normalization
and tokenization processes. The tokenizer is developed in this study using the tagset
definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million
tokens.
This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus
(IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an
inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the
IgbTS where necessary. A novel automatic method was developed to bootstrap a manual
annotation process through exploitation of the by-products of this IAA exercise, to improve
IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach
was adopted to propose erroneous instances on IgbTC for correction. A novel automatic
method that uses knowledge of affixes to flag and correct all morphologically-inflected
words in the IgbTC whose tags violate their status as not being morphologically-inflected
was also developed and used.
Experiments towards the development of an automatic POS tagging system for Igbo
using IgbTC show good accuracy scores comparable to other languages that these taggers
have been tested on, such as English. Accuracy on the words previously unseen during
the taggersâ training (also called unknown words) is considerably low, and much lower
on the unknown words that are morphologically-complex, which indicates difficulty in
handling morphologically-complex words in Igbo. This was improved by adopting a
morphological reconstruction method (a linguistically-informed segmentation into stems
and affixes) that reformatted these morphologically-complex words into patterns learnable
by machines. This enables taggers to use the knowledge of stems and associated affixes
of these morphologically-complex words during the tagging process to predict their
appropriate tags. Interestingly, this method outperforms other methods that existing
taggers use in handling unknown words, and achieves an impressive increase for the
accuracy of the morphologically-inflected unknown words and overall unknown words.
These developments are the first NLP toolkit for the Igbo language and a step towards
achieving the objective of Basic Language Resources Kits (BLARK) for the language. This
IgboNLP toolkit will be made available for the NLP community and should encourage
further research and development for the language
MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages
In this paper, we present AfricaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the universal dependencies (UD) guidelines. We conducted extensive POS baseline experiments using both conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in the UD. Evaluating on the AfricaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with parameter-fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems to be more effective for POS tagging in unseen languages
Automatic Restoration of Diacritics for Igbo Language
Igbo is a low-resource African language with orthographic and tonal diacritics, which capture distinctions between words that are important for both meaning and pronunciation, and hence of potential value for a range of language processing tasks. Such diacritics, however, are often largely absent from the electronic texts we might want to process, or assemble into corpora, and so the need arises for effective methods for automatic diacritic restoration for Igbo. In this paper, we experiment using an Igbo bible corpus, which is extensively marked for vowel distinctions, and partially for tonal distinctions, and attempt the task of reinstating these diacritics when they have been deleted. We investigate a number of word-level diacritic restoration methods, based on n-grams, under a closed-world assumption, achieving an accuracy of 98.83 % with our most effective method
Lexical Disambiguation of Igbo using Diacritic Restoration
Properly written texts in Igbo, a low resource
African language, are rich in both
orthographic and tonal diacritics. Diacritics
are essential in capturing the distinctions
in pronunciation and meaning of
words, as well as in lexical disambiguation.
Unfortunately, most electronic texts
in diacritic languages are written without
diacritics. This makes diacritic restoration
a necessary step in corpus building and
language processing tasks for languages
with diacritics. In our previous work, we
built some nâgram models with simple
smoothing techniques based on a closedworld
assumption. However, as a classi-
fication task, diacritic restoration is well
suited for and will be more generalisable
with machine learning. This paper, therefore,
presents a more standard approach
to dealing with the task which involves
the application of machine learning algorithms