67 research outputs found
Methods in Contemporary Linguistics
The present volume is a broad overview of methods and methodologies in linguistics, illustrated with examples from concrete research. It collects insights gained from a broad range of linguistic sub-disciplines, ranging from core disciplines to topics in cross-linguistic and language-internal diversity or to contributions towards language, space and society. Given its critical and innovative nature, the volume is a valuable source for students and researchers of a broad range of linguistic interests
Developing Methods and Resources for Automated Processing of the African Language Igbo
Natural Language Processing (NLP) research is still in its infancy in Africa. Most of
languages in Africa have few or zero NLP resources available, of which Igbo is among those
at zero state. In this study, we develop NLP resources to support NLP-based research in
the Igbo language. The springboard is the development of a new part-of-speech (POS)
tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result
of language internal features not recognized in EAGLES. The tagset consists of three
granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The
medium-grained tagset is to strike a balance between the other two grains for practical
purpose. Following this is the preprocessing of Igbo electronic texts through normalization
and tokenization processes. The tokenizer is developed in this study using the tagset
definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million
tokens.
This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus
(IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an
inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the
IgbTS where necessary. A novel automatic method was developed to bootstrap a manual
annotation process through exploitation of the by-products of this IAA exercise, to improve
IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach
was adopted to propose erroneous instances on IgbTC for correction. A novel automatic
method that uses knowledge of affixes to flag and correct all morphologically-inflected
words in the IgbTC whose tags violate their status as not being morphologically-inflected
was also developed and used.
Experiments towards the development of an automatic POS tagging system for Igbo
using IgbTC show good accuracy scores comparable to other languages that these taggers
have been tested on, such as English. Accuracy on the words previously unseen during
the taggers’ training (also called unknown words) is considerably low, and much lower
on the unknown words that are morphologically-complex, which indicates difficulty in
handling morphologically-complex words in Igbo. This was improved by adopting a
morphological reconstruction method (a linguistically-informed segmentation into stems
and affixes) that reformatted these morphologically-complex words into patterns learnable
by machines. This enables taggers to use the knowledge of stems and associated affixes
of these morphologically-complex words during the tagging process to predict their
appropriate tags. Interestingly, this method outperforms other methods that existing
taggers use in handling unknown words, and achieves an impressive increase for the
accuracy of the morphologically-inflected unknown words and overall unknown words.
These developments are the first NLP toolkit for the Igbo language and a step towards
achieving the objective of Basic Language Resources Kits (BLARK) for the language. This
IgboNLP toolkit will be made available for the NLP community and should encourage
further research and development for the language
Developing Methods and Resources for Automated Processing of the African Language Igbo
Natural Language Processing (NLP) research is still in its infancy in Africa. Most of
languages in Africa have few or zero NLP resources available, of which Igbo is among those
at zero state. In this study, we develop NLP resources to support NLP-based research in
the Igbo language. The springboard is the development of a new part-of-speech (POS)
tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result
of language internal features not recognized in EAGLES. The tagset consists of three
granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The
medium-grained tagset is to strike a balance between the other two grains for practical
purpose. Following this is the preprocessing of Igbo electronic texts through normalization
and tokenization processes. The tokenizer is developed in this study using the tagset
definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million
tokens.
This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus
(IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an
inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the
IgbTS where necessary. A novel automatic method was developed to bootstrap a manual
annotation process through exploitation of the by-products of this IAA exercise, to improve
IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach
was adopted to propose erroneous instances on IgbTC for correction. A novel automatic
method that uses knowledge of affixes to flag and correct all morphologically-inflected
words in the IgbTC whose tags violate their status as not being morphologically-inflected
was also developed and used.
Experiments towards the development of an automatic POS tagging system for Igbo
using IgbTC show good accuracy scores comparable to other languages that these taggers
have been tested on, such as English. Accuracy on the words previously unseen during
the taggers’ training (also called unknown words) is considerably low, and much lower
on the unknown words that are morphologically-complex, which indicates difficulty in
handling morphologically-complex words in Igbo. This was improved by adopting a
morphological reconstruction method (a linguistically-informed segmentation into stems
and affixes) that reformatted these morphologically-complex words into patterns learnable
by machines. This enables taggers to use the knowledge of stems and associated affixes
of these morphologically-complex words during the tagging process to predict their
appropriate tags. Interestingly, this method outperforms other methods that existing
taggers use in handling unknown words, and achieves an impressive increase for the
accuracy of the morphologically-inflected unknown words and overall unknown words.
These developments are the first NLP toolkit for the Igbo language and a step towards
achieving the objective of Basic Language Resources Kits (BLARK) for the language. This
IgboNLP toolkit will be made available for the NLP community and should encourage
further research and development for the language
A Study of Chinese Named Entity and Relation Identification in a Specific Domain
This thesis aims at investigating automatic identification of Chinese named entities (NEs) and their relations (NERs) in a specific domain. We have proposed a three-stage pipeline computational model for the error correction of word segmentation and POS tagging, NE recognition and NER identification. In this model, an error repair module utilizing machine learning techniques is developed in the first stage. At the second stage, a new algorithm that can automatically construct Finite State Cascades (FSC) from given sets of rules is designed. As a supplement, the recognition strategy without NE trigger words can identify the special linguistic phenomena. In the third stage, a novel approach - positive and negative case-based learning and identification (PNCBL&I) is implemented. It pursues the improvement of the identification performance for NERs through simultaneously learning two opposite cases and automatically selecting effective multi-level linguistic features for NERs and non-NERs. Further, two other strategies, resolving relation conflicts and inferring missing relations, are also integrated in the identification procedure.Diese Dissertation ist der Forschung zur automatischen Erkennung von chinesischen Begriffen (named entities, NE) und ihrer Relationen (NER) in einer spezifischen Domäne gewidmet. Wir haben ein Pipelinemodell mit drei aufeinanderfolgenden Verarbeitungsschritten für die Korrektur der Fehler der Wortsegmentation und Wortartmarkierung, NE-Erkennung, und NER-Identifizierung vorgeschlagen.
In diesem Modell wird eine Komponente zur Fehlerreparatur im ersten Verarbeitungsschritt verwirklicht, die ein machinelles Lernverfahren einsetzt. Im zweiten Stadium wird ein neuer Algorithmus, der die Kaskaden endlicher Transduktoren aus den Mengen der Regeln automatisch konstruieren kann, entworfen. Zusätzlich kann eine Strategie für die Erkennung von NE, die nicht durch das Vorkommen bestimmer lexikalischer Trigger markiert sind, die spezielle linguistische Phänomene identifizieren. Im dritten Verarbeitungsschritt wird ein neues Verfahren, das auf dem Lernen und der Identifizierung positiver und negativer Fälle beruht, implementiert. Es verfolgt die Verbesserung der NER-Erkennungsleistung durch das gleichzeitige Lernen zweier gegenüberliegenden Fälle und die automatische Auswahl der wirkungsvollen linguistischen Merkmale auf mehreren Ebenen für die NER und Nicht-NER. Weiter werden zwei andere Strategien, die Lösung von Konflikten in der Relationenerkennung und die Inferenz von fehlenden Relationen, auch in den Erkennungsprozeß integriert
Exploring formal models of linguistic data structuring. Enhanced solutions for knowledge management systems based on NLP applications
2010 - 2011The principal aim of this research is describing to which extent formal models for linguistic data structuring are crucial in Natural Language Processing (NLP) applications. In this sense, we will pay particular attention to those Knowledge Management Systems (KMS) which are designed for the Internet, and also to the enhanced solutions they may require. In order to appropriately deal with this topics, we will describe how to achieve computational linguistics applications helpful to humans in establishing and maintaining an advantageous relationship with technologies, especially with those technologies which are based on or produce man-machine interactions in natural language.
We will explore the positive relationship which may exist between well-structured Linguistic Resources (LR) and KMS, in order to state that if the information architecture of a KMS is based on the formalization of linguistic data, then the system works better and is more consistent.
As for the topics we want to deal with, frist of all it is indispensable to state that in order to structure efficient and effective Information Retrieval (IR) tools, understanding and formalizing natural language combinatory mechanisms seems to be the first operation to achieve, also because any piece of information produced by humans on the Internet is necessarily a linguistic act. Therefore, in this research work we will also discuss the NLP structuring of a linguistic formalization Hybrid Model, which we hope will prove to be a useful tool to support, improve and refine KMSs.
More specifically, in section 1 we will describe how to structure language resources implementable inside KMSs, to what extent they can improve the performance of these systems and how the problem of linguistic data structuring is dealt with by natural language formalization methods.
In section 2 we will proceed with a brief review of computational linguistics, paying particular attention to specific software packages such Intex, Unitex, NooJ, and Cataloga, which are developed according to Lexicon-Grammar (LG) method, a linguistic theory established during the 60’s by Maurice Gross.
In section 3 we will describe some specific works useful to monitor the state of the art in Linguistic Data Structuring Models, Enhanced Solutions for KMSs, and NLP Applications for KMSs.
In section 4 we will cope with problems related to natural language formalization methods, describing mainly Transformational-Generative Grammar (TGG) and LG, plus other methods based on statistical approaches and ontologies.
In section 5 we will propose a Hybrid Model usable in NLP applications in order to create effective enhanced solutions for KMSs. Specific features and elements of our hybrid model will be shown through some results on experimental research work. The case study we will present is a very complex NLP problem yet little explored in recent years, i.e. Multi Word Units (MWUs) treatment.
In section 6 we will close our research evaluating its results and presenting possible future work perspectives. [edited by author]X n.s
- …