10,683 research outputs found
Some Salient Issues in the Unsupervised Learning of Igbo Morphology
The issue of automatic learning of the morphology of natural language is an important topic in computational linguistics. This owes to the fact that morphology is foundational to the study of linguistics. In addition, the emerging information society demands the application of Information and Communication Technologies (ICT) to languages in ways that demand human-like analysis of language and this depends to a large extent on the ability to undertake computational analysis of morphology. Even though rule-based and supervised learning approaches to the modeling of morphology have been found to be productive, they have also been discovered to be costly, cumbersome and sucseptible to human errors. Contrarily, unsupervised learning methods do not require the expensive human intervention but as in everything statistical, they demand large volumes of linguistic data. This poses a challenge to resource scarce languages such as Igbo. Furthermore, being a highly agglutinative language, Igbo features certain morphological processes that may not be easily accommodated by most of the frequency-driven unsupervised learning models available. this paper takes a critical look at some of the identified challenges of inducing Igbo morphology as a first step in devising methods by which they can be addressed
Recommended from our members
Minimally supervised induction of morphology through bitexts
textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems.
Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one languageāthe source languageāto another languageāthe target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis.
While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic
Producing power-law distributions and damping word frequencies with two-stage language models
Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statisticalmodels that can generically produce power laws, breaking generativemodels into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes-the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process-that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology.48 page(s
An Unsupervised Knowledge Free Algorithm for the Learning of Morphology in Natural Languages - Master\u27s Thesis, May 2002
This thesis describes an unsupervised system to learn natural language morphology, specifically suffix identification from unannotated text. The system is language independent, so that is can learn the morphology of any human language. For English this means identifying ā-sā, ā-ingā, ā-edā, ā-tionā and many other suffixes, in addition to learning which stems they attach to. The system uses no prior knowledge, such as part of speech tags, and learns the morphology by simply reading in a body of unannotated text. The system consists of a generative probabilistic model which is used to evaluate hypotheses, and a directed search and a hill-climbing search which are used in conjunction to find a highly probably hypothesis. Experiments applying the system to English and Polish are described
Exploring Linguistic Constraints in Nlp Applications
The key argument of this dissertation is that the success of an Natural Language Processing (NLP) application depends on a proper representation of the corresponding linguistic problem. This theme is raised in the context that the recent progress made in our field is widely credited to the effective use of strong engineering techniques. However, the intriguing power of highly lexicalized models shown in many NLP applications is not only an achievement by the development in machine learning, but also impossible without the extensive hand-annotated data resources made available,
which are originally built with very deep linguistic considerations.
More specifically, we explore three linguistic aspects in this dissertation: the distinction between closed-class vs. open-class words, long-tail distributions in vocabulary study
and determinism in language models. The first two aspects are studied in unsupervised tasks, unsupervised part-of-speech (POS) tagging and morphology learning, and the last one is studied in supervised tasks, English POS tagging and Chinese word segmentation. Each linguistic aspect under study manifests
itself in a (different) way to help improve performance or efficiency in some NLP application
Building Morphological Chains for Agglutinative Languages
In this paper, we build morphological chains for agglutinative languages by
using a log-linear model for the morphological segmentation task. The model is
based on the unsupervised morphological segmentation system called
MorphoChains. We extend MorphoChains log linear model by expanding the
candidate space recursively to cover more split points for agglutinative
languages such as Turkish, whereas in the original model candidates are
generated by considering only binary segmentation of each word. The results
show that we improve the state-of-art Turkish scores by 12% having a F-measure
of 72% and we improve the English scores by 3% having a F-measure of 74%.
Eventually, the system outperforms both MorphoChains and other well-known
unsupervised morphological segmentation systems. The results indicate that
candidate generation plays an important role in such an unsupervised log-linear
model that is learned using contrastive estimation with negative samples.Comment: 10 pages, accepted and presented at the CICLing 2017 (18th
International Conference on Intelligent Text Processing and Computational
Linguistics
Improving the quality of Gujarati-Hindi Machine Translation through part-of-speech tagging and stemmer-assisted transliteration
Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration.We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language
- ā¦