Search CORE

96 research outputs found

Automatic generation of named entity taggers leveraging parallel corpora

Author: Chung Yi-Ling
Publication venue
Publication date: 26/09/2017
Field of study

The lack of hand curated data is a major impediment to developing statistical semantic processors for many of the world languages. A major issue of semantic processors in Nat- ural Language Processing (NLP) is that they require manually annotated data to perform accurately. Our work aims to address this issue by leveraging existing annotations and semantic processors from multiple source languages by projecting their annotations via statistical word alignments traditionally used in Machine Translation. Taking the Named Entity Recognition (NER) task as a use case of semantic processing, this work presents a method to automatically induce Named Entity taggers using parallel data, without any manual intervention. Our method leverages existing semantic processors and annotations to overcome the lack of annotation data for a given language. The intuition is to transfer or project semantic annotations, from multiple sources to a target language, by statistical word alignment methods applied to parallel texts (Och and Ney, 2000; Liang et al., 2006). The projected annotations can then be used to automatically generate semantic processors for the target language. In this way we would be able to provide NLP processors with- out training data for the target language. The experiments are focused on 4 languages: German, English, Spanish and Italian, and our empirical evaluation results show that our method obtains competitive results when compared with models trained on gold-standard out-of-domain data. This shows that our projection algorithm is effective to transport NER annotations across languages via parallel data thus providing a fully automatic method to obtain NER taggers for as many as the number of languages aligned via parallel corpora

Archivo Digital para la Docencia y la Investigación

Recommended from our members

Minimally supervised induction of morphology through bitexts

Author: Moon Taesun, Ph. D.
Publication venue
Publication date: 01/12/2008
Field of study

textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems. Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis. While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic

Texas ScholarWorks

Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging

Author: Das Dipanjan
McDonald Ryan
Nivre Joakim
Petrov Slav
Täckström Oscar
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2013
Field of study

We consider the construction of part-of-speech taggers for resource-poor languages. Recently, manually constructed tag dictionaries from Wiktionary and dictionaries projected via bitext have been used as type constraints to overcome the scarcity of annotated data in this setting. In this paper, we show that additional token constraints can be projected from a resource-rich source language to a resource-poor target language via word-aligned bitext. We present several models to this end; in particular a partially observed conditional random ﬁeld model, where coupled token and type constraints provide a partial signal for training. Averaged across eight previously studied Indo-European languages, our model achieves a 25% relative error reduction over the prior state of the art. We further present successful results on seven additional languages from different families, empirically demonstrating the applicability of coupled token and type constraints across a diverse set of languages

CiteSeerX

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Automatic generation of named entity taggers leveraging parallel corpora

Author: Chung Yi-Ling
Publication venue
Publication date: 01/01/2017
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación

BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOURCE LANGUAGES TAGSET- A FOCUS ON AN AFRICAN IGBO

Author: Onyenwe Ikechukwu
Publication venue
Publication date
Field of study

In this paper, we demonstrate the efficacy of a POS annotation method that employed the services of two automatic approaches to assist POS tagged corpus creation for a novel language in NLP. The two approaches are cross-lingual and monolingual POS tags projection. We used cross-lingual to automatically create an initial ‘errorful’ tagged corpus for a target language via word-alignment. The resources for creating this are derived from a source language rich in NLP resources. A monolingual method is applied to clean the induce noise via an alignment process and to transform the source language tags to the target language tags. We used English and Igbo as our case study. This is possible because there are parallel texts that exist between English and Igbo, and the source language English has available NLP resources. The results of the experiment show a steady improvement in accuracy and rate of tags transformation with score ranges of 6.13% to 83.79% and 8.67% to 98.37% respectively. The rate of tags transformation evaluates the rate at which source language tags are translated to target language tags

ZENODO

Practical Natural Language Processing for Low-Resource Languages.

Author: King Benjamin Philip
Publication venue
Publication date: 01/01/2015
Field of study

As the Internet and World Wide Web have continued to gain widespread adoption, the linguistic diversity represented has also been growing. Simultaneously the field of Linguistics is facing a crisis of the opposite sort. Languages are becoming extinct faster than ever before and linguists now estimate that the world could lose more than half of its linguistic diversity by the year 2100. This is a special time for Computational Linguistics; this field has unprecedented access to a great number of low-resource languages, readily available to be studied, but needs to act quickly before political, social, and economic pressures cause these languages to disappear from the Web. Most work in Computational Linguistics and Natural Language Processing (NLP) focuses on English or other languages that have text corpora of hundreds of millions of words. In this work, we present methods for automatically building NLP tools for low-resource languages with minimal need for human annotation in these languages. We start first with language identification, specifically focusing on word-level language identification, an understudied variant that is necessary for processing Web text and develop highly accurate machine learning methods for this problem. From there we move onto the problems of part-of-speech tagging and dependency parsing. With both of these problems we extend the current state of the art in projected learning to make use of multiple high-resource source languages instead of just a single language. In both tasks, we are able to improve on the best current methods. All of these tools are practically realized in the "Minority Language Server," an online tool that brings these techniques together with low-resource language text on the Web. The Minority Language Server, starting with only a few words in a language can automatically collect text in a language, identify its language and tag its parts of speech. We hope that this system is able to provide a convincing proof of concept for the automatic collection and processing of low-resource language text from the Web, and one that can hopefully be realized before it is too late.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113373/1/benking_1.pd

Deep Blue Documents at the University of Michigan

Inducing information extraction systems for new languages via cross-language projection

Author: Riloff Ellen M.
Schafer Charles
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2002
Field of study

Journal ArticleInformation extraction (IE) systems are costly to build because they require development texts, parsing tools, and specialized dictionaries for each application domain and each natural language that needs to be processed. We present a novel method for rapidly creating IE systems for new languages by exploiting existing IE systems via cross-language projection. Given an IE system for a source language (e.g., English), we can transfer its annotations to corresponding texts in a target language (e.g., French) and learn information extraction rules for the new language automatically. In this paper, we explore several ways of realizing both the transfer and learning processes using off-the-shelf machine translation systems, induced word alignment, attribute projection, and transformation-based learning. We present a variety of experiments that show how an English IE system for a plane crash domain can be leveraged to automatically create a French IE system for the same domain

The University of Utah: J. Willard Marriott Digital Library

Planting Trees in the Desert: Delexicalized Tagging and Parsing Combined

Author: Mareček David
Yu Zhiwei
Zeman Daniel
Žabokrtský Zdeněk
Publication venue
Publication date: 01/01/2016
Field of study

Various unsupervised and semi-supervised methods have been proposed to tag and parse an unseen language. We explore delexicalized parsing, proposed by (Zeman and Resnik, 2008), and delexicalized tagging, proposed by (Yu et al., 2016). For both approaches we provide a detailed evaluation on Universal Dependencies data (Nivre et al., 2016), a de-facto standard for multi-lingual morphosyntactic processing (while the previous work used other datasets). Our results confirm that in separation, each of the two delexicalized techniques has some limited potential when no annotation of the target language is available. However, if used in combination, their errors multiply beyond acceptable limits. We demonstrate that even the tiniest bit of expert annotation in the target language may contain significant potential and should be used if available

Biblio at Institute of Formal and Applied Linguistics

Planting Trees in the Desert: Delexicalized Tagging and Parsing Combined

Author: Marecek David
Yu Zhiwei
Zabokrtsky Zdenek
Zeman Daniel
Publication venue: Hankookmunhwasa
Publication date: 01/01/2016
Field of study

Waseda University Repository