3 research outputs found
Automatic generation of named entity taggers leveraging parallel corpora
The lack of hand curated data is a major impediment to developing statistical semantic
processors for many of the world languages. A major issue of semantic processors in Nat-
ural Language Processing (NLP) is that they require manually annotated data to perform
accurately. Our work aims to address this issue by leveraging existing annotations and
semantic processors from multiple source languages by projecting their annotations via
statistical word alignments traditionally used in Machine Translation. Taking the Named
Entity Recognition (NER) task as a use case of semantic processing, this work presents
a method to automatically induce Named Entity taggers using parallel data, without any
manual intervention. Our method leverages existing semantic processors and annotations
to overcome the lack of annotation data for a given language. The intuition is to transfer
or project semantic annotations, from multiple sources to a target language, by statistical
word alignment methods applied to parallel texts (Och and Ney, 2000; Liang et al., 2006).
The projected annotations can then be used to automatically generate semantic processors
for the target language. In this way we would be able to provide NLP processors with-
out training data for the target language. The experiments are focused on 4 languages:
German, English, Spanish and Italian, and our empirical evaluation results show that our
method obtains competitive results when compared with models trained on gold-standard
out-of-domain data. This shows that our projection algorithm is effective to transport NER
annotations across languages via parallel data thus providing a fully automatic method to
obtain NER taggers for as many as the number of languages aligned via parallel corpora
Practical Natural Language Processing for Low-Resource Languages.
As the Internet and World Wide Web have continued to gain widespread adoption, the linguistic diversity represented has also been growing. Simultaneously the field of Linguistics is facing a crisis of the opposite sort. Languages are becoming extinct faster than ever before and linguists now estimate that the world could lose more than half of its linguistic diversity by the year 2100. This is a special time for Computational Linguistics; this field has unprecedented access to a great number of low-resource languages, readily available to be studied, but needs to act quickly before political, social, and economic pressures cause these languages to disappear from the Web.
Most work in Computational Linguistics and Natural Language Processing (NLP) focuses on English or other languages that have text corpora of hundreds of millions of words. In this work, we present methods for automatically building NLP tools for low-resource languages with minimal need for human annotation in these languages. We start first with language identification, specifically focusing on word-level language identification, an understudied variant that is necessary for processing Web text and develop highly accurate machine learning methods for this problem. From there we move onto the problems of part-of-speech tagging and dependency parsing. With both of these problems we extend the current state of the art in projected learning to make use of multiple high-resource source languages instead of just a single language. In both tasks, we are able to improve on the best current methods. All of these tools are practically realized in the "Minority Language Server," an online tool that brings these techniques together with low-resource language text on the Web. The Minority Language Server, starting with only a few words in a language can automatically collect text in a language, identify its language and tag its parts of speech. We hope that this system is able to provide a convincing proof of concept for the automatic collection and processing of low-resource language text from the Web, and one that can hopefully be realized before it is too late.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113373/1/benking_1.pd