137 research outputs found

    Lexical Resources for Low-Resource PoS Tagging in Neural Times

    Get PDF

    DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German

    Get PDF
    International audienceWe introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual work, we show that is represents a good tradeoff between development cost, lexical coverage and resource accuracy

    Inducing Multilingual Text Analysis Tools Using Bidirectional Recurrent Neural Networks

    Get PDF
    International audienceThis work focuses on the rapid development of linguistic annotation tools for resource-poor languages. We experiment several cross-lingual annotation projection methods using Recurrent Neural Networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between the source and target language. More precisely, our method has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about foreign languages, which makes it applicable to a wide range of resource-poor languages, (c) it provides truly multilingual taggers. We investigate both uni-and bi-directional RNN models and propose a method to include external information (for instance low level information from POS) in the RNN to train higher level taggers (for instance, super sense taggers). We demonstrate the validity and genericity of our model by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual POS and super sense taggers

    Semi-supervised neural part-of-speech tagging

    Get PDF
    We present a simple method for learning part-of-speech taggers for low-resource languages using dictionaries as are reference method. Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracy required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source of weak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. We have shown that we can build POS-tagger by using bi-LSTMs and a freely available and naturally growing resource, the Wiktionary. Across nine languages for which we have labeled data to evaluate results, we achieve accuracy that in some cases exceeds all unsupervised methods, supervised method which uses hidden Markov chains and parallel text methods. We achieve highest accuracy reported for several languages.U radu je iznesena i objašnjena jednostavna metoda za učenje morfološkog označivača za nisko resursne jezike oslanjajući se na rječnike tih jezika. Unatoč znatnom broju nedavno objavljenih radova koji oslovljavaju ovaj problem, bez nadzorne metode učenja nisu rezultirale dovoljno velikom točnošću. Jedna od metoda (slabo) nadzirnog učenja je korištenje paralelnog teksta između jezika s bogatih i siromašnim resursima koji znatno poboljšava točnost morfološkog označivanja. Međutim, paralelni tekstovi nisu uvijek dostupni, a tehnike za upotrebu istog zahtijevaju mnogo složenih algoritamskih koraka. U radu smo pokazali kako izgraditi jednostavan morfološki označivač pomoću bi-LSTM neuronskih mreža i slobodno dostupnog i prirodno rastućeg resursa, Wiktionary-a Za devet jezika koje smo označili podatke u svrhu procjene dobivenih rezultata, postižemo točnost koja u nekim slučajevima prelazi sve metode bez nadzora i metode s nadzorom koje koriste skrivene markovljeve lance i paralene korpuse

    Semi-supervised neural part-of-speech tagging

    Get PDF
    We present a simple method for learning part-of-speech taggers for low-resource languages using dictionaries as are reference method. Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracy required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source of weak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. We have shown that we can build POS-tagger by using bi-LSTMs and a freely available and naturally growing resource, the Wiktionary. Across nine languages for which we have labeled data to evaluate results, we achieve accuracy that in some cases exceeds all unsupervised methods, supervised method which uses hidden Markov chains and parallel text methods. We achieve highest accuracy reported for several languages.U radu je iznesena i objašnjena jednostavna metoda za učenje morfološkog označivača za nisko resursne jezike oslanjajući se na rječnike tih jezika. Unatoč znatnom broju nedavno objavljenih radova koji oslovljavaju ovaj problem, bez nadzorne metode učenja nisu rezultirale dovoljno velikom točnošću. Jedna od metoda (slabo) nadzirnog učenja je korištenje paralelnog teksta između jezika s bogatih i siromašnim resursima koji znatno poboljšava točnost morfološkog označivanja. Međutim, paralelni tekstovi nisu uvijek dostupni, a tehnike za upotrebu istog zahtijevaju mnogo složenih algoritamskih koraka. U radu smo pokazali kako izgraditi jednostavan morfološki označivač pomoću bi-LSTM neuronskih mreža i slobodno dostupnog i prirodno rastućeg resursa, Wiktionary-a Za devet jezika koje smo označili podatke u svrhu procjene dobivenih rezultata, postižemo točnost koja u nekim slučajevima prelazi sve metode bez nadzora i metode s nadzorom koje koriste skrivene markovljeve lance i paralene korpuse

    Practical Natural Language Processing for Low-Resource Languages.

    Full text link
    As the Internet and World Wide Web have continued to gain widespread adoption, the linguistic diversity represented has also been growing. Simultaneously the field of Linguistics is facing a crisis of the opposite sort. Languages are becoming extinct faster than ever before and linguists now estimate that the world could lose more than half of its linguistic diversity by the year 2100. This is a special time for Computational Linguistics; this field has unprecedented access to a great number of low-resource languages, readily available to be studied, but needs to act quickly before political, social, and economic pressures cause these languages to disappear from the Web. Most work in Computational Linguistics and Natural Language Processing (NLP) focuses on English or other languages that have text corpora of hundreds of millions of words. In this work, we present methods for automatically building NLP tools for low-resource languages with minimal need for human annotation in these languages. We start first with language identification, specifically focusing on word-level language identification, an understudied variant that is necessary for processing Web text and develop highly accurate machine learning methods for this problem. From there we move onto the problems of part-of-speech tagging and dependency parsing. With both of these problems we extend the current state of the art in projected learning to make use of multiple high-resource source languages instead of just a single language. In both tasks, we are able to improve on the best current methods. All of these tools are practically realized in the "Minority Language Server," an online tool that brings these techniques together with low-resource language text on the Web. The Minority Language Server, starting with only a few words in a language can automatically collect text in a language, identify its language and tag its parts of speech. We hope that this system is able to provide a convincing proof of concept for the automatic collection and processing of low-resource language text from the Web, and one that can hopefully be realized before it is too late.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113373/1/benking_1.pd

    Semi-supervised neural part-of-speech tagging

    Get PDF
    We present a simple method for learning part-of-speech taggers for low-resource languages using dictionaries as are reference method. Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracy required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source of weak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. We have shown that we can build POS-tagger by using bi-LSTMs and a freely available and naturally growing resource, the Wiktionary. Across nine languages for which we have labeled data to evaluate results, we achieve accuracy that in some cases exceeds all unsupervised methods, supervised method which uses hidden Markov chains and parallel text methods. We achieve highest accuracy reported for several languages.U radu je iznesena i objašnjena jednostavna metoda za učenje morfološkog označivača za nisko resursne jezike oslanjajući se na rječnike tih jezika. Unatoč znatnom broju nedavno objavljenih radova koji oslovljavaju ovaj problem, bez nadzorne metode učenja nisu rezultirale dovoljno velikom točnošću. Jedna od metoda (slabo) nadzirnog učenja je korištenje paralelnog teksta između jezika s bogatih i siromašnim resursima koji znatno poboljšava točnost morfološkog označivanja. Međutim, paralelni tekstovi nisu uvijek dostupni, a tehnike za upotrebu istog zahtijevaju mnogo složenih algoritamskih koraka. U radu smo pokazali kako izgraditi jednostavan morfološki označivač pomoću bi-LSTM neuronskih mreža i slobodno dostupnog i prirodno rastućeg resursa, Wiktionary-a Za devet jezika koje smo označili podatke u svrhu procjene dobivenih rezultata, postižemo točnost koja u nekim slučajevima prelazi sve metode bez nadzora i metode s nadzorom koje koriste skrivene markovljeve lance i paralene korpuse

    Uvid u automatsko izlučivanje metaforičkih kolokacija

    Get PDF
    Collocations have been the subject of much scientific research over the years. The focus of this research is on a subset of collocations, namely metaphorical collocations. In metaphorical collocations, a semantic shift has taken place in one of the components, i.e., one of the components takes on a transferred meaning. The main goal of this paper is to review the existing literature and provide a systematic overview of the existing research on collocation extraction, as well as the overview of existing methods, measures, and resources. The existing research is classified according to the approach (statistical, hybrid, and distributional semantics) and presented in three separate sections. The insights gained from existing research serve as a first step in exploring the possibility of developing a method for automatic extraction of metaphorical collocations. The methods, tools, and resources that may prove useful for future work are highlighted.Kolokacije su već dugi niz godina tema mnogih znanstvenih istraživanja. U fokusu ovoga istraživanja podskupina je kolokacija koju čine metaforičke kolokacije. Kod metaforičkih je kolokacija kod jedne od sastavnica došlo do semantičkoga pomaka, tj. jedna od sastavnica poprima preneseno značenje. Glavni su ciljevi ovoga rada istražiti postojeću literaturu te dati sustavan pregled postojećih istraživanja na temu izlučivanja kolokacija i postojećih metoda, mjera i resursa. Postojeća istraživanja opisana su i klasificirana prema različitim pristupima (statistički, hibridni i zasnovani na distribucijskoj semantici). Također su opisane različite asocijativne mjere i postojeći načini procjene rezultata automatskoga izlučivanja kolokacija. Metode, alati i resursi koji su korišteni u prethodnim istraživanjima, a mogli bi biti korisni za naš budući rad posebno su istaknuti. Stečeni uvidi u postojeća istraživanja čine prvi korak u razmatranju mogućnosti razvijanja postupka za automatsko izlučivanje metaforičkih kolokacija
    corecore