230 research outputs found

    Automatic Extraction of Semantic Relations for Less­Resourced Languages

    Get PDF
    Proceedings of the NODALIDA 2009 workshop WordNets and other Lexical Semantic Resources — between Lexical Semantics, Lexicography, Terminology and Formal Ontologies. Editors: Bolette Sandford Pedersen, Anna Braasch, Sanni Nimb and Ruth Vatvedt Fjeld. NEALT Proceedings Series, Vol. 7 (2009), 1-6. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9209

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT

    LiDom builder: Automatising the construction of multilingual domain modules

    Get PDF
    136 p.Laburpena Lan honetan LiDOM Builder tresnaren analisi, diseinu eta ebaluazioa aurkezten dira. Teknologian oinarritutako hezkuntzarako tresnen Domeinu Modulu Eleaniztunak testuliburu elektronikoetatik era automatikoan erauztea ahalbidetzen du LiDOM Builderek. Ezagutza eskuratzeko, Hizkuntzaren Prozesamendurako eta Ikaste Automatikorako teknikekin batera, hainbat baliabide eleaniztun erabiltzen ditu, besteak beste, Wikipedia eta WordNet.Domeinu Modulu Elebakarretik Domeinu Modulu Eleaniztunerako bidean, LiDOM Builder tresna DOM-Sortze ingurunearen (Larrañaga, 2012; Larrañaga et al., 2014) bilakaera dela esan genezake. Horretarako, LiDOM Builderek domeinua ikuspegi eleaniztun batetik adieraztea ahalbidetzen duen mekanismoa dakar. Domeinu Modulu Eleaniztunak bi maila ezberdinetako ezagutza jasotzen du: Ikaste Domeinuaren Ontologia (IDO), non hizkuntza ezberdinetan etiketatutako topikoak eta hauen arteko erlazio pedagogikoak jasotzen baitira, eta Ikaste Objektuak (IO), hau da, metadatuekin etiketatutako baliabide didaktikoen bilduma, hizkuntza horietan. LiDOM Builderek onartutako hizkuntza guztietan domeinuaren topikoak adierazteko aukera ematen du. Topiko bakoitza lotuta dago dagokion hizkuntzako bere etiketa baliokidearekin. Gainera, IOak deskribatzeko metadatu aberastuak erabiltzen ditu hizkuntza desberdinetan parekideak diren baliabide didaktikoak lotzeko.LiDOM Builderen, hasiera batean, domeinu-modulua hizkuntza jakin batean idatzitako dokumentu batetik erauziko da eta, baliabide eleaniztunak erabiliko dira, gerora, bai topikoak bai IOak beste hizkuntzetan ere lortzeko. Lan honetan, Ingelesez idatzitako liburuek osatuko dute informazio-iturri nagusia bai doitze-prozesuan bai ebaluazio-prozesuan. Zehazki, honako testuliburu hauek erabili dira: Principles of Object Oriented Programming (Wong and Nguyen, 2010), Introduction to Astronomy (Morison, 2008) eta Introduction to Molecular Biology (Raineri, 2010). Baliabide eleaniztunei dagokienez, Wikipedia, WordNet eta Wikipediatik erauzitako beste hainbat ezagutza-base erabili dira. Testuliburuetatik Domeinu Modulu Eleaniztunak eraikitzeko, LiDOM Builder hiru modulu nagusitan oinarritzen da: LiTeWi eta LiReWi moduluak IDO eleaniztuna eraikitzeaz arduratuko dira eta LiLoWi, aldiz, IO eleaniztunak eraikitzeaz. Jarraian, aipatutako modulu bakoitza xehetasun gehiagorekin azaltzen da.¿ LiTeWi (Conde et al., 2015) moduluak, edozein ikaste-domeinutako testuliburu batetik abiatuta, Hezkuntzarako Ontologia bati dagozkion hainbat termino eleaniztun identifikatuko ditu, hala nola TF-IDF, KP-Miner, CValue eta Shallow Parsing Grammar. Hori lortzeko, gainbegiratu gabeko datu-erauzketa teknikez eta Wikipediaz baliatzen da. Ontologiako topikoak erauzteak LiTeWi-n hiru urrats ditu: lehenik hautagai diren terminoen erauzketa; bigarrenik, lortutako terminoen konbinatzea eta fintzea azken termino zerrenda osatuz; eta azkenik, zerrendako terminoak beste hizkuntzetara mapatzea Wikipedia baliatuz.¿ LiReWi (Conde et al., onartzeko) moduluak Hezkuntzarako Ontologia erlazio pedagogikoez aberastuko du, beti ere testuliburua abiapuntu gisa erabilita. Lau motatako erlazio pedagogikoak erauziko ditu (isA, partOf, prerequisite eta pedagogicallyClose) hainbat teknika eta ezagutza-base konbinatuz. Ezagutza-baseen artean Wikipedia, WordNet, WikiTaxonomy, WibiTaxonomy eta WikiRelations daude. LiReWi-k ere hiru urrats emango ditu erlazioak lortzeko: hasteko, ontologiako topikoak erlazioak erauzteko erabiliko diren ezagutza-base desberdinekin mapatuko ditu; gero, hainbat erlazio-erauzle, bakoitza teknika desberdin batean oinarritzen dena, exekutatuko ditu konkurrenteki erlazio hautagaiak erauzteko; eta, bukatzeko, lortutako emaitza guztiak konbinatu eta iragaziko ditu erlazio pedagogikoen azken multzoa lortuz. Gainera, DOM-Sortzetik LiDOM Buildererako trantsizioan, tesi honetan hobetu egin dira dokumentuen indizeetatik erauzitako isA eta partOf erlazioak, Wikipedia baliabide gehigarri bezala erabilita (Conde et al., 2014).¿ LiLoWi moduluak IOak -batzuk eleaniztunak- erauziko ditu, abiapuntuko testuliburutik ez ezik Wikipedia edo WordNet bezalako ezagutza-baseetatik ere. IDO ontologiako topiko bakoitza Wikipedia eta WordNet-ekin mapatu ostean, LiLoWi-k baliabide didaktikoak erauziko ditu hainbat IO erauzlez baliatuz.IO erauzketa-prozesuan, DOM-Sortzetik LiDOM Buildereko bidean, eta Wikipedia eta WordNet erabili aurretik, ingelesa hizkuntza ere gehitu eta ebaluatu da (Conde et al., 2012).LiDOM Builderen ebaluaziori dagokionez, modulu bakoitza bere aldetik testatua eta ebaluatua izan da bai Gold-standard teknika bai aditu-ebaluazioa baliatuz. Gainera, Wikipedia eta WordNet ezagutza-baseen integrazioak IOen erauzketari ekarri dion hobekuntza ere ebaluatu da. Esan genezake kasu guztietan lortu diren emaitzak oso onak direla.Bukatzeko, eta laburpen gisa, lau dira LiDOM Builderek Domeinu Modulu Eleaniztunaren arloari egin dizkion ekarpen nagusiak:¿ Domeinu Modulu Eleaniztunak adierazteko mekanismo egokia.¿ LiTeWiren garapena. Testuliburuetatik Hezkuntzarako Ontologietarako terminologia eleaniztuna erauztea ahalbidetzen du modulu honek. Ingelesa eta Gaztelera hizkuntzentzako termino-erauzlea eskura dago https://github.com/Neuw84/LiTe URLan.¿ LiReWiren garapena. Testuliburuetatik Hezkuntzarako Ontologietarako erlazio pedagogikoak erauztea ahalbidetzen du modulu honek. Erabiltzen duen Wikipedia/WordNet mapatzailea eskura dago https://github.com/Neuw84/Wikipedia2WordNet URLan.¿ LiLoWiren garapena. Testuliburua eta Wikipedia eta WordNet ezagutza-baseak erabilita IO eleaniztunak erauztea ahalbidetzen du modulu honek

    Report on first selection of resources

    Get PDF
    The central objective of the Metanet4u project is to contribute to the establishment of a pan-European digital platform that makes available language resources and services, encompassing both datasets and software tools, for speech and language processing, and supports a new generation of exchange facilities for them.Peer ReviewedPreprin

    Lexicography in the crystal ball: facts, trends and outlook

    Get PDF

    Practical Natural Language Processing for Low-Resource Languages.

    Full text link
    As the Internet and World Wide Web have continued to gain widespread adoption, the linguistic diversity represented has also been growing. Simultaneously the field of Linguistics is facing a crisis of the opposite sort. Languages are becoming extinct faster than ever before and linguists now estimate that the world could lose more than half of its linguistic diversity by the year 2100. This is a special time for Computational Linguistics; this field has unprecedented access to a great number of low-resource languages, readily available to be studied, but needs to act quickly before political, social, and economic pressures cause these languages to disappear from the Web. Most work in Computational Linguistics and Natural Language Processing (NLP) focuses on English or other languages that have text corpora of hundreds of millions of words. In this work, we present methods for automatically building NLP tools for low-resource languages with minimal need for human annotation in these languages. We start first with language identification, specifically focusing on word-level language identification, an understudied variant that is necessary for processing Web text and develop highly accurate machine learning methods for this problem. From there we move onto the problems of part-of-speech tagging and dependency parsing. With both of these problems we extend the current state of the art in projected learning to make use of multiple high-resource source languages instead of just a single language. In both tasks, we are able to improve on the best current methods. All of these tools are practically realized in the "Minority Language Server," an online tool that brings these techniques together with low-resource language text on the Web. The Minority Language Server, starting with only a few words in a language can automatically collect text in a language, identify its language and tag its parts of speech. We hope that this system is able to provide a convincing proof of concept for the automatic collection and processing of low-resource language text from the Web, and one that can hopefully be realized before it is too late.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113373/1/benking_1.pd

    Proceedings

    Get PDF
    Proceedings of the NODALIDA 2011 Workshop Constraint Grammar Applications. Editors: Eckhard Bick, Kristin Hagen, Kaili Müürisep, Trond Trosterud. NEALT Proceedings Series, Vol. 14 (2011), vi+69 pp. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/19231

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT
    corecore