237 research outputs found

    Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair

    Get PDF
    This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain .hr and the Slovene top-level domain .si, and extrinsically on the English–Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English–Croatian, English–Finnish, English–Serbian and English–Slovene language pairs.This research is supported by the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (AbuMaTran)

    Building a free French wordnet from multilingual resources

    Get PDF
    International audienceThis paper describes automatic construction a freely-available wordnet for French (WOLF) based on Princeton WordNet (PWN) by using various multilingual resources. Polysemous words were dealt with an approach in which a parallel corpus for five languages was word-aligned and the extracted multilingual lexicon was disambiguated with the existing wordnets for these languages. On the other hand, a bilingual approach sufficed to acquire equivalents for monosemous words. Bilingual lexicons were extracted from Wikipedia and thesauri. The results obtained from each resource were merged and ranked according to the number of resources yielding the same literal. Automatic evaluation of the merged wordnet was performed with the French WordNet (FREWN). Manual evaluation was also carried out on a sample of the generated synsets. Precision shows that the presented approach has proved to be very promising and applications to use the created wordnet are already intended

    Téarmaíocht don Aontas Eorpach. Taithí na hÉireann: Tionscadal GA IATE/ Terminology for the European Union. The Irish Experience: The GA IATE Project

    Get PDF
    Tugann an staidéar seo cur síos cuimsitheach ar théarmeolaíocht na Gaeilge i gcomhthéacs fheidhmeanna aistriúcháin an Aontais Eorpaigh. Tháinig riachtanais phráinneacha téarmaíochta Gaeilge chun cinn in 2007 nuair a tugadh stádas teanga oifigiúil de chuid an AE don Ghaeilge. Tráchtann an staidéar seo ar an bhfreagairt a tugadh ar na riachtanais sin, agus cuireann sé an obair i gcomhthéacs na hoibre téarmeolaíochta a rinneadh i gcás theangacha ‘nua’ eile an AE, teangacha a bhain amach stádas oifigiúil in 2004 agus in 2007. Tugtar mioneolas ar IATE, comhbhunachar sonraí ilteangach fhorais agus chomhlachtaí an AE agus leagtar béim ar leith ar ról thrí mhórinstitiúid an AE, an Coimisiún, an Chomhairle agus an Pharlaimint. Is é Fiontar, Ollscoil Chathair Bhaile Átha Cliath, i gcomhairle le rannpháirtithe an tionscadail in institiúidí an AE agus i seirbhís phoiblí na hÉireann, a thiomsaigh an staidéar. ***English*** This study provides a comprehensive description of Irish-language terminology for the purposes of European Union translation work. An urgent need for Irish-language terminology arose in 2007 when Irish became an official EU language. This study documents the response to that need, and places it in the context of terminology work in other ‘new’ EU languages which gained official status in 2004 and 2007. IATE, the shared multilingual terminology database of the EU institutions and bodies, is described in detail, with particular emphasis on the role of the three major EU institutions, Commission, Council and Parliament. The study was compiled by Fiontar, Dublin City University, in consultation with project participants in the EU institutions and the Irish public service

    JaSlo: Integration of a Japanese-Slovene Bilingual Dictionary with a Corpus Search System

    Get PDF

    Introduction to the special issue on cross-language algorithms and applications

    Get PDF
    With the increasingly global nature of our everyday interactions, the need for multilingual technologies to support efficient and efective information access and communication cannot be overemphasized. Computational modeling of language has been the focus of Natural Language Processing, a subdiscipline of Artificial Intelligence. One of the current challenges for this discipline is to design methodologies and algorithms that are cross-language in order to create multilingual technologies rapidly. The goal of this JAIR special issue on Cross-Language Algorithms and Applications (CLAA) is to present leading research in this area, with emphasis on developing unifying themes that could lead to the development of the science of multi- and cross-lingualism. In this introduction, we provide the reader with the motivation for this special issue and summarize the contributions of the papers that have been included. The selected papers cover a broad range of cross-lingual technologies including machine translation, domain and language adaptation for sentiment analysis, cross-language lexical resources, dependency parsing, information retrieval and knowledge representation. We anticipate that this special issue will serve as an invaluable resource for researchers interested in topics of cross-lingual natural language processing.Postprint (published version

    Parse tree based machine translation for less-used languages

    Get PDF
    The article describes a method that enhances translation performance of language pairs with a less used source language and a widely used target language. We propose a method that enables the use of parse tree based statistical translation algorithms for language pairs with a less used source language and a widely used target language. Automatic part of speech (POS) tagging algorithms have become accurate to the extent of efficient use in many tasks. Most of these methods are quite easily implementable in most world languages. The method is divided in two partsthe first part constructs alignments between POS tags of source sentences and induced parse trees of target language. The second part searches through trained data and selects the best candidates for target sentences, the translations. The method was not fully implemented due to time constraintsthe training part was implemented and incorporated into a functional translation systemthe inclusion of a word alignment model into the translation part was not implemented. The empirical evaluation addressing the quality of trained data was carried out on a full implementation of the presented training algorithms and the results confirm the employability of the method

    The Swedish-Turkish Parallel Corpus and Tools for its Creation

    Get PDF
    Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007. Editors: Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit. University of Tartu, Tartu, 2007. ISBN 978-9985-4-0513-0 (online) ISBN 978-9985-4-0514-7 (CD-ROM) pp. 136-143

    462 Machine Translation Systems for Europe

    Get PDF
    We built 462 machine translation systems for all language pairs of the Acquis Communautaire corpus. We report and analyse the performance of these system, and compare them against pivot translation and a number of system combination methods (multi-pivot, multisource) that are possible due to the available systems.JRC.G.2-Global security and crisis managemen

    TEI and Microsoft: a marriage made in...

    Get PDF
    In several on-going projects we were faced with the dilemma of how to reconcile our goal of delivering standardly encoded historical documents, yet have the actual editing and annotation performed by researchers and students who had no knowledge of XML and TEI, and, for the most part, no interest in learning them. The solution we developed consists of allowing the annotators use familiar and flexible editors, such as Microsoft Word (for structural annotation of documents) and Excel (for word-level linguistic annotation) and automatically converting these into TEI. Given the unconstrained nature of such editors this sounds like a recipe for disaster. But the solution crucially depends on a dedicated Web service, to which the annotators can up-load their files; these are then immediately converted to XML/TEI and from it back to a visual format, either HTML or Excel XML, and presented to the annotators. These then get immediate feedback about the quality of their encoding in the source, and can thus correct errors before they accumulate; and the responsibility for the correct encoding rests with the annotators, rather than with the developers of the conversion procedure. The paper describes the web service and details its use in three projects. The main conclusions are that the proposed solution is appropriate for shallow encodings, and nevertheless does require producing detailed annotation guidelines
    corecore