238 research outputs found

    Building machine translation systems for minor languages: challenges and effects

    Get PDF
    La creació de sistemes de traducció automàtica per a llengües desfavorides, que anomenaré llengües menors, presenta diversos reptes alhora que obri la porta a noves oportunitats. Després de definir conceptes preliminars com ara els de llengua menor i traducció automàtica, i d’explicar breument els tipus de traducció automàtica existents, els usos més comuns, el tipus de dades en què es basen, i els drets d’ús i les llicències del programari i de les dades de traducció automàtica, es discuteixen els reptes a què s’enfronta la construcció de sistemes de traducció automàtica i els possibles efectes sobre l’estatus de la llengua menor, usant com a exemples llengües menors d’Europa.Building machine translation systems for disadvantaged languages, which I will call minor languages, poses a number of challenges whilst also opening the door to new opportunities. After defining a few basic concepts, such as minor language and machine translation, the paper provides a brief overview of the types of machine translation available today, their most common uses, the type of data they are based on, and the usage rights and licences of machine translation software and data. Then, it describes the challenges involved in building machine translation systems, as well as the effects these systems can have on the status of minor languages. Finally, this is illustrated by drawing on examples from minor languages in Europe

    TecnologĂ­as de la TraducciĂłn: Actividades opcionales

    Get PDF
    Actividades opcionales sobre tecnologĂ­as de la traducciĂłn

    La traducció automàtica en la pràctica: aplicacions, dificultats i estratègies de desenvolupament

    Get PDF
    En aquest article es descriuen els sistemes de traducció automàtica, les seves aplicacions actuals i les principals dificultats que ha d’afrontar aquesta tecnologia lingüística. Es presenta el sistema Apertium, una plataforma de traducció automàtica de codi obert sobre la qual s’han construït diversos traductors automàtics entre diferents parells d’idiomes, en els quals està inclòs el català. Basant-se en l’experiència dels autors, es descriuen algunes tensions que es donen en el desenvolupament de les dades lingüístiques d’un traductor automàtic i les solucions de compromís a què cal arribar per a construir sistemes útils

    Bilingual dictionary generation and enrichment via graph exploration

    Get PDF
    In recent years, we have witnessed a steady growth of linguistic information represented and exposed as linked data on the Web. Such linguistic linked data have stimulated the development and use of openly available linguistic knowledge graphs, as is the case with the Apertium RDF, a collection of interconnected bilingual dictionaries represented and accessible through Semantic Web standards. In this work, we explore techniques that exploit the graph nature of bilingual dictionaries to automatically infer new links (translations). We build upon a cycle density based method: partitioning the graph into biconnected components for a speed-up, and simplifying the pipeline through a careful structural analysis that reduces hyperparameter tuning requirements. We also analyse the shortcomings of traditional evaluation metrics used for translation inference and propose to complement them with new ones, both-word precision (BWP) and both-word recall (BWR), aimed at being more informative of algorithmic improvements. Over twenty-seven language pairs, our algorithm produces dictionaries about 70% the size of existing Apertium RDF dictionaries at a high BWP of 85% from scratch within a minute. Human evaluation shows that 78% of the additional translations generated for dictionary enrichment are correct as well. We further describe an interesting use-case: inferring synonyms within a single language, on which our initial human-based evaluation shows an average accuracy of 84%. We release our tool as free/open-source software which can not only be applied to RDF data and Apertium dictionaries, but is also easily usable for other formats and communities.This work was partially funded by the Prêt-à-LLOD project within the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 825182. This article is also based upon work from COST Action CA18209 NexusLinguarum, “European network for Web-centred linguistic data science”, supported by COST (European Cooperation in Science and Technology). It has been also partially supported by the Spanish projects TIN2016-78011-C4-3-R and PID2020-113903RB-I00 (AEI/FEDER, UE), by DGA/FEDER, and by the Agencia Estatal de Investigación of the Spanish Ministry of Economy and Competitiveness and the European Social Fund through the “Ramón y Cajal” program (RYC2019-028112-I)

    Using Machine Translation to Provide Target-Language Edit Hints in Computer Aided Translation Based on Translation Memories

    Get PDF
    This paper explores the use of general-purpose machine translation (MT) in assisting the users of computer-aided translation (CAT) systems based on translation memory (TM) to identify the target words in the translation proposals that need to be changed (either replaced or removed) or kept unedited, a task we term as "word-keeping recommendation". MT is used as a black box to align source and target sub-segments on the fly in the translation units (TUs) suggested to the user. Source-language (SL) and target-language (TL) segments in the matching TUs are segmented into overlapping sub-segments of variable length and machine-translated into the TL and the SL, respectively. The bilingual sub-segments obtained and the matching between the SL segment in the TU and the segment to be translated are employed to build the features that are then used by a binary classifier to determine the target words to be changed and those to be kept unedited. In this approach, MT results are never presented to the translator. Two approaches are presented in this work: one using a word-keeping recommendation system which can be trained on the TM used with the CAT system, and a more basic approach which does not require any training. Experiments are conducted by simulating the translation of texts in several language pairs with corpora belonging to different domains and using three different MT systems. We compare the performance obtained to that of previous works that have used statistical word alignment for word-keeping recommendation, and show that the MT-based approaches presented in this paper are more accurate in most scenarios. In particular, our results confirm that the MT-based approaches are better than the alignment-based approach when using models trained on out-of-domain TMs. Additional experiments were performed to check how dependent the MT-based recommender is on the language pair and MT system used for training. These experiments confirm a high degree of reusability of the recommendation models across various MT systems, but a low level of reusability across language pairs.This work is supported by the Spanish government through projects TIN2009-14009-C02-01 and TIN2012-32615

    Learning to use machine translation on the Translation Commons Learn portal

    Get PDF
    We describe the Learn portal of Translation Commons (TC), a self-managed community of volunteer translators community aimed at sharing tools, resources and initiatives for the translation community as a whole. Members are encouraged to upload and share their free resources on the platform and to create free courses and tutorials. Specifically there are no educational material on machine translation yet and we invite experts to contribute

    Stand-off Annotation of Web Content as a Legally Safer Alternative to Crawling for Distribution

    Get PDF
    Sentence-aligned web-crawled parallel text or bitext is frequently used to train statistical machine translation systems. To that end, web-crawled sentence-aligned bitext sets are sometimes made publicly available and distributed by translation technologies practitioners. Contrary to what may be commonly believed, distribution of web-crawled text is far from being free from legal implications, and may sometimes actually violate the usage restrictions. As the distribution and availability of sentence-aligned bitext is key to the development of statistical machine translation systems, this paper proposes an alternative: instead of copying and distributing copies of web content in the form of sentence-aligned bitext, one could distribute a legally safer stand-off annotation of web content, that is, files that identify where the aligned sentences are, so that end users can use this annotation to privately recrawl the bitexts. The paper describes and discusses the legal and technical aspects of this proposal, and outlines an implementation.Funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran) is acknowledged
    • …
    corecore