30 research outputs found

    Stand-off Annotation of Web Content as a Legally Safer Alternative to Crawling for Distribution

    Get PDF
    Sentence-aligned web-crawled parallel text or bitext is frequently used to train statistical machine translation systems. To that end, web-crawled sentence-aligned bitext sets are sometimes made publicly available and distributed by translation technologies practitioners. Contrary to what may be commonly believed, distribution of web-crawled text is far from being free from legal implications, and may sometimes actually violate the usage restrictions. As the distribution and availability of sentence-aligned bitext is key to the development of statistical machine translation systems, this paper proposes an alternative: instead of copying and distributing copies of web content in the form of sentence-aligned bitext, one could distribute a legally safer stand-off annotation of web content, that is, files that identify where the aligned sentences are, so that end users can use this annotation to privately recrawl the bitexts. The paper describes and discusses the legal and technical aspects of this proposal, and outlines an implementation.Funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran) is acknowledged

    TMX: Intercambio de memorias de traducción

    Get PDF
    En aquest article presentem TMX (Translation Memory eXchange), el format estàndard d'intercanvi de memòries de traducció. Repassarem el concepte de memòria de traducció i els seus usos, que les converteixen en un dels principals recursos per al traductor. Veurem les estratègies per recuperar de manera ràpida els segments més similars als que estem traduint i els mecanismes per ordenar els segments recuperats segons la seva similitud amb el segment a traduir. Presentarem breument les especificacions del format TMX i els seus diferents nivells i analitzarem el grau d'acceptació d'aquest format entre les eines de traducció assistida.In this paper the standard format for translation memories interchange (TMX) is presented. We review the concept of translation memory and its uses. We also present strategies for quick access to the most similar segments to the one being translated and the ways to sort the retrieved segments according to similarity. The specifications of the TMX format and its levels will be presented. We analyze the degree of implementation of this format in CAT toolsEn este artículo presentamos el TMX (Translation Memory eXchange), el formato estándar de intercambio de memorias de traducción. Repasaremos el concepto de memoria de traducción y sus usos que las convierten en uno de los principales recursos para el traductor. Veremos las estrategias para recuperar de manera rápida los segmentos más similares a que estamos traduciendo y los mecanismos para ordenar los segmentos recuperados según su similitud con el segmento a traducir. Se analizarán los formatos internos de las memorias de traducción en las principales herramientas de traducción asistida y se verá la importancia de disponer de un formato de intercambio que sea estándar, versátil y que permita su evolución para adaptarse a las nuevas necesidades.Presentaremos brevemente las especificaciones del formato TMX y sus diferentes niveles y analizaremos el grado de aceptación de este formato entre las herramientas de traducción asistida. Finalmente presentaremos algunas de las propuestas de futuro para este formato

    TectoMT – a deep-­linguistic core of the combined Chimera MT system

    Get PDF
    Chimera is a machine translation system that combines the TectoMT deep-linguistic core with phrase-based MT system Moses. For English–Czech pair it also uses the Depfix post-correction system. All the components run on Unix/Linux platform and are open source (available from Perl repository CPAN and the LINDAT/CLARIN repository). The main website is https://ufal.mff.cuni.cz/tectomt. The development is currently supported by the QTLeap 7th FP project (http://qtleap.eu)

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    Placeable and localizable elements in translation memory systems

    Get PDF
    Translation memory systems (TM systems) are software packages used in computer-assisted translation (CAT) to support human translators. As an example of successful natural language processing (NLP), these applications have been discussed in monographic works, conferences, articles in specialized journals, newsletters, forums, mailing lists, etc. This thesis focuses on how TM systems deal with placeable and localizable elements, as defined in 2.1.1.1. Although these elements are mentioned in the cited sources, there is no systematic work discussing them. This thesis is aimed at filling this gap and at suggesting improvements that could be implemented in order to tackle current shortcomings. The thesis is divided into the following chapters. Chapter 1 is a general introduction to the field of TM technology. Chapter 2 presents the conducted research in detail. The chapters 3 to 12 each discuss a specific category of placeable and localizable elements. Finally, chapter 13 provides a conclusion summarizing the major findings of this research project

    La recuperación de subsegmentos en SDL Trados studio 2017: análisis del mecanismo upLIFT y evaluación de su aprovechamiento.

    Get PDF
    [SP] Con el objetivo de maximizar el uso de la información contenida en las memorias de traducción, las empresas están desarrollando continuamente nuevos mecanismos para evitar que dicha información quede infrautilizada. La herramienta de traducción asistida por ordenador, SDL Trados Studio, ha incorporado en su última versión (2017) un complemento de recuperación de subsegmentos, diseñado para asistir al traductor en su trabajo. Este estudio se divide en dos partes diferenciadas. En primer lugar, se presenta el marco teórico, donde se contextualiza la recuperación de subsegmentos dentro del ámbito de la recuperación de traducciones, mediante una estructura expositiva. En segundo lugar, se lleva a cabo un experimento práctico utilizando la herramienta SDL Trados Studio 2017, con el objetivo de evaluar la utilidad del mecanismo upLIFT Fragment Recall, que se encarga de recuperar subsegmentos a partir de la memoria de traducción empleada en cada proyecto. Para lograrlo, se utilizan técnicas similares a las de la traducción automática estadística

    Translators' requirements for translation technologies: user study on translation tools

    Get PDF
    Another major concern of the survey respondents was the quality of machine translation and its usefulness for creating draft translations for post- editing. In this direction, a part of this dissertation is dedicated to evaluation of machine translation, and investigation of the post-editing process. The findings of these studies showed which machine translation errors are easier to post-edit, which can be of practical use for improving the post-editing workflow.This dissertation investigates the needs of professional translators regarding trans- lation technologies with the aim of suggesting ways to improve these technologies from the users’ point of view. It mostly covers the topics of computer-assisted translation (CAT) tools, machine translation and terminology management. In particular, the work presented here examines three main questions: 1) what kind of tools do translators need to increase their productivity and income, 2) do ex- isting translation tools satisfy translators’ needs, 3) how can translation tools be improved to cater to these needs. The dissertation is composed of nine previously published articles, which are included in the Appendix, while the methodology used and the results obtained in these studies are summarised in the main body of the dissertation. The task of identifying user needs was approached from three different perspectives: 1) eliciting translators’ needs by means of a user survey, 2) evaluation of existing CAT systems, and 3) analysis of the process of post-editing of ma- chine translation. The data from the user survey was analysed using quantitative and qualitative data analysis techniques. The post-editing process was studied through quantitative measures of time and technical effort, as well as through the qualitative study of the actual edits.The survey results demonstrated that the two crucial characteristics of CAT software were usability and functionality. It also helped to distinguish the features translators find most useful in their software, such as support for many different document formats, concordance search, autopropagation and autosuggest functions. Based on these preferences, an evaluation scheme for CAT software was developed. Various ways of improving CAT software usability and functionality were proposed, including making better use of textual corpora techniques and providing different versions of software with respect to the required level of functionality

    Clinical practice knowledge acquisition and interrogation using natural language: aquisição e interrogação de conhecimento de prática clínica usando linguagem natural

    Get PDF
    Os conceitos científicos, metodologias e ferramentas no sub-dominio da Representação de Conhecimento da área da Inteligência Artificial Aplicada têm sofrido avanços muito significativos nos anos recentes. A utilização de Ontologias como conceptualizações de domínios é agora suficientemente poderosa para aspirar ao raciocínio computacional sobre realidades complexas. Uma das tarefas científica e tecnicamente mais desafiante é prestação de cuidados pelos profissionais de saúde na especialidade cardiovascular. Um domínio de tal forma complexo pode beneficiar largamente da possibilidade de ajudas ao raciocínio clínico que estão neste momento a beira de ficarem disponíveis. Investigamos no sentido de desenvolver uma infraestrutura sólida e completa para a representação de conhecimento na prática clínica bem como os processes associados para adquirir o conhecimento a partir de textos clínicos e raciocinar automaticamente sobre esse conhecimento; ABSTRACT: The scientific concepts, methodologies and tools in the Knowledge Representation (KR) subdomain of applied Artificial Intelligence (AI) came a long way with enormous strides in recent years. The usage of domain conceptualizations that are Ontologies is now powerful enough to aim at computable reasoning over complex realities. One of the most challenging scientific and technical human endeavors is the daily Clinical Practice (CP) of Cardiovascular (C V) specialty healthcare providers. Such a complex domain can benefit largely from the possibility of clinical reasoning aids that are now at the edge of being available. We research into al complete end-to-end solid ontological infrastructure for CP knowledge representation as well as the associated processes to automatically acquire knowledge from clinical texts and reason over it

    Aquisição e Interrogação de Conhecimento de Prática Clínica usando Linguagem Natural

    Get PDF
    The scientific concepts, methodologies and tools in the Knowledge Representation (KR) sub- domain of applied Artificial Intelligence (AI) came a long way with enormous strides in recent years. The usage of domain conceptualizations that are Ontologies is now powerful enough to aim at computable reasoning over complex realities. One of the most challenging scientific and technical human endeavors is the daily Clinical Prac- tice (CP) of Cardiovascular (CV) specialty healthcare providers. Such a complex domain can benefit largely from the possibility of clinical reasoning aids that are now at the edge of being available. We research into a complete end-to-end solid ontological infrastructure for CP knowledge represen- tation as well as the associated processes to automatically acquire knowledge from clinical texts and reason over it
    corecore