513 research outputs found

    Target-language-driven agglomerative part-of-speech tag clustering for machine translation

    Get PDF
    This paper presents a method for reducing the set of different tags to be considered by a part-of-speech tagger. The method is based on a clustering algorithm performed over the states of a hidden Markov model, which is initially trained by considering information not only from the source language, but also from the target language, using a new unsupervised technique which has been recently proposed to obtain taggers involved in machine translation systems. Then, a bottom-up agglomerative clustering algorithm groups the states of the hidden Markov model according to a similarity measure based on their transition probabilities; this reduces the complexity by grouping the initial finer tags into coarser ones. The experiments show that part-of-speech taggers using the coarser tags have smaller error rates than those using the initial finest tags; moreover, considering unsupervised information from the target language results in better clusters compared to those unsupervisedly built from source language information only.Work funded by the Spanish Ministry of Science and Technology through project TIC2003-08681-C02-01, and by the Spanish Ministry of Education and Science and the European Social Found through grant BES-2004-4711

    Stand-off Annotation of Web Content as a Legally Safer Alternative to Crawling for Distribution

    Get PDF
    Sentence-aligned web-crawled parallel text or bitext is frequently used to train statistical machine translation systems. To that end, web-crawled sentence-aligned bitext sets are sometimes made publicly available and distributed by translation technologies practitioners. Contrary to what may be commonly believed, distribution of web-crawled text is far from being free from legal implications, and may sometimes actually violate the usage restrictions. As the distribution and availability of sentence-aligned bitext is key to the development of statistical machine translation systems, this paper proposes an alternative: instead of copying and distributing copies of web content in the form of sentence-aligned bitext, one could distribute a legally safer stand-off annotation of web content, that is, files that identify where the aligned sentences are, so that end users can use this annotation to privately recrawl the bitexts. The paper describes and discusses the legal and technical aspects of this proposal, and outlines an implementation.Funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran) is acknowledged

    Manual d'informĂ tica i de tecnologies per a la traducciĂł

    Get PDF
    Este libro cubre la mayor parte de los contenidos de la asignatura Tecnologías de la Traducción que cursara el alumnado de segundo curso del grado en Traducción e Interpretación de la Universitat d'Alacant; también puede ser útil para asignaturas similares en otras universidades (por eso se ha incluido material más avanzado que no se estudia en Tecnologías de la Traducción).Aquest llibre cobreixen la major part dels continguts de l'assignatura Tecnologies de la Traducció que cursara l'alumnat de segon curs del grau en Traducció i Interpretació de la Universitat d’Alacant; també pot ser útil per a assignatures similars en altres universitats (per aixo s’hi ha inclòs material mes avançat que no s'estudia en Tecnologies de la Traducció)

    Comparative Human and Automatic Evaluation of Glass-Box and Black-Box Approaches to Interactive Translation Prediction

    Get PDF
    Interactive translation prediction (ITP) is a modality of computer-aided translation that assists professional translators by offering context-based computer-generated continuation suggestions as they type. While most state-of-the-art ITP systems follow a glass-box approach, meaning that they are tightly coupled to an adapted machine translation system, a black-box approach which does not need access to the inner workings of the bilingual resources used to generate the suggestions has been recently proposed in the literature: this new approach allows new sources of bilingual information to be included almost seamlessly. In this paper, we compare for the first time the glass-box and the black-box approaches by means of an automatic evaluation of translation tasks between related languages such as English–Spanish and unrelated ones such as Arabic–English and English–Chinese, showing that, with our setup, 20%–50% of keystrokes could be saved using either method and that the black-box approach outperformed the glass-box one in five out of six scenarios operating under similar conditions. We also performed a preliminary human evaluation of English to Spanish translation for both approaches. On average, the evaluators saved 10% keystrokes and were 4% faster with the black-box approach, and saved 15% keystrokes and were 12% slower with the glass-box one; but they could have saved 51% and 69% keystrokes respectively if they had used all the compatible suggestions. Users felt the suggestions helped them to translate faster and easier. All the tools used to perform the evaluation are available as free/open–source software.Work partially funded by the Generalitat Valenciana through grant ACIF/2014/365, the Spanish government through project EFFORTUNE (TIN2015-69632-R), and by the Government of the Republic of Kazakhstan

    How Bees Respond Differently to Field Margins of Shrubby and Herbaceous Plants in Intensive Agricultural Crops of the Mediterranean Area

    Get PDF
    (1) Intensive agriculture has a high impact on pollinating insects, and conservation strategies targeting agricultural landscapes may greatly contribute to their maintenance. The aim of this work was to quantify the effect that the vegetation of crop margins, with either herbaceous or shrubby plants, had on the abundance and diversity of bees in comparison to non-restored margins. (2) The work was carried out in an area of intensive agriculture in southern Spain. Bees were monitored visually and using pan traps, and floral resources were quantified in crop margins for two years. (3) An increase in the abundance and diversity of wild bees in restored margins was registered, compared to non-restored margins. Significant differences in the structure of bee communities were found between shrubby and herbaceous margins. Apis mellifera and mining bees were found to be more polylectic than wild Apidae and Megachilidae. The abundance of A. mellifera and mining bees was correlated to the total floral resources, in particular, to those offered by the Boraginaceae and Brassicaceae; wild Apidae and Megachilidae were associated with the Lamiaceae. (4) This work emphasises the importance of floral diversity and shrubby plants for the maintenance of rich bee communities in Mediterranean agricultural landscapes

    An Open-Source Web-Based Tool for Resource-Agnostic Interactive Translation Prediction

    Get PDF
    We present a web-based open-source tool for interactive translation prediction (ITP) and describe its underlying architecture. ITP systems assist human translators by making context-based computer-generated suggestions as they type. Most of the ITP systems in literature are strongly coupled with a statistical machine translation system that is conveniently adapted to provide the suggestions. Our system, however, follows a resource-agnostic approach and suggestions are obtained from any unmodified black-box bilingual resource. This paper reviews our ITP method and describes the architecture of Forecat, a web tool, partly based on the recent technology of web components, that eases the use of our ITP approach in any web application requiring this kind of translation assistance. We also evaluate the performance of our method when using an unmodified Moses-based statistical machine translation system as the bilingual resource.This work has been partly funded by the Spanish Ministerio de EconomĂ­a y Competitividad through project TIN2012-32615

    Integrating Rules and Dictionaries from Shallow-Transfer Machine Translation into Phrase-Based Statistical Machine Translation

    Get PDF
    We describe a hybridisation strategy whose objective is to integrate linguistic resources from shallow-transfer rule-based machine translation (RBMT) into phrase-based statistical machine translation (PBSMT). It basically consists of enriching the phrase table of a PBSMT system with bilingual phrase pairs matching transfer rules and dictionary entries from a shallow-transfer RBMT system. This new strategy takes advantage of how the linguistic resources are used by the RBMT system to segment the source-language sentences to be translated, and overcomes the limitations of existing hybrid approaches that treat the RBMT systems as a black box. Experimental results confirm that our approach delivers translations of higher quality than existing ones, and that it is specially useful when the parallel corpus available for training the SMT system is small or when translating out-of-domain texts that are well covered by the RBMT dictionaries. A combination of this approach with a recently proposed unsupervised shallow-transfer rule inference algorithm results in a significantly greater translation quality than that of a baseline PBSMT; in this case, the only hand-crafted resource used are the dictionaries commonly used in RBMT. Moreover, the translation quality achieved by the hybrid system built with automatically inferred rules is similar to that obtained by those built with hand-crafted rules.Research funded by the Spanish Ministry of Economy and Competitiveness through projects TIN2009-14009-C02-01 and TIN2012-32615, by Generalitat Valenciana through grant ACIF 2010/174, and by the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran)

    A generalised alignment template formalism and its application to the inference of shallow-transfer machine translation rules from scarce bilingual corpora

    Get PDF
    Statistical and rule-based methods are complementary approaches to machine translation (MT) that have different strengths and weaknesses. This complementarity has, over the last few years, resulted in the consolidation of a growing interest in hybrid systems that combine both data-driven and linguistic approaches. In this paper, we address the situation in which the amount of bilingual resources that is available for a particular language pair is not sufficiently large to train a competitive statistical MT system, but the cost and slow development cycles of rule-based MT systems cannot be afforded either. In this context, we formalise a new method that uses scarce parallel corpora to automatically infer a set of shallow-transfer rules to be integrated into a rule-based MT system, thus avoiding the need for human experts to handcraft these rules. Our work is based on the alignment template approach to phrase-based statistical MT, but the definition of the alignment template is extended to encompass different generalisation levels. It is also greatly inspired by the work of Sánchez-Martínez and Forcada (2009) in which alignment templates were also considered for shallow-transfer rule inference. However, our approach overcomes many relevant limitations of that work, principally those related to the inability to find the correct generalisation level for the alignment templates, and to select the subset of alignment templates that ensures an adequate segmentation of the input sentences by the rules eventually obtained. Unlike previous approaches in literature, our formalism does not require linguistic knowledge about the languages involved in the translation. Moreover, it is the first time that conflicts between rules are resolved by choosing the most appropriate ones according to a global minimisation function rather than proceeding in a pairwise greedy fashion. Experiments conducted using five different language pairs with the free/open-source rule-based MT platform Apertium show that translation quality significantly improves when compared to the method proposed by Sánchez-Martínez and Forcada (2009), and is close to that obtained using handcrafted rules. For some language pairs, our approach is even able to outperform them. Moreover, the resulting number of rules is considerably smaller, which eases human revision and maintenance.Research funded by Universitat d’Alacant through project GRE11-20, by the Spanish Ministry of Economy and Competitiveness through projects TIN2009-14009-C02-01 and TIN2012-32615, by Generalitat Valenciana through grant ACIF/2010/174, and by the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran)

    Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

    Get PDF
    This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation, for which there is incomplete evidence in the literature. The study covers eight language pairs, different training corpus sizes, two architectures and three types of annotation: dummy tags (with no linguistic information at all), part-of-speech tags, and morpho-syntactic description tags, which consist of part of speech and morphological features. These linguistic annotations are interleaved in the input or output streams as a single tag placed before each word. In order to measure the performance under each scenario, we use automatic evaluation metrics and perform automatic error classification. Our experiments show that, in general, source-language annotations are helpful and morpho-syntactic descriptions outperform part of speech for some language pairs. On the contrary, when words are annotated in the target language, part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics, even though the use of morpho-syntactic description tags improves the grammaticality of the output. We provide a detailed analysis of the reasons behind this result.Work funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement number 825299, project Global Under-Resourced Media Translation (GoURMET)
    • …
    corecore