Search CORE

13,730 research outputs found

Adaptation of machine translation for multilingual information retrieval in the medical domain

Author: Dušek Ondřej
Goeuriot Lorraine
Hajič Jan
Hlaváčová Jaroslava
Jones Gareth J.F.
Kelly Liadh
Leveling Johannes
Mareček David
Novák Michal
Pecina Pavel
Popel Martin
Rosa Rudolf
Tamchyna Aleš
Urešová Zdeňka
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR. Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets. Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results. Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions

Crossref

Hal - Université Grenoble Alpes

Irish Universities

DCU Online Research Access Service

Biblio at Institute of Formal and Applied Linguistics

TectoMT – a deep-linguistic core of the combined Chimera MT system

Author: Bojar Ondřej
Hajič Jan
Popel Martin
Rosa Rudolf
Sudarikov Roman
Publication venue
Publication date: 01/01/2016
Field of study

Chimera is a machine translation system that combines the TectoMT deep-linguistic core with phrase-based MT system Moses. For English–Czech pair it also uses the Depfix post-correction system. All the components run on Unix/Linux platform and are open source (available from Perl repository CPAN and the LINDAT/CLARIN repository). The main website is https://ufal.mff.cuni.cz/tectomt. The development is currently supported by the QTLeap 7th FP project (http://qtleap.eu)

Biblio at Institute of Formal and Applied Linguistics

Target-Side Context for Discriminative Models in Statistical Machine Translation

Author: Bojar Ondřej
Fraser Alexander
Junczys-Dowmunt Marcin
Tamchyna Aleš
Publication venue
Publication date: 01/01/2016
Field of study

Discriminative translation models utilizing source context have been shown to help statistical machine translation performance. We propose a novel extension of this work using target context information. Surprisingly, we show that this model can be efficiently integrated directly in the decoding process. Our approach scales to large training data sizes and results in consistent improvements in translation quality on four language pairs. We also provide an analysis comparing the strengths of the baseline source-context model with our extended source-context and target-context model and we show that our extension allows us to better capture morphological coherence. Our work is freely available as part of Moses.Comment: Accepted as a long paper for ACL 201

arXiv.org e-Print Archive

Crossref

Biblio at Institute of Formal and Applied Linguistics

MATREX: the DCU MT System for WMT 2008

Author: Ma Yanjun
Ozdowska Sylwia
Tinsley John
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2008
Field of study

In this paper, we give a description of the machine translation system developed at DCU that was used for our participation in the evaluation campaign of the Third Workshop on Statistical Machine Translation at ACL 2008. We describe the modular design of our data driven MT system with particular focus on the components used in this participation. We also describe some of the significant modules which were unused in this task. We participated in the EuroParl task for the following translation directions: Spanish–English and French–English, in which we employed our hybrid EBMT-SMT architecture to translate. We also participated in the Czech–English News and News Commentary tasks which represented a previously untested language pair for our system. We report results on the provided development and test sets

CiteSeerX

Irish Universities

DCU Online Research Access Service

Findings of the 2019 Conference on Machine Translation (WMT19)

Author: Barrault Loïc
Bojar Ondřej
Costa-Jussà Marta R.
Federmann Christian
Fishel Mark
Graham Yvette
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/08/2019
Field of study

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation

Irish Universities

DCU Online Research Access Service

Enriching Morphologically Poor Languages for Statistical Machine Translation

Author: Avramidis Eleftherios
Koehn Philipp
Publication venue
Publication date: 01/06/2008
Field of study

Edinburgh Research Explorer

Factored Translation Models

Author: Hoang Hieu
Koehn Philipp
Publication venue
Publication date: 01/06/2007
Field of study

Edinburgh Research Explorer

MATREX: the DCU MT system for WMT 2010

Author: Banerjee Pratyush
Dandapat Sandipan
Du Jinhua
Forcada Mikel
Haque Rejwanul
Kumar Naskar Sudip
Pecina Pavel
Penkale Sergio
Srivastava Ankit Kumar
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 15/07/2010
Field of study

This paper describes the DCU machine translation system in the evaluation campaign of the Joint Fifth Workshop on Statistical Machine Translation and Metrics in ACL-2010. We describe the modular design of our multi-engine machine translation (MT) system with particular focus on the components used in this participation. We participated in the English–Spanish and English–Czech translation tasks, in which we employed our multiengine architecture to translate. We also participated in the system combination task which was carried out by the MBR decoder and confusion network decoder

Irish Universities

DCU Online Research Access Service

Using supertags as source language context in SMT

Author: Haque Rejwanul
Ma Yanjun
Naskar Sudip Kumar
Way Andy
Publication venue: European Association for Machine Translation
Publication date: 01/01/2009
Field of study

Recent research has shown that Phrase-Based Statistical Machine Translation (PB-SMT) systems can benefit from two enhancements: (i) using words and POS tags as context-informed features on the source side; and (ii) incorporating lexical syntactic descriptions in the form of supertags on the target side. In this work we present a novel PB-SMT model that combines these two aspects by using supertags as source language contextinformed features. These features enable us to exploit source similarity in addition to target similarity, as modelled by the language model. In our experiments two kinds of supertags are employed: those from Lexicalized Tree-Adjoining Grammar and Combinatory Categorial Grammar. We use a memory-based classification framework that enables the estimation of these features while avoiding problems of sparseness. Despite the differences between these two approaches, the supertaggers give similar improvements. We evaluate the performance of our approach on an English-to-Chinese translation task using a state-of-the-art phrase-based SMT system, and report an improvement of 7.88% BLEU score in translation quality when adding supertags as context-informed features

Irish Universities

DCU Online Research Access Service