5 research outputs found

    Adaptation of machine translation for multilingual information retrieval in the medical domain

    Get PDF
    Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR. Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets. Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results. Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions

    Zināšanās bāzētu un korpusā bāzētu metožu kombinētā izmantošanas mašīntulkošanā

    Get PDF
    ANOTĀCIJA. Mašīntulkošanas (MT) sistēmas tiek būvētas izmantojot dažādas metodes (zināšanās un korpusā bāzētas). Zināšanās bāzēta MT tulko tekstu, izmantojot cilvēka rakstītus likumus. Korpusā bāzēta MT izmanto no tulkojumu piemēriem automātiski izgūtus modeļus. Abām metodēm ir gan priekšrocības, gan trūkumi. Šajā darbā tiek meklēta kombināta metode MT kvalitātes uzlabošanai, kombinējot abas metodes. Darbā tiek pētīta metožu piemērotība latviešu valodai, kas ir maza, morfoloģiski bagāta valoda ar ierobežotiem resursiem. Tiek analizētas esošās metodes un tiek piedāvātas vairākas kombinētās metodes. Metodes ir realizētas un novērtētas, izmantojot gan automātiskas, gan cilvēka novērtēšanas metodes. Faktorēta statistiskā MT ar zināšanās balstītu morfoloģisko analizatoru ir piedāvāta kā perspektīvākā. Darbā aprakstīts arī metodes praktiskais pielietojums. Atslēgas vārdi: mašīntulkošana (MT), zināšanās balstīta MT, korpusā balstīta MT, kombinēta metodeABSTRACT. Machine Translation (MT) systems are built using different methods (knowledge-based and corpus-based). Knowledge-based MT translates text using human created rules. Corpus-based MT uses models which are automatically built from translation examples. Both methods have their advantages and disadvantages. This work aims to find a combined method to improve the MT quality combining both methods. An applicability of the methods for Latvian (a small, morphologically rich, under-resourced language) is researched. The existing MT methods have been analyzed and several combined methods have been proposed. Methods have been implemented and evaluated using an automatic and human evaluation. The factored statistical MT with a rule-based morphological analyzer is proposed to be the most promising. The practical application of methods is described. Keywords: Machine Translation (MT), Rule-based MT, Statistical MT, Combined approac

    Statistical machine translation of German compound words

    No full text
    corecore