803 research outputs found

    Semi-Automatic Identification of Bilingual Synonymous Technical Terms from Phrase Tables and Parallel Patent Sentences

    Get PDF

    Adaptation of machine translation for multilingual information retrieval in the medical domain

    Get PDF
    Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR. Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets. Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results. Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions

    Acquiring Domain-Specific Knowledge for WordNet from a Terminological Database

    Get PDF
    In this research we explore a terminological database (Termoteca) in order to expand the Portuguese and Galician wordnets (PULO and Galnet) with the addition of new synset variants (word forms for a concept), usage examples for the variants, and synset glosses or definitions. The methodology applied in this experiment is based on the alignment between concepts of WordNet (synsets) and concepts described in Termoteca (terminological records), taking into account the lexical forms in both resources, their morphological category and their knowledge domains, using the information provided by the WordNet Domains Hierarchy and the Termoteca field domains to reduce the incidence of polysemy and homography in the results of the experiment. The results obtained confirm our hypothesis that the combined use of the semantic domain information included in both resources makes it possible to minimise the problem of lexical ambiguity and to obtain a very acceptable index of precision in terminological information extraction tasks, attaining a precision above 89% when there are two or more different languages sharing at least one lexical form between the synset in Galnet and the Termoteca record

    METRICC: Harnessing Comparable Corpora for Multilingual Lexicon Development

    Get PDF
    International audienceResearch on comparable corpora has grown in recent years bringing about the possibility of developing multilingual lexicons through the exploitation of comparable corpora to create corpus-driven multilingual dictionaries. To date, this issue has not been widely addressed. This paper focuses on the use of the mechanism of collocational networks proposed by Williams (1998) for exploiting comparable corpora. The paper first provides a description of the METRICC project, which is aimed at the automatically creation of comparable corpora and describes one of the crawlers developed for comparable corpora building, and then discusses the power of collocational networks for multilingual corpus-driven dictionary development

    How word decoding, vocabulary and prior topic knowledge predict reading comprehension. A study of language-minority students in Norwegian fifth grade classrooms

    Get PDF
    This study examined the contribution of word decoding, first-language (L1) and second-language (L2) vocabulary and prior topic knowledge to L2 reading comprehension. For measuring reading comprehension we employed two different reading tasks: Woodcock Passage Comprehension and a researcher-developed content-area reading assignment (the Global Warming Test) consisting of multiple lengthy texts. The sample included 67 language-minority students (native Urdu or native Turkish speakers) from 21 different fifth grade classrooms in Norway. Multiple regression analyses revealed that word decoding and different facets of L2 vocabulary explained most of the variance in Woodcock Passage Comprehension, but a smaller proportion of variance in the Global Warming Test. For the Global Warming Test, prior topic knowledge was the most influential predictor. Furthermore, L2 vocabulary depth appeared to moderate the contribution of prior topic knowledge to the Global Warming Test in this sample of language minority students

    Contrastive linguistics and translations studies interconnected: the corpus-based approach

    Get PDF
    p. 393-406The relationship between contrastive linguistics (CL) and translation studies (TS) as two disciplines within the field of applied linguistics has been explored in depth by several authors, especially in the 1970s and early 1980s. From the mid-nineties on both these disciplines have experienced a great boom due to the use of computerised language corpora in linguistic analy-sis. We will argue in this paper that this new corpus-based approach to CL and TS makes it necessary to revise the relationship between them, and look for a new common ground to work on. Our hypothesis is that the use of translation equivalence as a tertium comparationis for a corpus-based con-trastive analysis provides essential data for TS in a wide range of aspects. On the other hand, the corpus approach of TS has shed a new light on numerous aspects of CL.S
    corecore