Search CORE

803 research outputs found

Semi-Automatic Identification of Bilingual Synonymous Technical Terms from Phrase Tables and Parallel Patent Sentences

Author: Liang Bing
Utsuro Takehito
Yamamoto Mikio
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Adaptation of machine translation for multilingual information retrieval in the medical domain

Author: Dušek Ondřej
Goeuriot Lorraine
Hajič Jan
Hlaváčová Jaroslava
Jones Gareth J.F.
Kelly Liadh
Leveling Johannes
Mareček David
Novák Michal
Pecina Pavel
Popel Martin
Rosa Rudolf
Tamchyna Aleš
Urešová Zdeňka
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR. Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets. Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results. Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions

Hal - Université Grenoble Alpes

Biblio at Institute of Formal and Applied Linguistics

Acquiring Domain-Specific Knowledge for WordNet from a Terminological Database

Author
Publication venue: OASIcs - OpenAccess Series in Informatics. 8th Symposium on Languages, Applications and Technologies (SLATE 2019)
Publication date: 01/01/2019
Field of study

In this research we explore a terminological database (Termoteca) in order to expand the Portuguese and Galician wordnets (PULO and Galnet) with the addition of new synset variants (word forms for a concept), usage examples for the variants, and synset glosses or definitions. The methodology applied in this experiment is based on the alignment between concepts of WordNet (synsets) and concepts described in Termoteca (terminological records), taking into account the lexical forms in both resources, their morphological category and their knowledge domains, using the information provided by the WordNet Domains Hierarchy and the Termoteca field domains to reduce the incidence of polysemy and homography in the results of the experiment. The results obtained confirm our hypothesis that the combined use of the semantic domain information included in both resources makes it possible to minimise the problem of lexical ambiguity and to obtain a very acceptable index of precision in terminological information extraction tasks, attaining a precision above 89% when there are two or more different languages sharing at least one lexical form between the synset in Galnet and the Termoteca record

Dagstuhl Research Online Publication Server

METRICC: Harnessing Comparable Corpora for Multilingual Lexicon Development

Author: Alonso Araceli
Blancafort Helena
De Groc Clément
Million Chrystel
Williams Geoffrey
Publication venue: HAL CCSD
Publication date: 07/08/2012
Field of study

International audienceResearch on comparable corpora has grown in recent years bringing about the possibility of developing multilingual lexicons through the exploitation of comparable corpora to create corpus-driven multilingual dictionaries. To date, this issue has not been widely addressed. This paper focuses on the use of the mechanism of collocational networks proposed by Williams (1998) for exploiting comparable corpora. The paper first provides a description of the METRICC project, which is aimed at the automatically creation of comparable corpora and describes one of the crawlers developed for comparable corpora building, and then discusses the power of collocational networks for multilingual corpus-driven dictionary development

Hal - Université Grenoble Alpes

HAL-Université de Bretagne Occidentale

How word decoding, vocabulary and prior topic knowledge predict reading comprehension. A study of language-minority students in Norwegian fifth grade classrooms

Author: Aukrust Vibeke Grøver
Fulland Helene
Rydland Veslemøy
Publication venue: Springer Netherlands
Publication date: 01/01/2010
Field of study

This study examined the contribution of word decoding, first-language (L1) and second-language (L2) vocabulary and prior topic knowledge to L2 reading comprehension. For measuring reading comprehension we employed two different reading tasks: Woodcock Passage Comprehension and a researcher-developed content-area reading assignment (the Global Warming Test) consisting of multiple lengthy texts. The sample included 67 language-minority students (native Urdu or native Turkish speakers) from 21 different fifth grade classrooms in Norway. Multiple regression analyses revealed that word decoding and different facets of L2 vocabulary explained most of the variance in Woodcock Passage Comprehension, but a smaller proportion of variance in the Global Warming Test. For the Global Warming Test, prior topic knowledge was the most influential predictor. Furthermore, L2 vocabulary depth appeared to moderate the contribution of prior topic knowledge to the Global Warming Test in this sample of language minority students

Springer - Publisher Connector

Contrastive linguistics and translations studies interconnected: the corpus-based approach

Author: Ramón García Noelia
Publication venue: Hoger Instituut voor Vertalers en Tolken
Publication date: 24/12/2018
Field of study

p. 393-406The relationship between contrastive linguistics (CL) and translation studies (TS) as two disciplines within the field of applied linguistics has been explored in depth by several authors, especially in the 1970s and early 1980s. From the mid-nineties on both these disciplines have experienced a great boom due to the use of computerised language corpora in linguistic analy-sis. We will argue in this paper that this new corpus-based approach to CL and TS makes it necessary to revise the relationship between them, and look for a new common ground to work on. Our hypothesis is that the use of translation equivalence as a tertium comparationis for a corpus-based con-trastive analysis provides essential data for TS in a wide range of aspects. On the other hand, the corpus approach of TS has shed a new light on numerous aspects of CL.S