13,302 research outputs found
Multi-engine machine translation by recursive sentence decomposition
In this paper, we present a novel approach to combine the outputs of multiple MT engines into a consensus translation. In contrast to previous Multi-Engine Machine
Translation (MEMT) techniques, we do not rely on word alignments of output hypotheses, but prepare the input sentence for multi-engine processing. We do this by using a recursive decomposition algorithm that produces simple chunks as input to the MT engines. A consensus translation
is produced by combining the best chunk translations, selected through majority voting, a trigram language model
score and a confidence score assigned to each MT engine. We report statistically significant relative improvements
of up to 9% BLEU score in experiments (English→Spanish) carried out on an 800-sentence test set extracted from the Penn-II Treebank
Neural fuzzy repair : integrating fuzzy matches into neural machine translation
We present a simple yet powerful data augmentation method for boosting Neural Machine Translation (NMT) performance by leveraging information retrieved from a Translation Memory (TM). We propose and test two methods for augmenting NMT training data with fuzzy TM matches. Tests on the DGT-TM data set for two language pairs show consistent and substantial improvements over a range of baseline systems. The results suggest that this method is promising for any translation environment in which a sizeable TM is available and a certain amount of repetition across translations is to be expected, especially considering its ease of implementation
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
On the Generation of Medical Question-Answer Pairs
Question answering (QA) has achieved promising progress recently. However,
answering a question in real-world scenarios like the medical domain is still
challenging, due to the requirement of external knowledge and the insufficient
quantity of high-quality training data. In the light of these challenges, we
study the task of generating medical QA pairs in this paper. With the insight
that each medical question can be considered as a sample from the latent
distribution of questions given answers, we propose an automated medical QA
pair generation framework, consisting of an unsupervised key phrase detector
that explores unstructured material for validity, and a generator that involves
a multi-pass decoder to integrate structural knowledge for diversity. A series
of experiments have been conducted on a real-world dataset collected from the
National Medical Licensing Examination of China. Both automatic evaluation and
human annotation demonstrate the effectiveness of the proposed method. Further
investigation shows that, by incorporating the generated QA pairs for training,
significant improvement in terms of accuracy can be achieved for the
examination QA system.Comment: AAAI 202
Japanese-English Parallel Corpora in the Classroom : Applications and Challenges
Computerized corpora have given linguists crucial new insights on the usage of language. With the help of software, it is possible to index the words which appear in a large collection of text and analyze word usage and frequency. Data Driven Learning looks at how students can benefit from their own direct use of corpora. While monolingual corpora have a steep learning curve and are often too difficult for language learners, a solution to this problem may be found in bilingual parallel corpora, which are built from authentically translated text. This article looks at Eijiro on the WEB and Weblio, two online Japanese-English parallel corpus based websites. Some guided practice exercises developed by the author for use in university level English language writing classes in Japan are discussed, and some of the challenges in training students to use these resources to improve their English language writing are presented
- …