9,557 research outputs found
Low-resource machine translation using MATREX: The DCU machine translation system for IWSLT 2009
In this paper, we give a description of the Machine Translation (MT) system developed at DCU that was used for our fourth participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2009). Two techniques are deployed in our system in order to improve the translation quality in a low-resource scenario. The first technique is to use multiple segmentations in MT training and to utilise word lattices in decoding stage. The second technique is used to select the optimal training data that can be used to build MT systems. In this year’s participation, we use three different prototype SMT systems, and the output from each system are combined using standard system combination method. Our system is the top system for Chinese–English CHALLENGE task in terms of BLEU score
Bilingually motivated domain-adapted word segmentation for statistical machine translation
We introduce a word segmentation approach to languages where word boundaries are not orthographically marked,
with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is
adapted for the specific translation task at hand by taking the corresponding source (target) language into account. Secondly, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and
demonstrate that our approach scores consistently among the best results across different data conditions
Improved phrase-based SMT with syntactic reordering patterns learned from lattice scoring
In this paper, we present a novel approach to incorporate source-side syntactic reordering patterns into phrase-based SMT. The main contribution of this work is to use the lattice scoring approach to exploit and utilize reordering
information that is favoured by the baseline PBSMT system. By referring to the parse trees of the training corpus, we represent the observed reorderings with source-side
syntactic patterns. The extracted patterns are then used to convert the parsed inputs into word lattices, which contain both the original source sentences and their potential reorderings. Weights of the word lattices are estimated from the observations of the syntactic reordering patterns in the training corpus. Finally, the PBSMT system is tuned
and tested on the generated word lattices to show the benefits of adding potential sourceside reorderings in the inputs. We confirmed the effectiveness of our proposed method on a medium-sized corpus for Chinese-English
machine translation task. Our method outperformed the baseline system by 1.67% relative on a randomly selected testset and 8.56% relative on the NIST 2008 testset in terms of BLEU score
Source-side syntactic reordering patterns with functional words for improved phrase-based SMT
Inspired by previous source-side syntactic reordering methods for SMT, this paper focuses on using automatically learned syntactic reordering patterns with functional words which indicate structural reorderings between the source and target language. This approach takes advantage of phrase alignments and source-side parse trees for pattern extraction, and then filters out those patterns without functional words. Word lattices transformed by the generated patterns are fed into PBSMT systems to incorporate potential reorderings from the inputs. Experiments are carried out on a medium-sized corpus for a Chinese–English SMT task. The proposed method outperforms the baseline system by 1.38% relative on a randomly selected testset and 10.45% relative on the NIST 2008 testset in terms of BLEU score. Furthermore, a system with just 61.88% of the patterns filtered by functional words obtains a comparable performance with the unfiltered one on the randomly selected testset, and achieves 1.74% relative improvements on the NIST 2008 testset
Lattice score based data cleaning for phrase-based statistical machine translation
Statistical machine translation relies heavily
on parallel corpora to train its models
for translation tasks. While more and
more bilingual corpora are readily available,
the quality of the sentence pairs
should be taken into consideration. This
paper presents a novel lattice score-based
data cleaning method to select proper sentence
pairs from the ones extracted from a
bilingual corpus by the sentence alignment
methods. The proposed method is carried
out as follows: firstly, an initial phrasebased
model is trained on the full sentencealigned
corpus; then for each of the sentence
pairs in the corpus, word alignments
are used to create anchor pairs and sourceside
lattices; thirdly, based on the translation
model, target-side phrase networks
are expanded on the lattices and Viterbi
searching is used to find approximated decoding
results; finally, BLEU score thresholds
are used to filter out the low-score
sentence pairs for the data cleaning purpose.
Our experiments on the FBIS corpus
showed improvements of BLEU score
from 23.78 to 24.02 in Chinese-English
Facilitating translation using source language paraphrase lattices
For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input
sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore,
given limited data, in order to facilitate translation
from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and largescale
English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resource sufficient pairs to some extent
Using TERp to augment the system combination for SMT
TER-Plus (TERp) is an extended TER evaluation metric incorporating morphology, synonymy and paraphrases.
There are three new edit operations in TERp: Stem Matches, Synonym Matches and Phrase Substitutions (Para-phrases). In this paper, we propose a TERp-based augmented system combination in terms of the backbone selection and consensus decoding network. Combining the new properties\ud
of the TERp, we also propose a two-pass decoding strategy for the lattice-based phrase-level confusion network(CN) to generate the final result. The experiments conducted on the NIST2008 Chinese-to-English test set show that our TERp-based augmented system combination framework achieves significant improvements in terms of BLEU and TERp scores compared to the state-of-the-art word-level system combination framework and a TER-based combination strategy
- …