11 research outputs found
Using linear interpolation and weighted reordering hypotheses in the moses system
This paper proposes to introduce a novel reordering model in the open-source Moses toolkit. The main idea is to provide
weighted reordering hypotheses to the SMT decoder. These hypotheses are built using a first-step Ngram-based SMT
translation from a source language into a third representation that is called reordered source language. Each hypothesis
has its own weight provided by the Ngram-based decoder. This proposed reordering technique offers a better and more
efficient translation when compared to both the distance-based and the lexicalized reordering. In addition to this reordering
approach, this paper describes a domain adaptation technique which is based on a linear combination of an specific indomain
and an extra out-domain translation models. Results for both approaches are reported in the Arabic-to-English
2008 IWSLT task. When implementing the weighted reordering hypotheses and the domain adaptation technique in the
final translation system, translation results reach improvements up to 2.5 BLEU compared to a standard state-of-the-art
Moses baseline system.Postprint (published version
Real-Time Statistical Speech Translation
This research investigates the Statistical Machine Translation approaches to
translate speech in real time automatically. Such systems can be used in a
pipeline with speech recognition and synthesis software in order to produce a
real-time voice communication system between foreigners. We obtained three main
data sets from spoken proceedings that represent three different types of human
speech. TED, Europarl, and OPUS parallel text corpora were used as the basis
for training of language models, for developmental tuning and testing of the
translation system. We also conducted experiments involving part of speech
tagging, compound splitting, linear language model interpolation, TrueCasing
and morphosyntactic analysis. We evaluated the effects of variety of data
preparations on the translation results using the BLEU, NIST, METEOR and TER
metrics and tried to give answer which metric is most suitable for PL-EN
language pair.Comment: machine translation, polish englis
Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level
Text alignment and text quality are critical to the accuracy of Machine
Translation (MT) systems, some NLP tools, and any other text processing tasks
requiring bilingual data. This research proposes a language independent
bi-sentence filtering approach based on Polish (not a position-sensitive
language) to English experiments. This cleaning approach was developed on the
TED Talks corpus and also initially tested on the Wikipedia comparable corpus,
but it can be used for any text domain or language pair. The proposed approach
implements various heuristics for sentence comparison. Some of them leverage
synonyms and semantic and structural analysis of text as additional
information. Minimization of data loss was ensured. An improvement in MT system
score with text processed using the tool is discussed.Comment: arXiv admin note: text overlap with arXiv:1509.09093,
arXiv:1509.0888
Latest trends in hybrid machine translation and its applications
This survey on hybrid machine translation (MT) is motivated by the fact that hybridization techniques have become popular as they attempt to combine the best characteristics of highly advanced pure rule or corpus-based MT approaches. Existing research typically covers either simple or more complex architectures guided by either rule or corpus-based approaches. The goal is to combine the best properties of each type.
This survey provides a detailed overview of the modification of the standard rule-based architecture to include statistical knowl- edge, the introduction of rules in corpus-based approaches, and the hybridization of approaches within this last single category. The principal aim here is to cover the leading research and progress in this field of MT and in several related applications.Peer ReviewedPostprint (published version
Using linear interpolation and weighted reordering hypotheses in the moses system
This paper proposes to introduce a novel reordering model in the open-source Moses toolkit. The main idea is to provide
weighted reordering hypotheses to the SMT decoder. These hypotheses are built using a first-step Ngram-based SMT
translation from a source language into a third representation that is called reordered source language. Each hypothesis
has its own weight provided by the Ngram-based decoder. This proposed reordering technique offers a better and more
efficient translation when compared to both the distance-based and the lexicalized reordering. In addition to this reordering
approach, this paper describes a domain adaptation technique which is based on a linear combination of an specific indomain
and an extra out-domain translation models. Results for both approaches are reported in the Arabic-to-English
2008 IWSLT task. When implementing the weighted reordering hypotheses and the domain adaptation technique in the
final translation system, translation results reach improvements up to 2.5 BLEU compared to a standard state-of-the-art
Moses baseline system