3 research outputs found

    A Richly Annotated, Multilingual Parallel Corpus for Hybrid Machine Translation

    Get PDF
    In recent years, machine translation (MT) research has focused on investigating how hybrid machine translation as well as system combination approaches can be designed so that the resulting hybrid translations show an improvement over the individual “component” translations. As a first step towards achieving this objective we have developed a parallel corpus with source text and the corresponding translation output from a number of machine translation engines, annotated with metadata information, capturing aspects of the translation process performed by the different MT systems. This corpus aims to serve as a basic resource for further research on whether hybrid machine translation algorithms and system combination techniques can benefit from additional (linguistically motivated, decoding, and runtime) information provided by the different systems involved. In this paper, we describe the annotated corpus we have created. We provide an overview on the component MT systems and the XLIFF-based annotation format we have developed. We also report on first experiments with the ML4HMT corpus data

    A Richly annotated, multilingual parallel corpus for hybrid machine translation

    No full text
    In recent years, machine translation (MT) research has focused on investigating how hybrid machine translation as well as system combination approachescan bedesigned so that theresulting hybrid translationsshow an improvement over theindividual “component” translations. As a first step towards achieving this objectivewe have developed a parallel corpuswith source text and the corresponding translation output from a number of machine translation engines, annotated with metadata information, capturing aspects of the translation process performed by the different MT systems. This corpus aims to serve as a basic resource for further research on whether hybridmachinetranslation algorithmsand systemcombination techniques can benefit fromadditional (linguistically motivated, decoding, and runtime) information provided by thedifferent systems involved. In this paper, wedescribe the annotated corpuswehave created. We provide an overview on the component MT systems and the XLIFF-based annotation format we have developed. We also report on first experimentswith theML4HMT corpus data

    Hybrid machine translation using binary classification models trained on joint, binarised feature vectors

    Get PDF
    We describe the design and implementation of a system combination method for machine translation output. It is based on sentence selection using binary classification models estimated on joint, binarised feature vectors. By contrast to existing system combination methods which work by dividing candidate translations into n-grams, i.e., sequences of n words or tokens, our framework performs sentence selection which does not alter the selected, best translation. First, we investigate the potential performance gain attainable by optimal sentence selection. To do so, we conduct the largest meta-study on data released by the yearly Workshop on Statistical Machine Translation (WMT). Second, we introduce so-called joint, binarised feature vectors which explicitly model feature value comparison for two systems A, B. We compare different settings for training binary classifiers using single, joint, as well as joint, binarised feature vectors. After having shown the potential of both selection and binarisation as methodological paradigms, we combine these two into a combination framework which applies pairwise comparison of all candidate systems to determine the best translation for each individual sentence. Our system is able to outperform other state-of-the-art system combination approaches; this is confirmed by our experiments. We conclude by summarising the main findings and contributions of our thesis and by giving an outlook to future research directions.Wir beschreiben den Entwurf und die Implementierung eines Systems zur Kombination von Übersetzungen auf Basis nicht modifizierender Auswahl gegebener Kandidaten. Die zugehörigen, binären Klassifikationsmodelle werden unter Verwendung von gemeinsamen, binärisierten Merkmalsvektoren trainiert. Im Gegensatz zu anderen Methoden zur Systemkombination, die die gegebenen Kandidatenübersetzungen in n-Gramme, d.h., Sequenzen von n Worten oder Symbolen zerlegen, funktioniert unser Ansatz mit Hilfe von nicht modifizierender Auswahl der besten Übersetzung. Zuerst untersuchen wir das Potenzial eines solches Ansatzes im Hinblick auf die maximale theoretisch mögliche Verbesserung und führen die größte Meta-Studie auf Daten, welche jährlich im Rahmen der Arbeitstreffen zur Statistischen Maschinellen Übersetzung (WMT) veröffentlicht worden sind, durch. Danach definieren wir sogenannte gemeinsame, binärisierte Merkmalsvektoren, welche explizit den Merkmalsvergleich zweier Systeme A, B modellieren. Wir vergleichen verschiedene Konfigurationen zum Training binärer Klassifikationsmodelle basierend auf einfachen, gemeinsamen, sowie gemeinsamen, binärisierten Merkmalsvektoren. Abschließend kombinieren wir beide Verfahren zu einer Methodik, die paarweise Vergleiche aller Quellsysteme zur Bestimmung der besten Übesetzung einsetzt. Wir schließen mit einer Zusammenfassung und einem Ausblick auf zukünftige Forschungsthemen
    corecore