Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting

Okita, Tsuyoshi

thesis

Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting

Authors: Tsuyoshi Okita
Publication date: 1 March 2012
Publisher: Dublin City University. School of Computing

Abstract

This thesis discusses how to incorporate linguistic knowledge into an SMT system. Although one important category of linguistic knowledge is that obtained by a constituent / dependency parser, a POS / super tagger, and a morphological analyser, linguistic knowledge here includes larger domains than this: Multi-Word Expressions, Out-Of-Vocabulary words, paraphrases, lexical semantics (or non-literal translations), named-entities, coreferences, and transliterations. The ﬁrst discussion is about word alignment where we propose a MWE-sensitive word aligner. The second discussion is about the smoothing methods for a language model and a translation model where we propose a hierarchical Pitman-Yor process-based smoothing method. The common grounds for these discussion are the examination of three exceptional cases from real-world data: the presence of noise, the availability of prior knowledge, and the problem of underﬁtting. Notable characteristics of this design are the careful usage of (Bayesian) priors in order that it can capture both frequent and linguistically important phenomena. This can be considered to provide one example to solve the problems of statistical models which often aim to learn from frequent examples only, and often overlook less frequent but linguistically important phenomena

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Irish Universities

Last time updated on 30/12/2017

DCU Online Research Access Service

oai:doras.dcu.ie:16759

Last time updated on 10/07/2013

Name not available

oai:doras.dcu.ie:16759

Last time updated on 09/02/2018