Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach

Abstract

Recent advances in statistical machine translation (SMT) have used dynamic programming (DP) based beam search methods for approximate inference within probabilistic translation models. Despite their success, these methods compromise the probabilistic interpretation of the underlying model thus limiting the application of probabilistically defined decision rules during training and decoding. As an alternative, in this thesis, we propose a novel Monte Carlo sampling approach for theoretically sound approximate probabilistic inference within these models. The distribution we are interested in is the conditional distribution of a log-linear translation model; however, often, there is no tractable way of computing the normalisation term of the model. Instead, a Gibbs sampling approach for phrase-based machine translation models is developed which obviates the need of computing this term yet produces samples from the required distribution. We establish that the sampler effectively explores the distribution defined by a phrase-based models by showing that it converges in a reasonable amount of time to the desired distribution, irrespective of initialisation. Empirical evidence is provided to confirm that the sampler can provide accurate estimates of expectations of functions of interest. The mix of high probability and low probability derivations obtained through sampling is shown to provide a more accurate estimate of expectations than merely using the n-most highly probable derivations. Subsequently, we show that the sampler provides a tractable solution for finding the maximum probability translation in the model. We also present a unified approach to approximating two additional intractable problems: minimum risk training and minimum Bayes risk decoding. Key to our approach is the use of the sampler which allows us to explore the entire probability distribution and maintain a strict probabilistic formulation through the translation pipeline. For these tasks, sampling allies the simplicity of n-best list approaches with the extended view of the distribution that lattice-based approaches benefit from, while avoiding the biases associated with beam search. Our approach is theoretically well-motivated and can give better and more stable results than current state of the art methods

    Similar works