14 research outputs found

    Pushdown automata in statistical machine translation

    Get PDF
    This article describes the use of pushdown automata (PDA) in the context of statistical machine translation and alignment under a synchronous context-free grammar. We use PDAs to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence. General-purpose PDA algorithms for replacement, composition, shortest path, and expansion are presented. We describe HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms. We contrast the complexity of this decoder with a decoder based on a finite state automata representation, showing that PDAs provide a more suitable framework to achieve exact decoding for larger synchronous context-free grammars and smaller language models. We assess this experimentally on a large-scale Chinese-to-English alignment and translation task. In translation, we propose a two-pass decoding strategy involving a weaker language model in the first-pass to address the results of PDA complexity analysis. We study in depth the experimental conditions and tradeoffs in which HiPDT can achieve state-of-the-art performance for large-scale SMT. </jats:p

    Phylogenetic quantification of intra-tumour heterogeneity.

    Get PDF
    Intra-tumour genetic heterogeneity is the result of ongoing evolutionary change within each cancer. The expansion of genetically distinct sub-clonal populations may explain the emergence of drug resistance, and if so, would have prognostic and predictive utility. However, methods for objectively quantifying tumour heterogeneity have been missing and are particularly difficult to establish in cancers where predominant copy number variation prevents accurate phylogenetic reconstruction owing to horizontal dependencies caused by long and cascading genomic rearrangements. To address these challenges, we present MEDICC, a method for phylogenetic reconstruction and heterogeneity quantification based on a Minimum Event Distance for Intra-tumour Copy-number Comparisons. Using a transducer-based pairwise comparison function, we determine optimal phasing of major and minor alleles, as well as evolutionary distances between samples, and are able to reconstruct ancestral genomes. Rigorous simulations and an extensive clinical study show the power of our method, which outperforms state-of-the-art competitors in reconstruction accuracy, and additionally allows unbiased numerical quantification of tumour heterogeneity. Accurate quantification and evolutionary inference are essential to understand the functional consequences of tumour heterogeneity. The MEDICC algorithms are independent of the experimental techniques used and are applicable to both next-generation sequencing and array CGH data.This is the final published version. It was originally published by PLoS in PLoS Computational Biology here: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003535

    Syntactically Guided Neural Machine Translation

    Get PDF
    We investigate the use of hierarchical phrase-based SMT lattices in end-to-end neural machine translation (NMT). Weight pushing transforms the Hiero scores for complete translation hypotheses, with the full translation grammar score and full n-gram language model score, into posteriors compatible with NMT predictive probabilities. With a slightly modified NMT beam-search decoder we find gains over both Hiero and NMT decoding alone, with practical advantages in extending NMT to very large input and output vocabularies.Engineering and Physical Sciences Research Council (Grant ID: EP/L027623/1

    A Formal Model of Ambiguity and its Applications in Machine Translation

    Get PDF
    Systems that process natural language must cope with and resolve ambiguity. In this dissertation, a model of language processing is advocated in which multiple inputs and multiple analyses of inputs are considered concurrently and a single analysis is only a last resort. Compared to conventional models, this approach can be understood as replacing single-element inputs and outputs with weighted sets of inputs and outputs. Although processing components must deal with sets (rather than individual elements), constraints are imposed on the elements of these sets, and the representations from existing models may be reused. However, to deal efficiently with large (or infinite) sets, compact representations of sets that share structure between elements, such as weighted finite-state transducers and synchronous context-free grammars, are necessary. These representations and algorithms for manipulating them are discussed in depth in depth. To establish the effectiveness and tractability of the proposed processing model, it is applied to several problems in machine translation. Starting with spoken language translation, it is shown that translating a set of transcription hypotheses yields better translations compared to a baseline in which a single (1-best) transcription hypothesis is selected and then translated, independent of the translation model formalism used. More subtle forms of ambiguity that arise even in text-only translation (such as decisions conventionally made during system development about how to preprocess text) are then discussed, and it is shown that the ambiguity-preserving paradigm can be employed in these cases as well, again leading to improved translation quality. A model for supervised learning that learns from training data where sets (rather than single elements) of correct labels are provided for each training instance and use it to learn a model of compound word segmentation is also introduced, which is used as a preprocessing step in machine translation

    Methods for frequent sequence mining with subsequence constraints

    Full text link
    In this thesis, we study scalable and general purpose methods for mining frequent sequences that satisfy a given subsequence constraint. Frequent sequence mining is a fundamental task in data mining and has many real-life applications like information extraction, market-basket analysis, web usage mining, or session analysis. Depending on the underlying application, we are generally interested in discovering certain frequent sequences, which are described using subsequence constraints. There exists many tools and algorithms for this task, however, they are not sufficiently scalable to deal with large amounts of data that may arise in applications and are generally not extensible across range of applications. We propose scalable, distributed sequence mining algorithms that target MapReduce. Our work builds on MG-FSM, which is a distributed framework for frequent sequence mining. We propose novel algorithms that improve and extend the basic MG-FSM framework to efficiently support traditional subsequence constraints that arise in applications. Additionally, we show that many subsequence constraints---including and beyond the traditional ones considered in literature---can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. To this end, we propose a general purpose framework that provides a set of simple and intuitive ``pattern expressions'', which allows to describe any subsequence constraint of interest and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our experimental study on real-world datasets indicates that our proposed algorithms are scalable and effective across wide range of applications

    Exact sampling and optimisation in statistical machine translation

    Get PDF
    In Statistical Machine Translation (SMT), inference needs to be performed over a high-complexity discrete distribution de ned by the intersection between a translation hypergraph and a target language model. This distribution is too complex to be represented exactly and one typically resorts to approximation techniques either to perform optimisation { the task of searching for the optimum translation { or sampling { the task of nding a subset of translations that is statistically representative of the goal distribution. Beam-search is an example of an approximate optimisation technique, where maximisation is performed over a heuristically pruned representation of the goal distribution. For inference tasks other than optimisation, rather than nding a single optimum, one is really interested in obtaining a set of probabilistic samples from the distribution. This is the case in training where one wishes to obtain unbiased estimates of expectations in order to t the parameters of a model. Samples are also necessary in consensus decoding where one chooses from a sample of likely translations the one that minimises a loss function. Due to the additional computational challenges posed by sampling, n-best lists, a by-product of optimisation, are typically used as a biased approximation to true probabilistic samples. A more direct procedure is to attempt to directly draw samples from the underlying distribution rather than rely on n-best list approximations. Markov Chain Monte Carlo (MCMC) methods, such as Gibbs sampling, o er a way to overcome the tractability issues in sampling, however their convergence properties are hard to assess. That is, it is di cult to know when, if ever, an MCMC sampler is producing samples that are compatible iii with the goal distribution. Rejection sampling, a Monte Carlo (MC) method, is more fundamental and natural, it o ers strong guarantees, such as unbiased samples, but is typically hard to design for distributions of the kind addressed in SMT, rendering an intractable method. A recent technique that stresses a uni ed view between the two types of inference tasks discussed here | optimisation and sampling | is the OS approach. OS can be seen as a cross between Adaptive Rejection Sampling (an MC method) and A optimisation. In this view the intractable goal distribution is upperbounded by a simpler (thus tractable) proxy distribution, which is then incrementally re ned to be closer to the goal until the maximum is found, or until the sampling performance exceeds a certain level. This thesis introduces an approach to exact optimisation and exact sampling in SMT by addressing the tractability issues associated with the intersection between the translation hypergraph and the language model. The two forms of inference are handled in a uni ed framework based on the OS approach. In short, an intractable goal distribution, over which one wishes to perform inference, is upperbounded by tractable proposal distributions. A proposal represents a relaxed version of the complete space of weighted translation derivations, where relaxation happens with respect to the incorporation of the language model. These proposals give an optimistic view on the true model and allow for easier and faster search using standard dynamic programming techniques. In the OS approach, such proposals are used to perform a form of adaptive rejection sampling. In rejection sampling, samples are drawn from a proposal distribution and accepted or rejected as a function of the mismatch between the proposal and the goal. The technique is adaptive in that rejected samples are used to motivate a re nement of the upperbound proposal that brings it closer to the goal, improving the rate of acceptance. Optimisation can be connected to an extreme form of sampling, thus the framework introduced here suits both exact optimisation and exact iv sampling. Exact optimisation means that the global maximum is found with a certi cate of optimality. Exact sampling means that unbiased samples are independently drawn from the goal distribution. We show that by using this approach exact inference is feasible using only a fraction of the time and space that would be required by a full intersection, without recourse to pruning techniques that only provide approximate solutions. We also show that the vast majority of the entries (n-grams) in a language model can be summarised by shorter and optimistic entries. This means that the computational complexity of our approach is less sensitive to the order of the language model distribution than a full intersection would be. Particularly in the case of sampling, we show that it is possible to draw exact samples compatible with distributions which incorporate a high-order language model component from proxy distributions that are much simpler. In this thesis, exact inference is performed in the context of both hierarchical and phrase-based models of translation, the latter characterising a problem that is NP-complete in nature.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    USE OF LANGUAGE TECHNOLOGY TO IMPROVE MATCHING AND RETRIEVAL IN TRANSLATION MEMORY

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of PhilosophyCurrent Translation Memory (TM) tools lack semantic knowledge while matching. Most TM tools compute similarity at the string level, which does not take into account semantic aspects in matching. Therefore, semantically similar segments, which differ on the surface form, are often not retrieved. In this thesis, we present five novel and efficient approaches to incorporate advanced semantic knowledge in translation memory matching and retrieval. Two efficient approaches which use a paraphrase database to improve translation memory matching and retrieval are presented. Both automatic and human evaluations are conducted. The results on both evaluations show that paraphrasing improves matching and retrieval. An approach based on manually designed features extracted using NLP systems and resources is presented, where a Support Vector Machine (SVM) regression model is trained, which calculates the similarity between two segments. The approach based on manually designed features did not retrieve better matches than simple edit-distance. Two approaches for retrieving segments from a TM using deep learning are investigated. The first one is based on Long Short Term Memory (LSTM) networks, while the other one is based on Tree Structured Long Short Term Memory (Tree-LSTM) networks. Eight different models using different datasets and settings are trained. The results are comparable to a baseline which uses simple edit-distance
    corecore