10 research outputs found

    GEOMETRIC APPROACHES TO DISCRIMINATIVE TRAINING

    Get PDF
    Discriminative training as a general machine learning approach has wide applications in tasks like Natural Language Processing (NLP) and Automatic Speech Recognition (ASR). In this thesis, we are interested in online methods for discriminative training due to their simplicity, efficiency and scalabililty. The novel methods we propose are summarized as follows. First, an interesting subclass of online learning algorithms adopts multiplicative instead of additive strategies to update parameters of linear models, but none of them can be directly used for structured prediction as required by many NLP tasks. We extend the multiplicative Winnow algorithm to a structured version, and the additive MIRA algorithm to a multiplicative version, and apply the them to NLP tasks. We also give interpretations to the relationship between EG and prod, two multiplicative algorithms, from an information geometric perspective. Secondly, although general online learning frameworks, notably the Online Mirror Descent (OMD), exist and subsume many specific algorithms, they are not suitable for deriving multiplicative algorithms. We therefore propose a new general framework named Generalized Multiplicative Update (GMU) that is multiplicative in nature and easily derives many specific multiplicative algorithms. We then propose a subclass of GMU, named the q-Exponentiated Gradient (qEG) method, that elucidates the relationship among several of the algorithms. To better understand the difference between OMD and GMU, we give further analysis of these algorithms from a Riemannian geometric perspective. We also extend OMD and GMU to accelerated versions by adding momentum terms. Thirdly, although natural gradient descent (NGD) is often hard to be applied in practice due its computational difficulty, we propose a novel approach for CRF training which allows efficient application of NGD. The loss functions, defined by Bregman divergence, generalizes the log-likelihood objective and can be easily coupled with NGD for optimization. The proposed framework is flexible, allowing us to choose proper convex functions that leads to better training performance. Finally, traditional vector space linear models require estimating as many parameters as the number of model features. In the presence of millions of features, a common phenomenon in many NLP tasks, this may complicate the training procedure especially when labeled training data is scarce. We propose a novel online learning approach by shifting from vector space to tensor space, which dramatically reduces the number of parameters to be estimated. The resulting model is highly regularized and is particularly suitable for training in low-resource environments

    Large-scale variational inference for Bayesian joint regression modelling of high-dimensional genetic data

    Get PDF
    Genetic association studies have become increasingly important in understanding the molecular bases of complex human traits. The specific analysis of intermediate molecular traits, via quantitative trait locus (QTL) studies, has recently received much attention, prompted by the advance of high-throughput technologies for quantifying gene, protein and metabolite levels. Of great interest is the detection of weak trans-regulatory effects between a genetic variant and a distal gene product. In particular, hotspot genetic variants, which remotely control the levels of many molecular outcomes, may initiate decisive functional mechanisms underlying disease endpoints. This thesis proposes a Bayesian hierarchical approach for joint analysis of QTL data on a genome-wide scale. We consider a series of parallel sparse regressions combined in a hierarchical manner to flexibly accommodate high-dimensional responses (molecular levels) and predictors (genetic variants), and we present new methods for large-scale inference. Existing approaches have limitations. Conventional marginal screening does not account for local dependencies and association patterns common to multiple outcomes and genetic variants, whereas joint modelling approaches are restricted to relatively small datasets by computational constraints. Our novel framework allows information-sharing across outcomes and variants, thereby enhancing the detection of weak trans and hotspot effects, and implements tailored variational inference procedures that allow simultaneous analysis of data for an entire QTL study, comprising hundreds of thousands of predictors, and thousands of responses and samples. The present work also describes extensions to leverage spatial and functional information on the genetic variants, for example, using predictor-level covariates such as epigenomic marks. Moreover, we augment variational inference with simulated annealing and parallel expectation-maximisation schemes in order to enhance exploration of highly multimodal spaces and allow efficient empirical Bayes estimation. Our methods, publicly available as packages implemented in R and C++, are extensively assessed in realistic simulations. Their advantages are illustrated in several QTL applications, including a large-scale proteomic QTL study on two clinical cohorts that highlights novel candidate biomarkers for metabolic disorders

    Natural Conjugate Gradient in Variational Inference

    No full text
    in machine learning often adapt a parametric probability distribution to optimize a given objective function. This view is especially useful when applying variational Bayes (VB) to models outside the conjugate-exponential family. For them, variational Bayesian expectation maximization (VB EM) algorithms are not easily available, and gradientbased methods are often used as alternatives. Traditional natural gradient methods use the Riemannian structure (or geometry) of the predictive distribution to speed up maximum likelihood estimation. We propose using the geometry of the variational approximating distribution instead to speed up a conjugate gradient method for variational learning and inference. The computational overhead is small due to the simplicity of the approximating distribution. Experiments with real-world speech data show significan
    corecore