3 research outputs found

    Rethinking LDA: moment matching for discrete ICA

    Get PDF
    We consider moment matching techniques for estimation in Latent Dirichlet Allocation (LDA). By drawing explicit links between LDA and discrete versions of independent component analysis (ICA), we first derive a new set of cumulant-based tensors, with an improved sample complexity. Moreover, we reuse standard ICA techniques such as joint diagonalization of tensors to improve over existing methods based on the tensor power method. In an extensive set of experiments on both synthetic and real datasets, we show that our new combination of tensors and orthogonal joint diagonalization techniques outperforms existing moment matching methods.Comment: 30 pages; added plate diagrams and clarifications, changed style, corrected typos, updated figures. in Proceedings of the 29-th Conference on Neural Information Processing Systems (NIPS), 201

    Rich and Scalable Models for Text

    Get PDF
    Topic models have become essential tools for uncovering hidden structures in big data. However, the most popular topic model algorithm—Latent Dirichlet Allocation (LDA)— and its extensions suffer from sluggish performance on big datasets. Recently, the machine learning community has attacked this problem using spectral learning approaches such as the moment method with tensor decomposition or matrix factorization. The anchor word algorithm by Arora et al. [2013] has emerged as a more efficient approach to solve a large class of topic modeling problems. The anchor word algorithm is high-speed, and it has a provable theoretical guarantee: it will converge to a global solution given enough number of documents. In this thesis, we present a series of spectral models based on the anchor word algorithm to serve a broader class of datasets and to provide more abundant and more flexible modeling capacity. First, we improve the anchor word algorithm by incorporating various rich priors in the form of appropriate regularization terms. Our new regularized anchor word algorithms produce higher topic quality and provide flexibility to incorporate informed priors, creating the ability to discover topics more suited for external knowledge. Second, we enrich the anchor word algorithm with metadata-based word representation for labeled datasets. Our new supervised anchor word algorithm runs very fast and predicts better than supervised topic models such as Supervised LDA on three sentiment datasets. Also, sentiment anchor words, which play a vital role in generating sentiment topics, provide cues to understand sentiment datasets better than unsupervised topic models. Lastly, we examine ALTO, an active learning framework with a static topic overview, and investigate the usability of supervised topic models for active learning. We develop a new, dynamic, active learning framework that combines the concept of informativeness and representativeness of documents using dynamically updating topics from our fast supervised anchor word algorithm. Experiments using three multi-class datasets show that our new framework consistently improves classification accuracy over ALTO

    Méthodes des moments pour l'inférence de systèmes séquentiels linéaires rationnels

    Get PDF
    Learning stochastic models generating sequences has many applications in natural language processing, speech recognitions or bioinformatics. Multiplicity Automata (MA) are graphical latent variable models that encompass a wide variety of linear systems. In particular, they can model stochastic languages, stochastic processes and controlled processes. Traditional learning algorithms such as the one of Baum-Welch are iterative, slow and may converge to local optima. A recent alternative is to use the Method of Moments (MoM) to design consistent and fast algorithms with pseudo-PAC guarantees.However, MoM-based algorithms have two main disadvantages. First, the PAC guarantees hold only if the size of the learned model corresponds to the size of the target model. Second, although these algorithms learn a function close to the target distribution, most do not ensure it will be a distribution. Thus, a model learned from a finite number of examples may return negative values or values that do not sum to one.This thesis addresses both problems. First, we extend the theoretical guarantees for compressed models, and propose a regularized spectral algorithm that adjusts the size of the model to the data. Then, an application in electronic warfare is proposed to sequence of the dwells of a superheterodyne receiver. Finally, we design new learning algorithms based on the MoM that do not suffer the problem of negative probabilities. We show for one of them pseudo-PAC guarantees.L’apprentissage de modèles stochastiques générant des séquences a de nombreuses applications comme en traitement de la parole, du langage ou bien encore en bio-informatique. Les Automates à Multiplicité (MA) sont des modèles graphiques à variables latentes qui englobent une grande variété de systèmes linéaires pouvant représenter entre autres des langues stochastiques, des processus stochastiques ainsi que des processus contrôlés. Les algorithmes traditionnels d’apprentissage comme celui de Baum-Welch sont itératifs, lent et peuvent converger vers des optima locaux. Une alternative récente consiste à utiliser la méthode des moments (MoM) pour concevoir des algorithmes rapides et consistent avec des garanties pseudo-PAC.Cependant, les algorithmes basés sur la MoM ont deux inconvénients principaux. Tout d'abord, les garanties PAC ne sont valides que si la dimension du modèle appris correspond à la dimension du modèle cible. Deuxièmement, bien que les algorithmes basés sur la MoM apprennent une fonction proche de la distribution cible, la plupart ne contraignent pas celle-ci à être une distribution. Ainsi, un modèle appris à partir d’un nombre fini d’exemples peut renvoyer des valeurs négatives et qui ne somment pas à un.Ainsi, cette thèse s’adresse à ces deux problèmes en proposant 1) un élargissement des garanties théoriques pour les modèles compressés et 2) de nouveaux algorithmes d’apprentissage ne souffrant pas du problème des probabilités négatives et dont certains bénéficient de garanties PAC. Une application en guerre électronique est aussi proposée pour le séquencement des écoutes du récepteur superhétéordyne
    corecore