800 research outputs found

    A network approach to topic models

    Full text link
    One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io

    Efficient training algorithms for HMMs using incremental estimation

    Get PDF
    Typically, parameter estimation for a hidden Markov model (HMM) is performed using an expectation-maximization (EM) algorithm with the maximum-likelihood (ML) criterion. The EM algorithm is an iterative scheme that is well-defined and numerically stable, but convergence may require a large number of iterations. For speech recognition systems utilizing large amounts of training material, this results in long training times. This paper presents an incremental estimation approach to speed-up the training of HMMs without any loss of recognition performance. The algorithm selects a subset of data from the training set, updates the model parameters based on the subset, and then iterates the process until convergence of the parameters. The advantage of this approach is a substantial increase in the number of iterations of the EM algorithm per training token, which leads to faster training. In order to achieve reliable estimation from a small fraction of the complete data set at each iteration, two training criteria are studied; ML and maximum a posteriori (MAP) estimation. Experimental results show that the training of the incremental algorithms is substantially faster than the conventional (batch) method and suffers no loss of recognition performance. Furthermore, the incremental MAP based training algorithm improves performance over the batch versio

    Correcting Predictive ModelCorrecting Models of Chaotic Reality

    Get PDF
    We will assume a chaotic (mixing) reality, can observe a substantially aggregated state vector only and want to predict one or more of its elements using a stochastic model. However, chaotic dynamics can be predicted in a short term only, while in the long term an ergodic distribution is the best predictor. Our stochastic model will thus be considered a local approximation with no predictive ability for the far future. Using an estimate of an ergodic distribution of the predicted scalar (or eventually vector), we get, under additional reasonable assumptions, the uniquely specified resulting model, containing information from both the local model and the ergodic distribution. For a small prediction horizon, if the local model converges in probability to a constant and additional technical assumption is fulfilled, the resulting model converges in L1 norm to the local model. In long term, the resulting model converges in L1 to the ergodic distribution. We propose also a formula for computing the resulting model from the nonparametric specification of the ergodic distribution (using past observations directly). Two examples follow.Chaotic system; Prediction; Bayesian Analysis; Local Approximation; Ergodic Distribution

    Bayesian Robust Tensor Factorization for Incomplete Multiway Data

    Full text link
    We propose a generative model for robust tensor factorization in the presence of both missing data and outliers. The objective is to explicitly infer the underlying low-CP-rank tensor capturing the global information and a sparse tensor capturing the local information (also considered as outliers), thus providing the robust predictive distribution over missing entries. The low-CP-rank tensor is modeled by multilinear interactions between multiple latent factors on which the column sparsity is enforced by a hierarchical prior, while the sparse tensor is modeled by a hierarchical view of Student-tt distribution that associates an individual hyperparameter with each element independently. For model learning, we develop an efficient closed-form variational inference under a fully Bayesian treatment, which can effectively prevent the overfitting problem and scales linearly with data size. In contrast to existing related works, our method can perform model selection automatically and implicitly without need of tuning parameters. More specifically, it can discover the groundtruth of CP rank and automatically adapt the sparsity inducing priors to various types of outliers. In addition, the tradeoff between the low-rank approximation and the sparse representation can be optimized in the sense of maximum model evidence. The extensive experiments and comparisons with many state-of-the-art algorithms on both synthetic and real-world datasets demonstrate the superiorities of our method from several perspectives.Comment: in IEEE Transactions on Neural Networks and Learning Systems, 201
    • …
    corecore