839 research outputs found

    Memory-Efficient Topic Modeling

    Full text link
    As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation (LDA) has found many important applications in text mining, computer vision and computational biology. Recent training algorithms for LDA can be interpreted within a unified message passing framework. However, message passing requires storing previous messages with a large amount of memory space, increasing linearly with the number of documents or the number of topics. Therefore, the high memory usage is often a major problem for topic modeling of massive corpora containing a large number of topics. To reduce the space complexity, we propose a novel algorithm without storing previous messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the message updating into the message passing process, and thus avoid storing previous messages. Experimental results on four large data sets confirm that TBP performs comparably well or even better than current state-of-the-art training algorithms for LDA but with a much less memory consumption. TBP can do topic modeling when massive corpora cannot fit in the computer memory, for example, extracting thematic topics from 7 GB PUBMED corpora on a common desktop computer with 2GB memory.Comment: 20 pages, 7 figure

    A New Approach to Speeding Up Topic Modeling

    Full text link
    Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic modeling paradigm, and recently finds many applications in computer vision and computational biology. In this paper, we propose a fast and accurate batch algorithm, active belief propagation (ABP), for training LDA. Usually batch LDA algorithms require repeated scanning of the entire corpus and searching the complete topic space. To process massive corpora having a large number of topics, the training iteration of batch LDA algorithms is often inefficient and time-consuming. To accelerate the training speed, ABP actively scans the subset of corpus and searches the subset of topic space for topic modeling, therefore saves enormous training time in each iteration. To ensure accuracy, ABP selects only those documents and topics that contribute to the largest residuals within the residual belief propagation (RBP) framework. On four real-world corpora, ABP performs around 1010 to 100100 times faster than state-of-the-art batch LDA algorithms with a comparable topic modeling accuracy.Comment: 14 pages, 12 figure

    Fixed-Point Algorithms for Solving the Critical Value and Upper Tail Quantile of Kuiper's Statistics

    Full text link
    Kuiper's statistic is a good measure for the difference of ideal distribution and empirical distribution in the goodness-of-fit test. However, it is a challenging problem to solve the critical value and upper tail quantile, or simply Kuiper pair, of Kuiper's statistics due to the difficulties of solving the nonlinear equation and reasonable approximation of infinite series. The pioneering work by Kuiper just provided the key ideas and few numerical tables created from the upper tail probability α\alpha and sample capacity nn, which limited its propagation and possible applications in various fields since there are infinite configurations for the parameters α\alpha and nn. In this work, the contributions lie in three perspectives: firstly, the second order approximation for the infinite series of the cumulative distribution of the critical value is used to achieve higher precision; secondly, the principles and fixed-point algorithms for solving the Kuiper pair are presented with details; finally, an error in Kuiper's table of critical value is discovered and fixed. The algorithms are verified and validated by comparing with the table provided by Kuiper. The methods and algorithms proposed are enlightening and worthy of introducing to the college students, computer programmers, engineers, experimental psychologists and so on.Comment: 19 pages, 6 figures, code available on GitHu

    5 GHz TMRT observations of 71 pulsars

    Full text link
    We present integrated pulse profiles at 5~GHz for 71 pulsars, including eight millisecond pulsars (MSPs), obtained using the Shanghai Tian Ma Radio Telescope (TMRT). Mean flux densities and pulse widths are measured. For 19 normal pulsars and one MSP, these are the first detections at 5~GHz and for a further 19, including five MPSs, the profiles have a better signal-to-noise ratio than previous observations. Mean flux density spectra between 400~MHz and 9~GHz are presented for 27 pulsars and correlations of power-law spectral index are found with characteristic age, radio pseudo-luminosity and spin-down luminosity. Mode changing was detected in five pulsars. The separation between the main pulse and interpulse is shown to be frequency independent for six pulsars but a frequency dependence of the relative intensity of the main pulse and interpulse is found. The frequency dependence of component separations is investigated for 20 pulsars and three groups are found: in seven cases the separation between the outmost leading and trailing components decreases with frequency, roughly in agreement with radius-to-frequency mapping; in eleven cases the separation is nearly constant; in the remain two cases the separation between the outmost components increases with frequency. We obtain the correlations of pulse widths with pulsar period and estimate the core widths of 23 multi-component profiles and conal widths of 17 multi-component profiles at 5.0~GHz using Gaussian fitting and discuss the width-period relationship at 5~GHz compared with the results at at 1.0~GHz and 8.6~GHz.Comment: 46 pages, 14 figures, 8 Tables, accepted by Ap
    • …
    corecore