Search CORE

839 research outputs found

Memory-Efficient Topic Modeling

Author: Cao Xiao-Qin
Liu Zhi-Qiang
Zeng Jia
Publication venue
Publication date: 08/06/2012
Field of study

As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation (LDA) has found many important applications in text mining, computer vision and computational biology. Recent training algorithms for LDA can be interpreted within a unified message passing framework. However, message passing requires storing previous messages with a large amount of memory space, increasing linearly with the number of documents or the number of topics. Therefore, the high memory usage is often a major problem for topic modeling of massive corpora containing a large number of topics. To reduce the space complexity, we propose a novel algorithm without storing previous messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the message updating into the message passing process, and thus avoid storing previous messages. Experimental results on four large data sets confirm that TBP performs comparably well or even better than current state-of-the-art training algorithms for LDA but with a much less memory consumption. TBP can do topic modeling when massive corpora cannot fit in the computer memory, for example, extracting thematic topics from 7 GB PUBMED corpora on a common desktop computer with 2GB memory.Comment: 20 pages, 7 figure

arXiv.org e-Print Archive

CiteSeerX

A New Approach to Speeding Up Topic Modeling

Author: Jia Zeng
Senior Member
Xiao-qin Cao
Zhi-qiang Liu
Publication venue
Publication date: 07/04/2014
Field of study

Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic modeling paradigm, and recently finds many applications in computer vision and computational biology. In this paper, we propose a fast and accurate batch algorithm, active belief propagation (ABP), for training LDA. Usually batch LDA algorithms require repeated scanning of the entire corpus and searching the complete topic space. To process massive corpora having a large number of topics, the training iteration of batch LDA algorithms is often inefficient and time-consuming. To accelerate the training speed, ABP actively scans the subset of corpus and searches the subset of topic space for topic modeling, therefore saves enormous training time in each iteration. To ensure accuracy, ABP selects only those documents and topics that contribute to the largest residuals within the residual belief propagation (RBP) framework. On four real-world corpora, ABP performs around

10

100

times faster than state-of-the-art batch LDA algorithms with a comparable topic modeling accuracy.Comment: 14 pages, 12 figure

arXiv.org e-Print Archive

CiteSeerX

Characterization of Microbial Diversity and Community Structure in Fermentation Pit Mud of Different Ages for Production of Strong-Aroma Baijiu

Author: DENG JIE
HUANG ZHI-GUO
REN ZHI-QIANG
WANG XU-JIA
WEI CHUN-HUI
ZHU HONG-MEI
Publication venue: 'Polish Society of Microbiologists'
Publication date: 01/09/2020
Field of study

Exeley Inc.

Fixed-Point Algorithms for Solving the Critical Value and Upper Tail Quantile of Kuiper's Statistics

Author: Chen Xiao
Feng Zhi-Qiang
Lin Rui-Jia
Zhang Hong-Yan
Zhou Yu
Publication venue
Publication date: 13/09/2023
Field of study

Kuiper's statistic is a good measure for the difference of ideal distribution and empirical distribution in the goodness-of-fit test. However, it is a challenging problem to solve the critical value and upper tail quantile, or simply Kuiper pair, of Kuiper's statistics due to the difficulties of solving the nonlinear equation and reasonable approximation of infinite series. The pioneering work by Kuiper just provided the key ideas and few numerical tables created from the upper tail probability

\alpha

and sample capacity

n

, which limited its propagation and possible applications in various fields since there are infinite configurations for the parameters

\alpha

and

n

. In this work, the contributions lie in three perspectives: firstly, the second order approximation for the infinite series of the cumulative distribution of the critical value is used to achieve higher precision; secondly, the principles and fixed-point algorithms for solving the Kuiper pair are presented with details; finally, an error in Kuiper's table of critical value is discovered and fixed. The algorithms are verified and validated by comparing with the table provided by Kuiper. The methods and algorithms proposed are enlightening and worthy of introducing to the college students, computer programmers, engineers, experimental psychologists and so on.Comment: 19 pages, 6 figures, code available on GitHu

arXiv.org e-Print Archive

5 GHz TMRT observations of 71 pulsars

Author: Lee Ke-Jia
Liu Jie
Manchester R. N.
Qiao Guo-Jun
Shen Zhi-Qiang
Wu Xin-Ji
Xu Ren-Xin
Yan Zhen
Zhao Ru-Shuang
Publication venue: 'American Astronomical Society'
Publication date: 29/01/2019
Field of study

We present integrated pulse profiles at 5~GHz for 71 pulsars, including eight millisecond pulsars (MSPs), obtained using the Shanghai Tian Ma Radio Telescope (TMRT). Mean flux densities and pulse widths are measured. For 19 normal pulsars and one MSP, these are the first detections at 5~GHz and for a further 19, including five MPSs, the profiles have a better signal-to-noise ratio than previous observations. Mean flux density spectra between 400~MHz and 9~GHz are presented for 27 pulsars and correlations of power-law spectral index are found with characteristic age, radio pseudo-luminosity and spin-down luminosity. Mode changing was detected in five pulsars. The separation between the main pulse and interpulse is shown to be frequency independent for six pulsars but a frequency dependence of the relative intensity of the main pulse and interpulse is found. The frequency dependence of component separations is investigated for 20 pulsars and three groups are found: in seven cases the separation between the outmost leading and trailing components decreases with frequency, roughly in agreement with radius-to-frequency mapping; in eleven cases the separation is nearly constant; in the remain two cases the separation between the outmost components increases with frequency. We obtain the correlations of pulse widths with pulsar period and estimate the core widths of 23 multi-component profiles and conal widths of 17 multi-component profiles at 5.0~GHz using Gaussian fitting and discuss the width-period relationship at 5~GHz compared with the results at at 1.0~GHz and 8.6~GHz.Comment: 46 pages, 14 figures, 8 Tables, accepted by Ap

arXiv.org e-Print Archive

Shanghai Astronomical Observatory,Chinese Academy of Sciences