12,966 research outputs found

    A Scalable Asynchronous Distributed Algorithm for Topic Modeling

    Full text link
    Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons: First, one needs to deal with a large number of topics (typically in the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over TT items in O(logT)O(\log T) time. Moreover, when topic counts change the data structure can be updated in O(logT)O(\log T) time. In order to distribute the computation across multiple processor we present a novel asynchronous framework inspired by the Nomad algorithm of \cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform state-of-the-art on massive problems which involve millions of documents, billions of words, and thousands of topics

    Methods for generating variates from probability distributions

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Diverse probabilistic results are used in the design of random univariate generators. General methods based on these are classified and relevant theoretical properties derived. This is followed by a comparative review of specific algorithms currently available for continuous and discrete univariate distributions. A need for a Zeta generator is established, and two new methods, based on inversion and rejection with a truncated Pareto envelope respectively are developed and compared. The paucity of algorithms for multivariate generation motivates a classification of general methods, and in particular, a new method involving envelope rejection with a novel target distribution is proposed. A new method for generating first passage times in a Wiener Process is constructed. This is based on the ratio of two random numbers, and its performance is compared to an existing method for generating inverse Gaussian variates. New "hybrid" algorithms for Poisson and Negative Binomial distributions are constructed, using an Alias implementation, together with a Geometric tail procedure. These are shown to be robust, exact and fast for a wide range of parameter values. Significant modifications are made to Atkinson's Poisson generator (PA), and the resulting algorithm shown to be complementary to the hybrid method. A new method for Von Mises generation via a comparison of random numbers follows, and its performance compared to that of Best and Fisher's Wrapped Cauchy rejection method. Finally new methods are proposed for sampling from distribution tails, using optimally designed Exponential envelopes. Timings are given for Gamma and Normal tails, and in the latter case the performance is shown to be significantly better than Marsaglia's tail generation procedure.Governors of Dundee College of Technolog

    Parallel Weighted Random Sampling

    Get PDF
    Data structures for efficient sampling from a set of weighted items are an important building block of many applications. However, few parallel solutions are known. We close many of these gaps both for shared-memory and distributed-memory machines. We give efficient, fast, and practicable algorithms for sampling single items, k items with/without replacement, permutations, subsets, and reservoirs. We also give improved sequential algorithms for alias table construction and for sampling with replacement. Experiments on shared-memory parallel machines with up to 158 threads show near linear speedups both for construction and queries

    Dynamic Sampling from a Discrete Probability Distribution with a Known Distribution of Rates

    Get PDF
    In this paper, we consider a number of efficient data structures for the problem of sampling from a dynamically changing discrete probability distribution, where some prior information is known on the distribution of the rates, in particular the maximum and minimum rate, and where the number of possible outcomes N is large. We consider three basic data structures, the Acceptance-Rejection method, the Complete Binary Tree and the Alias Method. These can be used as building blocks in a multi-level data structure, where at each of the levels, one of the basic data structures can be used. Depending on assumptions on the distribution of the rates of outcomes, different combinations of the basic structures can be used. We prove that for particular data structures the expected time of sampling and update is constant, when the rates follow a non-decreasing distribution, log-uniform distribution or an inverse polynomial distribution, and show that for any distribution, an expected time of sampling and update of O(loglogrmax/rmin)O\left(\log\log{r_{max}}/{r_{min}}\right) is possible, where rmaxr_{max} is the maximum rate and rminr_{min} the minimum rate. We also present an experimental verification, highlighting the limits given by the constraints of a real-life setting

    A Fast Chi-squared Technique For Period Search of Irregularly Sampled Data

    Full text link
    A new, computationally- and statistically-efficient algorithm, the Fast χ2\chi^2 algorithm, can find a periodic signal with harmonic content in irregularly-sampled data with non-uniform errors. The algorithm calculates the minimized χ2\chi^2 as a function of frequency at the desired number of harmonics, using Fast Fourier Transforms to provide O(NlogN)O (N \log N) performance. The code for a reference implementation is provided.Comment: Source code for the reference implementation is available at http://public.lanl.gov/palmer/fastchi.html . Accepted by ApJ. 24 pages, 4 figure

    Review of Methods of Power-Spectrum Analysis as Applied to Super-Kamiokande Solar Neutrino Data

    Full text link
    To help understand why different published analyses of the Super-Kamiokande solar neutrino data arrive at different conclusions, we have applied six different methods to a standardized problem. The key difference between the various methods rests in the amount of information that each processes. A Lomb-Scargle analysis that uses the mid times of the time bins and ignores experimental error estimates uses the least information. A likelihood analysis that uses the start times, end times, and mean live times, and takes account of the experimental error estimates, makes the greatest use of the available information. We carry out power-spectrum analyses of the Super-Kamiokande 5-day solar neutrino data, using each method in turn, for a standard search band (0 to 50 yr-1). For each method, we also carry out a fixed number (10,000) of Monte-Carlo simulations for the purpose of estimating the significance of the leading peak in each power spectrum. We find that, with one exception, the results of these calculations are compatible with those of previously published analyses. (We are unable to replicate Koshio's recent results.) We find that the significance of the peaks at 9.43 yr-1 and at 43.72 yr-1 increases progressively as one incorporates more information into the analysis procedure.Comment: 21 pages, 25 figure
    corecore