12,966 research outputs found
A Scalable Asynchronous Distributed Algorithm for Topic Modeling
Learning meaningful topic models with massive document collections which
contain millions of documents and billions of tokens is challenging because of
two reasons: First, one needs to deal with a large number of topics (typically
in the order of thousands). Second, one needs a scalable and efficient way of
distributing the computation across multiple machines. In this paper we present
a novel algorithm F+Nomad LDA which simultaneously tackles both these problems.
In order to handle large number of topics we use an appropriately modified
Fenwick tree. This data structure allows us to sample from a multinomial
distribution over items in time. Moreover, when topic counts
change the data structure can be updated in time. In order to
distribute the computation across multiple processor we present a novel
asynchronous framework inspired by the Nomad algorithm of
\cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform
state-of-the-art on massive problems which involve millions of documents,
billions of words, and thousands of topics
Methods for generating variates from probability distributions
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Diverse probabilistic results are used in the design of random univariate generators. General methods based on these are classified and relevant theoretical properties derived. This is followed by a comparative review of specific algorithms currently available for continuous and discrete univariate distributions. A need for a Zeta generator is established, and two new methods, based on inversion and rejection with a truncated Pareto envelope respectively are developed and compared. The paucity of algorithms for multivariate generation motivates a classification of general methods, and in particular, a new method involving envelope rejection with a novel target distribution is proposed. A new method for generating first passage times in a Wiener Process is constructed. This is based on the ratio of two random numbers, and its performance is compared to an existing method for generating inverse Gaussian variates. New "hybrid" algorithms for Poisson and Negative Binomial distributions are constructed, using an Alias implementation, together with a Geometric tail procedure. These are shown to be robust, exact and fast for a wide range of parameter values. Significant modifications are made to Atkinson's Poisson generator (PA), and the resulting algorithm shown to be complementary to the hybrid method. A new method for Von Mises generation via a comparison of random numbers follows, and its performance compared to
that of Best and Fisher's Wrapped Cauchy rejection method. Finally new methods are proposed for sampling from distribution tails, using optimally designed Exponential envelopes. Timings are given for Gamma and Normal tails, and in the latter case the performance is shown to be significantly better than Marsaglia's tail generation procedure.Governors of Dundee College of Technolog
Parallel Weighted Random Sampling
Data structures for efficient sampling from a set of weighted items are an important building block of many applications. However, few parallel solutions are known. We close many of these gaps both for shared-memory and distributed-memory machines. We give efficient, fast, and practicable algorithms for sampling single items, k items with/without replacement, permutations, subsets, and reservoirs. We also give improved sequential algorithms for alias table construction and for sampling with replacement. Experiments on shared-memory parallel machines with up to 158 threads show near linear speedups both for construction and queries
Dynamic Sampling from a Discrete Probability Distribution with a Known Distribution of Rates
In this paper, we consider a number of efficient data structures for the
problem of sampling from a dynamically changing discrete probability
distribution, where some prior information is known on the distribution of the
rates, in particular the maximum and minimum rate, and where the number of
possible outcomes N is large.
We consider three basic data structures, the Acceptance-Rejection method, the
Complete Binary Tree and the Alias Method. These can be used as building blocks
in a multi-level data structure, where at each of the levels, one of the basic
data structures can be used.
Depending on assumptions on the distribution of the rates of outcomes,
different combinations of the basic structures can be used. We prove that for
particular data structures the expected time of sampling and update is
constant, when the rates follow a non-decreasing distribution, log-uniform
distribution or an inverse polynomial distribution, and show that for any
distribution, an expected time of sampling and update of
is possible, where is the
maximum rate and the minimum rate.
We also present an experimental verification, highlighting the limits given
by the constraints of a real-life setting
A Fast Chi-squared Technique For Period Search of Irregularly Sampled Data
A new, computationally- and statistically-efficient algorithm, the Fast
algorithm, can find a periodic signal with harmonic content in
irregularly-sampled data with non-uniform errors. The algorithm calculates the
minimized as a function of frequency at the desired number of
harmonics, using Fast Fourier Transforms to provide performance.
The code for a reference implementation is provided.Comment: Source code for the reference implementation is available at
http://public.lanl.gov/palmer/fastchi.html . Accepted by ApJ. 24 pages, 4
figure
Review of Methods of Power-Spectrum Analysis as Applied to Super-Kamiokande Solar Neutrino Data
To help understand why different published analyses of the Super-Kamiokande
solar neutrino data arrive at different conclusions, we have applied six
different methods to a standardized problem. The key difference between the
various methods rests in the amount of information that each processes. A
Lomb-Scargle analysis that uses the mid times of the time bins and ignores
experimental error estimates uses the least information. A likelihood analysis
that uses the start times, end times, and mean live times, and takes account of
the experimental error estimates, makes the greatest use of the available
information. We carry out power-spectrum analyses of the Super-Kamiokande 5-day
solar neutrino data, using each method in turn, for a standard search band (0
to 50 yr-1). For each method, we also carry out a fixed number (10,000) of
Monte-Carlo simulations for the purpose of estimating the significance of the
leading peak in each power spectrum. We find that, with one exception, the
results of these calculations are compatible with those of previously published
analyses. (We are unable to replicate Koshio's recent results.) We find that
the significance of the peaks at 9.43 yr-1 and at 43.72 yr-1 increases
progressively as one incorporates more information into the analysis procedure.Comment: 21 pages, 25 figure
- …