21,649 research outputs found
A Scalable Asynchronous Distributed Algorithm for Topic Modeling
Learning meaningful topic models with massive document collections which
contain millions of documents and billions of tokens is challenging because of
two reasons: First, one needs to deal with a large number of topics (typically
in the order of thousands). Second, one needs a scalable and efficient way of
distributing the computation across multiple machines. In this paper we present
a novel algorithm F+Nomad LDA which simultaneously tackles both these problems.
In order to handle large number of topics we use an appropriately modified
Fenwick tree. This data structure allows us to sample from a multinomial
distribution over items in time. Moreover, when topic counts
change the data structure can be updated in time. In order to
distribute the computation across multiple processor we present a novel
asynchronous framework inspired by the Nomad algorithm of
\cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform
state-of-the-art on massive problems which involve millions of documents,
billions of words, and thousands of topics
Fast and scalable Gaussian process modeling with applications to astronomical time series
The growing field of large-scale time domain astronomy requires methods for
probabilistic data analysis that are computationally tractable, even with large
datasets. Gaussian Processes are a popular class of models used for this
purpose but, since the computational cost scales, in general, as the cube of
the number of data points, their application has been limited to small
datasets. In this paper, we present a novel method for Gaussian Process
modeling in one-dimension where the computational requirements scale linearly
with the size of the dataset. We demonstrate the method by applying it to
simulated and real astronomical time series datasets. These demonstrations are
examples of probabilistic inference of stellar rotation periods, asteroseismic
oscillation spectra, and transiting planet parameters. The method exploits
structure in the problem when the covariance function is expressed as a mixture
of complex exponentials, without requiring evenly spaced observations or
uniform noise. This form of covariance arises naturally when the process is a
mixture of stochastically-driven damped harmonic oscillators -- providing a
physical motivation for and interpretation of this choice -- but we also
demonstrate that it can be a useful effective model in some other cases. We
present a mathematical description of the method and compare it to existing
scalable Gaussian Process methods. The method is fast and interpretable, with a
range of potential applications within astronomical data analysis and beyond.
We provide well-tested and documented open-source implementations of this
method in C++, Python, and Julia.Comment: Updated in response to referee. Submitted to the AAS Journals.
Comments (still) welcome. Code available: https://github.com/dfm/celerit
Deep generative modeling for single-cell transcriptomics.
Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task
The Effect of Recency to Human Mobility
In recent years, we have seen scientists attempt to model and explain human
dynamics and, in particular, human movement. Many aspects of our complex life
are affected by human movements such as disease spread and epidemics modeling,
city planning, wireless network development, and disaster relief, to name a
few. Given the myriad of applications it is clear that a complete understanding
of how people move in space can lead to huge benefits to our society. In most
of the recent works, scientists have focused on the idea that people movements
are biased towards frequently-visited locations. According to them, human
movement is based on an exploration/exploitation dichotomy in which individuals
choose new locations (exploration) or return to frequently-visited locations
(exploitation). In this work, we focus on the concept of recency. We propose a
model in which exploitation in human movement also considers recently-visited
locations and not solely frequently-visited locations. We test our hypothesis
against different empirical data of human mobility and show that our proposed
model is able to better explain the human trajectories in these datasets
An Empirical Study of Stochastic Variational Algorithms for the Beta Bernoulli Process
Stochastic variational inference (SVI) is emerging as the most promising
candidate for scaling inference in Bayesian probabilistic models to large
datasets. However, the performance of these methods has been assessed primarily
in the context of Bayesian topic models, particularly latent Dirichlet
allocation (LDA). Deriving several new algorithms, and using synthetic, image
and genomic datasets, we investigate whether the understanding gleaned from LDA
applies in the setting of sparse latent factor models, specifically beta
process factor analysis (BPFA). We demonstrate that the big picture is
consistent: using Gibbs sampling within SVI to maintain certain posterior
dependencies is extremely effective. However, we find that different posterior
dependencies are important in BPFA relative to LDA. Particularly,
approximations able to model intra-local variable dependence perform best.Comment: ICML, 12 pages. Volume 37: Proceedings of The 32nd International
Conference on Machine Learning, 201
A Hierarchical Allometric Scaling Analysis of Chinese Cities: 1991-2014
The law of allometric scaling based on Zipf distributions can be employed to
research hierarchies of cities in a geographical region. However, the
allometric patterns are easily influenced by random disturbance from the noises
in observational data. In theory, both the allometric growth law and Zipf's law
are related to the hierarchical scaling laws associated with fractal structure.
In this paper, the scaling laws of hierarchies with cascade structure are used
to study Chinese cities, and the method of R/S analysis is applied to analyzing
the change trend of the allometric scaling exponents. The results show that the
hierarchical scaling relations of Chinese cities became clearer and clearer
from 1991 to 2014 year; the global allometric scaling exponent values
fluctuated around 0.85, and the local scaling exponent approached to 0.85. The
Hurst exponent of the allometric parameter change is greater than 0.5,
indicating persistence and a long-term memory of urban evolution. The main
conclusions can be reached as follows: the allometric scaling law of cities
represents an evolutionary order rather than an invariable rule, which emerges
from self-organized process of urbanization, and the ideas from allometry and
fractals can be combined to optimize spatial and hierarchical structure of
urban systems in future city planning.Comment: 28 pages, 10 figures, 5 table
- …