21,649 research outputs found

    A Scalable Asynchronous Distributed Algorithm for Topic Modeling

    Full text link
    Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons: First, one needs to deal with a large number of topics (typically in the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over TT items in O(logT)O(\log T) time. Moreover, when topic counts change the data structure can be updated in O(logT)O(\log T) time. In order to distribute the computation across multiple processor we present a novel asynchronous framework inspired by the Nomad algorithm of \cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform state-of-the-art on massive problems which involve millions of documents, billions of words, and thousands of topics

    Fast and scalable Gaussian process modeling with applications to astronomical time series

    Full text link
    The growing field of large-scale time domain astronomy requires methods for probabilistic data analysis that are computationally tractable, even with large datasets. Gaussian Processes are a popular class of models used for this purpose but, since the computational cost scales, in general, as the cube of the number of data points, their application has been limited to small datasets. In this paper, we present a novel method for Gaussian Process modeling in one-dimension where the computational requirements scale linearly with the size of the dataset. We demonstrate the method by applying it to simulated and real astronomical time series datasets. These demonstrations are examples of probabilistic inference of stellar rotation periods, asteroseismic oscillation spectra, and transiting planet parameters. The method exploits structure in the problem when the covariance function is expressed as a mixture of complex exponentials, without requiring evenly spaced observations or uniform noise. This form of covariance arises naturally when the process is a mixture of stochastically-driven damped harmonic oscillators -- providing a physical motivation for and interpretation of this choice -- but we also demonstrate that it can be a useful effective model in some other cases. We present a mathematical description of the method and compare it to existing scalable Gaussian Process methods. The method is fast and interpretable, with a range of potential applications within astronomical data analysis and beyond. We provide well-tested and documented open-source implementations of this method in C++, Python, and Julia.Comment: Updated in response to referee. Submitted to the AAS Journals. Comments (still) welcome. Code available: https://github.com/dfm/celerit

    Deep generative modeling for single-cell transcriptomics.

    Get PDF
    Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task

    The Effect of Recency to Human Mobility

    Get PDF
    In recent years, we have seen scientists attempt to model and explain human dynamics and, in particular, human movement. Many aspects of our complex life are affected by human movements such as disease spread and epidemics modeling, city planning, wireless network development, and disaster relief, to name a few. Given the myriad of applications it is clear that a complete understanding of how people move in space can lead to huge benefits to our society. In most of the recent works, scientists have focused on the idea that people movements are biased towards frequently-visited locations. According to them, human movement is based on an exploration/exploitation dichotomy in which individuals choose new locations (exploration) or return to frequently-visited locations (exploitation). In this work, we focus on the concept of recency. We propose a model in which exploitation in human movement also considers recently-visited locations and not solely frequently-visited locations. We test our hypothesis against different empirical data of human mobility and show that our proposed model is able to better explain the human trajectories in these datasets

    An Empirical Study of Stochastic Variational Algorithms for the Beta Bernoulli Process

    Full text link
    Stochastic variational inference (SVI) is emerging as the most promising candidate for scaling inference in Bayesian probabilistic models to large datasets. However, the performance of these methods has been assessed primarily in the context of Bayesian topic models, particularly latent Dirichlet allocation (LDA). Deriving several new algorithms, and using synthetic, image and genomic datasets, we investigate whether the understanding gleaned from LDA applies in the setting of sparse latent factor models, specifically beta process factor analysis (BPFA). We demonstrate that the big picture is consistent: using Gibbs sampling within SVI to maintain certain posterior dependencies is extremely effective. However, we find that different posterior dependencies are important in BPFA relative to LDA. Particularly, approximations able to model intra-local variable dependence perform best.Comment: ICML, 12 pages. Volume 37: Proceedings of The 32nd International Conference on Machine Learning, 201

    A Hierarchical Allometric Scaling Analysis of Chinese Cities: 1991-2014

    Full text link
    The law of allometric scaling based on Zipf distributions can be employed to research hierarchies of cities in a geographical region. However, the allometric patterns are easily influenced by random disturbance from the noises in observational data. In theory, both the allometric growth law and Zipf's law are related to the hierarchical scaling laws associated with fractal structure. In this paper, the scaling laws of hierarchies with cascade structure are used to study Chinese cities, and the method of R/S analysis is applied to analyzing the change trend of the allometric scaling exponents. The results show that the hierarchical scaling relations of Chinese cities became clearer and clearer from 1991 to 2014 year; the global allometric scaling exponent values fluctuated around 0.85, and the local scaling exponent approached to 0.85. The Hurst exponent of the allometric parameter change is greater than 0.5, indicating persistence and a long-term memory of urban evolution. The main conclusions can be reached as follows: the allometric scaling law of cities represents an evolutionary order rather than an invariable rule, which emerges from self-organized process of urbanization, and the ideas from allometry and fractals can be combined to optimize spatial and hierarchical structure of urban systems in future city planning.Comment: 28 pages, 10 figures, 5 table
    corecore