274 research outputs found
Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
Topic models, and more specifically the class of Latent Dirichlet Allocation
(LDA), are widely used for probabilistic modeling of text. MCMC sampling from
the posterior distribution is typically performed using a collapsed Gibbs
sampler. We propose a parallel sparse partially collapsed Gibbs sampler and
compare its speed and efficiency to state-of-the-art samplers for topic models
on five well-known text corpora of differing sizes and properties. In
particular, we propose and compare two different strategies for sampling the
parameter block with latent topic indicators. The experiments show that the
increase in statistical inefficiency from only partial collapsing is smaller
than commonly assumed, and can be more than compensated by the speedup from
parallelization and sparsity on larger corpora. We also prove that the
partially collapsed samplers scale well with the size of the corpus. The
proposed algorithm is fast, efficient, exact, and can be used in more modeling
situations than the ordinary collapsed sampler.Comment: Accepted for publication in Journal of Computational and Graphical
Statistic
Bayesian leave-one-out cross-validation for large data
Model inference, such as model comparison, model checking, and model
selection, is an important part of model development. Leave-one-out
cross-validation (LOO) is a general approach for assessing the generalizability
of a model, but unfortunately, LOO does not scale well to large datasets. We
propose a combination of using approximate inference techniques and
probability-proportional-to-size-sampling (PPS) for fast LOO model evaluation
for large datasets. We provide both theoretical and empirical results showing
good properties for large data.Comment: Accepted to ICML 2019. This version is the submitted pape
Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison
Leave-one-out cross-validation (LOO-CV) is a popular method for comparing
Bayesian models based on their estimated predictive performance on new, unseen,
data. Estimating the uncertainty of the resulting LOO-CV estimate is a complex
task and it is known that the commonly used standard error estimate is often
too small. We analyse the frequency properties of the LOO-CV estimator and
study the uncertainty related to it. We provide new results of the properties
of the uncertainty both theoretically and empirically and discuss the
challenges of estimating it. We show that problematic cases include: comparing
models with similar predictions, misspecified models, and small data. In these
cases, there is a weak connection in the skewness of the sampling distribution
and the distribution of the error of the LOO-CV estimator. We show that it is
possible that the problematic skewness of the error distribution, which occurs
when the models make similar predictions, does not fade away when the data size
grows to infinity in certain situations.Comment: 88 pages, 19 figure
Multivariate Analysis of Orthogonal Range Searching and Graph Distances Parameterized by Treewidth
We show that the eccentricities, diameter, radius, and Wiener index of an
undirected -vertex graph with nonnegative edge lengths can be computed in
time , where
is the treewidth of the graph. For every , this bound is
, which matches a hardness result of Abboud,
Vassilevska Williams, and Wang (SODA 2015) and closes an open problem in the
multivariate analysis of polynomial-time computation. To this end, we show that
the analysis of an algorithm of Cabello and Knauer (Comp. Geom., 2009) in the
regime of non-constant treewidth can be improved by revisiting the analysis of
orthogonal range searching, improving bounds of the form to
, as originally observed by Monier (J. Alg.
1980).
We also investigate the parameterization by vertex cover number
- …