1,729 research outputs found
Scalable Bayesian Non-Negative Tensor Factorization for Massive Count Data
We present a Bayesian non-negative tensor factorization model for
count-valued tensor data, and develop scalable inference algorithms (both batch
and online) for dealing with massive tensors. Our generative model can handle
overdispersed counts as well as infer the rank of the decomposition. Moreover,
leveraging a reparameterization of the Poisson distribution as a multinomial
facilitates conjugacy in the model and enables simple and efficient Gibbs
sampling and variational Bayes (VB) inference updates, with a computational
cost that only depends on the number of nonzeros in the tensor. The model also
provides a nice interpretability for the factors; in our model, each factor
corresponds to a "topic". We develop a set of online inference algorithms that
allow further scaling up the model to massive tensors, for which batch
inference methods may be infeasible. We apply our framework on diverse
real-world applications, such as \emph{multiway} topic modeling on a scientific
publications database, analyzing a political science data set, and analyzing a
massive household transactions data set.Comment: ECML PKDD 201
Fast and scalable non-parametric Bayesian inference for Poisson point processes
We study the problem of non-parametric Bayesian estimation of the intensity
function of a Poisson point process. The observations are independent
realisations of a Poisson point process on the interval . We propose two
related approaches. In both approaches we model the intensity function as
piecewise constant on bins forming a partition of the interval . In
the first approach the coefficients of the intensity function are assigned
independent gamma priors, leading to a closed form posterior distribution. On
the theoretical side, we prove that as the posterior
asymptotically concentrates around the "true", data-generating intensity
function at an optimal rate for -H\"older regular intensity functions (). In the second approach we employ a gamma Markov chain prior on the
coefficients of the intensity function. The posterior distribution is no longer
available in closed form, but inference can be performed using a
straightforward version of the Gibbs sampler. Both approaches scale well with
sample size, but the second is much less sensitive to the choice of .
Practical performance of our methods is first demonstrated via synthetic data
examples. We compare our second method with other existing approaches on the UK
coal mining disasters data. Furthermore, we apply it to the US mass shootings
data and Donald Trump's Twitter data.Comment: 45 pages, 22 figure
Analysis of variance--why it is more important than ever
Analysis of variance (ANOVA) is an extremely important method in exploratory
and confirmatory data analysis. Unfortunately, in complex problems (e.g.,
split-plot designs), it is not always easy to set up an appropriate ANOVA. We
propose a hierarchical analysis that automatically gives the correct ANOVA
comparisons even in complex scenarios. The inferences for all means and
variances are performed under a model with a separate batch of effects for each
row of the ANOVA table. We connect to classical ANOVA by working with
finite-sample variance components: fixed and random effects models are
characterized by inferences about existing levels of a factor and new levels,
respectively. We also introduce a new graphical display showing inferences
about the standard deviations of each batch of effects. We illustrate with two
examples from our applied data analysis, first illustrating the usefulness of
our hierarchical computations and displays, and second showing how the ideas of
ANOVA are helpful in understanding a previously fit hierarchical model.Comment: This paper discussed in: [math.ST/0508526], [math.ST/0508527],
[math.ST/0508528], [math.ST/0508529]. Rejoinder in [math.ST/0508530
Statistical models in biogeography
We concentrate on the statistical methods used in Biogeography for modelling the spatial distribution of bird species. Due to the difficulty of specifying a joint multivariate spatial covariance structure in environmental processes, we factor such a joint distribution into a series of conditional models linked together in a hierarchical framework. We have a process that corresponds to an unobservable map with the actual information about a bird species, and the data correspond to the observations that are connected to that process. Markov chain Monte Carlo (MCMC) simulation approaches are used for models involving multiple levels incorporating dependence structures. We use a Bayesian algorithm for drawing samples from the posterior distribution in order to obtain estimates of the parameters and reconstruct the true map based on data. We present different methods to overcome the problem of calculating the distribution of the Markov random field that is used in the MCMC algorithm. During the analysis it is desirable to delete some of the predictors from the model and only use a subset of covariates in the estimation procedure. We use the method by Kuo & Mallick (1998) (KM) for variable selection and combine it with multiple independent chains which successfully improves the mixing behaviour. In simulation studies we show the better performance of the pseudolikelihood over other likelihood approximation methods, and the good performance of the KM method with this type of data. We illustrate the application of the methods with the complete analysis of the spatial distribution of two bird species (Sturnella magna and Anas rubripes) based on a real data set. We show the advantages of using the hidden structure and the spatial interaction parameter in the spatial hidden Markov model over other simpler models, like the ordinary logistic model or the autologistic model without observation errors
Towards derandomising Markov chain Monte Carlo
We present a new framework to derandomise certain Markov chain Monte Carlo
(MCMC) algorithms.
As in MCMC, we first reduce counting problems to sampling from a sequence of
marginal distributions.
For the latter task,
we introduce a method called coupling towards the past that can, in
logarithmic time,
evaluate one or a constant number of variables from a stationary Markov chain
state.
Since there are at most logarithmic random choices, this leads to very simple
derandomisation.
We provide two applications of this framework, namely efficient deterministic
approximate counting algorithms for hypergraph independent sets and hypergraph
colourings,
under local lemma type conditions matching, up to lower order factors, their
state-of-the-art randomised counterparts.Comment: 57 page
- …