1,729 research outputs found

    Scalable Bayesian Non-Negative Tensor Factorization for Massive Count Data

    Full text link
    We present a Bayesian non-negative tensor factorization model for count-valued tensor data, and develop scalable inference algorithms (both batch and online) for dealing with massive tensors. Our generative model can handle overdispersed counts as well as infer the rank of the decomposition. Moreover, leveraging a reparameterization of the Poisson distribution as a multinomial facilitates conjugacy in the model and enables simple and efficient Gibbs sampling and variational Bayes (VB) inference updates, with a computational cost that only depends on the number of nonzeros in the tensor. The model also provides a nice interpretability for the factors; in our model, each factor corresponds to a "topic". We develop a set of online inference algorithms that allow further scaling up the model to massive tensors, for which batch inference methods may be infeasible. We apply our framework on diverse real-world applications, such as \emph{multiway} topic modeling on a scientific publications database, analyzing a political science data set, and analyzing a massive household transactions data set.Comment: ECML PKDD 201

    Fast and scalable non-parametric Bayesian inference for Poisson point processes

    Get PDF
    We study the problem of non-parametric Bayesian estimation of the intensity function of a Poisson point process. The observations are nn independent realisations of a Poisson point process on the interval [0,T][0,T]. We propose two related approaches. In both approaches we model the intensity function as piecewise constant on NN bins forming a partition of the interval [0,T][0,T]. In the first approach the coefficients of the intensity function are assigned independent gamma priors, leading to a closed form posterior distribution. On the theoretical side, we prove that as n,n\rightarrow\infty, the posterior asymptotically concentrates around the "true", data-generating intensity function at an optimal rate for hh-H\"older regular intensity functions (0<h10 < h\leq 1). In the second approach we employ a gamma Markov chain prior on the coefficients of the intensity function. The posterior distribution is no longer available in closed form, but inference can be performed using a straightforward version of the Gibbs sampler. Both approaches scale well with sample size, but the second is much less sensitive to the choice of NN. Practical performance of our methods is first demonstrated via synthetic data examples. We compare our second method with other existing approaches on the UK coal mining disasters data. Furthermore, we apply it to the US mass shootings data and Donald Trump's Twitter data.Comment: 45 pages, 22 figure

    Analysis of variance--why it is more important than ever

    Full text link
    Analysis of variance (ANOVA) is an extremely important method in exploratory and confirmatory data analysis. Unfortunately, in complex problems (e.g., split-plot designs), it is not always easy to set up an appropriate ANOVA. We propose a hierarchical analysis that automatically gives the correct ANOVA comparisons even in complex scenarios. The inferences for all means and variances are performed under a model with a separate batch of effects for each row of the ANOVA table. We connect to classical ANOVA by working with finite-sample variance components: fixed and random effects models are characterized by inferences about existing levels of a factor and new levels, respectively. We also introduce a new graphical display showing inferences about the standard deviations of each batch of effects. We illustrate with two examples from our applied data analysis, first illustrating the usefulness of our hierarchical computations and displays, and second showing how the ideas of ANOVA are helpful in understanding a previously fit hierarchical model.Comment: This paper discussed in: [math.ST/0508526], [math.ST/0508527], [math.ST/0508528], [math.ST/0508529]. Rejoinder in [math.ST/0508530

    Statistical models in biogeography

    Get PDF
    We concentrate on the statistical methods used in Biogeography for modelling the spatial distribution of bird species. Due to the difficulty of specifying a joint multivariate spatial covariance structure in environmental processes, we factor such a joint distribution into a series of conditional models linked together in a hierarchical framework. We have a process that corresponds to an unobservable map with the actual information about a bird species, and the data correspond to the observations that are connected to that process. Markov chain Monte Carlo (MCMC) simulation approaches are used for models involving multiple levels incorporating dependence structures. We use a Bayesian algorithm for drawing samples from the posterior distribution in order to obtain estimates of the parameters and reconstruct the true map based on data. We present different methods to overcome the problem of calculating the distribution of the Markov random field that is used in the MCMC algorithm. During the analysis it is desirable to delete some of the predictors from the model and only use a subset of covariates in the estimation procedure. We use the method by Kuo & Mallick (1998) (KM) for variable selection and combine it with multiple independent chains which successfully improves the mixing behaviour. In simulation studies we show the better performance of the pseudolikelihood over other likelihood approximation methods, and the good performance of the KM method with this type of data. We illustrate the application of the methods with the complete analysis of the spatial distribution of two bird species (Sturnella magna and Anas rubripes) based on a real data set. We show the advantages of using the hidden structure and the spatial interaction parameter in the spatial hidden Markov model over other simpler models, like the ordinary logistic model or the autologistic model without observation errors

    Towards derandomising Markov chain Monte Carlo

    Full text link
    We present a new framework to derandomise certain Markov chain Monte Carlo (MCMC) algorithms. As in MCMC, we first reduce counting problems to sampling from a sequence of marginal distributions. For the latter task, we introduce a method called coupling towards the past that can, in logarithmic time, evaluate one or a constant number of variables from a stationary Markov chain state. Since there are at most logarithmic random choices, this leads to very simple derandomisation. We provide two applications of this framework, namely efficient deterministic approximate counting algorithms for hypergraph independent sets and hypergraph colourings, under local lemma type conditions matching, up to lower order factors, their state-of-the-art randomised counterparts.Comment: 57 page
    corecore