18 research outputs found

    Mixtures of multivariate generalized linear models with overlapping clusters

    Full text link
    With the advent of ubiquitous monitoring and measurement protocols, studies have started to focus more and more on complex, multivariate and heterogeneous datasets. In such studies, multivariate response variables are drawn from a heterogeneous population often in the presence of additional covariate information. In order to deal with this intrinsic heterogeneity, regression analyses have to be clustered for different groups of units. Up until now, mixture model approaches assigned units to distinct and non-overlapping groups. However, not rarely these units exhibit more complex organization and clustering. It is our aim to define a mixture of generalized linear models with overlapping clusters of units. This involves crucially an overlap function, that maps the coefficients of the parent clusters into the the coefficient of the multiple allocation units. We present a computationally efficient MCMC scheme that samples the posterior distribution of the parameters in the model. An example on a two-mode network study shows details of the implementation in the case of a multivariate probit regression setting. A simulation study shows the overall performance of the method, whereas an illustration of the voting behaviour on the US supreme court shows how the 9 justices split in two overlapping sets of justices.Comment: 24 pages, 3 figure

    Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability

    Get PDF
    Understanding the shopping motivations behind market baskets has high commercial value in the grocery retail industry. Analyzing shopping transactions demands techniques that can cope with the volume and dimensionality of grocery transactional data while keeping interpretable outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to process grocery transactions and to discover a broad representation of customers' shopping motivations. However, summarizing the posterior distribution of an LDA model is challenging, while individual LDA draws may not be coherent and cannot capture topic uncertainty. Moreover, the evaluation of LDA models is dominated by model-fit measures which may not adequately capture the qualitative aspects such as interpretability and stability of topics. In this paper, we introduce clustering methodology that post-processes posterior LDA draws to summarise the entire posterior distribution and identify semantic modes represented as recurrent topics. Our approach is an alternative to standard label-switching techniques and provides a single posterior summary set of topics, as well as associated measures of uncertainty. Furthermore, we establish a more holistic definition for model evaluation, which assesses topic models based not only on their likelihood but also on their coherence, distinctiveness and stability. By means of a survey, we set thresholds for the interpretation of topic coherence and topic similarity in the domain of grocery retail data. We demonstrate that the selection of recurrent topics through our clustering methodology not only improves model likelihood but also outperforms the qualitative aspects of LDA such as interpretability and stability. We illustrate our methods on an example from a large UK supermarket chain.Comment: 20 pages, 9 figure

    Adaptive MCMC with online relabeling

    Full text link
    When targeting a distribution that is artificially invariant under some permutations, Markov chain Monte Carlo (MCMC) algorithms face the label-switching problem, rendering marginal inference particularly cumbersome. Such a situation arises, for example, in the Bayesian analysis of finite mixture models. Adaptive MCMC algorithms such as adaptive Metropolis (AM), which self-calibrates its proposal distribution using an online estimate of the covariance matrix of the target, are no exception. To address the label-switching issue, relabeling algorithms associate a permutation to each MCMC sample, trying to obtain reasonable marginals. In the case of adaptive Metropolis (Bernoulli 7 (2001) 223-242), an online relabeling strategy is required. This paper is devoted to the AMOR algorithm, a provably consistent variant of AM that can cope with the label-switching problem. The idea is to nest relabeling steps within the MCMC algorithm based on the estimation of a single covariance matrix that is used both for adapting the covariance of the proposal distribution in the Metropolis algorithm step and for online relabeling. We compare the behavior of AMOR to similar relabeling methods. In the case of compactly supported target distributions, we prove a strong law of large numbers for AMOR and its ergodicity. These are the first results on the consistency of an online relabeling algorithm to our knowledge. The proof underlines latent relations between relabeling and vector quantization.Comment: Published at http://dx.doi.org/10.3150/13-BEJ578 in the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

    Repeated measures regression mixture models

    Get PDF
    Regression mixture models are one increasingly utilized approach for developing theories about and exploring the heterogeneity of effects. In this study we aimed to extend the current use of regression mixtures to a repeated regression mixture method when repeated measures, such as diary-type and experience-sampling method, data are available. We hypothesized that additional information borrowed from the repeated measures would improve the model performance, in terms of class enumeration and accuracy of the parameter estimates. We specifically compared three types of model specifications in regression mixtures: (a) traditional single-outcome model; (b) repeated measures models with three, five, and seven measures; and (c) a single-outcome model with the average of seven repeated measures. The results showed that the repeated measures regression mixture models substantially outperformed the traditional and average single-outcome models in class enumeration, with less bias in the parameter estimates. For sample size, whereas prior recommendations have suggested that regression mixtures require samples of well over 1,000 participants, even for classes at a large distance from each other (classes with regression weights of.20 vs.70), the present repeated measures regression mixture models allow for samples as low as 200 participants with an increased number (i.e., seven) of repeated measures. We also demonstrate an application of the proposed repeated measures approach using data from the Sleep Research Project. Implications and limitations of the study are discussed

    label.switching: An R Package for Dealing with the Label Switching Problem in MCMC Outputs

    Get PDF
    Label switching is a well-known and fundamental problem in Bayesian estimation of mixture or hidden Markov models. In case that the prior distribution of the model parameters is the same for all states, then both the likelihood and posterior distribution are invariant to permutations of the parameters. This property makes Markov chain Monte Carlo (MCMC) samples simulated from the posterior distribution non-identifiable. In this paper, the \pkg{label.switching} package is introduced. It contains one probabilistic and seven deterministic relabelling algorithms in order to post-process a given MCMC sample, provided by the user. Each method returns a set of permutations that can be used to reorder the MCMC output. Then, any parametric function of interest can be inferred using the reordered MCMC sample. A set of user-defined permutations is also accepted, allowing the researcher to benchmark new relabelling methods against the available onesComment: Accepted to Journal of Statistical Softwar

    Bayesian cylindrical data modeling using Abe-Ley mixtures

    Get PDF
    This paper proposes a Metropolis-Hastings algorithm based on Markov chain Monte Carlo sampling, to estimate the parameters of the Abe-Ley distribution, which is a recently proposed Weibull-Sine-Skewed-von Mises mixture model, for bivariate circular-linear data. Current literature estimates the parameters of these mixture models using the expectation-maximization method, but we will show that this exhibits a few shortcomings for the considered mixture model. First, standard expectation-maximization does not guarantee convergence to a global optimum, because the likelihood is multi-modal, which results from the high dimensionality of the mixture's likelihood. Second, given that expectation-maximization provides point estimates of the parameters only, the uncertainties of the estimates (e.g., confidence intervals) are not directly available in these methods. Hence, extra calculations are needed to quantify such uncertainty. We propose a Metropolis-Hastings based algorithm that avoids both shortcomings of expectation-maximization. Indeed, Metropolis-Hastings provides an approximation to the complete (posterior) distribution, given that it samples from the joint posterior of the mixture parameters. This facilitates direct inference (e.g., about uncertainty, multi-modality) from the estimation. In developing the algorithm, we tackle various challenges including convergence speed, label switching and selecting the optimum number of mixture components. We then (i) verify the effectiveness of the proposed algorithm on sample datasets with known true parameters, and further (ii) validate our methodology on an environmental dataset (a traditional application domain of Abe-Ley mixtures where measurements are function of direction). Finally, we (iii) demonstrate the usefulness of our approach in an application domain where the circular measurement is periodic in time. (C) 2018 Elsevier Inc. All rights reserved

    Relabeling and Summarizing Posterior Distributions in Signal Decomposition Problems when the Number of Components is Unknown

    Get PDF
    International audienceThis paper addresses the problems of relabeling and summarizing posterior distributions that typically arise, in a Bayesian framework, when dealing with signal decomposition problems with an unknown number of components. Such posterior distributions are defined over union of subspaces of differing dimensionality and can be sampled from using modern Monte Carlo techniques, for instance the increasingly popular RJ-MCMC method. No generic approach is available, however, to summarize the resulting variable-dimensional samples and extract from them component-specific parameters. We propose a novel approach, named Variable-dimensional Approximate Posterior for Relabeling and Summarizing (VAPoRS), to this problem, which consists in approximating the posterior distribution of interest by a "simple"---but still variable-dimensional---parametric distribution. The distance between the two distributions is measured using the Kullback-Leibler divergence, and a Stochastic EM-type algorithm, driven by the RJ-MCMC sampler, is proposed to estimate the parameters. Two signal decomposition problems are considered, to show the capability of VAPoRS both for relabeling and for summarizing variable dimensional posterior distributions: the classical problem of detecting and estimating sinusoids in white Gaussian noise on the one hand, and a particle counting problem motivated by the Pierre Auger project in astrophysics on the other hand

    Modeling predictors of latent classes in regression mixture models

    Get PDF
    The purpose of this study is to provide guidance on a process for including latent class predictors in regression mixture models. We first examine the performance of current practice for using the 1-step and 3-step approaches where the direct covariate effect on the outcome is omitted. None of the approaches show adequate estimates of model parameters. Given that Step 1 of the 3-step approach shows adequate results in class enumeration, we suggest using an alternative approach: (a) decide the number of latent classes without predictors of latent classes, and (b) bring the latent class predictors into the model with the inclusion of hypothesized direct covariate effects. Our simulations show that this approach leads to good estimates for all model parameters. The proposed approach is demonstrated by using empirical data to examine the differential effects of family resources on students’ academic achievement outcome. Implications of the study are discussed
    corecore