18 research outputs found
Mixtures of multivariate generalized linear models with overlapping clusters
With the advent of ubiquitous monitoring and measurement protocols, studies
have started to focus more and more on complex, multivariate and heterogeneous
datasets. In such studies, multivariate response variables are drawn from a
heterogeneous population often in the presence of additional covariate
information. In order to deal with this intrinsic heterogeneity, regression
analyses have to be clustered for different groups of units. Up until now,
mixture model approaches assigned units to distinct and non-overlapping groups.
However, not rarely these units exhibit more complex organization and
clustering. It is our aim to define a mixture of generalized linear models with
overlapping clusters of units. This involves crucially an overlap function,
that maps the coefficients of the parent clusters into the the coefficient of
the multiple allocation units. We present a computationally efficient MCMC
scheme that samples the posterior distribution of the parameters in the model.
An example on a two-mode network study shows details of the implementation in
the case of a multivariate probit regression setting. A simulation study shows
the overall performance of the method, whereas an illustration of the voting
behaviour on the US supreme court shows how the 9 justices split in two
overlapping sets of justices.Comment: 24 pages, 3 figure
Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability
Understanding the shopping motivations behind market baskets has high
commercial value in the grocery retail industry. Analyzing shopping
transactions demands techniques that can cope with the volume and
dimensionality of grocery transactional data while keeping interpretable
outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to
process grocery transactions and to discover a broad representation of
customers' shopping motivations. However, summarizing the posterior
distribution of an LDA model is challenging, while individual LDA draws may not
be coherent and cannot capture topic uncertainty. Moreover, the evaluation of
LDA models is dominated by model-fit measures which may not adequately capture
the qualitative aspects such as interpretability and stability of topics.
In this paper, we introduce clustering methodology that post-processes
posterior LDA draws to summarise the entire posterior distribution and identify
semantic modes represented as recurrent topics. Our approach is an alternative
to standard label-switching techniques and provides a single posterior summary
set of topics, as well as associated measures of uncertainty. Furthermore, we
establish a more holistic definition for model evaluation, which assesses topic
models based not only on their likelihood but also on their coherence,
distinctiveness and stability. By means of a survey, we set thresholds for the
interpretation of topic coherence and topic similarity in the domain of grocery
retail data. We demonstrate that the selection of recurrent topics through our
clustering methodology not only improves model likelihood but also outperforms
the qualitative aspects of LDA such as interpretability and stability. We
illustrate our methods on an example from a large UK supermarket chain.Comment: 20 pages, 9 figure
Adaptive MCMC with online relabeling
When targeting a distribution that is artificially invariant under some
permutations, Markov chain Monte Carlo (MCMC) algorithms face the
label-switching problem, rendering marginal inference particularly cumbersome.
Such a situation arises, for example, in the Bayesian analysis of finite
mixture models. Adaptive MCMC algorithms such as adaptive Metropolis (AM),
which self-calibrates its proposal distribution using an online estimate of the
covariance matrix of the target, are no exception. To address the
label-switching issue, relabeling algorithms associate a permutation to each
MCMC sample, trying to obtain reasonable marginals. In the case of adaptive
Metropolis (Bernoulli 7 (2001) 223-242), an online relabeling strategy is
required. This paper is devoted to the AMOR algorithm, a provably consistent
variant of AM that can cope with the label-switching problem. The idea is to
nest relabeling steps within the MCMC algorithm based on the estimation of a
single covariance matrix that is used both for adapting the covariance of the
proposal distribution in the Metropolis algorithm step and for online
relabeling. We compare the behavior of AMOR to similar relabeling methods. In
the case of compactly supported target distributions, we prove a strong law of
large numbers for AMOR and its ergodicity. These are the first results on the
consistency of an online relabeling algorithm to our knowledge. The proof
underlines latent relations between relabeling and vector quantization.Comment: Published at http://dx.doi.org/10.3150/13-BEJ578 in the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
Repeated measures regression mixture models
Regression mixture models are one increasingly utilized approach for developing theories about and exploring the heterogeneity of effects. In this study we aimed to extend the current use of regression mixtures to a repeated regression mixture method when repeated measures, such as diary-type and experience-sampling method, data are available. We hypothesized that additional information borrowed from the repeated measures would improve the model performance, in terms of class enumeration and accuracy of the parameter estimates. We specifically compared three types of model specifications in regression mixtures: (a) traditional single-outcome model; (b) repeated measures models with three, five, and seven measures; and (c) a single-outcome model with the average of seven repeated measures. The results showed that the repeated measures regression mixture models substantially outperformed the traditional and average single-outcome models in class enumeration, with less bias in the parameter estimates. For sample size, whereas prior recommendations have suggested that regression mixtures require samples of well over 1,000 participants, even for classes at a large distance from each other (classes with regression weights of.20 vs.70), the present repeated measures regression mixture models allow for samples as low as 200 participants with an increased number (i.e., seven) of repeated measures. We also demonstrate an application of the proposed repeated measures approach using data from the Sleep Research Project. Implications and limitations of the study are discussed
label.switching: An R Package for Dealing with the Label Switching Problem in MCMC Outputs
Label switching is a well-known and fundamental problem in Bayesian
estimation of mixture or hidden Markov models. In case that the prior
distribution of the model parameters is the same for all states, then both the
likelihood and posterior distribution are invariant to permutations of the
parameters. This property makes Markov chain Monte Carlo (MCMC) samples
simulated from the posterior distribution non-identifiable. In this paper, the
\pkg{label.switching} package is introduced. It contains one probabilistic and
seven deterministic relabelling algorithms in order to post-process a given
MCMC sample, provided by the user. Each method returns a set of permutations
that can be used to reorder the MCMC output. Then, any parametric function of
interest can be inferred using the reordered MCMC sample. A set of user-defined
permutations is also accepted, allowing the researcher to benchmark new
relabelling methods against the available onesComment: Accepted to Journal of Statistical Softwar
Bayesian cylindrical data modeling using Abe-Ley mixtures
This paper proposes a Metropolis-Hastings algorithm based on Markov chain Monte Carlo sampling, to estimate the parameters of the Abe-Ley distribution, which is a recently proposed Weibull-Sine-Skewed-von Mises mixture model, for bivariate circular-linear data. Current literature estimates the parameters of these mixture models using the expectation-maximization method, but we will show that this exhibits a few shortcomings for the considered mixture model. First, standard expectation-maximization does not guarantee convergence to a global optimum, because the likelihood is multi-modal, which results from the high dimensionality of the mixture's likelihood. Second, given that expectation-maximization provides point estimates of the parameters only, the uncertainties of the estimates (e.g., confidence intervals) are not directly available in these methods. Hence, extra calculations are needed to quantify such uncertainty. We propose a Metropolis-Hastings based algorithm that avoids both shortcomings of expectation-maximization. Indeed, Metropolis-Hastings provides an approximation to the complete (posterior) distribution, given that it samples from the joint posterior of the mixture parameters. This facilitates direct inference (e.g., about uncertainty, multi-modality) from the estimation. In developing the algorithm, we tackle various challenges including convergence speed, label switching and selecting the optimum number of mixture components. We then (i) verify the effectiveness of the proposed algorithm on sample datasets with known true parameters, and further (ii) validate our methodology on an environmental dataset (a traditional application domain of Abe-Ley mixtures where measurements are function of direction). Finally, we (iii) demonstrate the usefulness of our approach in an application domain where the circular measurement is periodic in time. (C) 2018 Elsevier Inc. All rights reserved
Relabeling and Summarizing Posterior Distributions in Signal Decomposition Problems when the Number of Components is Unknown
International audienceThis paper addresses the problems of relabeling and summarizing posterior distributions that typically arise, in a Bayesian framework, when dealing with signal decomposition problems with an unknown number of components. Such posterior distributions are defined over union of subspaces of differing dimensionality and can be sampled from using modern Monte Carlo techniques, for instance the increasingly popular RJ-MCMC method. No generic approach is available, however, to summarize the resulting variable-dimensional samples and extract from them component-specific parameters. We propose a novel approach, named Variable-dimensional Approximate Posterior for Relabeling and Summarizing (VAPoRS), to this problem, which consists in approximating the posterior distribution of interest by a "simple"---but still variable-dimensional---parametric distribution. The distance between the two distributions is measured using the Kullback-Leibler divergence, and a Stochastic EM-type algorithm, driven by the RJ-MCMC sampler, is proposed to estimate the parameters. Two signal decomposition problems are considered, to show the capability of VAPoRS both for relabeling and for summarizing variable dimensional posterior distributions: the classical problem of detecting and estimating sinusoids in white Gaussian noise on the one hand, and a particle counting problem motivated by the Pierre Auger project in astrophysics on the other hand
Modeling predictors of latent classes in regression mixture models
The purpose of this study is to provide guidance on a process for including latent class predictors in regression mixture models. We first examine the performance of current practice for using the 1-step and 3-step approaches where the direct covariate effect on the outcome is omitted. None of the approaches show adequate estimates of model parameters. Given that Step 1 of the 3-step approach shows adequate results in class enumeration, we suggest using an alternative approach: (a) decide the number of latent classes without predictors of latent classes, and (b) bring the latent class predictors into the model with the inclusion of hypothesized direct covariate effects. Our simulations show that this approach leads to good estimates for all model parameters. The proposed approach is demonstrated by using empirical data to examine the differential effects of family resources on students’ academic achievement outcome. Implications of the study are discussed