328 research outputs found
An Evolutionary Algorithm with Crossover and Mutation for Model-Based Clustering
An evolutionary algorithm (EA) is developed as an alternative to the EM
algorithm for parameter estimation in model-based clustering. This EA
facilitates a different search of the fitness landscape, i.e., the likelihood
surface, utilizing both crossover and mutation. Furthermore, this EA represents
an efficient approach to "hard" model-based clustering and so it can be viewed
as a sort of generalization of the k-means algorithm, which is itself
equivalent to a restricted Gaussian mixture model. The EA is illustrated on
several datasets, and its performance is compared to other hard clustering
approaches and model-based clustering via the EM algorithm
Mixtures of Variance-Gamma Distributions
A mixture of variance-gamma distributions is introduced and developed for
model-based clustering and classification. The latest in a growing line of
non-Gaussian mixture approaches to clustering and classification, the proposed
mixture of variance-gamma distributions is a special case of the recently
developed mixture of generalized hyperbolic distributions, and a restriction is
required to ensure identifiability. Our mixture of variance-gamma distributions
is perhaps the most useful such special case and, we will contend, may be more
useful than the mixture of generalized hyperbolic distributions in some cases.
In addition to being an alternative to the mixture of generalized hyperbolic
distributions, our mixture of variance-gamma distributions serves as an
alternative to the ubiquitous mixture of Gaussian distributions, which is a
special case, as well as several non-Gaussian approaches, some of which are
special cases. The mathematical development of our mixture of variance-gamma
distributions model relies on its relationship with the generalized inverse
Gaussian distribution; accordingly, the latter is reviewed before our mixture
of variance-gamma distributions is presented. Parameter estimation carried out
within the expectation-maximization framework
A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting
Mixture model-based clustering has become an increasingly popular data
analysis technique since its introduction over fifty years ago, and is now
commonly utilized within a family setting. Families of mixture models arise
when the component parameters, usually the component covariance (or scale)
matrices, are decomposed and a number of constraints are imposed. Within the
family setting, model selection involves choosing the member of the family,
i.e., the appropriate covariance structure, in addition to the number of
mixture components. To date, the Bayesian information criterion (BIC) has
proved most effective for model selection, and the expectation-maximization
(EM) algorithm is usually used for parameter estimation. In fact, this EM-BIC
rubric has virtually monopolized the literature on families of mixture models.
Deviating from this rubric, variational Bayes approximations are developed for
parameter estimation and the deviance information criterion for model
selection. The variational Bayes approach provides an alternate framework for
parameter estimation by constructing a tight lower bound on the complex
marginal likelihood and maximizing this lower bound by minimizing the
associated Kullback-Leibler divergence. This approach is taken on the most
commonly used family of Gaussian mixture models, and real and simulated data
are used to compare the new approach to the EM-BIC rubric
Modelling Receiver Operating Characteristic Curves Using Gaussian Mixtures
The receiver operating characteristic curve is widely applied in measuring
the performance of diagnostic tests. Many direct and indirect approaches have
been proposed for modelling the ROC curve, and because of its tractability, the
Gaussian distribution has typically been used to model both populations. We
propose using a Gaussian mixture model, leading to a more flexible approach
that better accounts for atypical data. Monte Carlo simulation is used to
circumvent the issue of absence of a closed-form. We show that our method
performs favourably when compared to the crude binormal curve and to the
semi-parametric frequentist binormal ROC using the famous LABROC procedure
Mixture Model Averaging for Clustering
In mixture model-based clustering applications, it is common to fit several
models from a family and report clustering results from only the `best' one. In
such circumstances, selection of this best model is achieved using a model
selection criterion, most often the Bayesian information criterion. Rather than
throw away all but the best model, we average multiple models that are in some
sense close to the best one, thereby producing a weighted average of clustering
results. Two (weighted) averaging approaches are considered: averaging the
component membership probabilities and averaging models. In both cases, Occam's
window is used to determine closeness to the best model and weights are
computed within a Bayesian model averaging paradigm. In some cases, we need to
merge components before averaging; we introduce a method for merging mixture
components based on the adjusted Rand index. The effectiveness of our
model-based clustering averaging approaches is illustrated using a family of
Gaussian mixture models on real and simulated data
Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model
The Gaussian cluster-weighted model (CWM) is a mixture of regression models
with random covariates that allows for flexible clustering of a random vector
composed of response variables and covariates. In each mixture component, it
adopts a Gaussian distribution for both the covariates and the responses given
the covariates. To robustify the approach with respect to possible elliptical
heavy tailed departures from normality, due to the presence of atypical
observations, the contaminated Gaussian CWM is here introduced. In addition to
the parameters of the Gaussian CWM, each mixture component of our contaminated
CWM has a parameter controlling the proportion of outliers, one controlling the
proportion of leverage points, one specifying the degree of contamination with
respect to the response variables, and one specifying the degree of
contamination with respect to the covariates. Crucially, these parameters do
not have to be specified a priori, adding flexibility to our approach.
Furthermore, once the model is estimated and the observations are assigned to
the groups, a finer intra-group classification in typical points, outliers,
good leverage points, and bad leverage points - concepts of primary importance
in robust regression analysis - can be directly obtained. Relations with other
mixture-based contaminated models are analyzed, identifiability conditions are
provided, an expectation-conditional maximization algorithm is outlined for
parameter estimation, and various implementation and operational issues are
discussed. Properties of the estimators of the regression coefficients are
evaluated through Monte Carlo experiments and compared to the estimators from
the Gaussian CWM. A sensitivity study is also conducted based on a real data
set
Parsimonious Skew Mixture Models for Model-Based Clustering and Classification
In recent work, robust mixture modelling approaches using skewed
distributions have been explored to accommodate asymmetric data. We introduce
parsimony by developing skew-t and skew-normal analogues of the popular GPCM
family that employ an eigenvalue decomposition of a positive-semidefinite
matrix. The methods developed in this paper are compared to existing models in
both an unsupervised and semi-supervised classification framework. Parameter
estimation is carried out using the expectation-maximization algorithm and
models are selected using the Bayesian information criterion. The efficacy of
these extensions is illustrated on simulated and benchmark clustering data
sets
Variational Bayes Approximations for Clustering via Mixtures of Normal Inverse Gaussian Distributions
Parameter estimation for model-based clustering using a finite mixture of
normal inverse Gaussian (NIG) distributions is achieved through variational
Bayes approximations. Univariate NIG mixtures and multivariate NIG mixtures are
considered. The use of variational Bayes approximations here is a substantial
departure from the traditional EM approach and alleviates some of the
associated computational complexities and uncertainties. Our variational
algorithm is applied to simulated and real data. The paper concludes with
discussion and suggestions for future work
Clustering Airbnb Reviews
In the last decade, online customer reviews increasingly exert influence on
consumers' decision when booking accommodation online. The renewal importance
to the concept of word-of mouth is reflected in the growing interests in
investigating consumers' experience by analyzing their online reviews through
the process of text mining and sentiment analysis. A clustering approach is
developed for Boston Airbnb reviews submitted in the English language and
collected from 2009 to 2016. This approach is based on a mixture of latent
variable models, which provides an appealing framework for handling clustered
binary data. We address here the problem of discovering meaningful segments of
consumers that are coherent from both the underlying topics and the sentiment
behind the reviews. A penalized mixture of latent traits approach is developed
to reduce the number of parameters and identify variables that are not
informative for clustering. The introduction of component-specific rate
parameters avoids the over-penalization that can occur when inferring a shared
rate parameter on clustered data. We divided the guests into four groups --
property driven guests, host driven guests, guests with recent overall negative
stay and guests with some negative experiences
A LASSO-Penalized BIC for Mixture Model Selection
The efficacy of family-based approaches to mixture model-based clustering and
classification depends on the selection of parsimonious models. Current wisdom
suggests the Bayesian information criterion (BIC) for mixture model selection.
However, the BIC has well-known limitations, including a tendency to
overestimate the number of components as well as a proclivity for, often
drastically, underestimating the number of components in higher dimensions.
While the former problem might be soluble through merging components, the
latter is impossible to mitigate in clustering and classification applications.
In this paper, a LASSO-penalized BIC (LPBIC) is introduced to overcome this
problem. This approach is illustrated based on applications of extensions of
mixtures of factor analyzers, where the LPBIC is used to select both the number
of components and the number of latent factors. The LPBIC is shown to match or
outperform the BIC in several situations
- …