112,407 research outputs found
Mixtures of Shifted Asymmetric Laplace Distributions
A mixture of shifted asymmetric Laplace distributions is introduced and used
for clustering and classification. A variant of the EM algorithm is developed
for parameter estimation by exploiting the relationship with the general
inverse Gaussian distribution. This approach is mathematically elegant and
relatively computationally straightforward. Our novel mixture modelling
approach is demonstrated on both simulated and real data to illustrate
clustering and classification applications. In these analyses, our mixture of
shifted asymmetric Laplace distributions performs favourably when compared to
the popular Gaussian approach. This work, which marks an important step in the
non-Gaussian model-based clustering and classification direction, concludes
with discussion as well as suggestions for future work
Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics
Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites.
google.com/site/gaussianbhc
Non-Gaussian velocity distributions in excited granular matter in the absence of clustering
The velocity distribution of spheres rolling on a slightly tilted rectangular
two dimensional surface is obtained by high speed imaging. The particles are
excited by periodic forcing of one of the side walls. Our data suggests that
strongly non-Gaussian velocity distributions can occur in dilute granular
materials even in the absence of significant density correlations or
clustering. When the surface on which the particles roll is tilted further to
introduce stronger gravitation, the collision frequency with the driving wall
increases and the velocity component distributions approach Gaussian
distributions of different widths.Comment: 4 pages, 5 figures. Additional information at
http://physics.clarku.edu/~akudrolli/nls.htm
Mixtures of Skew-t Factor Analyzers
In this paper, we introduce a mixture of skew-t factor analyzers as well as a
family of mixture models based thereon. The mixture of skew-t distributions
model that we use arises as a limiting case of the mixture of generalized
hyperbolic distributions. Like their Gaussian and t-distribution analogues, our
mixture of skew-t factor analyzers are very well-suited to the model-based
clustering of high-dimensional data. Imposing constraints on components of the
decomposed covariance parameter results in the development of eight flexible
models. The alternating expectation-conditional maximization algorithm is used
for model parameter estimation and the Bayesian information criterion is used
for model selection. The models are applied to both real and simulated data,
giving superior clustering results compared to a well-established family of
Gaussian mixture models
Semi-supervised cross-entropy clustering with information bottleneck constraint
In this paper, we propose a semi-supervised clustering method, CEC-IB, that
models data with a set of Gaussian distributions and that retrieves clusters
based on a partial labeling provided by the user (partition-level side
information). By combining the ideas from cross-entropy clustering (CEC) with
those from the information bottleneck method (IB), our method trades between
three conflicting goals: the accuracy with which the data set is modeled, the
simplicity of the model, and the consistency of the clustering with side
information. Experiments demonstrate that CEC-IB has a performance comparable
to Gaussian mixture models (GMM) in a classical semi-supervised scenario, but
is faster, more robust to noisy labels, automatically determines the optimal
number of clusters, and performs well when not all classes are present in the
side information. Moreover, in contrast to other semi-supervised models, it can
be successfully applied in discovering natural subgroups if the partition-level
side information is derived from the top levels of a hierarchical clustering
A general theory for robust clustering via trimmed mean
Clustering is a fundamental tool in statistical machine learning in the
presence of heterogeneous data. Many recent results focus primarily on optimal
mislabeling guarantees, when data are distributed around centroids with
sub-Gaussian errors. Yet, the restrictive sub-Gaussian model is often invalid
in practice, since various real-world applications exhibit heavy tail
distributions around the centroids or suffer from possible adversarial attacks
that call for robust clustering with a robust data-driven initialization. In
this paper, we introduce a hybrid clustering technique with a novel
multivariate trimmed mean type centroid estimate to produce mislabeling
guarantees under a weak initialization condition for general error
distributions around the centroids. A matching lower bound is derived, up to
factors depending on the number of clusters. In addition, our approach also
produces the optimal mislabeling even in the presence of adversarial outliers.
Our results reduce to the sub-Gaussian case when errors follow sub-Gaussian
distributions. To solve the problem thoroughly, we also present novel
data-driven robust initialization techniques and show that, with probabilities
approaching one, these initial centroid estimates are sufficiently good for the
subsequent clustering algorithm to achieve the optimal mislabeling rates.
Furthermore, we demonstrate that the Lloyd algorithm is suboptimal for more
than two clusters even when errors are Gaussian, and for two clusters when
errors distributions have heavy tails. Both simulated data and real data
examples lend further support to both of our robust initialization procedure
and clustering algorithm.Comment: 51 pages, corrected typo
- …