121,174 research outputs found
Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study
Validation is one of the most important aspects of clustering, particularly when the user is designing a trustworthy or explainable system. However, most clustering validation approaches require batch calculation. This is an important gap because of the value of clustering in real-time data streaming and other online learning applications. Therefore, interest has grown in providing online alternatives for validation. This paper extends the incremental cluster validity index (iCVI) family by presenting incremental versions of Calinski-Harabasz (iCH), Pakhira-Bandyopadhyay-Maulik (iPBM), WB index (iWB), Silhouette (iSIL), Negentropy Increment (iNI), Representative Cross Information Potential (irCIP), Representative Cross Entropy (irH), and Conn_Index (iConn_Index). This paper also provides a thorough comparative study of correct, under- and over-partitioning on the behavior of these iCVIs, the Partition Separation (PS) index as well as four recently introduced iCVIs: incremental Xie-Beni (iXB), incremental Davies-Bouldin (iDB), and incremental generalized Dunn\u27s indices 43 and 53 (iGD43 and iGD53). Experiments were carried out using a framework that was designed to be as agnostic as possible to the clustering algorithms. The results on synthetic benchmark data sets showed that while evidence of most under-partitioning cases could be inferred from the behaviors of the majority of these iCVIs, over-partitioning was found to be a more challenging problem, detected by fewer of them. Interestingly, over-partitioning, rather then under-partitioning, was more prominently detected on the real-world data experiments within this study. The expansion of iCVIs provides significant novel opportunities for assessing and interpreting the results of unsupervised lifelong learning in real-time, wherein samples cannot be reprocessed due to memory and/or application constraints
A data driven equivariant approach to constrained Gaussian mixture modeling
Maximum likelihood estimation of Gaussian mixture models with different
class-specific covariance matrices is known to be problematic. This is due to
the unboundedness of the likelihood, together with the presence of spurious
maximizers. Existing methods to bypass this obstacle are based on the fact that
unboundedness is avoided if the eigenvalues of the covariance matrices are
bounded away from zero. This can be done imposing some constraints on the
covariance matrices, i.e. by incorporating a priori information on the
covariance structure of the mixture components. The present work introduces a
constrained equivariant approach, where the class conditional covariance
matrices are shrunk towards a pre-specified matrix Psi. Data-driven choices of
the matrix Psi, when a priori information is not available, and the optimal
amount of shrinkage are investigated. The effectiveness of the proposal is
evaluated on the basis of a simulation study and an empirical example
Cluster Analysis of Simulated GravitationalWave Triggers Using S-MEANS and Constrained Validation Clustering
The fifth Science run of LIGO (S5) has been concluded recently. The data collected over two years of the run calls for a thorough analysis of the glitches seen in the gravitational wave channels, as well as in the auxiliary and environmental channels. The study presents two new techniques for cluster analysis of gravitational wave burst triggers. Traditional approaches to clustering treats the problem as an optimization problem in an “open” search space of clustering models. However, this can lead to problems with producing models that over-fit or under-fit the data as the search is stuck on local minima. The new algorithms tackle local minima by putting constraints in the search process. S-MEANS looks at similarity statistics of burst triggers and builds up clusters that have the advantage of avoiding local minima. Constrained Validation clustering tackles the problem by constraining the search in the space of clustering models that are “non-splittable” models in which centroids of the left and right child of a cluster (after splitting) are nearest to each other; the region of models that either over-fit or under-fit data (i.e. “splittable” models) can therefore be effectively avoided when assumptions about data are satisfied. These methods are demonstrated by using simulated data. The results on simulated data are promising and the methods are expected to be useful for LIGO S5 data analysis
Clustering via kernel decomposition
Spectral clustering methods were proposed recently which rely on the eigenvalue decomposition of an affinity matrix. In this letter, the affinity matrix is created from the elements of a nonparametric density estimator and then decomposed to obtain posterior probabilities of class membership. Hyperparameters are selected using standard cross-validation methods
Dynamic Metric Learning from Pairwise Comparisons
Recent work in distance metric learning has focused on learning
transformations of data that best align with specified pairwise similarity and
dissimilarity constraints, often supplied by a human observer. The learned
transformations lead to improved retrieval, classification, and clustering
algorithms due to the better adapted distance or similarity measures. Here, we
address the problem of learning these transformations when the underlying
constraint generation process is nonstationary. This nonstationarity can be due
to changes in either the ground-truth clustering used to generate constraints
or changes in the feature subspaces in which the class structure is apparent.
We propose Online Convex Ensemble StrongLy Adaptive Dynamic Learning (OCELAD),
a general adaptive, online approach for learning and tracking optimal metrics
as they change over time that is highly robust to a variety of nonstationary
behaviors in the changing metric. We apply the OCELAD framework to an ensemble
of online learners. Specifically, we create a retro-initialized composite
objective mirror descent (COMID) ensemble (RICE) consisting of a set of
parallel COMID learners with different learning rates, demonstrate RICE-OCELAD
on both real and synthetic data sets and show significant performance
improvements relative to previously proposed batch and online distance metric
learning algorithms.Comment: to appear Allerton 2016. arXiv admin note: substantial text overlap
with arXiv:1603.0367
Exhaustive and Efficient Constraint Propagation: A Semi-Supervised Learning Perspective and Its Applications
This paper presents a novel pairwise constraint propagation approach by
decomposing the challenging constraint propagation problem into a set of
independent semi-supervised learning subproblems which can be solved in
quadratic time using label propagation based on k-nearest neighbor graphs.
Considering that this time cost is proportional to the number of all possible
pairwise constraints, our approach actually provides an efficient solution for
exhaustively propagating pairwise constraints throughout the entire dataset.
The resulting exhaustive set of propagated pairwise constraints are further
used to adjust the similarity matrix for constrained spectral clustering. Other
than the traditional constraint propagation on single-source data, our approach
is also extended to more challenging constraint propagation on multi-source
data where each pairwise constraint is defined over a pair of data points from
different sources. This multi-source constraint propagation has an important
application to cross-modal multimedia retrieval. Extensive results have shown
the superior performance of our approach.Comment: The short version of this paper appears as oral paper in ECCV 201
- …