410,237 research outputs found
A new distance measure for model-based sequence clustering
We review the existing alternatives for defining model-based distances for clustering sequences and propose a new one based on the Kullback-Leibler divergence. This distance is shown to be especially useful in combination with spectral clustering. For improved performance in real-world scenarios, a model selection scheme is also proposed.Publicad
Partial mixture model for tight clustering of gene expression time-course
Background: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to
this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.
Results: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate
information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a
simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.
Conclusion: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the ombination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset
under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion
Geometry-Aware Neighborhood Search for Learning Local Models for Image Reconstruction
Local learning of sparse image models has proven to be very effective to
solve inverse problems in many computer vision applications. To learn such
models, the data samples are often clustered using the K-means algorithm with
the Euclidean distance as a dissimilarity metric. However, the Euclidean
distance may not always be a good dissimilarity measure for comparing data
samples lying on a manifold. In this paper, we propose two algorithms for
determining a local subset of training samples from which a good local model
can be computed for reconstructing a given input test sample, where we take
into account the underlying geometry of the data. The first algorithm, called
Adaptive Geometry-driven Nearest Neighbor search (AGNN), is an adaptive scheme
which can be seen as an out-of-sample extension of the replicator graph
clustering method for local model learning. The second method, called
Geometry-driven Overlapping Clusters (GOC), is a less complex nonadaptive
alternative for training subset selection. The proposed AGNN and GOC methods
are evaluated in image super-resolution, deblurring and denoising applications
and shown to outperform spectral clustering, soft clustering, and geodesic
distance based subset selection in most settings.Comment: 15 pages, 10 figures and 5 table
Nonparametric Feature Extraction from Dendrograms
We propose feature extraction from dendrograms in a nonparametric way. The
Minimax distance measures correspond to building a dendrogram with single
linkage criterion, with defining specific forms of a level function and a
distance function over that. Therefore, we extend this method to arbitrary
dendrograms. We develop a generalized framework wherein different distance
measures can be inferred from different types of dendrograms, level functions
and distance functions. Via an appropriate embedding, we compute a vector-based
representation of the inferred distances, in order to enable many numerical
machine learning algorithms to employ such distances. Then, to address the
model selection problem, we study the aggregation of different dendrogram-based
distances respectively in solution space and in representation space in the
spirit of deep representations. In the first approach, for example for the
clustering problem, we build a graph with positive and negative edge weights
according to the consistency of the clustering labels of different objects
among different solutions, in the context of ensemble methods. Then, we use an
efficient variant of correlation clustering to produce the final clusters. In
the second approach, we investigate the sequential combination of different
distances and features sequentially in the spirit of multi-layered
architectures to obtain the final features. Finally, we demonstrate the
effectiveness of our approach via several numerical studies
Clustering Time Series from Mixture Polynomial Models with Discretised Data
Clustering time series is an active research area with applications in many fields. One common feature of time series is the likely presence of outliers. These uncharacteristic data can significantly effect the quality of clusters formed. This paper evaluates a method of over-coming the detrimental effects of outliers. We describe some of the alternative approaches to clustering time series, then specify a particular class of model for experimentation with k-means clustering and a correlation based distance metric. For data derived from this class of model we demonstrate that discretising the data into a binary series of above and below the median improves the clustering when the data has outliers. More specifically, we show that firstly discretisation does not significantly effect the accuracy of the clusters when there are no outliers and secondly it significantly increases the accuracy in the presence of outliers, even when the probability of outlier is very low
Min-Max-Jump distance and its applications
We explore three applications of Min-Max-Jump distance (MMJ distance).
MMJ-based K-means revises K-means with MMJ distance. MMJ-based Silhouette
coefficient revises Silhouette coefficient with MMJ distance. We also tested
the Clustering with Neural Network and Index (CNNI) model with MMJ-based
Silhouette coefficient. In the last application, we tested using Min-Max-Jump
distance for predicting labels of new points, after a clustering analysis of
data. Result shows Min-Max-Jump distance achieves good performances in all the
three proposed applications
Axioms for Distanceless Graph Partitioning
In 2002, Kleinberg proposed three axioms for distance-based clustering, and
proved that it was impossible for a clustering method to satisfy all three.
While there has been much subsequent work examining and modifying these axioms
for distance-based clustering, little work has been done to explore axioms
relevant to the graph partitioning problem, i.e., when the graph is given
without a distance matrix. Here, we propose and explore axioms for graph
partitioning when given graphs without distance matrices, including
modifications of Kleinberg's axioms for the distanceless case and two others
(one axiom relevant to the ''Resolution Limit'' and one addressing
well-connectedness). We prove that clustering under the Constant Potts Model
satisfies all the axioms, while Modularity clustering and Iterative k-core both
fail many axioms we pose. These theoretical properties of the clustering
methods are relevant both for theoretical investigation as well as to
practitioners considering which methods to use for their domain science
studies
- …