117,711 research outputs found
Optimal Clustering Framework for Hyperspectral Band Selection
Band selection, by choosing a set of representative bands in hyperspectral
image (HSI), is an effective method to reduce the redundant information without
compromising the original contents. Recently, various unsupervised band
selection methods have been proposed, but most of them are based on
approximation algorithms which can only obtain suboptimal solutions toward a
specific objective function. This paper focuses on clustering-based band
selection, and proposes a new framework to solve the above dilemma, claiming
the following contributions: 1) An optimal clustering framework (OCF), which
can obtain the optimal clustering result for a particular form of objective
function under a reasonable constraint. 2) A rank on clusters strategy (RCS),
which provides an effective criterion to select bands on existing clustering
structure. 3) An automatic method to determine the number of the required
bands, which can better evaluate the distinctive information produced by
certain number of bands. In experiments, the proposed algorithm is compared to
some state-of-the-art competitors. According to the experimental results, the
proposed algorithm is robust and significantly outperform the other methods on
various data sets
Clustering by soft-constraint affinity propagation: Applications to gene-expression data
Motivation: Similarity-measure based clustering is a crucial problem
appearing throughout scientific data analysis. Recently, a powerful new
algorithm called Affinity Propagation (AP) based on message-passing techniques
was proposed by Frey and Dueck \cite{Frey07}. In AP, each cluster is identified
by a common exemplar all other data points of the same cluster refer to, and
exemplars have to refer to themselves. Albeit its proved power, AP in its
present form suffers from a number of drawbacks. The hard constraint of having
exactly one exemplar per cluster restricts AP to classes of regularly shaped
clusters, and leads to suboptimal performance, {\it e.g.}, in analyzing gene
expression data. Results: This limitation can be overcome by relaxing the AP
hard constraints. A new parameter controls the importance of the constraints
compared to the aim of maximizing the overall similarity, and allows to
interpolate between the simple case where each data point selects its closest
neighbor as an exemplar and the original AP. The resulting soft-constraint
affinity propagation (SCAP) becomes more informative, accurate and leads to
more stable clustering. Even though a new {\it a priori} free-parameter is
introduced, the overall dependence of the algorithm on external tuning is
reduced, as robustness is increased and an optimal strategy for parameter
selection emerges more naturally. SCAP is tested on biological benchmark data,
including in particular microarray data related to various cancer types. We
show that the algorithm efficiently unveils the hierarchical cluster structure
present in the data sets. Further on, it allows to extract sparse gene
expression signatures for each cluster.Comment: 11 pages, supplementary material:
http://isiosf.isi.it/~weigt/scap_supplement.pd
Constraint-based discriminative dimension selection for high-dimensional stream clustering
Clustering data streams is one of active research topic in data mining. However, runtime of the existing stream clustering algorithms increases and their performance drop in the face of large number of dimensions. Complexity of the stream clustering methods is increased when perform on data with large number of dimensions. In order to reduce the clustering complexity, one possible solution consists in determining the appropriate subset of cluster dimensions via dimension projection. SED-Stream is an efficient clustering algorithm that supports high dimension data streams. The aim of this paper is to increase performance of SED-Stream in terms of both clustering quality and execution-time. In order to improve the clustering process, background or domain expert knowledge are integrated as “constraints” in SEDC-Stream. The new algorithm, SEDC-Stream, supports the evolving characteristics of the dynamic constraints which are activation, fading, outdating and prioritization. SEDC-Stream algorithm is able to reduce cluster splitting time, and place new incoming points to their suitable clusters. Compared to SED-Stream on the three real-world streams datasets, SEDC-Stream is able to generate a better clustering performance in terms of both purity and f-measure
Delete or merge regressors for linear model selection
We consider a problem of linear model selection in the presence of both
continuous and categorical predictors. Feasible models consist of subsets of
numerical variables and partitions of levels of factors. A new algorithm called
delete or merge regressors (DMR) is presented which is a stepwise backward
procedure involving ranking the predictors according to squared t-statistics
and choosing the final model minimizing BIC. In the article we prove
consistency of DMR when the number of predictors tends to infinity with the
sample size and describe a simulation study using a pertaining R package. The
results indicate significant advantage in time complexity and selection
accuracy of our algorithm over Lasso-based methods described in the literature.
Moreover, a version of DMR for generalized linear models is proposed
- …