280,440 research outputs found
Feature Selection For High-Dimensional Clustering
We present a nonparametric method for selecting informative features in
high-dimensional clustering problems. We start with a screening step that uses
a test for multimodality. Then we apply kernel density estimation and mode
clustering to the selected features. The output of the method consists of a
list of relevant features, and cluster assignments. We provide explicit bounds
on the error rate of the resulting clustering. In addition, we provide the
first error bounds on mode based clustering.Comment: 11 pages, 2 figure
Randomized Dimensionality Reduction for k-means Clustering
We study the topic of dimensionality reduction for -means clustering.
Dimensionality reduction encompasses the union of two approaches: \emph{feature
selection} and \emph{feature extraction}. A feature selection based algorithm
for -means clustering selects a small subset of the input features and then
applies -means clustering on the selected features. A feature extraction
based algorithm for -means clustering constructs a small set of new
artificial features and then applies -means clustering on the constructed
features. Despite the significance of -means clustering as well as the
wealth of heuristic methods addressing it, provably accurate feature selection
methods for -means clustering are not known. On the other hand, two provably
accurate feature extraction methods for -means clustering are known in the
literature; one is based on random projections and the other is based on the
singular value decomposition (SVD).
This paper makes further progress towards a better understanding of
dimensionality reduction for -means clustering. Namely, we present the first
provably accurate feature selection method for -means clustering and, in
addition, we present two feature extraction methods. The first feature
extraction method is based on random projections and it improves upon the
existing results in terms of time complexity and number of features needed to
be extracted. The second feature extraction method is based on fast approximate
SVD factorizations and it also improves upon the existing results in terms of
time complexity. The proposed algorithms are randomized and provide
constant-factor approximation guarantees with respect to the optimal -means
objective value.Comment: IEEE Transactions on Information Theory, to appea
Recommended from our members
A niching memetic algorithm for simultaneous clustering and feature selection
Clustering is inherently a difficult task, and is made even more difficult when the selection of relevant features is also an issue. In this paper we propose an approach for simultaneous clustering and feature selection using a niching memetic algorithm. Our approach (which we call NMA_CFS) makes feature selection an integral part of the global clustering search procedure and attempts to overcome the problem of identifying less promising locally optimal solutions in both clustering and feature selection, without making any a priori assumption about the number of clusters. Within the NMA_CFS procedure, a variable composite representation is devised to encode both feature selection and cluster centers with different numbers of clusters. Further, local search operations are introduced to refine feature selection and cluster centers encoded in the chromosomes. Finally, a niching method is integrated to preserve the population diversity and prevent premature convergence. In an experimental evaluation we demonstrate the effectiveness of the proposed approach and compare it with other related approaches, using both synthetic and real data
Recommended from our members
Automatic Feature Set Selection for Merging Image Segmentation Results Using Fuzzy Clustering
The image segmentation performance of clustering algorithms is highly dependent on the features used and the type of objects contained in the image, which limits the generalization ability of such algorithms. As a consequence, a fuzzy image segmentation using suppressed fuzzy c-means clustering (FSSC) algorithm was proposed that merged the initially segmented regions produced by a fuzzy clustering algorithm, using two different feature sets each comprising two features from pixel location, pixel intensity and a combination of both, which considered objects with similar surface variations (SSV), the arbitrariness of fuzzy c-means (FCM) algorithm using pixel location and the connectedness property of objects. The feature set selection for the initial segmentation in the merging technique was however, inaccurate because it did not consider all possible feature set combinations and also manually defined the threshold used to identify objects having SSV. To overcome these limitations, a new automatic feature set selection for merging image segmentation results using fuzzy clustering (AFMSF) algorithm is proposed, which considers the best feature set selection and also calculates the threshold based upon human visual perception. Both qualitative and quantitative analysis prove the superiority of AFMSF algorithm compared with other clustering techniques including FSSC, FCM, possibilistic c-means (PCM) and SFCM, for different image types
Dynamic feature selection for clustering high dimensional data streams
open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked
Clustering Approach for Feature Selection on the Microarray Data Classification Using Random Forest
Microarray data mengambil bagian penting di dalam mendiagnosis dan mendeteksi kanker karena analisis microarray dapat digunakan untuk melihat tingkat ekspresi gen dalam sampel sel tertentu yang berfungsi untuk menganalisis ribuan gen secara bersamaan. Namun, microarray data memiliki sangat sedikit data sample dan memiliki dimensi data yang tinggi. Sehingga untuk melakukan klasifikasi pada microarray data membutuhkan proses reduksi dimensi. Reduksi dimensi dapat menghilangkan redundancy pada data sehingga fitur yang digunakan pada klasifikasi adalah fitur yang memiliki correlation yang tinggi terhadap kelasnya. Ada 2 jenis reduksi dimensi yaitu seleksi fitur dan ektrasi fitur. Pada penelitian ini akan menggunakan seleksi fitur, dengan menggunakan algoritma clustering k-means untuk mengelompokan fitur yang memiliki similarity yang tingkat pada 1 cluster, kemudian untuk setiap cluster dilakukan proses perankingan menggunakan metode Relief. Setelah itu, fitur yang memiliki skor yang tinggi akan dipilih sebagai subset fitur untuk proses klasifikasi. Tujuan nya adalah untuk menghapus redundancy pada data yang dapat menurunkan akurasi pada klasifikasi. Selanjutnya pada proses klasifikasi akan menggunakan algoritma Random Forest. Dari hasil penelitian ini diperoleh hasil akurasi untuk setiap dataset, yaitu Colon 85.87%, Lung Cancer 98.9%, dan Tumor Prostate 89%. Hasil akurasi yang diperoleh lebih tinggi daripada penelitian sebelumnya, yang hanya menggunakan algoritma Random Forest sebagai seleksi gen dan klasifikasi, sehingga dapat disimpulkan pendekatan clustering untuk menghapus redundancy dimensi dapat digunakan dan diterapkan pada klasifikasi menggunakan microarray data
Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology
This paper presents some experiments in clustering homogeneous XMLdocuments
to validate an existing classification or more generally anorganisational
structure. Our approach integrates techniques for extracting knowledge from
documents with unsupervised classification (clustering) of documents. We focus
on the feature selection used for representing documents and its impact on the
emerging classification. We mix the selection of structured features with fine
textual selection based on syntactic characteristics.We illustrate and evaluate
this approach with a collection of Inria activity reports for the year 2003.
The objective is to cluster projects into larger groups (Themes), based on the
keywords or different chapters of these activity reports. We then compare the
results of clustering using different feature selections, with the official
theme structure used by Inria.Comment: (postprint); This version corrects a couple of errors in authors'
names in the bibliograph
- …