Search CORE

280,440 research outputs found

Feature Selection For High-Dimensional Clustering

Author: Azizyan Martin
Singh Aarti
Wasserman Larry
Publication venue
Publication date: 09/06/2014
Field of study

We present a nonparametric method for selecting informative features in high-dimensional clustering problems. We start with a screening step that uses a test for multimodality. Then we apply kernel density estimation and mode clustering to the selected features. The output of the method consists of a list of relevant features, and cluster assignments. We provide explicit bounds on the error rate of the resulting clustering. In addition, we provide the first error bounds on mode based clustering.Comment: 11 pages, 2 figure

arXiv.org e-Print Archive

CiteSeerX

Randomized Dimensionality Reduction for k-means Clustering

Author: Boutsidis Christos
Drineas Petros
Mahoney Michael W.
Zouzias Anastasios
Publication venue
Publication date: 01/01/2013
Field of study

We study the topic of dimensionality reduction for

k

-means clustering. Dimensionality reduction encompasses the union of two approaches: \emph{feature selection} and \emph{feature extraction}. A feature selection based algorithm for

k

-means clustering selects a small subset of the input features and then applies

k

-means clustering on the selected features. A feature extraction based algorithm for

k

-means clustering constructs a small set of new artificial features and then applies

k

-means clustering on the constructed features. Despite the significance of

k

-means clustering as well as the wealth of heuristic methods addressing it, provably accurate feature selection methods for

k

-means clustering are not known. On the other hand, two provably accurate feature extraction methods for

k

-means clustering are known in the literature; one is based on random projections and the other is based on the singular value decomposition (SVD). This paper makes further progress towards a better understanding of dimensionality reduction for

k

-means clustering. Namely, we present the first provably accurate feature selection method for

k

-means clustering and, in addition, we present two feature extraction methods. The first feature extraction method is based on random projections and it improves upon the existing results in terms of time complexity and number of features needed to be extracted. The second feature extraction method is based on fast approximate SVD factorizations and it also improves upon the existing results in terms of time complexity. The proposed algorithms are randomized and provide constant-factor approximation guarantees with respect to the optimal

k

-means objective value.Comment: IEEE Transactions on Information Theory, to appea

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

A niching memetic algorithm for simultaneous clustering and feature selection

Author: Fairhurst M
Liu X
Sheng W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2008
Field of study

Clustering is inherently a difficult task, and is made even more difficult when the selection of relevant features is also an issue. In this paper we propose an approach for simultaneous clustering and feature selection using a niching memetic algorithm. Our approach (which we call NMA_CFS) makes feature selection an integral part of the global clustering search procedure and attempts to overcome the problem of identifying less promising locally optimal solutions in both clustering and feature selection, without making any a priori assumption about the number of clusters. Within the NMA_CFS procedure, a variable composite representation is devised to encode both feature selection and cluster centers with different numbers of clusters. Further, local search operations are introduced to refine feature selection and cluster centers encoded in the chromosomes. Finally, a niching method is integrated to preserve the population diversity and prevent premature convergence. In an experimental evaluation we demonstrate the effectiveness of the proposed approach and compare it with other related approaches, using both synthetic and real data

Brunel University Research Archive

Recommended from our members

Automatic Feature Set Selection for Merging Image Segmentation Results Using Fuzzy Clustering

Author: Ali Ameer
Dooley Laurence S.
Karmakar Gour
Publication venue
Publication date: 01/12/2005
Field of study

The image segmentation performance of clustering algorithms is highly dependent on the features used and the type of objects contained in the image, which limits the generalization ability of such algorithms. As a consequence, a fuzzy image segmentation using suppressed fuzzy c-means clustering (FSSC) algorithm was proposed that merged the initially segmented regions produced by a fuzzy clustering algorithm, using two different feature sets each comprising two features from pixel location, pixel intensity and a combination of both, which considered objects with similar surface variations (SSV), the arbitrariness of fuzzy c-means (FCM) algorithm using pixel location and the connectedness property of objects. The feature set selection for the initial segmentation in the merging technique was however, inaccurate because it did not consider all possible feature set combinations and also manually defined the threshold used to identify objects having SSV. To overcome these limitations, a new automatic feature set selection for merging image segmentation results using fuzzy clustering (AFMSF) algorithm is proposed, which considers the best feature set selection and also calculates the threshold based upon human visual perception. Both qualitative and quantitative analysis prove the superiority of AFMSF algorithm compared with other clustering techniques including FSSC, FCM, possibilistic c-means (PCM) and SFCM, for different image types

Open Research Online (The Open University)

Dynamic feature selection for clustering high dimensional data streams

Author: Fahy Conor
Yang Shengxiang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/07/2019
Field of study

open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked

De Montfort University Open Research Archive

Clustering Approach for Feature Selection on the Microarray Data Classification Using Random Forest

Author: HUSNA AYDADENTA
Publication venue: Universitas Telkom, Magister Teknik Informatika
Publication date: 29/03/2018
Field of study

Microarray data mengambil bagian penting di dalam mendiagnosis dan mendeteksi kanker karena analisis microarray dapat digunakan untuk melihat tingkat ekspresi gen dalam sampel sel tertentu yang berfungsi untuk menganalisis ribuan gen secara bersamaan. Namun, microarray data memiliki sangat sedikit data sample dan memiliki dimensi data yang tinggi. Sehingga untuk melakukan klasifikasi pada microarray data membutuhkan proses reduksi dimensi. Reduksi dimensi dapat menghilangkan redundancy pada data sehingga fitur yang digunakan pada klasifikasi adalah fitur yang memiliki correlation yang tinggi terhadap kelasnya. Ada 2 jenis reduksi dimensi yaitu seleksi fitur dan ektrasi fitur. Pada penelitian ini akan menggunakan seleksi fitur, dengan menggunakan algoritma clustering k-means untuk mengelompokan fitur yang memiliki similarity yang tingkat pada 1 cluster, kemudian untuk setiap cluster dilakukan proses perankingan menggunakan metode Relief. Setelah itu, fitur yang memiliki skor yang tinggi akan dipilih sebagai subset fitur untuk proses klasifikasi. Tujuan nya adalah untuk menghapus redundancy pada data yang dapat menurunkan akurasi pada klasifikasi. Selanjutnya pada proses klasifikasi akan menggunakan algoritma Random Forest. Dari hasil penelitian ini diperoleh hasil akurasi untuk setiap dataset, yaitu Colon 85.87%, Lung Cancer 98.9%, dan Tumor Prostate 89%. Hasil akurasi yang diperoleh lebih tinggi daripada penelitian sebelumnya, yang hanya menggunakan algoritma Random Forest sebagai seleksi gen dan klasifikasi, sehingga dapat disimpulkan pendekatan clustering untuk menghapus redundancy dimensi dapat digunakan dan diterapkan pada klasifikasi menggunakan microarray data

Open Library

Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology

Author: Despeyroux Thierry
Lechevallier Yves
Trousse Brigitte
Vercoustre Anne-Marie
Publication venue
Publication date: 01/01/2005
Field of study

This paper presents some experiments in clustering homogeneous XMLdocuments to validate an existing classification or more generally anorganisational structure. Our approach integrates techniques for extracting knowledge from documents with unsupervised classification (clustering) of documents. We focus on the feature selection used for representing documents and its impact on the emerging classification. We mix the selection of structured features with fine textual selection based on syntactic characteristics.We illustrate and evaluate this approach with a collection of Inria activity reports for the year 2003. The objective is to cluster projects into larger groups (Themes), based on the keywords or different chapters of these activity reports. We then compare the results of clustering using different feature selections, with the official theme structure used by Inria.Comment: (postprint); This version corrects a couple of errors in authors' names in the bibliograph

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot