271,420 research outputs found

    Feature Selection For High-Dimensional Clustering

    Full text link
    We present a nonparametric method for selecting informative features in high-dimensional clustering problems. We start with a screening step that uses a test for multimodality. Then we apply kernel density estimation and mode clustering to the selected features. The output of the method consists of a list of relevant features, and cluster assignments. We provide explicit bounds on the error rate of the resulting clustering. In addition, we provide the first error bounds on mode based clustering.Comment: 11 pages, 2 figure

    Randomized Dimensionality Reduction for k-means Clustering

    Full text link
    We study the topic of dimensionality reduction for kk-means clustering. Dimensionality reduction encompasses the union of two approaches: \emph{feature selection} and \emph{feature extraction}. A feature selection based algorithm for kk-means clustering selects a small subset of the input features and then applies kk-means clustering on the selected features. A feature extraction based algorithm for kk-means clustering constructs a small set of new artificial features and then applies kk-means clustering on the constructed features. Despite the significance of kk-means clustering as well as the wealth of heuristic methods addressing it, provably accurate feature selection methods for kk-means clustering are not known. On the other hand, two provably accurate feature extraction methods for kk-means clustering are known in the literature; one is based on random projections and the other is based on the singular value decomposition (SVD). This paper makes further progress towards a better understanding of dimensionality reduction for kk-means clustering. Namely, we present the first provably accurate feature selection method for kk-means clustering and, in addition, we present two feature extraction methods. The first feature extraction method is based on random projections and it improves upon the existing results in terms of time complexity and number of features needed to be extracted. The second feature extraction method is based on fast approximate SVD factorizations and it also improves upon the existing results in terms of time complexity. The proposed algorithms are randomized and provide constant-factor approximation guarantees with respect to the optimal kk-means objective value.Comment: IEEE Transactions on Information Theory, to appea

    Dynamic feature selection for clustering high dimensional data streams

    Get PDF
    open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked

    Clustering Approach for Feature Selection on the Microarray Data Classification Using Random Forest

    Get PDF
    Microarray data mengambil bagian penting di dalam mendiagnosis dan mendeteksi kanker karena analisis microarray dapat digunakan untuk melihat tingkat ekspresi gen dalam sampel sel tertentu yang berfungsi untuk menganalisis ribuan gen secara bersamaan. Namun, microarray data memiliki sangat sedikit data sample dan memiliki dimensi data yang tinggi. Sehingga untuk melakukan klasifikasi pada microarray data membutuhkan proses reduksi dimensi. Reduksi dimensi dapat menghilangkan redundancy pada data sehingga fitur yang digunakan pada klasifikasi adalah fitur yang memiliki correlation yang tinggi terhadap kelasnya. Ada 2 jenis reduksi dimensi yaitu seleksi fitur dan ektrasi fitur. Pada penelitian ini akan menggunakan seleksi fitur, dengan menggunakan algoritma clustering k-means untuk mengelompokan fitur yang memiliki similarity yang tingkat pada 1 cluster, kemudian untuk setiap cluster dilakukan proses perankingan menggunakan metode Relief. Setelah itu, fitur yang memiliki skor yang tinggi akan dipilih sebagai subset fitur untuk proses klasifikasi. Tujuan nya adalah untuk menghapus redundancy pada data yang dapat menurunkan akurasi pada klasifikasi. Selanjutnya pada proses klasifikasi akan menggunakan algoritma Random Forest. Dari hasil penelitian ini diperoleh hasil akurasi untuk setiap dataset, yaitu Colon 85.87%, Lung Cancer 98.9%, dan Tumor Prostate 89%. Hasil akurasi yang diperoleh lebih tinggi daripada penelitian sebelumnya, yang hanya menggunakan algoritma Random Forest sebagai seleksi gen dan klasifikasi, sehingga dapat disimpulkan pendekatan clustering untuk menghapus redundancy dimensi dapat digunakan dan diterapkan pada klasifikasi menggunakan microarray data

    Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology

    Get PDF
    This paper presents some experiments in clustering homogeneous XMLdocuments to validate an existing classification or more generally anorganisational structure. Our approach integrates techniques for extracting knowledge from documents with unsupervised classification (clustering) of documents. We focus on the feature selection used for representing documents and its impact on the emerging classification. We mix the selection of structured features with fine textual selection based on syntactic characteristics.We illustrate and evaluate this approach with a collection of Inria activity reports for the year 2003. The objective is to cluster projects into larger groups (Themes), based on the keywords or different chapters of these activity reports. We then compare the results of clustering using different feature selections, with the official theme structure used by Inria.Comment: (postprint); This version corrects a couple of errors in authors' names in the bibliograph
    • …
    corecore