Search CORE

53,884 research outputs found

A Novel Subset Selection Clustering-Based Algorithm for High Dimensional Data

Author: Chandra Mouli Kolavasi
Krishna Balineni Bala
Publication venue: Kakinada Institute of Engineering and Technology for Women
Publication date: 23/07/2015
Field of study

Feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are to be distinguished from feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). It involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high dimensional image, microarray, and text data, demonstrate that FAST not only produces smaller subsets of features but also improves the performances of the four types of classifier

International Journal of Science Engineering and Advance Technology (IJSEAT)

IDENTIFICATION OF SIGNIFICANT FEATURES USING RANDOM FOREST FOR HIGH DIMENSIONAL MICROARRAY DATA

Author: ARPITA NAGPAL
VIJENDRA SINGH
Publication venue: Taylor's University
Publication date: 01/08/2018
Field of study

Feature subset selection for microarray data aims at reducing the number of genes so that useful information can be extracted from the samples. At the same time, selecting the relevant genes (features) from the high dimensional data can improve the classification accuracy of the learning algorithm. This paper proposes a feature selection algorithm, which is fit for high dimensional and small sample size microarray data. Feature selection is performed in two phases. In the first phase, Random Forest is used to identifying the importance of each feature, so that the features with high relevance can be given priority over less relevant ones. In the second phase, feature clustering is performed around the relevant features to yield the reduced feature set. A statistical method is used to create the clusters that aid in giving the genes specifically representing the disease. The effectiveness of the proposed algorithm has been compared with three state-of-the-art feature selection algorithms viz. FastCorrelation Based Filter (FCBF), a Fast Clustering-Based Feature Selection Algorithm (FAST) and Random Forest (RF) on nine real-world cancer microarray datasets. Empirically, the algorithms have been evaluated through three well-known classifiers viz. probability based Naïve Bayes, Tree-based C4.5, and the Instance-based IB1. The stated result shows that the proposed algorithm can be helpful in finding the smaller set of features for cancer microarray datasets with better classification accuracy

Directory of Open Access Journals

Recommended from our members

A niching memetic algorithm for simultaneous clustering and feature selection

Author: Fairhurst M
Liu X
Sheng W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2008
Field of study

Clustering is inherently a difficult task, and is made even more difficult when the selection of relevant features is also an issue. In this paper we propose an approach for simultaneous clustering and feature selection using a niching memetic algorithm. Our approach (which we call NMA_CFS) makes feature selection an integral part of the global clustering search procedure and attempts to overcome the problem of identifying less promising locally optimal solutions in both clustering and feature selection, without making any a priori assumption about the number of clusters. Within the NMA_CFS procedure, a variable composite representation is devised to encode both feature selection and cluster centers with different numbers of clusters. Further, local search operations are introduced to refine feature selection and cluster centers encoded in the chromosomes. Finally, a niching method is integrated to preserve the population diversity and prevent premature convergence. In an experimental evaluation we demonstrate the effectiveness of the proposed approach and compare it with other related approaches, using both synthetic and real data

Brunel University Research Archive

Spatial Random Sampling: A Structure-Preserving Data Sketching Tool

Author: Atia George
Rahmani Mostafa
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/07/2017
Field of study

Random column sampling is not guaranteed to yield data sketches that preserve the underlying structures of the data and may not sample sufficiently from less-populated data clusters. Also, adaptive sampling can often provide accurate low rank approximations, yet may fall short of producing descriptive data sketches, especially when the cluster centers are linearly dependent. Motivated by that, this paper introduces a novel randomized column sampling tool dubbed Spatial Random Sampling (SRS), in which data points are sampled based on their proximity to randomly sampled points on the unit sphere. The most compelling feature of SRS is that the corresponding probability of sampling from a given data cluster is proportional to the surface area the cluster occupies on the unit sphere, independently from the size of the cluster population. Although it is fully randomized, SRS is shown to provide descriptive and balanced data representations. The proposed idea addresses a pressing need in data science and holds potential to inspire many novel approaches for analysis of big data

arXiv.org e-Print Archive

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Randomized Dimensionality Reduction for k-means Clustering

Author: Boutsidis Christos
Drineas Petros
Mahoney Michael W.
Zouzias Anastasios
Publication venue
Publication date: 01/01/2013
Field of study

We study the topic of dimensionality reduction for

k

-means clustering. Dimensionality reduction encompasses the union of two approaches: \emph{feature selection} and \emph{feature extraction}. A feature selection based algorithm for

k

-means clustering selects a small subset of the input features and then applies

k

-means clustering on the selected features. A feature extraction based algorithm for

k

-means clustering constructs a small set of new artificial features and then applies

k

-means clustering on the constructed features. Despite the significance of

k

-means clustering as well as the wealth of heuristic methods addressing it, provably accurate feature selection methods for

k

-means clustering are not known. On the other hand, two provably accurate feature extraction methods for

k

-means clustering are known in the literature; one is based on random projections and the other is based on the singular value decomposition (SVD). This paper makes further progress towards a better understanding of dimensionality reduction for

k

-means clustering. Namely, we present the first provably accurate feature selection method for

k

-means clustering and, in addition, we present two feature extraction methods. The first feature extraction method is based on random projections and it improves upon the existing results in terms of time complexity and number of features needed to be extracted. The second feature extraction method is based on fast approximate SVD factorizations and it also improves upon the existing results in terms of time complexity. The proposed algorithms are randomized and provide constant-factor approximation guarantees with respect to the optimal

k

-means objective value.Comment: IEEE Transactions on Information Theory, to appea

arXiv.org e-Print Archive

CiteSeerX

Dynamic feature selection for clustering high dimensional data streams

Author: Fahy Conor
Yang Shengxiang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/07/2019
Field of study

open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked

De Montfort University Open Research Archive