53,884 research outputs found

    A Novel Subset Selection Clustering-Based Algorithm for High Dimensional Data

    Get PDF
    Feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are to be distinguished from feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). It involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high dimensional image, microarray, and text data, demonstrate that FAST not only produces smaller subsets of features but also improves the performances of the four types of classifier

    IDENTIFICATION OF SIGNIFICANT FEATURES USING RANDOM FOREST FOR HIGH DIMENSIONAL MICROARRAY DATA

    Get PDF
    Feature subset selection for microarray data aims at reducing the number of genes so that useful information can be extracted from the samples. At the same time, selecting the relevant genes (features) from the high dimensional data can improve the classification accuracy of the learning algorithm. This paper proposes a feature selection algorithm, which is fit for high dimensional and small sample size microarray data. Feature selection is performed in two phases. In the first phase, Random Forest is used to identifying the importance of each feature, so that the features with high relevance can be given priority over less relevant ones. In the second phase, feature clustering is performed around the relevant features to yield the reduced feature set. A statistical method is used to create the clusters that aid in giving the genes specifically representing the disease. The effectiveness of the proposed algorithm has been compared with three state-of-the-art feature selection algorithms viz. FastCorrelation Based Filter (FCBF), a Fast Clustering-Based Feature Selection Algorithm (FAST) and Random Forest (RF) on nine real-world cancer microarray datasets. Empirically, the algorithms have been evaluated through three well-known classifiers viz. probability based NaĂŻve Bayes, Tree-based C4.5, and the Instance-based IB1. The stated result shows that the proposed algorithm can be helpful in finding the smaller set of features for cancer microarray datasets with better classification accuracy

    Spatial Random Sampling: A Structure-Preserving Data Sketching Tool

    Full text link
    Random column sampling is not guaranteed to yield data sketches that preserve the underlying structures of the data and may not sample sufficiently from less-populated data clusters. Also, adaptive sampling can often provide accurate low rank approximations, yet may fall short of producing descriptive data sketches, especially when the cluster centers are linearly dependent. Motivated by that, this paper introduces a novel randomized column sampling tool dubbed Spatial Random Sampling (SRS), in which data points are sampled based on their proximity to randomly sampled points on the unit sphere. The most compelling feature of SRS is that the corresponding probability of sampling from a given data cluster is proportional to the surface area the cluster occupies on the unit sphere, independently from the size of the cluster population. Although it is fully randomized, SRS is shown to provide descriptive and balanced data representations. The proposed idea addresses a pressing need in data science and holds potential to inspire many novel approaches for analysis of big data

    Randomized Dimensionality Reduction for k-means Clustering

    Full text link
    We study the topic of dimensionality reduction for kk-means clustering. Dimensionality reduction encompasses the union of two approaches: \emph{feature selection} and \emph{feature extraction}. A feature selection based algorithm for kk-means clustering selects a small subset of the input features and then applies kk-means clustering on the selected features. A feature extraction based algorithm for kk-means clustering constructs a small set of new artificial features and then applies kk-means clustering on the constructed features. Despite the significance of kk-means clustering as well as the wealth of heuristic methods addressing it, provably accurate feature selection methods for kk-means clustering are not known. On the other hand, two provably accurate feature extraction methods for kk-means clustering are known in the literature; one is based on random projections and the other is based on the singular value decomposition (SVD). This paper makes further progress towards a better understanding of dimensionality reduction for kk-means clustering. Namely, we present the first provably accurate feature selection method for kk-means clustering and, in addition, we present two feature extraction methods. The first feature extraction method is based on random projections and it improves upon the existing results in terms of time complexity and number of features needed to be extracted. The second feature extraction method is based on fast approximate SVD factorizations and it also improves upon the existing results in terms of time complexity. The proposed algorithms are randomized and provide constant-factor approximation guarantees with respect to the optimal kk-means objective value.Comment: IEEE Transactions on Information Theory, to appea

    Dynamic feature selection for clustering high dimensional data streams

    Get PDF
    open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked
    • …
    corecore