78,671 research outputs found

    Clustering based feature selection using Partitioning Around Medoids (PAM)

    Get PDF
    High-dimensional data contains a large number of features. With many features, high dimensional data requires immense computational resources, including space and time. Several studies indicate that not all features of high dimensional data are relevant to classification result. Dimensionality reduction is inevitable and is required due to classifier performance improvement. Several dimensionality reduction techniques were carried out, including feature selection techniques and feature extraction techniques. Sequential forward feature selection and backward feature selection are feature selection using the greedy approach. The heuristics approach is also applied in feature selection, using the Genetic Algorithm, PSO, and Forest Optimization Algorithm. PCA is the most well-known feature extraction method. Besides, other methods such as multidimensional scaling and linear discriminant analysis. In this work, a different approach is applied to perform feature selection. Cluster analysis based feature selection using Partitioning Around Medoids (PAM) clustering is carried out. Our experiment results showed that classification accuracy gained when using feature vectors' medoids to represent the original dataset is high, above 80%

    Similarity Based Entropy on Feature Selection for High Dimensional Data Classification

    Get PDF
    Curse of dimensionality is a major problem in most classification tasks. Feature transformation and feature selection as a feature reduction method can be applied to overcome this problem. Despite of its good performance, feature transformation is not easily interpretable because the physical meaning of the original features cannot be retrieved. On the other side, feature selection with its simple computational process is able to reduce unwanted features and visualize the data to facilitate data understanding. We propose a new feature selection method using similarity based entropy to overcome the high dimensional data problem. Using 6 datasets with high dimensional feature, we have computed the similarity between feature vector and class vector. Then we find the maximum similarity that can be used for calculating the entropy values of each feature. The selected features are features that having higher entropy than mean entropy of overall features. The fuzzy k-NN classifier was implemented to evaluate the selected features. The experiment result shows that proposed method is able to deal with high dimensional data problem with average accuracy of 80.5%

    Efficient Feature Subset Selection Algorithm for High Dimensional Data

    Get PDF
    Feature selection approach solves the dimensionality problem by removing irrelevant and redundant features. Existing Feature selection algorithms take more time to obtain feature subset for high dimensional data. This paper proposes a feature selection algorithm based on Information gain measures for high dimensional data termed as IFSA (Information gain based Feature Selection Algorithm) to produce optimal feature subset in efficient time and improve the computational performance of learning algorithms. IFSA algorithm works in two folds: First apply filter on dataset. Second produce the small feature subset by using information gain measure. Extensive experiments are carried out to compare proposed algorithm and other methods with respect to two different classifiers (Naive bayes and IBK) on microarray and text data sets. The results demonstrate that IFSA not only produces the most select feature subset in efficient time but also improves the classifier performance

    A new approach for feature extraction from functional MR images

    Get PDF
    The functional MR images consist of very high dimensional data containing thousands of voxels, even for a single subject. Data reduction methods are inevitable for the classification of these three-dimensional images. In this study in the first step of the data reduction, the first level statistical analysis was applied to fMRI data and brain maps of each subject were obtained for the feature extraction. In the second step the feature selection was applied to brain maps. According to the feature selection method used in the classification studies of fMRI and which is called as the active method, the intensity values of all brain voxels are ranked from high to low and some of these features are presented to the classifier. However, the location information of the voxels is lost with this method. In this study, a new feature extraction method was presented for use in the classification of fMRI. According to this method, active voxels can be used as features by considering brain maps obtained in three dimensions as slice based. Since the functional MR images have big data sets, the selected features were once again reduced by Principal Component Analysis and the voxel intensity values were presented to the classifiers. As a result; 83.9% classification accuracy was obtained by using kNN classifier with purposed slice-based feature extraction method and it was seen that the slice-based feature extraction method increased the classification.The functional MR images consist of very high dimensional data containing thousands of voxels, even for a single subject. Data reduction methods are inevitable for the classification of these three-dimensional images. In this study in the first step of the data reduction, the first level statistical analysis was applied to fMRI data and brain maps of each subject were obtained for the feature extraction. In the second step the feature selection was applied to brain maps. According to the feature selection method used in the classification studies of fMRI and which is called as the active method, the intensity values of all brain voxels are ranked from high to low and some of these features are presented to the classifier. However, the location information of the voxels is lost with this method. In this study, a new feature extraction method was presented for use in the classification of fMRI. According to this method, active voxels can be used as features by considering brain maps obtained in three dimensions as slice based. Since the functional MR images have big data sets, the selected features were once again reduced by Principal Component Analysis and the voxel intensity values were presented to the classifiers. As a result; 83.9% classification accuracy was obtained by using kNN classifier with purposed slice-based feature extraction method and it was seen that the slice-based feature extraction method increased the classification

    Clustering based feature selection using Partitioning Around Medoids (PAM)

    Get PDF
    High-dimensional data contains a large number of features. With many features, high dimensional data requires immense computational resources, including space and time. Several studies indicate that not all features of high dimensional data are relevant to classification result. Dimensionality reduction is inevitable and is required due to classifier performance improvement. Several dimensionality reduction techniques were carried out, including feature selection techniques and feature extraction techniques. Sequential forward feature selection and backward feature selection are feature selection using the greedy approach. The heuristics approach is also applied in feature selection, using the Genetic Algorithm, PSO, and Forest Optimization Algorithm. PCA is the most well-known feature extraction method. Besides, other methods such as multidimensional scaling and linear discriminant analysis. In this work, a different approach is applied to perform feature selection. Cluster analysis based feature selection using Partitioning Around Medoids (PAM) clustering is carried out. Our experiment results showed that classification accuracy gained when using feature vectors' medoids to represent the original dataset is high, above 80%

    Selection of online Features and its Application

    Get PDF
    Selection of Online Feature is significant important concept in data mining. Batch learning is the mostly used learning algorithm in feature selection. Instead of Batch learning, online learning is most efficient and scalable machine learning method. Most existing system studies of online learning should access the data related to features. But accessing all data becomes a problem when we deal with high dimensional data. To avoid this limitation we proposed system in this online learner allowed to operate a classifier having fixed and small number of features related data. But the significant challenge Selection of online features (SOF) is how to construct accurate prediction for a data using a small number of operative features. To develop novel Selection of Online Feature algorithms to perform a various tasks of Selection of Online Feature by using semi supervised and supervised with unlabeled and label data for full input and partial input. Hence it provides integrity and scalability to the data storage system efficiently and users will be accessing the data through online

    On feature selection protocols for very low-sample-size data

    Get PDF
    High-dimensional data with very few instances are typical in many application domains. Selecting a highly discriminative subset of the original features is often the main interest of the end user. The widely-used feature selection protocol for such type of data consists of two steps. First, features are selected from the data (possibly through cross-validation), and, second, a cross-validation protocol is applied to test a classifier using the selected features. The selected feature set and the testing accuracy are then returned to the user. For the lack of a better option, the same low-sample-size dataset is used in both steps. Questioning the validity of this protocol, we carried out an experiment using 24 high-dimensional datasets, three feature selection methods and five classifier models. We found that the accuracy returned by the above protocol is heavily biased, and therefore propose an alternative protocol which avoids the contamination by including both steps in a single cross-validation loop. Statistical tests verify that the classification accuracy returned by the proper protocol is significantly closer to the true accuracy (estimated from an independent testing set) compared to that returned by the currently favoured protocol.project RPG-2015-188 funded by The Leverhulme Trust, UK and by project TIN2015-67534-P (MINECO/FEDER, UE) funded by the Ministerio de Economía y Competitividad of the Spanish Government and European Union FEDER fund
    corecore