805 research outputs found

    Feature Selection for High-Dimensional Data with RapidMiner

    Get PDF
    Feature selection is an important task in machine learning, reducing dimensionality of learning problems by selecting few relevant features without losing too much information. Focusing on smaller sets of features, we can learn simpler models from data that are easier to understand and to apply. In fact, simpler models are more robust to input noise and outliers, often leading to better prediction performance than the models trained in higher dimensions with all features. We implement several feature selection algorithms in an extension of RapidMiner, that scale well with the number of features compared to the existing feature selection operators in RapidMiner

    Feature Selection for High-Dimensional Data with RapidMiner

    Get PDF
    Feature selection is an important task in machine learning, reducing dimensionality of learning problems by selecting few relevant features without losing too much information. Focusing on smaller sets of features, we can learn simpler models from data that are easier to understand and to apply. In fact, simpler models are more robust to input noise and outliers, often leading to better prediction performance than the models trained in higher dimensions with all features. We implement several feature selection algorithms in an extension of RapidMiner, that scale well with the number of features compared to the existing feature selection operators in RapidMiner

    Feature selection for high dimensional data: An evolutionary filter approach.

    Get PDF
    Problem statement: Feature selection is a task of crucial importance for the application of machine learning in various domains. In addition, the recent increase of data dimensionality poses a severe challenge to many existing feature selection approaches with respect to efficiency and effectiveness. As an example, genetic algorithm is an effective search algorithm that lends itself directly to feature selection; however this direct application is hindered by the recent increase of data dimensionality. Therefore adapting genetic algorithm to cope with the high dimensionality of the data becomes increasingly appealing. Approach: In this study, we proposed an adapted version of genetic algorithm that can be applied for feature selection in high dimensional data. The proposed approach is based essentially on a variable length representation scheme and a set of modified and proposed genetic operators. To assess the effectiveness of the proposed approach, we applied it for cues phrase selection and compared its performance with a number of ranking approaches which are always applied for this task. Results and Conclusion: The results provide experimental evidences on the effectiveness of the proposed approach for feature selection in high dimensional data

    Similarity Based Entropy on Feature Selection for High Dimensional Data Classification

    Get PDF
    Curse of dimensionality is a major problem in most classification tasks. Feature transformation and feature selection as a feature reduction method can be applied to overcome this problem. Despite of its good performance, feature transformation is not easily interpretable because the physical meaning of the original features cannot be retrieved. On the other side, feature selection with its simple computational process is able to reduce unwanted features and visualize the data to facilitate data understanding. We propose a new feature selection method using similarity based entropy to overcome the high dimensional data problem. Using 6 datasets with high dimensional feature, we have computed the similarity between feature vector and class vector. Then we find the maximum similarity that can be used for calculating the entropy values of each feature. The selected features are features that having higher entropy than mean entropy of overall features. The fuzzy k-NN classifier was implemented to evaluate the selected features. The experiment result shows that proposed method is able to deal with high dimensional data problem with average accuracy of 80.5%

    Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains

    Get PDF
    Selecting a subset of relevant features is crucial to the analysis of high-dimensional datasets coming from a number of application domains, such as biomedical data, document and image analysis. Since no single selection algorithm seems to be capable of ensuring optimal results in terms of both predictive performance and stability (i.e. robustness to changes in the input data), researchers have increasingly explored the effectiveness of "ensemble" approaches involving the combination of different selectors. While interesting proposals have been reported in the literature, most of them have been so far evaluated in a limited number of settings (e.g. with data from a single domain and in conjunction with specific selection approaches), leaving unanswered important questions about the large-scale applicability and utility of ensemble feature selection. To give a contribution to the field, this work presents an empirical study which encompasses different kinds of selection algorithms (filters and embedded methods, univariate and multivariate techniques) and different application domains. Specifically, we consider 18 classification tasks with heterogeneous characteristics (in terms of number of classes and instances-to-features ratio) and experimentally evaluate, for feature subsets of different cardinalities, the extent to which an ensemble approach turns out to be more robust than a single selector, thus providing useful insight for both researchers and practitioners

    Clustering and Classification with Feature Selection for High-Dimensional Data

    Get PDF
    In this dissertation, we discuss several methods for clustering and classification with feature selection for high-dimensional data. In the first part, we focus on the problem of biclustering, which is the task of simultaneously clustering the rows and columns of the data matrix into different subgroups such that the rows and columns within a subgroup exhibit similar patterns. We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk, and introduce a novel algorithm that alternately applies an adapted version of the k-means clustering algorithm between columns and rows. In the second part, we develop a new classification method based on nearest centroid, using disjoint sets of features. We present a simple algorithm based on adapted k-means clustering that can find the subsets of features used in our method and extend the algorithm to perform feature selection. In the third part, we study the problem of classification with feature selection, where the features are selected iteratively in a supervised way to optimize predictive performance. We propose to use beam search to perform feature selection, which can be viewed as a generalization of forward selection. In all parts of the dissertation, we evaluate and compare the performance of our methods to other related methods on both simulated data and real-world datasets.Doctor of Philosoph

    A scalable implementation of information theoretic feature selection for high dimensional data

    Get PDF
    With the growth of high dimensional data, feature selection is a vital component of machine learning as well as an important stand alone data analytics tool. Without it, the computation cost of big data analytics can become unmanageable and spurious correlations and noise can reduce the accuracy of any results. Feature selection removes irrelevant and redundant information leading to faster, more reliable data analysis. Feature selection techniques based on information theory are among the fastest known and the Manchester AnalyticS Toolkit (MAST) provides an efficient, parallel and scalable implementation of these methods. This paper considers a number of data structures for storing the frequency counters that underpin MAST. We show that preprocessing the data to reduce the number of zero-valued counters in an array structure results in an order of magnitude reduction in both memory usage and execution time compared to state of the art structures that use explicit mappings to avoid zero-valued counters. We also describe a number of parallel processing techniques that enable MAST to scale linearly with the number of processors even on NUMA architectures. MAST targets scale-up servers rather than scale-out clusters and we show that it performs orders of magnitude faster than existing tools. Moreover, we show that MAST is 3.5 times faster than a scale-out solution built for Spark running on the same server. As an example of the performance of MAST, we were able to process a dataset of 100 million examples and 100,000 features in under 10 minutes on a four socket server which each socket containing an 8-core Intel Xeon E5-4620 processor
    corecore