459,249 research outputs found

    How complex is the microarray dataset? A novel data complexity metric for biological high-dimensional microarray data

    Full text link
    Data complexity analysis quantifies the hardness of constructing a predictive model on a given dataset. However, the effectiveness of existing data complexity measures can be challenged by the existence of irrelevant features and feature interactions in biological micro-array data. We propose a novel data complexity measure, depth, that leverages an evolutionary inspired feature selection algorithm to quantify the complexity of micro-array data. By examining feature subsets of varying sizes, the approach offers a novel perspective on data complexity analysis. Unlike traditional metrics, depth is robust to irrelevant features and effectively captures complexity stemming from feature interactions. On synthetic micro-array data, depth outperforms existing methods in robustness to irrelevant features and identifying complexity from feature interactions. Applied to case-control genotype and gene-expression micro-array datasets, the results reveal that a single feature of gene-expression data can account for over 90% of the performance of multi-feature model, confirming the adequacy of the commonly used differentially expressed gene (DEG) feature selection method for the gene expression data. Our study also demonstrates that constructing predictive models for genotype data is harder than gene expression data. The results in this paper provide evidence for the use of interpretable machine learning algorithms on microarray data

    From Review to Rating: Exploring Dependency Measures for Text Classification

    Full text link
    Various text analysis techniques exist, which attempt to uncover unstructured information from text. In this work, we explore using statistical dependence measures for textual classification, representing text as word vectors. Student satisfaction scores on a 3-point scale and their free text comments written about university subjects are used as the dataset. We have compared two textual representations: a frequency word representation and term frequency relationship to word vectors, and found that word vectors provide a greater accuracy. However, these word vectors have a large number of features which aggravates the burden of computational complexity. Thus, we explored using a non-linear dependency measure for feature selection by maximizing the dependence between the text reviews and corresponding scores. Our quantitative and qualitative analysis on a student satisfaction dataset shows that our approach achieves comparable accuracy to the full feature vector, while being an order of magnitude faster in testing. These text analysis and feature reduction techniques can be used for other textual data applications such as sentiment analysis.Comment: 8 page

    Distributed Spacing Stochastic Feature Selection and its Application to Textile Classification

    Get PDF
    Many situations require the need to quickly and accurately locate dismounted individuals in a variety of environments. In conjunction with other dismount detection techniques, being able to detect and classify clothing (textiles) provides a more comprehensive and complete dismount characterization capability. Because textile classification depends on distinguishing between different material types, hyperspectral data, which consists of several hundred spectral channels sampled from a continuous electromagnetic spectrum, is used as a data source. However, a hyperspectral image generates vast amounts of information and can be computationally intractable to analyze. A primary means to reduce the computational complexity is to use feature selection to identify a reduced set of features that effectively represents a specific class. While many feature selection methods exist, applying them to continuous data results in closely clustered feature sets that offer little redundancy and fail in the presence of noise. This dissertation presents a novel feature selection method that limits feature redundancy and improves classification. This method uses a stochastic search algorithm in conjunction with a heuristic that combines measures of distance and dependence to select features. Comparison testing between the presented feature selection method and existing methods uses hyperspectral data and image wavelet decompositions. The presented method produces feature sets with an average correlation of 0.40-0.54. This is significantly lower than the 0.70-0.99 of the existing feature selection methods. In terms of classification accuracy, the feature sets produced outperform those of other methods, to a significance of 0.025, and show greater robustness under noise representative of a hyperspectral imaging system

    Feature selection and hierarchical classifier design with applications to human motion recognition

    Get PDF
    The performance of a classifier is affected by a number of factors including classifier type, the input features and the desired output. This thesis examines the impact of feature selection and classification problem division on classification accuracy and complexity. Proper feature selection can reduce classifier size and improve classifier performance by minimizing the impact of noisy, redundant and correlated features. Noisy features can cause false association between the features and the classifier output. Redundant and correlated features increase classifier complexity without adding additional information. Output selection or classification problem division describes the division of a large classification problem into a set of smaller problems. Problem division can improve accuracy by allocating more resources to more difficult class divisions and enabling the use of more specific feature sets for each sub-problem. The first part of this thesis presents two methods for creating feature-selected hierarchical classifiers. The feature-selected hierarchical classification method jointly optimizes the features and classification tree-design using genetic algorithms. The multi-modal binary tree (MBT) method performs the class division and feature selection sequentially and tolerates misclassifications in the higher nodes of the tree. This yields a piecewise separation for classes that cannot be fully separated with a single classifier. Experiments show that the accuracy of MBT is comparable to other multi-class extensions, but with lower test time. Furthermore, the accuracy of MBT is significantly higher on multi-modal data sets. The second part of this thesis focuses on input feature selection measures. A number of filter-based feature subset evaluation measures are evaluated with the goal of assessing their performance with respect to specific classifiers. Although there are many feature selection measures proposed in literature, it is unclear which feature selection measures are appropriate for use with different classifiers. Sixteen common filter-based measures are tested on 20 real and 20 artificial data sets, which are designed to probe for specific feature selection challenges. The strengths and weaknesses of each measure are discussed with respect to the specific feature selection challenges in the artificial data sets, correlation with classifier accuracy and their ability to identify known informative features. The results indicate that the best filter measure is classifier-specific. K-nearest neighbours classifiers work well with subset-based RELIEF, correlation feature selection or conditional mutual information maximization, whereas Fisher's interclass separability criterion and conditional mutual information maximization work better for support vector machines. Based on the results of the feature selection experiments, two new filter-based measures are proposed based on conditional mutual information maximization, which performs well but cannot identify dependent features in a set and does not include a check for correlated features. Both new measures explicitly check for dependent features and the second measure also includes a term to discount correlated features. Both measures correctly identify known informative features in the artificial data sets and correlate well with classifier accuracy. The final part of this thesis examines the use of feature selection for time-series data by using feature selection to determine important individual time windows or key frames in the series. Time-series feature selection is used with the MBT algorithm to create classification trees for time-series data. The feature selected MBT algorithm is tested on two human motion recognition tasks: full-body human motion recognition from joint angle data and hand gesture recognition from electromyography data. Results indicate that the feature selected MBT is able to achieve high classification accuracy on the time-series data while maintaining a short test time

    Local feature selection for multiple instance learning with applications.

    Get PDF
    Feature selection is a data processing approach that has been successfully and effectively used in developing machine learning algorithms for various applications. It has been proven to effectively reduce the dimensionality of the data and increase the accuracy and interpretability of machine learning algorithms. Conventional feature selection algorithms assume that there is an optimal global subset of features for the whole sample space. Thus, only one global subset of relevant features is learned. An alternative approach is based on the concept of Local Feature Selection (LFS), where each training sample can have its own subset of relevant features. Multiple Instance Learning (MIL) is a variation of traditional supervised learning, also known as single instance learning. In MIL, each object is represented by a set of instances, or a bag. While bags are labeled, the labels of their instances are unknown. The ambiguity of the instance labels makes the feature selection for MIL challenging. Although feature selection in traditional supervised learning has been researched extensively, there are only a few methods for the MIL framework. Moreover, localized feature selection for MIL has not been researched. This dissertation focuses on developing a local feature selection method for the MIL framework. Our algorithm, called Multiple Instance Local Salient Feature Selection (MI-LSFS), searches the feature space to find the relevant features within each bag. We also propose a new multiple instance classification algorithm, called MILES-LFS, that integrates information learned by MI-LSFS during the feature selection process to identify a reduced subset of representative bags and instances. We show that using a more focused subset of prototypes can improve the performance while significantly reducing the computational complexity. Other applications of the proposed MI-LSFS include a new method that uses our MI-LSFS algorithm to explore and investigate the features learned by a Convolutional Neural Network (CNN) model; a visualization method for CNN models, called Gradient-weighted Sample Activation Map (Grad-SAM), that uses the locally learned features of each sample to highlight their relevant and salient parts, and a novel explanation method, called Classifier Explanation by Local Feature Selection (CE-LFS), to explain the decisions of trained models. The proposed MI-LSFS and its applications are validated using several synthetic and real data sets. We report and compare quantitative measures such as Rand Index, Area Under Curve (AUC), and accuracy. We also provide qualitative measures by visualizing and interpreting the selected features and their effects

    A hybrid feature selection approach for the early diagnosis of Alzheimer's disease

    Get PDF
    Objective. Recently, significant advances have been made in the early diagnosis of Alzheimer’s disease from EEG. However, choosing suitable measures is a challenging task. Among other measures, frequency Relative Power and loss of complexity have been used with promising results. In the present study we investigate the early diagnosis of AD using synchrony measures and frequency Relative Power on EEG signals, examining the changes found in different frequency ranges. Approach. We first explore the use of a single feature for computing the classification rate, looking for the best frequency range. Then, we present a multiple feature classification system that outperforms all previous results using a feature selection strategy. These two approaches are tested in two different databases, one containing MCI and healthy subjects (patients age: 71.9 ± 10.2, healthy subjects age: 71.7 ± 8.3), and the other containing Mild AD and healthy subjects (patients age: 77.6 ± 10.0; healthy subjects age: 69.4± 11.5). Main Results. Using a single feature to compute classification rates we achieve a performance of 78.33% for the MCI data set and of 97.56 % for Mild AD. Results are clearly improved using the multiple feature classification, where a classification rate of 95% is found for the MCI data set using 11 features, and 100% for the Mild AD data set using 4 features. Significance. The new features selection method described in this work may be a reliable tool that could help to design a realistic system that does not require prior knowledge of a patient's status. With that aim, we explore the standardization of features for MCI and Mild AD data sets with promising results

    Map-elites algorithm for features selection problem

    Get PDF
    In the High-dimensional data analysis there are several challenges in the fields of machine learning and data mining. Typically, feature selection is considered as a combinatorial optimization problem which seeks to remove irrelevant and redundant data by reducing computation time and improve learning measures. Given the complexity of this problem, we propose a novel Map-Elites based Algorithm that determines the minimum set of features maximizing learning accuracy simultaneously. Experimental results, on several data based from real scenarios, show the effectiveness of the proposed algorithm.CONACYT – Consejo Nacional de Ciencia y TecnologíaPROCIENCI

    Selecting Features by their Resilience to the Curse of Dimensionality

    Full text link
    Real-world datasets are often of high dimension and effected by the curse of dimensionality. This hinders their comprehensibility and interpretability. To reduce the complexity feature selection aims to identify features that are crucial to learn from said data. While measures of relevance and pairwise similarities are commonly used, the curse of dimensionality is rarely incorporated into the process of selecting features. Here we step in with a novel method that identifies the features that allow to discriminate data subsets of different sizes. By adapting recent work on computing intrinsic dimensionalities, our method is able to select the features that can discriminate data and thus weaken the curse of dimensionality. Our experiments show that our method is competitive and commonly outperforms established feature selection methods. Furthermore, we propose an approximation that allows our method to scale to datasets consisting of millions of data points. Our findings suggest that features that discriminate data and are connected to a low intrinsic dimensionality are meaningful for learning procedures.Comment: 16 pages, 1 figure, 2 table

    Heuristic ensembles of filters for accurate and reliable feature selection

    Get PDF
    Feature selection has become increasingly important in data mining in recent years. However, the accuracy and stability of feature selection methods vary considerably when used individually, and yet no rule exists to indicate which one should be used for a particular dataset. Thus, an ensemble method that combines the outputs of several individual feature selection methods appears to be a promising approach to address the issue and hence is investigated in this research. This research aims to develop an effective ensemble that can improve the accuracy and stability of the feature selection. We proposed a novel heuristic ensemble of filters (HEF). It combines two types of filters: subset filters and ranking filters with a heuristic consensus algorithm in order to utilise the strength of each type. The ensemble is tested on ten benchmark datasets and its performance is evaluated by two stability measures and three classifiers. The experimental results demonstrate that HEF improves the stability and accuracy of the selected features and in most cases outperforms the other ensemble algorithms, individual filters and the full feature set. The research on the HEF algorithm is extended in several dimensions; including more filter members, three novel schemes of mean rank aggregation with partial lists, and three novel schemes for a weighted heuristic ensemble of filters. However, the experimental results demonstrate that adding weight to filters in HEF does not achieve the expected improvement in accuracy, but increases time and space complexity, and clearly decreases stability. Therefore, the core ensemble algorithm (HEF) is demonstrated to be not just simpler but also more reliable and consistent than the later more complicated and weighted ensembles. In addition, we investigated how to use data in feature selection, using ALL or PART of it. Systematic experiments with thirty five synthetic and benchmark real-world datasets were carried out
    • …
    corecore