1,697 research outputs found

    Exploiting the accumulated evidence for gene selection in microarray gene expression data

    Get PDF
    Machine Learning methods have of late made signicant efforts to solving multidisciplinary problems in the field of cancer classification using microarray gene expression data. Feature subset selection methods can play an important role in the modeling process, since these tasks are characterized by a large number of features and a few observations, making the modeling a non-trivial undertaking. In this particular scenario, it is extremely important to select genes by taking into account the possible interactions with other gene subsets. This paper shows that, by accumulating the evidence in favour (or against) each gene along the search process, the obtained gene subsets may constitute better solutions, either in terms of predictive accuracy or gene size, or in both. The proposed technique is extremely simple and applicable at a negligible overhead in cost.Postprint (published version

    Heuristic ensembles of filters for accurate and reliable feature selection

    Get PDF
    Feature selection has become increasingly important in data mining in recent years. However, the accuracy and stability of feature selection methods vary considerably when used individually, and yet no rule exists to indicate which one should be used for a particular dataset. Thus, an ensemble method that combines the outputs of several individual feature selection methods appears to be a promising approach to address the issue and hence is investigated in this research. This research aims to develop an effective ensemble that can improve the accuracy and stability of the feature selection. We proposed a novel heuristic ensemble of filters (HEF). It combines two types of filters: subset filters and ranking filters with a heuristic consensus algorithm in order to utilise the strength of each type. The ensemble is tested on ten benchmark datasets and its performance is evaluated by two stability measures and three classifiers. The experimental results demonstrate that HEF improves the stability and accuracy of the selected features and in most cases outperforms the other ensemble algorithms, individual filters and the full feature set. The research on the HEF algorithm is extended in several dimensions; including more filter members, three novel schemes of mean rank aggregation with partial lists, and three novel schemes for a weighted heuristic ensemble of filters. However, the experimental results demonstrate that adding weight to filters in HEF does not achieve the expected improvement in accuracy, but increases time and space complexity, and clearly decreases stability. Therefore, the core ensemble algorithm (HEF) is demonstrated to be not just simpler but also more reliable and consistent than the later more complicated and weighted ensembles. In addition, we investigated how to use data in feature selection, using ALL or PART of it. Systematic experiments with thirty five synthetic and benchmark real-world datasets were carried out

    Embedding Feature Selection for Large-scale Hierarchical Classification

    Full text link
    Large-scale Hierarchical Classification (HC) involves datasets consisting of thousands of classes and millions of training instances with high-dimensional features posing several big data challenges. Feature selection that aims to select the subset of discriminant features is an effective strategy to deal with large-scale HC problem. It speeds up the training process, reduces the prediction time and minimizes the memory requirements by compressing the total size of learned model weight vectors. Majority of the studies have also shown feature selection to be competent and successful in improving the classification accuracy by removing irrelevant features. In this work, we investigate various filter-based feature selection methods for dimensionality reduction to solve the large-scale HC problem. Our experimental evaluation on text and image datasets with varying distribution of features, classes and instances shows upto 3x order of speed-up on massive datasets and upto 45% less memory requirements for storing the weight vectors of learned model without any significant loss (improvement for some datasets) in the classification accuracy. Source Code: https://cs.gmu.edu/~mlbio/featureselection.Comment: IEEE International Conference on Big Data (IEEE BigData 2016

    Hybrid filter-wrapper approaches for feature selection

    Get PDF
    Durant les darreres dècades, molts sectors empresarials han adoptat les tecnologies digitals, emmagatzemant tota la informació que generen en bases de dades. A més, amb l'auge de l'aprenentatge automàtic i la ciència de les dades, s'ha tornat econòmicament rendible utilitzar aquestes dades per resoldre problemes del món real. No obstant això, a mesura que els conjunts de dades creixen en mida, cada vegada és més difícil determinar exactament quines variables són valuoses per resoldre un problema específic. Aquest projecte estudia el problema de la selecció de variables, que intenta seleccionar el subconjunt de variables rellevants per a una determinada tasca predictiva. En particular, ens centrarem en els algoritmes híbrids que combinen mètodes filtre i embolcall. Aquesta és una àrea d'estudi relativament nova, que ha obtingut bons resultats en conjunts de dades amb grans dimensions perquè ofereixen un bon compromís entre velocitat i precisió. El projecte començarà explicant diversos mètodes filtre i embolcall i seguidament ensenyarà com diversos autors els han combinat per obtenir nous algoritmes híbrids. També introduirem un nou algoritme al qual anomenarem BWRR, que utilitza el popular filtre ReliefF per guiar una cerca cap enrere. La principal novetat que proposem és recomputar ReliefF en certs punts per guiar millor la cerca. Addicionalment, introduirem diverses variacions de l'algoritme. També hem realitzat una extensa experimentació per a provar el nou algoritme. Primerament, hem treballat amb conjunts de dades sintètiques per esbrinar quins factors afectaven el rendiment. Seguidament, l'hem comparat amb l'estat de l'art en diversos conjunts de dades reals.Over the last couple of decades, more business sectors than ever have embraced digital technologies, storing all the information they generate in databases. Moreover, with the rise of machine learning and data science, it has become economically profitable to use this data to solve real-world problems. However, as datasets grow larger, it has become increasingly difficult to determine exactly which variables are valuable to solve a given problem. This project studies the problem of feature selection, which tries to select a subset of relevant variables for a specific prediction task from the complete set of attributes. In particular, we have mostly focused on hybrid filter-wrapper algorithms, a relatively new branch of study, that has seen great success in high-dimensional datasets because they offer a good trade-off between speed and accuracy. The project starts by explaining several important filter and wrapper methods and moves on to illustrate how several authors have combined them to form new hybrid algorithms. Moreover, we also introduce a new algorithm called BWRR, which uses the popular ReliefF filter to guide a backward wrapper search. The key novelty we propose is to recompute the ReliefF rankings at several points to better guide the search. In addition, we also introduce several variations of this algorithm. We have also performed extensive experimentation to test this algorithm. In the first phase, we experimented with synthetic datasets to see which factors affected the performance. After that, we compared the new algorithm against the state-of-the-art in real-world datasets

    Neuropsychological predictors of conversion from mild cognitive impairment to Alzheimer’s disease: a feature selection ensemble combining stability and predictability

    Get PDF
    Background Predicting progression from Mild Cognitive Impairment (MCI) to Alzheimer’s Disease (AD) is an utmost open issue in AD-related research. Neuropsychological assessment has proven to be useful in identifying MCI patients who are likely to convert to dementia. However, the large battery of neuropsychological tests (NPTs) performed in clinical practice and the limited number of training examples are challenge to machine learning when learning prognostic models. In this context, it is paramount to pursue approaches that effectively seek for reduced sets of relevant features. Subsets of NPTs from which prognostic models can be learnt should not only be good predictors, but also stable, promoting generalizable and explainable models. Methods We propose a feature selection (FS) ensemble combining stability and predictability to choose the most relevant NPTs for prognostic prediction in AD. First, we combine the outcome of multiple (filter and embedded) FS methods. Then, we use a wrapper-based approach optimizing both stability and predictability to compute the number of selected features. We use two large prospective studies (ADNI and the Portuguese Cognitive Complaints Cohort, CCC) to evaluate the approach and assess the predictive value of a large number of NPTs. Results The best subsets of features include approximately 30 and 20 (from the original 79 and 40) features, for ADNI and CCC data, respectively, yielding stability above 0.89 and 0.95, and AUC above 0.87 and 0.82. Most NPTs learnt using the proposed feature selection ensemble have been identified in the literature as strong predictors of conversion from MCI to AD. Conclusions The FS ensemble approach was able to 1) identify subsets of stable and relevant predictors from a consensus of multiple FS methods using baseline NPTs and 2) learn reliable prognostic models of conversion from MCI to AD using these subsets of features. The machine learning models learnt from these features outperformed the models trained without FS and achieved competitive results when compared to commonly used FS algorithms. Furthermore, the selected features are derived from a consensus of methods thus being more robust, while releasing users from choosing the most appropriate FS method to be used in their classification task.PTDC/EEI-SII/1937/2014; SFRH/BD/95846/2013; SFRH/BD/118872/2016info:eu-repo/semantics/publishedVersio
    • …
    corecore