14 research outputs found

    Examining Swarm Intelligence-based Feature Selection for Multi-Label Classification

    Get PDF
    Multi-label classification addresses the issues that more than one class label assigns to each instance. Many real-world multi-label classification tasks are high-dimensional due to digital technologies, leading to reduced performance of traditional multi-label classifiers. Feature selection is a common and successful approach to tackling this problem by retaining relevant features and eliminating redundant ones to reduce dimensionality. There is several feature selection that is successfully applied in multi-label learning. Most of those features are wrapper methods that employ a multi-label classifier in their processes. They run a classifier in each step, which requires a high computational cost, and thus they suffer from scalability issues. To deal with this issue, filter methods are introduced to evaluate the feature subsets using information-theoretic mechanisms instead of running classifiers. This paper aims to provide a comprehensive review of different methods of feature selection presented for the tasks of multi-label classification. To this end, in this review, we have investigated most of the well-known and state-of-the-art methods. We then provided the main characteristics of the existing multi-label feature selection techniques and compared them analytically

    Gene selection and classification in autism gene expression data

    Get PDF
    Autism spectrum disorders (ASD) are neurodevelopmental disorders that are currently diagnosed on the basis of abnormal stereotyped behaviour as well as observable deficits in communication and social functioning. Although a variety of candidate genes have been attributed to the disorder, no single gene is applicable to more than 1–2% of the general ASD population. Despite extensive efforts, definitive genes that contribute to autism susceptibility have yet to be identified. The major problems in dealing with the gene expression dataset of autism include the presence of limited number of samples and large noises due to errors of experimental measurements and natural variation. In this study, a systematic combination of three important filters, namely t-test (TT), Wilcoxon Rank Sum (WRS) and Feature Correlation (COR) are applied along with efficient wrapper algorithm based on geometric binary particle swarm optimization-support vector machine (GBPSO-SVM), aiming at selecting and classifying the most attributed genes of autism. A new approach based on the criterion of median ratio, mean ratio and variance deviations is also applied to reduce the initial dataset prior to its involvement. Results showed that the most discriminative genes that were identified in the first and last selection steps concluded the presence of a repetitive gene (CAPS2), which was assigned as the most ASD risk gene. The fused result of genes subset that were selected by the GBPSO-SVM algorithm increased the classification accuracy to about 92.10%, which is higher than those reported in literature for the same autism dataset. Noticeably, the application of ensemble using random forest (RF) showed better performance compared to that of previous studies. However, the ensemble approach based on the employment of SVM as an integrator of the fused genes from the output branches of GBPSO-SVM outperformed the RF integrator. The overall improvement was ascribed to the selection strategies that were taken to reduce the dataset and the utilization of efficient wrapper based GBPSO-SVM algorithm

    Benchmark study of feature selection strategies for multi-omics data

    Get PDF
    BACKGROUND: In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our knowledge, none compared these methods for the special case of multi-omics data. Given that these data have specific structures that differentiate them from single-omics data, it is unclear whether different feature selection strategies may be optimal for such data. In this paper, using 15 cancer multi-omics datasets we compared four filter methods, two embedded methods, and two wrapper methods with respect to their performance in the prediction of a binary outcome in several situations that may affect the prediction results. As classifiers, we used support vector machines and random forests. The methods were compared using repeated fivefold cross-validation. The accuracy, the AUC, and the Brier score served as performance metrics. RESULTS: The results suggested that, first, the chosen number of selected features affects the predictive performance for many feature selection methods but not all. Second, whether the features were selected by data type or from all data types concurrently did not considerably affect the predictive performance, but for some methods, concurrent selection took more time. Third, regardless of which performance measure was considered, the feature selection methods mRMR, the permutation importance of random forests, and the Lasso tended to outperform the other considered methods. Here, mRMR and the permutation importance of random forests already delivered strong predictive performance when considering only a few selected features. Finally, the wrapper methods were computationally much more expensive than the filter and embedded methods. CONCLUSIONS: We recommend the permutation importance of random forests and the filter method mRMR for feature selection using multi-omics data, where, however, mRMR is considerably more computationally costly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04962-x

    A Hybrid PSO Based on Dynamic Clustering for Global Optimization

    Get PDF
    Particle swarm optimization is a population-based global search method, and known to suffer from premature convergence prior to discovering the true global minimizer for global optimization problems. Taking balance of local intensive exploitation and global exploration into account, a novel algorithm is presented in the paper, called dynamic clustering hybrid particle swarm optimization (DC-HPSO). In the method, particles are constantly and dynamically clustered into several groups (sub-swarms) corresponding to promising sub-regions in terms of similarity of their generalized particles. In each group, a dominant particle is chosen to take responsibility for local intensive exploitation, while the rest are responsible for exploration by maintaining diversity of the swarm. The simultaneous perturbation stochastic approximation (SPSA) is introduced into our work in order to guarantee the implementation of exploitation and the standard PSO is modified for exploration. The experimental results show the efficiency of the proposed algorithm in comparison with several other peer algorithms

    Improved Reptile Search Optimization Algorithm using Chaotic map and Simulated Annealing for Feature Selection in Medical Filed

    Get PDF
    The increased volume of medical datasets has produced high dimensional features, negatively affecting machine learning (ML) classifiers. In ML, the feature selection process is fundamental for selecting the most relevant features and reducing redundant and irrelevant ones. The optimization algorithms demonstrate its capability to solve feature selection problems. Reptile Search Algorithm (RSA) is a new nature-inspired optimization algorithm that stimulates Crocodiles’ encircling and hunting behavior. The unique search of the RSA algorithm obtains promising results compared to other optimization algorithms. However, when applied to high-dimensional feature selection problems, RSA suffers from population diversity and local optima limitations. An improved metaheuristic optimizer, namely the Improved Reptile Search Algorithm (IRSA), is proposed to overcome these limitations and adapt the RSA to solve the feature selection problem. Two main improvements adding value to the standard RSA; the first improvement is to apply the chaos theory at the initialization phase of RSA to enhance its exploration capabilities in the search space. The second improvement is to combine the Simulated Annealing (SA) algorithm with the exploitation search to avoid the local optima problem. The IRSA performance was evaluated over 20 medical benchmark datasets from the UCI machine learning repository. Also, IRSA is compared with the standard RSA and state-of-the-art optimization algorithms, including Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Grasshopper Optimization algorithm (GOA) and Slime Mould Optimization (SMO). The evaluation metrics include the number of selected features, classification accuracy, fitness value, Wilcoxon statistical test (p-value), and convergence curve. Based on the results obtained, IRSA confirmed its superiority over the original RSA algorithm and other optimized algorithms on the majority of the medical datasets
    corecore