67 research outputs found

    Benchmark study of feature selection strategies for multi-omics data

    Get PDF
    BACKGROUND: In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our knowledge, none compared these methods for the special case of multi-omics data. Given that these data have specific structures that differentiate them from single-omics data, it is unclear whether different feature selection strategies may be optimal for such data. In this paper, using 15 cancer multi-omics datasets we compared four filter methods, two embedded methods, and two wrapper methods with respect to their performance in the prediction of a binary outcome in several situations that may affect the prediction results. As classifiers, we used support vector machines and random forests. The methods were compared using repeated fivefold cross-validation. The accuracy, the AUC, and the Brier score served as performance metrics. RESULTS: The results suggested that, first, the chosen number of selected features affects the predictive performance for many feature selection methods but not all. Second, whether the features were selected by data type or from all data types concurrently did not considerably affect the predictive performance, but for some methods, concurrent selection took more time. Third, regardless of which performance measure was considered, the feature selection methods mRMR, the permutation importance of random forests, and the Lasso tended to outperform the other considered methods. Here, mRMR and the permutation importance of random forests already delivered strong predictive performance when considering only a few selected features. Finally, the wrapper methods were computationally much more expensive than the filter and embedded methods. CONCLUSIONS: We recommend the permutation importance of random forests and the filter method mRMR for feature selection using multi-omics data, where, however, mRMR is considerably more computationally costly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04962-x

    Hybrid feature selection of breast cancer gene expression microarray data based on metaheuristic methods: a comprehensive review

    Get PDF
    Breast cancer (BC) remains the most dominant cancer among women worldwide. Numerous BC gene expression microarray-based studies have been employed in cancer classification and prognosis. The availability of gene expression microarray data together with advanced classification methods has enabled accurate and precise classification. Nevertheless, the microarray datasets suffer from a large number of gene expression levels, limited sample size, and irrelevant features. Additionally, datasets are often asymmetrical, where the number of samples from different classes is not balanced. These limitations make it difficult to determine the actual features that contribute to the existence of cancer classification in the gene expression profiles. Various accurate feature selection methods exist, and they are being widely applied. The objective of feature selection is to search for a relevant, discriminant feature subset from the basic feature space. In this review, we aim to compile and review the latest hybrid feature selection methods based on bio-inspired metaheuristic methods and wrapper methods for the classification of BC and other types of cancer

    An interpretable multi-stage forecasting framework for energy consumption and CO2 emissions for the transportation sector

    Get PDF
    The transportation sector is deemed one of the primary sources of energy consumption and greenhouse gases throughout the world. To realise and design sustainable transport, it is imperative to comprehend relationships and evaluate interactions among a set of variables, which may influence transport energy consumption and CO2 emissions. Unlike recent published papers, this study strives to achieve a balance between machine learning (ML) model accuracy and model interpretability using the Shapley additive explanation (SHAP) method for forecasting the energy consumption and CO2 emissions in the UK's transportation sector. To this end, this paper proposes an interpretable multi-stage forecasting framework to simultaneously maximise the ML model accuracy and determine the relationship between the predictions and the influential variables by revealing the contribution of each variable to the predictions. For the UK's transportation sector, the experimental results indicate that road carbon intensity is found to be the most contributing variable to both energy consumption and CO2 emissions predictions. Unlike other studies, population and GDP per capita are found to be uninfluential variables. The proposed multi-stage forecasting framework may assist policymakers in making more informed energy decisions and establishing more accurate investment

    Feature selection algorithms for Malaysian dengue outbreak detection model

    Get PDF
    Dengue fever is considered as one of the most common mosquito borne diseases worldwide. Dengue outbreak detection can be very useful in terms of practical efforts to overcome the rapid spread of the disease by providing the knowledge to predict the next outbreak occurrence. Many studies have been conducted to model and predict dengue outbreak using different data mining techniques. This research aimed to identify the best features that lead to better predictive accuracy of dengue outbreaks using three different feature selection algorithms; particle swarm optimization (PSO), genetic algorithm (GA) and rank search (RS). Based on the selected features, three predictive modeling techniques (J48, DTNB and Naive Bayes) were applied for dengue outbreak detection. The dataset used in this research was obtained from the Public Health Department, Seremban, Negeri Sembilan, Malaysia. The experimental results showed that the predictive accuracy was improved by applying feature selection process before the predictive modeling process. The study also showed the set of features to represent dengue outbreak detection for Malaysian health agencies

    R : A hybrid machine learning feature selection model—HMLFSM to enhance gene classification applied to multiple colon cancers dataset

    Get PDF
    Colon cancer is a significant global health problem, and early detection is critical for improving survival rates. Traditional detection methods, such as colonoscopies, can be invasive and uncomfortable for patients. Machine Learning (ML) algorithms have emerged as a promising approach for non-invasive colon cancer classification using genetic data or patient demographics and medical history. One approach is to use ML to analyse genetic data, or patient demographics and medical history, to predict the likelihood of colon cancer. However, due to the challenges imposed by variable gene expression and the high dimensionality of cancer-related datasets, traditional transductive ML applications have limited accuracy and risk overfitting. In this paper, we propose a new hybrid feature selection model called HMLFSM–Hybrid Machine Learning Feature Selection Model to improve colon cancer gene classification. We developed a multifilter hybrid model including a two-phase feature selection approach, combining Information Gain (IG) and Genetic Algorithms (GA), and minimum Redundancy Maximum Relevance (mRMR) coupling with Particle Swarm Optimization (PSO). We critically tested our model on three colon cancer genetic datasets and found that the new framework outperformed other models with significant accuracy improvements (95%, ~97%, and ~94% accuracies for datasets 1, 2, and 3 respectively). The results show that our approach improves the classification accuracy of colon cancer detection by highlighting important and relevant genes, eliminating irrelevant ones, and revealing the genes that have a direct influence on the classification process. For colon cancer gene analysis, and along with our experiments and literature review, we found that selective input feature extraction prior to feature selection is essential for improving predictive performance

    A Probabilistic Multi-Objective Artificial Bee Colony Algorithm for Gene Selection

    Get PDF
    Microarray technology is widely used to report gene expression data. The inclusion of many features and few samples is one of the characteristic features of this platform. In order to define significant genes for a particular disease, the problem of high-dimensionality microarray data should be overcome. The Artificial Bee Colony (ABC) Algorithm is a successful meta-heuristic algorithm that solves optimization problems effectively. In this paper, we propose a hybrid gene selection method for discriminatively selecting genes. We propose a new probabilistic binary Artificial Bee Colony Algorithm, namely PrBABC, that is hybridized with three different filter methods. The proposed method is applied to nine microarray datasets in order to detect distinctive genes for classifying cancer data. Results are compared with other wellknown meta-heuristic algorithms: Binary Differential Evolution Algorithm (BinDE), Binary Particle Swarm Optimization Algorithm (BinPSO), and Genetic Algorithm (GA), as well as with other methods in the literature. Experimental results show that the probabilistic self-adaptive learning strategy integrated into the employed-bee phase can boost classification accuracy with a minimal number of genes

    Swarm Intelligence Based Feature Selection for High Dimensional Classification: A Literature Survey

    Get PDF
    Feature selection is an important and challenging task in machine learning and data mining techniques to avoid the curse of dimensionality and maximize the classification accuracy. Moreover, feature selection helps to reduce computational complexity of learning algorithm, improve prediction performance, better data understanding and reduce data storage space. Swarm intelligence based feature selection approach enables to find an optimal feature subset from an extremely large dimensionality of features for building the most accurate classifier model. There is still a type of researches that is not done yet in data mining. In this paper, the utilization of swarm intelligence algorithms for feature selection process in high dimensional data focusing on medical data classification is form the subject matter. The results shows that swarm intelligence algorithms reviewed based on state-of-the-art literature have a promising capability that can be applied in feature selections techniques. The significance of this work is to present the comparison and various alternatives of swarm algorithms to be applied in feature selections for high dimensional classification

    Supervised Methods for Biomarker Detection from Microarray Experiments

    Get PDF
    Biomarkers are valuable indicators of the state of a biological system. Microarray technology has been extensively used to identify biomarkers and build computational predictive models for disease prognosis, drug sensitivity and toxicity evaluations. Activation biomarkers can be used to understand the underlying signaling cascades, mechanisms of action and biological cross talk. Biomarker detection from microarray data requires several considerations both from the biological and computational points of view. In this chapter, we describe the main methodology used in biomarkers discovery and predictive modeling and we address some of the related challenges. Moreover, we discuss biomarker validation and give some insights into multiomics strategies for biomarker detection.Non peer reviewe
    • …
    corecore