31 research outputs found

    EFSIS: Ensemble Feature Selection Integrating Stability

    Get PDF
    Ensemble learning that can be used to combine the predictions from multiple learners has been widely applied in pattern recognition, and has been reported to be more robust and accurate than the individual learners. This ensemble logic has recently also been more applied in feature selection. There are basically two strategies for ensemble feature selection, namely data perturbation and function perturbation. Data perturbation performs feature selection on data subsets sampled from the original dataset and then selects the features consistently ranked highly across those data subsets. This has been found to improve both the stability of the selector and the prediction accuracy for a classifier. Function perturbation frees the user from having to decide on the most appropriate selector for any given situation and works by aggregating multiple selectors. This has been found to maintain or improve classification performance. Here we propose a framework, EFSIS, combining these two strategies. Empirical results indicate that EFSIS gives both high prediction accuracy and stability.Comment: 20 pages, 3 figure

    Forecasting foreing exchange reserves using Bayesian Model Averaging-Naïve Bayes

    Get PDF
    Foreign exchange reserves are used by governments to balance international payments and make stable the exchange rate. Numerous works have developed models to predict foreign exchange reserves; however, the existing models have limitations and the literature demands more research on the subject given that the accuracy of the models is still poor, and they have only been used for emerging countries. This paper presents a new prediction model of foreign exchange reserves for both emerging countries and developed countries, applying a method of Bayesian model averaging-Naïve Bayes, which shows better precision results than the individual classifier. Our model has a great potential impact on the adequacy of macroeconomic policy against the risks derived from balance of payment crises providing tools that help to achieve financial stability on a global level

    Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification

    Get PDF
    Effective management of chronic diseases and cancer can greatly benefit from disease-specific biomarkers that enable informative screening and timely diagnosis. IgG N-glycans found in human plasma have the potential to be minimally invasive disease-specific biomarkers for all stages of disease development due to their plasticity in response to various genetic and environmental stimuli. Data analysis and machine learning (ML) approaches can assist in harnessing the potential of IgG glycomics towards biomarker discovery and the development of reliable predictive tools for disease screening. This study proposes an ML-based N-glycomic analysis framework that can be employed to build, optimise, and evaluate multiple ML pipelines to stratify patients based on disease risk in an interpretable manner. To design and test this framework, a published colorectal cancer (CRC) dataset from the Study of Colorectal Cancer in Scotland (SOCCS) cohort (1999-2006) was used. In particular, among the different pipelines tested, an XGBoost-based ML pipeline, which was tuned using multi-objective optimisation, calibrated using an inductive Venn-Abers predictor (IVAP), and evaluated via a nested cross-validation (NCV) scheme, achieved a mean area under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.771 when classifying between age-, and sex-matched healthy controls and CRC patients. This performance suggests the potential of using the relative abundance of IgG N-glycans to define populations at elevated CRC risk who merit investigation or surveillance. Finally, the IgG N-glycans that highly impact CRC classification decisions were identified using a global model-agnostic interpretability technique, namely Accumulated Local Effects (ALE). We envision that open-source computational frameworks, such as the one presented herein, will be useful in supporting the translation of glycan-based biomarkers into clinical applications

    Automated machine learning optimizes and accelerates predictive modeling from COVID-19 high throughput datasets

    Get PDF
    COVID-19 outbreak brings intense pressure on healthcare systems, with an urgent demand for effective diagnostic, prognostic and therapeutic procedures. Here, we employed Automated Machine Learning (AutoML) to analyze three publicly available high throughput COVID-19 datasets, including proteomic, metabolomic and transcriptomic measurements. Pathway analysis of the selected features was also performed. Analysis of a combined proteomic and metabolomic dataset led to 10 equivalent signatures of two features each, with AUC 0.840 (CI 0.723–0.941) in discriminating severe from non-severe COVID-19 patients. A transcriptomic dataset led to two equivalent signatures of eight features each, with AUC 0.914 (CI 0.865–0.955) in identifying COVID-19 patients from those with a different acute respiratory illness. Another transcriptomic dataset led to two equivalent signatures of nine features each, with AUC 0.967 (CI 0.899–0.996) in identifying COVID-19 patients from virus-free individuals. Signature predictive performance remained high upon validation. Multiple new features emerged and pathway analysis revealed biological relevance by implication in Viral mRNA Translation, Interferon gamma signaling and Innate Immune System pathways. In conclusion, AutoML analysis led to multiple biosignatures of high predictive performance, with reduced features and large choice of alternative predictors. These favorable characteristics are eminent for development of cost-effective assays to contribute to better disease management

    kCV-B: Bootstrap with Cross-Validation for Deep Learning Model Development, Assessment and Selection

    Get PDF
    This study investigates the inability of two popular data splitting techniques: train/test split and k-fold cross-validation that are to create training and validation data sets, and to achieve sufficient generality for supervised deep learning (DL) methods. This failure is mainly caused by their limited ability of new data creation. In response, the bootstrap is a computer based statistical resampling method that has been used efficiently for estimating the distribution of a sample estimator and to assess a model without having knowledge about the population. This paper couples cross-validation and bootstrap to have their respective advantages in view of data generation strategy and to achieve better generalization of a DL model. This paper contributes by: (i) developing an algorithm for better selection of training and validation data sets, (ii) exploring the potential of bootstrap for drawing statistical inference on the necessary performance metrics (e.g., mean square error), and (iii) introducing a method that can assess and improve the efficiency of a DL model. The proposed method is applied for semantic segmentation and is demonstrated via a DL based classification algorithm, PointNet, through aerial laser scanning point cloud data

    Global patterns and extreme events in sovereign risk premia: a fuzzy vs deep learning comparative.

    Get PDF
    Investment in foreign countries has become more common nowadays and this im- plies that there may be risks inherent to these investments, being the sovereign risk premium the measure of such risk. Many studies have examined the behaviour of the sovereign risk premium, nevertheless, there are limitations to the current models and the literature calls for further investigation of the issue as behavioural factors are necessary to analyse the investor’s risk perception. In addition, the methodology widely used in previous research is the regres- sion model, and the literature shows it as scarce yet. This study provides a model for a new of the drivers of the government risk premia in developing countries and developed coun- tries, comparing Fuzzy methods such as Fuzzy Decision Trees, Fuzzy Rough Nearest Neighbour, Neuro-Fuzzy Approach, with Deep Learning procedures such as Deep Recurrent Convolution Neural Network, Deep Neural Decision Trees, Deep Learning Linear Support Vector Machines. Our models have a large effect on the suitability of macroeconomic policy in the face of foreign investment risks by delivering instruments that contribute to bringing about financial stability at the global level.This research received funding from the University of Málaga, and from the Cátedra de Economía y Finanzas Sostenibles (University of Málaga). Additionally, we also appreciate the financial support from the University of Barcelona (under the grant UB-AE-AS017634)
    corecore