31 research outputs found
EFSIS: Ensemble Feature Selection Integrating Stability
Ensemble learning that can be used to combine the predictions from multiple
learners has been widely applied in pattern recognition, and has been reported
to be more robust and accurate than the individual learners. This ensemble
logic has recently also been more applied in feature selection. There are
basically two strategies for ensemble feature selection, namely data
perturbation and function perturbation. Data perturbation performs feature
selection on data subsets sampled from the original dataset and then selects
the features consistently ranked highly across those data subsets. This has
been found to improve both the stability of the selector and the prediction
accuracy for a classifier. Function perturbation frees the user from having to
decide on the most appropriate selector for any given situation and works by
aggregating multiple selectors. This has been found to maintain or improve
classification performance. Here we propose a framework, EFSIS, combining these
two strategies. Empirical results indicate that EFSIS gives both high
prediction accuracy and stability.Comment: 20 pages, 3 figure
Forecasting foreing exchange reserves using Bayesian Model Averaging-Naïve Bayes
Foreign exchange reserves are used by governments to balance international payments and make stable the exchange rate. Numerous works have developed models to predict foreign exchange reserves; however, the existing models have limitations and the literature demands more research on the subject given that the accuracy of the models is still poor, and they have only been used for emerging countries. This paper presents a new prediction model of foreign exchange reserves for both emerging countries and developed countries, applying a method of Bayesian model averaging-Naïve Bayes, which shows better precision results than the individual classifier. Our model has a great potential impact on the adequacy of macroeconomic policy against the risks derived from balance of payment crises providing tools that help to achieve financial stability on a global level
Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification
Effective management of chronic diseases and cancer can greatly benefit from disease-specific biomarkers that enable informative screening and timely diagnosis. IgG N-glycans found in human plasma have the potential to be minimally invasive disease-specific biomarkers for all stages of disease development due to their plasticity in response to various genetic and environmental stimuli. Data analysis and machine learning (ML) approaches can assist in harnessing the potential of IgG glycomics towards biomarker discovery and the development of reliable predictive tools for disease screening. This study proposes an ML-based N-glycomic analysis framework that can be employed to build, optimise, and evaluate multiple ML pipelines to stratify patients based on disease risk in an interpretable manner. To design and test this framework, a published colorectal cancer (CRC) dataset from the Study of Colorectal Cancer in Scotland (SOCCS) cohort (1999-2006) was used. In particular, among the different pipelines tested, an XGBoost-based ML pipeline, which was tuned using multi-objective optimisation, calibrated using an inductive Venn-Abers predictor (IVAP), and evaluated via a nested cross-validation (NCV) scheme, achieved a mean area under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.771 when classifying between age-, and sex-matched healthy controls and CRC patients. This performance suggests the potential of using the relative abundance of IgG N-glycans to define populations at elevated CRC risk who merit investigation or surveillance. Finally, the IgG N-glycans that highly impact CRC classification decisions were identified using a global model-agnostic interpretability technique, namely Accumulated Local Effects (ALE). We envision that open-source computational frameworks, such as the one presented herein, will be useful in supporting the translation of glycan-based biomarkers into clinical applications
Automated machine learning optimizes and accelerates predictive modeling from COVID-19 high throughput datasets
COVID-19 outbreak brings intense pressure on healthcare systems, with an urgent demand for effective diagnostic, prognostic and therapeutic procedures. Here, we employed Automated Machine Learning (AutoML) to analyze three publicly available high throughput COVID-19 datasets, including proteomic, metabolomic and transcriptomic measurements. Pathway analysis of the selected features was also performed. Analysis of a combined proteomic and metabolomic dataset led to 10 equivalent signatures of two features each, with AUC 0.840 (CI 0.723–0.941) in discriminating severe from non-severe COVID-19 patients. A transcriptomic dataset led to two equivalent signatures of eight features each, with AUC 0.914 (CI 0.865–0.955) in identifying COVID-19 patients from those with a different acute respiratory illness. Another transcriptomic dataset led to two equivalent signatures of nine features each, with AUC 0.967 (CI 0.899–0.996) in identifying COVID-19 patients from virus-free individuals. Signature predictive performance remained high upon validation. Multiple new features emerged and pathway analysis revealed biological relevance by implication in Viral mRNA Translation, Interferon gamma signaling and Innate Immune System pathways. In conclusion, AutoML analysis led to multiple biosignatures of high predictive performance, with reduced features and large choice of alternative predictors. These favorable characteristics are eminent for development of cost-effective assays to contribute to better disease management
kCV-B: Bootstrap with Cross-Validation for Deep Learning Model Development, Assessment and Selection
This study investigates the inability of two popular data splitting techniques: train/test split and k-fold cross-validation that are to create training and validation data sets, and to achieve sufficient generality for supervised deep learning (DL) methods. This failure is mainly caused by their limited ability of new data creation. In response, the bootstrap is a computer based statistical resampling method that has been used efficiently for estimating the distribution of a sample estimator and to assess a model without having knowledge about the population. This paper couples cross-validation and bootstrap to have their respective advantages in view of data generation strategy and to achieve better generalization of a DL model. This paper contributes by: (i) developing an algorithm for better selection of training and validation data sets, (ii) exploring the potential of bootstrap for drawing statistical inference on the necessary performance metrics (e.g., mean square error), and (iii) introducing a method that can assess and improve the efficiency of a DL model. The proposed method is applied for semantic segmentation and is demonstrated via a DL based classification algorithm, PointNet, through aerial laser scanning point cloud data
Global patterns and extreme events in sovereign risk premia: a fuzzy vs deep learning comparative.
Investment in foreign countries has become more common nowadays and this im-
plies that there may be risks inherent to these investments, being the sovereign risk premium
the measure of such risk. Many studies have examined the behaviour of the sovereign risk
premium, nevertheless, there are limitations to the current models and the literature calls for
further investigation of the issue as behavioural factors are necessary to analyse the investor’s
risk perception. In addition, the methodology widely used in previous research is the regres-
sion model, and the literature shows it as scarce yet. This study provides a model for a new
of the drivers of the government risk premia in developing countries and developed coun-
tries, comparing Fuzzy methods such as Fuzzy Decision Trees, Fuzzy Rough Nearest Neighbour,
Neuro-Fuzzy Approach, with Deep Learning procedures such as Deep Recurrent Convolution
Neural Network, Deep Neural Decision Trees, Deep Learning Linear Support Vector Machines.
Our models have a large effect on the suitability of macroeconomic policy in the face of foreign
investment risks by delivering instruments that contribute to bringing about financial stability
at the global level.This research received funding from the University of Málaga, and from the Cátedra de
Economía y Finanzas Sostenibles (University of Málaga). Additionally, we also appreciate the
financial support from the University of Barcelona (under the grant UB-AE-AS017634)