Search CORE

111,576 research outputs found

Regression Trees and Random forest based feature selection for malaria risk exposure prediction

Author: Kouwayè Bienvenue
Publication venue
Publication date: 22/06/2016
Field of study

This paper deals with prediction of anopheles number, the main vector of malaria risk, using environmental and climate variables. The variables selection is based on an automatic machine learning method using regression trees, and random forests combined with stratified two levels cross validation. The minimum threshold of variables importance is accessed using the quadratic distance of variables importance while the optimal subset of selected variables is used to perform predictions. Finally the results revealed to be qualitatively better, at the selection, the prediction , and the CPU time point of view than those obtained by GLM-Lasso method

arXiv.org e-Print Archive

HAL-Paris1

Recommended from our members

Statistical Workflow for Feature Selection in Human Metabolomics Data.

Author: Antonelli Joseph
Cheng Susan
Claggett Brian L
Demler Olga V
Deng Katherine
Henglin Mir
Hushcha Pavel V
Jain Mohit
Kim Andy
Kim Nicole
Lagerborg Kim A
Mora Samia
Niiranen Teemu J
Ovsak Gavin
Pereira Alexandre C
Rao Kevin
Tyagi Octavia
Watrous Jeramie D
Publication venue: eScholarship, University of California
Publication date: 01/07/2019
Field of study

High-throughput metabolomics investigations, when conducted in large human cohorts, represent a potentially powerful tool for elucidating the biochemical diversity underlying human health and disease. Large-scale metabolomics data sources, generated using either targeted or nontargeted platforms, are becoming more common. Appropriate statistical analysis of these complex high-dimensional data will be critical for extracting meaningful results from such large-scale human metabolomics studies. Therefore, we consider the statistical analytical approaches that have been employed in prior human metabolomics studies. Based on the lessons learned and collective experience to date in the field, we offer a step-by-step framework for pursuing statistical analyses of cohort-based human metabolomics data, with a focus on feature selection. We discuss the range of options and approaches that may be employed at each stage of data management, analysis, and interpretation and offer guidance on the analytical decisions that need to be considered over the course of implementing a data analysis workflow. Certain pervasive analytical challenges facing the field warrant ongoing focused research. Addressing these challenges, particularly those related to analyzing human metabolomics data, will allow for more standardization of as well as advances in how research in the field is practiced. In turn, such major analytical advances will lead to substantial improvements in the overall contributions of human metabolomics investigations

eScholarship - University of California

Seasonal prediction of lake inflows and rainfall in a hydro-electricity catchment, Waitaki river, New Zealand

Author: Bardsley W. Earl
Purdie Jennifer Margaret
Publication venue: 'Wiley'
Publication date: 01/01/2010
Field of study

The Waitaki River is located in the centre of the South Island of New Zealand, and hydro-electricity generated on the river accounts for 35-40% of New Zealand's electricity. Low inflows in 1992 and 2001 resulted in the threat of power blackouts. Improved seasonal rainfall and inflow forecasts will result in the better management of the water used in hydro-generation on a seasonal basis. Researchers have stated that two key directions in the fields of seasonal rainfall and streamflow forecasting are to a) decrease the spatial scale of forecast products, and b) tailor forecast products to end-user needs, so as to provide more relevant and targeted forecasts. Several season-ahead lake inflow and rainfall forecast models were calibrated for the Waitaki river catchment using statistical techniques to quantify relationships between land-ocean-atmosphere state variables and seasonally lagged inflows and rainfall. Techniques included principal components analysis and multiple linear regression, with cross-validation techniques applied to estimate model error and randomization techniques used to establish the significance of the skill of the models. Many of the models calibrated predict rainfall and inflows better than random chance and better than the long-term mean as a predictor. When compared to the range of all probable inflow seasonal totals (based on the 80-year recorded history in the catchment), 95% confidence limits around most model predictions offer significant skill. These models explain up to 19% of the variance in season-ahead rainfall and inflows in this catchment. Seasonal rainfall and inflow forecasting on a single catchment scale and focussed to end-user needs is possible with some skill in the South Island of New Zealand

Research Commons@Waikato

Lasso based feature selection for malaria risk exposure prediction

Author: Fonton Noël
Kouwayè Bienvenue
Rossi Fabrice
Publication venue
Publication date: 11/07/2015
Field of study

In life sciences, the experts generally use empirical knowledge to recode variables, choose interactions and perform selection by classical approach. The aim of this work is to perform automatic learning algorithm for variables selection which can lead to know if experts can be help in they decision or simply replaced by the machine and improve they knowledge and results. The Lasso method can detect the optimal subset of variables for estimation and prediction under some conditions. In this paper, we propose a novel approach which uses automatically all variables available and all interactions. By a double cross-validation combine with Lasso, we select a best subset of variables and with GLM through a simple cross-validation perform predictions. The algorithm assures the stability and the the consistency of estimators.Comment: in Petra Perner. Machine Learning and Data Mining in Pattern Recognition, Jul 2015, Hamburg, Germany. Ibai publishing, 2015, Machine Learning and Data Mining in Pattern Recognition (proceedings of 11th International Conference, MLDM 2015

arXiv.org e-Print Archive

HAL-Paris1

Computational Models for Transplant Biomarker Discovery.

Author: Sarwal Minnie M
Wang Anyou
Publication venue: eScholarship, University of California
Publication date: 01/01/2015
Field of study

Translational medicine offers a rich promise for improved diagnostics and drug discovery for biomedical research in the field of transplantation, where continued unmet diagnostic and therapeutic needs persist. Current advent of genomics and proteomics profiling called "omics" provides new resources to develop novel biomarkers for clinical routine. Establishing such a marker system heavily depends on appropriate applications of computational algorithms and software, which are basically based on mathematical theories and models. Understanding these theories would help to apply appropriate algorithms to ensure biomarker systems successful. Here, we review the key advances in theories and mathematical models relevant to transplant biomarker developments. Advantages and limitations inherent inside these models are discussed. The principles of key -computational approaches for selecting efficiently the best subset of biomarkers from high--dimensional omics data are highlighted. Prediction models are also introduced, and the integration of multi-microarray data is also discussed. Appreciating these key advances would help to accelerate the development of clinically reliable biomarker systems

Crossref

Directory of Open Access Journals

Frontiers - Publisher Connector

PubMed Central

eScholarship - University of California

Forecasting of electricity prices in the Spanish electricity market using machine learning tools

Author: Kaleta Joanna
Publication venue: Universitat Politècnica de Catalunya
Publication date: 17/01/2019
Field of study

The objective of this research assignment was to forecast electricity prices in the Spanish electricity market using three different machine learning techniques: k-nearest neighbours, support vector regression and artificial neural networks. The achieved results were compared and the quality of developed models was evaluated. The project was implemented in Python3.Incomin

UPCommons. Portal del coneixement obert de la UPC

Open source R for applying machine learning to RPAS remote sensing images

Author: Masiero Andrea
Piragnolo Marco
Pirotti Francesco
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

The increase in the number of remote sensing platforms, ranging from satellites to close-range Remotely Piloted Aircraft System (RPAS), is leading to a growing demand for new image processing and classification tools. This article presents a comparison of the Random Forest (RF) and Support Vector Machine (SVM) machine-learning algorithms for extracting land-use classes in RPAS-derived orthomosaic using open source R packages. The camera used in this work captures the reflectance of the Red, Blue, Green and Near Infrared channels of a target. The full dataset is therefore a 4-channel raster image. The classification performance of the two methods is tested at varying sizes of training sets. The SVM and RF are evaluated using Kappa index, classification accuracy and classification error as accuracy metrics. The training sets are randomly obtained as subset of 2 to 20% of the total number of raster cells, with stratified sampling according to the land-use classes. Ten runs are done for each training set to calculate the variance in results. The control dataset consists of an independent classification obtained by photointerpretation. The validation is carried out(i) using the K-Fold cross validation, (ii) using the pixels from the validation test set, and (iii) using the pixels from the full test set. Validation with K-fold and with the validation dataset show SVM give better results, but RF prove to be more performing when training size is larger. Classification error and classification accuracy follow the trend of Kappa index

Crossref

Florence Research

Directory of Open Access Journals

Archivio istituzionale della ricerca - Università di Padova

A Multi-objective Exploratory Procedure for Regression Model Selection

Author: Ankur Sinha
Davidson R.
Deb K.
Freund Y.
Goldberg D.
Jeffreys H.
Leamer E.E.
MacKay D. J.C.
Murata N.
Paterlini S.
Pekka Malo
Redmond M.
Takeuchi K.
Tibshirani R.
Timo Kuosmanen
Zitzler E.
Publication venue: 'Informa UK Limited'
Publication date: 13/07/2016
Field of study

Variable selection is recognized as one of the most critical steps in statistical modeling. The problems encountered in engineering and social sciences are commonly characterized by over-abundance of explanatory variables, non-linearities and unknown interdependencies between the regressors. An added difficulty is that the analysts may have little or no prior knowledge on the relative importance of the variables. To provide a robust method for model selection, this paper introduces the Multi-objective Genetic Algorithm for Variable Selection (MOGA-VS) that provides the user with an optimal set of regression models for a given data-set. The algorithm considers the regression problem as a two objective task, and explores the Pareto-optimal (best subset) models by preferring those models over the other which have less number of regression coefficients and better goodness of fit. The model exploration can be performed based on in-sample or generalization error minimization. The model selection is proposed to be performed in two steps. First, we generate the frontier of Pareto-optimal regression models by eliminating the dominated models without any user intervention. Second, a decision making process is executed which allows the user to choose the most preferred model using visualisations and simple metrics. The method has been evaluated on a recently published real dataset on Communities and Crime within United States.Comment: in Journal of Computational and Graphical Statistics, Vol. 24, Iss. 1, 201

arXiv.org e-Print Archive

CiteSeerX

Crossref