111,576 research outputs found
Regression Trees and Random forest based feature selection for malaria risk exposure prediction
This paper deals with prediction of anopheles number, the main vector of
malaria risk, using environmental and climate variables. The variables
selection is based on an automatic machine learning method using regression
trees, and random forests combined with stratified two levels cross validation.
The minimum threshold of variables importance is accessed using the quadratic
distance of variables importance while the optimal subset of selected variables
is used to perform predictions. Finally the results revealed to be
qualitatively better, at the selection, the prediction , and the CPU time point
of view than those obtained by GLM-Lasso method
Recommended from our members
Statistical Workflow for Feature Selection in Human Metabolomics Data.
High-throughput metabolomics investigations, when conducted in large human cohorts, represent a potentially powerful tool for elucidating the biochemical diversity underlying human health and disease. Large-scale metabolomics data sources, generated using either targeted or nontargeted platforms, are becoming more common. Appropriate statistical analysis of these complex high-dimensional data will be critical for extracting meaningful results from such large-scale human metabolomics studies. Therefore, we consider the statistical analytical approaches that have been employed in prior human metabolomics studies. Based on the lessons learned and collective experience to date in the field, we offer a step-by-step framework for pursuing statistical analyses of cohort-based human metabolomics data, with a focus on feature selection. We discuss the range of options and approaches that may be employed at each stage of data management, analysis, and interpretation and offer guidance on the analytical decisions that need to be considered over the course of implementing a data analysis workflow. Certain pervasive analytical challenges facing the field warrant ongoing focused research. Addressing these challenges, particularly those related to analyzing human metabolomics data, will allow for more standardization of as well as advances in how research in the field is practiced. In turn, such major analytical advances will lead to substantial improvements in the overall contributions of human metabolomics investigations
Seasonal prediction of lake inflows and rainfall in a hydro-electricity catchment, Waitaki river, New Zealand
The Waitaki River is located in the centre of the South Island of New Zealand, and hydro-electricity generated on the river accounts for 35-40% of New Zealand's electricity. Low inflows in 1992 and 2001 resulted in the threat of power blackouts. Improved seasonal rainfall and inflow forecasts will result in the better management of the water used in hydro-generation on a seasonal basis.
Researchers have stated that two key directions in the fields of seasonal rainfall and streamflow forecasting are to a) decrease the spatial scale of forecast products, and b) tailor forecast products to end-user needs, so as to provide more relevant and targeted forecasts.
Several season-ahead lake inflow and rainfall forecast models were calibrated for the Waitaki river catchment using statistical techniques to quantify relationships between land-ocean-atmosphere state variables and seasonally lagged inflows and rainfall. Techniques included principal components analysis and multiple linear regression, with cross-validation techniques applied to estimate model error and randomization techniques used to establish the significance of the skill of the models.
Many of the models calibrated predict rainfall and inflows better than random chance and better than the long-term mean as a predictor. When compared to the range of all probable inflow seasonal totals (based on the 80-year recorded history in the catchment), 95% confidence limits around most model predictions offer significant skill. These models explain up to 19% of the variance in season-ahead rainfall and inflows in this catchment.
Seasonal rainfall and inflow forecasting on a single catchment scale and focussed to end-user needs is possible with some skill in the South Island of New Zealand
Lasso based feature selection for malaria risk exposure prediction
In life sciences, the experts generally use empirical knowledge to recode
variables, choose interactions and perform selection by classical approach. The
aim of this work is to perform automatic learning algorithm for variables
selection which can lead to know if experts can be help in they decision or
simply replaced by the machine and improve they knowledge and results. The
Lasso method can detect the optimal subset of variables for estimation and
prediction under some conditions. In this paper, we propose a novel approach
which uses automatically all variables available and all interactions. By a
double cross-validation combine with Lasso, we select a best subset of
variables and with GLM through a simple cross-validation perform predictions.
The algorithm assures the stability and the the consistency of estimators.Comment: in Petra Perner. Machine Learning and Data Mining in Pattern
Recognition, Jul 2015, Hamburg, Germany. Ibai publishing, 2015, Machine
Learning and Data Mining in Pattern Recognition (proceedings of 11th
International Conference, MLDM 2015
Computational Models for Transplant Biomarker Discovery.
Translational medicine offers a rich promise for improved diagnostics and drug discovery for biomedical research in the field of transplantation, where continued unmet diagnostic and therapeutic needs persist. Current advent of genomics and proteomics profiling called "omics" provides new resources to develop novel biomarkers for clinical routine. Establishing such a marker system heavily depends on appropriate applications of computational algorithms and software, which are basically based on mathematical theories and models. Understanding these theories would help to apply appropriate algorithms to ensure biomarker systems successful. Here, we review the key advances in theories and mathematical models relevant to transplant biomarker developments. Advantages and limitations inherent inside these models are discussed. The principles of key -computational approaches for selecting efficiently the best subset of biomarkers from high--dimensional omics data are highlighted. Prediction models are also introduced, and the integration of multi-microarray data is also discussed. Appreciating these key advances would help to accelerate the development of clinically reliable biomarker systems
Forecasting of electricity prices in the Spanish electricity market using machine learning tools
The objective of this research assignment was to forecast electricity prices in the Spanish electricity market using three different machine learning techniques: k-nearest neighbours, support vector regression and artificial neural networks. The achieved results were compared and the quality of developed models was evaluated. The project was implemented in Python3.Incomin
Open source R for applying machine learning to RPAS remote sensing images
The increase in the number of remote sensing platforms, ranging from satellites to close-range Remotely Piloted Aircraft System (RPAS), is leading to a growing demand for new image processing and classification tools. This article presents a comparison of the Random Forest (RF) and Support Vector Machine (SVM) machine-learning algorithms for extracting land-use classes in RPAS-derived orthomosaic using open source R packages.
The camera used in this work captures the reflectance of the Red, Blue, Green and Near Infrared channels of a target. The full dataset is therefore a 4-channel raster image. The classification performance of the two methods is tested at varying sizes of training sets. The SVM and RF are evaluated using Kappa index, classification accuracy and classification error as accuracy metrics. The training sets are randomly obtained as subset of 2 to 20% of the total number of raster cells, with stratified sampling according to the land-use classes. Ten runs are done for each training set to calculate the variance in results. The control dataset consists of an independent classification obtained by photointerpretation. The validation is carried out(i) using the K-Fold cross validation, (ii) using the pixels from the validation test set, and (iii) using the pixels from the full test set.
Validation with K-fold and with the validation dataset show SVM give better results, but RF prove to be more performing when training size is larger. Classification error and classification accuracy follow the trend of Kappa index
A Multi-objective Exploratory Procedure for Regression Model Selection
Variable selection is recognized as one of the most critical steps in
statistical modeling. The problems encountered in engineering and social
sciences are commonly characterized by over-abundance of explanatory variables,
non-linearities and unknown interdependencies between the regressors. An added
difficulty is that the analysts may have little or no prior knowledge on the
relative importance of the variables. To provide a robust method for model
selection, this paper introduces the Multi-objective Genetic Algorithm for
Variable Selection (MOGA-VS) that provides the user with an optimal set of
regression models for a given data-set. The algorithm considers the regression
problem as a two objective task, and explores the Pareto-optimal (best subset)
models by preferring those models over the other which have less number of
regression coefficients and better goodness of fit. The model exploration can
be performed based on in-sample or generalization error minimization. The model
selection is proposed to be performed in two steps. First, we generate the
frontier of Pareto-optimal regression models by eliminating the dominated
models without any user intervention. Second, a decision making process is
executed which allows the user to choose the most preferred model using
visualisations and simple metrics. The method has been evaluated on a recently
published real dataset on Communities and Crime within United States.Comment: in Journal of Computational and Graphical Statistics, Vol. 24, Iss.
1, 201
- …