111,576 research outputs found

    Regression Trees and Random forest based feature selection for malaria risk exposure prediction

    Full text link
    This paper deals with prediction of anopheles number, the main vector of malaria risk, using environmental and climate variables. The variables selection is based on an automatic machine learning method using regression trees, and random forests combined with stratified two levels cross validation. The minimum threshold of variables importance is accessed using the quadratic distance of variables importance while the optimal subset of selected variables is used to perform predictions. Finally the results revealed to be qualitatively better, at the selection, the prediction , and the CPU time point of view than those obtained by GLM-Lasso method

    Seasonal prediction of lake inflows and rainfall in a hydro-electricity catchment, Waitaki river, New Zealand

    Get PDF
    The Waitaki River is located in the centre of the South Island of New Zealand, and hydro-electricity generated on the river accounts for 35-40% of New Zealand's electricity. Low inflows in 1992 and 2001 resulted in the threat of power blackouts. Improved seasonal rainfall and inflow forecasts will result in the better management of the water used in hydro-generation on a seasonal basis. Researchers have stated that two key directions in the fields of seasonal rainfall and streamflow forecasting are to a) decrease the spatial scale of forecast products, and b) tailor forecast products to end-user needs, so as to provide more relevant and targeted forecasts. Several season-ahead lake inflow and rainfall forecast models were calibrated for the Waitaki river catchment using statistical techniques to quantify relationships between land-ocean-atmosphere state variables and seasonally lagged inflows and rainfall. Techniques included principal components analysis and multiple linear regression, with cross-validation techniques applied to estimate model error and randomization techniques used to establish the significance of the skill of the models. Many of the models calibrated predict rainfall and inflows better than random chance and better than the long-term mean as a predictor. When compared to the range of all probable inflow seasonal totals (based on the 80-year recorded history in the catchment), 95% confidence limits around most model predictions offer significant skill. These models explain up to 19% of the variance in season-ahead rainfall and inflows in this catchment. Seasonal rainfall and inflow forecasting on a single catchment scale and focussed to end-user needs is possible with some skill in the South Island of New Zealand

    Lasso based feature selection for malaria risk exposure prediction

    Full text link
    In life sciences, the experts generally use empirical knowledge to recode variables, choose interactions and perform selection by classical approach. The aim of this work is to perform automatic learning algorithm for variables selection which can lead to know if experts can be help in they decision or simply replaced by the machine and improve they knowledge and results. The Lasso method can detect the optimal subset of variables for estimation and prediction under some conditions. In this paper, we propose a novel approach which uses automatically all variables available and all interactions. By a double cross-validation combine with Lasso, we select a best subset of variables and with GLM through a simple cross-validation perform predictions. The algorithm assures the stability and the the consistency of estimators.Comment: in Petra Perner. Machine Learning and Data Mining in Pattern Recognition, Jul 2015, Hamburg, Germany. Ibai publishing, 2015, Machine Learning and Data Mining in Pattern Recognition (proceedings of 11th International Conference, MLDM 2015

    Computational Models for Transplant Biomarker Discovery.

    Get PDF
    Translational medicine offers a rich promise for improved diagnostics and drug discovery for biomedical research in the field of transplantation, where continued unmet diagnostic and therapeutic needs persist. Current advent of genomics and proteomics profiling called "omics" provides new resources to develop novel biomarkers for clinical routine. Establishing such a marker system heavily depends on appropriate applications of computational algorithms and software, which are basically based on mathematical theories and models. Understanding these theories would help to apply appropriate algorithms to ensure biomarker systems successful. Here, we review the key advances in theories and mathematical models relevant to transplant biomarker developments. Advantages and limitations inherent inside these models are discussed. The principles of key -computational approaches for selecting efficiently the best subset of biomarkers from high--dimensional omics data are highlighted. Prediction models are also introduced, and the integration of multi-microarray data is also discussed. Appreciating these key advances would help to accelerate the development of clinically reliable biomarker systems

    Forecasting of electricity prices in the Spanish electricity market using machine learning tools

    Get PDF
    The objective of this research assignment was to forecast electricity prices in the Spanish electricity market using three different machine learning techniques: k-nearest neighbours, support vector regression and artificial neural networks. The achieved results were compared and the quality of developed models was evaluated. The project was implemented in Python3.Incomin

    Open source R for applying machine learning to RPAS remote sensing images

    Get PDF
    The increase in the number of remote sensing platforms, ranging from satellites to close-range Remotely Piloted Aircraft System (RPAS), is leading to a growing demand for new image processing and classification tools. This article presents a comparison of the Random Forest (RF) and Support Vector Machine (SVM) machine-learning algorithms for extracting land-use classes in RPAS-derived orthomosaic using open source R packages. The camera used in this work captures the reflectance of the Red, Blue, Green and Near Infrared channels of a target. The full dataset is therefore a 4-channel raster image. The classification performance of the two methods is tested at varying sizes of training sets. The SVM and RF are evaluated using Kappa index, classification accuracy and classification error as accuracy metrics. The training sets are randomly obtained as subset of 2 to 20% of the total number of raster cells, with stratified sampling according to the land-use classes. Ten runs are done for each training set to calculate the variance in results. The control dataset consists of an independent classification obtained by photointerpretation. The validation is carried out(i) using the K-Fold cross validation, (ii) using the pixels from the validation test set, and (iii) using the pixels from the full test set. Validation with K-fold and with the validation dataset show SVM give better results, but RF prove to be more performing when training size is larger. Classification error and classification accuracy follow the trend of Kappa index

    A Multi-objective Exploratory Procedure for Regression Model Selection

    Full text link
    Variable selection is recognized as one of the most critical steps in statistical modeling. The problems encountered in engineering and social sciences are commonly characterized by over-abundance of explanatory variables, non-linearities and unknown interdependencies between the regressors. An added difficulty is that the analysts may have little or no prior knowledge on the relative importance of the variables. To provide a robust method for model selection, this paper introduces the Multi-objective Genetic Algorithm for Variable Selection (MOGA-VS) that provides the user with an optimal set of regression models for a given data-set. The algorithm considers the regression problem as a two objective task, and explores the Pareto-optimal (best subset) models by preferring those models over the other which have less number of regression coefficients and better goodness of fit. The model exploration can be performed based on in-sample or generalization error minimization. The model selection is proposed to be performed in two steps. First, we generate the frontier of Pareto-optimal regression models by eliminating the dominated models without any user intervention. Second, a decision making process is executed which allows the user to choose the most preferred model using visualisations and simple metrics. The method has been evaluated on a recently published real dataset on Communities and Crime within United States.Comment: in Journal of Computational and Graphical Statistics, Vol. 24, Iss. 1, 201
    corecore