4,133 research outputs found
Boosting Additive Models using Component-wise P-Splines
We consider an efficient approximation of Bühlmann & Yu’s L2Boosting algorithm with component-wise smoothing splines. Smoothing spline base-learners are replaced by P-spline base-learners which yield similar prediction errors but are more advantageous from a computational point of view. In particular, we give a detailed analysis on the effect of various P-spline hyper-parameters on the boosting fit. In addition, we derive a new theoretical result on the relationship between the boosting stopping iteration and the step length factor used for shrinking the boosting estimates
A Unified Framework of Constrained Regression
Generalized additive models (GAMs) play an important role in modeling and
understanding complex relationships in modern applied statistics. They allow
for flexible, data-driven estimation of covariate effects. Yet researchers
often have a priori knowledge of certain effects, which might be monotonic or
periodic (cyclic) or should fulfill boundary conditions. We propose a unified
framework to incorporate these constraints for both univariate and bivariate
effect estimates and for varying coefficients. As the framework is based on
component-wise boosting methods, variables can be selected intrinsically, and
effects can be estimated for a wide range of different distributional
assumptions. Bootstrap confidence intervals for the effect estimates are
derived to assess the models. We present three case studies from environmental
sciences to illustrate the proposed seamless modeling framework. All discussed
constrained effect estimates are implemented in the comprehensive R package
mboost for model-based boosting.Comment: This is a preliminary version of the manuscript. The final
publication is available at
http://link.springer.com/article/10.1007/s11222-014-9520-
Spatial Smoothing Techniques for the Assessment of Habitat Suitability
Precise knowledge about factors influencing the habitat suitability of a certain species forms the basis for the implementation of effective programs to conserve biological diversity. Such knowledge is frequently gathered from studies relating abundance data to a set of influential variables in a regression setup. In particular, generalised linear models are used to analyse binary presence/absence data or counts of a certain species at locations within an observation area. However, one of the key assumptions of generalised linear models, the independence of the observations is often violated in practice since the points at which the observations are collected are spatially aligned. While several approaches have been developed to analyse and account for spatial correlation in regression models with normally distributed responses, far less work has been done in the context of generalised linear models. In this paper, we describe a general framework for semiparametric spatial generalised linear models that allows for the routine analysis of non-normal spatially aligned regression data. The approach is utilised for the analysis of a data set of synthetic bird species in beech forests, revealing that ignorance of spatial dependence actually may lead to false conclusions in a number of situations
Variable Selection and Model Choice in Geoadditive Regression Models
Model choice and variable selection are issues of major concern in practical regression analyses. We propose a boosting procedure that facilitates both tasks in a class of complex geoadditive regression models comprising spatial effects, nonparametric effects of continuous covariates, interaction surfaces, random effects, and varying coefficient terms. The major modelling component are penalized splines and their bivariate tensor product extensions. All smooth model terms are represented as the sum of a parametric component and a remaining smooth component with one degree of freedom to obtain a fair comparison between all model terms. A generic representation of the geoadditive model allows to devise a general boosting algorithm that implements automatic model choice and variable selection. We demonstrate the versatility of our approach with two examples: a geoadditive Poisson regression
model for species counts in habitat suitability analyses and a geoadditive logit model for the analysis of forest health
Identifying Risk Factors for Severe Childhood Malnutrition by Boosting Additive Quantile Regression
Ordinary linear and generalized linear regression models relate the mean of a response variable to a linear combination of covariate effects and, as a consequence, focus on average properties of the response. Analyzing childhood malnutrition in developing or transition countries based on such a regression model implies that the estimated effects describe the average nutritional status. However, it is of even larger interest to analyze quantiles of the response distribution such as the 5% or 10% quantile that relate to the risk of children for extreme malnutrition. In this paper, we analyze data on childhood malnutrition collected in the 2005/2006 India Demographic and Health Survey based on a semiparametric extension of quantile
regression models where nonlinear effects are included in the model equation, leading to additive quantile regression. The variable selection and model choice problems associated with estimating an additive quantile regression model are addressed by a novel boosting approach. Based on this rather general class of statistical learning procedures for empirical risk minimization, we develop, evaluate and apply a boosting algorithm for quantile regression. Our proposal allows for data-driven determination of the amount of smoothness required for the nonlinear effects and combines model selection with an automatic variable selection property. The results of our empirical evaluation suggest that boosting is an appropriate tool for estimation in linear and additive quantile regression models and helps to identify yet unknown risk factors for childhood malnutrition
Variable Selection and Model Choice in Structured Survival Models
In many situations, medical applications ask for flexible survival models that allow to extend the classical Cox-model via the
inclusion of time-varying and nonparametric effects. These structured survival models are very flexible but additional
difficulties arise when model choice and variable selection is desired. In particular, it has to be decided which covariates
should be assigned time-varying effects or whether parametric modeling is sufficient for a given covariate. Component-wise
boosting provides a means of likelihood-based model fitting that enables simultaneous variable selection and model choice. We
introduce a component-wise likelihood-based boosting algorithm for survival data that permits the inclusion of both parametric
and nonparametric time-varying effects as well as nonparametric effects of continuous covariates utilizing penalized splines as
the main modeling technique. Its properties
and performance are investigated in simulation studies.
The new modeling approach is used to build a flexible survival model for
intensive care patients suffering from severe sepsis.
A software implementation is available to the interested reader
Random Forest variable importance with missing data
Random Forests are commonly applied for data prediction and interpretation. The latter purpose is supported by variable importance measures that rate the relevance of predictors. Yet existing measures can not be computed when data contains missing values. Possible solutions are given by imputation methods, complete case analysis and a newly suggested importance measure. However, it is unknown to what extend these approaches are able to provide a reliable estimate of a variables relevance. An extensive simulation study was performed to investigate this property for a variety of missing data generating processes. Findings and recommendations: Complete case analysis should not be applied as it inappropriately penalized variables that were completely observed. The new importance measure is much more capable to reflect decreased information exclusively for variables with missing values and should therefore be used to evaluate actual data situations. By contrast, multiple imputation allows for an estimation of importances one would potentially observe in complete data situations
- …