87 research outputs found
A Unified Framework of Constrained Regression
Generalized additive models (GAMs) play an important role in modeling and
understanding complex relationships in modern applied statistics. They allow
for flexible, data-driven estimation of covariate effects. Yet researchers
often have a priori knowledge of certain effects, which might be monotonic or
periodic (cyclic) or should fulfill boundary conditions. We propose a unified
framework to incorporate these constraints for both univariate and bivariate
effect estimates and for varying coefficients. As the framework is based on
component-wise boosting methods, variables can be selected intrinsically, and
effects can be estimated for a wide range of different distributional
assumptions. Bootstrap confidence intervals for the effect estimates are
derived to assess the models. We present three case studies from environmental
sciences to illustrate the proposed seamless modeling framework. All discussed
constrained effect estimates are implemented in the comprehensive R package
mboost for model-based boosting.Comment: This is a preliminary version of the manuscript. The final
publication is available at
http://link.springer.com/article/10.1007/s11222-014-9520-
Variable Selection and Model Choice in Structured Survival Models
In many situations, medical applications ask for flexible survival models that allow to extend the classical Cox-model via the
inclusion of time-varying and nonparametric effects. These structured survival models are very flexible but additional
difficulties arise when model choice and variable selection is desired. In particular, it has to be decided which covariates
should be assigned time-varying effects or whether parametric modeling is sufficient for a given covariate. Component-wise
boosting provides a means of likelihood-based model fitting that enables simultaneous variable selection and model choice. We
introduce a component-wise likelihood-based boosting algorithm for survival data that permits the inclusion of both parametric
and nonparametric time-varying effects as well as nonparametric effects of continuous covariates utilizing penalized splines as
the main modeling technique. Its properties
and performance are investigated in simulation studies.
The new modeling approach is used to build a flexible survival model for
intensive care patients suffering from severe sepsis.
A software implementation is available to the interested reader
A Framework for Unbiased Model Selection Based on Boosting
Variable selection and model choice are of major concern in many statistical applications, especially in high-dimensional regression models. Boosting is a convenient statistical method that combines model fitting with intrinsic model selection.
We investigate the impact of base-learner specification on the performance of boosting as a model selection procedure.
We show that variable selection may be biased if the covariates are of different nature.
Important examples are models combining continuous and categorical covariates, especially if the number of categories is large. In this case, least squares base-learners offer increased flexibility for the categorical covariate and lead to a preference even if the categorical covariate is non-informative.
Similar difficulties arise when comparing linear and nonlinear base-learners for a continuous covariate. The additional flexibility in the nonlinear base-learner again yields a preference of the more complex modeling alternative.
We investigate these problems from a theoretical perspective and suggest a framework for unbiased model selection based on a general class of penalized least squares base-learners.
Making all base-learners comparable in terms of their degrees of freedom strongly reduces the selection bias observed in naive boosting specifications. The importance of unbiased model selection is demonstrated in simulations and an application to forest health models
Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost
We provide a detailed hands-on tutorial for the R add-on package mboost. The package implements boosting for optimizing general risk functions utilizing component-wise (penalized) least squares estimates as base-learners for fitting various kinds of generalized linear and generalized additive models to potentially high-dimensional data. We give a theoretical background and demonstrate how mboost can be used to fit interpretable models of different complexity. As an example we use mboost to predict the body fat based on anthropometric measurements throughout the tutorial
Building Cox-Type Structured Hazard Regression Models with Time-Varying Effects
In recent years, flexible hazard regression models based on penalised splines have been developed that allow us to extend the classical Cox-model via the inclusion of time-varying and nonparametric effects. Despite their immediate appeal in terms of flexibility, these models introduce additional difficulties when a subset of covariates and the corresponding modelling alternatives have to be chosen. We present an analysis of data from a specific patient population with 90-day survival as the response variable. The aim is to determine a sensible prognostic model where some variables have to be included due to subject-matter knowledge while other variables are subject to model selection. Motivated by this application, we propose a twostage stepwise model building strategy to choose both the relevant covariates and the corresponding modelling alternatives within the choice set of possible covariates simultaneously. For categorical covariates, competing modelling approaches are linear effects and time-varying effects, whereas nonparametric modelling provides a further alternative in case of continuous covariates. In our data analysis, we identified a prognostic model containing both smooth and time-varying effects
Boosting for statistical modelling: A non-technical introduction
Boosting algorithms were originally developed for machine learning but were later adapted to estimate statistical models-offering various practical advantages such as automated variable selection and implicit regularization of effect estimates. The interpretation of the resulting models, however, remains the same as if they had been fitted by classical methods. Boosting, hence, allows to use an advanced machine learning scheme to estimate various types of statistical models. This tutorial aims to highlight how boosting can be used for semi-parametric modelling, what practical implications follow from the design of the algorithm and what kind of drawbacks data analysts have to expect. We illustrate the application of boosting in the analysis of a stunting score from children in India and a high-dimensional dataset of tumour DNA to develop a biomarker for the occurrence of metastases in breast cancer patients
GAMLSS for high-dimensional data – a flexible approach based on boosting
Generalized additive models for location, scale and shape (GAMLSS) are a popular semi-parametric modelling approach that, in contrast to conventional GAMs, regress not only the expected mean but every distribution parameter (e.g. location, scale and shape) to a set of covariates. Current fitting procedures for GAMLSS are infeasible for high-dimensional data setups and require variable selection based on (potentially problematic) information criteria. The present work describes a boosting algorithm for high-dimensional GAMLSS that was developed to overcome these limitations. Specifically, the new algorithm was designed to allow the simultaneous estimation of predictor effects and variable selection. The proposed algorithm was applied to data of the Munich Rental Guide, which is used by
landlords and tenants as a reference for the average rent of a flat depending on its characteristics and spatial features. The net-rent predictions that resulted from the high-dimensional GAMLSS were found to be highly competitive while covariate-specific prediction intervals showed a major improvement over classical GAMs
gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework
Generalized additive models for location, scale and shape are a flexible class of regression models that allow to model multiple parameters of a distribution function, such as the mean and the standard deviation, simultaneously. With the R package gamboostLSS, we provide a boosting method to fit these models. Variable selection and model choice are naturally available within this regularized regression framework. To introduce and illustrate the R package gamboostLSS and its infrastructure, we use a data set on stunted growth in India. In addition to the specification and application of the model itself, we present a variety of convenience functions, including methods for tuning parameter selection, prediction and visualization of results. The package gamboostLSS is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=gamboostLSS
- …