4,847 research outputs found

    An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests

    Get PDF
    Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years. High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions. The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application. Application of the methods is illustrated using freely available implementations in the R system for statistical computing

    Bagging Time Series Models

    Get PDF
    A common problem in out-of-sample prediction is that there are potentially many relevant predictors that individually have only weak explanatory power. We propose bootstrap aggregation of pre-test predictors (or bagging for short) as a means of constructing forecasts from multiple regression models with local-to-zero regression parameters and errors subject to possible serial correlation or conditional heteroskedasticity. Bagging is designed for situations in which the number of predictors (M) is moderately large relative to the sample size (T). We show how to implement bagging in the dynamic multiple regression model and provide asymptotic justification for the bagging predictor. A simulation study shows that bagging tends to produce large reductions in the out-of-sample prediction mean squared error and provides a useful alternative to forecasting from factor models when M is large, but much smaller than T. We also find that bagging indicators of real economic activity greatly redcues the prediction mean squared error of forecasts of U.S. CPI inflation at horizons of one month and one yearforecasting; bootstrap; model selection; pre-testing; forecast aggregation; factor models; inflation.

    Essays on Robust Model Selection and Model Averaging for Linear Models

    Get PDF
    Model selection is central to all applied statistical work. Selecting the variables for use in a regression model is one important example of model selection. This thesis is a collection of essays on robust model selection procedures and model averaging for linear regression models. In the first essay, we propose robust Akaike information criteria (AIC) for MM-estimation and an adjusted robust scale based AIC for M and MM-estimation. Our proposed model selection criteria can maintain their robust properties in the presence of a high proportion of outliers and the outliers in the covariates. We compare our proposed criteria with other robust model selection criteria discussed in previous literature. Our simulation studies demonstrate a significant outperformance of robust AIC based on MM-estimation in the presence of outliers in the covariates. The real data example also shows a better performance of robust AIC based on MM-estimation. The second essay focuses on robust versions of the ``Least Absolute Shrinkage and Selection Operator" (lasso). The adaptive lasso is a method for performing simultaneous parameter estimation and variable selection. The adaptive weights used in its penalty term mean that the adaptive lasso achieves the oracle property. In this essay, we propose an extension of the adaptive lasso named the Tukey-lasso. By using Tukey's biweight criterion, instead of squared loss, the Tukey-lasso is resistant to outliers in both the response and covariates. Importantly, we demonstrate that the Tukey-lasso also enjoys the oracle property. A fast accelerated proximal gradient (APG) algorithm is proposed and implemented for computing the Tukey-lasso. Our extensive simulations show that the Tukey-lasso, implemented with the APG algorithm, achieves very reliable results, including for high-dimensional data where p>n. In the presence of outliers, the Tukey-lasso is shown to offer substantial improvements in performance compared to the adaptive lasso and other robust implementations of the lasso. Real data examples further demonstrate the utility of the Tukey-lasso. In many statistical analyses, a single model is used for statistical inference, ignoring the process that leads to the model being selected. To account for this model uncertainty, many model averaging procedures have been proposed. In the last essay, we propose an extension of a bootstrap model averaging approach, called bootstrap lasso averaging (BLA). BLA utilizes the lasso for model selection. This is in contrast to other forms of bootstrap model averaging that use AIC or Bayesian information criteria (BIC). The use of the lasso improves the computation speed and allows BLA to be applied even when the number of variables p is larger than the sample size n. Extensive simulations confirm that BLA has outstanding finite sample performance, in terms of both variable and prediction accuracies, compared with traditional model selection and model averaging methods. Several real data examples further demonstrate an improved out-of-sample predictive performance of BLA

    Towards Using Model Averaging To Construct Confidence Intervals In Logistic Regression Models

    Get PDF
    Regression analyses in epidemiological and medical research typically begin with a model selection process, followed by inference assuming the selected model has generated the data at hand. It is well-known that this two-step procedure can yield biased estimates and invalid confidence intervals for model coefficients due to the uncertainty associated with the model selection. To account for this uncertainty, multiple models may be selected as a basis for inference. This method, commonly referred to as model-averaging, is increasingly becoming a viable approach in practice. Previous research has demonstrated the advantage of model-averaging in reducing bias of parameter estimates. However, there is lack of methods for constructing confidence intervals around parameter estimates using model-averaging. In the context of multiple logistic regression models, we propose and evaluate new confidence interval estimation approaches for regression coefficients. Specifically, we study the properties of confidence intervals constructed by averaging tail errors arising from confidence limits obtained from all models included in model-averaging for parameter estimation. We propose model-averaging confidence intervals based on the score test. For selection of models to be averaged, we propose the bootstrap inclusion fractions method. We evaluate the performance of our proposed methods using simulation studies, in a comparison with model-averaging interval procedures based on likelihood ratio and Wald tests, traditional stepwise procedures, the bootstrap approach, penalized regression, and the Bayesian model-averaging approach. Methods with good performance have been implemented in the \u27mataci\u27 R package, and illustrated using data from a low birth weight study

    Methods for Population Adjustment with Limited Access to Individual Patient Data: A Review and Simulation Study

    Get PDF
    Population-adjusted indirect comparisons estimate treatment effects when access to individual patient data is limited and there are cross-trial differences in effect modifiers. Popular methods include matching-adjusted indirect comparison (MAIC) and simulated treatment comparison (STC). There is limited formal evaluation of these methods and whether they can be used to accurately compare treatments. Thus, we undertake a comprehensive simulation study to compare standard unadjusted indirect comparisons, MAIC and STC across 162 scenarios. This simulation study assumes that the trials are investigating survival outcomes and measure continuous covariates, with the log hazard ratio as the measure of effect. MAIC yields unbiased treatment effect estimates under no failures of assumptions. The typical usage of STC produces bias because it targets a conditional treatment effect where the target estimand should be a marginal treatment effect. The incompatibility of estimates in the indirect comparison leads to bias as the measure of effect is non-collapsible. Standard indirect comparisons are systematically biased, particularly under stronger covariate imbalance and interaction effects. Standard errors and coverage rates are often valid in MAIC but the robust sandwich variance estimator underestimates variability where effective sample sizes are small. Interval estimates for the standard indirect comparison are too narrow and STC suffers from bias-induced undercoverage. MAIC provides the most accurate estimates and, with lower degrees of covariate overlap, its bias reduction outweighs the loss in effective sample size and precision under no failures of assumptions. An important future objective is the development of an alternative formulation to STC that targets a marginal treatment effect.Comment: 73 pages (34 are supplementary appendices and references), 8 figures, 2 tables. Full article (following Round 4 of minor revisions). arXiv admin note: text overlap with arXiv:2008.0595

    A nonparametric model-based estimator for the cumulative distribution function of a right censored variable in a finite population

    Get PDF
    In survey analysis, the estimation of the cumulative distribution function (cdf) is of great interest: it allows for instance to derive quantiles estimators or other non linear parameters derived from the cdf. We consider the case where the response variable is a right censored duration variable. In this framework, the classical estimator of the cdf is the Kaplan-Meier estimator. As an alternative, we propose a nonparametric model-based estimator of the cdf in a finite population. The new estimator uses auxiliary information brought by a continuous covariate and is based on nonparametric median regression adapted to the censored case. The bias and variance of the prediction error of the estimator are estimated by a bootstrap procedure adapted to censoring. The new estimator is compared by model-based simulations to the Kaplan-Meier estimator computed with the sampled individuals: a significant gain in precision is brought by the new method whatever the size of the sample and the censoring rate. Welfare duration data are used to illustrate the new methodology.Comment: 18 pages, 5 figure
    • …
    corecore