4,847 research outputs found
An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years.
High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions.
The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application.
Application of the methods is illustrated using freely available implementations in the R system for statistical computing
Bagging Time Series Models
A common problem in out-of-sample prediction is that there are potentially many relevant predictors that individually have only weak explanatory power. We propose bootstrap aggregation of pre-test predictors (or bagging for short) as a means of constructing forecasts from multiple regression models with local-to-zero regression parameters and errors subject to possible serial correlation or conditional heteroskedasticity. Bagging is designed for situations in which the number of predictors (M) is moderately large relative to the sample size (T). We show how to implement bagging in the dynamic multiple regression model and provide asymptotic justification for the bagging predictor. A simulation study shows that bagging tends to produce large reductions in the out-of-sample prediction mean squared error and provides a useful alternative to forecasting from factor models when M is large, but much smaller than T. We also find that bagging indicators of real economic activity greatly redcues the prediction mean squared error of forecasts of U.S. CPI inflation at horizons of one month and one yearforecasting; bootstrap; model selection; pre-testing; forecast aggregation; factor models; inflation.
Essays on Robust Model Selection and Model Averaging for Linear Models
Model selection is central to all applied statistical work.
Selecting the variables for use in a regression model is one
important example of model selection. This thesis is a collection
of essays on robust model selection procedures and model
averaging for linear regression models.
In the first essay, we propose robust Akaike information criteria
(AIC) for MM-estimation and an adjusted robust scale based AIC
for M and MM-estimation. Our proposed model selection criteria
can maintain their robust properties in the presence of a high
proportion of outliers and the outliers in the covariates. We
compare our proposed criteria with other robust model selection
criteria discussed in previous literature. Our simulation studies
demonstrate a significant outperformance of robust AIC based on
MM-estimation in the presence of outliers in the covariates. The
real data example also shows a better performance of robust AIC
based on MM-estimation.
The second essay focuses on robust versions of the ``Least
Absolute Shrinkage and Selection Operator" (lasso). The adaptive
lasso is a method for performing simultaneous parameter
estimation and variable selection. The adaptive weights used in
its penalty term mean that the adaptive lasso achieves the oracle
property. In this essay, we propose an extension of the adaptive
lasso named the Tukey-lasso. By using Tukey's biweight criterion,
instead of squared loss, the Tukey-lasso is resistant to outliers
in both the response and covariates. Importantly, we demonstrate
that the Tukey-lasso also enjoys the oracle property. A fast
accelerated proximal gradient (APG) algorithm is proposed and
implemented for computing the Tukey-lasso. Our extensive
simulations show that the Tukey-lasso, implemented with the APG
algorithm, achieves very reliable results, including for
high-dimensional data where p>n. In the presence of outliers, the
Tukey-lasso is shown to offer substantial improvements in
performance compared to the adaptive lasso and other robust
implementations of the lasso. Real data examples further
demonstrate the utility of the Tukey-lasso.
In many statistical analyses, a single model is used for
statistical inference, ignoring the process that leads to the
model being selected. To account for this model uncertainty, many
model averaging procedures have been proposed. In the last essay,
we propose an extension of a bootstrap model averaging approach,
called bootstrap lasso averaging (BLA). BLA utilizes the lasso
for model selection. This is in contrast to other forms of
bootstrap model averaging that use AIC or Bayesian information
criteria (BIC). The use of the lasso improves the computation
speed and allows BLA to be applied even when the number of
variables p is larger than the sample size n. Extensive
simulations confirm that BLA has outstanding finite sample
performance, in terms of both variable and prediction accuracies,
compared with traditional model selection and model averaging
methods. Several real data examples further demonstrate an
improved out-of-sample predictive performance of BLA
Towards Using Model Averaging To Construct Confidence Intervals In Logistic Regression Models
Regression analyses in epidemiological and medical research typically begin with a model selection process, followed by inference assuming the selected model has generated the data at hand. It is well-known that this two-step procedure can yield biased estimates and invalid confidence intervals for model coefficients due to the uncertainty associated with the model selection. To account for this uncertainty, multiple models may be selected as a basis for inference. This method, commonly referred to as model-averaging, is increasingly becoming a viable approach in practice.
Previous research has demonstrated the advantage of model-averaging in reducing bias of parameter estimates. However, there is lack of methods for constructing confidence intervals around parameter estimates using model-averaging. In the context of multiple logistic regression models, we propose and evaluate new confidence interval estimation approaches for regression coefficients. Specifically, we study the properties of confidence intervals constructed by averaging tail errors arising from confidence limits obtained from all models included in model-averaging for parameter estimation. We propose model-averaging confidence intervals based on the score test. For selection of models to be averaged, we propose the bootstrap inclusion fractions method.
We evaluate the performance of our proposed methods using simulation studies, in a comparison with model-averaging interval procedures based on likelihood ratio and Wald tests, traditional stepwise procedures, the bootstrap approach, penalized regression, and the Bayesian model-averaging approach.
Methods with good performance have been implemented in the \u27mataci\u27 R package, and illustrated using data from a low birth weight study
Methods for Population Adjustment with Limited Access to Individual Patient Data: A Review and Simulation Study
Population-adjusted indirect comparisons estimate treatment effects when
access to individual patient data is limited and there are cross-trial
differences in effect modifiers. Popular methods include matching-adjusted
indirect comparison (MAIC) and simulated treatment comparison (STC). There is
limited formal evaluation of these methods and whether they can be used to
accurately compare treatments. Thus, we undertake a comprehensive simulation
study to compare standard unadjusted indirect comparisons, MAIC and STC across
162 scenarios. This simulation study assumes that the trials are investigating
survival outcomes and measure continuous covariates, with the log hazard ratio
as the measure of effect. MAIC yields unbiased treatment effect estimates under
no failures of assumptions. The typical usage of STC produces bias because it
targets a conditional treatment effect where the target estimand should be a
marginal treatment effect. The incompatibility of estimates in the indirect
comparison leads to bias as the measure of effect is non-collapsible. Standard
indirect comparisons are systematically biased, particularly under stronger
covariate imbalance and interaction effects. Standard errors and coverage rates
are often valid in MAIC but the robust sandwich variance estimator
underestimates variability where effective sample sizes are small. Interval
estimates for the standard indirect comparison are too narrow and STC suffers
from bias-induced undercoverage. MAIC provides the most accurate estimates and,
with lower degrees of covariate overlap, its bias reduction outweighs the loss
in effective sample size and precision under no failures of assumptions. An
important future objective is the development of an alternative formulation to
STC that targets a marginal treatment effect.Comment: 73 pages (34 are supplementary appendices and references), 8 figures,
2 tables. Full article (following Round 4 of minor revisions). arXiv admin
note: text overlap with arXiv:2008.0595
A nonparametric model-based estimator for the cumulative distribution function of a right censored variable in a finite population
In survey analysis, the estimation of the cumulative distribution function
(cdf) is of great interest: it allows for instance to derive quantiles
estimators or other non linear parameters derived from the cdf. We consider the
case where the response variable is a right censored duration variable. In this
framework, the classical estimator of the cdf is the Kaplan-Meier estimator. As
an alternative, we propose a nonparametric model-based estimator of the cdf in
a finite population. The new estimator uses auxiliary information brought by a
continuous covariate and is based on nonparametric median regression adapted to
the censored case. The bias and variance of the prediction error of the
estimator are estimated by a bootstrap procedure adapted to censoring. The new
estimator is compared by model-based simulations to the Kaplan-Meier estimator
computed with the sampled individuals: a significant gain in precision is
brought by the new method whatever the size of the sample and the censoring
rate. Welfare duration data are used to illustrate the new methodology.Comment: 18 pages, 5 figure
- …