42,295 research outputs found
Essays on Robust Model Selection and Model Averaging for Linear Models
Model selection is central to all applied statistical work.
Selecting the variables for use in a regression model is one
important example of model selection. This thesis is a collection
of essays on robust model selection procedures and model
averaging for linear regression models.
In the first essay, we propose robust Akaike information criteria
(AIC) for MM-estimation and an adjusted robust scale based AIC
for M and MM-estimation. Our proposed model selection criteria
can maintain their robust properties in the presence of a high
proportion of outliers and the outliers in the covariates. We
compare our proposed criteria with other robust model selection
criteria discussed in previous literature. Our simulation studies
demonstrate a significant outperformance of robust AIC based on
MM-estimation in the presence of outliers in the covariates. The
real data example also shows a better performance of robust AIC
based on MM-estimation.
The second essay focuses on robust versions of the ``Least
Absolute Shrinkage and Selection Operator" (lasso). The adaptive
lasso is a method for performing simultaneous parameter
estimation and variable selection. The adaptive weights used in
its penalty term mean that the adaptive lasso achieves the oracle
property. In this essay, we propose an extension of the adaptive
lasso named the Tukey-lasso. By using Tukey's biweight criterion,
instead of squared loss, the Tukey-lasso is resistant to outliers
in both the response and covariates. Importantly, we demonstrate
that the Tukey-lasso also enjoys the oracle property. A fast
accelerated proximal gradient (APG) algorithm is proposed and
implemented for computing the Tukey-lasso. Our extensive
simulations show that the Tukey-lasso, implemented with the APG
algorithm, achieves very reliable results, including for
high-dimensional data where p>n. In the presence of outliers, the
Tukey-lasso is shown to offer substantial improvements in
performance compared to the adaptive lasso and other robust
implementations of the lasso. Real data examples further
demonstrate the utility of the Tukey-lasso.
In many statistical analyses, a single model is used for
statistical inference, ignoring the process that leads to the
model being selected. To account for this model uncertainty, many
model averaging procedures have been proposed. In the last essay,
we propose an extension of a bootstrap model averaging approach,
called bootstrap lasso averaging (BLA). BLA utilizes the lasso
for model selection. This is in contrast to other forms of
bootstrap model averaging that use AIC or Bayesian information
criteria (BIC). The use of the lasso improves the computation
speed and allows BLA to be applied even when the number of
variables p is larger than the sample size n. Extensive
simulations confirm that BLA has outstanding finite sample
performance, in terms of both variable and prediction accuracies,
compared with traditional model selection and model averaging
methods. Several real data examples further demonstrate an
improved out-of-sample predictive performance of BLA
Inference in Linear Regression Models with Many Covariates and Heteroskedasticity
The linear regression model is widely used in empirical work in Economics,
Statistics, and many other disciplines. Researchers often include many
covariates in their linear model specification in an attempt to control for
confounders. We give inference methods that allow for many covariates and
heteroskedasticity. Our results are obtained using high-dimensional
approximations, where the number of included covariates are allowed to grow as
fast as the sample size. We find that all of the usual versions of Eicker-White
heteroskedasticity consistent standard error estimators for linear models are
inconsistent under this asymptotics. We then propose a new heteroskedasticity
consistent standard error formula that is fully automatic and robust to both
(conditional)\ heteroskedasticity of unknown form and the inclusion of possibly
many covariates. We apply our findings to three settings: parametric linear
models with many covariates, linear panel models with many fixed effects, and
semiparametric semi-linear models with many technical regressors. Simulation
evidence consistent with our theoretical results is also provided. The proposed
methods are also illustrated with an empirical application
Automated design of robust discriminant analysis classifier for foot pressure lesions using kinematic data
In the recent years, the use of motion tracking systems for acquisition of functional biomechanical gait data, has received increasing interest due to the richness and accuracy of the measured kinematic information. However, costs frequently restrict the number of subjects employed, and this makes the dimensionality of the collected data far higher than the available samples. This paper applies discriminant analysis algorithms to the classification of patients with different types of foot lesions, in order to establish an association between foot motion and lesion formation. With primary attention to small sample size situations, we compare different types of Bayesian classifiers and evaluate their performance with various dimensionality reduction techniques for feature extraction, as well as search methods for selection of raw kinematic variables. Finally, we propose a novel integrated method which fine-tunes the classifier parameters and selects the most relevant kinematic variables simultaneously. Performance comparisons are using robust resampling techniques such as Bootstrapand k-fold cross-validation. Results from experimentations with lesion subjects suffering from pathological plantar hyperkeratosis, show that the proposed method can lead tocorrect classification rates with less than 10% of the original features
Stability
Reproducibility is imperative for any scientific discovery. More often than
not, modern scientific findings rely on statistical analysis of
high-dimensional data. At a minimum, reproducibility manifests itself in
stability of statistical results relative to "reasonable" perturbations to data
and to the model used. Jacknife, bootstrap, and cross-validation are based on
perturbations to data, while robust statistics methods deal with perturbations
to models. In this article, a case is made for the importance of stability in
statistics. Firstly, we motivate the necessity of stability for interpretable
and reliable encoding models from brain fMRI signals. Secondly, we find strong
evidence in the literature to demonstrate the central role of stability in
statistical inference, such as sensitivity analysis and effect detection.
Thirdly, a smoothing parameter selector based on estimation stability (ES),
ES-CV, is proposed for Lasso, in order to bring stability to bear on
cross-validation (CV). ES-CV is then utilized in the encoding models to reduce
the number of predictors by 60% with almost no loss (1.3%) of prediction
performance across over 2,000 voxels. Last, a novel "stability" argument is
seen to drive new results that shed light on the intriguing interactions
between sample to sample variability and heavier tail error distribution (e.g.,
double-exponential) in high-dimensional regression models with predictors
and independent samples. In particular, when
and the error distribution is
double-exponential, the Ordinary Least Squares (OLS) is a better estimator than
the Least Absolute Deviation (LAD) estimator.Comment: Published in at http://dx.doi.org/10.3150/13-BEJSP14 the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
Calibration of Distributionally Robust Empirical Optimization Models
We study the out-of-sample properties of robust empirical optimization
problems with smooth -divergence penalties and smooth concave objective
functions, and develop a theory for data-driven calibration of the non-negative
"robustness parameter" that controls the size of the deviations from
the nominal model. Building on the intuition that robust optimization reduces
the sensitivity of the expected reward to errors in the model by controlling
the spread of the reward distribution, we show that the first-order benefit of
``little bit of robustness" (i.e., small, positive) is a significant
reduction in the variance of the out-of-sample reward while the corresponding
impact on the mean is almost an order of magnitude smaller. One implication is
that substantial variance (sensitivity) reduction is possible at little cost if
the robustness parameter is properly calibrated. To this end, we introduce the
notion of a robust mean-variance frontier to select the robustness parameter
and show that it can be approximated using resampling methods like the
bootstrap. Our examples show that robust solutions resulting from "open loop"
calibration methods (e.g., selecting a confidence level regardless of
the data and objective function) can be very conservative out-of-sample, while
those corresponding to the robustness parameter that optimizes an estimate of
the out-of-sample expected reward (e.g., via the bootstrap) with no regard for
the variance are often insufficiently robust.Comment: 51 page
tRNA functional signatures classify plastids as late-branching cyanobacteria.
BackgroundEukaryotes acquired the trait of oxygenic photosynthesis through endosymbiosis of the cyanobacterial progenitor of plastid organelles. Despite recent advances in the phylogenomics of Cyanobacteria, the phylogenetic root of plastids remains controversial. Although a single origin of plastids by endosymbiosis is broadly supported, recent phylogenomic studies are contradictory on whether plastids branch early or late within Cyanobacteria. One underlying cause may be poor fit of evolutionary models to complex phylogenomic data.ResultsUsing Posterior Predictive Analysis, we show that recently applied evolutionary models poorly fit three phylogenomic datasets curated from cyanobacteria and plastid genomes because of heterogeneities in both substitution processes across sites and of compositions across lineages. To circumvent these sources of bias, we developed CYANO-MLP, a machine learning algorithm that consistently and accurately phylogenetically classifies ("phyloclassifies") cyanobacterial genomes to their clade of origin based on bioinformatically predicted function-informative features in tRNA gene complements. Classification of cyanobacterial genomes with CYANO-MLP is accurate and robust to deletion of clades, unbalanced sampling, and compositional heterogeneity in input tRNA data. CYANO-MLP consistently classifies plastid genomes into a late-branching cyanobacterial sub-clade containing single-cell, starch-producing, nitrogen-fixing ecotypes, consistent with metabolic and gene transfer data.ConclusionsPhylogenomic data of cyanobacteria and plastids exhibit both site-process heterogeneities and compositional heterogeneities across lineages. These aspects of the data require careful modeling to avoid bias in phylogenomic estimation. Furthermore, we show that amino acid recoding strategies may be insufficient to mitigate bias from compositional heterogeneities. However, the combination of our novel tRNA-specific strategy with machine learning in CYANO-MLP appears robust to these sources of bias with high accuracy in phyloclassification of cyanobacterial genomes. CYANO-MLP consistently classifies plastids as late-branching Cyanobacteria, consistent with independent evidence from signature-based approaches and some previous phylogenetic studies
- …