Search CORE

42,295 research outputs found

Essays on Robust Model Selection and Model Averaging for Linear Models

Author: Chang Le
Publication venue
Publication date: 01/01/2017
Field of study

Model selection is central to all applied statistical work. Selecting the variables for use in a regression model is one important example of model selection. This thesis is a collection of essays on robust model selection procedures and model averaging for linear regression models. In the first essay, we propose robust Akaike information criteria (AIC) for MM-estimation and an adjusted robust scale based AIC for M and MM-estimation. Our proposed model selection criteria can maintain their robust properties in the presence of a high proportion of outliers and the outliers in the covariates. We compare our proposed criteria with other robust model selection criteria discussed in previous literature. Our simulation studies demonstrate a significant outperformance of robust AIC based on MM-estimation in the presence of outliers in the covariates. The real data example also shows a better performance of robust AIC based on MM-estimation. The second essay focuses on robust versions of the ``Least Absolute Shrinkage and Selection Operator" (lasso). The adaptive lasso is a method for performing simultaneous parameter estimation and variable selection. The adaptive weights used in its penalty term mean that the adaptive lasso achieves the oracle property. In this essay, we propose an extension of the adaptive lasso named the Tukey-lasso. By using Tukey's biweight criterion, instead of squared loss, the Tukey-lasso is resistant to outliers in both the response and covariates. Importantly, we demonstrate that the Tukey-lasso also enjoys the oracle property. A fast accelerated proximal gradient (APG) algorithm is proposed and implemented for computing the Tukey-lasso. Our extensive simulations show that the Tukey-lasso, implemented with the APG algorithm, achieves very reliable results, including for high-dimensional data where p>n. In the presence of outliers, the Tukey-lasso is shown to offer substantial improvements in performance compared to the adaptive lasso and other robust implementations of the lasso. Real data examples further demonstrate the utility of the Tukey-lasso. In many statistical analyses, a single model is used for statistical inference, ignoring the process that leads to the model being selected. To account for this model uncertainty, many model averaging procedures have been proposed. In the last essay, we propose an extension of a bootstrap model averaging approach, called bootstrap lasso averaging (BLA). BLA utilizes the lasso for model selection. This is in contrast to other forms of bootstrap model averaging that use AIC or Bayesian information criteria (BIC). The use of the lasso improves the computation speed and allows BLA to be applied even when the number of variables p is larger than the sample size n. Extensive simulations confirm that BLA has outstanding finite sample performance, in terms of both variable and prediction accuracies, compared with traditional model selection and model averaging methods. Several real data examples further demonstrate an improved out-of-sample predictive performance of BLA

The Australian National University

Fast robust estimation of prediction error based on resampling

Author: Khan Jafar A
Van Aelst Stefan
Zamar Ruben H
Publication venue: 'Elsevier BV'
Publication date: 01/01/2010
Field of study

Ghent University Academic Bibliography

Archivsystem Ask23

Inference in Linear Regression Models with Many Covariates and Heteroskedasticity

Author: Cattaneo Matias D.
Jansson Michael
Newey Whitney K.
Publication venue
Publication date: 16/01/2017
Field of study

The linear regression model is widely used in empirical work in Economics, Statistics, and many other disciplines. Researchers often include many covariates in their linear model specification in an attempt to control for confounders. We give inference methods that allow for many covariates and heteroskedasticity. Our results are obtained using high-dimensional approximations, where the number of included covariates are allowed to grow as fast as the sample size. We find that all of the usual versions of Eicker-White heteroskedasticity consistent standard error estimators for linear models are inconsistent under this asymptotics. We then propose a new heteroskedasticity consistent standard error formula that is fully automatic and robust to both (conditional)\ heteroskedasticity of unknown form and the inclusion of possibly many covariates. We apply our findings to three settings: parametric linear models with many covariates, linear panel models with many fixed effects, and semiparametric semi-linear models with many technical regressors. Simulation evidence consistent with our theoretical results is also provided. The proposed methods are also illustrated with an empirical application

arXiv.org e-Print Archive

eScholarship - University of California

FigShare

Automated design of robust discriminant analysis classifier for foot pressure lesions using kinematic data

Author: Bowker P
Findlow AH
Goulermas JY
Howard D
Nester CJ
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

In the recent years, the use of motion tracking systems for acquisition of functional biomechanical gait data, has received increasing interest due to the richness and accuracy of the measured kinematic information. However, costs frequently restrict the number of subjects employed, and this makes the dimensionality of the collected data far higher than the available samples. This paper applies discriminant analysis algorithms to the classification of patients with different types of foot lesions, in order to establish an association between foot motion and lesion formation. With primary attention to small sample size situations, we compare different types of Bayesian classifiers and evaluate their performance with various dimensionality reduction techniques for feature extraction, as well as search methods for selection of raw kinematic variables. Finally, we propose a novel integrated method which fine-tunes the classifier parameters and selects the most relevant kinematic variables simultaneously. Performance comparisons are using robust resampling techniques such as Bootstrap

632+

and k-fold cross-validation. Results from experimentations with lesion subjects suffering from pathological plantar hyperkeratosis, show that the proposed method can lead to

sim 96%

correct classification rates with less than 10% of the original features

Keele Research Repository

University of Salford Institutional Repository

Crossref

Stability

Author: Yu Bin
Publication venue: 'Bernoulli Society for Mathematical Statistics and Probability'
Publication date: 01/09/2013
Field of study

Reproducibility is imperative for any scientific discovery. More often than not, modern scientific findings rely on statistical analysis of high-dimensional data. At a minimum, reproducibility manifests itself in stability of statistical results relative to "reasonable" perturbations to data and to the model used. Jacknife, bootstrap, and cross-validation are based on perturbations to data, while robust statistics methods deal with perturbations to models. In this article, a case is made for the importance of stability in statistics. Firstly, we motivate the necessity of stability for interpretable and reliable encoding models from brain fMRI signals. Secondly, we find strong evidence in the literature to demonstrate the central role of stability in statistical inference, such as sensitivity analysis and effect detection. Thirdly, a smoothing parameter selector based on estimation stability (ES), ES-CV, is proposed for Lasso, in order to bring stability to bear on cross-validation (CV). ES-CV is then utilized in the encoding models to reduce the number of predictors by 60% with almost no loss (1.3%) of prediction performance across over 2,000 voxels. Last, a novel "stability" argument is seen to drive new results that shed light on the intriguing interactions between sample to sample variability and heavier tail error distribution (e.g., double-exponential) in high-dimensional regression models with

p

predictors and

n

independent samples. In particular, when

p/n\rightarrow\kappa\in(0.3,1)

and the error distribution is double-exponential, the Ordinary Least Squares (OLS) is a better estimator than the Least Absolute Deviation (LAD) estimator.Comment: Published in at http://dx.doi.org/10.3150/13-BEJSP14 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

arXiv.org e-Print Archive

eScholarship - University of California

Calibration of Distributionally Robust Empirical Optimization Models

Author: Gotoh Jun-Ya
Kim Michael Jong
Lim Andrew E. B.
Publication venue
Publication date: 18/05/2020
Field of study

We study the out-of-sample properties of robust empirical optimization problems with smooth

\phi

-divergence penalties and smooth concave objective functions, and develop a theory for data-driven calibration of the non-negative "robustness parameter"

\delta

that controls the size of the deviations from the nominal model. Building on the intuition that robust optimization reduces the sensitivity of the expected reward to errors in the model by controlling the spread of the reward distribution, we show that the first-order benefit of ``little bit of robustness" (i.e.,

\delta

small, positive) is a significant reduction in the variance of the out-of-sample reward while the corresponding impact on the mean is almost an order of magnitude smaller. One implication is that substantial variance (sensitivity) reduction is possible at little cost if the robustness parameter is properly calibrated. To this end, we introduce the notion of a robust mean-variance frontier to select the robustness parameter and show that it can be approximated using resampling methods like the bootstrap. Our examples show that robust solutions resulting from "open loop" calibration methods (e.g., selecting a

90\%

confidence level regardless of the data and objective function) can be very conservative out-of-sample, while those corresponding to the robustness parameter that optimizes an estimate of the out-of-sample expected reward (e.g., via the bootstrap) with no regard for the variance are often insufficiently robust.Comment: 51 page

arXiv.org e-Print Archive

ScholarBank@NUS

tRNA functional signatures classify plastids as late-branching cyanobacteria.

Author: Amrine Katherine Ch
Ardell David H
Lawrence Travis J
Swingley Wesley D
Publication venue: eScholarship, University of California
Publication date: 01/12/2019
Field of study

BackgroundEukaryotes acquired the trait of oxygenic photosynthesis through endosymbiosis of the cyanobacterial progenitor of plastid organelles. Despite recent advances in the phylogenomics of Cyanobacteria, the phylogenetic root of plastids remains controversial. Although a single origin of plastids by endosymbiosis is broadly supported, recent phylogenomic studies are contradictory on whether plastids branch early or late within Cyanobacteria. One underlying cause may be poor fit of evolutionary models to complex phylogenomic data.ResultsUsing Posterior Predictive Analysis, we show that recently applied evolutionary models poorly fit three phylogenomic datasets curated from cyanobacteria and plastid genomes because of heterogeneities in both substitution processes across sites and of compositions across lineages. To circumvent these sources of bias, we developed CYANO-MLP, a machine learning algorithm that consistently and accurately phylogenetically classifies ("phyloclassifies") cyanobacterial genomes to their clade of origin based on bioinformatically predicted function-informative features in tRNA gene complements. Classification of cyanobacterial genomes with CYANO-MLP is accurate and robust to deletion of clades, unbalanced sampling, and compositional heterogeneity in input tRNA data. CYANO-MLP consistently classifies plastid genomes into a late-branching cyanobacterial sub-clade containing single-cell, starch-producing, nitrogen-fixing ecotypes, consistent with metabolic and gene transfer data.ConclusionsPhylogenomic data of cyanobacteria and plastids exhibit both site-process heterogeneities and compositional heterogeneities across lineages. These aspects of the data require careful modeling to avoid bias in phylogenomic estimation. Furthermore, we show that amino acid recoding strategies may be insufficient to mitigate bias from compositional heterogeneities. However, the combination of our novel tRNA-specific strategy with machine learning in CYANO-MLP appears robust to these sources of bias with high accuracy in phyloclassification of cyanobacterial genomes. CYANO-MLP consistently classifies plastids as late-branching Cyanobacteria, consistent with independent evidence from signature-based approaches and some previous phylogenetic studies

Directory of Open Access Journals

eScholarship - University of California