42,295 research outputs found

    Essays on Robust Model Selection and Model Averaging for Linear Models

    Get PDF
    Model selection is central to all applied statistical work. Selecting the variables for use in a regression model is one important example of model selection. This thesis is a collection of essays on robust model selection procedures and model averaging for linear regression models. In the first essay, we propose robust Akaike information criteria (AIC) for MM-estimation and an adjusted robust scale based AIC for M and MM-estimation. Our proposed model selection criteria can maintain their robust properties in the presence of a high proportion of outliers and the outliers in the covariates. We compare our proposed criteria with other robust model selection criteria discussed in previous literature. Our simulation studies demonstrate a significant outperformance of robust AIC based on MM-estimation in the presence of outliers in the covariates. The real data example also shows a better performance of robust AIC based on MM-estimation. The second essay focuses on robust versions of the ``Least Absolute Shrinkage and Selection Operator" (lasso). The adaptive lasso is a method for performing simultaneous parameter estimation and variable selection. The adaptive weights used in its penalty term mean that the adaptive lasso achieves the oracle property. In this essay, we propose an extension of the adaptive lasso named the Tukey-lasso. By using Tukey's biweight criterion, instead of squared loss, the Tukey-lasso is resistant to outliers in both the response and covariates. Importantly, we demonstrate that the Tukey-lasso also enjoys the oracle property. A fast accelerated proximal gradient (APG) algorithm is proposed and implemented for computing the Tukey-lasso. Our extensive simulations show that the Tukey-lasso, implemented with the APG algorithm, achieves very reliable results, including for high-dimensional data where p>n. In the presence of outliers, the Tukey-lasso is shown to offer substantial improvements in performance compared to the adaptive lasso and other robust implementations of the lasso. Real data examples further demonstrate the utility of the Tukey-lasso. In many statistical analyses, a single model is used for statistical inference, ignoring the process that leads to the model being selected. To account for this model uncertainty, many model averaging procedures have been proposed. In the last essay, we propose an extension of a bootstrap model averaging approach, called bootstrap lasso averaging (BLA). BLA utilizes the lasso for model selection. This is in contrast to other forms of bootstrap model averaging that use AIC or Bayesian information criteria (BIC). The use of the lasso improves the computation speed and allows BLA to be applied even when the number of variables p is larger than the sample size n. Extensive simulations confirm that BLA has outstanding finite sample performance, in terms of both variable and prediction accuracies, compared with traditional model selection and model averaging methods. Several real data examples further demonstrate an improved out-of-sample predictive performance of BLA

    Inference in Linear Regression Models with Many Covariates and Heteroskedasticity

    Full text link
    The linear regression model is widely used in empirical work in Economics, Statistics, and many other disciplines. Researchers often include many covariates in their linear model specification in an attempt to control for confounders. We give inference methods that allow for many covariates and heteroskedasticity. Our results are obtained using high-dimensional approximations, where the number of included covariates are allowed to grow as fast as the sample size. We find that all of the usual versions of Eicker-White heteroskedasticity consistent standard error estimators for linear models are inconsistent under this asymptotics. We then propose a new heteroskedasticity consistent standard error formula that is fully automatic and robust to both (conditional)\ heteroskedasticity of unknown form and the inclusion of possibly many covariates. We apply our findings to three settings: parametric linear models with many covariates, linear panel models with many fixed effects, and semiparametric semi-linear models with many technical regressors. Simulation evidence consistent with our theoretical results is also provided. The proposed methods are also illustrated with an empirical application

    Automated design of robust discriminant analysis classifier for foot pressure lesions using kinematic data

    Get PDF
    In the recent years, the use of motion tracking systems for acquisition of functional biomechanical gait data, has received increasing interest due to the richness and accuracy of the measured kinematic information. However, costs frequently restrict the number of subjects employed, and this makes the dimensionality of the collected data far higher than the available samples. This paper applies discriminant analysis algorithms to the classification of patients with different types of foot lesions, in order to establish an association between foot motion and lesion formation. With primary attention to small sample size situations, we compare different types of Bayesian classifiers and evaluate their performance with various dimensionality reduction techniques for feature extraction, as well as search methods for selection of raw kinematic variables. Finally, we propose a novel integrated method which fine-tunes the classifier parameters and selects the most relevant kinematic variables simultaneously. Performance comparisons are using robust resampling techniques such as Bootstrap632+632+and k-fold cross-validation. Results from experimentations with lesion subjects suffering from pathological plantar hyperkeratosis, show that the proposed method can lead tosim96sim 96%correct classification rates with less than 10% of the original features

    Stability

    Full text link
    Reproducibility is imperative for any scientific discovery. More often than not, modern scientific findings rely on statistical analysis of high-dimensional data. At a minimum, reproducibility manifests itself in stability of statistical results relative to "reasonable" perturbations to data and to the model used. Jacknife, bootstrap, and cross-validation are based on perturbations to data, while robust statistics methods deal with perturbations to models. In this article, a case is made for the importance of stability in statistics. Firstly, we motivate the necessity of stability for interpretable and reliable encoding models from brain fMRI signals. Secondly, we find strong evidence in the literature to demonstrate the central role of stability in statistical inference, such as sensitivity analysis and effect detection. Thirdly, a smoothing parameter selector based on estimation stability (ES), ES-CV, is proposed for Lasso, in order to bring stability to bear on cross-validation (CV). ES-CV is then utilized in the encoding models to reduce the number of predictors by 60% with almost no loss (1.3%) of prediction performance across over 2,000 voxels. Last, a novel "stability" argument is seen to drive new results that shed light on the intriguing interactions between sample to sample variability and heavier tail error distribution (e.g., double-exponential) in high-dimensional regression models with pp predictors and nn independent samples. In particular, when p/nκ(0.3,1)p/n\rightarrow\kappa\in(0.3,1) and the error distribution is double-exponential, the Ordinary Least Squares (OLS) is a better estimator than the Least Absolute Deviation (LAD) estimator.Comment: Published in at http://dx.doi.org/10.3150/13-BEJSP14 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

    Calibration of Distributionally Robust Empirical Optimization Models

    Full text link
    We study the out-of-sample properties of robust empirical optimization problems with smooth ϕ\phi-divergence penalties and smooth concave objective functions, and develop a theory for data-driven calibration of the non-negative "robustness parameter" δ\delta that controls the size of the deviations from the nominal model. Building on the intuition that robust optimization reduces the sensitivity of the expected reward to errors in the model by controlling the spread of the reward distribution, we show that the first-order benefit of ``little bit of robustness" (i.e., δ\delta small, positive) is a significant reduction in the variance of the out-of-sample reward while the corresponding impact on the mean is almost an order of magnitude smaller. One implication is that substantial variance (sensitivity) reduction is possible at little cost if the robustness parameter is properly calibrated. To this end, we introduce the notion of a robust mean-variance frontier to select the robustness parameter and show that it can be approximated using resampling methods like the bootstrap. Our examples show that robust solutions resulting from "open loop" calibration methods (e.g., selecting a 90%90\% confidence level regardless of the data and objective function) can be very conservative out-of-sample, while those corresponding to the robustness parameter that optimizes an estimate of the out-of-sample expected reward (e.g., via the bootstrap) with no regard for the variance are often insufficiently robust.Comment: 51 page

    tRNA functional signatures classify plastids as late-branching cyanobacteria.

    Get PDF
    BackgroundEukaryotes acquired the trait of oxygenic photosynthesis through endosymbiosis of the cyanobacterial progenitor of plastid organelles. Despite recent advances in the phylogenomics of Cyanobacteria, the phylogenetic root of plastids remains controversial. Although a single origin of plastids by endosymbiosis is broadly supported, recent phylogenomic studies are contradictory on whether plastids branch early or late within Cyanobacteria. One underlying cause may be poor fit of evolutionary models to complex phylogenomic data.ResultsUsing Posterior Predictive Analysis, we show that recently applied evolutionary models poorly fit three phylogenomic datasets curated from cyanobacteria and plastid genomes because of heterogeneities in both substitution processes across sites and of compositions across lineages. To circumvent these sources of bias, we developed CYANO-MLP, a machine learning algorithm that consistently and accurately phylogenetically classifies ("phyloclassifies") cyanobacterial genomes to their clade of origin based on bioinformatically predicted function-informative features in tRNA gene complements. Classification of cyanobacterial genomes with CYANO-MLP is accurate and robust to deletion of clades, unbalanced sampling, and compositional heterogeneity in input tRNA data. CYANO-MLP consistently classifies plastid genomes into a late-branching cyanobacterial sub-clade containing single-cell, starch-producing, nitrogen-fixing ecotypes, consistent with metabolic and gene transfer data.ConclusionsPhylogenomic data of cyanobacteria and plastids exhibit both site-process heterogeneities and compositional heterogeneities across lineages. These aspects of the data require careful modeling to avoid bias in phylogenomic estimation. Furthermore, we show that amino acid recoding strategies may be insufficient to mitigate bias from compositional heterogeneities. However, the combination of our novel tRNA-specific strategy with machine learning in CYANO-MLP appears robust to these sources of bias with high accuracy in phyloclassification of cyanobacterial genomes. CYANO-MLP consistently classifies plastids as late-branching Cyanobacteria, consistent with independent evidence from signature-based approaches and some previous phylogenetic studies
    corecore