9 research outputs found

    A New Perspective on Robust MM-Estimation: Finite Sample Theory and Applications to Dependence-Adjusted Multiple Testing

    Get PDF
    Heavy-tailed errors impair the accuracy of the least squares estimate, which can be spoiled by a single grossly outlying observation. As argued in the seminal work of Peter Huber in 1973 [{\it Ann. Statist.} {\bf 1} (1973) 799--821], robust alternatives to the method of least squares are sorely needed. To achieve robustness against heavy-tailed sampling distributions, we revisit the Huber estimator from a new perspective by letting the tuning parameter involved diverge with the sample size. In this paper, we develop nonasymptotic concentration results for such an adaptive Huber estimator, namely, the Huber estimator with the tuning parameter adapted to sample size, dimension, and the variance of the noise. Specifically, we obtain a sub-Gaussian-type deviation inequality and a nonasymptotic Bahadur representation when noise variables only have finite second moments. The nonasymptotic results further yield two conventional normal approximation results that are of independent interest, the Berry-Esseen inequality and Cram\'er-type moderate deviation. As an important application to large-scale simultaneous inference, we apply these robust normal approximation results to analyze a dependence-adjusted multiple testing procedure for moderately heavy-tailed data. It is shown that the robust dependence-adjusted procedure asymptotically controls the overall false discovery proportion at the nominal level under mild moment conditions. Thorough numerical results on both simulated and real datasets are also provided to back up our theory.Comment: Ann. Statist. (in press

    Robust Dependence-Adjusted Methods for High Dimensional Data

    No full text
    The focus of this dissertation is the development, implementation and verification of robust methods for high dimensional heavy-tailed data, with an emphasis on underlying dependence-adjustment through factor models. First, we prove a nonasymptotic version of the Bahadur representation for a Huber loss M-estimator in the presence of heavy-tailed errors. Consequently, we prove a number of important normal approximation results, including the Berry-Esseen bound and Cramér-type moderate deviation. This theory is used to analyze a covariate-adjusted multiple testing procedure under moderately heavy-tailed errors. We prove that the procedure asymptotically controls the overall false discovery proportion at the nominal level. Next, we present the development of an R package that conducts factor-adjusted robust multiple testing of mean effects, even where the factors are unobservable or partially observable. Experiments on real and simulated datasets demonstrate the superior performance of our package. Applying this testing procedure to RNA-Seq data from autism patients, we find new evidence for the etiology of the disease and novel pathways that may be changed in autism. Many of the candidate genes found are responsible for functions affected by autism, or implicated in autism comorbidities like seizures and epilepsy. We observe differences between functions of genes implicated in male and female patients: promising results since autism is a heavily gender-biased disease. Next, we present an R package that performs large-scale model selection for high dimensional sparse regression in the presence of correlated covariates. The software implements a consistent model selection strategy when the covariate dependence can be reduced through factor models. Numerical studies show that it has nice finite-sample performance in terms of both model selection and out-of-sample prediction. Finally, we present a novel method for estimating higher moments of multivariate elliptical distributions. Existing estimators typically require a good estimate of the precision matrix, which assumes strict structural assumptions on the covariance or the precision matrix when data is high dimensional. We propose two methods that only involve estimating the covariance matrix. As a by-product we propose a new index for financial returns. Theoretical results, as well as experiments with financial data, reveal the efficacy of our estimators

    Sampling Within k-Means Algorithm to Cluster Large Datasets

    No full text
    Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm uses a sample from a dataset to decrease runtime by reducing the amount of data analyzed. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both the speed and accuracy of the two methods. Results show that our algorithm is significantly more efficient than the existing algorithm with comparable accuracy. Further work on this project might include a more comprehensive study both on more varied test datasets as well as on real weather datasets. This is especially important considering that this preliminary study was performed on rather tame datasets. Also, these datasets should analyze the performance of the algorithm on varied values of k. Lastly, this paper showed that the algorithm was accurate for relatively low sample sizes. We would like to analyze this further to see how accurate the algorithm is for even lower sample sizes. We could find the lowest sample sizes, by manipulating width and confidence level, for which the algorithm would be acceptably accurate. In order for our algorithm to be a success, it needs to meet two benchmarks: match the accuracy of the standard k-means algorithm and significantly reduce runtime. Both goals are accomplished for all six datasets analyzed. However, on datasets of three and four dimension, as the data becomes more difficult to cluster, both algorithms fail to obtain the correct classifications on some trials. Nevertheless, our algorithm consistently matches the performance of the standard algorithm while becoming remarkably more efficient with time. Therefore, we conclude that analysts can use our algorithm, expecting accurate results in considerably less time
    corecore