17,482 research outputs found

    Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering

    Get PDF
    Comparing large covariance matrices has important applications in modern genomics, where scientists are often interested in understanding whether relationships (e.g., dependencies or co-regulations) among a large number of genes vary between different biological states. We propose a computationally fast procedure for testing the equality of two large covariance matrices when the dimensions of the covariance matrices are much larger than the sample sizes. A distinguishing feature of the new procedure is that it imposes no structural assumptions on the unknown covariance matrices. Hence the test is robust with respect to various complex dependence structures that frequently arise in genomics. We prove that the proposed procedure is asymptotically valid under weak moment conditions. As an interesting application, we derive a new gene clustering algorithm which shares the same nice property of avoiding restrictive structural assumptions for high-dimensional genomics data. Using an asthma gene expression dataset, we illustrate how the new test helps compare the covariance matrices of the genes across different gene sets/pathways between the disease group and the control group, and how the gene clustering algorithm provides new insights on the way gene clustering patterns differ between the two groups. The proposed methods have been implemented in an R-package HDtest and is available on CRAN.Comment: The original title dated back to May 2015 is "Bootstrap Tests on High Dimensional Covariance Matrices with Applications to Understanding Gene Clustering

    MATS: Inference for potentially Singular and Heteroscedastic MANOVA

    Get PDF
    In many experiments in the life sciences, several endpoints are recorded per subject. The analysis of such multivariate data is usually based on MANOVA models assuming multivariate normality and covariance homogeneity. These assumptions, however, are often not met in practice. Furthermore, test statistics should be invariant under scale transformations of the data, since the endpoints may be measured on different scales. In the context of high-dimensional data, Srivastava and Kubokawa (2013) proposed such a test statistic for a specific one-way model, which, however, relies on the assumption of a common non-singular covariance matrix. We modify and extend this test statistic to factorial MANOVA designs, incorporating general heteroscedastic models. In particular, our only distributional assumption is the existence of the group-wise covariance matrices, which may even be singular. We base inference on quantiles of resampling distributions, and derive confidence regions and ellipsoids based on these quantiles. In a simulation study, we extensively analyze the behavior of these procedures. Finally, the methods are applied to a data set containing information on the 2016 presidential elections in the USA with unequal and singular empirical covariance matrices

    User-Friendly Covariance Estimation for Heavy-Tailed Distributions

    Get PDF
    We offer a survey of recent results on covariance estimation for heavy-tailed distributions. By unifying ideas scattered in the literature, we propose user-friendly methods that facilitate practical implementation. Specifically, we introduce element-wise and spectrum-wise truncation operators, as well as their MM-estimator counterparts, to robustify the sample covariance matrix. Different from the classical notion of robustness that is characterized by the breakdown property, we focus on the tail robustness which is evidenced by the connection between nonasymptotic deviation and confidence level. The key observation is that the estimators needs to adapt to the sample size, dimensionality of the data and the noise level to achieve optimal tradeoff between bias and robustness. Furthermore, to facilitate their practical use, we propose data-driven procedures that automatically calibrate the tuning parameters. We demonstrate their applications to a series of structured models in high dimensions, including the bandable and low-rank covariance matrices and sparse precision matrices. Numerical studies lend strong support to the proposed methods.Comment: 56 pages, 2 figure

    Statistical eigen-inference from large Wishart matrices

    Full text link
    We consider settings where the observations are drawn from a zero-mean multivariate (real or complex) normal distribution with the population covariance matrix having eigenvalues of arbitrary multiplicity. We assume that the eigenvectors of the population covariance matrix are unknown and focus on inferential procedures that are based on the sample eigenvalues alone (i.e., "eigen-inference"). Results found in the literature establish the asymptotic normality of the fluctuation in the trace of powers of the sample covariance matrix. We develop concrete algorithms for analytically computing the limiting quantities and the covariance of the fluctuations. We exploit the asymptotic normality of the trace of powers of the sample covariance matrix to develop eigenvalue-based procedures for testing and estimation. Specifically, we formulate a simple test of hypotheses for the population eigenvalues and a technique for estimating the population eigenvalues in settings where the cumulative distribution function of the (nonrandom) population eigenvalues has a staircase structure. Monte Carlo simulations are used to demonstrate the superiority of the proposed methodologies over classical techniques and the robustness of the proposed techniques in high-dimensional, (relatively) small sample size settings. The improved performance results from the fact that the proposed inference procedures are "global" (in a sense that we describe) and exploit "global" information thereby overcoming the inherent biases that cripple classical inference procedures which are "local" and rely on "local" information.Comment: Published in at http://dx.doi.org/10.1214/07-AOS583 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Relaxed 2-D Principal Component Analysis by LpL_p Norm for Face Recognition

    Full text link
    A relaxed two dimensional principal component analysis (R2DPCA) approach is proposed for face recognition. Different to the 2DPCA, 2DPCA-L1L_1 and G2DPCA, the R2DPCA utilizes the label information (if known) of training samples to calculate a relaxation vector and presents a weight to each subset of training data. A new relaxed scatter matrix is defined and the computed projection axes are able to increase the accuracy of face recognition. The optimal LpL_p-norms are selected in a reasonable range. Numerical experiments on practical face databased indicate that the R2DPCA has high generalization ability and can achieve a higher recognition rate than state-of-the-art methods.Comment: 19 pages, 11 figure

    Finite Sample Properties of Tests Based on Prewhitened Nonparametric Covariance Estimators

    Get PDF
    We analytically investigate size and power properties of a popular family of procedures for testing linear restrictions on the coefficient vector in a linear regression model with temporally dependent errors. The tests considered are autocorrelation-corrected F-type tests based on prewhitened nonparametric covariance estimators that possibly incorporate a data-dependent bandwidth parameter, e.g., estimators as considered in Andrews and Monahan (1992), Newey and West (1994), or Rho and Shao (2013). For design matrices that are generic in a measure theoretic sense we prove that these tests either suffer from extreme size distortions or from strong power deficiencies. Despite this negative result we demonstrate that a simple adjustment procedure based on artificial regressors can often resolve this problem.Comment: Some material adde

    Detecting single-trial EEG evoked potential using a wavelet domain linear mixed model: application to error potentials classification

    Full text link
    Objective. The main goal of this work is to develop a model for multi-sensor signals such as MEG or EEG signals, that accounts for the inter-trial variability, suitable for corresponding binary classification problems. An important constraint is that the model be simple enough to handle small size and unbalanced datasets, as often encountered in BCI type experiments. Approach. The method involves linear mixed effects statistical model, wavelet transform and spatial filtering, and aims at the characterization of localized discriminant features in multi-sensor signals. After discrete wavelet transform and spatial filtering, a projection onto the relevant wavelet and spatial channels subspaces is used for dimension reduction. The projected signals are then decomposed as the sum of a signal of interest (i.e. discriminant) and background noise, using a very simple Gaussian linear mixed model. Main results. Thanks to the simplicity of the model, the corresponding parameter estimation problem is simplified. Robust estimates of class-covariance matrices are obtained from small sample sizes and an effective Bayes plug-in classifier is derived. The approach is applied to the detection of error potentials in multichannel EEG data, in a very unbalanced situation (detection of rare events). Classification results prove the relevance of the proposed approach in such a context. Significance. The combination of linear mixed model, wavelet transform and spatial filtering for EEG classification is, to the best of our knowledge, an original approach, which is proven to be effective. This paper improves on earlier results on similar problems, and the three main ingredients all play an important role
    • …
    corecore