84,016 research outputs found

    High-dimensional discriminant analysis and covariance matrix estimation

    Get PDF
    Statistical analysis in high-dimensional settings, where the data dimension p is close to or larger than the sample size n, has been an intriguing area of research. Applications include gene expression data analysis, financial economics, text mining, and many others. Estimating large covariance matrices is an essential part of high-dimensional data analysis because of the ubiquity of covariance matrices in statistical procedures. The estimation is also a challenging part, since the sample covariance matrix is no longer an accurate estimator of the population covariance matrix in high dimensions. In this thesis, a series of matrix structures, that facilitate the covariance matrix estimation, are studied. Firstly, we develop a set of innovative quadratic discriminant rules by applying the compound symmetry structure. For each class, we construct an estimator, by pooling the diagonal elements as well as the off-diagonal elements of the sample covariance matrix, and substitute the estimator for the covariance matrix in the normal quadratic discriminant rule. Furthermore, we develop a more general rule to deal with nonnormal data by incorporating an additional data transformation. Theoretically, as long as the population covariance matrices loosely conform to the compound symmetry structure, our specialized quadratic discriminant rules enjoy low asymptotic classification error. Computationally, they are easy to implement and do not require large-scale mathematical programming. Then, we generalize the compound symmetry structure by considering the assumption that the population covariance matrix (or equivalently its inverse, the precision matrix) can be decomposed into a diagonal component and a low-rank component. The rank of the low-rank component governs to what extent the decomposition can simplify the covariance/precision matrix and reduce the number of unknown parameters. In the estimation, this rank can either be pre-selected to be small or controlled by a penalty function. Under moderate conditions on the population covariance/precision matrix itself and on the penalty function, we prove some consistency results for our estimator. A blockwise coordinate descent algorithm, which iteratively updates the diagonal component and the low-rank component, is then proposed to obtain the estimator in practice. In the end, we consider jointly estimating large covariance matrices of multiple categories. In addition to the aforementioned diagonal and low-rank matrix decomposition, it is further assumed that there is some common matrix structure shared across the categories. We assume that the population precision matrix of category k can be decomposed into a diagonal matrix D, a shared low-rank matrix L, and a category-specific low-rank matrix Lk. The assumption can be understood under the framework of factor models --- some latent factors affect all categories alike while others are specific to only one of these categories. We propose a method that jointly estimates the precision matrices (therefore, the covariance matrices) --- D and L are estimated with the entire dataset whereas Lk is estimated solely with the data of category k. An AIC-type penalty is applied to encourage the decomposition, especially the shared component. Under certain conditions on the population covariance matrices, some consistency results are developed for the estimators. The performances in finite dimensions are shown through numerical experiments. Using simulated data, we demonstrate certain advantages of our methods over existing ones, in terms of classification error for the discriminant rules and Kullback--Leibler loss for the covariance matrix estimators. The proposed methods are also applied to real life datasets, including microarray data, stock return data and text data, to perform tasks, such as distinguishing normal from diseased tissues, portfolio selection and classifying webpages

    Functional Regression

    Full text link
    Functional data analysis (FDA) involves the analysis of data whose ideal units of observation are functions defined on some continuous domain, and the observed data consist of a sample of functions taken from some population, sampled on a discrete grid. Ramsay and Silverman's 1997 textbook sparked the development of this field, which has accelerated in the past 10 years to become one of the fastest growing areas of statistics, fueled by the growing number of applications yielding this type of data. One unique characteristic of FDA is the need to combine information both across and within functions, which Ramsay and Silverman called replication and regularization, respectively. This article will focus on functional regression, the area of FDA that has received the most attention in applications and methodological development. First will be an introduction to basis functions, key building blocks for regularization in functional regression methods, followed by an overview of functional regression methods, split into three types: [1] functional predictor regression (scalar-on-function), [2] functional response regression (function-on-scalar) and [3] function-on-function regression. For each, the role of replication and regularization will be discussed and the methodological development described in a roughly chronological manner, at times deviating from the historical timeline to group together similar methods. The primary focus is on modeling and methodology, highlighting the modeling structures that have been developed and the various regularization approaches employed. At the end is a brief discussion describing potential areas of future development in this field

    A Quadratically Regularized Functional Canonical Correlation Analysis for Identifying the Global Structure of Pleiotropy with NGS Data

    Full text link
    Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. To explore multiple levels of representations of genetic variants, learn their internal patterns involved in the disease development, and overcome critical barriers in advancing the development of novel statistical methods and computational algorithms for genetic pleiotropic analysis, we proposed a new framework referred to as a quadratically regularized functional CCA (QRFCCA) for association analysis which combines three approaches: (1) quadratically regularized matrix factorization, (2) functional data analysis and (3) canonical correlation analysis (CCA). Large-scale simulations show that the QRFCCA has a much higher power than that of the nine competing statistics while retaining the appropriate type 1 errors. To further evaluate performance, the QRFCCA and nine other statistics are applied to the whole genome sequencing dataset from the TwinsUK study. We identify a total of 79 genes with rare variants and 67 genes with common variants significantly associated with the 46 traits using QRFCCA. The results show that the QRFCCA substantially outperforms the nine other statistics.Comment: 64 pages including 12 figure

    Some methods for robust inference in econometric factor models and in machine learning

    Full text link
    Traditional multivariate statistical theory and applications are often based on specific parametric assumptions. For example it is often assumed that data follow (nearly) normal distribution. In practice such assumption is rarely true and in fact the underlying data distribution is often unknown. Violations of the normality assumption can be detrimental in inference. In particular, two areas affected by violations of assumptions are quadratic discriminant analysis (QDA), used in classification, and principal component analysis (PCA), commonly employed in dimension reduction. Both PCA and QDA involve the computation of empirical covariance matrices of the data. In econometric and financial data, non-normality is often associated with heavy-tailed distributions and such distributions can create significant problems in computing sample covariance matrix. Furthermore, in PCA non-normality may lead to erroneous decisions about numbers of components to be retained due to unexpected behavior of empirical covariance matrix eigenvalues. In the first part of the dissertation, we consider the so called number-of-factors problem in econometric and financial data, which is related to the number of sources of variations (latent factors) that are common to a set of variables observed multiple times (as in time series). The approach that is commonly used in the literature is the PCA and examination of the pattern of the related eigenvalues. We employ an existing technique for robust principal component analysis, which produces properly estimated eigenvalues that are then used in an automatic inferential procedure for the identification of the number of latent factors. In a series of simulation experiments we demonstrate the superiority of our approach compared to other well-established methods. In the second part of the dissertation, we discuss a method to normalize the data empirically so that classical QDA for binary classification can be used. In addition, we successfully overcome the usual issue of large dimension-to-sample-size ratio through regularized estimation of precision matrices. Extensive simulation experiments demonstrate the advantages of our approach in terms of accuracy over other classification techniques. We illustrate the efficiency of our methods in both situations by applying them to real datasets from economics and bioinformatics
    • …
    corecore