84,016 research outputs found
High-dimensional discriminant analysis and covariance matrix estimation
Statistical analysis in high-dimensional settings, where the data dimension p is close to or larger than the sample size n, has been an intriguing area of research. Applications include gene expression data analysis, financial economics, text mining, and many others. Estimating large covariance matrices is an essential part of high-dimensional data analysis because of the ubiquity of covariance matrices in statistical procedures. The estimation is also a challenging part, since the sample covariance matrix is no longer an accurate estimator of the population covariance matrix in high dimensions. In this thesis, a series of matrix structures, that facilitate the covariance matrix estimation, are studied.
Firstly, we develop a set of innovative quadratic discriminant rules by applying the compound symmetry structure. For each class, we construct an estimator, by pooling the diagonal elements as well as the off-diagonal elements of the sample covariance matrix, and substitute the estimator for the covariance matrix in the normal quadratic discriminant rule. Furthermore, we develop a more general rule to deal with nonnormal data by incorporating an additional data transformation. Theoretically, as long as the population covariance matrices loosely conform to the compound symmetry structure, our specialized quadratic discriminant rules enjoy low asymptotic classification error. Computationally, they are easy to implement and do not require large-scale mathematical programming.
Then, we generalize the compound symmetry structure by considering the assumption that the population covariance matrix (or equivalently its inverse, the precision matrix) can be decomposed into a diagonal component and a low-rank component. The rank of the low-rank component governs to what extent the decomposition can simplify the covariance/precision matrix and reduce the number of unknown parameters. In the estimation, this rank can either be pre-selected to be small or controlled by a penalty function. Under moderate conditions on the population covariance/precision matrix itself and on the penalty function, we prove some consistency results for our estimator. A blockwise coordinate descent algorithm, which iteratively updates the diagonal component and the low-rank component, is then proposed to obtain the estimator in practice.
In the end, we consider jointly estimating large covariance matrices of multiple categories. In addition to the aforementioned diagonal and low-rank matrix decomposition, it is further assumed that there is some common matrix structure shared across the categories. We assume that the population precision matrix of category k can be decomposed into a diagonal matrix D, a shared low-rank matrix L, and a category-specific low-rank matrix Lk. The assumption can be understood under the framework of factor models --- some latent factors affect all categories alike while others are specific to only one of these categories. We propose a method that jointly estimates the precision matrices (therefore, the covariance matrices) --- D and L are estimated with the entire dataset whereas Lk is estimated solely with the data of category k. An AIC-type penalty is applied to encourage the decomposition, especially the shared component. Under certain conditions on the population covariance matrices, some consistency results are developed for the estimators.
The performances in finite dimensions are shown through numerical experiments. Using simulated data, we demonstrate certain advantages of our methods over existing ones, in terms of classification error for the discriminant rules and Kullback--Leibler loss for the covariance matrix estimators. The proposed methods are also applied to real life datasets, including microarray data, stock return data and text data, to perform tasks, such as distinguishing normal from diseased tissues, portfolio selection and classifying webpages
Recommended from our members
Gene set analysis using variance component tests
Background: Gene set analyses have become increasingly important in genomic research, as many complex diseases are contributed jointly by alterations of numerous genes. Genes often coordinate together as a functional repertoire, e.g., a biological pathway/network and are highly correlated. However, most of the existing gene set analysis methods do not fully account for the correlation among the genes. Here we propose to tackle this important feature of a gene set to improve statistical power in gene set analyses. Results: We propose to model the effects of an independent variable, e.g., exposure/biological status (yes/no), on multiple gene expression values in a gene set using a multivariate linear regression model, where the correlation among the genes is explicitly modeled using a working covariance matrix. We develop TEGS (Test for the Effect of a Gene Set), a variance component test for the gene set effects by assuming a common distribution for regression coefficients in multivariate linear regression models, and calculate the p-values using permutation and a scaled chi-square approximation. We show using simulations that type I error is protected under different choices of working covariance matrices and power is improved as the working covariance approaches the true covariance. The global test is a special case of TEGS when correlation among genes in a gene set is ignored. Using both simulation data and a published diabetes dataset, we show that our test outperforms the commonly used approaches, the global test and gene set enrichment analysis (GSEA). Conclusion: We develop a gene set analyses method (TEGS) under the multivariate regression framework, which directly models the interdependence of the expression values in a gene set using a working covariance. TEGS outperforms two widely used methods, GSEA and global test in both simulation and a diabetes microarray data
Functional Regression
Functional data analysis (FDA) involves the analysis of data whose ideal
units of observation are functions defined on some continuous domain, and the
observed data consist of a sample of functions taken from some population,
sampled on a discrete grid. Ramsay and Silverman's 1997 textbook sparked the
development of this field, which has accelerated in the past 10 years to become
one of the fastest growing areas of statistics, fueled by the growing number of
applications yielding this type of data. One unique characteristic of FDA is
the need to combine information both across and within functions, which Ramsay
and Silverman called replication and regularization, respectively. This article
will focus on functional regression, the area of FDA that has received the most
attention in applications and methodological development. First will be an
introduction to basis functions, key building blocks for regularization in
functional regression methods, followed by an overview of functional regression
methods, split into three types: [1] functional predictor regression
(scalar-on-function), [2] functional response regression (function-on-scalar)
and [3] function-on-function regression. For each, the role of replication and
regularization will be discussed and the methodological development described
in a roughly chronological manner, at times deviating from the historical
timeline to group together similar methods. The primary focus is on modeling
and methodology, highlighting the modeling structures that have been developed
and the various regularization approaches employed. At the end is a brief
discussion describing potential areas of future development in this field
A Quadratically Regularized Functional Canonical Correlation Analysis for Identifying the Global Structure of Pleiotropy with NGS Data
Investigating the pleiotropic effects of genetic variants can increase
statistical power, provide important information to achieve deep understanding
of the complex genetic structures of disease, and offer powerful tools for
designing effective treatments with fewer side effects. However, the current
multiple phenotype association analysis paradigm lacks breadth (number of
phenotypes and genetic variants jointly analyzed at the same time) and depth
(hierarchical structure of phenotype and genotypes). A key issue for high
dimensional pleiotropic analysis is to effectively extract informative internal
representation and features from high dimensional genotype and phenotype data.
To explore multiple levels of representations of genetic variants, learn their
internal patterns involved in the disease development, and overcome critical
barriers in advancing the development of novel statistical methods and
computational algorithms for genetic pleiotropic analysis, we proposed a new
framework referred to as a quadratically regularized functional CCA (QRFCCA)
for association analysis which combines three approaches: (1) quadratically
regularized matrix factorization, (2) functional data analysis and (3)
canonical correlation analysis (CCA). Large-scale simulations show that the
QRFCCA has a much higher power than that of the nine competing statistics while
retaining the appropriate type 1 errors. To further evaluate performance, the
QRFCCA and nine other statistics are applied to the whole genome sequencing
dataset from the TwinsUK study. We identify a total of 79 genes with rare
variants and 67 genes with common variants significantly associated with the 46
traits using QRFCCA. The results show that the QRFCCA substantially outperforms
the nine other statistics.Comment: 64 pages including 12 figure
Some methods for robust inference in econometric factor models and in machine learning
Traditional multivariate statistical theory and applications are often based on specific parametric assumptions. For example it is often assumed that data follow (nearly) normal distribution. In practice such assumption is rarely true and in fact the underlying data distribution is often unknown. Violations of the normality assumption can be detrimental in inference. In particular, two areas affected by violations of assumptions are quadratic discriminant analysis (QDA), used in classification, and principal component analysis (PCA), commonly employed in dimension reduction. Both PCA and QDA involve the computation of empirical covariance matrices of the data. In econometric and financial data, non-normality is often associated with heavy-tailed distributions and such distributions can create significant problems in computing sample covariance matrix. Furthermore, in PCA non-normality may lead to erroneous decisions about numbers of components to be retained due to unexpected behavior of empirical covariance matrix eigenvalues.
In the first part of the dissertation, we consider the so called number-of-factors problem in econometric and financial data, which is related to the number of sources of variations (latent factors) that are common to a set of variables observed multiple times (as in time series). The approach that is commonly used in the literature is the PCA and examination of the pattern of the related eigenvalues. We employ an existing technique for robust principal component analysis, which produces properly estimated eigenvalues that are then used in an automatic inferential procedure for the identification of the number of latent factors. In a series of simulation experiments we demonstrate the superiority of our approach compared to other well-established methods.
In the second part of the dissertation, we discuss a method to normalize the data empirically so that classical QDA for binary classification can be used. In addition, we successfully overcome the usual issue of large dimension-to-sample-size ratio through regularized estimation of precision matrices. Extensive simulation experiments demonstrate the advantages of our approach in terms of accuracy over other classification techniques.
We illustrate the efficiency of our methods in both situations by applying them to real datasets from economics and bioinformatics
- …