Statistical Learning Methods for High-dimensional Classification and Regression

Abstract

With the recent advancement of technology, large and heterogeneous data containing enormous variables of mixed types have become increasingly popular, great challenges in computation and theory have arisen for classical methods in classification and regression. It is of great interest to develop new statistical methods that are computationally efficient and theoretically sound for classification and regression using high-dimensinoal and heterogeneous data. In this dissertation, we specifically address the problems in the computation of high-dimensional linear discriminant analysis, and in high-dimensional linear regression and ordinal classification with mixed covariates. First, we propose an efficient greedy search algorithm that depends solely on closed-form formulae to learn a high-dimensional linear discriminant analysis (LDA) rule. We establish theoretical guarantee of its statistical properties in terms of variable selection and error rate consistency; in addition, we provide an explicit interpretation of the extra information brought by an additional feature in a LDA problem under some mild distributional assumptions. We demonstrate that this new algorithm drastically improves computational speed compared with other high-dimensional LDA methods, while maintaining comparable or even better classification performance through extensive simulation studies and real data analysis. Second, we propose a semiparametric Latent Mixed Gaussian Copula Regression (LMGCR) model to perform linear regression for high-dimensional mixed data. The model assumes that the observed mixed covariates are generated from latent variables that follow the Gaussian copula. We develop an estimator of the regression coefficients in LMGCR and prove its estimation and variable selection consistency. In addition, we devise a prediction rule given by LMGCR and quantify its prediction error under mild conditions. We demonstrate that the proposed model has superior performance in both coefficient estimation and prediction through extensive simulation studies and real data analysis. Finally, we propose a semiparametric Latent Mixed Gaussian Copula Classification (LMGCC)rule to perform classification of ordinal response using unnormalized high-dimensional data. Our clas- sification rule learns the Bayes rule derived from joint modeling of ordinal response and continuous features through a latent Gaussian copula model. We develop an estimator of the regression coeffi- cients in predicting the latent response and prove its estimation and variable selection consistency. In addition, we establish that our devised LMGCC has error rate consistency. We demonstrate that the proposed method has superior performance in ordinal classification through extensive simulation studies and real data analysis.Doctor of Philosoph

    Similar works