52,543 research outputs found

    Detecting and visualizing differences in brain structures with SPHARM and functional data analysis

    Get PDF
    A new procedure for classifying brain structures described by SPHARM is presented. We combine a dimension reduction technique (functional principal component analysis or functional independent component analysis) with stepwise variable selection for linear discriminant classification. This procedure is compared with many well-known methods in a novel classification problem in neuroeducation, where the reversal error (a common error in mathematical problem solving) is analyzed by using the left and right putamens of 33 participants. The comparison shows that our proposal not only provides outstanding performance in terms of predictive power, but it is also valuable in terms of interpretation, since it yields a linear discriminant function for 3D structures

    Over-optimism in bioinformatics: an illustration

    Get PDF
    In statistical bioinformatics research, different optimization mechanisms potentially lead to "over-optimism" in published papers. The present empirical study illustrates these mechanisms through a concrete example from an active research field. The investigated sources of over-optimism include the optimization of the data sets, of the settings, of the competing methods and, most importantly, of the method’s characteristics. We consider a "promising" new classification algorithm that turns out to yield disappointing results in terms of error rate, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. We quantitatively demonstrate that this disappointing method can artificially seem superior to existing approaches if we "fish for significance”. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should be validated using "fresh" validation data sets

    Statistical Learning Methods for High-dimensional Classification and Regression

    Get PDF
    With the recent advancement of technology, large and heterogeneous data containing enormous variables of mixed types have become increasingly popular, great challenges in computation and theory have arisen for classical methods in classification and regression. It is of great interest to develop new statistical methods that are computationally efficient and theoretically sound for classification and regression using high-dimensinoal and heterogeneous data. In this dissertation, we specifically address the problems in the computation of high-dimensional linear discriminant analysis, and in high-dimensional linear regression and ordinal classification with mixed covariates. First, we propose an efficient greedy search algorithm that depends solely on closed-form formulae to learn a high-dimensional linear discriminant analysis (LDA) rule. We establish theoretical guarantee of its statistical properties in terms of variable selection and error rate consistency; in addition, we provide an explicit interpretation of the extra information brought by an additional feature in a LDA problem under some mild distributional assumptions. We demonstrate that this new algorithm drastically improves computational speed compared with other high-dimensional LDA methods, while maintaining comparable or even better classification performance through extensive simulation studies and real data analysis. Second, we propose a semiparametric Latent Mixed Gaussian Copula Regression (LMGCR) model to perform linear regression for high-dimensional mixed data. The model assumes that the observed mixed covariates are generated from latent variables that follow the Gaussian copula. We develop an estimator of the regression coefficients in LMGCR and prove its estimation and variable selection consistency. In addition, we devise a prediction rule given by LMGCR and quantify its prediction error under mild conditions. We demonstrate that the proposed model has superior performance in both coefficient estimation and prediction through extensive simulation studies and real data analysis. Finally, we propose a semiparametric Latent Mixed Gaussian Copula Classification (LMGCC)rule to perform classification of ordinal response using unnormalized high-dimensional data. Our clas- sification rule learns the Bayes rule derived from joint modeling of ordinal response and continuous features through a latent Gaussian copula model. We develop an estimator of the regression coeffi- cients in predicting the latent response and prove its estimation and variable selection consistency. In addition, we establish that our devised LMGCC has error rate consistency. We demonstrate that the proposed method has superior performance in ordinal classification through extensive simulation studies and real data analysis.Doctor of Philosoph

    Optimal classifier selection and negative bias in error rate estimation: An empirical study on high-dimensional prediction

    Get PDF
    In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias. In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. We then assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly. We conclude that the strategy to present only the optimal result is not acceptable, and suggest alternative approaches for properly reporting classification accuracy

    On High Dimensional Sparse Regression and Its Inference

    Get PDF
    In the first part of this work, we aim to develop a sparse projection regression modeling (SPReM) framework to perform multivariate regression modeling with a large number of responses and a multivariate covariate of interest. We propose two novel heritability ratios to simultaneously perform dimension reduction, response selection, estimation, and testing, while explicitly accounting for correlations among multivariate responses. Our SPReM is devised to specifically address the low statistical power issue of many standard statistical approaches, such as the Hotelling's T2T^2 test statistic or a mass univariate analysis, for high-dimensional data. We formulate the estimation problem of SPREM as a novel sparse unit rank projection (SURP) problem and propose a fast optimization algorithm for SURP. Furthermore, we extend SURP to the sparse multi-rank projection (SMURP) by adopting a sequential SURP approximation. Theoretically, we have systematically investigated the convergence properties of SURP and the convergence rate of SURP estimates. Our simulation results and real data analysis have shown that SPReM outperforms other state-of-the-art methods. In the second part of this work, we propose a Hard Thresholded Regression (HTR) framework for simultaneous variable selection and unbiased estimation in high dimensional linear regression. This new framework is motivated by its close connection with the L0L_0 regularization and best subset selection under orthogonal design, while enjoying several key computational and theoretical advantages over many existing penalization methods (e.g., SCAD or MCP). Computationally, HTR is a fast two-stage estimation procedure consisting of the first step for calculating a coarse initial estimator and the second step for solving a linear program. Theoretically, under some mild conditions, the HTR estimator is shown to enjoy the strong oracle property and thresholded property even when the number of covariates may grow at an exponential rate. We also propose to incorporate the regularized covariance estimator into the estimation procedure in order to better trade off between noise accumulation and correlation modeling. Under this scenario with regularized covariance matrix, HTR includes Sure Independence Screening as a special case. Both simulation and real data results show that HTR outperforms other state-of-the-art methods. In the third part of this work, we focus on multicategory classification and propose the sparse multicategory discriminant analysis. Many supervised machine learning tasks can be cast as multicategory classification problems. Linear discriminant analysis has been well studied in two class classification problems and can be easily extended to multicatigory cases. For high dimensional classification, traditional linear discriminant analysis fails due to diverging spectra and accumulation of noise. Therefore, researchers have proposed penalized LDA (Fan et al., 2012; Witten and Tibshirani, 2011). However, most available methods for high dimensional multi-class LDA are based on an iterative algorithm, which is computationally expensive and not theoretically justified. In this paper, we present a new framework for sparse multicategory discriminant analysis (SMDA) for high dimensional multi-class classification by simultaneous extracting the discriminant directions. Our SMDA can be cast as an convex programming which distinguishes itself from other state-of-the-art method. We evaluate the performances of the resulting methods on the extensive simulation study and a real data analysis.Doctor of Philosoph

    Empirical Validation of Objective Functions in Feature Selection Based on Acceleration Motion Segmentation Data

    Get PDF
    Recent change in evaluation criteria from accuracy alone to trade-off with time delay has inspired multivariate energy-based approaches in motion segmentation using acceleration. The essence of multivariate approaches lies in the construction of highly dimensional energy and requires feature subset selection in machine learning. Due to fast process, filter methods are preferred; however, their poorer estimate is of the main concerns. This paper aims at empirical validation of three objective functions for filter approaches, Fisher discriminant ratio, multiple correlation (MC), and mutual information (MI), through two subsequent experiments. With respect to 63 possible subsets out of 6 variables for acceleration motion segmentation, three functions in addition to a theoretical measure are compared with two wrappers, k-nearest neighbor and Bayes classifiers in general statistics and strongly relevant variable identification by social network analysis. Then four kinds of new proposed multivariate energy are compared with a conventional univariate approach in terms of accuracy and time delay. Finally it appears that MC and MI are acceptable enough to match the estimate of two wrappers, and multivariate approaches are justified with our analytic procedures
    corecore