Sparse Linear Discriminant Analysis with more Variables than Observations

Abstract

It is known that classical linear discriminant analysis (LDA) performs classification well when the number of observations is much larger than the number of variables. However, when the number of variables is larger than the number of observations, classical LDA cannot be performed because the within-group covariance matrix is singular. Recently proposed LDA methods that can handle singular within-group covariance matrix were reviewed. Most of these methods focus on regularizing the within-class covariance matrix. However, they give less attention to sparsity ( selecting variables), interpretation and computational cost, which are important in high-dimensional problems. The fact that most of the original variables may be irrelevant or redundant suggests looking for sparse solutions that involve only a small portion of the variables. In the present work, new sparse LDA methods are proposed that are suited to high-dimensional data. The first two methods assume groups share a common within-group covariance matrix and approximate this matrix by a diagonal matrix. One of these methods is a variant of the other that sacrifices some accuracy for greater computational speed. Both methods obtain sparsity by minimizing an l1 norm and maximizing discrimination power under a common loss function with a tuning parameter. The third method assumes that groups share common eigenvector in eigenvector-eigenvalue decomposition of their within-group covariance matrices, while their eigenvalues may differ. The fourth method assumes the within-group covariance matrices are proportional to each other. The fifth method is derived from the Dantzig selector and uses optimal scoring to construct discriminant function. The third and fourth methods achieve sparsity by imposing a cardinality constraint with the cardinality level determined by cross-validation. All the new methods reduce their computation time by sequentially determining individual discriminant functions. The methods are applied to six real data sets and perform well when compared with two existing methods

    Similar works