7 research outputs found

    Faster Randomized Interior Point Methods for Tall/Wide Linear Programs

    Full text link
    Linear programming (LP) is an extremely useful tool which has been successfully applied to solve various problems in a wide range of areas, including operations research, engineering, economics, or even more abstract mathematical areas such as combinatorics. It is also used in many machine learning applications, such as â„“1\ell_1-regularized SVMs, basis pursuit, nonnegative matrix factorization, etc. Interior Point Methods (IPMs) are one of the most popular methods to solve LPs both in theory and in practice. Their underlying complexity is dominated by the cost of solving a system of linear equations at each iteration. In this paper, we consider both feasible and infeasible IPMs for the special case where the number of variables is much larger than the number of constraints. Using tools from Randomized Linear Algebra, we present a preconditioning technique that, when combined with the iterative solvers such as Conjugate Gradient or Chebyshev Iteration, provably guarantees that IPM algorithms (suitably modified to account for the error incurred by the approximate solver), converge to a feasible, approximately optimal solution, without increasing their iteration complexity. Our empirical evaluations verify our theoretical results on both real-world and synthetic data.Comment: Extended version of the NeurIPS 2020 submission. arXiv admin note: substantial text overlap with arXiv:2003.0807

    Faster Matrix Algorithms Via Randomized Sketching & Preconditioning

    No full text
    Recently, in statistics and machine learning, the notion of Randomization in Numerical Linear Algebra (RandNLA) has not only evolved as a vital new tool to design fast and efficient algorithms, but also promised a sound statistical foundation for modern large-scale data analysis. In this dissertation, we study the application of matrix sketching & sampling algorithms on four problems in statistics and machine learning: 1. Ridge regression:We consider ridge regression, a variant of regularized least squares regression that is particularly suitable in settings where the number of predictor variables greatly exceeds the number of observations. We present a simple, iterative, sketching-based algorithm for ridge regression that guarantees high-quality approximations to the optimal solution vector. Our analysis builds upon two simple structural results that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized linear algebra. An important contribution of our work is the analysis of the behavior of sub-sampled ridge regression problems when the ridge leverage scores and random projections are used: we prove that accurate approximations can be achieved by a sample whose size depends on the degrees of freedom of the ridge-regression problem rather than the dimensions of the design matrix. 2. Fisher discriminant analysis:We develop faster algorithms for regularized Fisher discriminant analysis (RFDA), which is a widely used method for classification and dimensionality reduction. More precisely, we present an iterative algorithm for massively under-constrained RFDA based on randomized matrix-sketching. Our algorithm comes with provable accuracy guarantees when compared to the conventional approach. We analyze the behavior of RFDA when leverage scores, ridge leverage scores and other random projection-based constructions are used for the dimensionality reduction, and prove that accurate approximations can be achieved by a sample whose size depends on the effective degrees of freedom of the RFDA problem. Our results yield significant improvements over existing approaches and our empirical evaluations support our theoretical analyses. 3. Linear programming:We further extend the RandNLA framework to analyze and speed-up linear programming (LP), an extremely useful tool which has been successfully applied to solve various problems including many machine learning applications, such as `1-regularized SVMs, basis pursuit, nonnegative matrix factorization, etc. Interior Point Methods (IPMs) are one of the most popular methods to solve LPs both in theory and in practice and their underlying complexity is dominated by the cost of solving a system of linear equations at each iteration. Using tools from Randomized Linear Algebra, we present a preconditioning technique that, when combined with the iterative solvers such as Conjugate Gradient or Chebyshev Iteration, provably guarantees that IPM algorithms (suitably modified to account for the error incurred by the approximate solver), converge to a feasible, approximately optimal solution, without increasing their iteration complexity. 4.Cost-preserving projections:In context of constrained low-rank approximation problems, we study the notion of cost-preserving projection which precisely guarantees that the cost of any rank-k projection can be approximately preserved using a smaller sketch of the original data matrix. It has recently emerged as a fundamental principle in the design and analysis of sketching-based algorithms for common matrix operations that are critical in data mining and machine learning

    Structure-informed clustering for population stratification in association studies

    No full text
    Abstract Background Identifying variants associated with complex traits is a challenging task in genetic association studies due to linkage disequilibrium (LD) between genetic variants and population stratification, unrelated to the disease risk. Existing methods of population structure correction use principal component analysis or linear mixed models with a random effect when modeling associations between a trait of interest and genetic markers. However, due to stringent significance thresholds and latent interactions between the markers, these methods often fail to detect genuinely associated variants. Results To overcome this, we propose CluStrat, which corrects for complex arbitrarily structured populations while leveraging the linkage disequilibrium induced distances between genetic markers. It performs an agglomerative hierarchical clustering using the Mahalanobis distance covariance matrix of the markers. In simulation studies, we show that our method outperforms existing methods in detecting true causal variants. Applying CluStrat on WTCCC2 and UK Biobank cohorts, we found biologically relevant associations in Schizophrenia and Myocardial Infarction. CluStrat was also able to correct for population structure in polygenic adaptation of height in Europeans. Conclusions CluStrat highlights the advantages of biologically relevant distance metrics, such as the Mahalanobis distance, which captures the cryptic interactions within populations in the presence of LD better than the Euclidean distance
    corecore