Search CORE

9 research outputs found

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Author: Mahoney Michael W.
Publication venue
Publication date: 08/10/2010
Field of study

In recent years, ideas from statistics and scientific computing have begun to interact in increasingly sophisticated and fruitful ways with ideas from computer science and the theory of algorithms to aid in the development of improved worst-case algorithms that are useful for large-scale scientific and Internet data analysis problems. In this chapter, I will describe two recent examples---one having to do with selecting good columns or features from a (DNA Single Nucleotide Polymorphism) data matrix, and the other having to do with selecting good clusters or communities from a data graph (representing a social or information network)---that drew on ideas from both areas and that may serve as a model for exploiting complementary algorithmic and statistical perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors, "Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201

arXiv.org e-Print Archive

CiteSeerX

Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis

Author: Mahoney Michael W.
Publication venue
Publication date: 01/01/2012
Field of study

Database theory and database practice are typically the domain of computer scientists who adopt what may be termed an algorithmic perspective on their data. This perspective is very different than the more statistical perspective adopted by statisticians, scientific computers, machine learners, and other who work on what may be broadly termed statistical data analysis. In this article, I will address fundamental aspects of this algorithmic-statistical disconnect, with an eye to bridging the gap between these two very different approaches. A concept that lies at the heart of this disconnect is that of statistical regularization, a notion that has to do with how robust is the output of an algorithm to the noise properties of the input data. Although it is nearly completely absent from computer science, which historically has taken the input data as given and modeled algorithms discretely, regularization in one form or another is central to nearly every application domain that applies algorithms to noisy data. By using several case studies, I will illustrate, both theoretically and empirically, the nonobvious fact that approximate computation, in and of itself, can implicitly lead to statistical regularization. This and other recent work suggests that, by exploiting in a more principled way the statistical properties implicit in worst-case algorithms, one can in many cases satisfy the bicriteria of having algorithms that are scalable to very large-scale databases and that also have good inferential or predictive properties.Comment: To appear in the Proceedings of the 2012 ACM Symposium on Principles of Database Systems (PODS 2012

arXiv.org e-Print Archive

CiteSeerX

Revisiting the Nystrom Method for Improved Large-Scale Machine Learning

Author: Gittens Alex
Mahoney Michael W.
Publication venue
Publication date: 01/06/2013
Field of study

We reconsider randomized algorithms for the low-rank approximation of symmetric positive semi-definite (SPSD) matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and projection methods on a diverse suite of SPSD matrices. Our results highlight complementary aspects of sampling versus projection methods; they characterize the effects of common data preprocessing steps on the performance of these algorithms; and they point to important differences between uniform sampling and nonuniform sampling methods based on leverage scores. In addition, our empirical results illustrate that existing theory is so weak that it does not provide even a qualitative guide to practice. Thus, we complement our empirical results with a suite of worst-case theoretical bounds for both random sampling and random projection methods. These bounds are qualitatively superior to existing bounds---e.g. improved additive-error bounds for spectral and Frobenius norm error and relative-error bounds for trace norm error---and they point to future directions to make these algorithms useful in even larger-scale machine learning applications.Comment: 60 pages, 15 color figures; updated proof of Frobenius norm bounds, added comparison to projection-based low-rank approximations, and an analysis of the power method applied to SPSD sketche

arXiv.org e-Print Archive

CiteSeerX

Caltech Authors

A Training Set Subsampling Strategy for the Reduced Basis Method

Author: Benner P.
Chellappa S.
Feng L.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

MPG.PuRe

A Training Set Subsampling Strategy for the Reduced Basis Method

Author: Benner Peter
Chellappa Sridhar
Feng Lihong
Publication venue
Publication date: 01/01/2021
Field of study

We present a subsampling strategy for the offline stage of the Reduced Basis Method. The approach is aimed at bringing down the considerable offline costs associated with using a finely-sampled training set. The proposed algorithm exploits the potential of the pivoted QR decomposition and the discrete empirical interpolation method to identify important parameter samples. It consists of two stages. In the first stage, we construct a low-fidelity approximation to the solution manifold over a fine training set. Then, for the available low-fidelity snapshots of the output variable, we apply the pivoted QR decomposition or the discrete empirical interpolation method to identify a set of sparse sampling locations in the parameter domain. These points reveal the structure of the parametric dependence of the output variable. The second stage proceeds with a subsampled training set containing a by far smaller number of parameters than the initial training set. Different subsampling strategies inspired from recent variants of the empirical interpolation method are also considered. Tests on benchmark examples justify the new approach and show its potential to substantially speed up the offline stage of the Reduced Basis Method, while generating reliable reduced-order models.Comment: 31 pages, 10 figures, 6 table

arXiv.org e-Print Archive

MPG.PuRe

Topics in Matrix Sampling Algorithms

Author: Christos Boutsidis
Christos Boutsidis
Kristin P. Bennett
Kristin P. Bennett
Malik Magdon-ismail Member
Mark Tygert Member
Mark Tygert Member
Michael W. Mahoney
Michael W. Mahoney
Sanmay Das Member
Publication venue
Publication date: 01/01/2011
Field of study

We study three fundamental problems of Linear Algebra, lying in the heart of various Machine Learning applications, namely: 1)"Low-rank Column-based Matrix Approximation". We are given a matrix A and a target rank k. The goal is to select a subset of columns of A and, by using only these columns, compute a rank k approximation to A that is as good as the rank k approximation that would have been obtained by using all the columns; 2) "Coreset Construction in Least-Squares Regression". We are given a matrix A and a vector b. Consider the (over-constrained) least-squares problem of minimizing ||Ax-b||, over all vectors x in D. The domain D represents the constraints on the solution and can be arbitrary. The goal is to select a subset of the rows of A and b and, by using only these rows, find a solution vector that is as good as the solution vector that would have been obtained by using all the rows; 3) "Feature Selection in K-means Clustering". We are given a set of points described with respect to a large number of features. The goal is to select a subset of the features and, by using only this subset, obtain a k-partition of the points that is as good as the partition that would have been obtained by using all the features. We present novel algorithms for all three problems mentioned above. Our results can be viewed as follow-up research to a line of work known as "Matrix Sampling Algorithms". [Frieze, Kanna, Vempala, 1998] presented the first such algorithm for the Low-rank Matrix Approximation problem. Since then, such algorithms have been developed for several other problems, e.g. Graph Sparsification and Linear Equation Solving. Our contributions to this line of research are: (i) improved algorithms for Low-rank Matrix Approximation and Regression (ii) algorithms for a new problem domain (K-means Clustering).Comment: PhD Thesis, 150 page

arXiv.org e-Print Archive

CiteSeerX

CERN Document Server