Search CORE

136,747 research outputs found

A statistical perspective of sampling scores for linear regression

Author: Chen Siheng
Kovačević Jelena
Singh Aarti
Varma Rohan
Publication venue
Publication date: 09/02/2016
Field of study

In this paper, we consider a statistical problem of learning a linear model from noisy samples. Existing work has focused on approximating the least squares solution by using leverage-based scores as an importance sampling distribution. However, no finite sample statistical guarantees and no computationally efficient optimal sampling strategies have been proposed. To evaluate the statistical properties of different sampling strategies, we propose a simple yet effective estimator, which is easy for theoretical analysis and is useful in multitask linear regression. We derive the exact mean square error of the proposed estimator for any given sampling scores. Based on minimizing the mean square error, we propose the optimal sampling scores for both estimator and predictor, and show that they are influenced by the noise-to-signal ratio. Numerical simulations match the theoretical analysis well

arXiv.org e-Print Archive

Crossref

A Statistical Perspective on Algorithmic Leveraging

Author: Ma Ping
Mahoney Michael W.
Yu Bin
Publication venue
Publication date: 22/06/2013
Field of study

One popular method for dealing with large-scale data sets is sampling. For example, by using the empirical statistical leverage scores as an importance sampling distribution, the method of algorithmic leveraging samples and rescales rows/columns of data matrices to reduce the data size before performing computations on the subproblem. This method has been successful in improving computational efficiency of algorithms for matrix problems such as least-squares approximation, least absolute deviations approximation, and low-rank matrix approximation. Existing work has focused on algorithmic issues such as worst-case running times and numerical issues associated with providing high-quality implementations, but none of it addresses statistical aspects of this method. In this paper, we provide a simple yet effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model with a fixed number of predictors. We show that from the statistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other. This result is particularly striking, given the well-known result that, from the algorithmic perspective of worst-case analysis, leverage-based sampling provides uniformly superior worst-case algorithmic results, when compared with uniform sampling. Based on these theoretical results, we propose and analyze two new leveraging algorithms. A detailed empirical evaluation of existing leverage-based methods as well as these two new methods is carried out on both synthetic and real data sets. The empirical results indicate that our theory is a good predictor of practical performance of existing and new leverage-based algorithms and that the new algorithms achieve improved performance.Comment: 44 pages, 17 figure

arXiv.org e-Print Archive

CiteSeerX

A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares

Author: Mahoney Michael
Raskutti Garvesh
Publication venue
Publication date: 25/08/2015
Field of study

We consider statistical as well as algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. For a LS problem with input data

(X, Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n

, sketching algorithms use a sketching matrix,

S\in\mathbb{R}^{r \times n}

with

r \ll n

. Then, rather than solving the LS problem using the full data

(X,Y)

, sketching algorithms solve the LS problem using only the sketched data

(SX, SY)

. Prior work has typically adopted an algorithmic perspective, in that it has made no statistical assumptions on the input

X

and

Y

, and instead it has been assumed that the data

(X,Y)

are fixed and worst-case (WC). Prior results show that, when using sketching matrices such as random projections and leverage-score sampling algorithms, with

p < r \ll n

, the WC error is the same as solving the original problem, up to a small constant. From a statistical perspective, we typically consider the mean-squared error performance of randomized sketching algorithms, when data

(X, Y)

are generated according to a statistical model

Y = X \beta + \epsilon

, where

\epsilon

is a noise process. We provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling sketching algorithms. Among other results, we show that the RE can be upper bounded when

p < r \ll n

while the PE typically requires the sample size

r

to be substantially larger. Lower bounds developed in subsequent results show that our upper bounds on PE can not be improved.Comment: 27 pages, 5 figure

arXiv.org e-Print Archive

CiteSeerX

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Author: Mahoney Michael W.
Publication venue
Publication date: 08/10/2010
Field of study

In recent years, ideas from statistics and scientific computing have begun to interact in increasingly sophisticated and fruitful ways with ideas from computer science and the theory of algorithms to aid in the development of improved worst-case algorithms that are useful for large-scale scientific and Internet data analysis problems. In this chapter, I will describe two recent examples---one having to do with selecting good columns or features from a (DNA Single Nucleotide Polymorphism) data matrix, and the other having to do with selecting good clusters or communities from a data graph (representing a social or information network)---that drew on ideas from both areas and that may serve as a model for exploiting complementary algorithmic and statistical perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors, "Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201

arXiv.org e-Print Archive

CiteSeerX