773 research outputs found

    A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares

    Full text link
    We consider statistical as well as algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. For a LS problem with input data (X,Y)∈Rn×p×Rn(X, Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n, sketching algorithms use a sketching matrix, S∈Rr×nS\in\mathbb{R}^{r \times n} with r≪nr \ll n. Then, rather than solving the LS problem using the full data (X,Y)(X,Y), sketching algorithms solve the LS problem using only the sketched data (SX,SY)(SX, SY). Prior work has typically adopted an algorithmic perspective, in that it has made no statistical assumptions on the input XX and YY, and instead it has been assumed that the data (X,Y)(X,Y) are fixed and worst-case (WC). Prior results show that, when using sketching matrices such as random projections and leverage-score sampling algorithms, with p<r≪np < r \ll n, the WC error is the same as solving the original problem, up to a small constant. From a statistical perspective, we typically consider the mean-squared error performance of randomized sketching algorithms, when data (X,Y)(X, Y) are generated according to a statistical model Y=Xβ+ϵY = X \beta + \epsilon, where ϵ\epsilon is a noise process. We provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling sketching algorithms. Among other results, we show that the RE can be upper bounded when p<r≪np < r \ll n while the PE typically requires the sample size rr to be substantially larger. Lower bounds developed in subsequent results show that our upper bounds on PE can not be improved.Comment: 27 pages, 5 figure

    GIANT: Globally Improved Approximate Newton Method for Distributed Optimization

    Full text link
    For distributed computing environment, we consider the empirical risk minimization problem and propose a distributed and communication-efficient Newton-type optimization method. At every iteration, each worker locally finds an Approximate NewTon (ANT) direction, which is sent to the main driver. The main driver, then, averages all the ANT directions received from workers to form a {\it Globally Improved ANT} (GIANT) direction. GIANT is highly communication efficient and naturally exploits the trade-offs between local computations and global communications in that more local computations result in fewer overall rounds of communications. Theoretically, we show that GIANT enjoys an improved convergence rate as compared with first-order methods and existing distributed Newton-type methods. Further, and in sharp contrast with many existing distributed Newton-type methods, as well as popular first-order methods, a highly advantageous practical feature of GIANT is that it only involves one tuning parameter. We conduct large-scale experiments on a computer cluster and, empirically, demonstrate the superior performance of GIANT.Comment: Fixed some typos. Improved writin

    Learning with SGD and Random Features

    Get PDF
    Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments
    • …
    corecore