14 research outputs found

    Relative Errors for Deterministic Low-Rank Matrix Approximations

    Get PDF
    We consider processing an n x d matrix A in a stream with row-wise updates according to a recent algorithm called Frequent Directions (Liberty, KDD 2013). This algorithm maintains an l x d matrix Q deterministically, processing each row in O(d l^2) time; the processing time can be decreased to O(d l) with a slight modification in the algorithm and a constant increase in space. We show that if one sets l = k+ k/eps and returns Q_k, a k x d matrix that is the best rank k approximation to Q, then we achieve the following properties: ||A - A_k||_F^2 <= ||A||_F^2 - ||Q_k||_F^2 <= (1+eps) ||A - A_k||_F^2 and where pi_{Q_k}(A) is the projection of A onto the rowspace of Q_k then ||A - pi_{Q_k}(A)||_F^2 <= (1+eps) ||A - A_k||_F^2. We also show that Frequent Directions cannot be adapted to a sparse version in an obvious way that retains the l original rows of the matrix, as opposed to a linear combination or sketch of the rows.Comment: 16 pages, 0 figure

    Improved Practical Matrix Sketching with Guarantees

    Full text link
    Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the error/size tradeoff under various sketching paradigms, the many forms of error bounds make these approaches hard to compare in theory and in practice. This paper attempts to categorize and compare most known methods under row-wise streaming updates with provable guarantees, and then to tweak some of these methods to gain practical improvements while retaining guarantees. For instance, we observe that a simple heuristic iSVD, with no guarantees, tends to outperform all known approaches in terms of size/error trade-off. We modify the best performing method with guarantees FrequentDirections under the size/error trade-off to match the performance of iSVD and retain its guarantees. We also demonstrate some adversarial datasets where iSVD performs quite poorly. In comparing techniques in the time/error trade-off, techniques based on hashing or sampling tend to perform better. In this setting we modify the most studied sampling regime to retain error guarantee but obtain dramatic improvements in the time/error trade-off. Finally, we provide easy replication of our studies on APT, a new testbed which makes available not only code and datasets, but also a computing platform with fixed environmental settings.Comment: 27 page

    Optimal Principal Component Analysis in Distributed and Streaming Models

    Full text link
    We study the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix ARm×n,A \in R^{m \times n}, a rank parameter k<rank(A)k < rank(A), and an accuracy parameter 0<ϵ<10 < \epsilon < 1, we want to output an m×km \times k orthonormal matrix UU for which AUUTAF2(1+ϵ)AAkF2, || A - U U^T A ||_F^2 \le \left(1 + \epsilon \right) \cdot || A - A_k||_F^2, where AkRm×nA_k \in R^{m \times n} is the best rank-kk approximation to AA. This paper provides improved algorithms for distributed PCA and streaming PCA.Comment: STOC2016 full versio

    Continuous Matrix Approximation on Distributed Data ∗

    No full text
    Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of “tracking approximations to a matrix ” in the distributed streaming model. In this model, there are m distributed sites each observing a distinct stream of data (where each element is a row of a distributed matrix) and has a communication channel with a coordinator, and the goal is to track an ε-approximation to the norm of the matrix along any direction. To that end, we present novel algorithms to address the matrix approximation problem. Our algorithms maintain a smaller matrix B, as an approximation to a distributed streaming matrix A, such that for any unit vector x: |‖Ax ‖ 2 − ‖Bx ‖ 2 | ≤ ε‖A ‖ 2 F. Our algorithms work in streaming fashion and incur small communication, which is critical for distributed computation. Our best method is deterministic and uses only O((m/ε) log(βN)) communication, where N is the size of stream (at the time of the query) and β is an upperbound on the squared norm of any row of the matrix. In addition to proving all algorithmic properties theoretically, extensive experiments with real large datasets demonstrate the efficiency of these protocols. 1
    corecore