14 research outputs found
Relative Errors for Deterministic Low-Rank Matrix Approximations
We consider processing an n x d matrix A in a stream with row-wise updates
according to a recent algorithm called Frequent Directions (Liberty, KDD 2013).
This algorithm maintains an l x d matrix Q deterministically, processing each
row in O(d l^2) time; the processing time can be decreased to O(d l) with a
slight modification in the algorithm and a constant increase in space. We show
that if one sets l = k+ k/eps and returns Q_k, a k x d matrix that is the best
rank k approximation to Q, then we achieve the following properties: ||A -
A_k||_F^2 <= ||A||_F^2 - ||Q_k||_F^2 <= (1+eps) ||A - A_k||_F^2 and where
pi_{Q_k}(A) is the projection of A onto the rowspace of Q_k then ||A -
pi_{Q_k}(A)||_F^2 <= (1+eps) ||A - A_k||_F^2.
We also show that Frequent Directions cannot be adapted to a sparse version
in an obvious way that retains the l original rows of the matrix, as opposed to
a linear combination or sketch of the rows.Comment: 16 pages, 0 figure
Improved Practical Matrix Sketching with Guarantees
Matrices have become essential data representations for many large-scale
problems in data analytics, and hence matrix sketching is a critical task.
Although much research has focused on improving the error/size tradeoff under
various sketching paradigms, the many forms of error bounds make these
approaches hard to compare in theory and in practice. This paper attempts to
categorize and compare most known methods under row-wise streaming updates with
provable guarantees, and then to tweak some of these methods to gain practical
improvements while retaining guarantees.
For instance, we observe that a simple heuristic iSVD, with no guarantees,
tends to outperform all known approaches in terms of size/error trade-off. We
modify the best performing method with guarantees FrequentDirections under the
size/error trade-off to match the performance of iSVD and retain its
guarantees. We also demonstrate some adversarial datasets where iSVD performs
quite poorly. In comparing techniques in the time/error trade-off, techniques
based on hashing or sampling tend to perform better. In this setting we modify
the most studied sampling regime to retain error guarantee but obtain dramatic
improvements in the time/error trade-off.
Finally, we provide easy replication of our studies on APT, a new testbed
which makes available not only code and datasets, but also a computing platform
with fixed environmental settings.Comment: 27 page
Optimal Principal Component Analysis in Distributed and Streaming Models
We study the Principal Component Analysis (PCA) problem in the distributed
and streaming models of computation. Given a matrix a
rank parameter , and an accuracy parameter , we
want to output an orthonormal matrix for which where is the best rank- approximation to .
This paper provides improved algorithms for distributed PCA and streaming
PCA.Comment: STOC2016 full versio
Continuous Matrix Approximation on Distributed Data ∗
Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of “tracking approximations to a matrix ” in the distributed streaming model. In this model, there are m distributed sites each observing a distinct stream of data (where each element is a row of a distributed matrix) and has a communication channel with a coordinator, and the goal is to track an ε-approximation to the norm of the matrix along any direction. To that end, we present novel algorithms to address the matrix approximation problem. Our algorithms maintain a smaller matrix B, as an approximation to a distributed streaming matrix A, such that for any unit vector x: |‖Ax ‖ 2 − ‖Bx ‖ 2 | ≤ ε‖A ‖ 2 F. Our algorithms work in streaming fashion and incur small communication, which is critical for distributed computation. Our best method is deterministic and uses only O((m/ε) log(βN)) communication, where N is the size of stream (at the time of the query) and β is an upperbound on the squared norm of any row of the matrix. In addition to proving all algorithmic properties theoretically, extensive experiments with real large datasets demonstrate the efficiency of these protocols. 1