We consider algorithmic problems in the setting in which the input data has
been partitioned arbitrarily on many servers. The goal is to compute a function
of all the data, and the bottleneck is the communication used by the algorithm.
We present algorithms for two illustrative problems on massive data sets: (1)
computing a low-rank approximation of a matrix A=A1+A2+β¦+As,
with matrix At stored on server t and (2) computing a function of a vector
a1β+a2β+β¦+asβ, where server t has the vector atβ; this
includes the well-studied special case of computing frequency moments and
separable functions, as well as higher-order correlations such as the number of
subgraphs of a specified type occurring in a graph. For both problems we give
algorithms with nearly optimal communication, and in particular the only
dependence on n, the size of the data, is in the number of bits needed to
represent indices and words (O(logn)).Comment: rewritten with focus on two main results (distributed PCA,
higher-order moments and correlations) in the arbitrary partition mode