Principal Component Analysis and Higher Correlations for Distributed
  Data

Kannan, Ravindran; Vempala, Santosh; Woodruff, David

slides

Principal Component Analysis and Higher Correlations for Distributed Data

Authors: Ravindran Kannan
Santosh Vempala
David Woodruff
Publication date: 29 June 2014
Publisher

Abstract

We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We present algorithms for two illustrative problems on massive data sets: (1) computing a low-rank approximation of a matrix

A=A^1 + A^2 + \ldots + A^s

, with matrix

A^t

stored on server

t

and (2) computing a function of a vector

a_1 + a_2 + \ldots + a_s

, where server

t

has the vector

a_t

; this includes the well-studied special case of computing frequency moments and separable functions, as well as higher-order correlations such as the number of subgraphs of a specified type occurring in a graph. For both problems we give algorithms with nearly optimal communication, and in particular the only dependence on

n

, the size of the data, is in the number of bits needed to represent indices and words (

O(\log n)

).Comment: rewritten with focus on two main results (distributed PCA, higher-order moments and correlations) in the arbitrary partition mode

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.454.4...

Last time updated on 28/10/2017