Performance of high-order SVD approximation: reading the data twice is enough

Abstract

Performance of high-order SVD approximation: reading the data twice is enough ============================================================================= This talk considers the problem of calculating a low-rank tensor approximation of some large dense data. We focus on the tensor train SVD (TT-SVD) but the approach can be transferred to other low-rank tensor formats such as general tree tensor networks. In the TT-SVD algorithm, the dominant building block consists of singular value decompositions of tall-skinny matrices. Therefore, the computational performance is bound by data transfers on current hardware as long as the desired tensor ranks are sufficiently small. Based on a simple roofline performance model we show that under reasonable assumptions the minimal runtime is of the order of reading the data twice. We present an almost optimal, distributed parallel implementation that is based on a specialized rank-preserving TSQR step. Moreover, we discuss important algorithmic details and compare our results with common implementations that are often about 50x slower than optimal. References: Oseledets: "Tensor-Train Decomposition", SISC 2011 Grasedyck and Hackbusch: "An Introduction to Hierarchical (H-) Rank and TT-Rank of Tensors with Examples", CMAM 2011 Demmel et. al.: "Communication Avoiding Rank Revealing QR Factorization with Column Pivoting", SIMAX 2015 Williams et. al.: "Roofline: An Insightful Visual Performance Model for Multicore Architectures", CACM 200

    Similar works