Performance of high-order SVD approximation: reading the data twice is enough

=============================================================================



This talk considers the problem of calculating a low-rank tensor approximation of some large dense data.

We focus on the tensor train SVD (TT-SVD) but the approach can be transferred to other low-rank tensor formats such as general tree tensor networks.

In the TT-SVD algorithm, the dominant building block consists of singular value decompositions of tall-skinny matrices.

Therefore, the computational performance is bound by data transfers on current hardware as long as the desired tensor ranks are sufficiently small.

Based on a simple roofline performance model we show that under reasonable assumptions the minimal runtime is of the order of reading the data twice.

We present an almost optimal, distributed parallel implementation that is based on a specialized rank-preserving TSQR step.

Moreover, we discuss important algorithmic details and compare our results with common implementations that are often about 50x slower than optimal.



References:

Oseledets: "Tensor-Train Decomposition", SISC 2011

Grasedyck and Hackbusch: "An Introduction to Hierarchical (H-) Rank and TT-Rank of Tensors with Examples", CMAM 2011

Demmel et. al.: "Communication Avoiding Rank Revealing QR Factorization with Column Pivoting", SIMAX 2015

Williams et. al.: "Roofline: An Insightful Visual Performance Model for Multicore Architectures", CACM 200

Basermann, Achim

Röhrig-Zöllner, Melven

Thies, Jonas

Institute of Transport Research:Publications

Introduction Tensor-Train SVD algorithm Results ConclusionPerformance of high-order SVD approximation:reading the data twice is enoughMelven Röhrig-Zöllner, Jonas Thies and Achim BasermannInstitute for Software Technology, German Aerospace Center (DLR)Introduction Tensor-Train SVD algorithm Results ConclusionCategories of tensor decomposition methodsfrom the point of view of computing resources:. . .1. data too large to process as a wholeI "randomly" access part of the dataI reconstruct approximation with some probability2. data implicitly given by some high-dim. function with known low rank / smoothnessI "black box" approximation, evaluate as few entries as possibleI error bounds for special classes of functions3. data large and sparse, feasible to access all entriesI exploit (problem specific) sparsityI accurate up to a desired tolerance4. data large and dense, feasible to access all entriesI discussed here!I accurate up to a desired toleranceIntroduction Tensor-Train SVD algorithm Results ConclusionProblem definitionLow-rank approximation in tensor-train format [Oseledets]Given:I large dense tensor X ∈ Rn1×n2×···×ndI max. tensor-train rank rmaxI desired tolerance εtolCalculate:I tensor-train XTT with:ranks(XTT) ≤ rmax and ‖X − XTT‖F . εtolRemarks:I Focus on the tensor-train format; very similar approaches for some other formatsI Consider high-dimensional case (d  3) and sufficiently small TT-ranks r1, . . . rd−1Introduction Tensor-Train SVD algorithm Results ConclusionRoofline performance modelConsider 2 bottlenecks:1. Peak performance: Pmax [GFlop/s]2. Memory bandwidth: bs [GByte/s]Analyze the algorithm:1. Computations: nflops [flop]2. Data transfers: Vread+write [byte]⇒ Expected (ideal) runtime:t = max(nflopsPmax,Vread+writebs)[s]Remark: growing memory gap Pmax/bs (e.g. ∼ 100 Flops per double on my CPU from 2017)Introduction Tensor-Train SVD algorithm Results ConclusionStandard TT-SVD algorithmAlgorithmInput: Tensor Xr0 ← 1for i = 1, . . . , d − 1 doReshape X to n̄i × (ni ri−1)Calculate SVD: USV T = XChoose rank riTi ← V T1:ri , reshape to ri−1 × ni × riX ← U1:ri S1:riend forReshape X to (rd−1 × nd × 1)Output: Tensor-train (T1, . . . ,Td−1,X )ObservationsI Based on successive SVDs, reshapes andmatrix-matrix multiplications (GEMM)I Cheap operations are grayed outI All large matrices are tall and skinnyn̄i :=∏dj=i+1 ni  ni ri→ Operations are likely memory-bound!I Size of X ideally decreases in each step,not ensured in first steps for ri  rmaxIntroduction Tensor-Train SVD algorithm Results ConclusionStandard TT-SVD algorithmAlgorithmInput: Tensor Xr0 ← 1for i = 1, . . . , d − 1 doReshape X to n̄i × (ni ri−1)Calculate SVD: USV T = XChoose rank riTi ← V T1:ri , reshape to ri−1 × ni × riX ← U1:ri S1:riend forReshape X to (rd−1 × nd × 1)Output: Tensor-train (T1, . . . ,Td−1,X )ObservationsI Based on successive SVDs, reshapes andmatrix-matrix multiplications (GEMM)I Cheap operations are grayed outI All large matrices are tall and skinnyn̄i :=∏dj=i+1 ni  ni ri→ Operations are likely memory-bound!I Size of X ideally decreases in each step,not ensured in first steps for ri  rmaxIntroduction Tensor-Train SVD algorithm Results ConclusionOptimized TT-SVD algorithmAlgorithmInput: Tensor XSkip first j − 1 iterationsReshape X to n̄j × (n1 · · · nj)for i = j , . . . , d − 1 doCalculate QR decomposition: QR = XCalculate small SVD: ŪSV T = RChoose rank riTi ← V T1:ri , reshape to ri−1 × ni × riX ← XV1:ri , reshape to n̄i+1 × (ni+1ri )end forRecover T1, . . . , TjOutput: Tensor-train (T1, . . . ,Td−1,X )RemarksI Skip iterations that don’t reduce the sizeof XI Replaced costly SVD by tall-skinny QRI Never use Q → Q-less TSQRI Fused reshape and tall-skinnymatrix-matrix multiplication ("TSMM"):X ← XV1:ri , reshape to . . .→ Reads the input data twice (1st iteration):(once for QR = X , once for X ← XV1:r1)Introduction Tensor-Train SVD algorithm Results ConclusionOptimized TT-SVD algorithmAlgorithmInput: Tensor XSkip first j − 1 iterationsReshape X to n̄j × (n1 · · · nj)for i = j , . . . , d − 1 doCalculate QR decomposition: QR = XCalculate small SVD: ŪSV T = RChoose rank riTi ← V T1:ri , reshape to ri−1 × ni × riX ← XV1:ri , reshape to n̄i+1 × (ni+1ri )end forRecover T1, . . . , TjOutput: Tensor-train (T1, . . . ,Td−1,X )RemarksI Skip iterations that don’t reduce the sizeof XI Replaced costly SVD by tall-skinny QRI Never use Q → Q-less TSQRI Fused reshape and tall-skinnymatrix-matrix multiplication ("TSMM"):X ← XV1:ri , reshape to . . .→ Reads the input data twice (1st iteration):(once for QR = X , once for X ← XV1:r1)Introduction Tensor-Train SVD algorithm Results ConclusionPerformance analysis (1)Building blocksQ-less TSQR:(X ∈ Rn×m)I Vread = nmI nflops ≈ 2nm2TSMM+reshape:(X ← XM, M ∈ Rm×k)I Vread+write = n(m + k)I nflops = 2nmk⇒ One step of the TT-SVD iteration:I Vread+write = n(2m + k)I nflops ≈ 2nm(m + k)Complete TT-SVD algorithmAssume size reduction factor f < 1 in each stepwith k/m ≤ f .⇒ upper bound from the geometric series:I Vread+write ≤ 2N1−f +fN1−fI nflops . 2Nrmax(1f +21−f)+ O(r3max)with N :=∏di=1 ni .Introduction Tensor-Train SVD algorithm Results ConclusionPerformance analysis (1)Building blocksQ-less TSQR:(X ∈ Rn×m)I Vread = nmI nflops ≈ 2nm2TSMM+reshape:(X ← XM, M ∈ Rm×k)I Vread+write = n(m + k)I nflops = 2nmk⇒ One step of the TT-SVD iteration:I Vread+write = n(2m + k)I nflops ≈ 2nm(m + k)Complete TT-SVD algorithmAssume size reduction factor f < 1 in each stepwith k/m ≤ f .⇒ upper bound from the geometric series:I Vread+write ≤ 2N1−f +fN1−fI nflops . 2Nrmax(1f +21−f)+ O(r3max)with N :=∏di=1 ni .Introduction Tensor-Train SVD algorithm Results ConclusionPerformance analysis (2)InterpretationTry to influence f by combining (splitting) dimensions!Suitable choices for 2d tensors:I f = 1/16 (low rank): Vread+write . 2.2N and nflops . 36NrmaxI f = 1/2 (medium rank): Vread+write . 5N and nflops . 12NrmaxComparison with measurements (using CPU performance counters)Decompose a double-precision 230 tensor (8GB)rmax operations (est.) data transfers (est.)[GFlop] [GByte]f = 1/2 1 14 (13) 43 (43)f = 1/16 1 41 (39) 21 (19)f = 1/2 31 417 (399) 43 (43)(in practice, as ni and ri are integers, only some discrete values for f possible)Introduction Tensor-Train SVD algorithm Results ConclusionPerformance analysis (2)InterpretationTry to influence f by combining (splitting) dimensions!Suitable choices for 2d tensors:I f = 1/16 (low rank): Vread+write . 2.2N and nflops . 36NrmaxI f = 1/2 (medium rank): Vread+write . 5N and nflops . 12NrmaxComparison with measurements (using CPU performance counters)Decompose a double-precision 230 tensor (8GB)rmax operations (est.) data transfers (est.)[GFlop] [GByte]f = 1/2 1 14 (13) 43 (43)f = 1/16 1 41 (39) 21 (19)f = 1/2 31 417 (399) 43 (43)(in practice, as ni and ri are integers, only some discrete values for f possible)Introduction Tensor-Train SVD algorithm Results ConclusionPerformance of building blocks: Q-less TSQRI (25 · 106)×m matrix,m = 1, . . . ,m (double-precision)I Data size: 200MB,. . . ,20GBI 14-core Intel Skylake Gold 6132I Bandwidth: bw := Vread/t→ Peak bandwidth for small m,∼1/3 peak Flop/s for larger mI Significantly faster than otherTSQR implementations I tried. . .I BUT the Trilinos TSQR herecalculates Q explicitly!02040608010010 20 30 40 50 60 70 80 90 100peak bandwidth (LOAD)peak Flop limitmemory bound compute boundbandwidth[GByte/s]#columnsQ-less TSQRroofline limitTrilinos TSQRIntroduction Tensor-Train SVD algorithm Results ConclusionPerformance of building blocks: fused tall-skinny GEMM+reshapeI For X ∈ Rn×2m, M ∈ R2m×m,X ← XM, reshape to (n/2, 2m),n = 25 · 107, m = 1, . . . , 50I Data size: 200MB,. . . ,20GBI Bandwidth:bw := Vread+write/tI 14-core Intel Skylake Gold 6132→ High bandwidth with fusedreshapeI Similar performance as MKL,exploit known memory layout0102030405060708010 20 30 40 50 60 70 80 90 100peak bandwidth (STREAM)bandwidth[GByte/s]#columnsTSMM+reshapeMKL dgemmIntroduction Tensor-Train SVD algorithm Results ConclusionTT-SVD runtime: different implementationsI Decompose random 227 tensor,rmax = 1, . . . , 50(double precision)I Data size: 1GBI 14-core Intel Skylake Gold 6132→ "Almost" optimal runtime→ Existing software: >50x slowerI tntorch first constructs afull-rank TT, then truncates it.I remark: my RNG is slower thanthe TT-SVD for rmax . 20.0.11101000 5 10 15 20 25 30 35 40 45 50read data twicetime[s]max. rankt3f (Eigen::BDCSVD)TensorToolbox (MKL dgesvd)tntorch (MKL dgeqrf)simple numpy (MKL dgesdd)ttpy (MKL dgesvd)TSQR TT-SVDIntroduction Tensor-Train SVD algorithm Results ConclusionTT-SVD runtime: different tensor dimensionsI Decompose large random tensor,rmax = 1, . . . , 50(double precision)I Data size: ∼ 8GBI Combine first dimensions only ifbeneficialI 14-core Intel Skylake Gold 6132→ Calculation more costly withfewer small dimensions!I Jumps in runtime: switch frome.g. 88 × 82 to 87 × 83 in thefirst tsqr step02468100 5 10 15 20 25 30 35 40 45 50read data twicetime[s]max. rank230 tensor415 tensor810 tensor109 tensor326 tensorIntroduction Tensor-Train SVD algorithm Results ConclusionTT-SVD runtime: distributed memory (MPI)I Decompose random 2d tensor,d = 29, . . . , 36,rmax = 1, . . . , 50(double precision)I Data size: 4GB, . . . , 550GBI Distributed parallel (user-definedMPI reduction for TSQR)I Up to 4 nodes with 4x14-coreIntel Skylake Gold 6132→ Scales well onto multiple nodes02468100 5 10 15 20 25 30 35 40 45 50229 elements per socket230 elements persocket231 elements persocket232 elementspersockettime[s]max. rank1 socket(s)2 socket(s)4 socket(s)8 socket(s)16 socket(s)Introduction Tensor-Train SVD algorithm Results ConclusionConclusionI Goal: compute a low-rank approximation of a large dense high-dimensional tensor (d ≥ 10)I Runtime lower bounds for the TT-SVD algorithm:I low rank: ∼ access data twiceI medium rank: O(rmax · N)→ Similar for some other tensor decompositionsI Almost optimal implementation:∼ 50× faster than others→ Difficult to map tensor algorithms to efficient building blocksI Future work: other tensor formats, performance of randomized decompositions (they canavoid this lower bound!), speed up algorithms from data analysis using TT-SVDIntroduction Tensor-Train SVD algorithm Results ConclusionLiteratureI Röhrig-Zöllner et.al.: "Performance of low-rank approximations in tensor train format(TT-SVD) for large dense tensors", submitted to SISC, arXiv:2102.00104, 2021I Oseledets: "Tensor-Train Decomposition", SISC, 2011I Demmel et.al.: "Communication-optimal Parallel and Sequential QR and LUFactorizations", SISC 2012I Psarras et.al.: "The Linear Algebra Mapping Problem", preprint, arXiv:911.09421, 2019I Demmel: "Communication avoiding algorithms", ENLA Seminar,https://www.youtube.com/watch?v=42f0nOw2NlgI Williams et.al.: "Roofline: An Insightful Visual Performance Model for MulticoreArchitectures", Comm. of the ACM, 2009Q-less tall-skinny QR implementationImplementation details: TSQR SIMD optimizationIdeaI Combine a full and a triangular blockI Used for all reductions in the TSQR algorithm(sequential/cache optimized, parallel/comm. optimized)Householder QR algorithm1. New block + previous block (already triangular)2. Calculate reflection vector3. Repeat→ Works with vectors of fixed-size b · nsimd (multiple of SIMD width)Q-less tall-skinny QR implementationImplementation details: TSQR SIMD optimizationIdeaI Combine a full and a triangular blockI Used for all reductions in the TSQR algorithm(sequential/cache optimized, parallel/comm. optimized)Householder QR algorithm1. New block + previous block (already triangular)2. Calculate reflection vector3. Repeat→ Works with vectors of fixed-size b · nsimd (multiple of SIMD width)∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗...............∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 0 ∗ ∗0 0 0 0 ∗Q-less tall-skinny QR implementationImplementation details: TSQR SIMD optimizationIdeaI Combine a full and a triangular blockI Used for all reductions in the TSQR algorithm(sequential/cache optimized, parallel/comm. optimized)Householder QR algorithm1. New block + previous block (already triangular)2. Calculate reflection vector3. Repeat→ Works with vectors of fixed-size b · nsimd (multiple of SIMD width)∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗...............∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 0 ∗ ∗0 0 0 0 ∗Q-less tall-skinny QR implementationImplementation details: TSQR SIMD optimizationIdeaI Combine a full and a triangular blockI Used for all reductions in the TSQR algorithm(sequential/cache optimized, parallel/comm. optimized)Householder QR algorithm1. New block + previous block (already triangular)2. Calculate reflection vector and apply reflection3. Repeat→ Works with vectors of fixed-size b · nsimd (multiple of SIMD width)∗ ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗...............0 ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 0 ∗ ∗0 0 0 0 ∗Q-less tall-skinny QR implementationImplementation details: TSQR SIMD optimizationIdeaI Combine a full and a triangular blockI Used for all reductions in the TSQR algorithm(sequential/cache optimized, parallel/comm. optimized)Householder QR algorithm1. New block + previous block (already triangular)2. Calculate reflection vector and apply reflection3. Repeat→ Works with vectors of fixed-size b · nsimd (multiple of SIMD width)∗ ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗...............0 ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 0 ∗ ∗0 0 0 0 ∗Q-less tall-skinny QR implementationImplementation details: TSQR SIMD optimizationIdeaI Combine a full and a triangular blockI Used for all reductions in the TSQR algorithm(sequential/cache optimized, parallel/comm. optimized)Householder QR algorithm1. New block + previous block (already triangular)2. Calculate reflection vector and apply reflection3. Repeat→ Works with vectors of fixed-size b · nsimd (multiple of SIMD width)∗ ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 0 ∗ ∗ ∗...............0 0 ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 0 ∗ ∗0 0 0 0 ∗Q-less tall-skinny QR implementationImplementation details: TSQR SIMD optimizationIdeaI Combine a full and a triangular blockI Used for all reductions in the TSQR algorithm(sequential/cache optimized, parallel/comm. optimized)Householder QR algorithm1. New block + previous block (already triangular)2. Calculate reflection vector and apply reflection3. Repeat→ Works with vectors of fixed-size b · nsimd (multiple of SIMD width)b·nsimd∗ ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 0 ∗ ∗ ∗...............0 0 ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 0 ∗ ∗0 0 0 0 ∗Q-less tall-skinny QR implementationImplementation details: "rank preserving" TSQRGoalI Avoid data dependencies!→ crucial for high performance,(the CPU is a big pipeline)→ No pivoting I Still handle rank-deficient blocksAdjusted Householder reflectionI Add smallest representable number εfpI Prevents break-down (no division by zero)I Introduces an error of order √εfp if u = 0I Ensures a valid reflection (‖v‖2 =√2)AlgorithmInput: Input column u ∈ Rk ,Smallest positive FP number εfp (≈ 10−300)Output: Householder reflection (I − vvT )with ‖v‖2 =√21: t ← ‖u‖22 + εfp2: α← − sign(u1)√t + εfp3: t ← t − αu14: u1 ← u1 − α5: β ← 1/√t6: v ← βuQ-less tall-skinny QR implementationImplementation details: "rank preserving" TSQRGoalI Avoid data dependencies!→ crucial for high performance,(the CPU is a big pipeline)→ No pivoting I Still handle rank-deficient blocksAdjusted Householder reflectionI Add smallest representable number εfpI Prevents break-down (no division by zero)I Introduces an error of order √εfp if u = 0I Ensures a valid reflection (‖v‖2 =√2)AlgorithmInput: Input column u ∈ Rk ,Smallest positive FP number εfp (≈ 10−300)Output: Householder reflection (I − vvT )with ‖v‖2 =√21: t ← ‖u‖22 + εfp2: α← − sign(u1)√t + εfp3: t ← t − αu14: u1 ← u1 − α5: β ← 1/√t6: v ← βuQ-less tall-skinny QR implementationImplementation details: "rank preserving" TSQRGoalI Avoid data dependencies!→ crucial for high performance,(the CPU is a big pipeline)→ No pivoting I Still handle rank-deficient blocksAdjusted Householder reflectionI Add smallest representable number εfpI Prevents break-down (no division by zero)I Introduces an error of order √εfp if u = 0I Ensures a valid reflection (‖v‖2 =√2)AlgorithmInput: Input column u ∈ Rk ,Smallest positive FP number εfp (≈ 10−300)Output: Householder reflection (I − vvT )with ‖v‖2 =√21: t ← ‖u‖22 + εfp2: α← − sign(u1)√t + εfp3: t ← t − αu14: u1 ← u1 − α5: β ← 1/√t6: v ← βu

Performance of high-order SVD approximation:

reading the data twice is enough

https://elib.dlr.de/142361/1/SIAM_LA21_Performance_of_high-order_SVD_Approximation.pdf

Performance of high-order SVD approximation: reading the data twice is enough

Abstract

Similar works

Full text

Available Versions

Institute of Transport Research:Publications