13 research outputs found
suCAQR: A Simplified Communication-Avoiding QR Factorization Solver Using the TBLAS Framework
The scope of this paper is to design and implement a scalable QR factorization solver that can deliver the fastest performance for tall and skinny matrices and square matrices on modern supercomputers. The new solver, named scalable universal communication-avoiding QR factorization (suCAQR), introduces a simplified and tuning-less way to realize the communication-avoiding QR factorization algorithm to support matrices of any shapes. The software design includes a mixed usage of physical and logical data layouts, a simplified method of dynamic-root binary-tree reduction, and a dynamic dataflow implementation. Compared with the existing communication avoiding QR factorization implementations, suCAQR has the benefits of being simpler, more general, and more efficient. By balancing the degree of parallelism and the proportion of faster computational kernels, it is able to achieve scalable performance on clusters of multicore nodes. The software essentially combines the strengths of both synchronization-reducing approach and communication-avoiding approach to achieve high performance. Based on the experimental results using 1,024 CPU cores, suCAQR is faster than DPLASMA by up to 30%, and faster than ScaLAPACK by up to 30 times
A 3D Parallel Algorithm for QR Decomposition
Interprocessor communication often dominates the runtime of large matrix
computations. We present a parallel algorithm for computing QR decompositions
whose bandwidth cost (communication volume) can be decreased at the cost of
increasing its latency cost (number of messages). By varying a parameter to
navigate the bandwidth/latency tradeoff, we can tune this algorithm for
machines with different communication costs
QR factorization over tunable processor grids
The increasing complexity of modern computer architectures has greatly influenced algorithm design. Algorithm performance on these architectures is now determined by the movement of data. Therefore,
modern algorithms should prioritize minimizing communication.
In this work, we present a new parallel QR factorization algorithm solved over a
tunable processor grid in a distributed memory environment. The processor grid can be
tuned between one and three dimensions, resulting in tradeoffs in the asymptotic costs of
synchronization, horizontal bandwidth,
flop count, and memory footprint. This parallel algorithm is
the first to efficiently extend the Cholesky-QR2 algorithm to matrices with an
arbitrary number of rows and columns. Along its critical path of execution on P processors, our tunable algorithm improves upon the horizontal bandwidth cost of the existing
Cholesky-QR2 algorithm by up to a factor of c^2 when solved over a c x d x c processor grid
subject to P = c^2 d and E[1,P^1/3].
The costs attained by our algorithm are asymptotically
equivalent to state-of-the-art QR factorization algorithms that have yet to
be implemented.
We argue that ours achieves better practicality and
flexibility while still attaining minimal communication.Ope
Reconstructing Householder Vectors from Tall-Skinny QR
International audienc