Search CORE

7,963 research outputs found

A 3D Parallel Algorithm for QR Decomposition

Author: Ballard Grey
Demmel James
Grigori Laura
Jacquelin Mathias
Knight Nicholas
Publication venue
Publication date: 14/05/2018
Field of study

Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Achieving Low-Complexity Maximum-Likelihood Detection for the 3D MIMO Code

Author: Crussière Matthieu
Hélard Jean-François
Hélard Maryline
Liu Ming
Publication venue
Publication date: 09/01/2014
Field of study

The 3D MIMO code is a robust and efficient space-time block code (STBC) for the distributed MIMO broadcasting but suffers from high maximum-likelihood (ML) decoding complexity. In this paper, we first analyze some properties of the 3D MIMO code to show that the 3D MIMO code is fast-decodable. It is proved that the ML decoding performance can be achieved with a complexity of O(M^{4.5}) instead of O(M^8) in quasi static channel with M-ary square QAM modulations. Consequently, we propose a simplified ML decoder exploiting the unique properties of 3D MIMO code. Simulation results show that the proposed simplified ML decoder can achieve much lower processing time latency compared to the classical sphere decoder with Schnorr-Euchner enumeration

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

HAL-Rennes 1

An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

Author: Ghysels Pieter
Li Xiaoye S.
Napov Artem
Rouet Francois-Henry
Williams Samuel
Publication venue
Publication date: 25/02/2015
Field of study

We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices

arXiv.org e-Print Archive

eScholarship - University of California

DI-fusion

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Author: Canon Shane
Chhugani Jatin
Demmel James
Devarakonda Aditya
Gerhardt Lisa
Gittens Alex
Harrell Jim
Kottalam Jey
Krishnamurthy Venkat
Liu Jialin
Mahoney Michael W.
Maschhoff Kristyn
Prabhat
Racah Evan
Ringenburg Michael
Sharma Pramod
Yang Jiyan
Publication venue
Publication date: 12/05/2016
Field of study

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance

arXiv.org e-Print Archive

eScholarship - University of California

Reduced-complexity maximum-likelihood decoding for 3D MIMO code

Author: Crussière Matthieu
Hélard Jean-François
Hélard Maryline
Liu Ming
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/04/2013
Field of study

The 3D MIMO code is a robust and efficient space-time coding scheme for the distributed MIMO broadcasting. However, it suffers from the high computational complexity if the optimal maximum-likelihood (ML) decoding is used. In this paper we first investigate the unique properties of the 3D MIMO code and consequently propose a simplified decoding algorithm without sacrificing the ML optimality. Analysis shows that the decoding complexity is reduced from O(M^8) to O(M^{4.5}) in quasi-static channels when M-ary square QAM constellation is used. Moreover, we propose an efficient implementation of the simplified ML decoder which achieves a much lower decoding time delay compared to the classical sphere decoder with Schnorr-Euchner enumeration.Comment: IEEE Wireless Communications and Networking Conference (WCNC 2013), Shanghai : China (2013

arXiv.org e-Print Archive

Crossref

HAL-Rennes 1

Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices

Author: Baruch E.
Cannon L. E.
Choi J
Choi J.
Choi J.
Solomonik E.
Szabo A.
van de Geijn R. A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/10/2015
Field of study

A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization.Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text overlap with arXiv:1504.0504

arXiv.org e-Print Archive

Crossref

A parallel butterfly algorithm

Author: Demanet Laurent
Maxwell Nicholas
Poulson Jack
Ying Lexing
Publication venue
Publication date: 25/11/2013
Field of study

The butterfly algorithm is a fast algorithm which approximately evaluates a discrete analogue of the integral transform \int K(x,y) g(y) dy at large numbers of target points when the kernel, K(x,y), is approximately low-rank when restricted to subdomains satisfying a certain simple geometric condition. In d dimensions with O(N^d) quasi-uniformly distributed source and target points, when each appropriate submatrix of K is approximately rank-r, the running time of the algorithm is at most O(r^2 N^d log N). A parallelization of the butterfly algorithm is introduced which, assuming a message latency of \alpha and per-process inverse bandwidth of \beta, executes in at most O(r^2 N^d/p log N + \beta r N^d/p + \alpha)log p) time using p processes. This parallel algorithm was then instantiated in the form of the open-source DistButterfly library for the special case where K(x,y)=exp(i \Phi(x,y)), where \Phi(x,y) is a black-box, sufficiently smooth, real-valued phase function. Experiments on Blue Gene/Q demonstrate impressive strong-scaling results for important classes of phase functions. Using quasi-uniform sources, hyperbolic Radon transforms and an analogue of a 3D generalized Radon transform were respectively observed to strong-scale from 1-node/16-cores up to 1024-nodes/16,384-cores with greater than 90% and 82% efficiency, respectively.Comment: To appear in SIAM Journal on Scientific Computin

arXiv.org e-Print Archive

DSpace@MIT