Search CORE

36 research outputs found

Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

Author: Ballard Grey
Demmel James
Holtz Olga
Lipshitz Benjamin
Schwartz Oded
Publication venue
Publication date: 01/01/2012
Field of study

A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms.Comment: 4 pages, 1 figur

arXiv.org e-Print Archive

CiteSeerX

Parallelizing Strassen's method for matrix multiplication on distributed-memory MIMD architectures

Author: Chou C.-C.
Deng Y.-F.
Li G.
Wang Y.
Publication venue: Published by Elsevier Ltd.
Publication date: 01/07/1995
Field of study

AbstractWe present a parallel method for matrix multiplication on distributed-memory MIMD architectures based on Strassen's method. Our timing tests, performed on a 56-node Intel Paragon, demonstrate the realization of the potential of the Strassen's method with a complexity of 4.7 M2.807 at the system level rather than the node level at which several earlier works have been focused. The parallel efficiency is nearly perfect when the processor number is the power of 7. The parallelized Strassen's method seems always faster than the traditional matrix multiplication methods whose complexity is 2M3 coupled with the BMR method and the Ring method at the system level. The speed gain depends on matrix order M: 20% for M ≈ 1000 and more than 100% for M ≈ 5000

Elsevier - Publisher Connector

Deakin Research Online

Graph Expansion and Communication Costs of Fast Matrix Multiplication

Author: Ballard Grey
Demmel James
Holtz Olga
Schwartz Oded
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size

M

, too small to store three

n

-by-

n

matrices, the lower bound on the number of words moved between fast and slow memory is, for many of the matrix multiplication algorithms,

\Omega((\frac{n}{\sqrt M})^{\omega_0}\cdot M)

, where

\omega_0

is the exponent in the arithmetic count (e.g.,

\omega_0 = \lg 7

for Strassen, and

\omega_0 = 3

for conventional matrix multiplication). With

p

parallel processors, each with fast memory of size

M

, the lower bound is

p

times smaller. These bounds are attainable both for sequential and for parallel algorithms and hence optimal. These bounds can also be attained by many fast algorithms in linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester equation)

arXiv.org e-Print Archive

CiteSeerX

Crossref

Algorithme innovant pour le traitement parallèle basé sur l'indépendance des tâches et la décomposition des données

Author: Abu Azab Hussam Hussein
Publication venue
Publication date: 01/01/2017
Field of study

Dépôt numérique de UQTR

ATCOM: Automatically tuned collective communication system for SMP clusters.

Author: Wu Meng-Shiou
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2005
Field of study

Conventional implementations of collective communications are based on point-to-point communications, and their optimizations have been focused on efficiency of those communication algorithms. However, point-to-point communications are not the optimal choice for modern computing clusters of SMPs due to their two-level communication structure. In recent years, a few research efforts have investigated efficient collective communications for SMP clusters. This dissertation is focused on platform-independent algorithms and implementations in this area;There are two main approaches to implementing efficient collective communications for clusters of SMPs: using shared memory operations for intra-node communications, and over-lapping inter-node/intra-node communications. The former fully utilizes the hardware based shared memory of an SMP, and the latter takes advantage of the inherent hierarchy of the communications within a cluster of SMPs. Previous studies focused on clusters of SMP from certain vendors. However, the previously proposed methods are not portable to other systems. Because the performance optimization issue is very complicated and the developing process is very time consuming, it is highly desired to have self-tuning, platform-independent implementations. As proven in this dissertation, such an implementation can significantly outperform the other point-to-point based portable implementations and some platform-specific implementations;The dissertation describes in detail the architecture of the platform-independent implementation. There are four system components: shared memory-based collective communications, overlapping mechanisms for inter-node and intra-node communications, a prediction-based tuning module and a micro-benchmark based tuning module. Each component is carefully designed with the goal of automatic tuning in mind

Digital Repository @ Iowa State University (ISU)

Crossref

UNT Digital Library

Fast and Memory Efficient Strassen’s Matrix Multiplication on GPU Cluster

Author: Gopala Krishnan Arjun
Publication venue
Publication date: 30/08/2021
Field of study

Prior implementations of Strassen's matrix multiplication algorithm on GPUs traded additional workspace in the form of global memory or registers for time. Although Strassen's algorithm offers a reduction in computational complexity as compared to the classical algorithm, the memory overhead associated with the algorithm limits its practical utility. While there were past attempts at reducing the memory footprint of Strassen's algorithm by compromising parallelism, no prior implementation, to our knowledge, was able to hide the workspace requirement successfully. This thesis presents an implementation of Strassen's matrix multiplication in CUDA, titled Multi-Stage Memory Efficient Strassen (MSMES), that eliminates additional workspace requirements by reusing and recovering input matrices. MSMES organizes the steps involved in Strassen's algorithm into five stages where multiple steps in the same stage can be executed in parallel. Two additional stages are also discussed in the thesis that allows the recovery of the input matrices. Unlike previous works, MSMES has no additional memory requirements irrespective of the level of recursion of Strassen's algorithm. Experiments performed with MSMES (with the recovery stages) on NVIDIA Tesla V100 GPU and NVIDIA GTX 1660ti GPU yielded higher compute performance and lower memory requirements as compared to the NVIDIA library function for double precision matrix multiplication, cublasDgemm. In the multi-GPU adaptation of matrix multiplication, we explore the performance of a Strassen-based and a tile-based global decomposition scheme. We also checked the performance of using MSMES and cublasDgemm for performing local matrix multiplication with each of the global decomposition schemes. From the experiments, it was identified that the combination of using Strassen-Winograd decomposition with MSMES yielded the highest speedup among all the tested combinations

Concordia University Research Repository

Foundations of computer science : [an advanced course on the foundations of computer science organized by the Mathematical Centre, held at the University of Amsterdam, May 20-31, 1974]

Author
Publication venue: Centrum Voor Wiskunde en Informatica
Publication date: 01/01/1975
Field of study

CWI's Institutional Repository

ATCOM: Automatically Tuned Collective Communication System for SMP Clusters

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

Even faster integer multiplication

Author: Harvey David
Lecerf Grégoire
van der Hoeven Joris
Publication venue
Publication date: 01/01/2014
Field of study

We give a new proof of F\"urer's bound for the cost of multiplying n-bit integers in the bit complexity model. Unlike F\"urer, our method does not require constructing special coefficient rings with "fast" roots of unity. Moreover, we prove the more explicit bound O(n log n K^(log^* n))$ with K = 8. We show that an optimised variant of F\"urer's algorithm achieves only K = 16, suggesting that the new algorithm is faster than F\"urer's by a factor of 2^(log^* n). Assuming standard conjectures about the distribution of Mersenne primes, we give yet another algorithm that achieves K = 4

arXiv.org e-Print Archive

CiteSeerX

Crossref

HAL-Polytechnique