Search CORE

4 research outputs found

Fast and Memory Efficient Strassen’s Matrix Multiplication on GPU Cluster

Author: Gopala Krishnan Arjun
Publication venue
Publication date: 30/08/2021
Field of study

Prior implementations of Strassen's matrix multiplication algorithm on GPUs traded additional workspace in the form of global memory or registers for time. Although Strassen's algorithm offers a reduction in computational complexity as compared to the classical algorithm, the memory overhead associated with the algorithm limits its practical utility. While there were past attempts at reducing the memory footprint of Strassen's algorithm by compromising parallelism, no prior implementation, to our knowledge, was able to hide the workspace requirement successfully. This thesis presents an implementation of Strassen's matrix multiplication in CUDA, titled Multi-Stage Memory Efficient Strassen (MSMES), that eliminates additional workspace requirements by reusing and recovering input matrices. MSMES organizes the steps involved in Strassen's algorithm into five stages where multiple steps in the same stage can be executed in parallel. Two additional stages are also discussed in the thesis that allows the recovery of the input matrices. Unlike previous works, MSMES has no additional memory requirements irrespective of the level of recursion of Strassen's algorithm. Experiments performed with MSMES (with the recovery stages) on NVIDIA Tesla V100 GPU and NVIDIA GTX 1660ti GPU yielded higher compute performance and lower memory requirements as compared to the NVIDIA library function for double precision matrix multiplication, cublasDgemm. In the multi-GPU adaptation of matrix multiplication, we explore the performance of a Strassen-based and a tile-based global decomposition scheme. We also checked the performance of using MSMES and cublasDgemm for performing local matrix multiplication with each of the global decomposition schemes. From the experiments, it was identified that the combination of using Strassen-Winograd decomposition with MSMES yielded the highest speedup among all the tested combinations

Concordia University Research Repository

An Investigation into the Performance Evaluation of Connected Vehicle Applications: From Real-World Experiment to Parallel Simulation Paradigm

Author: Ahmed Md Salman
Publication venue: Digital Commons @ East Tennessee State University
Publication date: 01/05/2017
Field of study

A novel system was developed that provides drivers lane merge advisories, using vehicle trajectories obtained through Dedicated Short Range Communication (DSRC). It was successfully tested on a freeway using three vehicles, then targeted for further testing, via simulation. The failure of contemporary simulators to effectively model large, complex urban transportation networks then motivated further research into distributed and parallel traffic simulation. An architecture for a closed-loop, parallel simulator was devised, using a new algorithm that accounts for boundary nodes, traffic signals, intersections, road lengths, traffic density, and counts of lanes; it partitions a sample, Tennessee road network more efficiently than tools like METIS, which increase interprocess communications (IPC) overhead by partitioning more transportation corridors. The simulator uses logarithmic accumulation to synchronize parallel simulations, further reducing IPC. Analyses suggest this eliminates up to one-third of IPC overhead incurred by a linear accumulation model

East Tennessee State University

Communication-optimal parallel algorithm for strassen's matrix multiplication

Author: Ballard Grey
Demmel James
Holtz Olga
Lipshitz Benjamin
Schwartz Oded
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA'11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of nodes ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.Comment: 13 pages, 3 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Algorithme innovant pour le traitement parallèle basé sur l'indépendance des tâches et la décomposition des données

Author: Abu Azab Hussam Hussein
Publication venue
Publication date: 01/01/2017
Field of study

Dépôt numérique de UQTR