4 research outputs found
Fast and Memory Efficient Strassen’s Matrix Multiplication on GPU Cluster
Prior implementations of Strassen's matrix multiplication algorithm on GPUs traded additional workspace in the form of global memory or registers for time. Although Strassen's algorithm offers a reduction in computational complexity as compared to the classical algorithm, the memory overhead associated with the algorithm limits its practical utility. While there were past attempts at reducing the memory footprint of Strassen's algorithm by compromising parallelism, no prior implementation, to our knowledge, was able to hide the workspace requirement successfully. This thesis presents an implementation of Strassen's matrix multiplication in CUDA, titled Multi-Stage Memory Efficient Strassen (MSMES), that eliminates additional workspace requirements by reusing and recovering input matrices. MSMES organizes the steps involved in Strassen's algorithm into five stages where multiple steps in the same stage can be executed in parallel. Two additional stages are also discussed in the thesis that allows the recovery of the input matrices. Unlike previous works, MSMES has no additional memory requirements irrespective of the level of recursion of Strassen's algorithm. Experiments performed with MSMES (with the recovery stages) on NVIDIA Tesla V100 GPU and NVIDIA GTX 1660ti GPU yielded higher compute performance and lower memory requirements as compared to the NVIDIA library function for double precision matrix multiplication, cublasDgemm. In the multi-GPU adaptation of matrix multiplication, we explore the performance of a Strassen-based and a tile-based global decomposition scheme. We also checked the performance of using MSMES and cublasDgemm for performing local matrix multiplication with each of the global decomposition schemes. From the experiments, it was identified that the combination of using Strassen-Winograd decomposition with MSMES yielded the highest speedup among all the tested combinations
An Investigation into the Performance Evaluation of Connected Vehicle Applications: From Real-World Experiment to Parallel Simulation Paradigm
A novel system was developed that provides drivers lane merge advisories, using vehicle trajectories obtained through Dedicated Short Range Communication (DSRC). It was successfully tested on a freeway using three vehicles, then targeted for further testing, via simulation. The failure of contemporary simulators to effectively model large, complex urban transportation networks then motivated further research into distributed and parallel traffic simulation. An architecture for a closed-loop, parallel simulator was devised, using a new algorithm that accounts for boundary nodes, traffic signals, intersections, road lengths, traffic density, and counts of lanes; it partitions a sample, Tennessee road network more efficiently than tools like METIS, which increase interprocess communications (IPC) overhead by partitioning more transportation corridors. The simulator uses logarithmic accumulation to synchronize parallel simulations, further reducing IPC. Analyses suggest this eliminates up to one-third of IPC overhead incurred by a linear accumulation model
Communication-optimal parallel algorithm for strassen's matrix multiplication
Parallel matrix multiplication is one of the most studied fundamental
problems in distributed and high performance computing. We obtain a new
parallel algorithm that is based on Strassen's fast matrix multiplication and
minimizes communication. The algorithm outperforms all known parallel matrix
multiplication algorithms, classical and Strassen-based, both asymptotically
and in practice.
A critical bottleneck in parallelizing Strassen's algorithm is the
communication between the processors. Ballard, Demmel, Holtz, and Schwartz
(SPAA'11) prove lower bounds on these communication costs, using expansion
properties of the underlying computation graph. Our algorithm matches these
lower bounds, and so is communication-optimal. It exhibits perfect strong
scaling within the maximum possible range.
Benchmarking our implementation on a Cray XT4, we obtain speedups over
classical and Strassen-based algorithms ranging from 24% to 184% for a fixed
matrix dimension n=94080, where the number of nodes ranges from 49 to 7203.
Our parallelization approach generalizes to other fast matrix multiplication
algorithms.Comment: 13 pages, 3 figure