12 research outputs found
Implementation Analysis of Strassen-Like Matrix Multiplication Algorithms Based on Block Decomposition
Matrix multiplication is one of the most widely used operations in all computational fields of linear algebra. The complexity of the naive method for multiplying two nĂn matrices requires O(n^3) arithmetic operations over the ring in which the matrix entries lie. In 1969, Strassen proposed the first sub-cubic complexity algorithm for matrix multiplication. Strassen's algorithm (SA) multiplies two 2Ă2 matrices using 7 multiplications and 18 additions over the ring. Later Winograd proposed a variant of SA that requires 7 multiplications, but 15 additions over the ring. Algorithms that multiply two 2Ă2 matrices using 7 multiplications over the ring are called Strassen-like algorithms and have a complexity of O(n^2.81). Although asymptotically better algorithms exist, Strassen-like algorithms are considered to be most widely used sub-cubic complexity algorithm.
Recently, Cenk and Hasan proposed techniques to reduce the arithmetic cost of Strassen-like algorithms. The main technique is to decompose Strassen-like algorithms into three blocks, namely, component matrix formation (CMF), component multiplication (CM), and reconstruction (R). Each block is a recursive operation. In this thesis, we study these building blocks and investigate three optimization methods: the linearity property of CMF and R, limited recursion, and block recombination.
In this thesis, software implementation and hardware simulation are also performed to support the theoretical analysis. For the software implementation, experiment results show that WV is approximately 15% faster than SA. Cenk and Hasan's techniques
yield an improved WV (IWV) that considerably reduces the matrix multiplication time in software. For hardware simulation, we conclude from the synthesis results that WV consumes about 7.5% less logic elements than SA. IWV for different matrix sizes are also tested to successfully reduce resource utilization and timing cost
Fast and Memory Efficient Strassenâs Matrix Multiplication on GPU Cluster
Prior implementations of Strassen's matrix multiplication algorithm on GPUs traded additional workspace in the form of global memory or registers for time. Although Strassen's algorithm offers a reduction in computational complexity as compared to the classical algorithm, the memory overhead associated with the algorithm limits its practical utility. While there were past attempts at reducing the memory footprint of Strassen's algorithm by compromising parallelism, no prior implementation, to our knowledge, was able to hide the workspace requirement successfully. This thesis presents an implementation of Strassen's matrix multiplication in CUDA, titled Multi-Stage Memory Efficient Strassen (MSMES), that eliminates additional workspace requirements by reusing and recovering input matrices. MSMES organizes the steps involved in Strassen's algorithm into five stages where multiple steps in the same stage can be executed in parallel. Two additional stages are also discussed in the thesis that allows the recovery of the input matrices. Unlike previous works, MSMES has no additional memory requirements irrespective of the level of recursion of Strassen's algorithm. Experiments performed with MSMES (with the recovery stages) on NVIDIA Tesla V100 GPU and NVIDIA GTX 1660ti GPU yielded higher compute performance and lower memory requirements as compared to the NVIDIA library function for double precision matrix multiplication, cublasDgemm. In the multi-GPU adaptation of matrix multiplication, we explore the performance of a Strassen-based and a tile-based global decomposition scheme. We also checked the performance of using MSMES and cublasDgemm for performing local matrix multiplication with each of the global decomposition schemes. From the experiments, it was identified that the combination of using Strassen-Winograd decomposition with MSMES yielded the highest speedup among all the tested combinations
Recommended from our members
Practical fast matrix multiplication algorithms
Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying n x n matrices from O(nÂł) to O(n [superscript 2.807]). Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.Computer Science
Implementation and Evaluation of Algorithmic Skeletons: Parallelisation of Computer Algebra Algorithms
This thesis presents design and implementation approaches for the parallel algorithms of computer algebra. We use algorithmic skeletons and also further approaches, like data parallel arithmetic and actors. We have implemented skeletons for divide and conquer algorithms and some special parallel loops, that we call ârepeated computation with a possibility of premature terminationâ. We introduce in this thesis a rational data parallel arithmetic. We focus on parallel symbolic computation algorithms, for these algorithms our arithmetic provides a generic parallelisation approach.
The implementation is carried out in Eden, a parallel functional programming language based on Haskell. This choice enables us to encode both the skeletons and the programs in the same language. Moreover, it allows us to refrain from using two different languagesâone for the implementation and one for the interfaceâfor our implementation of computer algebra algorithms.
Further, this thesis presents methods for evaluation and estimation of parallel execution times. We partition the parallel execution time into two components. One of them accounts for the quality of the parallelisation, we call it the âparallel penaltyâ. The other is the sequential execution time. For the estimation, we predict both components separately, using statistical methods. This enables very confident estimations, although using drastically less measurement points than other methods. We have applied both our evaluation and estimation approaches to the parallel programs presented in this thesis. We haven also used existing estimation methods.
We developed divide and conquer skeletons for the implementation of fast parallel multiplication. We have implemented the Karatsuba algorithm, Strassenâs matrix multiplication algorithm and the fast Fourier transform. The latter was used to implement polynomial convolution that leads to a further fast multiplication algorithm. Specially for our implementation of Strassen algorithm we have designed and implemented a divide and conquer skeleton basing on actors. We have implemented the parallel fast Fourier transform, and not only did we use new divide and conquer skeletons, but also developed a map-and-transpose skeleton. It enables good parallelisation of the Fourier transform. The parallelisation of Karatsuba multiplication shows a very good performance. We have analysed the parallel penalty of our programs and compared it to the serial fractionâan approach, known from literature. We also performed execution time estimations of our divide and conquer programs.
This thesis presents a parallel map+reduce skeleton scheme. It allows us to combine the usual parallel map skeletons, like parMap, farm, workpool, with a premature termination property. We use this to implement the so-called âparallel repeated computationâ, a special form of a speculative parallel loop. We have implemented two probabilistic primality tests: the RabinâMiller test and the Jacobi sum test. We parallelised both with our approach. We analysed the task distribution and stated the fitting configurations of the Jacobi sum test. We have shown formally that the Jacobi sum test can be implemented in parallel. Subsequently, we parallelised it, analysed the load balancing issues, and produced an optimisation. The latter enabled a good implementation, as verified using the parallel penalty. We have also estimated the performance of the tests for further input sizes and numbers of processing elements. Parallelisation of the Jacobi sum test and our generic parallelisation scheme for the repeated computation is our original contribution.
The data parallel arithmetic was defined not only for integers, which is already known, but also for rationals. We handled the common factors of the numerator or denominator of the fraction with the modulus in a novel manner. This is required to obtain a true multiple-residue arithmetic, a novel result of our research. Using these mathematical advances, we have parallelised the determinant computation using the GauĂ elimination. As always, we have performed task distribution analysis and estimation of the parallel execution time of our implementation. A similar computation in Maple emphasised the potential of our approach. Data parallel arithmetic enables parallelisation of entire classes of computer algebra algorithms.
Summarising, this thesis presents and thoroughly evaluates new and existing design decisions for high-level parallelisations of computer algebra algorithms
Computational Bayesian Methods Applied to Complex Problems in Bio and Astro Statistics
In this dissertation we apply computational Bayesian methods to three distinct problems. In the first chapter, we address the issue of unrealistic covariance matrices used to estimate collision probabilities. We model covariance matrices with a Bayesian Normal-Inverse-Wishart model, which we fit with Gibbs sampling. In the second chapter, we are interested in determining the sample sizes necessary to achieve a particular interval width and establish non-inferiority in the analysis of prevalences using two fallible tests. To this end, we use a third order asymptotic approximation. In the third chapter, we wish to synthesize evidence across multiple domains in measurements taken longitudinally across time, featuring a substantial amount of structurally missing data, and fit the model with Hamiltonian Monte Carlo in a simulation to analyze how estimates of a parameter of interest change across sample sizes
MIMO Systems
In recent years, it was realized that the MIMO communication systems seems to be inevitable in accelerated evolution of high data rates applications due to their potential to dramatically increase the spectral efficiency and simultaneously sending individual information to the corresponding users in wireless systems. This book, intends to provide highlights of the current research topics in the field of MIMO system, to offer a snapshot of the recent advances and major issues faced today by the researchers in the MIMO related areas. The book is written by specialists working in universities and research centers all over the world to cover the fundamental principles and main advanced topics on high data rates wireless communications systems over MIMO channels. Moreover, the book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum