121,230 research outputs found
Parallel computation of entries of A-1
In this paper, we are concerned about computing in parallel several entries of the inverse of a large sparse matrix. We assume that the matrix has already been factorized by a direct method and that the factors are distributed. Entries are efficiently computed by exploiting sparsity of the right-hand sides and the solution vectors in the triangular solution phase. We demonstrate that in this setting, parallelism and computational efficiency are two contrasting objectives. We develop an efficient approach and show its efficacy by runs using the MUMPS code that implements a parallel multifrontal method
Parallel computation of entries in A-1
International audienceIn this paper, we consider the computation in parallel of several entries of the inverseof a large sparse matrix. We assume that the matrix has already been factorized by a direct methodand that the factors are distributed. Entries are efficiently computed by exploiting sparsity of theright-hand sides and the solution vectors in the triangular solution phase. We demonstrate that inthis setting, parallelism and computational efficiency are two contrasting objectives. We develop anefficient approach and show its efficiency on a general purpose parallel multifrontal solver
Solution of Large Sparse System of Linear Equations over GF(2) on a Multi Node Multi GPU Platform
We provide an efficient multi-node, multi-GPU implementation of the Block Wiedemann Algorithm (BWA)to find the solution of a large sparse system of linear equations over GF(2). One of the important applications ofsolving such systems arises in most integer factorization algorithms like Number Field Sieve. In this paper, wedescribe how hybrid parallelization can be adapted to speed up the most time-consuming sequence generation stage of BWA. This stage involves generating a sequence of matrix-matrix products and matrix transpose-matrix products where the matrices are very large, highly sparse, and have entries over GF(2). We describe a GPU-accelerated parallel method for the computation of these matrix-matrix products using techniques like row-wise parallel distribution of the first matrix over multi-node multi-GPU platform using MPI and CUDA and word-wise XORing of rows of the second matrix. We also describe the hybrid parallelization of matrix transpose-matrix product computation, where we divide both the matrices row-wise into equal-sized blocks using MPI. Then after a GPU-accelerated matrix transpose-matrix product generation, we combine all those blocks using MPI_BXOR operation in MPI_Reduce to obtain the result. The performance of hybrid parallelization of the sequence generation step on a hybrid cluster using multiple GPUs has been compared with parallelization on only multiple MPI processors. We have used this hybrid parallel sequence generation tool for the benchmarking of an HPC cluster. Detailed timings of the complete solution of number field sieve matrices of RSA-130, RSA-140, and RSA-170 are also compared in this paper using up to 4 NVidia V100 GPUs of a DGX station. We got a speedup of 2.8 after parallelization on 4 V100 GPUs compared to that over 1 GPU
Recommended from our members
Crosslinking in parallel
A crosslink is a double link established between the two entries of an edge in an adjacency list representation of a graph. Crosslinks play important roles in several parallel algorithms as they provide constant time access between the two entries of an edge; the existence of crosslinks is usually assumed. We consider the problem of establishing crosslinks in a crosslink-less adjacency list for graphs that belong to a class of graphs called the linearly contractible graphs, and show that cross-links can be established optimally in O(log n log*n) time using a CREW PRAM and optimally in O(log n) time using a CRCW PRAM for such graphs
The Parallel Persistent Memory Model
We consider a parallel computational model that consists of processors,
each with a fast local ephemeral memory of limited size, and sharing a large
persistent memory. The model allows for each processor to fault with bounded
probability, and possibly restart. On faulting all processor state and local
ephemeral memory are lost, but the persistent memory remains. This model is
motivated by upcoming non-volatile memories that are as fast as existing random
access memory, are accessible at the granularity of cache lines, and have the
capability of surviving power outages. It is further motivated by the
observation that in large parallel systems, failure of processors and their
caches is not unusual.
Within the model we develop a framework for developing locality efficient
parallel algorithms that are resilient to failures. There are several
challenges, including the need to recover from failures, the desire to do this
in an asynchronous setting (i.e., not blocking other processors when one
fails), and the need for synchronization primitives that are robust to
failures. We describe approaches to solve these challenges based on breaking
computations into what we call capsules, which have certain properties, and
developing a work-stealing scheduler that functions properly within the context
of failures. The scheduler guarantees a time bound of in expectation, where and are the work and
depth of the computation (in the absence of failures), is the average
number of processors available during the computation, and is the
probability that a capsule fails. Within the model and using the proposed
methods, we develop efficient algorithms for parallel sorting and other
primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same
nam
A Novel Methodology for Memory Reduction in Distributed Arithmetic Based Discrete Wavelet Transform
AbstractDiscrete Wavelet Transform (DWT) is widely used in image compression standards such as JPEG 2000. DWT can be implemented on FPGA using parallel Distributed Arithmetic (DA) architecture, which is suitable for low power implementation. However, the size of the memory in DA increases with the number of wavelet coefficients. In this paper, we propose a novel methodology to reduce the size of the Look-Up Tables (LUTs) used in DA for DWT. The table entries are sorted using Burrows-Wheeler Transform (BWT) and then compressed. The compressed table is stored in memory. During DWT/IDWT computation, without reconstructing the entire table we can recover only the required table entry. A comparative study of this methodology among different wavelets is performed. We demonstrate that the method is very effective for reducing the memory of DA architectures. A compression ratio of around 2.3:1 is achieved for the look-up table which stores the inner product of high-pass filter coefficients of Daubechies-4 (Db4) wavelet which is used in JPEG2000
- âŠ