Search CORE

121,230 research outputs found

Parallel computation of entries of A-1

Author: Amestoy Patrick
Duff Iain S.
L'Excellent Jean-Yves
Rouet François-Henry
Publication venue: HAL CCSD
Publication date: 01/12/2012
Field of study

In this paper, we are concerned about computing in parallel several entries of the inverse of a large sparse matrix. We assume that the matrix has already been factorized by a direct method and that the factors are distributed. Entries are efficiently computed by exploiting sparsity of the right-hand sides and the solution vectors in the triangular solution phase. We demonstrate that in this setting, parallelism and computational efficiency are two contrasting objectives. We develop an efficient approach and show its efficacy by runs using the MUMPS code that implements a parallel multifrontal method

HAL-ENS-LYON

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

Parallel computation of entries in A-1

Author: Amestoy Patrick
Duff Iain S.
L'Excellent Jean-Yves
Rouet François-Henry
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2015
Field of study

International audienceIn this paper, we consider the computation in parallel of several entries of the inverseof a large sparse matrix. We assume that the matrix has already been factorized by a direct methodand that the factors are distributed. Entries are efficiently computed by exploiting sparsity of theright-hand sides and the solution vectors in the triangular solution phase. We demonstrate that inthis setting, parallelism and computational efficiency are two contrasting objectives. We develop anefficient approach and show its efficiency on a general purpose parallel multifrontal solver

HAL-ENS-LYON

Crossref

INRIA a CCSD electronic archive server

Open Archive Toulouse Archive Ouverte

Hal-Diderot

HAL-Rennes 1

Solution of Large Sparse System of Linear Equations over GF(2) on a Multi Node Multi GPU Platform

Author: Gupta Indivar
Rawal Shruti
Publication venue: 'Defence Scientific Information and Documentation Centre'
Publication date: 06/12/2022
Field of study

We provide an efficient multi-node, multi-GPU implementation of the Block Wiedemann Algorithm (BWA)to find the solution of a large sparse system of linear equations over GF(2). One of the important applications ofsolving such systems arises in most integer factorization algorithms like Number Field Sieve. In this paper, wedescribe how hybrid parallelization can be adapted to speed up the most time-consuming sequence generation stage of BWA. This stage involves generating a sequence of matrix-matrix products and matrix transpose-matrix products where the matrices are very large, highly sparse, and have entries over GF(2). We describe a GPU-accelerated parallel method for the computation of these matrix-matrix products using techniques like row-wise parallel distribution of the first matrix over multi-node multi-GPU platform using MPI and CUDA and word-wise XORing of rows of the second matrix. We also describe the hybrid parallelization of matrix transpose-matrix product computation, where we divide both the matrices row-wise into equal-sized blocks using MPI. Then after a GPU-accelerated matrix transpose-matrix product generation, we combine all those blocks using MPI_BXOR operation in MPI_Reduce to obtain the result. The performance of hybrid parallelization of the sequence generation step on a hybrid cluster using multiple GPUs has been compared with parallelization on only multiple MPI processors. We have used this hybrid parallel sequence generation tool for the benchmarking of an HPC cluster. Detailed timings of the complete solution of number field sieve matrices of RSA-130, RSA-140, and RSA-170 are also compared in this paper using up to 4 NVidia V100 GPUs of a DGX station. We got a speedup of 2.8 after parallelization on 4 V100 GPUs compared to that over 1 GPU

Defence Science Journal

Recommended from our members

Crosslinking in parallel

Author: Asuri Hari S.
Publication venue: eScholarship, University of California
Publication date: 01/01/1992
Field of study

A crosslink is a double link established between the two entries of an edge in an adjacency list representation of a graph. Crosslinks play important roles in several parallel algorithms as they provide constant time access between the two entries of an edge; the existence of crosslinks is usually assumed. We consider the problem of establishing crosslinks in a crosslink-less adjacency list for graphs that belong to a class of graphs called the linearly contractible graphs, and show that cross-links can be established optimally in O(log n log*n) time using a CREW PRAM and optimally in O(log n) time using a CRCW PRAM for such graphs

eScholarship - University of California

The Parallel Persistent Memory Model

Author: Berryhill R.
Blelloch G. E.
Buettner M.
Chauhan H.
Herlihy M.
JaJa J.
Lee S. K.
Meena J. S.
Nawab F.
Pelley S.
Woude J. Van Der
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/06/2018
Field of study

We consider a parallel computational model that consists of

P

processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded probability, and possibly restart. On faulting all processor state and local ephemeral memory are lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. Within the model we develop a framework for developing locality efficient parallel algorithms that are resilient to failures. There are several challenges, including the need to recover from failures, the desire to do this in an asynchronous setting (i.e., not blocking other processors when one fails), and the need for synchronization primitives that are robust to failures. We describe approaches to solve these challenges based on breaking computations into what we call capsules, which have certain properties, and developing a work-stealing scheduler that functions properly within the context of failures. The scheduler guarantees a time bound of

O(W/P_A + D(P/P_A) \lceil\log_{1/f} W\rceil)

in expectation, where

W

and

D

are the work and depth of the computation (in the absence of failures),

P_A

is the average number of processors available during the computation, and

f \le 1/2

is the probability that a capsule fails. Within the model and using the proposed methods, we develop efficient algorithms for parallel sorting and other primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same nam

arXiv.org e-Print Archive

Crossref

DSpace@MIT

A Novel Methodology for Memory Reduction in Distributed Arithmetic Based Discrete Wavelet Transform

Author: Nagaraj Nithin
Remya Ajai A.S.
Publication venue: Published by Elsevier Ltd.
Publication date: 31/12/2012
Field of study

AbstractDiscrete Wavelet Transform (DWT) is widely used in image compression standards such as JPEG 2000. DWT can be implemented on FPGA using parallel Distributed Arithmetic (DA) architecture, which is suitable for low power implementation. However, the size of the memory in DA increases with the number of wavelet coefficients. In this paper, we propose a novel methodology to reduce the size of the Look-Up Tables (LUTs) used in DA for DWT. The table entries are sorted using Burrows-Wheeler Transform (BWT) and then compressed. The compressed table is stored in memory. During DWT/IDWT computation, without reconstructing the entire table we can recover only the required table entry. A comparative study of this methodology among different wavelets is performed. We demonstrate that the method is very effective for reducing the memory of DA architectures. A compression ratio of around 2.3:1 is achieved for the look-up table which stores the inner product of high-pass filter coefficients of Daubechies-4 (Db4) wavelet which is used in JPEG2000

Elsevier - Publisher Connector