51,967 research outputs found
Recommended from our members
Solving large scale linear programming
The interior point method (IPM) is now well established as a competitive technique for solving very large scale linear programming problems. The leading variant of the interior point method is the primal dual - predictor corrector algorithm due to Mehrotra. The main computational steps of this algorithm are the repeated calculation and solution of a large sparse positive definite system of equations.
We describe an implementation of the predictor corrector IPM algorithm on MasPar, a massively parallel SIMD computer. At the heart of the implemen-tation is a parallel Cholesky factorization algorithm for sparse matrices. Our implementation uses a new scheme of mapping the matrix onto the processor grid of the MasPar, that results in a more efficient Cholesky factorization than previously suggested schemes.
The IPM implementation uses the parallel unit of MasPar to speed up the factorization and other computationally intensive parts of the IPM. An impor-tant part of this implementation is the judicious division of data and computation between the front-end computer, that runs the main IPM algorithm, and the par-allel unit. Performanc
Solution of Large Sparse System of Linear Equations over GF(2) on a Multi Node Multi GPU Platform
We provide an efficient multi-node, multi-GPU implementation of the Block Wiedemann Algorithm (BWA)to find the solution of a large sparse system of linear equations over GF(2). One of the important applications ofsolving such systems arises in most integer factorization algorithms like Number Field Sieve. In this paper, wedescribe how hybrid parallelization can be adapted to speed up the most time-consuming sequence generation stage of BWA. This stage involves generating a sequence of matrix-matrix products and matrix transpose-matrix products where the matrices are very large, highly sparse, and have entries over GF(2). We describe a GPU-accelerated parallel method for the computation of these matrix-matrix products using techniques like row-wise parallel distribution of the first matrix over multi-node multi-GPU platform using MPI and CUDA and word-wise XORing of rows of the second matrix. We also describe the hybrid parallelization of matrix transpose-matrix product computation, where we divide both the matrices row-wise into equal-sized blocks using MPI. Then after a GPU-accelerated matrix transpose-matrix product generation, we combine all those blocks using MPI_BXOR operation in MPI_Reduce to obtain the result. The performance of hybrid parallelization of the sequence generation step on a hybrid cluster using multiple GPUs has been compared with parallelization on only multiple MPI processors. We have used this hybrid parallel sequence generation tool for the benchmarking of an HPC cluster. Detailed timings of the complete solution of number field sieve matrices of RSA-130, RSA-140, and RSA-170 are also compared in this paper using up to 4 NVidia V100 GPUs of a DGX station. We got a speedup of 2.8 after parallelization on 4 V100 GPUs compared to that over 1 GPU
Recommended from our members
A Parallel Direct Method for Finite Element Electromagnetic Computations Based on Domain Decomposition
High performance parallel computing and direct (factorization-based) solution methods have been the two main trends in electromagnetic computations in recent years. When time-harmonic (frequency-domain) Maxwell\u27s equation are directly discretized with the Finite Element Method (FEM) or other Partial Differential Equation (PDE) methods, the resulting linear system of equations is sparse and indefinite, thus harder to efficiently factorize serially or in parallel than alternative methods e.g. integral equation solutions, that result in dense linear systems. State-of-the-art sparse matrix direct solvers such as MUMPS and PARDISO don\u27t scale favorably, have low parallel efficiency and high memory footprint. This work introduces a new class of sparse direct solvers based on domain decomposition method, termed Direct Domain Decomposition Method (D3M), which is reliable, memory efficient, and offers very good parallel scalability for arbitrary 3D FEM problems.
Unlike recent trends in approximate/low-rank solvers, this method focuses on `numerically exact\u27 solution methods as they are more reliable for complex `real-life\u27 models. The proposed method leverages physical insights at every stage of the development through a new symmetric domain decomposition method (DDM) with one set of Lagrange multipliers. Applying a special regularization scheme at the interfaces, either artificial loss or gain is introduced to each domain to eliminate non-physical internal resonances. A block-wise recursive algorithm based on Takahashi relationship is proposed for the efficient computation of discrete Dirichlet-to-Neumann (DtN) map to reduce the volumetric problem from all domains into an auxiliary surfacial problem defined on the domain interfaces only. Numerical results show up to 50% run-time saving in DtN map computation using the proposed block-wise recursive algorithm compared to alternative approaches. The auxiliary unknowns on the domain interfaces form a considerably (approximately an order of magnitude) smaller block-wise sparse matrix, which is efficiently factorized using a customized block LDL factorization with restricted pivoting to ensure stability.
The parallelization of the proposed D3M is realized based on Directed Acyclic Graph (DAG). Recent advances in parallel dense direct solvers, have shifted toward parallel implementation that rely on DAG scheduling to achieve highly efficient asynchronous parallel execution. However, adaptation of such schemes to sparse matrices is harder and often impractical. In D3M, computation of each domain\u27s discrete DtN map ``embarrassingly parallel\u27\u27, whereas the customized block LDLT is suitable for a block directed acyclic graph (B-DAG) task scheduling, similar to that used in dense matrix parallel direct solvers. In this approach, computations are represented as a sequence of small tasks that operate on domains of DDM or dense matrix blocks of the reduced matrix. These tasks can be statically scheduled for parallel execution using their DAG dependencies and weights that depend on estimates of computation and communication costs.
Comparisons with state-of-the-art exact direct solvers on electrically large problems suggest up to 20% better parallel efficiency, 30% - 3X less memory and slightly faster in runtime, while maintaining the same accuracy
Adapting the interior point method for the solution of LPs on serial, coarse grain parallel and massively parallel computers
In this paper we describe a unified scheme for implementing an interior point algorithm (IPM) over a range of computer architectures. In the inner iteration of the IPM a search direction is computed using Newton's method. Computationally this involves solving a sparse symmetric positive definite (SSPD) system of equations. The choice of direct and indirect methods for the solution of this system, and the design of data structures to take advantage of serial, coarse grain parallel and massively parallel computer architectures, are considered in detail. We put forward arguments as to why integration of the system within a sparse simplex solver is important and outline how the system is designed to achieve this integration
Domain Decomposition Based High Performance Parallel Computing\ud
The study deals with the parallelization of finite element based Navier-Stokes codes using domain decomposition and state-ofart sparse direct solvers. There has been significant improvement in the performance of sparse direct solvers. Parallel sparse direct solvers are not found to exhibit good scalability. Hence, the parallelization of sparse direct solvers is done using domain decomposition techniques. A highly efficient sparse direct solver PARDISO is used in this study. The scalability of both Newton and modified Newton algorithms are tested
Adapting the interior point method for the solution of linear programs on high performance computers
In this paper we describe a unified algorithmic framework for the interior point method (IPM) of solving Linear Programs (LPs) which allows us to adapt it over a range of high performance computer architectures. We set out the reasons as to why IPM makes better use of high performance computer architecture than the sparse simplex method. In the inner iteration of the IPM a search direction is computed using Newton or higher order methods. Computationally this involves solving a sparse symmetric positive definite (SSPD) system of equations. The choice of direct and indirect methods for the solution of this system and the design of data structures to take advantage of coarse grain parallel and massively parallel computer architectures are considered in detail. Finally, we present experimental results of solving NETLIB test problems on examples of these architectures and put forward arguments as to why integration of the system within sparse simplex is beneficial
- …