840 research outputs found
Adapting the interior point method for the solution of LPs on serial, coarse grain parallel and massively parallel computers
In this paper we describe a unified scheme for implementing an interior point algorithm (IPM) over a range of computer architectures. In the inner iteration of the IPM a search direction is computed using Newton's method. Computationally this involves solving a sparse symmetric positive definite (SSPD) system of equations. The choice of direct and indirect methods for the solution of this system, and the design of data structures to take advantage of serial, coarse grain parallel and massively parallel computer architectures, are considered in detail. We put forward arguments as to why integration of the system within a sparse simplex solver is important and outline how the system is designed to achieve this integration
A bibliography on parallel and vector numerical algorithms
This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also
Solving Lattice QCD systems of equations using mixed precision solvers on GPUs
Modern graphics hardware is designed for highly parallel numerical tasks and
promises significant cost and performance benefits for many scientific
applications. One such application is lattice quantum chromodyamics (lattice
QCD), where the main computational challenge is to efficiently solve the
discretized Dirac equation in the presence of an SU(3) gauge field. Using
NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector
product that performs at up to 40 Gflops, 135 Gflops and 212 Gflops for double,
single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have
developed a new mixed precision approach for Krylov solvers using reliable
updates which allows for full double precision accuracy while using only single
or half precision arithmetic for the bulk of the computation. The resulting
BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations
until convergence, perform better than the usual defect-correction approach for
mixed precision.Comment: 30 pages, 7 figure
Solution of partial differential equations on vector and parallel computers
The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed
Parallel Sparse Matrix Solver on the GPU Applied to Simulation of Electrical Machines
Nowadays, several industrial applications are being ported to parallel
architectures. In fact, these platforms allow acquire more performance for
system modelling and simulation. In the electric machines area, there are many
problems which need speed-up on their solution. This paper examines the
parallelism of sparse matrix solver on the graphics processors. More
specifically, we implement the conjugate gradient technique with input matrix
stored in CSR, and Symmetric CSR and CSC formats. This method is one of the
most efficient iterative methods available for solving the finite-element basis
functions of Maxwell's equations. The GPU (Graphics Processing Unit), which is
used for its implementation, provides mechanisms to parallel the algorithm.
Thus, it increases significantly the computation speed in relation to serial
code on CPU based systems
Efficient ICCG on a shared memory multiprocessor
Different approaches are discussed for exploiting parallelism in the ICCG (Incomplete Cholesky Conjugate Gradient) method for solving large sparse symmetric positive definite systems of equations on a shared memory parallel computer. Techniques for efficiently solving triangular systems and computing sparse matrix-vector products are explored. Three methods for scheduling the tasks in solving triangular systems are implemented on the Sequent Balance 21000. Sample problems that are representative of a large class of problems solved using iterative methods are used. We show that a static analysis to determine data dependences in the triangular solve can greatly improve its parallel efficiency. We also show that ignoring symmetry and storing the whole matrix can reduce solution time substantially
Parallel computation of 3-D electromagnetic scattering using finite elements
The finite element method (FEM) with local absorbing boundary conditions has been recently applied to compute electromagnetic scattering from large 3-D geometries. In this paper, we present details pertaining to code implementation and optimization. Various types of sparse matrix storage schemes are discussed and their performance is examined in terms of vectorization and net storage requirements. The system of linear equations is solved using a preconditioned biconjugate gradient (BCG) algorithm and a fairly detailed study of existing point and block preconditioners (diagonal and incomplete LU) is carried out. A modified ILU preconditioning scheme is also introducted which works better than the traditional version for our matrix systems. The parallelization of the iterative sparse solver and the matrix generation/assembly as implemented on the KSR1 multiprocessor is described and the interprocessor communication patterns are analysed in detail. Near-linear speed-up is obtained for both the iterative solver and the matrix generation/assembly phases. Results are presented for a problem having 224,476 unknowns and validated by comparison with measured data.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/50413/1/1660070504_ftp.pd
FPGA implementation of a Cholesky algorithm for a shared-memory multiprocessor architecture
Solving a system of linear equations is a key problem in the field of engineering and science. Matrix factorization is a key component of many methods used to solve such equations. However, the factorization process is very time consuming, so these problems have traditionally been targeted for parallel machines rather than sequential ones. Nevertheless, commercially available supercomputers are expensive and only large institutions have the resources to purchase them or use them. Hence, efforts are on to develop more affordable alternatives. This thesis presents one such approach.
The work presented here is an implementation of a parallel version of the Cholesky matrix factorization algorithm on a single-chip multiprocessor built on an APEX20K series FPGA developed by Altera. This multiprocessor system uses an asymmetric, shared-memory MIMD architecture, built using a configurable processor core called Nios, which was also developed by Altera. The whole system was developed on Altera\u27s SOPC Development Kit using the Quartus 11 development environment.
The Cholesky algorithm is based on an algorithm described in George, et al. [9]. The key features of this algorithm are that it is scalable and uses a queue of tasks approach [9], which ensures dynamic load-balancing among the processing elements. The implementation also assumes dense matrices in the input.
Timing, speedup and efficiency results based on experiments run on uniprocessor and multiprocessor implementations are also presented
Automated problem scheduling and reduction of synchronization delay effects
It is anticipated that in order to make effective use of many future high performance architectures, programs will have to exhibit at least a medium grained parallelism. A framework is presented for partitioning very sparse triangular systems of linear equations that is designed to produce favorable preformance results in a wide variety of parallel architectures. Efficient methods for solving these systems are of interest because: (1) they provide a useful model problem for use in exploring heuristics for the aggregation, mapping and scheduling of relatively fine grained computations whose data dependencies are specified by directed acrylic graphs, and (2) because such efficient methods can find direct application in the development of parallel algorithms for scientific computation. Simple expressions are derived that describe how to schedule computational work with varying degrees of granularity. The Encore Multimax was used as a hardware simulator to investigate the performance effects of using the partitioning techniques presented in shared memory architectures with varying relative synchronization costs
- …