Search CORE

305 research outputs found

Parallel implementation of the finite element method on shared memory multiprocessors

Author: Pakzad Mustapha
Publication venue: Newcastle University
Publication date: 01/01/1995
Field of study

PhD ThesisThe work presented in this thesis concerns parallel methods for finite element analysis. The research has been funded by British Gas and some of the presented material involves work on their software. Practical problems involving the finite element method can use a large amount of processing power and the execution times can be very large. It is consequently important to investigate the possibilities for the parallel implementation of the method. The research has been carried out on an Encore Multimax, a shared memory multiprocessor with 14 identical CPU's. We firstly experimented on autoparallelising a large British Gas finite element program (GASP4) using Encore's parallelising Fortran compiler (epf). The par- allel program generated by epj proved not to be efficient. The main reasons are the complexity of the code and small grain parallelism. Since the program is hard to analyse for the compiler at high levels, only small grain parallelism has been inserted automatically into the code. This involves a great deal of low level syn- chronisations which produce large overheads and cause inefficiency. A detailed analysis of the autoparallelised code has been made with a view to determining the reasons for the inefficiency. Suggestions have also been made about writing programs such that they are suitable for efficient autoparallelisation. The finite element method consists of the assembly of a stiffness matrix and the solution of a set of simultaneous linear equations. A sparse representation of the stiffness matrix has been used to allow experimentation on large problems. Parallel assembly techniques for the sparse representation have been developed. Some of these methods have proved to be very efficient giving speed ups that are near ideal. For the solution phase, we have used the preconditioned conjugate gradient method (PCG). An incomplete LU factorization ofthe stiffness matrix with no fill- in (ILU(O)) has been found to be an effective preconditioner. The factors can be obtained at a low cost. We have parallelised all the steps of the PCG method. The main bottleneck is the triangular solves (preconditioning operations) at each step. Two parallel methods of triangular solution have been implemented. One is based on level scheduling (row-oriented parallelism) and the other is a new approach called independent columns (column-oriented parallelism). The algorithms have been tested for row and red-black orderings of the nodal unknowns in the finite element meshes considered. The best speed ups obtained are 7.29 (on 12 processors) for level scheduling and 7.11 (on 12 processors) for independent columns. Red-black ordering gives rise to better parallel performance than row ordering in general. An analysis of methods for the improvement of the parallel efficiency has been made.British Ga

Newcastle University eTheses

Finite Element Algorithms and Data Structures on Graphical Processing Units

Author: C Cecka
C Johnson
D Komatitsch
EL Poole
G Alefeld
I. Z. Reguly
J Bolz
KJ Fidkowski
M. B. Giles
O Axelsson
WmW Hwu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/12/2013
Field of study

The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA’s cuSPARSE library, depending on the preconditioner used

Crossref

Oxford University Research Archive

Achieving parallel performance in scientific computations

Author: Clarke Lyndon J.
Publication venue: The University of Edinburgh
Publication date: 01/01/1990
Field of study

Edinburgh Research Archive

Address generator synthesis

Author: Grant Douglas M.
Publication venue: The University of Edinburgh
Publication date: 01/01/1991
Field of study

Edinburgh Research Archive

Strategies for producing fast finite element solutions of the incompressible Navier-Stokes equations on massively parallel architectures

Author: Mallick Swapan
Publication venue
Publication date
Field of study

To take advantage of the inherent flexibility of the finite element method in solving for flows within complex geometries, it is necessary to produce efficient implementations of the method. Segregation of the solution scheme and the use of parallel computers are two ways of doing this. Here, the optimisation of a sequential segregated finite element algorithm is discussed, together with the various strategies by which this is done. Furthermore, the implications of parallelising the code onto a massively parallel computer, the MasPar, are explored. This machine is of Single Instruction Multiple Data type and so modifications to the computer code have been necessary. A general methodology for the implementation of finite element programs is presented based on projecting the levels of data within the algorithm into a form which is ideal for parallelisation. Application of this methodology, in a high level language, has resulted in a code which runs at just under 30MFlops (in double precision). The computations are performed with minimal inter-processor communication and this represents an efficiency of 20% of the theoretical peak speed. Even though only high level language constructs have been used, this efficiency is comparable with other work using low level constructs on machines of this architecture. In particular, the use of data parallel arrays and the utilisation of the non-unique machine specific features of the computer architecture have produced an efficient, fast program

Warwick Research Archives Portal Repository