305 research outputs found
Parallel implementation of the finite element method on shared memory multiprocessors
PhD ThesisThe work presented in this thesis concerns parallel methods for finite element
analysis. The research has been funded by British Gas and some of the presented
material involves work on their software. Practical problems involving the finite
element method can use a large amount of processing power and the execution
times can be very large. It is consequently important to investigate the possibilities
for the parallel implementation of the method. The research has been carried out
on an Encore Multimax, a shared memory multiprocessor with 14 identical CPU's.
We firstly experimented on autoparallelising a large British Gas finite element
program (GASP4) using Encore's parallelising Fortran compiler (epf). The par-
allel program generated by epj proved not to be efficient. The main reasons are
the complexity of the code and small grain parallelism. Since the program is hard
to analyse for the compiler at high levels, only small grain parallelism has been
inserted automatically into the code. This involves a great deal of low level syn-
chronisations which produce large overheads and cause inefficiency. A detailed
analysis of the autoparallelised code has been made with a view to determining
the reasons for the inefficiency. Suggestions have also been made about writing
programs such that they are suitable for efficient autoparallelisation.
The finite element method consists of the assembly of a stiffness matrix and
the solution of a set of simultaneous linear equations. A sparse representation of
the stiffness matrix has been used to allow experimentation on large problems.
Parallel assembly techniques for the sparse representation have been developed.
Some of these methods have proved to be very efficient giving speed ups that are
near ideal.
For the solution phase, we have used the preconditioned conjugate gradient
method (PCG). An incomplete LU factorization ofthe stiffness matrix with no fill-
in (ILU(O)) has been found to be an effective preconditioner. The factors can be
obtained at a low cost. We have parallelised all the steps of the PCG method. The
main bottleneck is the triangular solves (preconditioning operations) at each step.
Two parallel methods of triangular solution have been implemented. One is based
on level scheduling (row-oriented parallelism) and the other is a new approach
called independent columns (column-oriented parallelism). The algorithms have
been tested for row and red-black orderings of the nodal unknowns in the finite
element meshes considered.
The best speed ups obtained are 7.29 (on 12 processors) for level scheduling
and 7.11 (on 12 processors) for independent columns. Red-black ordering gives
rise to better parallel performance than row ordering in general. An analysis of
methods for the improvement of the parallel efficiency has been made.British Ga
Finite Element Algorithms and Data Structures on Graphical Processing Units
The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA’s cuSPARSE library, depending on the preconditioner used
Strategies for producing fast finite element solutions of the incompressible Navier-Stokes equations on massively parallel architectures
To take advantage of the inherent flexibility of the finite element method in solving for flows within complex geometries, it is necessary to produce efficient implementations of the method. Segregation of the solution scheme and the use of parallel computers are two ways of doing this.
Here, the optimisation of a sequential segregated finite element algorithm is discussed, together with the various strategies by which this is done. Furthermore, the implications of parallelising the code onto a massively parallel computer, the MasPar, are explored.
This machine is of Single Instruction Multiple Data type and so modifications to the computer code have been necessary. A general methodology for the implementation of finite element programs is presented based on projecting the levels of data within the algorithm into a form which is ideal for parallelisation. Application of this methodology, in a high level language, has resulted in a code which runs at just under 30MFlops (in double precision). The computations are performed with minimal inter-processor communication and this represents an efficiency of 20% of the theoretical peak speed. Even though only high level language constructs have been used, this efficiency is comparable with other work using low level constructs on machines of this architecture. In particular, the use of data parallel arrays and the utilisation of the non-unique machine specific features of the computer architecture have produced an efficient, fast program
An extensive English language bibliography on graph theory and its applications
Bibliography on graph theory and its application
- …