479 research outputs found
Enhancements to the STAGS computer code
The power of the STAGS family of programs was greatly enhanced. Members of the family include STAGS-C1 and RRSYS. As a result of improvements implemented, it is now possible to address the full collapse of a structural system, up to and beyond critical points where its resistance to the applied loads vanishes or suddenly changes. This also includes the important class of problems where a multiplicity of solutions exists at a given point (bifurcation), and where until now no solution could be obtained along any alternate (secondary) load path with any standard production finite element code
Parallel unstructured solvers for linear partial differential equations
This thesis presents the development of a parallel algorithm to solve symmetric
systems of linear equations and the computational implementation of a parallel
partial differential equations solver for unstructured meshes. The proposed
method, called distributive conjugate gradient - DCG, is based on a single-level
domain decomposition method and the conjugate gradient method to obtain a
highly scalable parallel algorithm.
An overview on methods for the discretization of domains and partial differential
equations is given. The partition and refinement of meshes is discussed and
the formulation of the weighted residual method for two- and three-dimensions
presented. Some of the methods to solve systems of linear equations are introduced,
highlighting the conjugate gradient method and domain decomposition
methods. A parallel unstructured PDE solver is proposed and its actual implementation
presented. Emphasis is given to the data partition adopted and the
scheme used for communication among adjacent subdomains is explained. A series
of experiments in processor scalability is also reported.
The derivation and parallelization of DCG are presented and the method validated
throughout numerical experiments. The method capabilities and limitations
were investigated by the solution of the Poisson equation with various source
terms. The experimental results obtained using the parallel solver developed as
part of this work show that the algorithm presented is accurate and highly scalable,
achieving roughly linear parallel speed-up in many of the cases tested
Algorithms in Lattice QCD
The enormous computing resources that large-scale simulations in Lattice QCD
require will continue to test the limits of even the largest supercomputers into
the foreseeable future. The efficiency of such simulations will therefore concern
practitioners of lattice QCD for some time to come.
I begin with an introduction to those aspects of lattice QCD essential to the
remainder of the thesis, and follow with a description of the Wilson fermion
matrix M, an object which is central to my theme.
The principal bottleneck in Lattice QCD simulations is the solution of linear
systems involving M, and this topic is treated in depth. I compare some of the
more popular iterative methods, including Minimal Residual, Corij ugate Gradient
on the Normal Equation, BI-Conjugate Gradient, QMR., BiCGSTAB and
BiCGSTAB2, and then turn to a study of block algorithms, a special class of iterative
solvers for systems with multiple right-hand sides. Included in this study
are two block algorithms which had not previously been applied to lattice QCD.
The next chapters are concerned with a generalised Hybrid Monte Carlo algorithm
(OHM C) for QCD simulations involving dynamical quarks. I focus squarely
on the efficient and robust implementation of GHMC, and describe some tricks
to improve its performance. A limited set of results from HMC simulations at
various parameter values is presented.
A treatment of the non-hermitian Lanczos method and its application to the
eigenvalue problem for M rounds off the theme of large-scale matrix computations
How to square floats accurately and efficiently on the ST231 integer processor
We consider the problem of computing IEEE floating-point squares by means of integer arithmetic. We show how the specific properties of squaring can be exploited in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithm descriptions are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes. We show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from STMicroelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context
- …