479 research outputs found

    Comparing Julia to Performance Portable Parallel Programming Models for HPC

    Get PDF

    Enhancements to the STAGS computer code

    Get PDF
    The power of the STAGS family of programs was greatly enhanced. Members of the family include STAGS-C1 and RRSYS. As a result of improvements implemented, it is now possible to address the full collapse of a structural system, up to and beyond critical points where its resistance to the applied loads vanishes or suddenly changes. This also includes the important class of problems where a multiplicity of solutions exists at a given point (bifurcation), and where until now no solution could be obtained along any alternate (secondary) load path with any standard production finite element code

    Parallel unstructured solvers for linear partial differential equations

    Get PDF
    This thesis presents the development of a parallel algorithm to solve symmetric systems of linear equations and the computational implementation of a parallel partial differential equations solver for unstructured meshes. The proposed method, called distributive conjugate gradient - DCG, is based on a single-level domain decomposition method and the conjugate gradient method to obtain a highly scalable parallel algorithm. An overview on methods for the discretization of domains and partial differential equations is given. The partition and refinement of meshes is discussed and the formulation of the weighted residual method for two- and three-dimensions presented. Some of the methods to solve systems of linear equations are introduced, highlighting the conjugate gradient method and domain decomposition methods. A parallel unstructured PDE solver is proposed and its actual implementation presented. Emphasis is given to the data partition adopted and the scheme used for communication among adjacent subdomains is explained. A series of experiments in processor scalability is also reported. The derivation and parallelization of DCG are presented and the method validated throughout numerical experiments. The method capabilities and limitations were investigated by the solution of the Poisson equation with various source terms. The experimental results obtained using the parallel solver developed as part of this work show that the algorithm presented is accurate and highly scalable, achieving roughly linear parallel speed-up in many of the cases tested

    Algorithms in Lattice QCD

    Get PDF
    The enormous computing resources that large-scale simulations in Lattice QCD require will continue to test the limits of even the largest supercomputers into the foreseeable future. The efficiency of such simulations will therefore concern practitioners of lattice QCD for some time to come. I begin with an introduction to those aspects of lattice QCD essential to the remainder of the thesis, and follow with a description of the Wilson fermion matrix M, an object which is central to my theme. The principal bottleneck in Lattice QCD simulations is the solution of linear systems involving M, and this topic is treated in depth. I compare some of the more popular iterative methods, including Minimal Residual, Corij ugate Gradient on the Normal Equation, BI-Conjugate Gradient, QMR., BiCGSTAB and BiCGSTAB2, and then turn to a study of block algorithms, a special class of iterative solvers for systems with multiple right-hand sides. Included in this study are two block algorithms which had not previously been applied to lattice QCD. The next chapters are concerned with a generalised Hybrid Monte Carlo algorithm (OHM C) for QCD simulations involving dynamical quarks. I focus squarely on the efficient and robust implementation of GHMC, and describe some tricks to improve its performance. A limited set of results from HMC simulations at various parameter values is presented. A treatment of the non-hermitian Lanczos method and its application to the eigenvalue problem for M rounds off the theme of large-scale matrix computations

    How to square floats accurately and efficiently on the ST231 integer processor

    Get PDF
    We consider the problem of computing IEEE floating-point squares by means of integer arithmetic. We show how the specific properties of squaring can be exploited in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithm descriptions are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes. We show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from STMicroelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context
    • …
    corecore