57 research outputs found

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    Exploring On-Node Parallelism with Neutral, a Monte Carlo Neutral Particle Transport Mini-App

    Get PDF

    Task-Based Parallelism for General Purpose Graphics Processing Units and Hybrid Shared-Distributed Memory Systems.

    Get PDF
    Modern computers can no longer rely on increasing CPU speed to improve their performance as further increasing the clock speed of single CPU machines will make them too difficult to cool, or the cooling require too much power. Hardware manufacturers must now use parallelism to drive performance to the levels expected by Moore's Law. More recently, High Performance Computers (HPCs) have adopted heterogeneous architectures, i.e.having multiple types of computing hardware (such as CPU & GPU) on a single node. These architectures allow the opportunity to extract performance from non-CPU architectures, while still providing a general purpose platform for less modern codes. In this thesis we investigate Task-Based Parallelism, a shared-memory paradigm for parallel computing. Task-Based Parallelism requires the programmer to divide the work into chunks (known as tasks) and describe the data dependencies between tasks. The tasks are then scheduled amongst the threads automatically by the task-based scheduler. In this thesis we examine how Task-Based Parallelism can be used with GPUs and hybrid shared-distributed memory, in particular we examine how data transfer can be incorporated into a task-based framework, either to the GPU from the host, or between separate nodes. We also examine how we can use the task graph to load balance the computation between multiple nodes or GPUs. We test our task-based methods with Molecular Dynamics, a tiled QR decomposition, and a new task-based Barnes-Hut algorithm. These are problems with different dependency structures which tests the ability of the scheduler to handle a variety of different types of computation. The results with these testcases show improved performance when we use asynchronous data transfer to and from the GPU, and show reasonable parallel efficiency over a small number of MPI ranks

    Space Decomposition Based Parallelisation Solutions for the Combined Finite-Discrete Element Method in 2D.

    Get PDF
    PhDThe Combined Finite-Discrete Element Method (FDEM), originally invented by Munjiza, has become a tool of choice for problems of discontinua, where particles are deformable and can fracture or fragment. The downside of FDEM is that it is CPU intensive and, as a consequence, it is difficult to analyse large scale problems on sequential CPU hardware and parallelisation becomes necessary. In this work a novel approach for parallelisation of the combined finite-discrete element method (FDEM) in 2D aimed at clusters and desktop computers is developed. Dynamic domain decomposition-based parallelisation solvers covering all aspects of FDEM have been developed. These have been implemented into the open source Y2D software package by using a Message-Passing Interface (MPI) and have been tested on a PC cluster. The overall performance and scalability of the parallel code has been studied using numerical examples. The state of the art, the proposed solvers and the test results are described in the thesis in detail.

    Flexible high performance agent based modelling on graphics card hardware

    Get PDF
    Agent Based Modelling is a technique for computational simulation of complex interacting systems, through the specification of the behaviour of a number of autonomous individuals acting simultaneously. This is a bottom up approach, in contrast with the top down one of modelling the behaviour of the whole system through dynamic mathematical equations. The focus on individuals is considerably more computationally demanding, but provides a natural and flexible environment for studying systems demonstrating emergent behaviour. Despite the obvious parallelism, traditionally frameworks for Agent Based Modelling fail to exploit this and are often based on highly serialised mobile discrete agents. Such an approach has serious implications, placing stringent limitations on both the scale of models and the speed at which they may be simulated. Serial simulation frameworks are also unable to exploit multiple processor architectures which have become essential in improving overall processing speed. This thesis demonstrates that it is possible to use the parallelism of graphics card hardware as a mechanism for high performance Agent Based Modelling. Such an approach is in contrast with alternative high performance architectures, such as distributed grids and specialist computing clusters, and is considerably more cost effective. The use of consumer hardware makes the techniques described available to a wide range of users, and the use of automatically generated simulation code abstracts the process of mapping algorithms to the specialist hardware. This approach avoids the steep learning curve associated with the graphics card hardware's data parallel architecture, which has previously limited the uptake of this emerging technology. The performance and flexibility of this approach are considered through the use of benchmarking and case studies. The resulting speedup and locality of agent data within the graphics processor also allow real time visualisation of computationally and demanding high population models

    Development and application of real-time and interactive software for complex system

    Get PDF
    Soft materials have attracted considerable interest in recent years for predicting the characteristics of phase separation and self-assembly in nanoscale structures. A popular method for demonstrating and simulating the dynamic behaviour of particles (e.g. particle tracking) and to consider effects of simulation parameters is cell dynamic simulation (CDS). This is a cellular computerisation technique that can be used to investigate different aspects of morphological topographies of soft material systems. The acquisition of quantitative data from particles is a critical requirement in order to obtain a better understanding and of characterising their dynamic behaviour. To achieve this objective particle tracking methods considering quantitative data and focusing on different properties and components of particles is essential. Despite the availability of various types of particle tracking used in experimental work, there is no method available to consider uniform computational data. In order to achieve accurate and efficient computational results for cell dynamic simulation method and particle tracking, two factors are essential: computing/calculating time-scale and simulation system size. Consequently, finding available computing algorithms and resources such as sequential algorithm for implementing a complex technique and achieving precise results is critical and rather expensive. Therefore, it is highly desirable to consider a parallel algorithm and programming model to solve time-consuming and massive computational processing issues. Hence, the gaps between the experimental and computational works and solving time consuming for expensive computational calculations need to be filled in order to investigate a uniform computational technique for particle tracking and significant enhancements in speed and execution times. The work presented in this thesis details a new particle tracking method for integrating diblock copolymers in the form of spheres with a shear flow and a novel designed GPU-based parallel acceleration approach to cell dynamic simulation (CDS). In addition, the evaluation of parallel models and architectures (CPUs and GPUs) utilising the mixtures of application program interface, OpenMP and programming model, CUDA were developed. Finally, this study presents the performance enhancements achieved with GPU-CUDA of approximately ~2 times faster than multi-threading implementation and 13~14 times quicker than optimised sequential processing for the CDS computations/workloads respectively

    Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models

    Get PDF
    Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms used for the solution of many scientific and engineering applications. In order to aid the design and optimisation of these applications, and to ensure that during procurement platforms are chosen best suited to these codes, there has been considerable research in analysing and evaluating their operational performance. Wavefront codes exhibit complex computation, communication, synchronisation patterns, and as a result there exist a large variety of such codes and possible optimisations. The problem is compounded by each new generation of high performance computing system, which has often introduced a previously unexplored architectural trait, requiring previous performance models to be rewritten and reevaluated. In this thesis, we address the performance modelling and optimisation of this class of application, as a whole. This differs from previous studies in which bespoke models are applied to specific applications. The analytic performance models are generalised and reusable, and we demonstrate their application to the predictive analysis and optimisation of pipelined wavefront computations running on modern high performance computing systems. The performance model is based on the LogGP parameterisation, and uses a small number of input parameters to specify the particular behaviour of most wavefront codes. The new parameters and model equations capture the key structural and behavioural differences among different wavefront application codes, providing a succinct summary of the operations for each application and insights into alternative wavefront application design. The models are applied to three industry-strength wavefront codes and are validated on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model predictions show high quantitative accuracy (less than 20% error) for all high performance configurations and excellent qualitative accuracy. The thesis presents applications, projections and insights for optimisations using the model, which show the utility of reusable analytic models for performance engineering of high performance computing codes. In particular, we demonstrate the use of the model for: (1) evaluating application configuration and resulting performance; (2) evaluating hardware platform issues including platform sizing, configuration; (3) exploring hardware platform design alternatives and system procurement and, (4) considering possible code and algorithmic optimisations
    corecore