193 research outputs found

    OpenFPM: A scalable environment for particle and particle-mesh codes on parallel computers

    Get PDF
    Scalable and efficient numerical simulations continue to gain importance, as computation is firmly established tool of discovery, together with theory and experiment. Meanwhile, the performance of computing hardware grows with increasing heterogeneous hardware, enabling simulations of ever more complex models. However, efficiently implementing scalable codes on heterogeneous, distributed hardware systems becomes the bottleneck. This bottleneck can be alleviated by intermediate software layers that provide higher-level abstractions closer to the problem domain, hence allowing the computational scientist to focus on the simulation. Here, we present OpenFPM, an open and scalable framework that provides an abstraction layer for numerical simulations using particles and/or meshes. OpenFPM provides transparent and scalable infrastructure for shared-memory and distributed-memory implementations of particles-only and hybrid particle-mesh simulations of both discrete and continuous models, as well as non-simulation codes. This infrastructure is complemented with frequently used numerical routines, as well as interfaces to third-party libraries. This thesis will present the architecture and design of OpenFPM, detail the underlying abstractions, and benchmark the framework in applications ranging from Smoothed-Particle Hydrodynamics (SPH) to Molecular Dynamics (MD), Discrete Element Methods (DEM), Vortex Methods, stencil codes, high-dimensional Monte Carlo sampling (CMA-ES), and Reaction-Diffusion solvers, comparing it to the current state of the art and existing software frameworks

    Verified lifting of stencil computations

    Get PDF
    This paper demonstrates a novel combination of program synthesis and verification to lift stencil computations from low-level Fortran code to a high-level summary expressed using a predicate language. The technique is sound and mostly automated, and leverages counter-example guided inductive synthesis (CEGIS) to find provably correct translations. Lifting existing code to a high-performance description language has a number of benefits, including maintainability and performance portability. For example, our experiments show that the lifted summaries can enable domain specific compilers to do a better job of parallelization as compared to an off-the-shelf compiler working on the original code, and can even support fully automatic migration to hardware accelerators such as GPUs. We have implemented verified lifting in a system called STNG and have evaluated it using microbenchmarks, mini-apps, and real-world applications. We demonstrate the benefits of verified lifting by first automatically summarizing Fortran source code into a high-level predicate language, and subsequently translating the lifted summaries into Halide, with the translated code achieving median performance speedups of 4.1X and up to 24X for non-trivial stencils as compared to the original implementation.United States. Department of Energy. Office of Science (Award DE-SC0008923)United States. Department of Energy. Office of Science (Award DE-SC0005288

    Doctor of Philosophy

    Get PDF
    dissertationThe goal of this dissertation is to improve flood risk management by enhancing the computational capability of two-dimensional models and incorporating data and parameter uncertainty to more accurately represent flood risk. Improvement of computational performance is accomplished by using the Graphics Processing Unit (GPU) approach, programmed in NVIDIA's Compute Unified Development Architecture (CUDA), to create a new two-dimensional hydrodynamic model, Flood2D-GPU. The model, based on the shallow water equations, is designed to execute simulations faster than the same code programmed using a serial approach (i.e., using a Central Processing Unit (CPU)). Testing the code against an identical CPU-based version demonstrated the improved computational efficiency of the GPU-based version (approximate speedup of more than 80 times). Given the substantial computational efficiency of Flood2D-GPU, a new Monte Carlo based flood risk modeling framework was created. The framework developed operates by performing many Flood2D-GPU simulations using randomly sampled model parameters and input variables. The Monte Carlo flood risk modeling framework is demonstrated in this dissertation by simulating the flood risk associated with a 1% annual probability flood event occurring in the Swannanoa River in Buncombe County near Asheville, North Carolina. The Monte Carlo approach is able to represent a wide range of possible scenarios, thus leading to the identification of areas outside a single simulation inundation extent that are susceptible to flood hazards. Further, the single simulation results underestimated the degree of flood hazard for the case study region when compared to the flood hazard map produced by the Monte Carlo approach. The Monte Carlo flood risk modeling framework is also used to determine the relative benefits of flood management alternatives for flood risk reduction. The objective of the analysis is to investigate the possibility of identifying specific annual exceedance probability flood events that will have greater benefits in terms of annualized flood risk reduction compared to an arbitrarily-selected discrete annual probability event. To test the hypothesis, a study was conducted on the Swannanoa River to determine the distribution of annualized risk as a function of average annual probability. Simulations of samples of flow rate from a continuous flow distribution provided the range of annual probability events necessary. The results showed a variation in annualized risk as a function of annual probability. And as hypothesized, a maximum annualized risk reduction could be identified for a specified annual probability. For the Swannanoa case study, the continuous flow distribution suggested targeting flood proofing to control the 12% exceedance probability event to maximize the reduction of annualized risk. This suggests that the arbitrary use of a specified risk of 1% exceedance may not in some cases be the most efficient allocation of resources to reduce annualized risk

    Generating and auto-tuning parallel stencil codes

    Get PDF
    In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code

    Exploiting BSP Abstractions for Compiler Based Optimizations of GPU Applications on multi-GPU Systems

    Get PDF
    Graphics Processing Units (GPUs) are accelerators for computers and provide massive amounts of computational power and bandwidth for amenable applications. While effectively utilizing an individual GPU already requires a high level of skill, effectively utilizing multiple GPUs introduces completely new types of challenges. This work sets out to investigate how the hierarchical execution model of GPUs can be exploited to simplify the utilization of such multi-GPU systems. The investigation starts with an analysis of the memory access patterns exhibited by applications from common GPU benchmark suites. Memory access patterns are collected using custom instrumentation and a simple simulation then analyzes the patterns and identifies implicit communication across the different levels of the execution hierarchy. The analysis reveals that for most GPU applications memory accesses are highly localized and there exists a way to partition the workload so that the communication volume grows slower than the aggregated bandwidth for growing numbers of GPUs. Next, an application model based on Z-polyhedra is derived that formalizes the distribution of work across multiple GPUs and allows the identification of data dependencies. The model is then used to implement a prototype compiler that consumes single-GPU programs and produces executables that distribute GPU workloads across all available GPUs in a system. It uses static analysis to identify memory access patterns and polyhedral code generation in combination with a dynamic tracking system to efficiently resolve data dependencies. The prototype is implemented as an extension to the LLVM/Clang compiler and published in full source. The prototype compiler is then evaluated using a set of benchmark applications. While the prototype is limited in its applicability by technical issues, it provides impressive speedups of up to 12.4x on 16 GPUs for amenable applications. An in-depth analysis of the application runtime reveals that dependency resolution takes up less than 10% of the runtime, often significantly less. A discussion follows and puts the work into context by presenting and differentiating related work, reflecting critically on the work itself and an outlook of the aspects that could be explored as part of this research. The work concludes with a summary and a closing opinion

    Improving Data Locality in Distributed Processing of Multi-Channel Remote Sensing Data with Potentially Large Stencils

    Get PDF
    Distributing a multi-channel remote sensing data processing with potentially large stencils is a difficult challenge. The goal of this master thesis was to evaluate and investigate the performance impacts of such a processing on a distributed system and if it is possible to improve the total execution time by exploiting data locality or memory alignments. The thesis also gives a brief overview of the actual state of the art in remote sensing distributed data processing and points out why distributed computing will become more important for it in the future. For the experimental part of this thesis an application to process huge arrays on a distributed system was implemented with DASH, a C++ Template Library for Distributed Data Structures with Support for Hierarchical Locality for High Performance Computing and Data-Driven Science. On the basis of the first results an optimization model was developed which has the goal to reduce network traffic while initializing a distributed data structure and executing computations on it with potentially large stencils. Furthermore, a software to estimate the memory layouts with the least network communication cost for a given multi-channel remote sensing data processing workflow was implemented. The results of this optimization were executed and evaluated afterwards. The results show that it is possible to improve the initialization speed of a large image by considering the brick locality by 25%. The optimization model also generate valid decisions for the initialization of the PGAS memory layouts. However, for a real implementation the optimization model has to be modified to reflect implementation-dependent sources of overhead. This thesis presented some approaches towards solving challenges of the distributed computing world that can be used for real-world remote sensing imaging applications and contributed towards solving the challenges of the modern Big Data world for future scientific data exploitation

    New prospects for computational hydraulics by leveraging high-performance heterogeneous computing techniques

    Get PDF
    In the last two decades, computational hydraulics has undergone a rapid development following the advancement of data acquisition and computing technologies. Using a finite-volume Godunov-type hydrodynamic model, this work demonstrates the promise of modern high-performance computing technology to achieve real-time flood modeling at a regional scale. The software is implemented for high-performance heterogeneous computing using the OpenCL programming framework, and developed to support simulations across multiple GPUs using a domain decomposition technique and across multiple systems through an efficient implementation of the Message Passing Interface (MPI) standard. The software is applied for a convective storm induced flood event in Newcastle upon Tyne, demonstrating high computational performance across a GPU cluster, and good agreement against crowd- sourced observations. Issues relating to data availability, complex urban topography and differences in drainage capacity affect results for a small number of areas
    • …
    corecore