132 research outputs found

    CUDA Implementation of a Navier-Stokes Solver on Multi-GPU Desktop Platforms for Incompressible Flows

    Get PDF
    Graphics processor units (GPU) that are traditionally designed for graphics rendering have emerged as massively-parallel co-processors to the central processing unit (CPU). Small-footprint desktop supercomputers with hundreds of cores that can deliver teraflops peak performance at the price of conventional workstations have been realized. A computational fluid dynamics (CFD) simulation capability with rapid computational turnaround time has the potential to transform engineering analysis and design optimization procedures. We describe the implementation of a Navier-Stokes solver for incompressible fluid flow using desktop platforms equipped with multi-GPUs. Specifically, NVIDIA’s Compute Unified Device Architecture (CUDA) programming model is used to implement the discretized form of the governing equations. The projection algorithm to solve the incompressible fluid flow equations is divided into distinct CUDA kernels, and a unique implementation that exploits the memory hierarchy of the CUDA programming model is suggested. Using a quad-GPU platform, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU desktops can serve as a cost-effective small-footprint parallel computing platform to accelerate CFD simulations substantially. I. Introductio

    Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms

    Get PDF
    An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms. The Fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, CUDA GPUs and develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions. We obtain some of the fastest known implementations for the computation of multi-dimensional FFT. We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480. We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound, and computation-bound application. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline to a computation-bound application-DGEMM, and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU

    Multi-Level Parallelism for Incompressible Flow Computations on GPU Clusters

    Get PDF
    We investigate multi-level parallelism on GPU clusters with MPI-CUDA and hybrid MPI-OpenMP-CUDA parallel implementations, in which all computations are done on the GPU using CUDA. We explore efficiency and scalability of incompressible flow computations using up to 256 GPUs on a problem with approximately 17.2 billion cells. Our work addresses some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism that use either MPI or MPI-OpenMP for communications. We present three different strategies to overlap computations with communications, and systematically assess their impact on parallel performance on two different GPU clusters. Our results for strong and weak scaling analysis of incompressible flow computations demonstrate that GPU clusters offer significant benefits for large data sets, and a dual-level MPI-CUDA implementation with maximum overlapping of computation and communication provides substantial benefits in performance. We also find that our tri-level MPI-OpenMP-CUDA parallel implementation does not offer a significant advantage in performance over the dual-level implementation on GPU clusters with two GPUs per node, but on clusters with higher GPU counts per node or with different domain decomposition strategies a tri-level implementation may exhibit higher efficiency than a dual-level implementation and needs to be investigated further

    Fast and Efficient Formulations for Electroencephalography-Based Neuroimaging Strategies

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Doctor of Philosophy

    Get PDF
    dissertationPartial differential equations (PDEs) are widely used in science and engineering to model phenomena such as sound, heat, and electrostatics. In many practical science and engineering applications, the solutions of PDEs require the tessellation of computational domains into unstructured meshes and entail computationally expensive and time-consuming processes. Therefore, efficient and fast PDE solving techniques on unstructured meshes are important in these applications. Relative to CPUs, the faster growth curves in the speed and greater power efficiency of the SIMD streaming processors, such as GPUs, have gained them an increasingly important role in the high-performance computing area. Combining suitable parallel algorithms and these streaming processors, we can develop very efficient numerical solvers of PDEs. The contributions of this dissertation are twofold: proposal of two general strategies to design efficient PDE solvers on GPUs and the specific applications of these strategies to solve different types of PDEs. Specifically, this dissertation consists of four parts. First, we describe the general strategies, the domain decomposition strategy and the hybrid gathering strategy. Next, we introduce a parallel algorithm for solving the eikonal equation on fully unstructured meshes efficiently. Third, we present the algorithms and data structures necessary to move the entire FEM pipeline to the GPU. Fourth, we propose a parallel algorithm for solving the levelset equation on fully unstructured 2D or 3D meshes or manifolds. This algorithm combines a narrowband scheme with domain decomposition for efficient levelset equation solving

    Fast, Realistic Terrain Synthesis

    Get PDF
    The authoring of realistic terrain models is necessary to generate immersive virtual environments for computer games and film visual effects. However, creating these landscapes is difficult – it usually involves an artist spending many hours sculpting a model in a 3D design program. Specialised terrain generation programs exist to rapidly create artificial terrains, such as Bryce (2013) and Terragen (2013). These make use of complex algorithms to pseudo-randomly generate the terrains, which can then be exported into a 3D editing program for fine tuning. Height-maps are a 2D data-structure, which stores elevation values, and can be used to represent terrain data. They are also a common format used with terrain generation and editing systems. Height-maps share the same storage design as image files, as such they can be viewed like any picture and image transformation algorithms can be applied to them. Early techniques for generating terrains include fractal generation and physical simulation. These methods proved difficult to use as the algorithms were manipulated with a set of parameters. However, the outcome from changing the values is not known, which results in the user changing values over several iterations to produce their desired terrain. An improved technique brings in a higher degree of user control as well as improved realism, known as texture-based terrain synthesis. This borrows techniques from texture synthesis, which is the process of algorithmically generating a larger image from a smaller sample image. Texture-based terrain synthesis makes use or real-world terrain data to produce highly realistic landscapes, which improves upon previous techniques. Recent work in texture-based synthesis has focused on improving both the realism and user control, through the use of sketching interfaces. We present a patch-based terrain synthesis system that utilises a user sketch to control the location of desired terrain features, such as ridges and valleys. Digital Elevation Models (DEMs) of real landscapes are used as exemplars, from which candidate patches of data are extracted and matched against the user’s sketch. The best candidates are merged seamlessly into the final terrain. Because real landscapes are used the resulting terrain appears highly realistic. Our research contributes a new version of this approach that employs multiple input terrains and acceleration using a modern Graphics Processing Unit (GPU). The use of multiple inputs increases the candidate pool of patches and thus the system is capable of producing more varied terrains. This addresses the limitation where supplying the wrong type of input terrain would fail to synthesise anything useful, for example supplying the system with a mountainous DEM and expecting deep valleys in the output. We developed a hybrid multithreaded CPU and GPU implementation that achieves a 45 times speedup

    The 1998 Center for Simulation of Dynamic Response in Materials Annual Technical Report

    Get PDF
    Introduction: This annual report describes research accomplishments for FY 98 of the Center for Simulation of Dynamic Response of Materials. The Center is constructing a virtual shock physics facility in which the full three dimensional response of a variety of target materials can be computed for a wide range of compressive, tensional, and shear loadings, including those produced by detonation of energetic materials. The goals are to facilitate computation of a variety of experiments in which strong shock and detonation waves are made to impinge on targets consisting of various combinations of materials, compute the subsequent dynamic response of the target materials, and validate these computations against experimental data

    Heterogeneous multicore systems for signal processing

    Get PDF
    This thesis explores the capabilities of heterogeneous multi-core systems, based on multiple Graphics Processing Units (GPUs) in a standard desktop framework. Multi-GPU accelerated desk side computers are an appealing alternative to other high performance computing (HPC) systems: being composed of commodity hardware components fabricated in large quantities, their price-performance ratio is unparalleled in the world of high performance computing. Essentially bringing “supercomputing to the masses”, this opens up new possibilities for application fields where investing in HPC resources had been considered unfeasible before. One of these is the field of bioelectrical imaging, a class of medical imaging technologies that occupy a low-cost niche next to million-dollar systems like functional Magnetic Resonance Imaging (fMRI). In the scope of this work, several computational challenges encountered in bioelectrical imaging are tackled with this new kind of computing resource, striving to help these methods approach their true potential. Specifically, the following main contributions were made: Firstly, a novel dual-GPU implementation of parallel triangular matrix inversion (TMI) is presented, addressing an crucial kernel in computation of multi-mesh head models of encephalographic (EEG) source localization. This includes not only a highly efficient implementation of the routine itself achieving excellent speedups versus an optimized CPU implementation, but also a novel GPU-friendly compressed storage scheme for triangular matrices. Secondly, a scalable multi-GPU solver for non-hermitian linear systems was implemented. It is integrated into a simulation environment for electrical impedance tomography (EIT) that requires frequent solution of complex systems with millions of unknowns, a task that this solution can perform within seconds. In terms of computational throughput, it outperforms not only an highly optimized multi-CPU reference, but related GPU-based work as well. Finally, a GPU-accelerated graphical EEG real-time source localization software was implemented. Thanks to acceleration, it can meet real-time requirements in unpreceeded anatomical detail running more complex localization algorithms. Additionally, a novel implementation to extract anatomical priors from static Magnetic Resonance (MR) scansions has been included

    GPGPU application in fusion science

    Get PDF
    GPGPUs have firmly earned their reputation in HPC (High Performance Computing) as hardware for massively parallel computation. However their application in fusion science is quite marginal and not considered a mainstream approach to numerical problems. Computation advances have increased immensely over the last decade and continue to accelerate. GPGPU boards were always an alternative and exotic approach to problem solving and scientific programming, which was cultivated only by enthusiasts and specialized programmers. Today it is about 10 years, since the first fully programmable GPUs appeared on the market. And due to exponential growth in processing power over the years GPGPUs are not the alternative choice any more, but they became the main choice for big problem solving. Originally developed for and dominating in fields such as image and media processing, image rendering, video encoding/decoding, image scaling, stereo vision and pattern recognition GPGPUs are also becoming mainstream computation platforms in scientific fields such as signal processing, physics, finance and biology. This PhD contains solutions and approaches to two relevant problems for fusion and plasma science using GPGPU processing. First problem belongs to the realms of plasma and accelerator physics. I will present number of plasma simulations built on a PIC (Particle In Cell) method such as plasma sheath simulation, electron beam simulation, negative ion beam simulation and space charge compensation simulation. Second problem belongs to the realms of tomography and real-time control. I will present number of simulated tomographic plasma reconstructions of Fourier-Bessel type and their analysis all in real-time oriented approach, i.e. GPGPU based implementations are integrated into MARTe environment. MARTe is a framework for real-time application developed at JET (Joint European Torus) and used in several european fusion labs. These two sets of problems represent a complete spectrum of GPGPU operation capabilities. PIC based problems are large complex simulations operated as batch processes, which do not have a time constraint and operate on huge amounts of memory. While tomographic plasma reconstructions are online (realtime) processes, which have a strict latency/time constraints suggested by the time scales of real-time control and operate on relatively small amounts of memory. Such a variety of problems covers a very broad range of disciplines and fields of science: such as plasma physics, NBI (Neutral Beam Injector) physics, tokamak physics, parallel computing, iterative/direct matrix solvers, PIC method, tomography and so on. PhD thesis also includes an extended performance analysis of Nvidia GPU cards considering the applicability to the real-time control and real-time performance. In order to approach the aforementioned problems I as a PhD candidate had to gain knowledge in those relevant fields and build a vast range of practical skills such as: parallel/sequential CPU programming, GPU programming, MARTe programming, MatLab programming, IDL programming and Python programming
    • …
    corecore