64 research outputs found

    GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

    Full text link
    High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.Comment: 50 pages, 14 figures, 14 table

    Local time stepping on high performance computing architectures: mitigating CFL bottlenecks for large-scale wave propagation

    Get PDF
    Modeling problems that require the simulation of hyperbolic PDEs (wave equations) on large heterogeneous domains have potentially many bottlenecks. We attack this problem through two techniques: the massively parallel capabilities of graphics processors (GPUs) and local time stepping (LTS) to mitigate any CFL bottlenecks on a multiscale mesh. Many modern supercomputing centers are installing GPUs due to their high performance, and extending existing seismic wave-propagation software to use GPUs is vitally important to give application scientists the highest possible performance. In addition to this architectural optimization, LTS schemes avoid performance losses in meshes with localized areas of refinement. Coupled with the GPU performance optimizations, the derivation and implementation of an Newmark LTS scheme enables next-generation performance for real-world applications. Included in this implementation is work addressing the load-balancing problem inherent to multi-level LTS schemes, enabling scalability to hundreds and thousands of CPUs and GPUs. These GPU, LTS, and scaling optimizations accelerate the performance of existing applications by a factor of 30 or more, and enable future modeling scenarios previously made unfeasible by the cost of standard explicit time-stepping schemes

    An Evaluation and Comparison of GPU Hardware and Solver Libraries for Accelerating the OPM Flow Reservoir Simulator

    Full text link
    Realistic reservoir simulation is known to be prohibitively expensive in terms of computation time when increasing the accuracy of the simulation or by enlarging the model grid size. One method to address this issue is to parallelize the computation by dividing the model in several partitions and using multiple CPUs to compute the result using techniques such as MPI and multi-threading. Alternatively, GPUs are also a good candidate to accelerate the computation due to their massively parallel architecture that allows many floating point operations per second to be performed. The numerical iterative solver takes thus the most computational time and is challenging to solve efficiently due to the dependencies that exist in the model between cells. In this work, we evaluate the OPM Flow simulator and compare several state-of-the-art GPU solver libraries as well as custom developed solutions for a BiCGStab solver using an ILU0 preconditioner and benchmark their performance against the default DUNE library implementation running on multiple CPU processors using MPI. The evaluated GPU software libraries include a manual linear solver in OpenCL and the integration of several third party sparse linear algebra libraries, such as cuSparse, rocSparse, and amgcl. To perform our bench-marking, we use small, medium, and large use cases, starting with the public test case NORNE that includes approximately 50k active cells and ending with a large model that includes approximately 1 million active cells. We find that a GPU can accelerate a single dual-threaded MPI process up to 5.6 times, and that it can compare with around 8 dual-threaded MPI processes

    GPU Accelerated Three Dimensional Unstructured Geometric Multi-Grid Solver

    Get PDF
    Consider a set of points P in three dimensional euclidean space. For each point p, the neighbourhood N(p), is defined as the set of points in P, which are voronoi neighbours. Each point in P represents a variable and its value is dependent on the value of its neighbourhood. Its value is given by the sum of the values of points in its neighbourhood scaled by predefined constants. The constants depend on the spacing between the points. The problem is to solve all the variables. Such representations arise naturally in solving flow equations in Computational Fluid Dynamics with domains represented using unstructured meshes. The problem reduces to solve a system of linear equations. In this work geometric multigrid method is implemented for solving the problem faster. Solving this problem on very large input is a time consuming process. The inputs considered here are having size of the order of millions. Graphics Processing Units(GPU) are dedicated parallel processors which serves both as a programmable graphics processor and a scalable parallel computing platform. The parallelization of this problem for GPUs is not straight forward because of the irregularity. The CFD problem used for experiment is the steady and unsteady heat transfer problem in 3D unstructured meshes.The combination of multigrid algorithm and GPU implementation for the steady problem on a 1.6 million mesh gives 1630 times speed up compared to non-multigrid CPU implementation

    Numerical modeling of extrusion forming tools: improving its efficiency on heterogeneous parallel computers

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaPolymer processing usually requires several experimentation and calibration attempts to lead to a final result with the desired quality. As this results in large costs, software applications have been developed aiming to replace laboratory experimentation by computer based simulations and hence lower these costs. The focus of this dissertation was on one of these applications, the FlowCode, an application which helps the design of extrusion forming tools, applied to plastics processing or in the processing of other fluids. The original application had two versions of the code, one to run in a single-core CPU and the other for NVIDIA GPU devices. With the increasing use of heterogeneous platforms, many applications can now benefit and leverage the computational power of these platforms. As this requires some expertise, mostly to schedule tasks/functions and transfer the necessary data to the devices, several frameworks were developed to aid the development - with StarPU being the one with more international relevance, although other ones are emerging such as Dynamic Irregular Computing Environment (DICE). The main objectives of this dissertation were to improve the FlowCode, and to assess the use of one framework to develop an efficient heterogeneous version. Only the CPU version of the code was improved, by first applying techniques to the sequential version and parallelizing it afterwards using OpenMP on both multi-core CPU devices (Intel Xeon 12-core) and on many-core devices (Intel Xeon Phi 61-core). For the heterogeneous version, StarPU was chosen after studying both StarPU and DICE frameworks. Results show the parallel CPU version to be faster than the GPU one, for all input datasets. The GPU code is far from being efficient, requiring several improvements, so comparing the devices with each other would not be fair. The Xeon Phi version proves to be the faster one when no framework is used. For the StarPU version, several schedulers were tested to evaluate the faster one, leading to the most efficient to solve our problem. Executing the code on two GPU devices is 1.7 times faster than when executing the GPU version without the framework. Adding the CPU to the GPUs of the testing environment do not improve execution time with most schedulers due to the lack of available parallelism in the application. Globally, the StarPU version is the faster one followed by the Xeon Phi, CPU and GPU versions.O processamento de polímeros requer normalmente várias tentativas de experimentação e calibração de modo a que o resultado final tenha a qualidade pretendida. Como isto resulta em custos elevados, diversas aplicações foram desenvolvidas para substituir a parte de experimentação laboratorial por simulações por computador e consequentemente, reduzir esses custos. Este dissertação foca-se numa dessas aplicações, o FlowCode, uma aplicação de ajuda à conceção de ferramentas de extrusão aplicada no processamento de plásticos ou no processamento de outros tipos de fluidos. Esta aplicação inicial era composta por duas versões, uma executada sequencialmente num processador e outra executada em aceleradores computacionais NVIDIA GPU. Com o aumento da utilização de plataformas heterogéneas, muitas aplicações podem beneficiar do poder computacional destas plataformas. Como isto requer alguma experiência, principalmente para escalonar tarefas/funções e transferir os dados necessários para os aceleradores, várias frameworks foram desenvolvidas para ajudar ao desenvolvimento - sendo StarPU a framework com mais relevância internacional, embora outras estejam a surgir como a framework DICE. Os principais objetivos desta dissertação eram melhorar o FlowCode assim como avaliar a utilização de uma framework para desenvolver uma versão heterogénea eficiente. Apenas a versão CPU foi melhorada, primeiro aplicando técnicas na versão sequencial, e depois procedendo à paralelização usando OpenMP em CPUs multi-core (Intel Xeon 12-core) e aceleradores many-core (Intel Xeon Phi 61-core). Para a versão heterogénea, foi escolhido a framework StarPU depois de se ter feito um estudo das frameworks StarPU e DICE. Os resultados mostram que a versão CPU paralela é mais rápida que a GPU em todos os casos testados. O código GPU está longe de ser eficiente, necessitando diversas melhorias. Portanto, uma comparação entre CPUs, GPUs e Xeon Phi’s não seria justa. A versão Xeon Phi revela-se ser a mais rápida quando não é usada nenhuma framework. Para a versão StarPU, vários escalonadores foram testados para avaliar o mais rápido, levando ao mais eficiente para resolver o nosso problema. Executar o código em dois GPUs é 1.7 vezes mais rápido do que executar para um GPU sem framework em um dos casos testados. Adicionar o CPU aos GPUs do ambiente de teste não melhora o tempo de execução para a maioria dos escalonadores devido à falta de paralelismo disponível. Globalmente, a versão StarPU é a mais rápida seguida das versões Xeon Phi, CPU, e GPU

    Parallel AMG Solver for Three Dimensional Unstructured Grids Using Gpus

    Get PDF
    Consider a set of points P in three dimensional euclidean space. Each point in P represents a variable and its value is dependent on the value of its neighborhood scaled by predefined constants. The problem is to solve all the variables which reduces to solving a large set of sparse linear equations. This kind of representation arises naturally while solving flow equations in Computational Fluid Dynamics (CFD). Graphics Processing Units (GPUs), over the years have evolved from being graphics accelerator to scalable co-processor. We implement an algebraic multigrid solver for three dimensional unstructured grids using GPUs. Such a solver has extensive applications in Computational Fluid Dynamics. Using a combination of vertex coloring, optimized memory representations, multi-grid and improved coarsening techniques, we obtain considerable speedup in our parallel implementation. For our implementation, we used Nvidia’s CUDA programming model. Our solver is used to accelerate solutions to various problems like heat transfer, Navier-Stokes etc. Our solver achieves 2157 and 29 times speed up for steady state and unsteady state head transfer problem respectively on a grid of size 2.3 million, compared to serial non-multigrid implementation. Our solver provides significant acceleration for solving pressure Poisson equations, which is the most time consuming part while solving Navier-Stokes equations. In our experimental study, we solve pressure Poisson equations for flow over lid driven cavity, laminar flow past square cylinder and plain jet problems. Our implementation achieves 915 times speed up for the lid driven cavity problem on a grid of size 2.6 million and a speed up of 1020 times for the laminar flow past square cylinder problem on a grid of size 1.7 million, compared to serial non-multigrid implementations. For plain jet problem, our solver achieves a speed up of 47 times, compared to serial non-multigrid implementation on a grid of size 2.7 million. We also implement multi GPU AMG solver which achieves a speed up of 1.5 times, compared to single GPU solver for heat transfer problem

    Accelerated computational micromechanics

    Get PDF
    We present an approach to solving problems in micromechanics that is amenable to massively parallel calculations through the use of graphical processing units and other accelerators. The problems lead to nonlinear differential equations that are typically second order in space and first order in time. This combination of nonlinearity and nonlocality makes such problems difficult to solve in parallel. However, this combination is a result of collapsing nonlocal, but linear and universal physical laws (kinematic compatibility, balance laws), and nonlinear but local constitutive relations. We propose an operator-splitting scheme inspired by this structure. The governing equations are formulated as (incremental) variational problems, the differential constraints like compatibility are introduced using an augmented Lagrangian, and the resulting incremental variational principle is solved by the alternating direction method of multipliers. The resulting algorithm has a natural connection to physical principles, and also enables massively parallel implementation on structured grids. We present this method and use it to study two examples. The first concerns the long wavelength instability of finite elasticity, and allows us to verify the approach against previous numerical simulations. We also use this example to study convergence and parallel performance. The second example concerns microstructure evolution in liquid crystal elastomers and provides new insights into some counter-intuitive properties of these materials. We use this example to validate the model and the approach against experimental observations

    Accelerated volumetric reconstruction from uncalibrated camera views

    Get PDF
    While both work with images, computer graphics and computer vision are inverse problems. Computer graphics starts traditionally with input geometric models and produces image sequences. Computer vision starts with input image sequences and produces geometric models. In the last few years, there has been a convergence of research to bridge the gap between the two fields. This convergence has produced a new field called Image-based Rendering and Modeling (IBMR). IBMR represents the effort of using the geometric information recovered from real images to generate new images with the hope that the synthesized ones appear photorealistic, as well as reducing the time spent on model creation. In this dissertation, the capturing, geometric and photometric aspects of an IBMR system are studied. A versatile framework was developed that enables the reconstruction of scenes from images acquired with a handheld digital camera. The proposed system targets applications in areas such as Computer Gaming and Virtual Reality, from a lowcost perspective. In the spirit of IBMR, the human operator is allowed to provide the high-level information, while underlying algorithms are used to perform low-level computational work. Conforming to the latest architecture trends, we propose a streaming voxel carving method, allowing a fast GPU-based processing on commodity hardware
    corecore