3,714 research outputs found
The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform
The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.Postprint (author's final draft
AlSub: Fully Parallel and Modular Subdivision
In recent years, mesh subdivision---the process of forging smooth free-form
surfaces from coarse polygonal meshes---has become an indispensable production
instrument. Although subdivision performance is crucial during simulation,
animation and rendering, state-of-the-art approaches still rely on serial
implementations for complex parts of the subdivision process. Therefore, they
often fail to harness the power of modern parallel devices, like the graphics
processing unit (GPU), for large parts of the algorithm and must resort to
time-consuming serial preprocessing. In this paper, we show that a complete
parallelization of the subdivision process for modern architectures is
possible. Building on sparse matrix linear algebra, we show how to structure
the complete subdivision process into a sequence of algebra operations. By
restructuring and grouping these operations, we adapt the process for different
use cases, such as regular subdivision of dynamic meshes, uniform subdivision
for immutable topology, and feature-adaptive subdivision for efficient
rendering of animated models. As the same machinery is used for all use cases,
identical subdivision results are achieved in all parts of the production
pipeline. As a second contribution, we show how these linear algebra
formulations can effectively be translated into efficient GPU kernels. Applying
our strategies to , Loop and Catmull-Clark subdivision shows
significant speedups of our approach compared to state-of-the-art solutions,
while we completely avoid serial preprocessing.Comment: Changed structure Added content Improved description
Sparse Volumetric Deformation
Volume rendering is becoming increasingly popular as applications require realistic solid shape representations with seamless texture mapping and accurate filtering. However rendering sparse volumetric data is difficult because of the limited memory and processing capabilities of current hardware. To address these limitations, the volumetric information can be stored at progressive resolutions in the hierarchical branches of a tree structure, and sampled according to the region of interest. This means that only a partial region of the full dataset is processed, and therefore massive volumetric scenes can be rendered efficiently.
The problem with this approach is that it currently only supports static scenes. This is because it is difficult to accurately deform massive amounts of volume elements and reconstruct the scene hierarchy in real-time. Another problem is that deformation operations distort the shape where more than one volume element tries to occupy the same location, and similarly gaps occur where deformation stretches the elements further than one discrete location. It is also challenging to efficiently support sophisticated deformations at hierarchical resolutions, such as character skinning or physically based animation. These types of deformation are expensive and require a control structure (for example a cage or skeleton) that maps to a set of features to accelerate the deformation process. The problems with this technique are that the varying volume hierarchy reflects different feature sizes, and manipulating the features at the original resolution is too expensive; therefore the control structure must also hierarchically capture features according to the varying volumetric resolution.
This thesis investigates the area of deforming and rendering massive amounts of dynamic volumetric content. The proposed approach efficiently deforms hierarchical volume elements without introducing artifacts and supports both ray casting and rasterization renderers. This enables light transport to be modeled both accurately and efficiently with applications in the fields of real-time rendering and computer animation. Sophisticated volumetric deformation, including character animation, is also supported in real-time. This is achieved by automatically generating a control skeleton which is mapped to the varying feature resolution of the volume hierarchy. The output deformations are demonstrated in massive dynamic volumetric scenes
Architecture--Performance Interrelationship Analysis In Single/Multiple Cpu/Gpu Computing Systems: Application To Composite Process Flow Modeling
Current developments in computing have shown the advantage of using one or more Graphic Processing Units (GPU) to boost the performance of many computationally intensive applications but there are still limits to these GPU-enhanced systems. The major factors that contribute to the limitations of GPU(s) for High Performance Computing (HPC) can be categorized as hardware and software oriented in nature. Understanding how these factors affect performance is essential to develop efficient and robust applications codes that employ one or more GPU devices as powerful co-processors for HPC computational modeling. The present work analyzes and understands the intrinsic interrelationship of both hardware and software categories on computational performance for single and multiple GPU-enhanced systems using a computationally intensive application that is representative of a large portion of challenges confronting modern HPC. The representative application uses unstructured finite element computations for transient composite resin infusion process flow modeling as the computational core, characteristics and results of which reflect many other HPC applications via the sparse matrix system used for the solution of linear system of equations. This work describes these various software and hardware factors and how they interact to affect performance of computationally intensive applications enabling more efficient development and porting of High Performance Computing applications that includes current, legacy, and future large scale computational modeling applications in various engineering and scientific disciplines
TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators
Over the past few years, the explosion in sparse tensor algebra workloads has
led to a corresponding rise in domain-specific accelerators to service them.
Due to the irregularity present in sparse tensors, these accelerators employ a
wide variety of novel solutions to achieve good performance. At the same time,
prior work on design-flexible sparse accelerator modeling does not express this
full range of design features, making it difficult to understand the impact of
each design choice and compare or extend the state-of-the-art.
To address this, we propose TeAAL: a language and compiler for the concise
and precise specification and evaluation of sparse tensor algebra
architectures. We use TeAAL to represent and evaluate four disparate
state-of-the-art accelerators--ExTensor, Gamma, OuterSPACE, and SIGMA--and
verify that it reproduces their performance with high accuracy. Finally, we
demonstrate the potential of TeAAL as a tool for designing new accelerators by
showing how it can be used to speed up Graphicionado--by on BFS and
on SSSP.Comment: 14 pages, 12 figure
Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs
Many problems in geophysical and atmospheric modelling require the fast
solution of elliptic partial differential equations (PDEs) in "flat" three
dimensional geometries. In particular, an anisotropic elliptic PDE for the
pressure correction has to be solved at every time step in the dynamical core
of many numerical weather prediction models, and equations of a very similar
structure arise in global ocean models, subsurface flow simulations and gas and
oil reservoir modelling. The elliptic solve is often the bottleneck of the
forecast, and an algorithmically optimal method has to be used and implemented
efficiently. Graphics Processing Units have been shown to be highly efficient
for a wide range of applications in scientific computing, and recently
iterative solvers have been parallelised on these architectures. We describe
the GPU implementation and optimisation of a Preconditioned Conjugate Gradient
(PCG) algorithm for the solution of a three dimensional anisotropic elliptic
PDE for the pressure correction in NWP. Our implementation exploits the strong
vertical anisotropy of the elliptic operator in the construction of a suitable
preconditioner. As the algorithm is memory bound, performance can be improved
significantly by reducing the amount of global memory access. We achieve this
by using a matrix-free implementation which does not require explicit storage
of the matrix and instead recalculates the local stencil. Global memory access
can also be reduced by rewriting the algorithm using loop fusion and we show
that this further reduces the runtime on the GPU. We demonstrate the
performance of our matrix-free GPU code by comparing it to a sequential CPU
implementation and to a matrix-explicit GPU code which uses existing libraries.
The absolute performance of the algorithm for different problem sizes is
quantified in terms of floating point throughput and global memory bandwidth.Comment: 18 pages, 7 figure
Best practices for HPM-assisted performance engineering on modern multicore processors
Many tools and libraries employ hardware performance monitoring (HPM) on
modern processors, and using this data for performance assessment and as a
starting point for code optimizations is very popular. However, such data is
only useful if it is interpreted with care, and if the right metrics are chosen
for the right purpose. We demonstrate the sensible use of hardware performance
counters in the context of a structured performance engineering approach for
applications in computational science. Typical performance patterns and their
respective metric signatures are defined, and some of them are illustrated
using case studies. Although these generic concepts do not depend on specific
tools or environments, we restrict ourselves to modern x86-based multicore
processors and use the likwid-perfctr tool under the Linux OS.Comment: 10 pages, 2 figure
- …