182 research outputs found
Manycore processing of repeated range queries over massive moving objects observations
The ability to timely process significant amounts of continuously updated
spatial data is mandatory for an increasing number of applications. Parallelism
enables such applications to face this data-intensive challenge and allows the
devised systems to feature low latency and high scalability. In this paper we
focus on a specific data-intensive problem, concerning the repeated processing
of huge amounts of range queries over massive sets of moving objects, where the
spatial extents of queries and objects are continuously modified over time. To
tackle this problem and significantly accelerate query processing we devise a
hybrid CPU/GPU pipeline that compresses data output and save query processing
work. The devised system relies on an ad-hoc spatial index leading to a problem
decomposition that results in a set of independent data-parallel tasks. The
index is based on a point-region quadtree space decomposition and allows to
tackle effectively a broad range of spatial object distributions, even those
very skewed. Also, to deal with the architectural peculiarities and limitations
of the GPUs, we adopt non-trivial GPU data structures that avoid the need of
locked memory accesses and favour coalesced memory accesses, thus enhancing the
overall memory throughput. To the best of our knowledge this is the first work
that exploits GPUs to efficiently solve repeated range queries over massive
sets of continuously moving objects, characterized by highly skewed spatial
distributions. In comparison with state-of-the-art CPU-based implementations,
our method highlights significant speedups in the order of 14x-20x, depending
on the datasets, even when considering very cheap GPUs
Optimising Convolutional Neural Networks Inference on Low-Powered GPUs
No abstract available
Finding Morton-Like Layouts for Multi-Dimensional Arrays Using Evolutionary Algorithms
The layout of multi-dimensional data can have a significant impact on the
efficacy of hardware caches and, by extension, the performance of applications.
Common multi-dimensional layouts include the canonical row-major and
column-major layouts as well as the Morton curve layout. In this paper, we
describe how the Morton layout can be generalized to a very large family of
multi-dimensional data layouts with widely varying performance characteristics.
We posit that this design space can be efficiently explored using a
combinatorial evolutionary methodology based on genetic algorithms. To this
end, we propose a chromosomal representation for such layouts as well as a
methodology for estimating the fitness of array layouts using cache simulation.
We show that our fitness function correlates to kernel running time in real
hardware, and that our evolutionary strategy allows us to find candidates with
favorable simulated cache properties in four out of the eight real-world
applications under consideration in a small number of generations. Finally, we
demonstrate that the array layouts found using our evolutionary method perform
well not only in simulated environments but that they can effect significant
performance gains -- up to a factor ten in extreme cases -- in real hardware
Scalable communication for high-order stencil computations using CUDA-aware MPI
Modern compute nodes in high-performance computing provide a tremendous level
of parallelism and processing power. However, as arithmetic performance has
been observed to increase at a faster rate relative to memory and network
bandwidths, optimizing data movement has become critical for achieving strong
scaling in many communication-heavy applications. This performance gap has been
further accentuated with the introduction of graphics processing units, which
can provide by multiple factors higher throughput in data-parallel tasks than
central processing units. In this work, we explore the computational aspects of
iterative stencil loops and implement a generic communication scheme using
CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations
based on high-order finite differences and third-order Runge-Kutta integration.
We put particular focus on improving intra-node locality of workloads. In
comparison to a theoretical performance model, our implementation exhibits
strong scaling from one to devices at -- efficiency in
sixth-order stencil computations when the problem domain consists of
-- cells.Comment: 17 pages, 15 figure
Systematically Exploring High-Performance Representations of Vector Fields Through Compile-Time Composition
We present a novel benchmark suite for implementations of vector fields in high-performance computing environments to aid developers in quantifying and ranking their performance. We decompose the design space of such benchmarks into access patterns and storage backends, the latter of which can be further decomposed into components with different functional and non-functional properties. Through compile-time meta-programming, we generate a large number of benchmarks with minimal effort and ensure the extensibility of our suite. Our empirical analysis, based on real-world applications in high-energy physics, demonstrates the feasibility of our approach on CPU and GPU platforms, and highlights that our suite is able to evaluate performance-critical design choices. Finally, we propose that our work towards composing vector fields from elementary components is not only useful for the purposes of benchmarking, but that it naturally gives rise to a novel library for implementing such fields in domain applications
Systematically Exploring High-Performance Representations of Vector Fields Through Compile-Time Composition
We present a novel benchmark suite for implementations of vector fields in high-performance computing environments to aid developers in quantifying and ranking their performance. We decompose the design space of such benchmarks into access patterns and storage backends, the latter of which can be further decomposed into components with different functional and non-functional properties. Through compile-time meta-programming, we generate a large number of benchmarks with minimal effort and ensure the extensibility of our suite. Our empirical analysis, based on real-world applications in high-energy physics, demonstrates the feasibility of our approach on CPU and GPU platforms, and highlights that our suite is able to evaluate performance-critical design choices. Finally, we propose that our work towards composing vector fields from elementary components is not only useful for the purposes of benchmarking, but that it naturally gives rise to a novel library for implementing such fields in domain applications.</p
Accelerating iterative CT reconstruction algorithms using Tensor Cores
Tensor Cores are specialized hardware units added to recent NVIDIA GPUs to speed up matrix multiplication-related tasks, such as convolutions and densely connected layers in neural networks. Due to their specific hardware implementation and programming model, Tensor Cores cannot be straightforwardly applied to other applications outside machine learning. In this paper, we demonstrate the feasibility of using NVIDIA Tensor Cores for the acceleration of a non-machine learning application: iterative Computed Tomography (CT) reconstruction. For large CT images and real-time CT scanning, the reconstruction time for many existing iterative reconstruction methods is relatively high, ranging from seconds to minutes, depending on the size of the image. Therefore, CT reconstruction is an application area that could potentially benefit from Tensor Core hardware acceleration. We first studied the reconstruction algorithm's performance as a function of the hardware related parameters and proposed an approach to accelerate reconstruction on Tensor Cores. The results show that the proposed method provides about 5 x increase in speed and energy saving using the NVIDIA RTX 2080 Ti GPU for the parallel projection of 32 images of size 512 x 512. The relative reconstruction error due to the mixed-precision computations was almost equal to the error of single-precision (32-bit) floating- point computations. We then presented an approach for real-time and memory-limited applications by exploiting the symmetry of the system (i.e., the acquisition geometry). As the proposed approach is based on the conjugate gradient method, it can be generalized to extend its application to many research and industrial fields
- …