Search CORE

182 research outputs found

Manycore processing of repeated range queries over massive moving objects observations

Author: Jensen Christian S.
Lettich Francesco
Orlando Salvatore
Silvestri Claudio
Publication venue
Publication date: 01/01/2014
Field of study

The ability to timely process significant amounts of continuously updated spatial data is mandatory for an increasing number of applications. Parallelism enables such applications to face this data-intensive challenge and allows the devised systems to feature low latency and high scalability. In this paper we focus on a specific data-intensive problem, concerning the repeated processing of huge amounts of range queries over massive sets of moving objects, where the spatial extents of queries and objects are continuously modified over time. To tackle this problem and significantly accelerate query processing we devise a hybrid CPU/GPU pipeline that compresses data output and save query processing work. The devised system relies on an ad-hoc spatial index leading to a problem decomposition that results in a set of independent data-parallel tasks. The index is based on a point-region quadtree space decomposition and allows to tackle effectively a broad range of spatial object distributions, even those very skewed. Also, to deal with the architectural peculiarities and limitations of the GPUs, we adopt non-trivial GPU data structures that avoid the need of locked memory accesses and favour coalesced memory accesses, thus enhancing the overall memory throughput. To the best of our knowledge this is the first work that exploits GPUs to efficiently solve repeated range queries over massive sets of continuously moving objects, characterized by highly skewed spatial distributions. In comparison with state-of-the-art CPU-based implementations, our method highlights significant speedups in the order of 14x-20x, depending on the datasets, even when considering very cheap GPUs

arXiv.org e-Print Archive

CiteSeerX

VBN

Optimising Convolutional Neural Networks Inference on Low-Powered GPUs

Author: Cano José
O’Boyle Michael
Rovder Simon
Publication venue
Publication date: 01/01/2019
Field of study

No abstract available

Enlighten

High-Performance GPU Implementation of PageRank with Reduced Precision Based on Mantissa Segmentation

Author: Anzt Hartwig
Grützmacher Thomas
Quintana-Orti E. S.
Scheidegger F.
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 01/01/2019
Field of study

KITopen

Finding Morton-Like Layouts for Multi-Dimensional Arrays Using Evolutionary Algorithms

Author: Krasznahorkay Attila
Pimentel Andy D.
Salzburger Andreas
Swatman Stephen Nicholas
Varbanescu Ana-Lucia
Publication venue
Publication date: 13/09/2023
Field of study

The layout of multi-dimensional data can have a significant impact on the efficacy of hardware caches and, by extension, the performance of applications. Common multi-dimensional layouts include the canonical row-major and column-major layouts as well as the Morton curve layout. In this paper, we describe how the Morton layout can be generalized to a very large family of multi-dimensional data layouts with widely varying performance characteristics. We posit that this design space can be efficiently explored using a combinatorial evolutionary methodology based on genetic algorithms. To this end, we propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts using cache simulation. We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties in four out of the eight real-world applications under consideration in a small number of generations. Finally, we demonstrate that the array layouts found using our evolutionary method perform well not only in simulated environments but that they can effect significant performance gains -- up to a factor ten in extreme cases -- in real hardware

arXiv.org e-Print Archive

Scalable communication for high-order stencil computations using CUDA-aware MPI

Author: Käpylä Maarit J.
Lappi Oskar
Pekkilä Johannes
Rheinhardt Matthias
Väisälä Miikka S.
Publication venue
Publication date: 02/03/2021
Field of study

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. In comparison to a theoretical performance model, our implementation exhibits strong scaling from one to

64

devices at

50\%

87\%

efficiency in sixth-order stencil computations when the problem domain consists of

256^3

1024^3

cells.Comment: 17 pages, 15 figure

arXiv.org e-Print Archive

Aaltodoc Publication Archive

MPG.PuRe

Systematically Exploring High-Performance Representations of Vector Fields Through Compile-Time Composition

Author: Krasznahorkay A.
Pimentel A.
Salzburger A.
Swatman S.N.
Varbanescu A.-L.
Publication venue
Publication date: 01/01/2023
Field of study

We present a novel benchmark suite for implementations of vector fields in high-performance computing environments to aid developers in quantifying and ranking their performance. We decompose the design space of such benchmarks into access patterns and storage backends, the latter of which can be further decomposed into components with different functional and non-functional properties. Through compile-time meta-programming, we generate a large number of benchmarks with minimal effort and ensure the extensibility of our suite. Our empirical analysis, based on real-world applications in high-energy physics, demonstrates the feasibility of our approach on CPU and GPU platforms, and highlights that our suite is able to evaluate performance-critical design choices. Finally, we propose that our work towards composing vector fields from elementary components is not only useful for the purposes of benchmarking, but that it naturally gives rise to a novel library for implementing such fields in domain applications

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Systematically Exploring High-Performance Representations of Vector Fields Through Compile-Time Composition

Author: Krasznahorkay Attila
Pimentel Andy
Salzburger Andreas
Swatman Stephen Nicholas
Varbanescu Ana-Lucia
Publication venue: Association for Computing Machinery
Publication date: 15/04/2023
Field of study

University of Twente Research Information

Accelerating iterative CT reconstruction algorithms using Tensor Cores

Author: Goossens Bart
Nourazar Mohsen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Tensor Cores are specialized hardware units added to recent NVIDIA GPUs to speed up matrix multiplication-related tasks, such as convolutions and densely connected layers in neural networks. Due to their specific hardware implementation and programming model, Tensor Cores cannot be straightforwardly applied to other applications outside machine learning. In this paper, we demonstrate the feasibility of using NVIDIA Tensor Cores for the acceleration of a non-machine learning application: iterative Computed Tomography (CT) reconstruction. For large CT images and real-time CT scanning, the reconstruction time for many existing iterative reconstruction methods is relatively high, ranging from seconds to minutes, depending on the size of the image. Therefore, CT reconstruction is an application area that could potentially benefit from Tensor Core hardware acceleration. We first studied the reconstruction algorithm's performance as a function of the hardware related parameters and proposed an approach to accelerate reconstruction on Tensor Cores. The results show that the proposed method provides about 5 x increase in speed and energy saving using the NVIDIA RTX 2080 Ti GPU for the parallel projection of 32 images of size 512 x 512. The relative reconstruction error due to the mixed-precision computations was almost equal to the error of single-precision (32-bit) floating- point computations. We then presented an approach for real-time and memory-limited applications by exploiting the symmetry of the system (i.e., the acquisition geometry). As the proposed approach is based on the conjugate gradient method, it can be generalized to extend its application to many research and industrial fields

Ghent University Academic Bibliography