182 research outputs found

    Manycore processing of repeated range queries over massive moving objects observations

    Full text link
    The ability to timely process significant amounts of continuously updated spatial data is mandatory for an increasing number of applications. Parallelism enables such applications to face this data-intensive challenge and allows the devised systems to feature low latency and high scalability. In this paper we focus on a specific data-intensive problem, concerning the repeated processing of huge amounts of range queries over massive sets of moving objects, where the spatial extents of queries and objects are continuously modified over time. To tackle this problem and significantly accelerate query processing we devise a hybrid CPU/GPU pipeline that compresses data output and save query processing work. The devised system relies on an ad-hoc spatial index leading to a problem decomposition that results in a set of independent data-parallel tasks. The index is based on a point-region quadtree space decomposition and allows to tackle effectively a broad range of spatial object distributions, even those very skewed. Also, to deal with the architectural peculiarities and limitations of the GPUs, we adopt non-trivial GPU data structures that avoid the need of locked memory accesses and favour coalesced memory accesses, thus enhancing the overall memory throughput. To the best of our knowledge this is the first work that exploits GPUs to efficiently solve repeated range queries over massive sets of continuously moving objects, characterized by highly skewed spatial distributions. In comparison with state-of-the-art CPU-based implementations, our method highlights significant speedups in the order of 14x-20x, depending on the datasets, even when considering very cheap GPUs

    Finding Morton-Like Layouts for Multi-Dimensional Arrays Using Evolutionary Algorithms

    Full text link
    The layout of multi-dimensional data can have a significant impact on the efficacy of hardware caches and, by extension, the performance of applications. Common multi-dimensional layouts include the canonical row-major and column-major layouts as well as the Morton curve layout. In this paper, we describe how the Morton layout can be generalized to a very large family of multi-dimensional data layouts with widely varying performance characteristics. We posit that this design space can be efficiently explored using a combinatorial evolutionary methodology based on genetic algorithms. To this end, we propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts using cache simulation. We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties in four out of the eight real-world applications under consideration in a small number of generations. Finally, we demonstrate that the array layouts found using our evolutionary method perform well not only in simulated environments but that they can effect significant performance gains -- up to a factor ten in extreme cases -- in real hardware

    Scalable communication for high-order stencil computations using CUDA-aware MPI

    Full text link
    Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. In comparison to a theoretical performance model, our implementation exhibits strong scaling from one to 6464 devices at 50%50\%--87%87\% efficiency in sixth-order stencil computations when the problem domain consists of 2563256^3--102431024^3 cells.Comment: 17 pages, 15 figure

    Systematically Exploring High-Performance Representations of Vector Fields Through Compile-Time Composition

    Get PDF
    We present a novel benchmark suite for implementations of vector fields in high-performance computing environments to aid developers in quantifying and ranking their performance. We decompose the design space of such benchmarks into access patterns and storage backends, the latter of which can be further decomposed into components with different functional and non-functional properties. Through compile-time meta-programming, we generate a large number of benchmarks with minimal effort and ensure the extensibility of our suite. Our empirical analysis, based on real-world applications in high-energy physics, demonstrates the feasibility of our approach on CPU and GPU platforms, and highlights that our suite is able to evaluate performance-critical design choices. Finally, we propose that our work towards composing vector fields from elementary components is not only useful for the purposes of benchmarking, but that it naturally gives rise to a novel library for implementing such fields in domain applications

    Systematically Exploring High-Performance Representations of Vector Fields Through Compile-Time Composition

    Get PDF
    We present a novel benchmark suite for implementations of vector fields in high-performance computing environments to aid developers in quantifying and ranking their performance. We decompose the design space of such benchmarks into access patterns and storage backends, the latter of which can be further decomposed into components with different functional and non-functional properties. Through compile-time meta-programming, we generate a large number of benchmarks with minimal effort and ensure the extensibility of our suite. Our empirical analysis, based on real-world applications in high-energy physics, demonstrates the feasibility of our approach on CPU and GPU platforms, and highlights that our suite is able to evaluate performance-critical design choices. Finally, we propose that our work towards composing vector fields from elementary components is not only useful for the purposes of benchmarking, but that it naturally gives rise to a novel library for implementing such fields in domain applications.</p

    Accelerating iterative CT reconstruction algorithms using Tensor Cores

    Get PDF
    Tensor Cores are specialized hardware units added to recent NVIDIA GPUs to speed up matrix multiplication-related tasks, such as convolutions and densely connected layers in neural networks. Due to their specific hardware implementation and programming model, Tensor Cores cannot be straightforwardly applied to other applications outside machine learning. In this paper, we demonstrate the feasibility of using NVIDIA Tensor Cores for the acceleration of a non-machine learning application: iterative Computed Tomography (CT) reconstruction. For large CT images and real-time CT scanning, the reconstruction time for many existing iterative reconstruction methods is relatively high, ranging from seconds to minutes, depending on the size of the image. Therefore, CT reconstruction is an application area that could potentially benefit from Tensor Core hardware acceleration. We first studied the reconstruction algorithm's performance as a function of the hardware related parameters and proposed an approach to accelerate reconstruction on Tensor Cores. The results show that the proposed method provides about 5 x increase in speed and energy saving using the NVIDIA RTX 2080 Ti GPU for the parallel projection of 32 images of size 512 x 512. The relative reconstruction error due to the mixed-precision computations was almost equal to the error of single-precision (32-bit) floating- point computations. We then presented an approach for real-time and memory-limited applications by exploiting the symmetry of the system (i.e., the acquisition geometry). As the proposed approach is based on the conjugate gradient method, it can be generalized to extend its application to many research and industrial fields
    • …