260 research outputs found
A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs
Sorting is at the core of many database operations, such as index creation,
sort-merge joins, and user-requested output sorting. As GPUs are emerging as a
promising platform to accelerate various operations, sorting on GPUs becomes a
viable endeavour. Over the past few years, several improvements have been
proposed for sorting on GPUs, leading to the first radix sort implementations
that achieve a sorting rate of over one billion 32-bit keys per second. Yet,
state-of-the-art approaches are heavily memory bandwidth-bound, as they require
substantially more memory transfers than their CPU-based counterparts.
Our work proposes a novel approach that almost halves the amount of memory
transfers and, therefore, considerably lifts the memory bandwidth limitation.
Being able to sort two gigabytes of eight-byte records in as little as 50
milliseconds, our approach achieves a 2.32-fold improvement over the
state-of-the-art GPU-based radix sort for uniform distributions, sustaining a
minimum speed-up of no less than a factor of 1.66 for skewed distributions.
To address inputs that either do not reside on the GPU or exceed the
available device memory, we build on our efficient GPU sorting approach with a
pipelined heterogeneous sorting algorithm that mitigates the overhead
associated with PCIe data transfers. Comparing the end-to-end sorting
performance to the state-of-the-art CPU-based radix sort running 16 threads,
our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement for
sorting 64 GB key-value pairs with a skewed and a uniform distribution,
respectively.Comment: 16 pages, accepted at SIGMOD 201
Efficiently Processing Large Relational Joins on GPUs
With the growing interest in Machine Learning (ML), Graphic Processing Units
(GPUs) have become key elements of any computing infrastructure. Their
widespread deployment in data centers and the cloud raises the question of how
to use them beyond ML use cases, with growing interest in employing them in a
database context. In this paper, we explore and analyze the implementation of
relational joins on GPUs from an end-to-end perspective, meaning that we take
result materialization into account. We conduct a comprehensive performance
study of state-of-the-art GPU-based join algorithms over diverse synthetic
workloads and TPC-H/TPC-DS benchmarks. Without being restricted to the
conventional setting where each input relation has only one key and one non-key
with all attributes being 4-bytes long, we investigate the effect of various
factors (e.g., input sizes, number of non-key columns, skewness, data types,
match ratios, and number of joins) on the end-to-end throughput. Furthermore,
we propose a technique called "Gather-from-Transformed-Relations" (GFTR) to
reduce the long-ignored yet high materialization cost in GPU-based joins. The
experimental evaluation shows significant performance improvements from GFTR,
with throughput gains of up to 2.3 times over previous work. The insights
gained from the performance study not only advance the understanding of
GPU-based joins but also introduce a structured approach to selecting the most
efficient GPU join algorithm based on the input relation characteristics
Optimizing sparse matrix-vector multiplication in NEC SX-Aurora vector engine
Sparse Matrix-Vector multiplication (SpMV) is an essential piece of code used in many High Performance Computing (HPC) applications. As previous literature shows, achieving efficient vectorization and performance in modern multi-core systems is nothing straightforward. It is important then to revisit the current stateof-the-art matrix formats and optimizations to be able to deliver deliver high performance in long vector architectures. In this tech-report, we describe how to develop an efficient implementation that achieves high throughput in the NEC Vector Engine: a 256 element-long vector architecture. Combining several pre-processing and kernel optimizations we obtain an average 12% improvement over a base SELLC-s implementation on a heterogeneous set of 24 matrices.Preprin
The Evolution of Distributed Systems for Graph Neural Networks and their Origin in Graph Processing and Deep Learning: A Survey
Graph Neural Networks (GNNs) are an emerging research field. This specialized
Deep Neural Network (DNN) architecture is capable of processing graph
structured data and bridges the gap between graph processing and Deep Learning
(DL). As graphs are everywhere, GNNs can be applied to various domains
including recommendation systems, computer vision, natural language processing,
biology and chemistry. With the rapid growing size of real world graphs, the
need for efficient and scalable GNN training solutions has come. Consequently,
many works proposing GNN systems have emerged throughout the past few years.
However, there is an acute lack of overview, categorization and comparison of
such systems. We aim to fill this gap by summarizing and categorizing important
methods and techniques for large-scale GNN solutions. In addition, we establish
connections between GNN systems, graph processing systems and DL systems.Comment: Accepted at ACM Computing Survey
Accelerated coplanar facet radio synthesis imaging
Imaging in radio astronomy entails the Fourier inversion of the relation between the sampled spatial coherence of an electromagnetic field and the intensity of its emitting source. This inversion is normally computed by performing a convolutional resampling step and applying the Inverse Fast Fourier Transform, because this leads to computational savings. Unfortunately, the resulting planar approximation of the sky is only valid over small regions. When imaging over wider fields of view, and in particular using telescope arrays with long non-East-West components, significant distortions are introduced in the computed image. We propose a coplanar faceting algorithm, where the sky is split up into many smaller images. Each of these narrow-field images are further corrected using a phase-correcting tech- nique known as w-projection. This eliminates the projection error along the edges of the facets and ensures approximate coplanarity. The combination of faceting and w-projection approaches alleviates the memory constraints of previous w-projection implementations. We compared the scaling performance of both single and double precision resampled images in both an optimized multi-threaded CPU implementation and a GPU implementation that uses a memory-access- limiting work distribution strategy. We found that such a w-faceting approach scales slightly better than a traditional w-projection approach on GPUs. We also found that double precision resampling on GPUs is about 71% slower than its single precision counterpart, making double precision resampling on GPUs less power efficient than CPU-based double precision resampling. Lastly, we have seen that employing only single precision in the resampling summations produces significant error in continuum images for a MeerKAT-sized array over long observations, especially when employing the large convolution filters necessary to create large images
AceleraciĂłn de algoritmos de procesamiento de imágenes para el análisis de partĂculas individuales con microscopia electrĂłnica
Tesis Doctoral inĂ©dita cotutelada por la Masaryk University (RepĂşblica Checa) y la Universidad AutĂłnoma de Madrid, Escuela PolitĂ©cnica Superior, Departamento de IngenierĂa Informática. Fecha de Lectura: 24-10-2022Cryogenic Electron Microscopy (Cryo-EM) is a vital field in current structural biology. Unlike X-ray
crystallography and Nuclear Magnetic Resonance, it can be used to analyze membrane proteins and
other samples with overlapping spectral peaks. However, one of the significant limitations of Cryo-EM
is the computational complexity. Modern electron microscopes can produce terabytes of data per single
session, from which hundreds of thousands of particles must be extracted and processed to obtain a
near-atomic resolution of the original sample. Many existing software solutions use high-Performance
Computing (HPC) techniques to bring these computations to the realm of practical usability. The
common approach to acceleration is parallelization of the processing, but in praxis, we face many
complications, such as problem decomposition, data distribution, load scheduling, balancing, and
synchronization. Utilization of various accelerators further complicates the situation, as heterogeneous
hardware brings additional caveats, for example, limited portability, under-utilization due to synchronization,
and sub-optimal code performance due to missing specialization.
This dissertation, structured as a compendium of articles, aims to improve the algorithms used
in Cryo-EM, esp. the SPA (Single Particle Analysis). We focus on the single-node performance
optimizations, using the techniques either available or developed in the HPC field, such as heterogeneous
computing or autotuning, which potentially needs the formulation of novel algorithms. The
secondary goal of the dissertation is to identify the limitations of state-of-the-art HPC techniques. Since
the Cryo-EM pipeline consists of multiple distinct steps targetting different types of data, there is no
single bottleneck to be solved. As such, the presented articles show a holistic approach to performance
optimization.
First, we give details on the GPU acceleration of the specific programs. The achieved speedup is
due to the higher performance of the GPU, adjustments of the original algorithm to it, and application
of the novel algorithms. More specifically, we provide implementation details of programs for movie
alignment, 2D classification, and 3D reconstruction that have been sped up by order of magnitude
compared to their original multi-CPU implementation or sufficiently the be used on-the-fly. In addition
to these three programs, multiple other programs from an actively used, open-source software package
XMIPP have been accelerated and improved.
Second, we discuss our contribution to HPC in the form of autotuning. Autotuning is the ability of
software to adapt to a changing environment, i.e., input or executing hardware. Towards that goal, we
present cuFFTAdvisor, a tool that proposes and, through autotuning, finds the best configuration of the
cuFFT library for given constraints of input size and plan settings. We also introduce a benchmark set
of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA,
together with the introduction of complex dynamic autotuning to the KTT tool.
Third, we propose an image processing framework Umpalumpa, which combines a task-based
runtime system, data-centric architecture, and dynamic autotuning. The proposed framework allows for
writing complex workflows which automatically use available HW resources and adjust to different HW
and data but at the same time are easy to maintainThe project that gave rise to these results received the support of a fellowship from the “la Caixa”
Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021.
This project has received funding from the European Union’s Horizon 2020 research and innovation
programme under the Marie Skłodowska-Curie grant agreement No. 71367
- …