6 research outputs found
Efficiently Processing Large Relational Joins on GPUs
With the growing interest in Machine Learning (ML), Graphic Processing Units
(GPUs) have become key elements of any computing infrastructure. Their
widespread deployment in data centers and the cloud raises the question of how
to use them beyond ML use cases, with growing interest in employing them in a
database context. In this paper, we explore and analyze the implementation of
relational joins on GPUs from an end-to-end perspective, meaning that we take
result materialization into account. We conduct a comprehensive performance
study of state-of-the-art GPU-based join algorithms over diverse synthetic
workloads and TPC-H/TPC-DS benchmarks. Without being restricted to the
conventional setting where each input relation has only one key and one non-key
with all attributes being 4-bytes long, we investigate the effect of various
factors (e.g., input sizes, number of non-key columns, skewness, data types,
match ratios, and number of joins) on the end-to-end throughput. Furthermore,
we propose a technique called "Gather-from-Transformed-Relations" (GFTR) to
reduce the long-ignored yet high materialization cost in GPU-based joins. The
experimental evaluation shows significant performance improvements from GFTR,
with throughput gains of up to 2.3 times over previous work. The insights
gained from the performance study not only advance the understanding of
GPU-based joins but also introduce a structured approach to selecting the most
efficient GPU join algorithm based on the input relation characteristics
A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing
ABSTRACT Hashing is a solved problem. It allows us to get constant time access for lookups. Hashing is also simple. It is safe to use an arbitrary method as a black box and expect good performance, and optimizations to hashing can only improve it by a negligible delta. Why are all of the previous statements plain wrong? That is what this paper is about. In this paper we thoroughly study hashing for integer keys and carefully analyze the most common hashing methods in a five-dimensional requirements space: () data-distribution, () load factor, () dataset size, () read/write-ratio, and () un/successfulratio. Each point in that design space may potentially suggest a different hashing scheme, and additionally also a different hash function. We show that a right or wrong decision in picking the right hashing scheme and hash function combination may lead to significant difference in performance. To substantiate this claim, we carefully analyze two additional dimensions: () five representative hashing schemes (which includes an improved variant of Robin Hood hashing), () four important classes of hash functions widely used today. That is, we consider 20 different combinations in total. Finally, we also provide a glimpse about the effect of table memory layout and the use of SIMD instructions. Our study clearly indicates that picking the right combination may have considerable impact on insert and lookup performance, as well as memory footprint. A major conclusion of our work is that hashing should be considered a white box before blindly using it in applications, such as query processing. Finally, we also provide a strong guideline about when to use which hashing method
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Data movement between the CPU and main memory is a first-order obstacle
against improving performance, scalability, and energy efficiency in modern
systems. Computer systems employ a range of techniques to reduce overheads tied
to data movement, spanning from traditional mechanisms (e.g., deep multi-level
cache hierarchies, aggressive hardware prefetchers) to emerging techniques such
as Near-Data Processing (NDP), where some computation is moved close to memory.
Our goal is to methodically identify potential sources of data movement over a
broad set of applications and to comprehensively compare traditional
compute-centric data movement mitigation techniques to more memory-centric
techniques, thereby developing a rigorous understanding of the best techniques
to mitigate each source of data movement.
With this goal in mind, we perform the first large-scale characterization of
a wide variety of applications, across a wide range of application domains, to
identify fundamental program properties that lead to data movement to/from main
memory. We develop the first systematic methodology to classify applications
based on the sources contributing to data movement bottlenecks. From our
large-scale characterization of 77K functions across 345 applications, we
select 144 functions to form the first open-source benchmark suite (DAMOV) for
main memory data movement studies. We select a diverse range of functions that
(1) represent different types of data movement bottlenecks, and (2) come from a
wide range of application domains. Using NDP as a case study, we identify new
insights about the different data movement bottlenecks and use these insights
to determine the most suitable data movement mitigation mechanism for a
particular application. We open-source DAMOV and the complete source code for
our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at
https://github.com/CMU-SAFARI/DAMO
Accelerators for Data Processing
The explosive growth in digital data and its growing role in real-time analytics motivate the design of high-performance database management systems (DBMSs). Meanwhile, slowdown in supply voltage scaling has stymied improvements in core performance and ushered an era of power-limited chips. These developments motivate the design of software and hardware DBMS accelerators that (1) maximize utility by accelerating the dominant operations, and (2) provide flexibility in the choice of DBMS, data layout, and data types. In this thesis, we identify pointer-intensive data structure operations as a key performance and efficiency bottleneck in data analytics workloads. We observe that data analytics tasks include a large number of independent data structure lookups, each of which is characterized by dependent long-latency memory accesses due to pointer chasing. Unfortunately, exploiting such inter-lookup parallelism to overlap memory accesses from different lookups is not possible within the limited instruction window of modern out-of-order cores. Similarly, software prefetching techniques attempt to exploit inter-lookup parallelism by statically staging independent lookups, and hence break down in the face of irregularity across lookup stages. Based on these observations, we provide a dynamic software acceleration scheme for exploiting inter-lookup parallelism to hide the memory access latency despite the irregularities across lookups. Furthermore, we propose a programmable hardware accelerator to maximize the efficiency of the data structure lookups. As a result, through flexible hardware and software techniques we eliminate a key efficiency and performance bottleneck in data analytics operations