14 research outputs found
CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs
Data compression and decompression have become vital components of big-data
applications to manage the exponential growth in the amount of data collected
and stored. Furthermore, big-data applications have increasingly adopted GPUs
due to their high compute throughput and memory bandwidth. Prior works presume
that decompression is memory-bound and have dedicated most of the GPU's threads
to data movement and adopted complex software techniques to hide memory latency
for reading compressed data and writing uncompressed data. This paper shows
that these techniques lead to poor GPU resource utilization as most threads end
up waiting for the few decoding threads, exposing compute and synchronization
latencies.
Based on this observation, we propose CODAG, a novel and simple kernel
architecture for high throughput decompression on GPUs. CODAG eliminates the
use of specialized groups of threads, frees up compute resources to increase
the number of parallel decompression streams, and leverages the ample compute
activities and the GPU's hardware scheduler to tolerate synchronization,
compute, and memory latencies. Furthermore, CODAG provides a framework for
users to easily incorporate new decompression algorithms without being burdened
with implementing complex optimizations to hide memory latency. We validate our
proposed architecture with three different encoding techniques, RLE v1, RLE v2,
and Deflate, and a wide range of large datasets from different domains. We show
that CODAG provides 13.46x, 5.69x, and 1.18x speed up for RLE v1, RLE v2, and
Deflate, respectively, when compared to the state-of-the-art decompressors from
NVIDIA RAPIDS
Predictive runtime code scheduling for heterogeneous architectures
Heterogeneous architectures are currently widespread. With
the advent of easy-to-program general purpose GPUs, virtually every re-
cent desktop computer is a heterogeneous system. Combining the CPU
and the GPU brings great amounts of processing power. However, such
architectures are often used in a restricted way for domain-speci c appli-
cations like scienti c applications and games, and they tend to be used
by a single application at a time. We envision future heterogeneous com-
puting systems where all their heterogeneous resources are continuously
utilized by di erent applications with versioned critical parts to be able
to better adapt their behavior and improve execution time, power con-
sumption, response time and other constraints at runtime. Under such a
model, adaptive scheduling becomes a critical component.
In this paper, we propose a novel predictive user-level scheduler based on
past performance history for heterogeneous systems. We developed sev-
eral scheduling policies and present the study of their impact on system
performance. We demonstrate that such scheduler allows multiple appli-
cations to fully utilize all available processing resources in CPU/GPU-
like systems and consistently achieve speedups ranging from 30% to 40%
compared to just using the GPU in a single application mode.Postprint (published version
CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-Core Clusters
Abstract. Rapid advancements in multi-core processor architectures along with low-cost, low-latency, high-bandwidth interconnects have made clusters of multi-core machines a common computing resource. Unfortunately, writing good parallel programs to efficiently utilize all the resources in such a cluster is still a major challenge. Programmers have to manually deal with low-level details that should ideally be the responsibility of an intelligent compiler or a run-time layer. Various programming languages have been proposed as a solution to this problem, but are yet to be adopted widely to run performance-critical code mainly due to the relatively immature software framework and the effort involved in re-writing existing code in the new language. In this paper, we motivate and describe our initial study in exploring CUDA as a programming language for a cluster of multi-cores. We develop CUDA-For-Clusters (CFC), a framework that transparently orchestrates execution of CUDA kernels on a cluster of multi-core machines. The well-structured nature of a CUDA kernel, the growing number of CUDA developers and benchmarks along with the stability of the CUDA software stack collectively make CUDA a good candidate to be considered as a programming language for a cluster. CFC uses a mixture of source-to-source compiler transformations, a work distribution runtime and a light-weight software distributed shared memory to manage parallel executions. Initial results on running several standard CUDA benchmark programs achieve impressive speedups of up to 7.5X on a cluster with 8 nodes, thereby opening up an interesting direction of research for further investigation
Assessment of two complementary influenza surveillance systems : Sentinel primary care influenza-like illness versus severe hospitalized laboratory-confirmed influenza using the moving epidemic method
Monitoring seasonal influenza epidemics is the corner stone to epidemiological surveillance of acute respiratory virus infections worldwide. This work aims to compare two sentinel surveillance systems within the Daily Acute Respiratory Infection Information System of Catalonia (PIDIRAC), the primary care ILI and Influenza confirmed samples from primary care (PIDIRAC-ILI and PIDIRAC-FLU) and the severe hospitalized laboratory confirmed influenza system (SHLCI), in regard to how they behave in the forecasting of epidemic onset and severity allowing for healthcare preparedness. Epidemiological study carried out during seven influenza seasons (2010-2017) in Catalonia, with data from influenza sentinel surveillance of primary care physicians reporting ILI along with laboratory confirmation of influenza from systematic sampling of ILI cases and 12 hospitals that provided data on severe hospitalized cases with laboratory-confirmed influenza (SHLCI-FLU). Epidemic thresholds for ILI and SHLCI-FLU (overall) as well as influenza A (SHLCI-FLUA) and influenza B (SHLCI-FLUB) incidence rates were assessed by the Moving Epidemics Method. Epidemic thresholds for primary care sentinel surveillance influenza-like illness (PIDIRAC-ILI) incidence rates ranged from 83.65 to 503.92 per 100.000 h. Paired incidence rate curves for SHLCI-FLU/PIDIRAC-ILI and SHLCI-FLUA/PIDIRAC-FLUA showed best correlation index' (0.805 and 0.724 respectively). Assessing delay in reaching epidemic level, PIDIRAC-ILI source forecasts an average of 1.6 weeks before the rest of sources paired. Differences are higher when SHLCI cases are paired to PIDIRAC-ILI and PIDIRAC-FLUB although statistical significance was observed only for SHLCI-FLU/PIDIRAC-ILI (p-value Wilcoxon test = 0.039). The combined ILI and confirmed influenza from primary care along with the severe hospitalized laboratory confirmed influenza data from PIDIRAC sentinel surveillance system provides timely and accurate syndromic and virological surveillance of influenza from the community level to hospitalization of severe cases