14 research outputs found

    CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs

    Full text link
    Data compression and decompression have become vital components of big-data applications to manage the exponential growth in the amount of data collected and stored. Furthermore, big-data applications have increasingly adopted GPUs due to their high compute throughput and memory bandwidth. Prior works presume that decompression is memory-bound and have dedicated most of the GPU's threads to data movement and adopted complex software techniques to hide memory latency for reading compressed data and writing uncompressed data. This paper shows that these techniques lead to poor GPU resource utilization as most threads end up waiting for the few decoding threads, exposing compute and synchronization latencies. Based on this observation, we propose CODAG, a novel and simple kernel architecture for high throughput decompression on GPUs. CODAG eliminates the use of specialized groups of threads, frees up compute resources to increase the number of parallel decompression streams, and leverages the ample compute activities and the GPU's hardware scheduler to tolerate synchronization, compute, and memory latencies. Furthermore, CODAG provides a framework for users to easily incorporate new decompression algorithms without being burdened with implementing complex optimizations to hide memory latency. We validate our proposed architecture with three different encoding techniques, RLE v1, RLE v2, and Deflate, and a wide range of large datasets from different domains. We show that CODAG provides 13.46x, 5.69x, and 1.18x speed up for RLE v1, RLE v2, and Deflate, respectively, when compared to the state-of-the-art decompressors from NVIDIA RAPIDS

    Predictive runtime code scheduling for heterogeneous architectures

    Get PDF
    Heterogeneous architectures are currently widespread. With the advent of easy-to-program general purpose GPUs, virtually every re- cent desktop computer is a heterogeneous system. Combining the CPU and the GPU brings great amounts of processing power. However, such architectures are often used in a restricted way for domain-speci c appli- cations like scienti c applications and games, and they tend to be used by a single application at a time. We envision future heterogeneous com- puting systems where all their heterogeneous resources are continuously utilized by di erent applications with versioned critical parts to be able to better adapt their behavior and improve execution time, power con- sumption, response time and other constraints at runtime. Under such a model, adaptive scheduling becomes a critical component. In this paper, we propose a novel predictive user-level scheduler based on past performance history for heterogeneous systems. We developed sev- eral scheduling policies and present the study of their impact on system performance. We demonstrate that such scheduler allows multiple appli- cations to fully utilize all available processing resources in CPU/GPU- like systems and consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode.Postprint (published version

    CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-Core Clusters

    No full text
    Abstract. Rapid advancements in multi-core processor architectures along with low-cost, low-latency, high-bandwidth interconnects have made clusters of multi-core machines a common computing resource. Unfortunately, writing good parallel programs to efficiently utilize all the resources in such a cluster is still a major challenge. Programmers have to manually deal with low-level details that should ideally be the responsibility of an intelligent compiler or a run-time layer. Various programming languages have been proposed as a solution to this problem, but are yet to be adopted widely to run performance-critical code mainly due to the relatively immature software framework and the effort involved in re-writing existing code in the new language. In this paper, we motivate and describe our initial study in exploring CUDA as a programming language for a cluster of multi-cores. We develop CUDA-For-Clusters (CFC), a framework that transparently orchestrates execution of CUDA kernels on a cluster of multi-core machines. The well-structured nature of a CUDA kernel, the growing number of CUDA developers and benchmarks along with the stability of the CUDA software stack collectively make CUDA a good candidate to be considered as a programming language for a cluster. CFC uses a mixture of source-to-source compiler transformations, a work distribution runtime and a light-weight software distributed shared memory to manage parallel executions. Initial results on running several standard CUDA benchmark programs achieve impressive speedups of up to 7.5X on a cluster with 8 nodes, thereby opening up an interesting direction of research for further investigation

    Assessment of two complementary influenza surveillance systems : Sentinel primary care influenza-like illness versus severe hospitalized laboratory-confirmed influenza using the moving epidemic method

    Get PDF
    Monitoring seasonal influenza epidemics is the corner stone to epidemiological surveillance of acute respiratory virus infections worldwide. This work aims to compare two sentinel surveillance systems within the Daily Acute Respiratory Infection Information System of Catalonia (PIDIRAC), the primary care ILI and Influenza confirmed samples from primary care (PIDIRAC-ILI and PIDIRAC-FLU) and the severe hospitalized laboratory confirmed influenza system (SHLCI), in regard to how they behave in the forecasting of epidemic onset and severity allowing for healthcare preparedness. Epidemiological study carried out during seven influenza seasons (2010-2017) in Catalonia, with data from influenza sentinel surveillance of primary care physicians reporting ILI along with laboratory confirmation of influenza from systematic sampling of ILI cases and 12 hospitals that provided data on severe hospitalized cases with laboratory-confirmed influenza (SHLCI-FLU). Epidemic thresholds for ILI and SHLCI-FLU (overall) as well as influenza A (SHLCI-FLUA) and influenza B (SHLCI-FLUB) incidence rates were assessed by the Moving Epidemics Method. Epidemic thresholds for primary care sentinel surveillance influenza-like illness (PIDIRAC-ILI) incidence rates ranged from 83.65 to 503.92 per 100.000 h. Paired incidence rate curves for SHLCI-FLU/PIDIRAC-ILI and SHLCI-FLUA/PIDIRAC-FLUA showed best correlation index' (0.805 and 0.724 respectively). Assessing delay in reaching epidemic level, PIDIRAC-ILI source forecasts an average of 1.6 weeks before the rest of sources paired. Differences are higher when SHLCI cases are paired to PIDIRAC-ILI and PIDIRAC-FLUB although statistical significance was observed only for SHLCI-FLU/PIDIRAC-ILI (p-value Wilcoxon test = 0.039). The combined ILI and confirmed influenza from primary care along with the severe hospitalized laboratory confirmed influenza data from PIDIRAC sentinel surveillance system provides timely and accurate syndromic and virological surveillance of influenza from the community level to hospitalization of severe cases
    corecore