107 research outputs found

    Persistent Kernels for Iterative Memory-bound GPU Applications

    Full text link
    Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts as the barrier required after advancing the solution every time step. We propose a scheme for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this scheme the time loop is moved inside a persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching a subset of the output in each time step in registers and shared memory to be used as input for the following time step. PERKS can be generalized to any iterative solver: they are largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate the effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geometric mean speedup of 2.292.29x in small domains and 1.531.53x in large domains), and a Krylov subspace solver (geometric mean speedup of 4.674.67x in smaller SpMV datasets from SuiteSparse and 1.391.39x in larger SpMV datasets, for conjugate gradient)

    Scrooge Attack: Undervolting ARM Processors for Profit

    Full text link
    Latest ARM processors are approaching the computational power of x86 architectures while consuming much less energy. Consequently, supply follows demand with Amazon EC2, Equinix Metal and Microsoft Azure offering ARM-based instances, while Oracle Cloud Infrastructure is about to add such support. We expect this trend to continue, with an increasing number of cloud providers offering ARM-based cloud instances. ARM processors are more energy-efficient leading to substantial electricity savings for cloud providers. However, a malicious cloud provider could intentionally reduce the CPU voltage to further lower its costs. Running applications malfunction when the undervolting goes below critical thresholds. By avoiding critical voltage regions, a cloud provider can run undervolted instances in a stealthy manner. This practical experience report describes a novel attack scenario: an attack launched by the cloud provider against its users to aggressively reduce the processor voltage for saving energy to the last penny. We call it the Scrooge Attack and show how it could be executed using ARM-based computing instances. We mimic ARM-based cloud instances by deploying our own ARM-based devices using different generations of Raspberry Pi. Using realistic and synthetic workloads, we demonstrate to which degree of aggressiveness the attack is relevant. The attack is unnoticeable by our detection method up to an offset of -50mV. We show that the attack may even remain completely stealthy for certain workloads. Finally, we propose a set of client-based detection methods that can identify undervolted instances. We support experimental reproducibility and provide instructions to reproduce our results.Comment: European Commission Project: LEGaTO - Low Energy Toolset for Heterogeneous Computing (EC-H2020-780681

    An architectural journey into RISC architectures for HPC workloads

    Get PDF
    The thesis evaluates the current state-of-the-art of RISC architectures in HPC. Studying the performance, power, and energy to solution in heterogeneous SoCs. For the evaluation 2 arm platforms (CPU+GPU, CPU+FPGA), 1 RISC-V platform and 1 Open Source RISC-V core running in an FPGA have been tested

    On designing hardware accelerator-based systems: interfaces, taxes and benefits

    Full text link
    Complementary Metal Oxide Semiconductor (CMOS) Technology scaling has slowed down. One promising approach to sustain the historic performance improvement of computing systems is to utilize hardware accelerators. Today, many commercial computing systems integrate one or more accelerators, with each accelerator optimized to efficiently execute specific tasks. Over the years, there has been a substantial amount of research on designing hardware accelerators for machine learning (ML) training and inference tasks. Hardware accelerators are also widely employed to accelerate data privacy and security algorithms. In particular, there is currently a growing interest in the use of hardware accelerators for accelerating homomorphic encryption (HE) based privacy-preserving computing. While the use of hardware accelerators is promising, a realistic end-to-end evaluation of an accelerator when integrated into the full system often reveals that the benefits of an accelerator are not always as expected. Simply assessing the performance of the accelerated portion of an application, such as the inference kernel in ML applications, during performance analysis can be misleading. When designing an accelerator-based system, it is critical to evaluate the system as a whole and account for all the accelerator taxes. In the first part of our research, we highlight the need for a holistic, end-to-end analysis of workloads using ML and HE applications. Our evaluation of an ML application for a database management system (DBMS) shows that the benefits of offloading ML inference to accelerators depend on several factors, including backend hardware, model complexity, data size, and the level of integration between the ML inference pipeline and the DBMS. We also found that the end-to-end performance improvement is bottlenecked by data retrieval and pre-processing, as well as inference. Additionally, our evaluation of an HE video encryption application shows that while HE client-side operations, i.e., message-to- ciphertext and ciphertext-to-message conversion operations, are bottlenecked by number theoretic transform (NTT) operations, accelerating NTT in hardware alone is not sufficient to get enough application throughput (frame rate per second) improvement. We need to address all bottlenecks such as error sampling, encryption, and decryption in message-to-ciphertext and ciphertext-to-message conversion pipeline. In the second part of our research, we address the lack of a scalable evaluation infrastructure for building and evaluating accelerator-based systems. To solve this problem, we propose a robust and scalable software-hardware framework for accelerator evaluation, which uses an open-source RISC-V based System-on-Chip (SoC) design called BlackParrot. This framework can be utilized by accelerator designers and system architects to perform an end-to-end performance analysis of coherent and non-coherent accelerators while carefully accounting for the interaction between the accelerator and the rest of the system. In the third part of our research, we present RISE, which is a full RISC-V SoC designed to efficiently perform message-to-ciphertext and ciphertext-to-message conversion operations. RISE comprises of a BlackParrot core and an efficient custom-designed accelerator tailored to accelerate end-to-end message-to-ciphertext and ciphertext-to-message conversion operations. Our RTL-based evaluation demonstrates that RISE improves the throughput of the video encryption application by 10x-27x for different frame resolutions
    corecore