785 research outputs found
Glider: A GPU Library Driver for Improved System Security
Legacy device drivers implement both device resource management and
isolation. This results in a large code base with a wide high-level interface
making the driver vulnerable to security attacks. This is particularly
problematic for increasingly popular accelerators like GPUs that have large,
complex drivers. We solve this problem with library drivers, a new driver
architecture. A library driver implements resource management as an untrusted
library in the application process address space, and implements isolation as a
kernel module that is smaller and has a narrower lower-level interface (i.e.,
closer to hardware) than a legacy driver. We articulate a set of device and
platform hardware properties that are required to retrofit a legacy driver into
a library driver. To demonstrate the feasibility and superiority of library
drivers, we present Glider, a library driver implementation for two GPUs of
popular brands, Radeon and Intel. Glider reduces the TCB size and attack
surface by about 35% and 84% respectively for a Radeon HD 6450 GPU and by about
38% and 90% respectively for an Intel Ivy Bridge GPU. Moreover, it incurs no
performance cost. Indeed, Glider outperforms a legacy driver for applications
requiring intensive interactions with the device driver, such as applications
using the OpenGL immediate mode API
OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF
The actor model of computation has been designed for a seamless support of
concurrency and distribution. However, it remains unspecific about data
parallel program flows, while available processing power of modern many core
hardware such as graphics processing units (GPUs) or coprocessors increases the
relevance of data parallelism for general-purpose computation.
In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework
(CAF). This offers a high level interface for accessing any OpenCL device
without leaving the actor paradigm. The new type of actor is integrated into
the runtime environment of CAF and gives rise to transparent message passing in
distributed systems on heterogeneous hardware. Following the actor logic in
CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence
operate in a multi-stage fashion on data resident at the GPU. Developers are
thus enabled to build complex data parallel programs from primitives without
leaving the actor paradigm, nor sacrificing performance. Our evaluations on
commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear
scaling behavior when offloading larger workloads. For sub-second duties, the
efficiency of offloading was found to largely differ between devices. Moreover,
our findings indicate a negligible overhead over programming with the native
OpenCL API.Comment: 28 page
Quantifying the latency benefits of near-edge and in-network FPGA acceleration
Transmitting data to cloud datacenters in distributed IoT applications introduces significant communication latency, but is often the only feasible solution when source nodes are computationally limited. To address latency concerns, cloudlets, in-network computing, and more capable edge nodes are all being explored as a way of moving processing capability towards the edge of the network. Hardware acceleration using Field Programmable Gate Arrays (FPGAs) is also seeing increased interest due to reduced computation latency and improved efficiency. This paper evaluates the the implications of these offloading approaches using a case study neural network based image classification application, quantifying both the computation and communication latency resulting from different platform choices. We consider communication latency including the ingestion of packets for processing on the target platform, showing that this varies significantly with the choice of platform. We demonstrate that emerging in-network accelerator approaches offer much improved and predictable performance as well as better scaling to support multiple data sources
Acceleration of stereo-matching on multi-core CPU and GPU
This paper presents an accelerated version of a
dense stereo-correspondence algorithm for two different parallelism
enabled architectures, multi-core CPU and GPU. The
algorithm is part of the vision system developed for a binocular
robot-head in the context of the CloPeMa 1 research project.
This research project focuses on the conception of a new clothes
folding robot with real-time and high resolution requirements
for the vision system. The performance analysis shows that
the parallelised stereo-matching algorithm has been significantly
accelerated, maintaining 12x and 176x speed-up respectively
for multi-core CPU and GPU, compared with non-SIMD singlethread
CPU. To analyse the origin of the speed-up and gain
deeper understanding about the choice of the optimal hardware,
the algorithm was broken into key sub-tasks and the performance
was tested for four different hardware architectures
A Survey of Techniques for Improving Security of GPUs
Graphics processing unit (GPU), although a powerful performance-booster, also
has many security vulnerabilities. Due to these, the GPU can act as a
safe-haven for stealthy malware and the weakest `link' in the security `chain'.
In this paper, we present a survey of techniques for analyzing and improving
GPU security. We classify the works on key attributes to highlight their
similarities and differences. More than informing users and researchers about
GPU security techniques, this survey aims to increase their awareness about GPU
security vulnerabilities and potential countermeasures
Optimización del rendimiento y la eficiencia energética en sistemas masivamente paralelos
RESUMEN Los sistemas heterogéneos son cada vez más relevantes, debido a sus capacidades de rendimiento y eficiencia energética, estando presentes en todo tipo de plataformas de cómputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programación host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energético del sistema, además de dificultar la adaptación de las aplicaciones.
La co-ejecución permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energía. No obstante, los programadores deben encargarse de toda la gestión de los dispositivos, la distribución de la carga y la portabilidad del código entre sistemas, complicando notablemente su programación.
Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energética en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracción y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energética. Para ello, se proponen dos motores de ejecución con enfoques completamente distintos.
EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la máxima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de dinámica molecular, como el utilizado en un centro de investigación internacional.
Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecución a la tecnología oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotación de estrategias dinámicas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications.
Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming.
This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed.
EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center.
Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant),
the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R
and PID2019-105660RB-C22.
This work has also been partially supported by the Mont-Blanc 3: European Scalable and
Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No.
671697) from the European Union’s Horizon 2020 Research and Innovation Programme
(H2020 Programme). Some activities have also been funded by the Spanish Science and Technology
Commission under contract TIN2016-81840-REDT (CAPAP-H6 network).
The Integration II: Hybrid programming models of Chapter 4 has been partially performed
under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC
Research Innovation Action under the H2020 Programme. In particular, the author gratefully
acknowledges the support of the SPMT Department of the High Performance Computing
Center Stuttgart (HLRS)
Distributed Computing in a Pandemic: A Review of Technologies Available for Tackling COVID-19
The current COVID-19 global pandemic caused by the SARS-CoV-2 betacoronavirus
has resulted in over a million deaths and is having a grave socio-economic
impact, hence there is an urgency to find solutions to key research challenges.
Much of this COVID-19 research depends on distributed computing. In this
article, I review distributed architectures -- various types of clusters, grids
and clouds -- that can be leveraged to perform these tasks at scale, at
high-throughput, with a high degree of parallelism, and which can also be used
to work collaboratively. High-performance computing (HPC) clusters will be used
to carry out much of this work. Several bigdata processing tasks used in
reducing the spread of SARS-CoV-2 require high-throughput approaches, and a
variety of tools, which Hadoop and Spark offer, even using commodity hardware.
Extremely large-scale COVID-19 research has also utilised some of the world's
fastest supercomputers, such as IBM's SUMMIT -- for ensemble docking
high-throughput screening against SARS-CoV-2 targets for drug-repurposing, and
high-throughput gene analysis -- and Sentinel, an XPE-Cray based system used to
explore natural products. Grid computing has facilitated the formation of the
world's first Exascale grid computer. This has accelerated COVID-19 research in
molecular dynamics simulations of SARS-CoV-2 spike protein interactions through
massively-parallel computation and was performed with over 1 million volunteer
computing devices using the Folding@home platform. Grids and clouds both can
also be used for international collaboration by enabling access to important
datasets and providing services that allow researchers to focus on research
rather than on time-consuming data-management tasks.Comment: 21 pages (15 excl. refs), 2 figures, 3 table
SAGE: Software-based Attestation for GPU Execution
With the application of machine learning to security-critical and sensitive
domains, there is a growing need for integrity and privacy in computation using
accelerators, such as GPUs. Unfortunately, the support for trusted execution on
GPUs is currently very limited - trusted execution on accelerators is
particularly challenging since the attestation mechanism should not reduce
performance. Although hardware support for trusted execution on GPUs is
emerging, we study purely software-based approaches for trusted GPU execution.
A software-only approach offers distinct advantages: (1) complement
hardware-based approaches, enhancing security especially when vulnerabilities
in the hardware implementation degrade security, (2) operate on GPUs without
hardware support for trusted execution, and (3) achieve security without
reliance on secrets embedded in the hardware, which can be extracted as history
has shown. In this work, we present SAGE, a software-based attestation
mechanism for GPU execution. SAGE enables secure code execution on NVIDIA GPUs
of the Ampere architecture (A100), providing properties of code integrity and
secrecy, computation integrity, as well as data integrity and secrecy - all in
the presence of malicious code running on the GPU and CPU. Our evaluation
demonstrates that SAGE is already practical today for executing code in a
trustworthy way on GPUs without specific hardware support.Comment: 14 pages, 2 reference pages, 6 figure
- …