Search CORE

5 research outputs found

Power models, energy models and libraries for energy-efficient concurrent data structures and algorithms

Author: Atalar Aras
Gidenstam Anders
Ha Hoai Phuong
Renaud-Goud Paul
Tran Vi Ngoc-Nha
Tsigas Philippas
Umar Ibrahim
Walulya Ivan
Publication venue: The EXCESS Consortium
Publication date: 01/01/2016
Field of study

EXCESS deliverable D2.3. More information at http://www.excess-project.eu/This deliverable reports the results of the power models, energy models and librariesfor energy-efficient concurrent data structures and algorithms as available by projectmonth 30 of Work Package 2 (WP2). It reports i) the latest results of Task 2.2-2.4 onproviding programming abstractions and libraries for developing energy-efficient datastructures and algorithms and ii) the improved results of Task 2.1 on investigating andmodeling the trade-off between energy and performance of concurrent data structuresand algorithms. The work has been conducted on two main EXCESS platforms: Intelplatforms with recent Intel multicore CPUs and Movidius Myriad platforms

Munin - Open Research Archive

Proceedings of the 8th Python in Science conference

Author: Millman Jarrod
Van Der Walt Stefan
Varoquaux Gaël
Publication venue: HAL CCSD
Publication date: 18/08/2009
Field of study

International audienceThe SciPy conference provides a unique opportunity to learn and affect what is happening in the realm of scientific computing with Python. Attendees have the opportunity to review the available tools and how they apply to specific problems. By providing a forum for developers to share their Python expertise with the wider commercial, academic, and research communities, this conference fosters collaboration and facilitates the sharing of software components, techniques and a vision for high level language use in scientific computing

INRIA a CCSD electronic archive server

HAL-CEA

Communication Architectures for Scalable GPU-centric Computing Systems

Author: Klenk Benjamin
Publication venue
Publication date: 01/01/2018
Field of study

In recent years, power consumption has become the main concern in High Performance Computing (HPC). This has lead to heterogeneous computing systems in which Central Processing Units (CPUs) are supported by accelerators, such as Graphics Processing Units (GPUs). While GPUs used to be seen as slave devices to which the main processor offloads computation, today’s systems tend to deploy more GPUs than CPUs. Eventually, the GPU will become a first-class processor, bearing increasing responsibilities. Promoting the GPU to a first-class processor comes with many challenges, such as progress guarantees, dynamic memory management, and scheduling. However, one of the main challenges is the GPU’s inability to orchestrate communication, which is currently entirely handled by the CPU. This work addresses that issue and presents solutions to allow GPUs to source and sink network traffic independently. Many important aspects are addressed, ranging from the application level to how networking hardware is accessed. First, important and large scale exascale applications are studied to further understand their communication behavior and applications’ requirements. Several metrics are presented, including time spent for communication, message sizes, and the length of queues that are required to match messages with receive requests. One aspect the analysis revealed is that messages are becoming smaller at scale, which renders the matching of messages and receive requests an important problem to address. The next part analyzes how the GPU can directly access the network with various communication models being presented and benchmarked. It is shown that a flat address space of distributed GPU memories shows superior bandwidth than put/get communication or CPU-controlled message passing, but less communication can be overlapped with computation. Overall, GPU-controlled communication is always superior, both in terms of time-to-solution and energy spending. The final part addresses communication management on GPUs, which is required to provide high-level communication abstractions. Besides other fundamental building blocks, an algorithm for the message matching is presented that yields similar performance as CPUs. However, it is also shown that the messaging protocol can be relaxed to improve performance significantly, leveraging the massive amount of parallelism provided by the GPU’s architecture

Heidelberger Dokumentenserver

Systems Support for Trusted Execution Environments

Author: Trach Bohdan
Publication venue
Publication date: 09/02/2022
Field of study

Cloud computing has become a default choice for data processing by both large corporations and individuals due to its economy of scale and ease of system management. However, the question of trust and trustoworthy computing inside the Cloud environments has been long neglected in practice and further exacerbated by the proliferation of AI and its use for processing of sensitive user data. Attempts to implement the mechanisms for trustworthy computing in the cloud have previously remained theoretical due to lack of hardware primitives in the commodity CPUs, while a combination of Secure Boot, TPMs, and virtualization has seen only limited adoption. The situation has changed in 2016, when Intel introduced the Software Guard Extensions (SGX) and its enclaves to the x86 ISA CPUs: for the first time, it became possible to build trustworthy applications relying on a commonly available technology. However, Intel SGX posed challenges to the practitioners who discovered the limitations of this technology, from the limited support of legacy applications and integration of SGX enclaves into the existing system, to the performance bottlenecks on communication, startup, and memory utilization. In this thesis, our goal is enable trustworthy computing in the cloud by relying on the imperfect SGX promitives. To this end, we develop and evaluate solutions to issues stemming from limited systems support of Intel SGX: we investigate the mechanisms for runtime support of POSIX applications with SCONE, an efficient SGX runtime library developed with performance limitations of SGX in mind. We further develop this topic with FFQ, which is a concurrent queue for SCONE's asynchronous system call interface. ShieldBox is our study of interplay of kernel bypass and trusted execution technologies for NFV, which also tackles the problem of low-latency clocks inside enclave. The two last systems, Clemmys and T-Lease are built on a more recent SGXv2 ISA extension. In Clemmys, SGXv2 allows us to significantly reduce the startup time of SGX-enabled functions inside a Function-as-a-Service platform. Finally, in T-Lease we solve the problem of trusted time by introducing a trusted lease primitive for distributed systems. We perform evaluation of all of these systems and prove that they can be practically utilized in existing systems with minimal overhead, and can be combined with both legacy systems and other SGX-based solutions. In the course of the thesis, we enable trusted computing for individual applications, high-performance network functions, and distributed computing framework, making a <vision of trusted cloud computing a reality

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

An improvement of OpenMP pipeline parallelism with the BatchQueue algorithm

Author: Folliot Bertil
Preud'homme Thomas
Sopena Julien
Thomas Gaël
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2012
Field of study

International audienceIn the context of multicore programming, pipeline parallelism is a solution to easily transform a sequential program into a parallel one without requiring a whole rewriting of the code. The OpenMP stream-computing extension presented by Pop and Cohen proposes an extension of OpenMP to handle pipeline parallelism. However, their communication algorithm relies on Multiple-producer-Multiple-Consumer queues, while pipelined applications mostly deal with linear chains of communication, i.e., with only a single producer and a single consumer. To improve the performance of the OpenMP stream-extension, we propose to add a more specialized Single-Producer-Single-Consumer communication algorithm called Batch Queue and to select it for one-to-one communication. Our evaluation shows that Batch Queue is then able to improve the throughput up to a factor 2 on an 8-core machine both for example application and real applications. Our study shows therefore that using specialized and efficient communication algorithms can have a significant impact on the overall performance of pipelined applications

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot