Search CORE

1,023 research outputs found

CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

Author: de Supinski Bronis R.
Feng Wu-chun
Rountree Barry
Scogland Thomas R. W.
Publication venue
Publication date: 01/01/2012
Field of study

Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMP’s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

Computer Science Technical Reports @Virginia Tech

Fast Integration of Hardware Accelerators for Dynamically Reconfigurable Architecture

Author: Foucher Clément
Giulieri Alain
Muller Fabrice
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/07/2012
Field of study

International audienceDynamic reconfiguration of hardware resources is increasingly used in applications as a way to increase performances, resources integration or energy efficiency. As this evolution induces a change of the application execution paradigm, various tools have been set up to develop and manage these applications. But most do not allow direct re-use of legacy code, needing adaptation to match the provided environment. Moreover, partial reconfiguration is only at its early stages, and lacks easy ways of handling. We propose a design methodology and a runtime environment bringing fast integration of legacy hardware accelerators for partial and dynamic reconfigurable hardware architectures. Thanks to it, applications making use of dynamic hardware can be run directly on an Embedded Linux without noticing the reconfiguration flow. Moreover, our design methodology allows providing various implementations of a computation kernel, including both hardware and software ones. The implementation can then be chosen at execution time depending on available resources. In this article, we introduce the generic IP interface description making the re-use process possible. Furthermore, we present the results of a sample application running on our platform using software and hardware implementations. For hardware implementations, we obtain reconfiguration overhead as low as 0.16\% of the total kernel execution time

Crossref

HAL-UNICE

Interfacing and scheduling legacy code within the Canals framework

Author: Dahlin Andreas
Gorin Jérôme
Jokhio Fareed
Lilius Johan
Raulet Mickaël
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

International audienceThe need for understanding how to distribute computations across multiple cores, have obviously increased in the multi-core era. Scheduling the functional blocks of an application for concurrent execution requires not only a good understanding of data dependencies, but also a structured way to describe the intended scheduling. In this paper we describe how the Canals language and its scheduling framework can be used for the purpose of scheduling and executing legacy code. Additionally a set of translation guidelines for translating RVC-CAL applications into Canals are presented. The proposed approaches are applied to an existing MPEG-4 Simple Profile decoder for evaluation purposes. The inverse discrete cosine transform (IDCT) is accelerated by the means of OpenCL

HAL-CentraleSupelec

HAL-Rennes 1

Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)

Author: Carretero Pérez Jesús
García Blas Javier
Petcu Dana
Publication venue
Publication date: 01/01/2016
Field of study

Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to present their current research, and to discuss topics with other students in order to look for synergies and common research topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big data management, training, contributing to glue disparate researchers working across different areas and provide a meeting ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in research topics such as sustainable software solutions (applications and system software stack), data management, energy efficiency, and resilience.European Cooperation in Science and Technology. COS

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

Author: Thomadakis Polykarpos
Publication venue: ODU Digital Commons
Publication date: 01/08/2023
Field of study

Scientific applications strive for increased memory and computing performance, requiring massive amounts of data and time to produce results. Applications utilize large-scale, parallel computing platforms with advanced architectures to accommodate their needs. However, developing performance-portable applications for modern, heterogeneous platforms requires lots of effort and expertise in both the application and systems domains. This is more relevant for unstructured applications whose workflow is not statically predictable due to their heavily data-dependent nature. One possible solution for this problem is the introduction of an intelligent Domain-Specific Language (iDSL) that transparently helps to maintain correctness, hides the idiosyncrasies of lowlevel hardware, and scales applications. An iDSL includes domain-specific language constructs, a compilation toolchain, and a runtime providing task scheduling, data placement, and workload balancing across and within heterogeneous nodes. In this work, we focus on the runtime framework. We introduce a novel design and extension of a runtime framework, the Parallel Runtime Environment for Multicore Applications. In response to the ever-increasing intra/inter-node concurrency, the runtime system supports efficient task scheduling and workload balancing at both levels while allowing the development of custom policies. Moreover, the new framework provides abstractions supporting the utilization of heterogeneous distributed nodes consisting of CPUs and GPUs and is extensible to other devices. We demonstrate that by utilizing this work, an application (or the iDSL) can scale its performance on heterogeneous exascale-era supercomputers with minimal effort. A future goal for this framework (out of the scope of this thesis) is to be integrated with machine learning to improve its decision-making and performance further. As a bridge to this goal, since the framework is under development, we experiment with data from Nuclear Physics Particle Accelerators and demonstrate the significant improvements achieved by utilizing machine learning in the hit-based track reconstruction process

Old Dominion University