Search CORE

3,757 research outputs found

Accelerating sequential programs using FastFlow and self-offloading

Author: Aldinucci Marco
Danelutto Marco
Kilpatrick Peter
Meneghin Massimiliano
Torquati Massimo
Publication venue
Publication date: 12/02/2010
Field of study

FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this paper we present a further evolution of FastFlow enabling programmers to offload part of their workload on a dynamically created software accelerator running on unused CPUs. The offloaded function can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove

arXiv.org e-Print Archive

UnipiEprints

Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

Author: Hechtman Blake A.
Sorin Daniel J.
Publication venue
Publication date: 01/01/2013
Field of study

The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM

arXiv.org e-Print Archive

CiteSeerX

Crossref

CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

Author: de Supinski Bronis R.
Feng Wu-chun
Rountree Barry
Scogland Thomas R. W.
Publication venue
Publication date: 01/01/2012
Field of study

Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMP’s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

Computer Science Technical Reports @Virginia Tech

Performance analysis of a hardware accelerator of dependence management for taskbased dataflow programming models

Author: Ayguadé Parra Eduard
Bosch Pons Jaume
Jiménez González Daniel
Tan Xubin
Valero Cortés Mateo
Álvarez Martínez Carlos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Along with the popularity of multicore and manycore, task-based dataflow programming models obtain great attention for being able to extract high parallelism from applications without exposing the complexity to programmers. One of these pioneers is the OpenMP Superscalar (OmpSs). By implementing dynamic task dependence analysis, dataflow scheduling and out-of-order execution in runtime, OmpSs achieves high performance using coarse and medium granularity tasks. In theory, for the same application, the more parallel tasks can be exposed, the higher possible speedup can be achieved. Yet this factor is limited by task granularity, up to a point where the runtime overhead outweighs the performance increase and slows down the application. To overcome this handicap, Picos was proposed to support task-based dataflow programming models like OmpSs as a fast hardware accelerator for fine-grained task and dependence management, and a simulator was developed to perform design space exploration. This paper presents the very first functional hardware prototype inspired by Picos. An embedded system based on a Zynq 7000 All-Programmable SoC is developed to study its capabilities and possible bottlenecks. Initial scalability and hardware consumption studies of different Picos designs are performed to find the one with the highest performance and lowest hardware cost. A further thorough performance study is employed on both the prototype with the most balanced configuration and the OmpSs software-only alternative. Results show that our OmpSs runtime hardware support significantly outperforms the software-only implementation currently available in the runtime system for finegrained tasks.This work is supported by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project, by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Research Council RoMoL Grant Agreement number 321253. We also thank the Xilinx University Program for its hardware and software donations.Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

AMA: asynchronous management of accelerators for task-based programming models

Author: Ayguadé Parra Eduard
Badia Sala Rosa Maria
Labarta Mancho Jesús José
Planas Judit
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: a set of optimization techniques to efficiently manage multi-accelerator systems. AMA maximizes the overlap of computation and communication in a blocking-free way. Then, we can use such spare time to do other work while waiting for device operations. Implemented on top of a task-based framework, the experimental evaluation of AMA on a quad-GPU node shows that we reach the performance of a hand-tuned native CUDA code, with the advantage of fully hiding the device management. In addition, we obtain up to more than 2x performance speed-up with respect to the original framework implementation.Peer ReviewedPostprint (published version

Elsevier - Publisher Connector

UPCommons. Portal del coneixement obert de la UPC