Search CORE

355 research outputs found

A Comparison of some recent Task-based Parallel Programming Models

Author: Brorsson Mats
Faxén Karl-Filip
Podobas Artur
Publication venue
Publication date: 01/01/2010
Field of study

The need for parallel programming models that are simple to use and at the same time efficient for current ant future parallel platforms has led to recent attention to task-based models such as Cilk++, Intel TBB and the task concept in OpenMP version 3.0. The choice of model and implementation can have a major impact on the final performance and in order to understand some of the trade-offs we have made a quantitative study comparing four implementations of OpenMP (gcc, Intel icc, Sun studio and the research compiler Mercurium/nanos mcc), Cilk++ and Wool, a high-performance task-based library developed at SICS. Abstract. We use microbenchmarks to characterize costs for task-creation and stealing and the Barcelona OpenMP Tasks Suite for characterizing application performance. By far Wool and Cilk++ have the lowest overhead in both spawning and stealing tasks. This is reflected in application performance when many tasks with small granularity are spawned where Cilk++ and, in particular, has the highest performance. For coarse granularity applications, the OpenMP implementations have quite similar performance as the more light-weight Cilk++ and Wool except for one application where mcc is superior thanks to a superior task scheduler. Abstract. The OpenMP implemenations are generally not yet ready for use when the task granularity becomes very small. There is no inherent reason for this, so we expect future implementations of OpenMP to focus on this issue

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2015
Field of study

Crossref

Unbalanced tree search on a manycore system using the GPI programming model

Author: Abreu Salvador
Lojewski Carsten
Machado Rui
Pfreundt Franz-Josef
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2011
Field of study

The recent developments in computer architectures progress towards systems with large core count (Manycore) which expose more parallelism to applications. Some applications named irregular and unbalanced applications demand a dynamic and asynchronous load balance implementation to utilize the full performance a Manycore system. For example, the recently established Graph500 benchmark aims at such applications. The UTS benchmark characterizes the performance of such irregular and unbalanced computations with a tree-structured search space that requires continuous dynamic load balancing. GPI is a PGAS API that delivers the full performance of RDMA-enabled networks directly to the application. Its programming model focuses the use of one-sided asynchronous communication, overlapping computation and communication. In this paper we address the dynamic load balancing requirements of unbalanced applications using the GPI programming model. Using the UTS benchmark, we detail the implementation of a work stealing algorithm using GPI and present the performance results. Our performance evaluation shows significant improvements when compared with the optimized MPI version with a maximum performance of 9.5 billion nodes per second on 3072 cores

Fraunhofer-ePrints

Repositório Científico da Universidade de Évora

Redundant dataflow applications on clustered manycore architectures

Author: Altmeyer Sebastian
Kühbacher Christoph
Ungerer Theo
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2022
Field of study

Increasing performance requirements in the embedded systems domain have encouraged a drift from singlecore to multicore processors. Cars are an example for complex embedded systems in which the use of multicores continues to grow. The requirements of software components running in modern cars are diverse. On the one hand there are safety-critical tasks like the airbag control, on the other hand tasks which do not have any safety-related requirements at all, for example those controlling the infotainment system. Trends like autonomous driving lead to tasks which are simultaneously safety-critical and computationally complex. To satisfy the requirements of modern embedded applications we developed a dataflow-based runtime environment (RTE) for clustered manycore architectures. The RTE is able to execute dataflow graphs in various redundancy configurations and with different schedulers. We implemented our RTE design on the Kalray Bostan Massively Parallel Processor Array and evaluated all possible configurations for three common computation tasks. To classify the performance of our RTE, we compared the non-redundant graph executions with OpenCL versions of the three applications. The results show that our RTE can come close or even surpass Kalray's OpenCL framework, although maximum performance was not the primary goal of our design

OPUS Augsburg