355 research outputs found

    A Comparison of some recent Task-based Parallel Programming Models

    Get PDF
    The need for parallel programming models that are simple to use and at the same time efficient for current ant future parallel platforms has led to recent attention to task-based models such as Cilk++, Intel TBB and the task concept in OpenMP version 3.0. The choice of model and implementation can have a major impact on the final performance and in order to understand some of the trade-offs we have made a quantitative study comparing four implementations of OpenMP (gcc, Intel icc, Sun studio and the research compiler Mercurium/nanos mcc), Cilk++ and Wool, a high-performance task-based library developed at SICS. Abstract. We use microbenchmarks to characterize costs for task-creation and stealing and the Barcelona OpenMP Tasks Suite for characterizing application performance. By far Wool and Cilk++ have the lowest overhead in both spawning and stealing tasks. This is reflected in application performance when many tasks with small granularity are spawned where Cilk++ and, in particular, has the highest performance. For coarse granularity applications, the OpenMP implementations have quite similar performance as the more light-weight Cilk++ and Wool except for one application where mcc is superior thanks to a superior task scheduler. Abstract. The OpenMP implemenations are generally not yet ready for use when the task granularity becomes very small. There is no inherent reason for this, so we expect future implementations of OpenMP to focus on this issue

    Unbalanced tree search on a manycore system using the GPI programming model

    Get PDF
    The recent developments in computer architectures progress towards systems with large core count (Manycore) which expose more parallelism to applications. Some applications named irregular and unbalanced applications demand a dynamic and asynchronous load balance implementation to utilize the full performance a Manycore system. For example, the recently established Graph500 benchmark aims at such applications. The UTS benchmark characterizes the performance of such irregular and unbalanced computations with a tree-structured search space that requires continuous dynamic load balancing. GPI is a PGAS API that delivers the full performance of RDMA-enabled networks directly to the application. Its programming model focuses the use of one-sided asynchronous communication, overlapping computation and communication. In this paper we address the dynamic load balancing requirements of unbalanced applications using the GPI programming model. Using the UTS benchmark, we detail the implementation of a work stealing algorithm using GPI and present the performance results. Our performance evaluation shows significant improvements when compared with the optimized MPI version with a maximum performance of 9.5 billion nodes per second on 3072 cores

    Redundant dataflow applications on clustered manycore architectures

    Get PDF
    Increasing performance requirements in the embedded systems domain have encouraged a drift from singlecore to multicore processors. Cars are an example for complex embedded systems in which the use of multicores continues to grow. The requirements of software components running in modern cars are diverse. On the one hand there are safety-critical tasks like the airbag control, on the other hand tasks which do not have any safety-related requirements at all, for example those controlling the infotainment system. Trends like autonomous driving lead to tasks which are simultaneously safety-critical and computationally complex. To satisfy the requirements of modern embedded applications we developed a dataflow-based runtime environment (RTE) for clustered manycore architectures. The RTE is able to execute dataflow graphs in various redundancy configurations and with different schedulers. We implemented our RTE design on the Kalray Bostan Massively Parallel Processor Array and evaluated all possible configurations for three common computation tasks. To classify the performance of our RTE, we compared the non-redundant graph executions with OpenCL versions of the three applications. The results show that our RTE can come close or even surpass Kalray's OpenCL framework, although maximum performance was not the primary goal of our design
    corecore