28 research outputs found

    On the roles of the programmer, the compiler and the runtime system when programming accelerators in OpenMP

    Get PDF
    OpenMP includes in its latest 4.0 specification the accelerator model. In this paper we present a partial implementation of this specification in the OmpSs programming model developed at the Barcelona Supercomputing Center with the aim of identifying which should be the roles of the programmer, the compiler and the runtime system in order to facilitate the asynchronous execution of tasks in architectures with multiple accelerator devices and processors. The design of OmpSs is highly biassed to delegate most of the decisions to the runtime system, which based on the task graph built at runtime (depend clauses) is able to schedule tasks in a data flow way to the available processors and accelerator devices and orchestrate data transfers and reuse among multiple address spaces. For this reason our implementation is partial, just considering from 4.0 those directives that enable the compiler the generation of the so called “kernels” to be executed on the target device. Several extensions to the current specification are also presented, such as the specification of tasks in “native” CUDA and OpenCL or how to specify the device and data privatization in the target construct. Finally, the paper also discusses some challenges found in code generation and a preliminary performance evaluation with some kernel applications.Peer ReviewedPostprint (author’s final draft

    Self-adaptive OmpSs tasks in heterogeneous environments

    Get PDF
    As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management. OmpSs is a task-based programming model and framework focused on the runtime exploitation of parallelism from annotated sequential applications. This paper presents a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the system can choose between these versions at runtime to obtain the best performance achievable for the given application. From the results obtained in a multi-GPU system, we prove that our proposal gives flexibility to application's source code and can potentially increase application's performance.This work has been supported by the European Commission through the ENCORE project (FP7-248647), the TERAFLUX project (FP7-249013), the TEXT project (FP7-261580), the HiPEAC-3 Network of Excellence (FP7-ICT 287759), the Intel-BSC Exascale Lab collaboration project, the support of the Spanish Ministry of Education (CSD2007- 00050 and FPU program), the projects of Computación de Altas Prestaciones V and VI (TIN2007-60625, TIN2012-34557) and the Generalitat de Catalunya (2009-SGR-980).Peer ReviewedPostprint (author’s final draft

    Locality-Aware Dynamic Task Graph Scheduling

    Get PDF
    Dynamic task graph schedulers automatically balance work across processor cores by scheduling tasks among available threads while preserving dependences. In this paper, we design NabbitC, a provably efficient dynamic task graph scheduler that accounts for data locality on NUMA systems. NabbitC allows users to assign a color to each task representing the location (e.g., a processor core) that has the most efficient access to data needed during that node’s execution. NabbitC then automatically adjusts the scheduling so as to preferentially execute each node at the location that matches its color—leading to better locality because the node is likely to make local rather than remote accesses. At the same time, NabbitC tries to optimize load balance and not add too much overhead compared to the vanilla Nabbit scheduler that does not consider locality. We provide a theoretical analysis that shows that NabbitC does not asymptotically impact the scalability of Nabbit . We evaluated the performance of NabbitC on a suite of memory intensive benchmarks. Our experiments indicates that adding locality awareness has a considerable performance advantage compared to the vanilla Nabbit scheduler. In addition, we also compared NabbitC to OpenMP programs for both regular and irregular applications. For regular applications, OpenMP achieves perfect locality and perfect load balance statically. For these benchmarks, NabbitC has a small performance penalty compared to OpenMP due to its dynamic scheduling strategy. For irregular applications, where OpenMP can not achieve locality and load balance simultaneously, we find that NabbitC performs better. Therefore, NabbitC combines the benefits of locality- aware scheduling for regular applications (the forte of static schedulers such as those in OpenMP) and dynamically adapting to load imbalance (the forte of dynamic schedulers such as Cilk Plus, TBB, and Nabbit)

    A framework for argument-based task synchronization with automatic detection of dependencies

    Get PDF
    [Abstract] Synchronization in parallel applications can be achieved either implicitly or explicitly. Implicit synchronization is typical of programming environments that provide predefined, and often simple, patterns of parallelism such as data-parallel libraries and languages and skeletal operations. Nevertheless, more flexible approaches that allow to express arbitrary task-level parallel computations without a predefined structure request in turn that the user explicitly specifies the synchronization needed among the parallel tasks. In this paper we present a library-based approach that enables arbitrary patterns of parallelism with minimal effort for the user. Our proposal is the first generic approach to express parallelism we know of that requires neither explicit synchronizations nor a detail of the dependencies of the parallel tasks. Our strategy relies on expressing the parallel tasks as functions that convey their dependencies implicitly by means of their arguments. These function arguments are analyzed by our library, called DepSpawn, when a parallel task is spawned in order to enforce its dependencies. Our experiments indicate that DepSpawn is very competitive, both in terms of performance and programmability, with respect to a widespread high-level approach like OpenMP.Xunta de Galicia; INCITE08PXIB105161PRMinisterio de Ciencia e Innovación; TIN2010-16735Ministerio de Educación de España; AP2009-475

    Impact study of data locality on task-based applications through the Heteroprio scheduler

    Get PDF
    International audienceThe task-based approach has emerged as a viable way to effectively use modern heterogeneous computing nodes. It allows the development of parallel applications with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task distribution. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks' dependencies. We evaluate the benefit of our method on two linear algebra applications and a stencil code. We show that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution
    corecore