12 research outputs found

    Extending OmpSs to support dynamic programming

    Get PDF
    [CASTELLÀ] Tras un breve resumen sobre programación dinámica y OmpSs, este documento muestra como OmpSs ofrece soporte para memoización y los resultados obtenidos. Se detallan las extensiones del modelo, del compilador Mercurium y del runtime Nanos++ realizadas para soportar memoización.[ANGLÈS] After a brief overview of dynamic programming and OmpSs, this document shows how OmpSs supports memoization and the results obtained. The extensions done in the model, Mercurium compiler and Nanos++ runtime system are detailed

    High-level effect handlers in C++

    Get PDF

    Worksharing tasks: An efficient way to exploit irregular and fine-grained loop parallelism

    Get PDF
    Shared memory programming models usually provide worksharing and task constructs. The former relies on the efficient fork-join execution model to exploit structured parallelism; while the latter relies on fine-grained synchronization among tasks and a flexible data-flow execution model to exploit dynamic, irregular, and nested parallelism. On applications that show both structured and unstructured parallelism, both worksharing and task constructs can be combined. However, it is difficult to mix both execution models without penalizing the data-flow execution model. Hence, on many applications structured parallelism is also exploited using tasks to leverage the full benefits of a pure data-flow execution model. However, task creation and management might introduce a non-negligible overhead that prevents the efficient exploitation of fine-grained structured parallelism, especially on many-core processors. In this work, we propose worksharing tasks. These are tasks that internally leverage worksharing techniques to exploit fine-grained structured loop-based parallelism. The evaluation shows promising results on several benchmarks and platforms.This work is supported by the Spanish Ministerio de Ciencia, Innovacion y Universidades (TIN2015-65316-P), by the Generalitat de Catalunya (2014-SGR-1051) and by the European Union’s Seventh Framework Programme (FP7/2007-2013) and the H2020 funding framework under grant agreement no. H2020-FETHPC-754304 (DEEP-EST).Peer ReviewedPostprint (author's final draft

    Evaluating worksharing tasks on distributed environments

    Get PDF
    ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Hybrid programming is a promising approach to exploit clusters of multicore systems. Our focus is on the combination of MPI and tasking. This hybrid approach combines the low-latency and high throughput of MPI with the flexibility of tasking models and their inherent ability to handle load imbalance. However, combining tasking with standard MPI implementations can be a challenge. The Task-Aware MPI library (TAMPI) eases the development of applications combining tasking with MPI. TAMPI enables developers to overlap computation and communication phases by relying on the tasking data-flow execution model. Using this approach, the original computation that was distributed in many different MPI ranks is grouped together in fewer MPI ranks, and split into several tasks per rank. Nevertheless, programmers must be careful with task granularity. Too fine-grained tasks introduce too much overhead, while too coarse-grained tasks lead to lack of parallelism. An adequate granularity may not always exist, especially in distributed environments where the same amount of work is distributed among many more cores. Worksharing tasks are a special kind of tasks, recently proposed, that internally leverage worksharing techniques. By doing so, a single worksharing task may run in several cores concurrently. Nonetheless, the task management costs remain the same than a regular task. In this work, we study the combination of worksharing tasks and TAMPI on distributed environments using two well known mini-apps: HPCCG and LULESH. Our results show significant improvements using worksharing tasks compared to regular tasks, and to other state-of-the-art alternatives such as OpenMP worksharing.This project is supported by the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No.s 754304 (DEEP-EST) and 823767 (PRACE), the Ministry of Economy of Spain through the Severo Ochoa Center of Excellence Program (SEV-2015-0493), by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB) and by the Generalitat de Catalunya (2017-SGR1481). The work has been performed under the Project HPCEUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme; in particular, the author gratefully acknowledges the support of Dr Mark Bull (EPCC) and the computer resources and technical support provided by EPCC.Peer ReviewedPostprint (author's final draft

    Advanced synchronization techniques for task-based runtime systems

    Get PDF
    Task-based programming models like OmpSs-2 and OpenMP provide a flexible data-flow execution model to exploit dynamic, irregular and nested parallelism. Providing an efficient implementation that scales well with small granularity tasks remains a challenge, and bottlenecks can manifest in several runtime components. In this paper, we analyze the limiting factors in the scalability of a task-based runtime system and propose individual solutions for each of the challenges, including a wait-free dependency system and a novel scalable scheduler design based on delegation. We evaluate how the optimizations impact the overall performance of the runtime, both individually and in combination. We also compare the resulting runtime against state of the art OpenMP implementations, showing equivalent or better performance, especially for fine-grained tasks.This project is supported by the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No.s 754304 (DEEP-EST), by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB and TIN2015-65316P) and by the Generalitat de Catalunya (2017-SGR-1414).Peer ReviewedPostprint (author's final draft

    On the design and development of programming models for exascale systems

    Get PDF
    High Performance Computing (HPC) systems have been evolving over time to adapt to the scientific community requirements. We are currently approaching to the Exascale era. Exascale systems will incorporate a large number of nodes, each of them containing many computing resources. Besides that, not only the computing resources, but memory hierarchies are becoming more deep and complex. Overall, Exascale systems will present several challenges in terms of performance, programmability and fault tolerance. Regarding programmability, the more complex a system architecture is, the more complex to properly exploit the system. The programmability is closely related to the performance, because the performance a system can deliver is useless if users are not able to write programs that obtain such performance. This stresses the importance of programming models as a tool to easily write programs that can reach the peak performance of the system. Finally, it is well known that more components lead to more errors. The combination of large executions with a low Mean Time To Failure (MTTF) may jeopardize application progress. Thus, all the efforts done to improve performance become pointless if applications hardly finish. To prevent that, we must apply fault tolerance. The main goal of this thesis is to enable non-expert users to exploit complex Exascale systems. To that end, we have enhanced state-of-the-art parallel programming models to cope with three key Exascale challenges: programmability, performance and fault tolerance. The first set of contributions focuses on the efficient management of modern multicore/manycore processors. We propose a new kind of task that combines the key advantages of tasks with the key advantages of worksharing techniques. The use of this new task type alleviates granularity issues, thereby enhancing performance in several scenarios. We also propose the introduction of dependences in the taskloop construct so that programmers can easily apply blocking techniques. Finally, we extend taskloop construct to support the creation of the new kind of tasks instead of regular tasks. The second set of contributions focuses on the efficient management of modern memory hierarchies, focused on NUMA domains. By using the information that users provide in the dependences annotations, we build a system that tracks data location. Later, we use this information to take scheduling decisions that maximize data locality. Our last set of contributions focuses on fault tolerance. We propose a programming model that provides application-level checkpoint/restart in an easy and portable way. Our programming model offers a set of compiler directives to abstract users from system-level nuances. Then, it leverages state-of-the-art libraries to deliver high performance and includes several redundancy schemes.Los supercomputadores han ido evolucionando a lo largo del tiempo para adaptarse a las necesidades de la comunidad científica. Actualmente, nos acercamos a la era Exascale. Los sistemas Exascale incorporarán un número de nodos enorme. Además, cada uno de esos nodos contendrá una gran cantidad de recursos computacionales. También la jerarquía de memoria se está volviendo más profunda y compleja. En conjunto, los sistemas Exascale plantearán varios desafíos en términos de rendimiento, programabilidad y tolerancia a fallos. Respecto a la programabilidad, cuánto más compleja es la arquitectura de un sistema, más difícil es aprovechar sus recursos de forma adecuada. La programabilidad está íntimamente ligada al rendimiento, ya que por mucho rendimiento que un sistema pueda ofrecer, no sirve de nada si nadie es capaz de conseguir ese rendimiento porque es demasiado difícil de usar. Esto refuerza la importancia de los modelos de programación como herramientas para desarrollar programas que puedan aprovechar al máximo estos sistemas de forma sencilla. Por último, es bien sabido que tener más componentes conlleva más errores. La combinación de ejecuciones muy largas y un tiempo medio hasta el fallo (MTTF) bajo ponen en peligro el progreso de las aplicaciones. Así pues, todos los esfuerzos realizados para mejorar el rendimiento son nulos si las aplicaciones difícilmente terminan. Para evitar esto, debemos desarrollar tolerancia a fallos. El objetivo principal de esta tesis es permitir que usuarios no expertos puedan aprovechar de forma óptima los complejos sistemas Exascale. Para ello, hemos mejorado algunos de los modelos de programación paralela más punteros para que puedan enfrentarse a tres desafíos clave de los sistemas Exascale: programabilidad, rendimiento y tolerancia a fallos. El primer conjunto de contribuciones de esta tesis se centra en la gestión eficiente de procesadores multicore/manycore. Proponemos un nuevo tipo de tarea que combina los puntos clave de las tareas con los de las técnicas de worksharing. Este nuevo tipo de tarea permite aliviar los problemas de granularidad, mejorando el rendimiento en algunos escenarios. También proponemos la introducción de dependencias en la directiva taskloop, de forma que los programadores puedan aplicar blocking de forma sencilla. Finalmente, extendemos la directiva taskloop para que pueda crear nuestro nuevo tipo de tareas, además de las tareas normales. El segundo conjunto de contribuciones está enfocado a la gestión eficiente de jerarquías de memoria modernas, centrado en entornos NUMA. Usando la información de las dependencias que anota el usuario, hemos construido un sistema que guarda la ubicación de los datos. Después, con esa información, decidimos dónde ejecutar el trabajo para maximizar la localidad de datos. El último conjunto de contribuciones se centra en tolerancia a fallos. Proponemos un modelo de programación que ofrece checkpoint/restart a nivel de aplicación, de forma sencilla y portable. Nuestro modelo ofrece una serie de directivas de compilador que permiten al usuario abstraerse de los detalles del sistema. Además, gestionamos librerías punteras en tolerancia a fallos para conseguir un alto rendimiento, incluyendo varios niveles y tipos de redundancia.Postprint (published version

    Ompss persistent checkpoint/restart: a directive-based approach

    No full text
    Exascale platforms require programming models incorporating support for resilience capabilities since the huge number of components they are expected to have is going to increase the number of errors. Checkpoint/restart is a widely used resilience technique due to its robustness and low overhead compared to other techniques. There already exists several solutions implementing this technique, such as FTI or SCR, which focus mainly on providing advanced I/O capabilities to minimize checkpoint/restart time. However, application developers are still in charge of: (1) manually serialize and deserialize the application state using a low-level API; (2) modify the natural flow of the application depending whether the current execution is a restart or not; and (3) reimplement their code regarding checkpoint/restart whenever they have to change the backend library. We present a new directive-based approach to performing application-level checkpoint/ restart in a simplified and portable way. We propose a solution based on compiler directives, such as OpenMP ones, that allows users to easily specify the state of the application that has to be saved and restored, leaving the tedious and error-prone serialization and deserialization activities to our intermediate library, which relies on a backend library (FTI/SCR) to perform scalable and efficient I/O operations. Our results, including several benchmarks and two large applications, reveal no extra overhead compared to the direct use of FTI/SCR checkpoint/restart libraries while significantly reducing the effort required by the application developers

    Extending OmpSs to support dynamic programming

    No full text
    [CASTELLÀ] Tras un breve resumen sobre programación dinámica y OmpSs, este documento muestra como OmpSs ofrece soporte para memoización y los resultados obtenidos. Se detallan las extensiones del modelo, del compilador Mercurium y del runtime Nanos++ realizadas para soportar memoización.[ANGLÈS] After a brief overview of dynamic programming and OmpSs, this document shows how OmpSs supports memoization and the results obtained. The extensions done in the model, Mercurium compiler and Nanos++ runtime system are detailed

    Extending OmpSs to support dynamic programming

    No full text
    [CASTELLÀ] Tras un breve resumen sobre programación dinámica y OmpSs, este documento muestra como OmpSs ofrece soporte para memoización y los resultados obtenidos. Se detallan las extensiones del modelo, del compilador Mercurium y del runtime Nanos++ realizadas para soportar memoización.[ANGLÈS] After a brief overview of dynamic programming and OmpSs, this document shows how OmpSs supports memoization and the results obtained. The extensions done in the model, Mercurium compiler and Nanos++ runtime system are detailed

    Ompss persistent checkpoint/restart: a directive-based approach

    No full text
    Exascale platforms require programming models incorporating support for resilience capabilities since the huge number of components they are expected to have is going to increase the number of errors. Checkpoint/restart is a widely used resilience technique due to its robustness and low overhead compared to other techniques. There already exists several solutions implementing this technique, such as FTI or SCR, which focus mainly on providing advanced I/O capabilities to minimize checkpoint/restart time. However, application developers are still in charge of: (1) manually serialize and deserialize the application state using a low-level API; (2) modify the natural flow of the application depending whether the current execution is a restart or not; and (3) reimplement their code regarding checkpoint/restart whenever they have to change the backend library. We present a new directive-based approach to performing application-level checkpoint/ restart in a simplified and portable way. We propose a solution based on compiler directives, such as OpenMP ones, that allows users to easily specify the state of the application that has to be saved and restored, leaving the tedious and error-prone serialization and deserialization activities to our intermediate library, which relies on a backend library (FTI/SCR) to perform scalable and efficient I/O operations. Our results, including several benchmarks and two large applications, reveal no extra overhead compared to the direct use of FTI/SCR checkpoint/restart libraries while significantly reducing the effort required by the application developers
    corecore