791 research outputs found

    Reducing the burden of parallel loop schedulers for many-core processors

    Get PDF

    Reducing the burden of parallel loop schedulers for many-core processors

    Get PDF
    Funder: FP7 People: Marie‐Curie Actions; Id: http://dx.doi.org/10.13039/100011264; Grant(s): 327744Summary: As core counts in processors increases, it becomes harder to schedule and distribute work in a timely and scalable manner. This article enhances the scalability of parallel loop schedulers by specializing schedulers for fine‐grain loops. We propose a low‐overhead work distribution mechanism for a static scheduler that uses no atomic operations. We integrate our static scheduler with the Intel OpenMP and Cilkplus parallel task schedulers to build hybrid schedulers. Compiler support enables efficient reductions for Cilk, without changing the programming interface of Cilk reducers. Detailed, quantitative measurements demonstrate that our techniques achieve scalable performance on a 48‐core machine and the scheduling overhead is 43% lower than Intel OpenMP and 12.1× lower than Cilk. We demonstrate consistent performance improvements on a range of HPC and data analytics codes. Performance gains are more important as loops become finer‐grain and thread counts increase. We observe consistently 16%–30% speedup on 48 threads, with a peak of 2.8× speedup

    Analysis, classification and comparison of scheduling techniques for software transactional memories

    Get PDF
    Transactional Memory (TM) is a practical programming paradigm for developing concurrent applications. Performance is a critical factor for TM implementations, and various studies demonstrated that specialised transaction/thread scheduling support is essential for implementing performance-effective TM systems. After one decade of research, this article reviews the wide variety of scheduling techniques proposed for Software Transactional Memories. Based on peculiarities and differences of the adopted scheduling strategies, we propose a classification of the existing techniques, and we discuss the specific characteristics of each technique. Also, we analyse the results of previous evaluation and comparison studies, and we present the results of a new experimental study encompassing techniques based on different scheduling strategies. Finally, we identify potential strengths and weaknesses of the different techniques, as well as the issues that require to be further investigated

    Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming

    Get PDF
    This work focuses on compiler and run-time techniques for improving the productivity and the performance portability of general-purpose parallel programming. More specifically, we focus on shared-memory task-parallel languages, where the programmer explicitly exposes parallelism in the form of short tasks that may outnumber the cores by orders of magnitude. The compiler, the run-time, and the platform (henceforth the system) are responsible for harnessing this unpredictable amount of parallelism, which can vary from none to excessive, towards efficient execution. The challenge arises from the aspiration to support fine-grained irregular computations and nested parallelism. This work is even more ambitious by also aspiring to lay the foundations to efficiently support declarative code, where the programmer exposes all available parallelism, using high-level language constructs such as parallel loops, reducers or futures. The appeal of declarative code is twofold for general-purpose programming: it is often easier for the programmer who does not have to worry about the granularity of the exposed parallelism, and it achieves better performance portability by avoiding overfitting to a small range of platforms and inputs for which the programmer is coarsening. Furthermore, PRAM algorithms, an important class of parallel algorithms, naturally lend themselves to declarative programming, so supporting it is a necessary condition for capitalizing on the wealth of the PRAM theory. Unfortunately, declarative codes often expose such an overwhelming number of fine-grained tasks that existing systems fail to deliver performance. Our contributions can be partitioned into three components. First, we tackle the issue of coarsening, which declarative code leaves to the system. We identify two goals of coarsening and advocate tackling them separately, using static compiler transformations for one and dynamic run-time approaches for the other. Additionally, we present evidence that the current practice of burdening the programmer with coarsening either leads to codes with poor performance-portability, or to a significantly increased programming effort. This is a ``show-stopper'' for general-purpose programming. To compare the performance portability among approaches, we define an experimental framework and two metrics, and we demonstrate that our approaches are preferable. We close the chapter on coarsening by presenting compiler transformations that automatically coarsen some types of very fine-grained codes. Second, we propose Lazy Scheduling, an innovative run-time scheduling technique that infers the platform load at run-time, using information already maintained. Based on the inferred load, Lazy Scheduling adapts the amount of available parallelism it exposes for parallel execution and, thus, saves parallelism overheads that existing approaches pay. We implement Lazy Scheduling and present experimental results on four different platforms. The results show that Lazy Scheduling is vastly superior for declarative codes and competitive, if not better, for coarsened codes. Moreover, Lazy Scheduling is also superior in terms of performance-portability, supporting our thesis that it is possible to achieve reasonable efficiency and performance portability with declarative codes. Finally, we also implement Lazy Scheduling on XMT, an experimental manycore platform developed at the University of Maryland, which was designed to support codes derived from PRAM algorithms. On XMT, we manage to harness the existing hardware support for scheduling flat parallelism to compose it with Lazy Scheduling, which supports nested parallelism. In the resulting hybrid scheduler, the hardware and software work in synergy to overcome each other's weaknesses. We show the performance composability of the hardware and software schedulers, both in an abstract cost model and experimentally, as the hybrid always performs better than the software scheduler alone. Furthermore, the cost model is validated by using it to predict if it is preferable to execute a code sequentially, with outer parallelism, or with nested parallelism, depending on the input, the available hardware parallelism and the calling context of the parallel code

    Exploiting asymmetric multi-core systems with flexible system software

    Get PDF
    Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores, AMCs are able to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing. This thesis initially explores the potential of AMCs when executing current HPC applications and searches for the most appropriate execution model. Specifically we evaluate several execution models on an Arm big.LITTLE AMC using the PARSEC benchmark suite that includes representative HPC applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system as it improves the user-level scheduling by 23%, while the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%. Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programming flexibility as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task schedulers for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45x on a real 8-core AMC and up to 2.1x on a simulated 32-core AMC. Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating tasks is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX minimizes task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. From our evaluation using 11 HPC workloads on both symmetric and AMC systems, we obtain performance improvements up to 15x, averaging to 3.1x over the baseline. Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) of the game-play on mobile devices with AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.Los procesadores multinúcleos asimétricos (AMC) son una solución arquitectónica exitosa para dispositivos móviles y supercomputadores. Estas arquitecturas combinan diferentes tipos de núcleos de procesamiento diseñados con diferentes propiedades de rendimiento y potencia. Al mantener dos o más tipos de núcleos, los AMCs pueden proporcionar un alto rendimiento con un consumo bajo de energía de las infraestructuras. Sin embargo, existen importantes desafíos al usar los AMC, como la programación y el equilibrio de carga. Esta tesis explora inicialmente el potencial de los AMC al ejecutar aplicaciones actuales de Computacion de Alto Rendimiento (HPC) y busca el modelo de ejecución más apropiado para ellas. Específicamente evaluamos varios modelos de ejecución en un procesador asimétrico Arm big.LITTLE utilizando las aplicaciones PARSEC que son aplicaciones representativas de HPC. En este trabajo se compara la programación en los niveles de usuario, sistema operativo y librería y evaluamos el impacto de estas opciones en el conocido problema de equilibrar la carga entre los AMCs. Nuestros resultados demuestran que la programación es más efectiva cuando se lleva a cabo en el nivel del runtime, ya que mejora la programación del nivel de usuario en un 23%, mientras que la solución de programación del sistema operativo heterogéneo mejora la programación del nivel de usuario en un 10%. Siguiendo este resultado, esta tesis se centra en aumentar el rendimiento de los sistemas AMC mejorando la programación al nivel de librería. La programación en este nivel se proporciona mediante el uso de Modelos de Programación Paralelos Basados en Tareas (MPBT). Estos modelos de programación ofrecen flexibilidad de programación, ya que consisten en una interfaz y un runtime para administrar los recursos e hilos subyacentes. En esta tesis, mejoramos la programación con MPBT al proporcionar tres nuevos planificadores de tareas para AMCs. Estos planificadores dinámicos reducen el tiempo total de ejecución ya sea detectando la camino más largo o el camino crítico del grafo de dependencia de tareas de la aplicación, que es generado dinámicamente. En nuestra evaluación, comparamos estos planificadores con un planificador heterogéneo existente y demonstramos su mejora sobre un planificador FIFO. Mostramos que los planificadores heterogéneos mejoran el planificador FIFO en hasta 1.45x en un AMC real de 8 núcleos y hasta 2.1x en un AMC simulado de 32 núcleos. Otra contribución en los MPBT es la adaptabilidad al paralelismo de grano fino. El creciente número de núcleos en los chip multinúcleos modernos está empujando la investigación hacia el uso de cargas de trabajo de grano fino, que es un desafío importante para los MPBT. Nuestro estudio observa que la creación de tareas bloquea la ejecución con cargas de trabajo de grano fino con MPBT. Cuando el número de núcleos aumenta, el tiempo empleado en generar tareas pasa a ser más crítico para toda la ejecución. Nuestra solución es TaskGenX, que minimiza los costes de creación de tareas y se basa en una extensión del runtime y en un hardware dedicado. En el runtime, TaskGenX desacopla la creación de tareas de las otras actividades del runtime, ejecutando esta actividad en un hardware especializado. Evaluamos 11 aplicaciones de HPC con TaskGenX en sistemas simétricos y AMC y obtenemos mejoras de rendimiento de hasta 15x, con un promedio de 3.1x sobre la implementación de referencia. Finalmente, esta tesis presenta un planificador de CPU con el objetivo de aumentar los fotogramas por segundo (FPS) para juegos en dispositivos móviles con sistemas AMC. Diseñamos e implementamos el planificador de Real-Time Scheduler (RTS) en Android. El RTS proporciona una política de programación eficiente que tiene en cuenta la temperatura actual del sistema para realizar la migración de tareas. La solución RTS aumenta la FPS mediana de los mecanismos de referenciaPostprint (published version

    Parallel For Loops on Heterogeneous Resources

    Get PDF
    In recent years, Graphics Processing Units (GPUs) have piqued the interest of researchers in scientific computing. Their immense floating point throughput and massive parallelism make them ideal for not just graphical applications, but many general algorithms as well. Load balancing applications and taking advantage of all computational resources in a machine is a difficult challenge, especially when the resources are heterogeneous. This dissertation presents the clUtil library, which vastly simplifies developing OpenCL applications for heterogeneous systems. The core focus of this dissertation lies in clUtil\u27s ParallelFor construct and our novel PINA scheduler which can efficiently load balance work onto multiple GPUs and CPUs simultaneously

    Exploiting asymmetric multi-core systems with flexible system software

    Get PDF
    Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores, AMCs are able to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing. This thesis initially explores the potential of AMCs when executing current HPC applications and searches for the most appropriate execution model. Specifically we evaluate several execution models on an Arm big.LITTLE AMC using the PARSEC benchmark suite that includes representative HPC applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system as it improves the user-level scheduling by 23%, while the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%. Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programming flexibility as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task schedulers for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45x on a real 8-core AMC and up to 2.1x on a simulated 32-core AMC. Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating tasks is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX minimizes task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. From our evaluation using 11 HPC workloads on both symmetric and AMC systems, we obtain performance improvements up to 15x, averaging to 3.1x over the baseline. Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) of the game-play on mobile devices with AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.Los procesadores multinúcleos asimétricos (AMC) son una solución arquitectónica exitosa para dispositivos móviles y supercomputadores. Estas arquitecturas combinan diferentes tipos de núcleos de procesamiento diseñados con diferentes propiedades de rendimiento y potencia. Al mantener dos o más tipos de núcleos, los AMCs pueden proporcionar un alto rendimiento con un consumo bajo de energía de las infraestructuras. Sin embargo, existen importantes desafíos al usar los AMC, como la programación y el equilibrio de carga. Esta tesis explora inicialmente el potencial de los AMC al ejecutar aplicaciones actuales de Computacion de Alto Rendimiento (HPC) y busca el modelo de ejecución más apropiado para ellas. Específicamente evaluamos varios modelos de ejecución en un procesador asimétrico Arm big.LITTLE utilizando las aplicaciones PARSEC que son aplicaciones representativas de HPC. En este trabajo se compara la programación en los niveles de usuario, sistema operativo y librería y evaluamos el impacto de estas opciones en el conocido problema de equilibrar la carga entre los AMCs. Nuestros resultados demuestran que la programación es más efectiva cuando se lleva a cabo en el nivel del runtime, ya que mejora la programación del nivel de usuario en un 23%, mientras que la solución de programación del sistema operativo heterogéneo mejora la programación del nivel de usuario en un 10%. Siguiendo este resultado, esta tesis se centra en aumentar el rendimiento de los sistemas AMC mejorando la programación al nivel de librería. La programación en este nivel se proporciona mediante el uso de Modelos de Programación Paralelos Basados en Tareas (MPBT). Estos modelos de programación ofrecen flexibilidad de programación, ya que consisten en una interfaz y un runtime para administrar los recursos e hilos subyacentes. En esta tesis, mejoramos la programación con MPBT al proporcionar tres nuevos planificadores de tareas para AMCs. Estos planificadores dinámicos reducen el tiempo total de ejecución ya sea detectando la camino más largo o el camino crítico del grafo de dependencia de tareas de la aplicación, que es generado dinámicamente. En nuestra evaluación, comparamos estos planificadores con un planificador heterogéneo existente y demonstramos su mejora sobre un planificador FIFO. Mostramos que los planificadores heterogéneos mejoran el planificador FIFO en hasta 1.45x en un AMC real de 8 núcleos y hasta 2.1x en un AMC simulado de 32 núcleos. Otra contribución en los MPBT es la adaptabilidad al paralelismo de grano fino. El creciente número de núcleos en los chip multinúcleos modernos está empujando la investigación hacia el uso de cargas de trabajo de grano fino, que es un desafío importante para los MPBT. Nuestro estudio observa que la creación de tareas bloquea la ejecución con cargas de trabajo de grano fino con MPBT. Cuando el número de núcleos aumenta, el tiempo empleado en generar tareas pasa a ser más crítico para toda la ejecución. Nuestra solución es TaskGenX, que minimiza los costes de creación de tareas y se basa en una extensión del runtime y en un hardware dedicado. En el runtime, TaskGenX desacopla la creación de tareas de las otras actividades del runtime, ejecutando esta actividad en un hardware especializado. Evaluamos 11 aplicaciones de HPC con TaskGenX en sistemas simétricos y AMC y obtenemos mejoras de rendimiento de hasta 15x, con un promedio de 3.1x sobre la implementación de referencia. Finalmente, esta tesis presenta un planificador de CPU con el objetivo de aumentar los fotogramas por segundo (FPS) para juegos en dispositivos móviles con sistemas AMC. Diseñamos e implementamos el planificador de Real-Time Scheduler (RTS) en Android. El RTS proporciona una política de programación eficiente que tiene en cuenta la temperatura actual del sistema para realizar la migración de tareas. La solución RTS aumenta la FPS mediana de los mecanismos de referenci
    corecore