92 research outputs found

    NUMA Optimizations for Algorithmic Skeletons

    Get PDF

    Automatic skeleton-driven performance optimizations for transactional memory

    Get PDF
    The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more coarse -grain parallelism with every new generation of processors, or the performance of their applications will remain roughly the same or even degrade. Unfortunately, parallel programming is still hard and error prone. This has driven the development of many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the problem at hand. Thus, this programming approach simplifies parallel programming by eliminating some of the major challenges of parallel programming, namely thread communication, scheduling and orchestration. However, the application programmer has still to correctly synchronize threads on data races. This commonly requires the use of locks to guarantee atomic access to shared data. In particular, lock programming is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly. This model allows programmers to write parallel code as transactions, which are then guaranteed by the runtime system to execute atomically and in isolation regardless of eventual data races. TM programming thus frees the application from deadlocks and enables the exploitation of coarse grain parallelism when transactions do not conflict very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework. This fact makes the combination of skeleton -based and transactional programming a natural step to improve programmability since these models complement each other. In fact, this combination releases the application programmer from dealing with thread management and data races, and also inherits the performance improvements of both models. In addition to it, a skeleton framework is also amenable to skeleton - driven iii performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the knowledge about the worklist pattern and the TM nature of the applications to avoid transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly. Experimental results on a subset of five applications from the STAMP benchmark suite show that the proposed autotuning mechanism can achieve performance improvements within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance optimizations such as thread mapping and memory page allocation. In order to do it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five applications from the STAMP benchmark show that the OpenSkel framework with the extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %, over a baseline version for a 16 -core UMA platform and up to 162 %, with an average of 91 %, for a 32 -core NUMA platform

    Automatic skeleton-driven performance optimizations for transactional memory

    Get PDF
    The recent shift toward multi-core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more coarse-grain parallelism with every new generation of processors, or the performance of their applications will remain roughly the same or even degrade. Unfortunately, parallel programming is still hard and error prone. This has driven the development of many new parallel programming models that aim to make this process efficient. This thesis first combines the skeleton-based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the problem at hand. Thus, this programming approach simplifies parallel programming by eliminating some of the major challenges of parallel programming, namely thread communication, scheduling and orchestration. However, the application programmer has still to correctly synchronize threads on data races. This commonly requires the use of locks to guarantee atomic access to shared data. In particular, lock programming is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads that could be potentially executed in parallel. Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly. This model allows programmers to write parallel code as transactions, which are then guaranteed by the runtime system to execute atomically and in isolation regardless of eventual data races. TM programming thus frees the application from deadlocks and enables the exploitation of coarse grain parallelism when transactions do not conflict very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework. This fact makes the combination of skeleton-based and transactional programming a natural step to improve programmability since these models complement each other. In fact, this combination releases the application programmer from dealing with thread management and data races, and also inherits the performance improvements of both models. In addition to it, a skeleton framework is also amenable to skeleton-driven performance optimizations that exploits the application pattern and system information. This thesis thus also presents a set of pattern-oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the knowledge about the worklist pattern and the TM nature of the applications to avoid transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these pattern-oriented performance optimizations for each application and adjusts them accordingly. Experimental results on a subset of five applications from the STAMP benchmark suite show that the proposed autotuning mechanism can achieve performance improvements within 2%, on average, of a static oracle for a 16-core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32-core NUMA (Non-Uniform Memory Access) platform. Finally, this thesis also investigates skeleton-driven system-oriented performance optimizations such as thread mapping and memory page allocation. In order to do it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five applications from the STAMP benchmark show that the OpenSkel framework with the extended autotuning mechanism driving both pattern and system-oriented optimizations can achieve performance improvements of up to 88%, with an average of 46%, over a baseline version for a 16-core UMA platform and up to 162%, with an average of 91%, for a 32-core NUMA platform

    Heterogeneous computing with an algorithmic skeleton framework

    Get PDF
    The Graphics Processing Unit (GPU) is present in almost every modern day personal computer. Despite its specific purpose design, they have been increasingly used for general computations with very good results. Hence, there is a growing effort from the community to seamlessly integrate this kind of devices in everyday computing. However, to fully exploit the potential of a system comprising GPUs and CPUs, these devices should be presented to the programmer as a single platform. The efficient combination of the power of CPU and GPU devices is highly dependent on each device’s characteristics, resulting in platform specific applications that cannot be ported to different systems. Also, the most efficient work balance among devices is highly dependable on the computations to be performed and respective data sizes. In this work, we propose a solution for heterogeneous environments based on the abstraction level provided by algorithmic skeletons. Our goal is to take full advantage of the power of all CPU and GPU devices present in a system, without the need for different kernel implementations nor explicit work-distribution.To that end, we extended Marrow, an algorithmic skeleton framework for multi-GPUs, to support CPU computations and efficiently balance the work-load between devices. Our approach is based on an offline training execution that identifies the ideal work balance and platform configurations for a given application and input data size. The evaluation of this work shows that the combination of CPU and GPU devices can significantly boost the performance of our benchmarks in the tested environments, when compared to GPU-only executions
    corecore