623 research outputs found

    Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures

    Get PDF
    A new solver featuring time-space adaptation and error control has been recently introduced to tackle the numerical solution of stiff reaction-diffusion systems. Based on operator splitting, finite volume adaptive multiresolution and high order time integrators with specific stability properties for each operator, this strategy yields high computational efficiency for large multidimensional computations on standard architectures such as powerful workstations. However, the data structure of the original implementation, based on trees of pointers, provides limited opportunities for efficiency enhancements, while posing serious challenges in terms of parallel programming and load balancing. The present contribution proposes a new implementation of the whole set of numerical methods including Radau5 and ROCK4, relying on a fully different data structure together with the use of a specific library, TBB, for shared-memory, task-based parallelism with work-stealing. The performance of our implementation is assessed in a series of test-cases of increasing difficulty in two and three dimensions on multi-core and many-core architectures, demonstrating high scalability

    Towards an Adaptive Skeleton Framework for Performance Portability

    Get PDF
    The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregularly-parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination. The paper outlines a novel approach to delivering portable parallel performance for irregularly parallel programs. The approach combines declarative parallelism with JIT technology, dynamic scheduling, and dynamic transformation. We present the design of an adaptive skeleton library, with a task graph implementation, JIT trace costing, and adaptive transformations. We outline the architecture of the protoype adaptive skeleton execution framework in Pycket, describing tasks, serialisation, and the current scheduler.We report a preliminary evaluation of the prototype framework using 4 micro-benchmarks and a small case study on two NUMA servers (24 and 96 cores) and a small cluster (17 hosts, 272 cores). Key results include Pycket delivering good sequential performance e.g. almost as fast as C for some benchmarks; good absolute speedups on all architectures (up to 120 on 128 cores for sumEuler); and that the adaptive transformations do improve performance

    Achieving High Performance and High Productivity in Next Generational Parallel Programming Languages

    Get PDF
    Processor design has turned toward parallelism and heterogeneity cores to achieve performance and energy efficiency. Developers find high-level languages attractive because they use abstraction to offer productivity and portability over hardware complexities. To achieve performance, some modern implementations of high-level languages use work-stealing scheduling for load balancing of dynamically created tasks. Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. These overheads arise as a necessary side effect of the implementation and hamper parallel performance. In addition to runtime-imposed overheads, there is a substantial cognitive load associated with ensuring that parallel code is data-race free. This dissertation explores the overheads associated with achieving high performance parallelism in modern high-level languages. My thesis is that, by exploiting existing underlying mechanisms of managed runtimes; and by extending existing language design, high-level languages will be able to deliver productivity and parallel performance at the levels necessary for widespread uptake. The key contributions of my thesis are: 1) a detailed analysis of the key sources of overhead associated with a work-stealing runtime, namely sequential and dynamic overheads; 2) novel techniques to reduce these overheads that use rich features of managed runtimes such as the yieldpoint mechanism, on-stack replacement, dynamic code-patching, exception handling support, and return barriers; 3) comprehensive analysis of the resulting benefits, which demonstrate that work-stealing overheads can be significantly reduced, leading to substantial performance improvements; and 4) a small set of language extensions that achieve both high performance and high productivity with minimal programmer effort. A managed runtime forms the backbone of any modern implementation of a high-level language. Managed runtimes enjoy the benefits of a long history of research and their implementations are highly optimized. My thesis demonstrates that converging these highly optimized features together with the expressiveness of high-level languages, gives further hope for achieving high performance and high productivity on modern parallel hardwar

    A taxonomy of task-based parallel programming technologies for high-performance computing

    Get PDF
    Task-based programming models for shared memory -- such as Cilk Plus and OpenMP 3 -- are well established and documented. However, with the increase in parallel, many-core and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today

    Configurable Strategies for Work-stealing

    Full text link
    Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. For instance, they do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, the actual task execution order is typically determined by the underlying task storage data structure, and cannot be changed. There are thus possibilities for optimizing task parallel executions by providing information on specific tasks and their preferred execution order to the scheduling system. We introduce scheduling strategies to enable applications to dynamically provide hints to the task-scheduling system on the nature of specific tasks. Scheduling strategies can be used to independently control both local task execution order as well as steal order. In contrast to conventional scheduling policies that are normally global in scope, strategies allow the scheduler to apply optimizations on individual tasks. This flexibility greatly improves composability as it allows the scheduler to apply different, specific scheduling choices for different parts of applications simultaneously. We present a number of benchmarks that highlight diverse, beneficial effects that can be achieved with scheduling strategies. Some benchmarks (branch-and-bound, single-source shortest path) show that prioritization of tasks can reduce the total amount of work compared to standard work-stealing execution order. For other benchmarks (triangle strip generation) qualitatively better results can be achieved in shorter time. Other optimizations, such as dynamic merging of tasks or stealing of half the work, instead of half the tasks, are also shown to improve performance. Composability is demonstrated by examples that combine different strategies, both within the same kernel (prefix sum) as well as when scheduling multiple kernels (prefix sum and unbalanced tree search)

    Locality-Aware Dynamic Task Graph Scheduling

    Get PDF
    Dynamic task graph schedulers automatically balance work across processor cores by scheduling tasks among available threads while preserving dependences. In this paper, we design NabbitC, a provably efficient dynamic task graph scheduler that accounts for data locality on NUMA systems. NabbitC allows users to assign a color to each task representing the location (e.g., a processor core) that has the most efficient access to data needed during that node’s execution. NabbitC then automatically adjusts the scheduling so as to preferentially execute each node at the location that matches its color—leading to better locality because the node is likely to make local rather than remote accesses. At the same time, NabbitC tries to optimize load balance and not add too much overhead compared to the vanilla Nabbit scheduler that does not consider locality. We provide a theoretical analysis that shows that NabbitC does not asymptotically impact the scalability of Nabbit . We evaluated the performance of NabbitC on a suite of memory intensive benchmarks. Our experiments indicates that adding locality awareness has a considerable performance advantage compared to the vanilla Nabbit scheduler. In addition, we also compared NabbitC to OpenMP programs for both regular and irregular applications. For regular applications, OpenMP achieves perfect locality and perfect load balance statically. For these benchmarks, NabbitC has a small performance penalty compared to OpenMP due to its dynamic scheduling strategy. For irregular applications, where OpenMP can not achieve locality and load balance simultaneously, we find that NabbitC performs better. Therefore, NabbitC combines the benefits of locality- aware scheduling for regular applications (the forte of static schedulers such as those in OpenMP) and dynamically adapting to load imbalance (the forte of dynamic schedulers such as Cilk Plus, TBB, and Nabbit)

    Adaptive query parallelization in multi-core column stores

    Get PDF
    With the rise of multi-core CPU platforms, their optimal utilization for in-memory OLAP workloads using column store databases has become one of the biggest challenges. Some of the inherent limi- tations in the achievable query parallelism are due to the degree of parallelism dependency on the data skew, the overheads incurred by thread coordination, and the hardware resource limits. Finding the right balance between the degree of parallelism and the multi-core utilizati
    • …
    corecore