7,844 research outputs found

    Locality-Aware Dynamic Task Graph Scheduling

    Get PDF
    Dynamic task graph schedulers automatically balance work across processor cores by scheduling tasks among available threads while preserving dependences. In this paper, we design NabbitC, a provably efficient dynamic task graph scheduler that accounts for data locality on NUMA systems. NabbitC allows users to assign a color to each task representing the location (e.g., a processor core) that has the most efficient access to data needed during that node’s execution. NabbitC then automatically adjusts the scheduling so as to preferentially execute each node at the location that matches its color—leading to better locality because the node is likely to make local rather than remote accesses. At the same time, NabbitC tries to optimize load balance and not add too much overhead compared to the vanilla Nabbit scheduler that does not consider locality. We provide a theoretical analysis that shows that NabbitC does not asymptotically impact the scalability of Nabbit . We evaluated the performance of NabbitC on a suite of memory intensive benchmarks. Our experiments indicates that adding locality awareness has a considerable performance advantage compared to the vanilla Nabbit scheduler. In addition, we also compared NabbitC to OpenMP programs for both regular and irregular applications. For regular applications, OpenMP achieves perfect locality and perfect load balance statically. For these benchmarks, NabbitC has a small performance penalty compared to OpenMP due to its dynamic scheduling strategy. For irregular applications, where OpenMP can not achieve locality and load balance simultaneously, we find that NabbitC performs better. Therefore, NabbitC combines the benefits of locality- aware scheduling for regular applications (the forte of static schedulers such as those in OpenMP) and dynamically adapting to load imbalance (the forte of dynamic schedulers such as Cilk Plus, TBB, and Nabbit)

    Graph partitioning applied to DAG scheduling to reduce NUMA effects

    Get PDF
    The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the devices containing the data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are typically applied by the system software. We propose techniques at the runtime system level to reduce NUMA effects on parallel applications. We leverage runtime system metadata in terms of a task dependency graph. Our approach, based on graph partitioning methods, is able to provide parallel performance improvements of 1.12X on average with respect to the state-of-the-art.This work has been partially supported by the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Spanish Government (contract TIN2015-65316-P). I. Sánchez Barrera has been supported by the Spanish Government under Formación del Profesorado Universitario fellowship number FPU15/03612.Peer ReviewedPostprint (published version

    TaskInsight: Understanding Task Schedules Effects on Memory and Performance

    Get PDF
    Recent scheduling heuristics for task-based applications have managed to improve their by taking into account memory-related properties such as data locality and cache sharing. However, there is still a general lack of tools that can provide insights into why, and where, different schedulers improve memory behavior, and how this is related to the applications' performance. To address this, we present TaskInsight, a technique to characterize the memory behavior of different task schedulers through the analysis of data reuse between tasks. TaskInsight provides high-level, quantitative information that can be correlated with tasks' performance variation over time to understand data reuse through the caches due to scheduling choices. TaskInsight is useful to diagnose and identify which scheduling decisions affected performance, when were they taken, and why the performance changed, both in single and multi-threaded executions. We demonstrate how TaskInsight can diagnose examples where poor scheduling caused over 10% difference in performance for tasks of the same type, due to changes in the tasks' data reuse through the private and shared caches, in single and multi-threaded executions of the same application. This flexible insight is key for optimization in many contexts, including data locality, throughput, memory footprint or even energy efficiency.We thank the reviewers for their feedback. This work was supported by the Swedish Research Council, the Swedish Foundation for Strategic Research project FFL12-0051 and carried out within the Linnaeus Centre of Excellence UPMARC, Uppsala Programming for Multicore Architectures Research Center. This paper was also published with the support of the HiPEAC network that received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 687698.Peer ReviewedPostprint (published version

    Configurable Strategies for Work-stealing

    Full text link
    Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. For instance, they do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, the actual task execution order is typically determined by the underlying task storage data structure, and cannot be changed. There are thus possibilities for optimizing task parallel executions by providing information on specific tasks and their preferred execution order to the scheduling system. We introduce scheduling strategies to enable applications to dynamically provide hints to the task-scheduling system on the nature of specific tasks. Scheduling strategies can be used to independently control both local task execution order as well as steal order. In contrast to conventional scheduling policies that are normally global in scope, strategies allow the scheduler to apply optimizations on individual tasks. This flexibility greatly improves composability as it allows the scheduler to apply different, specific scheduling choices for different parts of applications simultaneously. We present a number of benchmarks that highlight diverse, beneficial effects that can be achieved with scheduling strategies. Some benchmarks (branch-and-bound, single-source shortest path) show that prioritization of tasks can reduce the total amount of work compared to standard work-stealing execution order. For other benchmarks (triangle strip generation) qualitatively better results can be achieved in shorter time. Other optimizations, such as dynamic merging of tasks or stealing of half the work, instead of half the tasks, are also shown to improve performance. Composability is demonstrated by examples that combine different strategies, both within the same kernel (prefix sum) as well as when scheduling multiple kernels (prefix sum and unbalanced tree search)

    Hybrid static/dynamic scheduling for already optimized dense matrix factorization

    Get PDF
    We present the use of a hybrid static/dynamic scheduling strategy of the task dependency graph for direct methods used in dense numerical linear algebra. This strategy provides a balance of data locality, load balance, and low dequeue overhead. We show that the usage of this scheduling in communication avoiding dense factorization leads to significant performance gains. On a 48 core AMD Opteron NUMA machine, our experiments show that we can achieve up to 64% improvement over a version of CALU that uses fully dynamic scheduling, and up to 30% improvement over the version of CALU that uses fully static scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic scheduling approach is up to 8% faster than the version of CALU that uses a fully static scheduling or fully dynamic scheduling. Our algorithm leads to speedups over the corresponding routines for computing LU factorization in well known libraries. On the 48 core AMD NUMA machine, our best implementation is up to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to 82% faster than MKL. Our approach also shows significant speedups compared with PLASMA on both of these systems
    • …
    corecore