31 research outputs found

    Locality-aware Scheduling and Characterization of Task-based Programs

    No full text
    Modern computer architectures expose an increasing number of parallel features supported by complex memory access and communication structures. Currently used task scheduling techniques perform poorly since they focus solely on balancing computation load across parallel features and remain oblivious to locality properties of support structures. We contribute with locality-aware task scheduling mechanisms which improve execution time performance on average by 44\% and 11\% respectively on two locality-sensitive architectures - the Tilera TILEPro64 manycore processor and an AMD Opteron 6172 processor based four socket SMP machine. Programmers need task performance metrics such as amount of task parallelism and task memory hierarchy utilization to analyze performance of task-based programs. However, existing tools indicate performance mainly using thread-centric metrics. Programmers therefore resort to using low-level and tedious thread-centric analysis methods to infer task performance. We contribute with tools and methods to characterize task-based OpenMP programs at the level of tasks using which programmers can quickly understand important properties of the task graph such as critical path and parallelism as well as properties of individual tasks such as instruction count and memory behavior.QC 20140212</p

    Exploiting locality in OpenMP task scheduling

    No full text
    Future multi- and many- core processors are likely to have tens of cores arranged in a tiled architecture where each tile will house a processing core and a bank of the shared last-level cache. The physical distribution of tiles on the processor die gives rise to a Distributed Shared Cache (DSC) architecture where cache access latencies are non-uniform and depend on the physical distance between core and cache bank. In order to maximize cache capacity and favor design simplicity, the address space on a tiled processor is likely to be divided and mapped either statically or dynamically on to the distributed last-level cache such that each cache bank homes certain cache blocks. Given this architecture, an efficient OpenMP 3.0 task scheduler can minimize miss latencies by scheduling tasks on tiles whichare physically closer to the cache banks which home task-relevant data. This master thesis work deals with the design and implementation of a locality-aware user-level runtime OpenMP 3.0 task scheduler for a simulated tiled multicore architecture. Guided by programmer hints, the scheduler extracts locality information pertaining to the data referenced by a task and schedules the task accordingly on the core closest to the L2 slice homing the largest amount of data. Initial results of performance comparison against a work-first randomized work-stealing cilk-like scheduler and a breadth-first randomized work-stealing scheduler have revealed problems with the locality-aware scheduler and have created ground for deeper exploration in the areas of programmer locality characterization and feedback-based extraction of locality information

    Improving OpenMP Productivity with Data Locality Optimizations and High-resolution Performance Analysis

    No full text
    The combination of high-performance parallel programming and multi-core processors is the dominant approach to meet the ever increasing demand for computing performance today. The thesis is centered around OpenMP, a popular parallel programming API standard that enables programmers to quickly get started with writing parallel programs. However, in contrast to the quickness of getting started, writing high-performance OpenMP programs requires high effort and saps productivity. Part of the reason for impeded productivity is OpenMP’s lack of abstractions and guidance to exploit the strong architectural locality exhibited in NUMA systems and manycore processors. The thesis contributes with data distribution abstractions that enable programmers to distribute data portably in NUMA systems and manycore processors without being aware of low-level system topology details. Data distribution abstractions are supported by the runtime system and leveraged by the second contribution of the thesis – an architecture-specific locality-aware scheduling policy that reduces data access latencies incurred by tasks, allowing programmers to obtain with minimal effort upto 69% improved performance for scientific programs compared to state-of-the-art work-stealing scheduling. Another reason for reduced programmer productivity is the poor support extended by OpenMP performance analysis tools to visualize, understand, and resolve problems at the level of grains– task and parallel for-loop chunk instances. The thesis contributes with a cost-effective and automatic method to extensively profile and visualize grains. Grain properties and hardware performance are profiled at event notifications from the runtime system with less than 2.5% overheads and visualized using a new method called theGrain Graph. The grain graph shows the program structure that unfolded during execution and highlights problems such as low parallelism, work inflation, and poor parallelization benefit directly at the grain level with precise links to problem areas in source code. The thesis demonstrates that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing tools in standard programs from SPEC OMP 2012, Parsec 3.0 and Barcelona OpenMP Tasks Suite (BOTS). Grain profiles are also applied to study the input sensitivity and similarity of BOTS programs. All thesis contributions are assembled together to create an iterative performance analysis and optimization work-flow that enables programmers to achieve desired performance systematically and more quickly than what is possible using existing tools. This reduces pressure on experts and removes the need for tedious trial-and-error tuning, simplifying OpenMP performance analysis.QC 20151221</p

    Exploiting locality in OpenMP task scheduling

    No full text
    Future multi- and many- core processors are likely to have tens of cores arranged in a tiled architecture where each tile will house a processing core and a bank of the shared last-level cache. The physical distribution of tiles on the processor die gives rise to a Distributed Shared Cache (DSC) architecture where cache access latencies are non-uniform and depend on the physical distance between core and cache bank. In order to maximize cache capacity and favor design simplicity, the address space on a tiled processor is likely to be divided and mapped either statically or dynamically on to the distributed last-level cache such that each cache bank homes certain cache blocks. Given this architecture, an efficient OpenMP 3.0 task scheduler can minimize miss latencies by scheduling tasks on tiles whichare physically closer to the cache banks which home task-relevant data. This master thesis work deals with the design and implementation of a locality-aware user-level runtime OpenMP 3.0 task scheduler for a simulated tiled multicore architecture. Guided by programmer hints, the scheduler extracts locality information pertaining to the data referenced by a task and schedules the task accordingly on the core closest to the L2 slice homing the largest amount of data. Initial results of performance comparison against a work-first randomized work-stealing cilk-like scheduler and a breadth-first randomized work-stealing scheduler have revealed problems with the locality-aware scheduler and have created ground for deeper exploration in the areas of programmer locality characterization and feedback-based extraction of locality information

    Data for paper "Characterizing Task-based OpenMP Programs"

    No full text
    <p>This is the encrypted archive of data for paper "Characterizing Task-based OpenMP Programs". Use password J%mcrJZRWV to decypt the archive. See README file inside the archive for more details.</p

    Diagnosing Highly-Parallel OpenMP Programs with Aggregated Grain Graphs

    No full text
    Grain graphs simplify OpenMP performance analysis by visualizing performance problems from a fork-join perspective that is familiar to programmers. However, when programmers decide to expose a high amount of parallelism by creating thousands of task and parallel for-loop chunk instances, the resulting grain graph becomes large and tedious to understand. We present an aggregation method that hierarchically groups related nodes together to reduce grain graphs of any size to one single node. This aggregated graph is then navigated by progressively uncovering groups and following visual clues that guide programmers towards problems while hiding non-problematic regions. Our approach enhances productivity by enabling programmers to understand problems in highly-parallel OpenMP programs with less effort than before

    Extending OMPT to Support Grain Graphs

    No full text
    The upcoming profiling API standard OMPT can describe almost all profiling events required to construct grain graphs, a recent visualization that simplifies OpenMP performance analysis. We propose OMPT extensions that provide the missing descriptions of task creation and parallel for-loop chunk scheduling events, making OMPT a sufficient, standard source for grain graphs. Our extensions adhere to OMPT design objectives and incur a low overhead for BOTS (up to 2% overhead) and SPEC OMP2012 (1%) programs. Although motivated by grain graphs, the events described by the extensions are general and can enable cost-effective, precise measurements in other profiling tools as well

    A Locality Approach to Architecture-aware Task-scheduling in OpenMP

    No full text
    Multicore and other parallel computer systems increasingly expose architectural aspects such as different memory access latencies depending on the physical memory address/location. In order to achieve high performance, programmers need to take these non-uniformities into consideration but this not only complicates the programming process but also leads to code that is not performance portable between different architectures. Task-centric programming models, such as OpenMP tasks, relieve the programmer from explicitly mapping computation on threads while still enabling effective resource management. We propose a task scheduling approach which uses programmer annotations and architecture awareness to identify the location of data regions that are operated upon by an OpenMP task. We have made an initial implementation of such a locality-aware OpenMP task scheduler for the Tilera TilerPro64 architecture and provide some initial results showing its effectiveness in fulfilling the need to minimize non-uniform access latencies to data and resources.QC 20120109ENCORE EU projec

    Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

    No full text
    Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers

    Characterizing task-based OpenMP programs.

    No full text
    Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance
    corecore