Search CORE

31 research outputs found

Locality-aware Scheduling and Characterization of Task-based Programs

Author: Muddukrishna Ananya
Publication venue: 'Stockholm University Press'
Publication date: 01/01/2014
Field of study

Modern computer architectures expose an increasing number of parallel features supported by complex memory access and communication structures. Currently used task scheduling techniques perform poorly since they focus solely on balancing computation load across parallel features and remain oblivious to locality properties of support structures. We contribute with locality-aware task scheduling mechanisms which improve execution time performance on average by 44\% and 11\% respectively on two locality-sensitive architectures - the Tilera TILEPro64 manycore processor and an AMD Opteron 6172 processor based four socket SMP machine. Programmers need task performance metrics such as amount of task parallelism and task memory hierarchy utilization to analyze performance of task-based programs. However, existing tools indicate performance mainly using thread-centric metrics. Programmers therefore resort to using low-level and tedious thread-centric analysis methods to infer task performance. We contribute with tools and methods to characterize task-based OpenMP programs at the level of tasks using which programmers can quickly understand important properties of the task graph such as critical path and parallelism as well as properties of individual tasks such as instruction count and memory behavior.QC 20140212</p

Publikationer från KTH

Exploiting locality in OpenMP task scheduling

Author: Muddukrishna Ananya
Publication venue: KTH, Skolan för informations- och kommunikationsteknik (ICT)
Publication date: 01/01/2010
Field of study

Future multi- and many- core processors are likely to have tens of cores arranged in a tiled architecture where each tile will house a processing core and a bank of the shared last-level cache. The physical distribution of tiles on the processor die gives rise to a Distributed Shared Cache (DSC) architecture where cache access latencies are non-uniform and depend on the physical distance between core and cache bank. In order to maximize cache capacity and favor design simplicity, the address space on a tiled processor is likely to be divided and mapped either statically or dynamically on to the distributed last-level cache such that each cache bank homes certain cache blocks. Given this architecture, an efficient OpenMP 3.0 task scheduler can minimize miss latencies by scheduling tasks on tiles whichare physically closer to the cache banks which home task-relevant data. This master thesis work deals with the design and implementation of a locality-aware user-level runtime OpenMP 3.0 task scheduler for a simulated tiled multicore architecture. Guided by programmer hints, the scheduler extracts locality information pertaining to the data referenced by a task and schedules the task accordingly on the core closest to the L2 slice homing the largest amount of data. Initial results of performance comparison against a work-first randomized work-stealing cilk-like scheduler and a breadth-first randomized work-stealing scheduler have revealed problems with the locality-aware scheduler and have created ground for deeper exploration in the areas of programmer locality characterization and feedback-based extraction of locality information

Publikationer från KTH

Improving OpenMP Productivity with Data Locality Optimizations and High-resolution Performance Analysis

Author: Muddukrishna Ananya
Publication venue: 'Stockholm University Press'
Publication date: 01/01/2016
Field of study

The combination of high-performance parallel programming and multi-core processors is the dominant approach to meet the ever increasing demand for computing performance today. The thesis is centered around OpenMP, a popular parallel programming API standard that enables programmers to quickly get started with writing parallel programs. However, in contrast to the quickness of getting started, writing high-performance OpenMP programs requires high effort and saps productivity. Part of the reason for impeded productivity is OpenMP’s lack of abstractions and guidance to exploit the strong architectural locality exhibited in NUMA systems and manycore processors. The thesis contributes with data distribution abstractions that enable programmers to distribute data portably in NUMA systems and manycore processors without being aware of low-level system topology details. Data distribution abstractions are supported by the runtime system and leveraged by the second contribution of the thesis – an architecture-specific locality-aware scheduling policy that reduces data access latencies incurred by tasks, allowing programmers to obtain with minimal effort upto 69% improved performance for scientific programs compared to state-of-the-art work-stealing scheduling. Another reason for reduced programmer productivity is the poor support extended by OpenMP performance analysis tools to visualize, understand, and resolve problems at the level of grains– task and parallel for-loop chunk instances. The thesis contributes with a cost-effective and automatic method to extensively profile and visualize grains. Grain properties and hardware performance are profiled at event notifications from the runtime system with less than 2.5% overheads and visualized using a new method called theGrain Graph. The grain graph shows the program structure that unfolded during execution and highlights problems such as low parallelism, work inflation, and poor parallelization benefit directly at the grain level with precise links to problem areas in source code. The thesis demonstrates that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing tools in standard programs from SPEC OMP 2012, Parsec 3.0 and Barcelona OpenMP Tasks Suite (BOTS). Grain profiles are also applied to study the input sensitivity and similarity of BOTS programs. All thesis contributions are assembled together to create an iterative performance analysis and optimization work-flow that enables programmers to achieve desired performance systematically and more quickly than what is possible using existing tools. This reduces pressure on experts and removes the need for tedious trial-and-error tuning, simplifying OpenMP performance analysis.QC 20151221</p

Publikationer från KTH

Exploiting locality in OpenMP task scheduling

Author: Muddukrishna Ananya
Publication venue: KTH, Skolan för informations- och kommunikationsteknik (ICT)
Publication date: 01/01/2010
Field of study

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Data for paper "Characterizing Task-based OpenMP Programs"

Author: Ananya Muddukrishna (696863)
Publication venue
Publication date
Field of study

<p>This is the encrypted archive of data for paper "Characterizing Task-based OpenMP Programs". Use password J%mcrJZRWV to decypt the archive. See README file inside the archive for more details.</p

FigShare

Diagnosing Highly-Parallel OpenMP Programs with Aggregated Grain Graphs

Author: Muddukrishna Ananya
Reissmann Nico
Publication venue: Springer Verlag
Publication date: 01/01/2018
Field of study

Grain graphs simplify OpenMP performance analysis by visualizing performance problems from a fork-join perspective that is familiar to programmers. However, when programmers decide to expose a high amount of parallelism by creating thousands of task and parallel for-loop chunk instances, the resulting grain graph becomes large and tedious to understand. We present an aggregation method that hierarchically groups related nodes together to reduce grain graphs of any size to one single node. This aggregated graph is then navigated by progressively uncovering groups and following visual clues that guide programmers towards problems while hiding non-problematic regions. Our approach enhances productivity by enabling programmers to understand problems in highly-parallel OpenMP programs with less effort than before

NORA - Norwegian Open Research Archives

Extending OMPT to Support Grain Graphs

Author: Jahre Magnus
Langdal Peder Voldnes
Muddukrishna Ananya
Publication venue: Springer Verlag
Publication date: 01/01/2017
Field of study

The upcoming profiling API standard OMPT can describe almost all profiling events required to construct grain graphs, a recent visualization that simplifies OpenMP performance analysis. We propose OMPT extensions that provide the missing descriptions of task creation and parallel for-loop chunk scheduling events, making OMPT a sufficient, standard source for grain graphs. Our extensions adhere to OMPT design objectives and incur a low overhead for BOTS (up to 2% overhead) and SPEC OMP2012 (1%) programs. Although motivated by grain graphs, the events described by the extensions are general and can enable cost-effective, precise measurements in other profiling tools as well

NORA - Norwegian Open Research Archives

A Locality Approach to Architecture-aware Task-scheduling in OpenMP

Author: Brorsson Mats
Muddukrishna Ananya
Vlassov Vladimir
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/01/2011
Field of study

Multicore and other parallel computer systems increasingly expose architectural aspects such as different memory access latencies depending on the physical memory address/location. In order to achieve high performance, programmers need to take these non-uniformities into consideration but this not only complicates the programming process but also leads to code that is not performance portable between different architectures. Task-centric programming models, such as OpenMP tasks, relieve the programmer from explicitly mapping computation on threads while still enabling effective resource management. We propose a task scheduling approach which uses programmer annotations and architecture awareness to identify the location of data regions that are operated upon by an OpenMP task. We have made an initial implementation of such a locality-aware OpenMP task scheduler for the Tilera TilerPro64 architecture and provide some initial results showing its effectiveness in fulfilling the need to minimize non-uniform access latencies to data and resources.QC 20120109ENCORE EU projec

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Author: Ananya Muddukrishna
Mats Brorsson
Peter A. Jonsson
Publication venue: Hindawi Limited
Publication date: 01/01/2015
Field of study

Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers

Directory of Open Access Journals

Characterizing task-based OpenMP programs.

Author: Ananya Muddukrishna
Mats Brorsson
Peter A Jonsson
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance

Directory of Open Access Journals

PubMed Central