5 research outputs found
A Control-Theoretic Methodology for Adaptive Structured Parallel Computations
Adaptivity for distributed parallel applications is an essential feature whose impor- tance has been assessed in many research fields (e.g. scientific computations, large- scale real-time simulation systems and emergency management applications). Especially for high-performance computing, this feature is of special interest in order to properly and promptly respond to time-varying QoS requirements, to react to uncontrollable environ- mental effects influencing the underlying execution platform and to efficiently deal with highly irregular parallel problems. In this scenario the Structured Parallel Programming paradigm is a cornerstone for expressing adaptive parallel programs: the high-degree of composability of parallelization schemes, their QoS predictability formally expressed by performance models, are basic tools in order to introduce dynamic reconfiguration processes of adaptive applications. These reconfigurations are not only limited to imple- mentation aspects (e.g. parallelism degree modifications), but also parallel versions with different structures can be expressed for the same computation, featuring different levels of performance, memory utilization, energy consumption, and exploitation of the memory hierarchies.
Over the last decade several programming models and research frameworks have been developed aimed at the definition of tools and strategies for expressing adaptive parallel applications. Notwithstanding this notable research effort, properties like the optimal- ity of the application execution and the stability of control decisions are not sufficiently studied in the existing work. For this reason this thesis exploits a pioneer research in the context of providing formal theoretical tools founded on Control Theory and Game Theory techniques. Based on these approaches, we introduce a formal model for control- ling distributed parallel applications represented by computational graphs of structured parallelism schemes (also called skeleton-based parallelism).
Starting out from the performance predictability of structured parallelism schemes, in this thesis we provide a formalization of the concept of adaptive parallel module per- forming structured parallel computations. The module behavior is described in terms of a Hybrid System abstraction and reconfigurations are driven by a Predictive Control ap- proach. Experimental results show the effectiveness of this work, in terms of execution cost reduction as well as the stability degree of a system reconfiguration: i.e. how long a
reconfiguration choice is useful for targeting the required QoS levels.
This thesis also faces with the issue of controlling large-scale distributed applications composed of several interacting adaptive components. After a panoramic view of the existing control-theoretic approaches (e.g. based on decentralized, distributed or hierar- chical structures of controllers), we introduce a methodology for the distributed predictive control. For controlling computational graphs, the overall control problem consists in a set of coupled control sub-problems for each application module. The decomposition is- sue has a twofold nature: first of all we need to model the coupling relationships between control sub-problems, furthermore we need to introduce proper notions of negotiation and convergence in the control decisions collectively taken by the parallel modules of the application graph. This thesis provides a formalization through basic concepts of Non-cooperative Games and Cooperative Optimization. In the notable context of the dis- tributed control of performance and resource utilization, we exploit a formal description of the control problem providing results for equilibrium point existence and the compari- son of the control optimality with different adaptation strategies and interaction protocols. Discussions and a first validation of the proposed techniques are exploited through exper-
iments performed in a simulation environment
Automatic skeleton-driven performance optimizations for transactional memory
The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more
coarse -grain parallelism with every new generation of processors, or the performance
of their applications will remain roughly the same or even degrade. Unfortunately,
parallel programming is still hard and error prone. This has driven the development of
many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance
and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or
pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the
problem at hand. Thus, this programming approach simplifies parallel programming
by eliminating some of the major challenges of parallel programming, namely thread
communication, scheduling and orchestration. However, the application programmer
has still to correctly synchronize threads on data races. This commonly requires the
use of locks to guarantee atomic access to shared data. In particular, lock programming
is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads
that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly.
This model allows programmers to write parallel code as transactions, which are then
guaranteed by the runtime system to execute atomically and in isolation regardless of
eventual data races. TM programming thus frees the application from deadlocks and
enables the exploitation of coarse grain parallelism when transactions do not conflict
very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework.
This fact makes the combination of skeleton -based and transactional programming a
natural step to improve programmability since these models complement each other.
In fact, this combination releases the application programmer from dealing with thread
management and data races, and also inherits the performance improvements of both
models. In addition to it, a skeleton framework is also amenable to skeleton - driven
iii
performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the
knowledge about the worklist pattern and the TM nature of the applications to avoid
transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly.
Experimental results on a subset of five applications from the STAMP benchmark suite
show that the proposed autotuning mechanism can achieve performance improvements
within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform
Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance
optimizations such as thread mapping and memory page allocation. In order to do
it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five
applications from the STAMP benchmark show that the OpenSkel framework with the
extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %,
over a baseline version for a 16 -core UMA platform and up to 162 %, with an average
of 91 %, for a 32 -core NUMA platform
Automatic skeleton-driven performance optimizations for transactional memory
The recent shift toward multi-core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more
coarse-grain parallelism with every new generation of processors, or the performance
of their applications will remain roughly the same or even degrade. Unfortunately,
parallel programming is still hard and error prone. This has driven the development of
many new parallel programming models that aim to make this process efficient.
This thesis first combines the skeleton-based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance
and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or
pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the
problem at hand. Thus, this programming approach simplifies parallel programming
by eliminating some of the major challenges of parallel programming, namely thread
communication, scheduling and orchestration. However, the application programmer
has still to correctly synchronize threads on data races. This commonly requires the
use of locks to guarantee atomic access to shared data. In particular, lock programming
is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads
that could be potentially executed in parallel.
Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly.
This model allows programmers to write parallel code as transactions, which are then
guaranteed by the runtime system to execute atomically and in isolation regardless of
eventual data races. TM programming thus frees the application from deadlocks and
enables the exploitation of coarse grain parallelism when transactions do not conflict
very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework.
This fact makes the combination of skeleton-based and transactional programming a
natural step to improve programmability since these models complement each other.
In fact, this combination releases the application programmer from dealing with thread
management and data races, and also inherits the performance improvements of both
models. In addition to it, a skeleton framework is also amenable to skeleton-driven
performance optimizations that exploits the application pattern and system information.
This thesis thus also presents a set of pattern-oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the
knowledge about the worklist pattern and the TM nature of the applications to avoid
transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these pattern-oriented performance optimizations for each application and adjusts them accordingly.
Experimental results on a subset of five applications from the STAMP benchmark suite
show that the proposed autotuning mechanism can achieve performance improvements
within 2%, on average, of a static oracle for a 16-core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32-core NUMA (Non-Uniform
Memory Access) platform.
Finally, this thesis also investigates skeleton-driven system-oriented performance
optimizations such as thread mapping and memory page allocation. In order to do
it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five
applications from the STAMP benchmark show that the OpenSkel framework with the
extended autotuning mechanism driving both pattern and system-oriented optimizations can achieve performance improvements of up to 88%, with an average of 46%,
over a baseline version for a 16-core UMA platform and up to 162%, with an average
of 91%, for a 32-core NUMA platform
Computer Game Innovation
Faculty of Technical Physics, Information Technology and Applied Mathematics. Institute of Information TechnologyWydział Fizyki Technicznej, Informatyki i Matematyki Stosowanej. Instytut InformatykiThe "Computer Game Innovations" series is an international forum
designed to enable the exchange of knowledge and expertise in the
field of video game development. Comprising both academic research
and industrial needs, the series aims at advancing innovative
industry-academia collaboration. The monograph provides a unique set
of articles presenting original research conducted in the leading
academic centres which specialise in video games education. The goal
of the publication is, among others, to enhance networking
opportunities for industry and university representatives seeking to
form R&D partnerships. This publication covers the key focus areas
specified in the GAMEINN sectoral programme supported by the
National Centre for Research and Development
Parallelization of an Iterative Placement Algorithm using ParMod-C
. This paper describes the parallelization of a deterministic iterative method for the placement of standard cell circuits in VLSI design. The programming environment is the implementation of the parallel programming language ParMod-C for workstation clusters. The essential steps for transforming the existing sequential program to a parallel program are shown. An emphasis is laid on the discussion of the usefulness of individual language concepts. The efficiency of the application is investigated by analyzing the speedup and the quality of computed solutions of the parallel program. 1 Introduction In this paper we describe the parallelization of DOMINO for workstation clusters. DOMINO [DJS92, DJA94] is a deterministic iterative method for the placement of standard cell circuits in VLSI design. The goal of the parallelization is to transform the existing sequential program into an efficient parallel program with the objective of reusing as much sequential code as possible. In contrast ..