Computational partitioning for heterogeneous systems by Spacey, Simon A. & Spacey, Simon A.
Computational Partitioning for
Heterogeneous Systems
Imperial College, London
Simon A. Spacey
Submitted in part fulfilment of the requirements for the degree of Doctor of
Philosophy in Computer Science and the Diploma of Imperial College at the
Department of Computing.

Abstract
The high-level goal of this research is to identify a method to accelerate general
purpose software using heterogeneous systems. Heterogeneous systems offer the
potential of improving performance through the use of hardware components
specialised for a particular task. The issue considered in this research is how best
to map parts of an existing piece of software to components in a heterogeneous
architecture to deliver the optimal program execution time.
The heterogeneous partitioning problem is an NP-hard quadratic optimisa-
tion problem. However, before the problem could be addressed using optimisation
techniques, the problem had to be quantified and formalised using software char-
acterisation and computational abstraction models.
The main contributions delivered by this research are:
• 3S: a novel software characterisation framework that combines static instru-
mentation with dynamic characterisation enabling simple tools to measure
real fine-grained execution timing, control and data flow information for
any compilable program.
• MAP: a new execution model and multi-level parallel assignment approach
that delivers up to 64 times the heterogeneous acceleration potential of
previous work for the benchmarks considered.
• MIP: the first formal work on multiple instantiation under activation se-
quence uncertainty, providing more than twice the acceleration potential
of previous optimal assignment methods for the software considered and
capable of being integrated with MAP to deliver significant gains for het-
erogeneous partitioning.
These contributions have been published in a collection of research papers and
are the focus of Chapters 3, 4 and 5 of this thesis.
1
Acknowledgements
All the concepts, implementations and results in this thesis are my own work.
However, this work would not have been possible without the help of others.
Professor W. Luk was my main supervisor and I can honestly say that I am
very fortunate to have been given the chance to work with him. Professor Luk’s
guidance was at a high enough level to allow me the flexibility I needed to make
this research project a success and detailed enough when required to provide
specific references so that I could classify my own ideas in the wider picture.
It was through Professor Luk’s professional guidance that I created the WOA
execution model and the MAP parallel partitioning paradigm which tie together
my work in its wider context.
My second and third supervisors were Professor P.H.J. Kelly and Dr D. Kuhn.
I had the pleasure of working closely with Professor Kelly at the beginning of my
Ph.D. when I created the 3S Software Characterisation Framework and Professor
Kelly’s help in providing references to allow me to formalise the contributions
provided by 3S were invaluable. 3S is the foundation on which everything else is
built.
I worked closely with Dr Kuhn in the last stage of my Ph.D. research. At the
point I started working with Dr Kuhn, I had a heuristic approach to assignment.
At the point I finished working with Dr Kuhn I understood LPs, ILPs, MILPs,
B&B, had written a paper on CPLEX and turned my heuristic into a pivotal piece
of work on optimisation under uncertainty that delivered twice the performance
of previous approaches which had been the best available for half a century.
During my Ph.D. I benefited from the co-operative working environment at
Imperial and I extend my thanks to the following list of people: W. Wiesemann,
Dr O. Mencer, Dr D. Thomas, Dr T. Todman, Dr K.H. Tsoi and Dr Y.M. Lam
as well as others in the Imperial Computer Science Department. Imperial is a
great university full of great people.
Finally, nothing happens without funding and some would say that those who
fund are those with the highest vision of all. Consequently it is with great thanks
that I acknowledge the financial support of the UK EPSRC without whose vision
none of this work would have been possible.
2
Contents
1 Introduction 9
1.1 Project Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Project Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Background and Related Work 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Program Characterisation . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Mathematical Assignment Abstractions . . . . . . . . . . . . . . . . . 21
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Program Characterisation 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Solutions to Instrumentation Issues . . . . . . . . . . . . . . . . . . . 28
3.4 Comprehensive Characterisation Tools . . . . . . . . . . . . . . . . . 36
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Partitioning with Certainty 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 The Write-Only Architecture . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 WOA Timing Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Sequential Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Parallel Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3
5 Partitioning with Uncertainty 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Multiple Instantiation Partitioning . . . . . . . . . . . . . . . . . . . 70
5.3 Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 Conclusions and Future Work 97
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Glossary 103
Bibliography 105
A 3S Technical Introduction 119
B 3S Technical Paper 133
C Concise CPLEX Technical Paper 139
D Heuristic Assignment Workshop Paper 145
E Multi-level Assignment Journal Paper 153
F Robust Optimisation Journal Paper 167
4
List of Figures
1.1 Example heterogeneous system . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Example heterogeneous system . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Control flow graph equivalent to a single task . . . . . . . . . . . . . 20
3.1 The 3S instrumentation process . . . . . . . . . . . . . . . . . . . . . 26
3.2 Components of a CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 The ELF binary structure . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 A C switch statement with an indirection table. . . . . . . . . . . . 32
3.5 Initialised statics in a C function . . . . . . . . . . . . . . . . . . . . . 33
3.6 3S framework instruction level parameters . . . . . . . . . . . . . . . 35
3.7 3S parallelism tool’s internal operation . . . . . . . . . . . . . . . . 37
3.8 3S parallelism tool’s address slots hashed list structure . . . . . 38
3.9 3S parallelism tool’s instruction to slot mapping . . . . . . . . . . 39
3.10 3S parallelism tool report for the Fibonacci benchmark . . . . . . . 41
4.1 The WOA used for client-server communications . . . . . . . . . . . 45
4.2 The WOA used for SoC communications . . . . . . . . . . . . . . . . 46
4.3 The WOA activation packet . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Implementation of an if else statement using WOA controllers . 48
4.5 Heterogeneous architecture examined in the SAP results . . . . . . 53
4.6 Tightly coupled SAP accelerations . . . . . . . . . . . . . . . . . . . . 56
4.7 Loosely coupled SAP accelerations . . . . . . . . . . . . . . . . . . . . 56
4.8 Graphical SAP assignment report . . . . . . . . . . . . . . . . . . . . 58
4.9 SAP accelerations for different hardware sizes . . . . . . . . . . . . . 59
4.10 SAP accelerations for different communication bandwidths . . . . . 60
4.11 SAP accelerations for different communication latencies . . . . . . 60
4.12 The Multi-level Assignment Partitioning (MAP) approach . . . . . 62
5
5.1 Illustration of MIP issues . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Example split control flow graph, UCM and closures . . . . . . . . 82
5.3 Control flow graph to be partitioned . . . . . . . . . . . . . . . . . . 89
5.4 Single instance SAP partition . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Robust multiple instance MXM partition . . . . . . . . . . . . . . . . 91
5.6 Robust closed multiple instance MXM∗ partition . . . . . . . . . . . 92
5.7 Optimistic closed multiple instance MMM∗ partition . . . . . . . . . 93
5.8 Optimistic multiple instance MMM partition . . . . . . . . . . . . . . 94
A.1 Default x86 3S assembly instrumentation stub . . . . . . . . . . . . 121
A.2 Reduced x86 3S assembly instrumentation stub . . . . . . . . . . . . 121
A.3 Simple 3S tool to measure fine-grained Instruction Requests . . . . 123
A.4 Example usage of 3S to characterise a SPEC benchmark . . . . . . 124
A.5 3S loopgraph d graphical reports . . . . . . . . . . . . . . . . . . . . . 126
A.6 3S data flow graphical contribution report . . . . . . . . . . . . . . 128
A.7 3S framework instrumentation and characterisation process . . . . 129
A.8 3S trace simulation tool tester . . . . . . . . . . . . . . . . . . . . . . 129
6
Notation
The following notation is used consistently throughout this document.
p, q Software code sections
s, e The first s and last e software code sections executed in a computation
l, m Computational hardware locations
r The reference computational component running the operating system
τ A step number in a computational trace
ˆ Indicates a cost for a single step τ in the trace as oppose to the composite costs
‖ Indicates a hardware component that does not need serialisation
⊥ Indicates a hardware component that needs serialisation
µpl Time for software section p to compute if executed at location l
cpqlm Time for communications from p on l to q on m
Qpqlm Per activation computation and communication times (quadratic objective weights)
t Total computation and communication times for an assignment (objective function)
δpl Size of p at location l
∆l Space available at location l
xpl Indicator variable, 1 if p is assigned to l and 0 otherwise
ypqlm An auxiliary variable introduced to linearise xplxqm quadratics
E The set of code sections that must be instantiated only at the reference location r
L The set of possible execution locations
M The set of possible communication channels
X The set of feasible assignments
Ψ(x) The set of feasible traces for an assignment x ∈ X
Y (x) The set of feasible greedy calling convention traces Y (x) ⊂ Ψ(x)
Ξ(x) The set of feasible split control flow graphs (Y (x) consolidated over τ)
χpq An arc in the standard control flow graph for the program
χpqlm An arc in a location aware split control flow graph χ ∈ Ξ(x)
7
8
Chapter 1
Introduction
This thesis presents some of the contributions made during my Ph.D. research at
Imperial College London. The thesis is divided into three main parts: introduc-
tion and background, contributions and a brief summary with the contributions
further divided into three chapters:
• Chapter 3: Program Characterisation
• Chapter 4: Partitioning with Certainty
• Chapter 5: Partitioning with Uncertainty
corresponding respectively to the computer science disciplines of Software En-
gineering, High-Performance Computing and Mathematical Optimisation where
contributions have been delivered.
This chapter sets the tone for the thesis by introducing the topic and research
area in Section 1.1, the scope of this particular project in Section 1.2 and outlin-
ing the high-level contributions in Section 1.3. Chapter 2 provides background
information and references to the previous state of the art so that the contribu-
tions detailed in Chapters 3, 4 and 5 can be appreciated and Chapter 6 concludes
the main body of this thesis with a brief summary and a discussion of future
work.
Readers with a specialist focus may find reading this chapter first for context
then moving on to selected papers [1,2,3,4,5] from the appendix more appropriate
than a cover-to-cover read in the first instance. To assist readers who may be new
to this subject area a glossary is provided on page 103 and common acronyms
are expanded on first use throughout this document.
9
10 CHAPTER 1. INTRODUCTION
1.1 Project Goal
The goal of this research is to investigate the potential of heterogeneous systems
to automatically accelerate general purpose programs. The research is focused
on automatic program characterisation, efficient heterogeneous execution models
and optimal mathematical assignment methods.
Memory
Execution Flows
Memory Access
CPU
Reconfigurable
Coprocessor
Figure 1.1: An example two component heterogeneous system.
Heterogeneous systems consist of a set of connected computational compo-
nents specialised for different tasks. Figure 1.1 illustrates a heterogeneous system
consisting of a Central Processing Unit (CPU) [6] specialised for sequential com-
putations and a Reconfigurable Coprocessor [19] specialised for parallel computa-
tions. Alternate examples of two component heterogeneous architectures include
desktop machines with a CPU and a Graphical Processing Unit (GPU) [110] and
specialist System on Chips (SoCs) [25] used in mobile devices however this re-
search is not limited to two component architectures and is equally applicable to
distributed heterogeneous clusters with any number of computational nodes [16].
By partitioning a program’s code to execute on the different components of
a heterogeneous system it is possible to deliver orders of magnitude performance
increase over homogeneous approaches [32, 79, 97, 101]. However despite half a
century of active research in this area [18], previous heterogeneous partitioning
work has been focused on specialist programs that exhibit trivial parallelism or
pipelinability [32], heuristics [79], coarse-grained tasks [97] and has often required
manual intervention [101], and general automatic approaches to optimal fine-
grained heterogeneous partitioning were not available.
1.2. PROJECT SCOPE 11
1.2 Project Scope
This project is a research project, not an implementation project. Research
can be seen as the first stage in a project where requirements are gathered and
clarified through communication, analysis and prototypes to reduce the risks for
later implementation stages.
Research is about removing unknowns and so the bounds of this project are
where knowledge already exists. With this in mind, the following areas are out
of scope for this project: hardware implementations, source code translation and
binary analysis because previous work [46, 47, 64, 65, 66, 89, 101] demonstrates
clearly how to deal with these, implementation focused, topics.
This project concentrates on answering the following questions through pro-
totypes, models and abstractions that constitute the contributions of this report:
1. How can software characterisation be used to assess heterogeneous execu-
tion performance potential?
2. How can the efficiency of previous heterogeneous execution models be im-
proved?
3. How can software assignment problems be formalised and are the formal
abstractions tractable for real problem instances?
To answer these questions it is necessary to prove that it is possible to measure
timing, control flow and data flow characteristics for programs at a fine level of
granularity on commercial architectures and that this information can be used to
estimate heterogeneous execution potential; to ensure any new execution models
are realistic, and to demonstrate that the formal abstractions are tractable for
a comprehensive collection of software benchmarks and hardware architectures.
These proofs, assurances and demonstrations are provided through the accompa-
nying papers [1,2,3,4,5] and in Chapters 3, 4 and 5 of this document which include
acceleration results for real general software benchmarks partitioned using actual
characterisation measurements.
12 CHAPTER 1. INTRODUCTION
1.3 Contributions
The tangible contributions delivered by this project include:
• 3S [1]: a novel software characterisation framework that combines static in-
strumentation with dynamic characterisation enabling simple tools to mea-
sure real fine-grained execution timing, control and data flow information
for any compilable program.
• MAP [2,3]: a new execution model and multi-level parallel assignment ap-
proach that delivers up to 64 times the heterogeneous acceleration potential
of previous work for the benchmarks considered.
• MIP [4,5]: the first formal work on multiple instantiation under activation
sequence uncertainty, providing more than twice the acceleration poten-
tial of previous optimal assignment methods for the software considered
and capable of being integrated with MAP to deliver significant gains for
heterogeneous partitioning.
These contributions are discussed in Chapters 3, 4 and 5 of this document respec-
tively and represent potentially order of magnitude increases in heterogeneous
execution performance potential over previous work [81,84,89,97,101].
The intangible contributions of this work include: identifying characterisation,
modelling and assignment as the critical activities in optimal automated hetero-
geneous partitioning, demonstrating that fine-grained sequential partitioning is
the core issue in even parallel heterogeneous partitioning through MAP, connect-
ing the heterogeneous partitioning problem to formal mathematical abstractions
and demonstrating the tractability of optimal approaches for real software assign-
ment problems. These benefits will allow future research to focus on well defined
sub-problems and support the commercialisation of this work.
Chapter 2
Background and Related Work
2.1 Introduction
This chapter presents previous work that represented the state of the art before
this project. The chapter is divided into three sections which correspond roughly
to the three main contributions of this work:
• Section 2.2: Program Characterisation.
• Section 2.3: Heterogeneous Computing.
• Section 2.4: Mathematical Assignment Abstractions.
and a summary is provided in Section 2.5 which could be used in itself as back-
ground for later chapters by readers who are short of time.
2.2 Program Characterisation
In order to identify the parts of a program suitable for partitioning it is necessary
to measure the computation and communication characteristics of the program
using a characterisation framework. The features of several previous software
characterisation frameworks are discussed in this section and summarised in Ta-
ble 2.1 below.
Aigner et al. in [63] present the SUIF2 compiler infrastructure which is a
source code analysis framework that parses code to an internal Abstract Syntax
Tree (AST) that can then be analysed by SUIF tools. SUIF can characterise a
program’s data and control flow dependencies statically at compile time without
13
14 CHAPTER 2. BACKGROUND AND RELATED WORK
Feature SUIF [63] GILK [64] Valgrind [65] Pin [66]
Method Static Dynamic Dynamic Dynamic
CPU/OS Any/Any x86/Linux x86/Unix x86/Any
Distribution Size 4.4MB 808KB 20MB 44.8MB
Target Languages C/C++ Any Any Any
License Stanford GNU GNU Intel
Table 2.1: Features of well established program characterisation frameworks.
the need for the program to execute and was originally used to identify parallel
partitioning opportunities [61, 62]. By working at the source code level, SUIF
has the advantage of being architecture neutral but has the disadvantages of
only working with a subset of source code languages and of not having full ac-
cess to run-time information such as data dependent loop counts and exact data
dependencies.
Pearce et al. in [64] present the GILK characterisation framework. GILK
analyses x86 binaries by dynamically adding instrumentation stubs to the loaded
executable image in memory which call a tool as the program executes. Unlike
SUIF, GILK works at the binary level and so is able to obtain detailed run-
time information for programs written in any language. However GILK would
require framework modifications to perform the advanced analysis required by
this project such as data flow graph creation and there would be considerable
work to port the framework to an architecture other than x86/Linux for which
it was designed.
Nethercote et al. present the Valgrind characterisation framework in [65].
Valgrind is similar to GILK in that it instruments x86 programs in memory how-
ever, unlike GILK, Valgrind uses dynamic compilation to insert instrumentation
routine calls into program traces before they are dispatched for execution. Val-
grind has the disadvantage of code expansion as it separates out load and store
operations from compact x86 instructions to form equivalent sets of instructions
with Reduced Instruction Set Computer (RISC) like characteristics for memory
access instrumentation.
In [66], Luk et al. present the Pin program analysis framework. Pin dynam-
ically instruments x86 program code through a virtual executive like Valgrind.
However, Pin does not expand x86 instructions unnecessarily and inlines tool code
in execution traces resulting in a 3.3 times efficiency increase over Valgrind [66]
but like GILK and Valgrind, Pin suffers from the overhead of repeatedly instru-
menting a program each time it is run.
Analysis tools such as gprof, oprofile, vTune and others [71, 73] and early
frameworks such as ATOM [67], PROTEUS [68], TANGO [69] and EEL [70]
2.3. HETEROGENEOUS COMPUTING 15
are also available, but are generally focused on a single analysis task and pro-
vide less flexibility, modularity and architectural independence than the modern
frameworks presented in Table 2.1 above.
The major characterisation issue that had to be addressed in this research was
balancing the conflicting demands of framework simplicity and flexibility with
previous flexible characterisation frameworks requiring from twelve thousand to
well over 100,000 lines of source code [63,64,65,66]. To solve the issue Chapter 3
presents the simple yet flexible instrumentation framework called 3S [1]. 3S
differs from previous work in that it instruments assembly instead of source code
or binaries to provide complete characterisation flexibility with only 288 lines
of code. 3S combines static instrumentation with dynamic characterisation and
contributes a unique set of program analysis tools as discussed in Chapter 3.
2.3 Heterogeneous Computing
Heterogeneous Systems
Heterogeneous systems consist of a set of computational components specialised
for certain computational tasks connected through communication channels. It
is common for heterogeneous architectures to include a Central Processing Unit
(CPU) component and a reconfigurable device as illustrated in Figure 2.1 al-
though any number of components can be connected as in [16]. The components
of a heterogeneous system are often connected through a tightly-coupled System
on Chip (SoC) bus [79,81,84,89] however loosely coupled connections via Hyper-
Transport or PCI-Express [105] and distributed clusters [16] are also possible.
Memory
Execution Flows
Memory Access
CPU
Reconfigurable
Coprocessor
Figure 2.1: An example two component heterogeneous system.
16 CHAPTER 2. BACKGROUND AND RELATED WORK
CPU components are explained in great detail in Hennessy and Patterson’s
Computer Architecture: A Quantitative Approach [6] and are designed to execute
a wide range of programs efficiently. To achieve high performance modern CPUs
provide a highly optimised implementation of the most commonly used functions
in multiple Arithmetic Logic Units (ALUs) fed by a long out-of-order pipeline
and deep cache hierarchies. The complexity of modern CPUs presents significant
issues for timing analysis in software characterisation frameworks like 3S and will
be discussed further in Chapter 3.
The general focus of CPUs means that they can provide relatively low perfor-
mance for special programs and adding custom instructions to custom CPU com-
ponents has been shown to provide orders of magnitude performance increase for
some programs [79,83,89]. Examples of special purpose CPU based components
in mass production today include Graphical Processing Units (GPUs) [50, 110]
which are specialised for Single Instruction Multiple Data (SIMD) [8] floating
point tasks and Massively Parallel Processing Arrays (MPPAs) [17] which are
specialised for Multiple Instruction Multiple Data (MIMD) [8] tasks.
However, creating custom CPUs specialised for specific programs can rarely be
justified as Application Specific Integrated Circuits (ASICs) require a large outlay
on costs such as mask production which can only be economically amortised
over large volumes. An alternative route to obtaining the performance gains
of hardware component specialisation is to use reconfigurable circuits instead of
custom ASICs as discussed next.
In many ways the dual of CPU components are Reconfigurable Computing
Devices (RCDs). RCDs allow their circuitry to be reconfigured repeatedly after
manufacture and are discussed in some detail in Gokhale and Graham’s Re-
configurable Computing [19]. RCD components can be used in heterogeneous
architectures to provide the benefits of hardware specialisation on a program
by program basis and offer low amortised manufacturing costs because of their
general applicability.
The stereotypical example of a reconfigurable device is the Field Pro-
grammable Gate Array (FPGA) [19,27,28] although other examples exist such as
the Berkley GARP coprocessor [31]. Classic FPGA components consist of a large
number of 4 input Look-Up Tables (LUTs) and arithmetic elements configured
to provide logic functions and inter-connected through a configurable routing
matrix.
Unfortunately, FPGA LUTs and routing matrices have a much larger fea-
ture size than the static CMOS circuits and wires used in general CPUs which
detrimentally affects FPGA cycle times. For example the Altera Stratix IV pro-
2.3. HETEROGENEOUS COMPUTING 17
duced with a 40nm technology is clocked at a design dependent maximum of
only 600MHz [28] because of its feature size which is much slower than the latest
45nm Intel Core 2 Duo E8600 clocked at 3.3GHz [111]. However, despite the lower
clock speeds, the programability of FPGA computational components can still
deliver considerable heterogeneous performance benefits for application segments
that can be pipelined or have more data parallelism than a CPU component’s
ALUs can exploit and performance improvements of up to 2000 times speed
increase have been reported for some programs on heterogeneous CPU/FPGA
systems [32].
This research is not concerned with new device creation however practical
device characteristics such as cycle times and maximum parallelism are taken
into account when modelling heterogeneous execution times in chapters 4 and 5.
Cross Compilation
In order to execute software partitioned for a heterogeneous system and, indeed,
to identify exact size and other information for execution models it is necessary
to be able to compile partitioned code for the different components in an archi-
tecture. Where suitable source code is available, compilation for a device can
easily be performed using standard compilers such as gcc for CPUs, manufac-
turer provided compilers for GPUs [110] or MPPAs [17] and tools like C2H [46],
ASC [47] or Handel-C [48] for FPGAs. Where a program is only available in
binary form, decompilation to a higher-level language and recompilation using a
targeted compiler may be necessary to cross compile the binary for a different
device [52,84].
The static decompilation of binaries can be complicated by the issue of dis-
tinguishing data from instructions. The issue can be seen in virus and other
obfuscated code, but is usually not seen in commercial or scientific programs
compiled using a standard compiler such as gcc or g++. Additionally, dynamic
decompilation can avoid the issue of instruction obfuscation altogether by us-
ing traces or instruction pointer monitoring [64, 65, 66] and previous work has
proved that it is possible to cross compile general binaries and move code sec-
tions to other components in a heterogeneous architecture albeit using heuristic
approaches to identify the code sections to migrate [79,81,84,85] and so the topic
of cross compilation will not be considered explicitly in this work.
18 CHAPTER 2. BACKGROUND AND RELATED WORK
Heterogeneous Execution Models
Previous heterogeneous partitioning work has focused on RISC/FPGA System
on Chips (SoCs) [79, 81, 84, 89, 100], programs written in a limited set of lan-
guages [89,95,101] and has often required manual intervention with results being
presented for only small program kernels in practice [89,95,100,101]. The limited
hardware and software focus of previous work meant that specialised execution
models incorporating a small set of hardware and software characterisation infor-
mation were sufficient as shown in Tables 2.2 and 2.3. Further, previous execu-
tion models have been based on the Remote Procedure Call (RPC) [107] protocol
which does not leverage the data predictability characteristics seen in a number
of computationally intense programs of practical interest.
Hardware Characteristic Stitt [79] Lysecky [81] Stitt [84] Atasu [89]
Hardware size X X X X
Data bandwidth 7 7 7 X
Communication latency 7 7 7 7
Relative cycle times 7 7 7 7
Parallel execution units 7 7 7 7
Execution efficiency 7 7 7 7
Table 2.2: Comparison of hardware characterisation metrics used in the execution models of
previous heterogeneous partitioning work.
Software Characteristic Stitt [79] Lysecky [81] Stitt [84] Atasu [89]
Size of code at each location X X X X
Program code unit iterations X X X X
Data flow sizes 7 7 X X
Parallel execution slots 7 7 7 X
Control flow counts 7 7 7 7
Execution cycle measurements 7 7 7 7
Table 2.3: Comparison of software characterisation metrics used in the execution models of
previous heterogeneous partitioning work.
To remove the issue of limited architectural applicability, Chapter 4 presents
the general Write-Only Architecture (WOA) execution and timing model which
integrates all the hardware and software characterisation information of Tables 2.2
and 2.3 into a single objective function that can be used to identify optimal parti-
tions for a comprehensive range of heterogeneous architectures. The Write-Only
Architecture (WOA) also addresses the issue of execution model efficiency and is
up to five times more efficient than previous models as explained in Section 4.2.
2.3. HETEROGENEOUS COMPUTING 19
Heterogeneous Partitioning Paradigms
Software partitioning work can be classified as either parallel or sequential. As Ta-
ble 2.4 shows, parallel methods typically deliver program acceleration by exploit-
ing coarse-grained data independence and sequential methods use fine-grained
heterogeneous computational characteristics to deliver their performance im-
provements.
Characteristic Parallel Sequential
Graph Type DAG CDFG
Granularity Coarse Fine
Exploits Parallelism X 7
Optimal Heterogeneity 7 X
Table 2.4: Parallel and sequential partitioning characteristics. Parallel partitioning is some-
times further divided into data and task parallelism depending on whether or not
the same code segments run in parallel [51].
Parallel assignment techniques [97,100,101,102,103] require detailed sequence
and data dependency information which is often represented through the levels
of a Directed Acyclic Graph (DAG). DAGs are run trace length in size and
DAG level scheduling is made tractable through the coarse-grained consolidation
of vertices into tasks at the process [97], function [50] or loop [101] level. For
example in Wiangtong et al. [97], program control and data flow information is
analysed to group functions with cyclic dependencies into process tasks connected
via a DAG. The DAG tasks are then scheduled for parallel execution on the
UltraSONIC heterogeneous CPU/FPGA system using a Tabu search algorithm
and dynamically triggered in parallel when dependency data becomes available.
While parallel partitioning has the advantage of being able to exploit data
parallelism in a program, the consolidation of fine-grained program parts into
coarse-grained tasks means that traditional parallel assignment approaches can-
not take full advantage of fine-grained heterogeneous component characteristics.
For example, with parallel partitioning, all five code sections of the control flow
graph of Figure 2.2 would be grouped into a single task to remove the cyclic con-
trol dependencies and partitioned or scheduled as a single unit which limits the
assignment possibilities to hardware components with relatively large capacity
and removes the potential for fine-grained component exploitation.
Sequential assignment techniques [79, 83, 89, 90, 91] work with Control/Data
Flow Graphs (CDFGs) instead of DAGs. CDFGs compress out sequence in-
formation from the program trace rather than consolidating program sections
as in DAGs and this allows sequential partitioning to be performed at a far
20 CHAPTER 2. BACKGROUND AND RELATED WORK
1
10
10
10
10
10
10
3
3
10 102
3
5 4
5
5
1
10
10
10
5
10
10
3
3
5 102
3
5 4
5
5
Figure 2.2: A control flow graph representing either a single coarse-grained task or five sep-
arate fine-grained program segments. Nodes correspond to program segments
and edges to control flows between program segments. Control flow graphs are
directed multi-graphs with edges labelled with the number of execution control
transfers between the nodes.
finer-grained level than parallel partitioning. Sequential partitioning is often per-
formed at the program kernel (tight loop) [79, 81], basic block [84] or at the raw
assembly instruction [89] level. For example in Stitt et al. [79], program kernels
are ranked by iteration count and the highest iteration kernels moved from a
CPU to a synchronous reconfigurable coprocessor to act as custom instructions.
Referring again to Figure 2.2, sequential partitioning would assign each of the
five code sections independently to hardware components in an architecture to
achieve optimal heterogeneous execution performance.
Sequential partitioning has the advantage of being applicable to any type of
program including programs with side-effects and programs with a running data
dependence and can generate accelerations with even small hardware component
capacities. However by not exploiting parallelism, sequential speed-up potentials
are bounded by computational specialism.
The issue of previous coarse-grained parallelism techniques ignoring the poten-
tial benefits of fine-grained heterogeneous assignment is addressed in Chapter 4
by a novel formal partitioning technique called the Multi-level Assignment Prob-
lem (MAP) [3]. MAP combines the benefits of coarse-grained data parallelism
with fine-grained heterogeneous assignment and delivers up to 64 times higher
program acceleration than previous methods for the benchmarks considered.
2.4. MATHEMATICAL ASSIGNMENT ABSTRACTIONS 21
2.4 Mathematical Assignment Abstractions
The assignment of code sections to locations at either a coarse or fine-grained
level is essentially a knapsack problem. Table 2.5 presents the characteristics of
several different classes of knapsack problems with formal representations.
Characteristic Knapsack [128] GAP [139] QAP [141] GQAP [148]
Multiple locations 7 X X X
Variable node costs 7 X 7 X
Connection costs 7 7 X X
Size constraints X X 7 X
Multiple instances 7 7 7 7
Strongly NP-hard 7 7 X X
Table 2.5: Formal knapsack partitioning paradigms and their characteristics. GAP stands for
the Generalized Assignment Problem [122,139], QAP for the Quadratic Assignment
Problem [141] and GQAP for the Generalized Quadratic Assignment Problem [148].
The single location Knapsack and the multiple location Generalized Assign-
ment Problem (GAP) can be used to represent problems with isolated node met-
rics and size constraints and are most applicable to SoCs with negligible intra-
location communication costs [79, 81]. Quadratic Assignment Problem (QAP)
formulations can be used to represent homogeneous architectures such as MP-
PAs [17] and the Generalized Quadratic Assignment Problem (GQAP) can rep-
resent problems with size constrained locations where inter-node communication
costs are significant [84, 89]. Our initial problem form in this work is GQAP
which includes size, execution and communication cost information as shown in
the Mixed Integer Quadratic Programming (MIQP) problem definition below.
Definition 2.4.1 GQAP Formalisation for Software Assignment
min
∑
pl
µplxpl +
∑
pqlm
cpqlmxplxqm (2.1)
s.t.:
∑
p
δplxpl ≤ ∆l ∀l (2.2)∑
l
xpl = 1 ∀p (2.3)
xpl =
{
1
0
if p is assigned to l
otherwise.
(2.4)
with p and q assignable items, l and m knapsack locations, µpl the cost of executing
p on location l, cpqlm the cost of communications between p on l and q on m, ∆l
the total space available at location l, δpl the space required to assign item p to
location l and xpl the 0-1 assignment decision variables.
22 CHAPTER 2. BACKGROUND AND RELATED WORK
Constraint (2.2) is the size constraint, (2.3) ensures all code sections are assigned
to one of the execution locations and constraint (2.4) governs the 0-1 assignment
indicators.
GQAP is the most difficult of the well known assignment problems to which all
others can be reduced [148] and in his seminal work, Sahni [141] proved QAP (and
by extension GQAP) is strongly NP-hard meaning that even the identification
of an -approximation scheme [128,139] would require P = NP. Consequently, in
the words of Sahni, any polynomial time heuristic GQAP algorithm:
“. . .must produce arbitrarily bad approximations on some inputs.”
which is why the focus of this work is on optimal solutions.
Optimal GQAP solutions are theoretically of combinatorial complexity and
problems with only thirty nodes [142] proposed over forty years ago have only
recently been solved to optimality with the help of massive computational clus-
ters [143, 144]. However Chapter 4 will show that real GQAP software parti-
tioning problem instances exhibit sufficient sensitivity to allow the combinatorial
solution space to be pruned using Branch and Bound (B&B) [123] and can be
solved optimally in just a few seconds for large problems with over a thousand
nodes, demonstrating the practical tractability of optimal solutions for software
partitioning.
The major issue with previous formal models is that they are of a general
focus and do not take advantage of the special features available to software
partitioning. Chapter 5 presents a new optimisation model called the Multiple
Instantiation Problem (MIP) that offers heterogeneous software assignment po-
tentials more than twice as good as GQAP for the benchmark considered by
taking advantage of the multiple instantiation feature available to software par-
titioning. The new problem form adds uncertainty to the single instance GQAP
which is addressed through a robust minimax formulation [149,150,151], the inner
maximisation of which is partially relax using Lagrangian dualisation [126, 127]
to obtain results using standard Linear Program (LP) solvers.
MIP is more complex than GQAP and has less B&B sensitivity [4] so the
combinatorial complexity issues predicted by Sahni and others soon become ap-
parent with even small MIP problem instances. As discussed in Section 5.5, the
practical MIP complexity is only an issue for some forms of MIP so all is not lost
by any means. In any case, a great deal of work has been performed on heuristic
algorithms [2, 145, 147, 148] and tighter B&B relaxations [140, 146] which could
be applied to the most challenging forms of MIP in future work as discussed in
Chapter 6.
2.5. SUMMARY 23
2.5 Summary
This chapter presented some background information and previous work related
to this project. The chapter was divided into three sections. Section 2.2 pre-
sented previous software characterisation work including: SUIF [63], GILK [64],
Valgrind [65] and Pin [66]. The previous program characterisation approaches
were compared in Table 2.1 and the issue of achieving framework simplicity with
measurement flexibility raised for solution in Chapter 3.
Section 2.3 discussed different heterogeneous computational devices, execution
models and practical partitioning paradigms. The essential features of Central
Processing Units (CPUs), Graphical Processing Units (GPUs), Massively Paral-
lel Processing Arrays (MPPAs) and Field Programmable Gate Arrays (FPGAs)
were all discussed before the restricted applicability and efficiency of previous
heterogeneous execution models were highlighted. The issues of heterogeneous
execution model applicability and efficiency will be addressed by the Write-Only
Architecture (WOA) which is discussed in Chapter 4 of this work.
Section 2.3 then went on to divide previous partitioning approaches into par-
allel and serial classes which correspond to Directed Acyclic Graph (DAG) and
Control Data Flow Graph (CDFG) partitioning respectively. An example was
provided that demonstrated that the major issue with parallel coarse-grained
task assignment is the loss of fine-grained heterogeneous acceleration potential
which will be solved in Chapter 4 with a new hybrid partitioning approach called
the Multi-level Assignment Problem (MAP) which provides up to 64 times the
acceleration potential of the previous approaches.
The chapter concluded with a review of some relevant theoretical partitioning
work. Characteristics of the formal Knapsack [128], GAP [139], QAP [141] and
GQAP [148] problems were compared in Table 2.5 and the complexity results of
Sahni [141] were used to explain that the problem this project attempts to solve
is strongly NP-hard. However the practical complexity of a particular problem
instance depends on the B&B granularity and in Chapter 4 results will be pre-
sented that demonstrate the software assignment problem is tractable for the
comprehensive range of benchmarks and architectures considered in this work.
The comparison of previous formal partitioning methods in Section 2.4 high-
lighted the issue that none of the previous work took advantage of the special
features available in the software partitioning domain. Chapter 5 develops a new
partitioning approach called the Multiple Instantiation Problem (MIP) that takes
advantage of software section multiple instantiation and paves the way for future
research into domain specific feature exploitation in formal software partitioning.
24 CHAPTER 2. BACKGROUND AND RELATED WORK
Chapter 3
Program Characterisation
3.1 Introduction
This chapter introduces the Spacey Stream Splitter (3S). 3S is a software charac-
terisation framework and set of tools that gathers the metric information required
by this project to partition software at a fine-grained level. 3S combines static
instrumentation with dynamic program characterisation and contributes:
• Section 3.2: A Novel Method to Instrument Software.
• Section 3.3: Solutions to Instrumentation Issues.
• Section 3.4: Unique Program Characterisation Tools.
These contributions are discussed in Sections 3.2, 3.3 and 3.4 of this chapter and
additional information about 3S can be found in the technical introduction of
Appendix A and in the 3S paper [1].
3.2 Methodology Overview
Figure 3.1 illustrates the 3S methodology which differs from previous work [63,
64, 65, 66] in that a combination of static compile-time instrumentation and dy-
namic run-time program characterisation is used. In the static instrumentation
stage, 3S adds instrumentation stubs into the assembly of a program at code sec-
tion boundaries and then compiles the instrumented program with a 3S run-time
analysis tool. The 3S instrumentation code section boundaries are user config-
urable and are typically the program basic block level for fine-grained control
flow analysis tools and the instruction level for data flow analysis tools.
25
26 CHAPTER 3. PROGRAM CHARACTERISATION
Instrumented Executable
Program
Assembly
Instrumented
Assembly
Instrumented
Executable
In
str
um
en
te
d
Pr
og
ra
m
Co
m
pil
ed
3S
 T
oo
l
Program
Code
Compile to
Assembly
(gcc)
Instrument
with 3S
Stub Calls
3S Stub
Code
3S Tool
Code
Link with
3S Tool
(ld)
3S Static Instrumentation Process
Instrumented
Program
Instrumented Executable
3S Dynamic Characterisation Process
3S
Characterisation
Reports
Compiled
3S Tool
Characterisation
Stream
In
str
um
en
te
d
Pr
og
ra
m
Co
m
pil
ed
3S
 T
oo
l
Characterisation
Stream
3S
Characterisation
Reports
(a) 3S static instrumentation process.
Instrumented Executable
Program
Assembly
Instrumented
Assembly
Instrumented
Executable
In
str
um
en
te
d
Pr
og
ra
m
Co
m
pil
ed
3S
 T
oo
l
Program
Code
Compile to
Assembly
(gcc)
Instrument
with 3S
tub Calls
3S Stub
Code
3S Tool
Code
Link with
3S Tool
(ld)
3S Static Instrumentation Process
Instrumented
Program
Instrumented Executable
3S Dynamic Characterisation Process
3S
Characterisation
Reports
Compiled
3S Tool
Characterisation
Stream
In
str
um
en
te
d
Pr
og
ra
m
Co
m
pil
ed
3S
 T
oo
l
Characterisation
Stream
3S
Characterisation
Reports
(b) 3S dynamic characterisation process.
Figure 3.1: Illustration of the 3S instrumentation and characterisation processes.
When an instrumented executable is run, the 3S stubs call the 3S tool with
a dynamic stream of control and data flow information. The 3S tool then uses
parts of the information stream to create its reports. The process of extracting
information from the characterisation stream is called “splitting” the stream in
3S parlance and this is the reason for the framework’s name: the Spacey Stream
Splitter.
Table 3.1 illustrates how the 3S methodology differs from previous work and
Table 3.2 summarises the contributions these differences generate. 3S combines
static instrumentation with dynamic characterisation. By using static instru-
mentation 3S is more efficient than GILK [64], Valgrind [65] and Pin [66] which
repeat their dynamic instrumentation stage each time a program is run. The 3S
efficiency has been measured to be over 15.5 times higher than Valgrind for the
same reports as shown in Table 3.3 [1]§A which, by extension of [66], corresponds
to a 4.6 times efficiency improvement over the highly optimised Pin framework.
3.2. METHODOLOGY OVERVIEW 27
Feature SUIF [63] GILK [64] Valgrind [65] Pin [66] 3S [1]
Instrumentation Static Dynamic Dynamic Dynamic Static
Characterisation Static Dynamic Dynamic Dynamic Dynamic
Target Source Binary Binary Binary Assembly
Table 3.1: Differences between the 3S methodology and previous work.
Contribution SUIF [63] GILK [64] Valgrind [65] Pin [66] 3S [1]
Efficient X 7 7 7 X
Accurate 7 X X X X
Simple 7 7 7 7 X
Table 3.2: Contributions provided by the 3S methodology compared with previous work. Ef-
ficiency indicates no wasted instrumentation time on repeated program executions
and Accuracy indicates access to exact loop counts and memory accesses which are
not available to static characetrisation methods. Simplicity is a relative measure
related to code base size with 3S the only framework less than 1000 lines of code.
Program GZIP AES256
Compiler GCC GCC G++
SHELL1 3.96x 3.29x 10.66x
OPT1 3.05x 5.53x 15.52x
Table 3.3: 3S performance improvements over Valgrind 3.2.0 for the callgrind tool (with the
3S callgrind tool also providing tick information). SHELL1 is an Intel Pentium
and OPT1 an AMD Opteron machine at Imperial. The compiler optimisation level
was -O2.
Static characterisation frameworks like SUIF [63] often require conservative
approximations. For example, there is no way to know how many times data
dependent loops will be run or exact data access patterns given complex pointer
indirections at compile time without simulating the run-time inputs. The use of
dynamic characterisation tools in 3S makes exact loop counts and memory access
patterns easy to determine for any program input and the program can always
be run with multiple inputs to deliver more accurate abstract characterisations
than would be possible with static approaches if required. Further, heterogeneous
partitioning relies on total computation time and data flow ratios which are often
exact data input independent meaning 3S measurements need only be performed
on representative program input data rather than over a (possibly infinite) data
set in order to obtain practically optimal partitions.
28 CHAPTER 3. PROGRAM CHARACTERISATION
3S instruments programs at the assembly level rather than the source or binary
level. Assembly instrumentation allowed 3S to benefit from:
1. the automatic assembly generation capabilities of existing compiler tools
such as gcc available for a range of platforms and CPUs.
2. assembly being a common representation for any compilable language in-
cluding C, C++, Java and FORTRAN.
3. assembly’s simple context free text structure that is easy to analyse, does
not suffer from offset re-writing issues [64] and allows human verification
that inserted instrumentation stubs do not alter program correctness.
The benefits of assembly instrumentation had been recognised by previous au-
thors in software simulation work [69], but had not been recognised or used
by any general program characterisation framework before 3S and are the main
reason the 3S code base is so simple and compact at only 288 lines of Python
code. By using assembly 3S removes the issue of framework complexity seen in
SUIF [63], GILK [64], Valgrind [65] and Pin [66] while remaining flexible through
its run-time characterisation tool linking modular approach.
3.3 Solutions to Instrumentation Issues
This section discusses some of the issues the latest multi-core Complex Instruc-
tion Set Computer (CISC) architectures present for program characterisation
frameworks and how the current implementation of 3S addresses these issues.
The issues facing modern program characterisation frameworks are listed in Ta-
ble 3.4 and can be grouped into three classes: obtaining fine-grained execution
cycle counts (the first five issues), distinguishing instructions from data in bi-
naries (the next two issues) and issues effecting data flow analysis (the last two
issues). Each of these three issue classes is discussed separately below.
In addition to the main issue classes discussed in this section, the current im-
plementation of 3S also solves other issues including: how to map binary sections
back to source code, how to ensure tool execution is not reordered when linking,
the isolation of program, stub and tool registers and instrumenting multiple file
projects. For more information the interested reader is referred to the technical
introduction of Appendix A, the 3S paper [1] and the documentation provided
with the 3S(ex) source code distribution.
3.3. SOLUTIONS TO INSTRUMENTATION ISSUES 29
Issue SUIF [63] GILK [64] Valgrind [65] Pin [66] 3S [1]
Out-of-order execution 7 7 7 7 X
Superscalar cores 7 7 7 7 X
Multi-cycle ALUs 7 7 7 7 X
Multi-core CPUs 7 7 7 7 X
Cache hierarchies 7 7 7 7 X
Different length instructions X X X X X
Data interleaved with code X X X X X
Complex addressing modes X X X X X
FPU stack data dependencies X 7 7 7 7
Table 3.4: Characteristics of modern architectures that present issues for characterisation
frameworks and whether they are addressed by 3S (version 2.9) and other frame-
works. The first five issues effect real timing measurements and are solved in the 3S
hotspot tool with other frameworks effectively ignoring the issues by only offering
Instruction Request counts. The next two issues concern distinguishing instruc-
tions from data in binaries and the last two issues concern data flow dependence
analysis for integer and floating point programs.
Fine-grained Execution Cycle Counts
3S offers fine-grained execution timing measurements for programs running on
modern proprietary CPUs. Modern CPUs like Intel’s x86 range are out-of-order,
superscalar, multi-issue, multi-core and have deep cache hierarchies as illustrated
in Figure 3.2. These characteristics result in different execution times for different
instructions and instruction combinations making the simple Instruction Request
(IR) counts of previous work [63, 64, 65, 66] an unacceptable approximation for
real execution times.
%eax
%ebx
%ecx
.
.
.
.
.
Register
File
Instruction
Window
AL
U 
1
B1.1
B1.2
B1.3
B2.1
B2.2
B2.3
AL
U 
2
Instruction
Cache
Data
Cache
Instruction Assignments Data Access
Instruction
Window
AL
U 
1
B2.3
B2.2
B2.1
B1.3
B1.2
B1.1
AL
U 
2
Register
File
%eax
%ebx
%ecx
.
.
.
.
.
Cache
1
2
3
Out-of-Order Superscalar CPU with Shared Instruction/
Data Cache
2
3
1
Instruction Executions Data Accesses
Out-of-Order Superscalar CPU with Harvard Style Caches
c.f. Core2Duo - except ?4? ALUs/ Core
Figure 3.2: Illustration of the parts of a CPU that can make simple Instruction Request (IR)
counts an unacceptable measure of fine-grained execution time.
30 CHAPTER 3. PROGRAM CHARACTERISATION
On the x86 implementation of 3S, fine-grained timing measurements are ob-
tained using the RDTSC (ReaD Time Stamp Counter) machine code instruction
which reads the CPU core’s cycle counter. 3S places the RDTSC call in the 3S stubs
around each fine-grained code segment and streams the timing information to the
3S tool for analysis. The main benefit of using software based execution timing
is that real timing measurements can be made on hidden proprietary commercial
architectures for which a simulator is not available, but there are also several
issues with the approach which are discussed next.
The first issue with using software based execution timing measurements is
that, like quantum mechanics, measurement interferes with the object being mea-
sured [72]. In our case, the instrumentation stubs and tool code take time to
execute which we need to account for in the timing measurements. To counteract
this issue, 3S stubs place the RDTSC instructions as close as possible to the pro-
gram code being timed resulting in a measured bias of only 9 clock cycles on an
AMD Opteron machine which can be subtracted from final timing measurements
if required.
Even with low timing overhead however, the stub and tool code need to run on
an ALU and this interferes with the CPU’s state and instruction issue window.
To solve this issue would require hardware instrumentation like that used in
bespoke RISC SoCs [81] and is not possible for complex commercial architectures
where die space is at a premium. To counteract the pipeline interference issue
3S offers the 3S INSTRUMENT LOOP BOUNDARIES option which performs lightweight
timing measurements around tight loops without the need to call a tool and the
3S INSTRUMENT FLUSH PIPELINE option which uses the x86 CPUID instruction to
serialise code and tool executions.
The second issue in performing software timing measurements on real CPUs
is that the CPU may interrupt a program at any point and transfer control to
the operating system. To make matters worse, there is no guarantee that the
program will be started again on the same core it was on when it was inter-
rupted on a multi-core machine — and cores can have different RDTSC counters.
To solve the first part of this issue (interruption on the same core) 3S includes
scripts that perform and consolidate multiple independent timing measurements
for a program. The consolidation is on a per code segment minimum total time
basis with the understanding that interrupts will slow a segment down. To solve
the second part of the issue (making sure the code always restarts on the same
core), 3S includes an executable wrapping program for Linux called affinity. A
user can use affinity to lock a program to a specified core to remove the core
uncertainty issues.
3.3. SOLUTIONS TO INSTRUMENTATION ISSUES 31
Distinguishing Instructions from Data
When working with binaries there is a potential issue in distinguishing between
data and instructions particularly where data is included in the .text section of
an executable. However these issues are not present in assembly generated by
compilers such as gcc from standard source code.
Consider for example Figures 3.4 and 3.5 overleaf corresponding to the
statics.c benchmark distributed with 3S. Figure 3.4 shows the assembly pro-
duced by the large continuous switch statement (small or disjoint switch state-
ments are implemented with if else structures) and Figure 3.5 shows the assem-
bly produced by the function level statics of the benchmark when compiled with
gcc -O0. The assembly examples show how code and data can be intermingled
in an assembly file but that the code/data boundary is clearly marked by the
compiler so that code and data can be placed in the correct segment of the final
binary file as illustrated in Figure 3.3.
Header
ELF Binary Schematic
<< program code >>
.text
<< initialised read only data >>
.rodata
<< initialise read/write data >>
.data
<< other sections >>
...
Footer
Figure 3.3: Schematic of the segments in a Executable and Linking Format (ELF) binary.
To distinguish instructions from data in standard assembly, the 3S framework
simply watches for section changes in its top down parse of the assembly files,
adding instrumentation stubs when in the .text section and not otherwise. For
example, in the C switch statement of Figure 3.4, the movl .L16(,%edx,4), %eax
instruction identifies the correct case body to jump to from the address array
starting at label .L16 and the jmp *%eax instruction performs the jump. As 3S
examines the assembly for the switch statement 3S will instrument the code up
to the jmp before swapping out of the .text section at the compiler inserted
assembler directive .section .rodata and 3S will start to instrument the code
32 CHAPTER 3. PROGRAM CHARACTERISATION
...
movl .L16(,%edx,4), %eax
jmp *%eax
.section .rodata # <= 3S swaps out of .text
.align 4
.align 4
.L16: # <= 3S would have inserted a stub if in .text
.long .L6
.long .L7
.long .L8
.long .L9
.long .L10
.long .L5
.long .L11
.long .L12
.long .L13
.long .L14
.text # <= 3S swaps back into .text
.L6: # <= 3S inserts an instrumentation stub here
movl $.LC2, -12(%ebp)
jmp .L17
.L7: # <= 3S inserts an instrumentation stub here
movl $.LC3, -12(%ebp)
jmp .L17
...
Figure 3.4: A C switch statement with an indirection table.
again after the compiler generated .text directive which signifies the continuation
of the executable program code section. A similar instrumentation process will
be seen for the compiler generated initialised statics example of Figure 3.5 where
3S will swap out of the .text section at the .data directive.
While it is true that some viruses hide code in the data segment and this
complicates binary analysis frameworks that aim to follow arbitrary execution
paths, this is not a problem one expects to see in commercial or scientific programs
compiled with standard compilers like gcc to which 3S is aimed. Further, the
3S assembly instrumentation approach is still applicable to programs which use
inline assembly provided they do not purposefully try to obfuscate the code/data
boundary.
3.3. SOLUTIONS TO INSTRUMENTATION ISSUES 33
...
call exit # <= end of the main() function
.size main, .-main
.data # <= 3S swaps out of .text
.align 8
.type f_y.2336, @object
.size f_y.2336, 8
f_y.2336: # <= 3S would have inserted a stub if in .text
.long 1
.long 0
.local f_x.2335
.comm f_x.2335,8,8
.local f.2334
.comm f.2334,8,8
.text # <= 3S swaps back into .text
.globl fib
.type fib, @function
fib: # <= 3S inserts an instrumentation stub here
pushl %ebp
movl %esp, %ebp
pushl %ebx
subl $40, %esp
cmpl $0, 8(%ebp)
je .L20
...
Figure 3.5: Initialised statics in a C function.
Data Flow Analysis
3S has several instruction level tools that perform run-time data flow analysis
including the data flow tool for classic data flow reporting and cache simulation
and the parallelism tool for identifying Instruction Level Parallelism (ILP) in
programs using real run-time data dependency information. To provide the 3S
instruction level tools with their characterisation stream, the 3S framework inserts
instrumentation stubs around instructions in the program assembly to capture
the values being passed to the instructions at run-time.
To capture the values an instruction will be executed on at run-time, 3S has
to evaluate all non-immediate instruction parameters — effectively performing a
34 CHAPTER 3. PROGRAM CHARACTERISATION
Address Mode Example [44] CISC [111] RISC [9]
Immediate pushl $1 X X
Direct pushl %eax X X
Indirect pushl (%eax) X X
Memory pushl foo X X
Offset pushl -4(%eax) X X
Indexed pushl foo(,%eax,4) X 7
Table 3.5: Common addressing modes used in CISC and RISC machines.
CPU decode and address de-referencing in software depending on the instruction
addressing mode. There are several addressing modes 3S has to deal with as
shown in Table 3.5 and 3S identifies the addressing mode using static pattern
matching on assembly at instrumentation time and inserts an instruction level
stub which evaluates and fills-in the 3S parameter structure at run-time so that
it can be sent to the 3S characterisation tool.
The 3S parameter structure is shown in Figure 3.6 and includes information
about the access type (read, write or branch), the addressing mode as well as the
dynamic value of each parameter accessed by an instruction including implied
parameters. By including all the parameter information for an instruction in a
single array passed to the tool in one call, 3S is more efficient than for example
Valgrind which expands instructions into RISC type equivalents with implied
parameters constituting separate expanded instructions with separate tool call
overheads [65]. With the Valgrind method the pushl instructions of Table 3.5
which each have implied stack pointer reads and stack pointer updates would be
expanded into separate data flow tool calls in addition to the main stack memory
contents update.
One complication when tracking data flows in a 3S tool is monitoring depen-
dencies for data initialised statically or by external libraries. The writes that
initialise the external data cannot be tracked by 3S unless the external function
is included in the code that is instrumented by the framework which is why exter-
nally and statically initialised data is identified by the special virtual label B00 in
the 3S instruction level tools as we will see when we examine assignment reports
in Chapter 4.
Dynamic instrumentation frameworks [64, 65, 66] can follow a program’s ex-
ecution path into external code and are thus capable of identifying exact data
sources. However the focus of 3S through this project is on heterogeneous par-
titioning and 3S assumes any external libraries or operating system routines not
specifically instrumented by the user are external to the code that can be parti-
3.3. SOLUTIONS TO INSTRUMENTATION ISSUES 35
Parameter
Array
parameter[1]
parameter[2]
parameter[3]
.
.
.
.
.
p
a
ra
m
e
te
rs
<<structure>>
instruction_parameter_t
<<enumeration>>
value_type
<<enumeration>>
access_type
 + type:  access_type | value_type
 + value: void*
_3S_INSTRUCTION_READ
_3S_INSTRUCTION_WRITE
_3S_INSTRUCTION_BRANCH
_3S_INSTRUCTION_IMPLIED
_3S_INSTRUCTION_IMMEDIATE
_3S_INSTRUCTION_REGISTER
_3S_INSTRUCTION_LEA
_3S_INSTRUCTION_MEMORY
1
1..2
1..1
Figure 3.6: Run-time instruction parameter array (left) and data structure (right) provided
by the 3S framework to tools when operating at the instruction level.
tioned and must remain on the CPU. This is a fare assumption for this project
as functions such as printf(), fopen() and exit() clearly need appropriate re-
sources (terminal, disk access, external context) to execute and naturally require
synchronisation through an operating system running on a CPU even in a het-
erogeneous environment.
3S can work with integer and floating point programs for timing, control
flow identification and indeed memory access measurement. However the Intel
x86 Floating Point Unit (FPU) has a unique implied register stack structure
dating back to when it was an external 8087 maths coprocessor which complicates
floating point data flow analysis.
Binary instrumentation frameworks such as GILK [64], Valgrind [65] and
Pin [66] are focused on memory access checking rather than data dependency
analysis and no previous dynamic framework attempts to address the issue of
data dependency analysis through Intel’s FPU stack. The current implementa-
tion of 3S has full data flow analysis for integer instructions but, like previous
work, does not follow data dependencies through the Intel FPU stack. This is
because floating point data flow analysis was not required for the integer bench-
marks considered in Chapters 4 and 5 of this work. However, due to the simplicity
of the 3S framework it would be a straightforward matter to add code to 3S to
shadow the FPU stack to support full data flow analysis for floating point pro-
grams on x86 CPUs and this is left as an exercise for the interested reader§6.3.
36 CHAPTER 3. PROGRAM CHARACTERISATION
3.4 Comprehensive Characterisation Tools
While the 3S framework is relatively simple at only 288 lines of code [1], the
framework itself does not perform any program characterisation and relies on tools
linked with the instrumented program at compile time. The 3S tools developed for
this project are listed in Table 3.6 along with their size and whether similar tools
are available for alternative frameworks like SUIF [63], GILK [64], Valgrind [65]
and Pin [66]. All the tools were written in standard ANSI C to provide immediate
portability for 3S implementations on alternative operating systems and CPUs
and all the tools have a high comment to code ratio to assist new users.
3S Tool Description Unique Code Comments
trace Saves the raw 3S characterisation stream X 54 110
hotspot Measures real CPU clock cycles and IRs X 72 109
callgrind Generates a control flow graph 7 172 132
memory Run-time memory access information 7 78 106
profile Measures real CPU cache timing effects X 43 67
parallelism Identifies instruction level data parallelism X 239 256
regex Regular expressions from control flows X 175 209
loopgraph d Adds hotspot and IR information to regex X 347 304
data flow Data dependencies between code sections X 366 411
all A meta-tool showing how to join 3S tools X 19 70
Table 3.6: 3S program characterisation tools developed for this project and whether they were
unique to 3S at the time of creation or not. The Code column lists the number of
code lines in the tool and the Comments column the number of comment lines, both
excluding blank lines. Smaller tools have a high comment ratio due to boilerplate
comments such as author, creation date and file history. All the 3S tools have an
expected time complexity of Θ¯(n) in the execution trace length.
As Table 3.6 shows, most of the 3S tools are unique. For example while Val-
grind [65], which is the leading alternative characterisation framework, has mem-
ory access monitoring and cache miss tools (memcheck and cachegrind), Valgrind
lacks the memory data dependency analysis and cache timing tools available for
3S. Further, the simplicity and orthogonality of the 3S methodology is reflected
in the size of its tools which average at less than 200 lines of C including internal
data structures and report generation code.
The following discusses the 3S parallelism tool in more detail. The 3S
parallelism tool provides evidentiary proof that it is possible to identify data
flow dependent heterogeneous computation characteristics for programs compiled
for a homogeneous CPU based system using the 3S approach.
3.4. COMPREHENSIVE CHARACTERISATION TOOLS 37
Identifying Parallelism in Sequential Code
Previous dynamic instrumentation frameworks [64,65,66] were focused on mem-
ory access checking and could not perform memory data flow analysis. The focus
of of previous work meant that researchers were faced with the issue of having to
write bespoke Instruction Level Parallelism (ILP) measurement frameworks [76],
repeating much of the work involved in creating a general characterisation frame-
work in the process. 3S provides run-time data dependence information to its
tools and so removes the issue of repeated work for researchers wanting to leverage
the frameworks instrumentation capabilities.
The 3S parallelism tool demonstrates the use of the 3S data flow analysis
capabilities for Instruction Level Parallelism (ILP) measurement. The tool iden-
tifies parallelism capabilities in sequential programs compiled for CPUs and has
a range of configuration options which allow the tool to identify either the max-
imum theoretical parallelism potential or a parallelism potential bounded by the
practical hardware ALUs and window sizes of heterogeneous components.
movl b,
%ebx
movl a,
%eax
address&0xff=0
address&0xff=1
address&0xff=2
.
.
.
.
.
3S Parallelism Tool's address_slots
Hashed List Structure
(%eax, 2) (%ebx, 1)
addl
%ebx,
%eax
3S Parallelism Tool's address_slots
Hashed List Structure
Sl
ot
Instructions
SLOTS_HEIGHT
SL
OT
S_
W
ID
TH
1
2
3
<<structure>>
block_metrics_t
 + entries: uint64_t
 + ir: uint64_t
 + slots: uint64_t
 + parallelism_min: uint64_t
 + parallelism_max: uint64_t
3S Parallelism Tool's block_metrics_t
Structure held in general[1]
Place instruction in the
maximum dependency
slot for all its reads
Store the width and
height of the current
section's slots
Characterisation
Stream
3S Parallelism Tool
Code Section
Boundary?
Tool Complete
no
yes
Update the dependency
slot of all addresses
written by the instruction
Address
Hash
Place instruction in the
maximum dependency
slot for all its reads
Store the width and
height of the current
section's slots
Tool Complete
no
yes
3S Parallelism Tool Basic Operation
Update the dependency
slot of all addresses
written by the instruction
addl %ebx,
%eax
Code Section
Boundary?
Characterisation
Stream
Figure 3.7: The internal operation of the 3S parallelism tool.
Figure 3.7 illustrates the internal operation of the parallelism tool. Like all
3S tools, the 3S parallelism tool analyses the stream of program control and
data flow information generated by 3S instrumentation stubs at run time.
For each instrumented instruction the parallelism tool receives instruction
parameter value and type information as the program runs in the parameter
array structure illustrated earlier in Figure 3.6. The tool uses the instruction
parameter information to identify inter-instruction data dependencies so that
each instruction can be placed in an appropriate parallel execution slot.
Instructions are placed in the earliest slot where all their dependent data
becomes available (i.e. As Soon As Possible scheduled [7]) by looking-up the
slots where dependencies were last written in the tool’s internal address slots
38 CHAPTER 3. PROGRAM CHARACTERISATION
movl b,
%ebx
movl a,
%eax
address&0xff=0
address&0xff=1
address&0xff=2
.
.
.
.
.
3S Parallelism Tool's address_slots
Hashed List Structure
(%eax, 1) (%ebx, 2)
addl
%ebx,
%eax
3S Parallelism Tool's address_slots
Hashed List Structure
Sl
ot
Instructions
SLOTS_HEIGHT
SL
OT
S_
W
ID
TH
1
2
3
<<structure>>
block_metrics_t
 + entries: uint64_t
 + ir: uint64_t
 + slots: uint64_t
 + parallelism_min: uint64_t
 + parallelism_max: uint64_t
3S Parallelism Tool's block_metrics_t
Structure held in general[1]
Place instruction in the
maximum dependency
slot for all its reads
Store the width and
height of the current
section's slots
Characterisation
Stream
3S Parallelism Tool
Code Section
Boundary?
Tool Complete
no
yes
Update the dependency
slot of all addresses
written by the instruction
Address
Hash
Place instruction in the
maximum dependency
slot for all its reads
Store the width and
height of the current
section's slots
Tool Complete
no
yes
3S Parallelism Tool Basic Operation
Update the dependency
slot of all addresses
written by the instruction
addl %eax,
%ebx
Code Section
Boundary?
Characterisation
Stream
Figure 3.8: The last slot where each address or register was written to is held in a hashed
linked list structure called address lots in the 3S parallelism tool. The hash
index is an address consolidation with the list chains holding address details and
the slot where the address was last written. The above illustration corresponds
to the situation after the three instructions in the example of this section have
been parallelised.
hashed list data structure illustrated in Figure 3.8. Once the instruction has been
placed in a parallel slot, the address slots hash is updated for any addresses
written by the instruction. The address slots structure is similar in concept
to the reservation stations in Tomasulo’s approach [13] however, rather than
having fixed stations holding Common Data Bus (CDB) tags, address slots has
stations dynamically allocated for accessed memory and register locations and
holds predicted availability information.
As an example consider the assembly code below which simply adds the con-
tents of two memory locations (at fixed addresses a and b) using the intermediary
registers %eax and %ebx:
movl a, %eax
movl b, %ebx
addl %eax, %ebx
The first instruction has a read dependency on memory address a. Assuming
the instructions are at the start of a code section, the 3S parallelism tool will
look-up the address a in its address hash and see that the address has not been
written to by this code section and so the first instruction will be placed in parallel
slot 1 as illustrated in Figure 3.9. The tool will then update the %eax entry in
the address slots hash depicted in Figure 3.8 to be slot 1 so that any future
3.4. COMPREHENSIVE CHARACTERISATION TOOLS 39
movl b,
%ebx
movl a,
%eax
address&0xff=0
address&0xff=1
address&0xff=2
.
.
.
.
.
3S Parallelism Tool's address_slots
Hashed List Structure
(%eax, 1) (%ebx, 2)
addl
%ebx,
3S Parallelism Tool's address_slots
Hashed List Structure
Sl
ot
Instructions
SLOTS_HEIGHT
SL
OT
S_
W
ID
TH
1
2
3
<<structure>>
block_metrics_t
 + entries: uint64_t
 + ir: uint64_t
 + slots: uint64_t
 + parallelism_min: uint64_t
 + parallelism_max: uint64_t
3S Parallelism Tool's block_metrics_t
Structure held in general[1]
Place instruction in the
maximum dependency
slot for all its reads
Store the width and
height of the current
section's slots
Characterisation
Stream
3S Parallelism Tool
Code Section
Boundary?
Tool Complete
no
yes
Update the dependency
slot of all addresses
written by the instruction
Address
Hash
Place instruction in the
maximum dependency
slot for all its reads
Store the width and
height of the current
section's slots
Tool Complete
no
yes
3S Parallelism Tool Basic Operation
Update the dependency
slot of all addresses
written by the instruction
addl %eax,
%ebx
Code Section
Boundary?
Characterisation
Stream
Figure 3.9: Illustration of the instruction to slot mapping performed by the 3S parallelism
tool for the example assembly of this section. SLOTS WIDTH and SLOTS HEIGHT
are configuration parameters that allow the tool to simulate the number of par-
allel execution units (ALUs) and the instruction window depth of a hardware
component.
instruction that needs to read %eax will be placed after slot 1 which is where the
register update is performed.
A similar process will occur for the second register loading instruction. As
memory address b had not been written to by any instruction in the current code
section, the second assembly instruction will be placed in slot 1 also to run in
parallel with the first instruction and the %ebx entry of the address slots hashed
list set to slot 1.
The third instruction of the example is an assembly add instruction. The
instruction reads both %eax and %ebx and writes the sum to %ebx consequently
there will be two entries in the parameter array sent by the 3S framework to
the tool: (%eax, r) and (%ebx, rw). As the 3S parallelism tool parses the
parameter array for the third instruction it will find the read dependencies on %eax
and %ebx, look-up the last writer slots in the address slots hash and identify that
the instruction’s dependencies are written in slot 1 meaning the third instruction
cannot execute until slot 2. The tool will then place the add instruction in slot 2
and update the last writer slot for the output register %ebx in the address slots
hash to slot 2.
The final address slots hash and slot mappings produced by the 3S
parallelism tool are illustrated in Figures 3.8 and 3.9. The mappings show how
the tool splits the three sequential instructions of the example into two parallel
40 CHAPTER 3. PROGRAM CHARACTERISATION
slots, the first slot having a width of two instructions (the two movl instructions)
and the second slot a width of one instruction (the addl instruction).
As mentioned earlier, the 3S parallelism tool, has several configuration op-
tions that allow a user to tune the tool’s operation. In Figure 3.9 two important
options called SLOTS WIDTH and SLOTS HEIGHT are illustrated.
The SLOTS WIDTH option sets the maximum parallelism the tool will allow for
any slot and is similar in concept to the cycle width option of Wall’s ILP quantifi-
cation framework [76]. At the parallelism limit, any instructions that could have
been placed in a slot will be slid forward by the tool to the next slot with space
available and the address slots hash set accordingly for any addresses or registers
written to by the instruction. The SLOTS WIDTH option can be used to make the
3S parallelism tool simulate the parallelism available on real architectures lim-
ited in ALUs or read-port widths. For example on an Intel Core2Duo CPU [111]
with four ALUs per core SLOTS WIDTH could be set to four, for a nVidia GeForce
GTX 295 GPU [50] with sixty parallel Streaming Multiprocessors SLOTS WIDTH
could be set to sixty and for a Xilinx Vertex 5 FPGA [27] with 207,360 five input
two output LUTs the SLOTS WIDTH value could be set to 6,480 assuming 32 LUTs
per CPU Instruction Equivalent (IE) and that data transfer requirements can be
met.
The SLOTS HEIGHT configuration option sets the parallelism map height. Con-
ceptually, at the point the 3S parallelism tool attempts to insert an instruction
above the SLOTS HEIGHT limit in the slot map, the tool slides the slot map down
and clears all memory dependencies attached to the slots slid off the top of the
slot map. This is a complicated process in reality and requires link liveliness
tracking, the deferment of single window slides into more efficient batches and
other advanced programming techniques which complicate the 3S parallelism
tool implementation. The SLOTS HEIGHT configuration option is necessary to limit
the run-time time and memory complexity of the tool (particularly when profil-
ing at the entire program level of granularity) and to allow correct parallelism
identification for CPUs and other architectures with fixed size reorder buffers.
At program code section boundaries and as slots are slid off the slot map the
tool writes the parallelism information to the 3S symbol table and summarises
the symbol table entries when the program terminates to generate parallelism
reports like the report shown in Figure 3.10.
3.5. SUMMARY 41
#### SPACEY STREAM SPLITTER (3S) PARALLELISM REPORT
block order entries ir slots min max average
main.0.0.0 1 1 27 6 1 14 4.50
main.L15.0.0 2 64 8192 704 1 23 11.64
main.L19.0.0 3 63 126 63 2 2 2.00
main.L15.128.0 4 1 6 2 1 5 3.00
main.L15.133.call printf 5 1 1 1 1 1 1.00
main.L15.134.0 6 1 1 1 1 1 1.00
main.L15.134.call exit 7 1 1 1 1 1 1.00
Global block parallelism (min, avg, max): (1, 10.74, 23)
Slot widths (min, avg, max): (1, 10.74, 23)
Totals (slots, ir): (778, 8354)
Figure 3.10: 3S parallelism tool report for the Fibonacci benchmark discussed in Chapter 5
compiled with gcc -O3 and analysed at the basic block level. The order column
is the order in which each block was first entered in the control flow, the entries
column the number of times a block was executed and the ir column gives the
Instruction Request (IR) figure. The slots, min, max and average columns give
the total height, minimum, maximum and average widths for the parallelism
map for each code section over all iterations.
3.5 Summary
This chapter presented an overview of the novel 3S instrumentation methodology,
the methods used to address modern characterisation issues and 3S tools currently
available. The 3S methodology combines static instrumentation with dynamic
characterisation as discussed in Section 3.2 and contributes increased:
• Efficiency: the 3S static instrumentation approach means there is no re-
peated instrumentation overhead on multiple program runs [64,65,66].
• Accuracy: the 3S dynamic characterisation means exact loop counts [63]
and exact data dependencies can be identified [63,64,65,66].
• Simplicity: by using assembly 3S leverages existing tools like gcc, requires
only context free parsing and allows human verification of its transforma-
tions [63,64,65,66].
The remainder of this thesis shows how 3S program characterisation measure-
ments can be used in heterogeneous partitioning and more information on 3S can
be found in Appendix A, the 3S paper [1] and in the 3S software distribution.
42 CHAPTER 3. PROGRAM CHARACTERISATION
Chapter 4
Partitioning with Certainty
4.1 Introduction
This chapter breaks down the heterogeneous partitioning problem into the smaller
problems of: selecting a heterogeneous execution model, calculating computation
and communication time estimates for the model and optimally assigning pro-
gram segments to locations given the timing estimates and presents solutions to
those problems through the contributions of:
• Section 4.2: A New Efficient Heterogeneous Execution Model.
• Section 4.3: General Execution Timing Equations.
• Section 4.4: Optimal Sequential Assignment Formalisation.
• Section 4.5: New Parallel Partitioning Approach.
Section 4.2 presents a new heterogeneous execution model [2, 3] that halves the
communication latencies of previous work [47, 81, 84, 89, 93]. Section 4.3 then
shows how the 3S characterisation information discussed in Chapter 3 can be used
with the new execution model to estimate computation and communication times
for heterogeneous architectures [1, 2, 3] and Section 4.4 introduces an optimal
formalisation for sequential assignment [2,3] and demonstrates the tractability of
the optimisation problem for practical benchmarks over an entire design-space.
In section 4.5, the sequential assignment approach of Section 4.4 is used to
extend previous parallel partitioning work [97,100,101,102,103] to produce a new
multi-level partitioning approach with up to 64 times the acceleration potential
for a reconfigurable processor modelled on published data [3]. The multi-level
approach will lead us nicely to the question of partitioning under activation se-
quence uncertainty which is the topic of Chapter 5.
43
44 CHAPTER 4. PARTITIONING WITH CERTAINTY
4.2 The Write-Only Architecture
Previous communication paradigms for heterogeneous systems can be thought of
as variations on the Remote Procedure Call (RPC) paradigm [107] for distributed
computing. In RPC a client program passes inputs to a remote procedure and
waits for the computational response. The RPC paradigm is simple to under-
stand, simple to implement through proxy libraries, portable and relatively effi-
cient if used for its intended purpose — so it is easy to see why RPC was adopted
as the basis for previous heterogeneous computing execution models.
However, the RPC paradigm was not designed as a general execution model.
RPC was designed specifically for two component distributed architectures where
waiting for a response does not incur extra communication latencies and where
service responses are going to be used by the caller and not forwarded on by
the caller to a third computational component. The moment we move to tightly
coupled busses or distributed architectures with more than 2 components, the
efficiency of the RPC paradigm breaks down and this is where the new Write-
Only Architecture (WOA) execution model comes in.
The Write-Only Architecture (WOA) is a heterogeneous execution model
based on control flows for sequential program partitions. In the WOA, an execut-
ing code segment passes the result of its calculation directly along the program
control flow path to the next execution code segment target rather than back
through an execution scheduler or message broker as in traditional RPC based
communication paradigms. The distinguishing feature of the WOA is that there
are no reads of data or control signals, only data writes along the control path.
The lack of reads makes WOA efficient for tightly coupled System on Chip (SoC)
busses and the removal of the need for a central communications controller makes
WOA efficient for architectures with more than two computational components.
Communication Paradigm Response Latencies Response Target
Client-Server [93,106] 4/2 (TCP/UDP) caller
Custom Instructions [89,47] 3 caller
Shared Memory [79,81,84] 5 caller
Write-Only-Architecture [3] 1 control path
Table 4.1: Comparison of latencies and call response targets for communication paradigms.
Table 4.1 compares characteristics of the WOA against alternative commu-
nication paradigms which are all based, to some extent, on RPC. Client-Server
communication paradigms like those of [93] often use TCP or UDP as an under-
lying protocol. While UDP is relatively efficient, even UDP Client-Sever models
4.2. THE WRITE-ONLY ARCHITECTURE 45
CBACBA
Activate
ACK
Activate
ACK
Activate
ACK
Call
ACK
Return
ACK
Call
ACK
Return
ACK
La
te
nc
y S
av
ing
(a) Client-Server protocol.
CBACBA
Activate
ACK
Activate
ACK
Activate
ACK
Call
ACK
Return
ACK
Call
ACK
Return
ACK
La
te
nc
y S
av
ing
(b) Communications for the WOA.
Figure 4.1: Communications for a traditional Client-Server protocol and the WOA on a three
component distributed architecture.
suffer from up to twice the latency overhead of WOA per control flow because of
the need for a central message broker or nested call stack as shown in Figure 4.1.
In Custom Instruction architectures like those of [47, 89], a CPU passes data
to a custom instruction fabric using a write and then issues a read to block
the CPU and wait for the custom instruction’s response. This process requires
three latencies per call whereas the WOA implemented on the same architecture
requires only one latency per control flow saving up to one latency per call as
illustrated in Figure 4.2.
Shared Memory models like those of [79, 81, 84] are similar to Custom In-
struction models except that components read the data they need from shared
memory after activation instead of having the data pushed to them as part of
the activation signal. Shared Memory models are useful where the exact data
46 CHAPTER 4. PARTITIONING WITH CERTAINTY
WOA Activate Packet Format
Data[0]
.
.
.
Data[n]
Target ID
Bus Width
{1c,
data}
{1a,
data}
1a 1c
Activate 1b Activate 1b
Activate 1a Activate 1c
{1b,
data}
{1b,
data}
1b
JMP (1a | 1c)
MUX (1b)
Client Writes Data and Sends Read Request
Service Sends ResponseService Computation
Write Input Data
Send Read
Computation
Reply to Read
Write and Suspend
Compute
Write Activate
Computation
Write Activate
Latency Saving
Latency
Data Transfer
Computation
Write and Suspend
(a) Communications for a Custom Instruction call.
WOA Activate Packet Format
Data[0]
.
.
.
Data[n]
Target ID
Bus Width
{1c,
data}
{1a,
data}
1a 1c
Activate 1b Activate 1b
Activate 1a Activate 1c
{1b,
data}
{1b,
data}
1b
JMP (1a | 1c)
MUX (1b)
Client Writes Data and Sends Read Request
Service Sends ResponseService Computation
Write Input Data
Send Read
Computation
Reply to Read
Write and Suspend
Compute
Write Activate
Computation
Write Activate
Latency Saving
L tency
Data Tr nsfer
Computation
Write and Suspend
(b) Communications for the WOA.
Figure 4.2: Communications for a traditional Custom Instruction protocol and the WOA on
a two component SoC architecture.
required by a custom instruction cannot be easily predicted ahead of time by the
caller, however the models suffer from additional latencies as shown in Table 4.1.
The WOA control flow activation mechanism can be implemented on tightly
coupled architectures using writes to memory mapped hardware, Symmetric
Multi-Processing (SMP) style cache snooping [6] and process spin-locking or in-
terrupts. In a distributed environment WOA can be implemented using UDP
with the fact that the caller is not necessarily the response target meaning that
acknowledgement messages can be sent off the main control flow path at low cost
for communication reliability as already illustrated in Figure 4.1(b).
One complication for a WOA implementation is coping with data dependent
control flows which would have previously been dealt with by the RPC caller
acting as a control flow hub. In the WOA control flow decisions are delegated
to individual code segments which select their appropriate response target at
4.2. THE WRITE-ONLY ARCHITECTURE 47
service execution time. To deal with data dependent control flow paths in a
heterogeneous environment, all cross component WOA activations must carry
with them fine-grained identifiers for the correct code section at the new target
location to execute. Figure 4.3 shows the basic WOA activation packet which
includes a fine-grained target identifier in the packet header to address the data
dependent path problem.
WOA Activate Packet Format
Data[0]
.
.
.
Data[n]
Target ID
Packet Width
Figure 4.3: The basic WOA activation packet sent between code sections. The packet width
is either the physical bus width or a nominal bit grouping.
The WOA activation packet’s data section can contain path context infor-
mation (for example the current loop count) as well as computational inputs
which a WOA implementation needs to be able to dynamically predict [92] at
each control flow step. Where it is not possible for a caller to predict the exact
data requirements of the next code segment, a WOA implementation can send
an entire computational context in cross component activations (intra-component
activations can send a context pointer) and would only need to resort to a Shared
Memory model [79,81,84] for communications where the context transfer time is
larger than the associated latency penalty.
Figure 4.4 illustrates how an if else data dependent branch would be im-
plemented using WOA activation packets in a heterogeneous architecture. The
WOA controller on the CPU (top) forwards each WOA activation packet sent
by the if else branch 1b implemented in hardware (bottom) to either 1a or 1c
depending on the Target ID field of the activation packet.
Aside from directing WOA activation packets, WOA controllers allow for
the fine-grained serialisation of access to sequential components such as CPU
cores shared by multiple parallel tasks. Fine-grained serialisation of sequential
components is required to allow partial utilisation of parallelising components
in heterogeneous architectures for programs that are too large to fit as a single
coarse-grained unit on space constrained devices as will be shown in Section 4.5.
48 CHAPTER 4. PARTITIONING WITH CERTAINTY
Client Writes Data and Sends Read Request
Service Sends ResponseService Computation
Write Input Data
Send Read
Computation
Reply to Read
WOA Activate Packet Format
Data[0]
.
.
.
Data[n]
Target ID
Bus Width
Write and Suspend
Compute
Write Activate
Computation
Write Activate
Up to 1 Latency Saved
{1c, data}{1a, data}
1a 1c
Activate 1b Activate 1b
Activate 1a Activate 1c
{1b, data} {1b, data}
1b
Latency
Data Transfer
Computation
JMP (1a | 1c)
MUX (1b)
Write and Suspend
Figure 4.4: An illustration of how the WOA allows an if else statement implemented in
hardware (bottom) to activate different software blocks (top) using WOA con-
trollers (labelled JMP and MUX).
Hardware Characteristic [79] [81] [84] [89] WOA
βlm data bandwidth 7 7 7 X X
λlm communication latency 7 7 7 7 X
τl cycle time 7 7 7 7 X
ωl parallel execution units 7 7 7 7 X
pl execution efficiency 7 7 7 7 X
Table 4.2: Comparison of hardware characterisation metrics used in this (WOA) and the
computational models of previous heterogeneous partitioning work. The table’s
symbols are used in the equations of Section 4.3 and the subscripts p refers to a
code section with l and m referring to computational locations.
Software Characteristic [79] [81] [84] [89] WOA
ιp program code unit iterations X X X X X
ηpq data flows 7 7 X X X
φpl parallel execution slots 7 7 7 X X
χpq control flows 7 7 7 7 X
µpr execution cycle measurements 7 7 7 7 X
Table 4.3: Comparison of software characterisation metrics used in this (WOA) and the com-
putational models of previous heterogeneous partitioning work. The table’s sym-
bols are used in the equations of Section 4.3 and the subscripts p and q refer to a
code sections with l and r referring to computational locations.
4.3. WOA TIMING EQUATIONS 49
4.3 WOA Timing Equations
The WOA execution model associates an activation transfer with each computa-
tion. The time required for each computation is thus a combination of computa-
tion time and communication time for the activation transfer:
tˆτ = µˆpl|τ + cˆpqlm|τ (4.1)
where µˆpl|τ is the time to execute code section p at location l on activation τ
and cˆpqlm|τ the communication time for activation τ between the code sections p
and q which can be either located on the same (l = m) or different computational
components (l 6= m) in correspondence with the notation of page 7.
When we sum over all activations, the total execution time of a program is
given by:
t =
∑
pl
µpl +
∑
pqlm
cpqlm (4.2)
where µpl is the total computation time for each code section p assigned to location
l and cpqlm the total communication time for WOA activations sent between the
program code sections assigned to their respective locations.
Equation (4.2) divides the execution time for an assignment into separate fine-
grained computation and communication times composed over an entire execution
and can be applied to any architecture where fine-grained execution timing and
communication cost estimates are available. The 3S hotspot tool discussed in
Chapter 3 demonstrates the ability to measure fine-grained computation times
for proprietary complex components such as the Intel x86 CISC CPU and the
ability to measure and model communication costs is well established [105].
This research used the 3S hotspot fine-grained timing measurements com-
posed over an entire execution for the reference CPU µpr and modelled the com-
putation times of code segments on other hardware components with:
µpl 6=r =
ιpφplτl
l
(4.3)
which is simply the total number of parallel execution slots a program section
would take to execute at a location (iteration count ιp multiplied by the number of
parallel slots per iteration φpl) obtained from the 3S callgrind and parallelism
tools multiplied by the time required for each parallel slot to execute (the cycle
time of the component τl divided by the issue rate l) obtained from hardware data
sheets. Specialised loop pipelining and SIMD gains are not accounted for, but
despite this simplification, equation (4.3) integrates over twice the characterisa-
tion information of previous work as shown in Tables 4.2 and 4.3 and allows the
50 CHAPTER 4. PARTITIONING WITH CERTAINTY
identification of great heterogeneous acceleration potential for the benchmarks
considered in this work as will be shown in Section 4.4.
Like the computation times, the fine-grained communication times cpqlm re-
quired by equation (4.2) can be real measurements where available, but in this
work a general communication time modelling equation was used:
cpqlm = χpqλlm +
ηpq
βlm
(4.4)
which sums one hardware latency λlm for each of the χpq calls from node p on l to
node q on m to cover the WOA Target ID header transfer with the total inter-node
data transfers ηpq scaled by the communication bandwidth βlm. The number of
calls χpq and bytes transferred ηpq between each program section are obtained
from the 3S callgrind and data flow tools and the hardware characteristics from
data sheets as summarised in Table 4.4 below. Equation (4.4) is general enough
to cope with implementations that wrap even intra-location control flows in a
WOA packet (λll and βll can be none singular) and is valid for both direct and
indirect data transfers with the insight that indirect data requirements can be
encapsulated in the WOA packets sent by nodes on the direct control flow path
but assumes any WOA controller overhead is included in the latency figure and
does not account for shared bus congestion.
Hardware Characteristic Software Characteristic
βlm model [105] 3S callgrind ιp
λlm model [105] 3S data flow ηpq
τl data sheets [25,27] 3S parallelism φpl
ωl data sheets [25,27] 3S callgrind χpq
pl nominally 100% 3S hotspot µpr
Table 4.4: Data sources for the WOA timing equations in this work. The characteristic sym-
bols are described in Tables 4.2 and 4.3
It is perhaps worth noting before continuing that any change in the timing
equations is fully independent of the underlying 3S framework and tools [1]§3, the
WOA execution model [3]§4.2 and, in particular, all the partitioning formulations
that follow in this chapter and the next§5 which rely solely on the presence of
fine-grained computation and communication times and do not depend on how
the timing information was obtained. Further, it would be a simple matter to
replace the execution timing estimation equations (4.3) and (4.4) with equations
specialised for a particular architecture and a simpler matter still to use real
timing measurements to replace equations (4.3) and (4.4) altogether for software
where actual fine-gained measurements and WOA implementations are available
for all the components in a heterogeneous system [95].
4.4. SEQUENTIAL PARTITIONING 51
4.4 Sequential Partitioning
Methodology
The execution time for a particular assignment of code segments to locations in
a heterogeneous architecture under the WOA execution model is:
t =
∑
pl
µpl +
∑
pqlm
cpqlm
from equation (4.2) as discussed in Section 4.3. For our purposes, the optimal
assignment is the assignment with the minimum execution time that satisfies the
hardware space constraints and is formally defined in 4.4.1 below.
Definition 4.4.1 Sequential Assignment Problem (SAP).
min
x∈X
∑
pl
µplxpl +
∑
pqlm
cpqlmxplxqm (4.5)
s.t. xpr = 1 ∀p ∈ E (4.6)∑
l
xpl = 1 ∀p (4.7)∑
p
δplxpl ≤ ∆l ∀l (4.8)
xpl =
{
1 if p is assigned to l
0 otherwise.
(4.9)
with p and q assignable code sections, l and m computational locations, r the
reference partition, E the set of nodes that must be run on r, µpl the execution
time of p if run at location l, cpqlm the cost of communications between p and q
on l and m, δpl the space required to assign code section p to location l, ∆l the
total space available at location l and xpl the assignment indicator variables to be
optimised as summarised in the notation of page 7.
The quadratic xplxqm terms in the objective (4.5) can be replaced with new
ypqlm ∈ {0, 1} variables constrained using Lemma 4.4.2 [4] below to convert the
Sequential Assignment Problem (SAP) into a Mixed Integer Linear Program
(MILP) solvable with standard solvers such as CPLEX [112].
Lemma 4.4.2 Boolean multiplication is equivalent to logical AND (ypqlm =
xpl ∧ xqm : xpl, xqm, ypqlm ∈ B) and can be expressed as the linear programming
constraints:
xpl + xqm − 1 ≤ 2ypqlm ≤ xpl + xqm (4.10)
52 CHAPTER 4. PARTITIONING WITH CERTAINTY
Proof is through the logic table below.
xpl xqm xpl + xqm − 1 xpl + xqm 2ypqlm ypqlm
0 0 -1 0 0 0
0 1 0 1 0 0
1 0 0 1 0 0
1 1 1 2 2 1
Corollary 4.4.3 With the minimisation objective (4.5), the constraints can be
relaxed to:
xpl + xqm − 1 ≤ ypqlm (4.11)
where 0 ≤ ypqlm ≤ 1 because the SAP constants cpqlm are all non-negative however,
in this work, the relaxation was not necessary as computational complexity issues
were not seen and the general quadratic removal form of equation (4.10) was used.
Although this research arrived at Definition 4.4.1 independently through the
WOA, referring to Section 2.4 of Chapter 2 it is clear that SAP is a Generalized
Quadratic Assignment Problem (GQAP) with multiple locations, location depen-
dent node costs µpl, quadratic communication costs cpqlm and size constraints
∆l. Thus as GQAP is known to be strongly NP-hard [148] so is SAP and, de-
pending on the problem instance characteristics, SAP may not be tractable for
problems of even moderate size [128, 136, 141, 142, 145]§2.4. Fortunately, for the
benchmarks and architectures considered in this work the SAP problem instance
characteristics lend themselves to solution space pruning with Branch and Bound
(B&B) [123] and can be solved optimally in just a few seconds.
However, in appreciating that there may be practical benchmarks and archi-
tectures that produce SAP problem instances that are harder to solve optimally
than the problems considered in this work, an O(n) time complexity regret min-
imisation [136, 139] heuristic called the Attractiveness Partitioning Algorithm
(APA) was created as part of this research. The heuristic algorithm is an order
of magnitude faster than CPLEX with only a 15% average optimality penalty
for the problems considered in this work. But, with the optimal SAP results
taking less than 14 seconds for the complete design-space analysis of multiple
benchmarks and compilation options presented in this chapter, the sub-optimal
APA heuristic will not be discussed further in this document and the reader is
referred to the paper [2]§E for more information.
4.4. SEQUENTIAL PARTITIONING 53
Results
This section provides optimal SAP results for MiBench 1.0 [56] software bench-
marks partitioned for the two component heterogeneous architecture of Figure 4.5
over a wide range of component and communication characteristics. Fine-grained
assignments are performed at the program basic block level with crc32 being the
smallest of the benchmarks with 22 active basic blocks and jpeg the largest with
1792 active basic blocks. Results are provided as accelerations defined as the
ratio of the execution time for the benchmark running on the reference CPU
alone to the optimal SAP partitioned execution time rather than abstract timing
or cycle figures. All the 3S software characterisation information was gathered
on a reference Intel Pentium 4 x86 machine in the Imperial computer labs and
architectural information is from data sheets [25,27] and published papers [105].
Memory
WOA Packets
Memory Access
x86 CPU
FPGA
Coprocessor
WOA Controller
WOA Controller
Figure 4.5: The heterogeneous architecture examined this section.
To demonstrate the practical utility and tractability of SAP, three types of
reports are provided. The first shows high-level acceleration opportunities for
different binary forms and fixed points in a design-space, the second illustrates
the detailed SAP assignments graphically and the third provides sensitivity infor-
mation for benchmarks over a range of architectural characteristics. The results
presented in this section were generated in just 13.69 second using a quad-thread
64 bit version of CPLEX 11 on the orion server at Imperial (2.2GHz Dual Core
Opteron machine) supporting the tractability assertion for optimal SAP on prac-
tical software assignment problem instances.
54 CHAPTER 4. PARTITIONING WITH CERTAINTY
High-Level Acceleration Opportunities
The SAP high-level acceleration reports quantify acceleration opportunities for
available program forms and fixed points in the architectural design-space. The
high-level reports can be used to support investment in a particular architecture
or to justify a more detailed investigation of the design-space characteristics of a
program set if performance requirements cannot be met with existing hardware.
Figures 4.6 and 4.7 compare the SAP program accelerations for six
MiBench [56] benchmarks compiled with gcc [45] using three different optimi-
sation levels when partitioned for the two component architecture of Figure 4.5.
Figure 4.6 is for the tightly coupled architectural characteristics of Table 4.5 and
Figure 4.7 for the loosely coupled characteristics.
Tightly Coupled Loosely Coupled
Characteristic CPU FPGA CPU FPGA
βlm data bandwidth 1.38GB/s 3.12GB/s
λlm bus latency 2.90ns 165ns
τl cycle time 2.90ns 2.90ns 0.5ns 2.90ns
ωl parallel execution units 4 256 4 256
pl execution efficiency 100% 100% 100% 100%
∆l hardware size capacity ∞ 256 ∞ 256
Table 4.5: Architectural characteristics for the tightly and loosely coupled design points dis-
cussed in this section. The tightly coupled characteristics correspond to a recon-
figurable SoC operating at 345MHz [25] with a 32-bit internal bus and the loosely
coupled characteristics to a 2GHz Pentium 4 coupled with a 345MHz FPGA [27]
through a HyperTransport connection with latencies described in [105].
The considerable difference between the acceleration potentials of the bench-
marks is a result of two factors: the parallelism available in the program and the
proportion of the program’s computation that can be relocated away from the
CPU in correspondence with Amdahl’s law [6]. For example, the crc32 bench-
mark spends 29.3% of its time in file related system calls which can not be
relocated to the FPGA which limits the maximum possible speed-up of crc32 to:
Reference CPU Cycles
F ixed CPU Cycles
= 3.412 times
This value is in fact quite close to the actual speedup shown in Figure 4.6 of 3.261
times acceleration for crc32. The only benchmarks in Figure 4.6 which do not
come close to their Amdahl’s limits are jpeg and sha which are both restricted
from reaching their theoretical bounds because of the FPGA size constraints of
Table 4.5.
4.4. SEQUENTIAL PARTITIONING 55
The impact of the different architectural configurations and compiler options
is clearly visible in the figures with the tightly coupled design point providing
superior acceleration potentials in all cases and the -O1 compiler option hav-
ing a lower acceleration potential than either the -O2 or -O3 options for some
benchmarks supporting the results of [85]. The effect of compiler optimisation on
acceleration performance is governed by the interaction between loop unrolling
and Instruction Level Parallelism (ILP). The loop unrolling performed by higher
compiler optimisation levels increases the average block size which reduces the
opportunity for fine-grained acceleration improvements, however the issue of in-
creased node size can be counteracted by increased node parallelism where the
unrolled loops of a benchmark have limited internal data dependence which in-
creases acceleration potential for the susan and dijkstra benchmarks.
Blocks TC Solution Times (s) LC Solution Times (s)
Benchmark O1 O2 O3 O1 O2 O3 O1 O2 O3
crc32 22 23 23 0.00 0.00 0.00 0.00 0.00 0.00
jpeg 1611 1640 1792 0.73 0.64 0.50 2.59 1.46 3.16
stringsearch 43 45 38 0.01 0.01 0.01 0.01 0.01 0.01
sha 72 70 67 0.02 0.04 0.03 0.02 0.04 0.02
susan 153 153 148 0.03 0.03 0.03 0.04 0.04 0.03
dijkstra 69 68 116 0.01 0.01 0.01 0.01 0.01 0.02
Table 4.6: The number of fine-grained basic blocks (Blocks) in each benchmark program when
compiled with the specified gcc optimisation level and the corresponding solution
times in seconds from the CPLEX log. All timing measurements were made with a
quad-thread 64 bit version of CPLEX 11 using the default settings on the Imperial
orion server which has two Dual Core AMD Opteron 275 CPUs operating at
2.2GHz and 4GB of memory. The TC columns provide the timings for the tightly
coupled partitions and the LC columns are for the loosely coupled partitions.
Table 4.6 presents the CPLEX 11 solve times for the tightly coupled (TC)
and loosely coupled (LC) results of Figures 4.6 and 4.7. The tightly coupled
partitions have less cross-partition communication costs and are closer to the
pseudo-polynomial time Knapsack problem than the loosely coupled assignments
for which quadratic communication costs are more significant and this distinction
is reflected in the CPLEX solution timings. However, despite the more significant
communication costs in the loosely coupled problem, the only solution where
CPLEX had to resort to Branch and Bound was the jpeg partition compiled
with gcc -O3 where the 1792 basic block problem was solved in 3.16 seconds
using 12 Branch and Bound tree nodes.
56 CHAPTER 4. PARTITIONING WITH CERTAINTY
0
5
10
15
20
25
30
35
40
crc32 jpeg stringsearch sha susan dijkstra
S P
E E
D -
U P
 ( x
)
-O1 -O2 -O3
Figure 4.6: SAP accelerations for MiBench benchmarks compiled with different gcc compiler
optimisation levels and partitioned for the tightly coupled architectural design
point of Table 4.5.
0
1
2
3
4
5
6
7
crc32 jpeg stringsearch sha susan dijkstra
S P
E E
D -
U P
 ( x
)
-O1 -O2 -O3
Figure 4.7: SAP accelerations for MiBench benchmarks compiled with different gcc compiler
optimisation levels and partitioned for the loosely coupled architectural design
point of Table 4.5.
4.4. SEQUENTIAL PARTITIONING 57
Assignment Reports
The SAP assignment reports show the fine-grained assignments for a benchmark
on the components of a particular architecture. Assignment information is pro-
vided in a graphical report and source code partition mapping files by the current
SAP implementation. Designers can use the assignment reports with either au-
tomated cross compilation tools [46, 47] or manual processes to implement the
SAP assignments.
Figure 4.8 shows the graphical assignment reports for the MiBench
stringsearch benchmark compiled with gcc -O3 and partitioned for the tightly
(top) and loosely (bottom) coupled architectural characteristics of Table 4.5.
Double circles are used for the start node and the special node B00 which rep-
resents static or externally initialised data. Dashed lines are control flows and
solid lines data flows. Calls to external code are constrained to be from the CPU
partition and may return data through B00. The control flow starts at B01 and
ends at B39 both of which must be on the CPU partition.
All nodes and lines are shaded in proportion to their computation or commu-
nication time requirements and control flows may be composed with data flows
to make WOA packets. The greatest data flow is 13904 bytes amassed over 3476
iterations between nodes B18 and B31 which are in a computationally intensive
loop and the greatest control flow is from B10 to its self through a tight loop body
consisting of 339660 iterations.
The number of cross partition control flows is 14800 for the tightly coupled
architecture shown (top) and drops to 5418 for the loosely coupled, higher latency,
architectural design point (bottom). As the communication costs become more
significant, more nodes are placed on the CPU around the constrained external
call nodes increasing from 12 on the tightly coupled partition to 25 for the loosely
coupled partition.
58 CHAPTER 4. PARTITIONING WITH CERTAINTY
B00
B01
B13
B14
B18
B28
B30
B05
B02
B39
B15
B16
B20
B21
B03
B24
B25
B26
B27
B29
B36
B37
B04
B08
B09
B10
B11 B12
B19
B23
B35
B17
B31
B32
B33
B22
B38
B07
B34
B06
CPU
FPGA
CPU
FPGA
B00
B01
B28
B30
B05
B13
B14
B18
B02
B39
B15
B16
B20 B21
B03
B22
B32
B23 B24 B25
B26 B27 B29 B38
B07
B12
B19
B35
B04
B36 B37 B06 B08
B09
B10
B11
B17
B31
B33
B34
B00
B01
B13
B14
B18
B28
B30
B05
B02
B39
B15
B16
B20
B21
B03
B24
B25
B26
B27
B29
B36
B37
B04
B08
B09
B10
B11 B12
B19
B23
B35
B17
B31
B32
B33
B22
B38
B07
B34
B06
CPU
FPGA
CPU
FPGA
B00
B01
B28
B30
B05
B13
B14
B18
B02
B39
B15
B16
B20 B21
B03
B22
B32
B23 B24 B25
B26 B27 B29 B38
B07
B12
B19
B35
B04
B36 B37 B06 B08
B09
B10
B11
B17
B31
B33
B34
Figure 4.8: SAP assignments for the MiBench stringsearch program compiled with gcc
-O3 and partitioned for the tightly (top) and loosely (bottom) coupled design
points of Table 4.5.
4.4. SEQUENTIAL PARTITIONING 59
Sensitivity Reports
The SAP sensitivity reports plot the acceleration potential of programs for dif-
ferent architectural configurations. The reports allow a designer to sweep design-
space parameters over a range to assess the sensitivity of a program to run-time
factors such as shared bus congestion or to identify the hardware characteristics
that would be required to deliver performance expectations.
0
5
10
15
20
25
30
35
40
0001001011
PARTITION SIZE (Instructions)
SP
EE
D-
UP
 (x
)
 dijkstra
 susan
 sha
Figure 4.9: SAP accelerations for MiBench benchmarks compiled with gcc -O3 and parti-
tioned for heterogeneous architectures with different hardware size capacities.
Figures 4.9, 4.10 and 4.11 show the effect of reconfigurable fabric size, inter-
component bandwidth and latency on partition performance for the dijkstra,
susan and sha benchmarks using the tightly coupled characteristics of Table 4.5
as a basis. The bandwidth and latency figures cover the full range of architecture
characteristics from single chip SoCs to distributed architectures, demonstrating
the general tractability of SAP with optimal solutions to each problem instance
found in only a few seconds with CPLEX 11.
From Figure 4.9 a designer could conclude that the sha performance is re-
stricted for FPGAs below 512 Instruction Equivalents in size. From Figure 4.10
a designer can see that sha and susan are both bandwidth sensitive and from
Figure 4.11 a designer could justify a focus on relative heterogeneous compo-
nent placement to reduce latency when attempting to accelerate either the sha
or dijkstra benchmarks.
60 CHAPTER 4. PARTITIONING WITH CERTAINTY
0
5
10
15
20
25
30
35
40
0.1 1 10 100 1000 10000
BANDWIDTH (MB/s)
S P
E E
D -
U P
 ( x
)
 dijkstra
 susan
 sha
Figure 4.10: SAP accelerations for MiBench benchmarks compiled with gcc -O3 and parti-
tioned for heterogeneous architectures with different bandwidths.
0
5
10
15
20
25
30
35
40
1 10 100 1000 10000 100000 1000000
LATENCY (ns)
S P
E E
D -
U P
 ( x
)
 dijkstra
 susan
 sha
Figure 4.11: SAP accelerations for MiBench benchmarks compiled with gcc -O3 and parti-
tioned for heterogeneous architectures with different latencies.
4.5. PARALLEL PARTITIONING 61
4.5 Parallel Partitioning
Methodology
In [3] a new form of partitioning called the Multi-level Assignment Prob-
lem (MAP) was introduced. MAP integrates fine-grained sequential assign-
ment approaches like SAP with coarse-grained parallel partitioning approaches
like [77,97,100,101,103] to deliver better performance than either approach alone.
Table 4.7 summarises the differences between MAP and previous approaches.
Feature Parallel Sequential MAP
Graph Type DAG CDFG DAG/CDFG
Granularity Coarse Fine Multi-level
Exploits Parallelism X 7 X
Optimal Heterogeneity 7 X X
Table 4.7: Feature comparison for previous parallel and sequential partitioning methods and
the new multi-level approach.
As discussed in Chapter 2, traditional parallel approaches identify parallel
tasks and assign them to execution locations as coarse-grained units. Coarse-
grained assignment limits the potential for heterogeneous acceleration because
if a coarse-grained program section is too large to fit on a hardware component
then it cannot be placed there.
The new multi-level approach removes the issues associated with coarse-
grained parallel assignment. The approach is summarised in Figure 4.12 and, like
traditional parallel partitioning methods, starts with a set of coarse-grained tasks
that can run in parallel. However unlike traditional parallel partitioning, MAP
uses fine-grained characterisation information to split the coarse-grained tasks in
to smaller code sections for optimal fine-grained assignment to shared hardware
components. Thus, as each task in the parallel task set is sequential by definition
(the parallelism has been extracted into the task grouping), the sequential SAP
algorithm presented in Section 4.4 or the algorithms of other authors in sequential
heterogeneous assignment [79,83,89,90,91] can replace the coarse-grained assign-
ment used by previous authors in parallel partitioning [97,100,101,102,103].
For any set of parallel tasks the multi-level approach is required to find the
fine-grained assignment of all task code segments to shared execution locations
that minimises the heterogeneous execution time for the slowest task in the par-
allel set. With only parallel execution components, the best parallel partition for
a set of tasks is the solution to the OMAP problem defined below which uses the
62 CHAPTER 4. PARTITIONING WITH CERTAINTY
t
new
t
1 t2 t3
t
old
1 2 3
S
E
1c
1b
1a
Task DAG
Intra-task CFG
Optimal Timing Fine-grained Assignment
1a 1b
2a
2c
1c
3a 3b 3c
2b
(a) Coarse-grained parallel tasks are
identified.
tnew
t1 t2 t3
told
1 2 3
S
E
1c
1b
1a
Task DAG Intra-task CFG
Optimal Timing Fine-grained Assignment
1a 1b
2a
2c
1c
3a 3b 3c
2b
(b) Fine-grained characterisation infor-
mation is generated for each task.
tnew
t1 t2 t3
told
1 2 3
S
E
1c
1b
1a
Task DAG Intra-task CFG
Optimal Timing Fine-grained Assignment
1a 1b
2a
2c
1c
3a 3b 3c
2b
(c) Tasks are assigned to shared loca-
tions on a fine-grained basis.
tnew
told
1 2 3
S
E
1c
1b
1a
Task DAG Intra-task CFG
Optimal Timing Fine-grained Assignment
1a 1b
2a
2c
1c
3a 3b 3c
2b
t1 t2 t3
(d) The fi e-grained assig ents with
the best timings are found.
Figure 4.12: The multi-level approach of combining coarse-grained parallel task partitioning
with fine-grained sequential assignment.
standard WOA timing equation from Section 4.3 and can be thought of simply
as a wrapper for multiple SAP sub-problems subject to shared space constraints.
Definition 4.5.1 Optimistic Multi-level Assignment Problem (OMAP).
min t‖ (4.12)
s.t. t‖ ≥ t(i) ∀i ∈ T (4.13)
where T is the parallel task set, i a task identifier, t(i) the WOA execution time
for a task as defined in Section 4.3 and t‖ no less than the longest execution time
of the individual tasks.
4.5. PARALLEL PARTITIONING 63
However with both sequential and parallel computational components, the best
execution time that can be guaranteed independently of sequential scheduling
conflicts is the solution to the RMAP problem 4.5.2 below which requires execu-
tion times divided into sequential t(i)⊥ and parallel t
(i)
‖ parts.
Definition 4.5.2 Robust Multi-level Assignment Problem (RMAP).
min t‖ + t⊥ (4.14)
s.t. t‖ ≥ t
(i)
‖ ∀i ∈ T (4.15)
t⊥ =
∑
i∈T
t
(i)
⊥ (4.16)
where T is the parallel task set, i a task identifier, t(i)⊥ and t
(i)
‖ the times task
i spends executing on sequential and parallel components respectively, t‖ no less
than the longest parallel execution of the individual tasks and t⊥ the sum of the
serial execution times of the tasks.
Dividing the WOA execution time into parallel and sequential parts is straight
forward once the computational components and communication channels of the
hardware have been separated into sequential and parallel sets. Examples of
components and communication channels in the sequential set include CPU cores
and shared busses. Examples of parallel components and communication channels
include FPGAs and task dedicated hardware inter-connects.
Letting L⊥ denote the sequential subset of the computational components
and M⊥ the shared communication channels, with reference to the notation of
page 7 the WOA execution time for an isolated task i can be respecified as:
t
(i)
‖ =
∑
pl 6∈L⊥
µ
(i)
pl
+
∑
pqlm 6∈M⊥
c
(i)
pqlm
(4.17)
t
(i)
⊥ =
∑
pl∈L⊥
µ
(i)
pl
+
∑
pqlm∈M⊥
c
(i)
pqlm
(4.18)
With the fine-grained sequential execution time for each task now separated de-
pending on location characteristics, equations (4.17) and (4.18) can be used to
expand Definition 4.5.2 and the size constraints of the hardware locations added
in a similar fashion to the SAP problem of Section 4.4 to produce the expanded
Robust Multi-level Assignment Problem (eRMAP) below.
64 CHAPTER 4. PARTITIONING WITH CERTAINTY
Definition 4.5.3 Expanded Robust Multi-level Assignment Problem (eRMAP).
min
x∈X t‖ + t⊥ (4.19)
s.t. t‖ ≥ t
(i)
‖ ∀i ∈ T (4.20)
t⊥ =
∑
i∈T
t
(i)
⊥ (4.21)
t
(i)
‖ =
∑
pl 6∈L⊥
µ
(i)
pl
x
(i)
pl
+
∑
pqlm 6∈M⊥
c
(i)
pqlm
x
(i)
pl
x
(i)
qm
(4.22)
t
(i)
⊥ =
∑
pl∈L⊥
µ
(i)
pl
x
(i)
pl
+
∑
pqlm∈M⊥
c
(i)
pqlm
x
(i)
pl
x
(i)
qm
(4.23)
where x ∈ X is a feasible assignment of fine-grained code sections to locations for
all coarse-grained tasks T satisfying the constraints:
x
(i)
pr = 1 ∀p ∈ E (4.24)∑
l
x
(i)
pl
= 1 ∀p, i (4.25)
∑
pi
δ
(i)
pl
x
(i)
pl
≤ ∆l ∀l (4.26)
x
(i)
pl
=
{
1 if p from task i is assigned to l
0 otherwise.
(4.27)
with i ∈ T a task identifier, p and q assignable code sections in a task, l and m com-
putational locations, r the reference partition, E the set of nodes that must be run
on r, L⊥ and M⊥ the subsets of sequential computational and shared communica-
tion channels, x(i)
pl
the assignment indicator variables to be optimised and other
variables the task indexed versions of their respective equivalents in the sequential
SAP problem of Section 4.4 summarised in the notation of page 7.
Like the SAP problem, the quadratic terms in the eRMAP can be replaced
by new variables constrained using Lemma 4.4.2 [4]:
x
(i)
pl
+ x(i)qm − 1 ≤ 2 y(i)pqlm ≤ x
(i)
pl
+ x(i)qm (4.28)
to convert the expanded Multi-level Assignment Problem (MAP) forms into
equivalent Mixed Integer Linear Programs (MILPs) solvable with standard solvers
such as CPLEX [112]. As MAP can be viewed as a combination of SAP problems,
the tractability of MAP has already been dealt with§4.4 and the remainder of
this chapter presents solutions to MAP that prove the benefit of the combined
multi-level approach over the sequential and traditional parallel alternatives.
4.5. PARALLEL PARTITIONING 65
Results
This section presents results for the assignment of three tasks to share the hard-
ware resources of Figure 4.5 for parallel execution. The three tasks are the
MiBench dijkstra, susan and sha benchmarks compiled with gcc -O3 consid-
ered for sensitivity analysis in the sequential results of Section 4.4. Referring to
Figure 4.6 it can be seen that these three benchmarks have the highest of the O3
accelerations and thus the most to loose by sharing the FPGA resources. The
hardware to be shared is the tightly coupled architecture of Table 4.5 with an
FPGA size of 256 Instruction Equivalents.
Task Traditional SAP eOMAP eRMAP
dijkstra 15.049 0.4295 0.4306 0.4349
susan 9.1977 0.3385 0.4075 0.3406
sha 3.4849 0.1892 0.2691 0.2067
Best Time 27.732 0.9572 0.4306 0.5442
Table 4.8: Performance times for three MiBench benchmarks assigned using traditional
coarse-grained parallel methods (Traditional), optimal sequential assignment as
isolated tasks (SAP) and optimistic (eOMAP) and robust (eRMAP) parallel multi-
level assignment. The Best Time row is the serial sum of the task times for the
Traditional and SAP approaches, the maximum of the three shared task times
for the eOMAP approach and a combination of 0.2193 seconds serial and 0.3248
seconds maximum parallel execution for eRMAP.
Table 4.8 gives the assignment objectives for the shared architecture using
traditional coarse-grained task assignment (Traditional), optimal sequential as-
signment (SAP) and the optimistic and robust expanded Multi-level Assignment
Problems (eOMAP and eRMAP respectively). The SAP partition objectives
correspond to Figure 4.6 where each task is run in isolation on the entire archi-
tecture. The combined MAP problem size was 331 basic blocks representing a
total of 1761 Instruction Equivalents. The solution times for the SAP, eOMAP
and eRMAP problems using a quad-thread 64 bit version of CPLEX 11 were
0.07, 0.58 and 0.11 seconds respectively on the Dual Core Opteron 2.2GHz orion
server at Imperial. eOMAP was the only problem form for which CPLEX needed
to use Branch and Bound and the tree size was 17 nodes.
Dijkstra was the smallest of the three tasks at only 294 Instruction Equiva-
lents. However even dijkstra could not fit on the FPGA using coarse-grained tra-
ditional assignment meaning that the best achievable traditional coarse-grained
parallel assignment time was limited by the serial executions of the tasks on the
CPU and could not be less than 27.73 seconds (the Traditional time would have
been 12.68 seconds if dijkstra could have been placed on the FPGA).
66 CHAPTER 4. PARTITIONING WITH CERTAINTY
Using the new multi-level assignment approach on the other hand allowed for
the tasks to be split at a fine-grained (in this case the basic block) level and as-
signed to share the hardware resources. The optimal eRMAP robust assignments
of Table 4.8 shared the reconfigurable hardware 26% to dijkstra, 28% to susan
and 46% to sha. The MAP approach delivered an execution time of between
0.4349 (optimistic) and 0.5442 (robust) seconds representing between 51 and 64
times better performance than the traditional parallel approach depending on
the amount of conflict serialisation required.
4.6 Summary
This chapter presented the Write-Only Architecture (WOA), the WOA execu-
tion timing equations and the formal sequential and parallel partitioning meth-
ods developed for this project. The WOA introduced in Section 4.2 is a new
model for heterogeneous execution which reduces communication costs by up to
a factor of five by delegating responsibility for control path management to dis-
tributed program sections instead of requiring a RPC [107] model like previous
work [47, 79, 81, 84, 89, 93]. The WOA model allowed simple computation and
communication timing equations to be developed in Section 4.3 which can be
supported by 3S [1]§3 program characterisation information.
The division of timing information into fine-grained computation and commu-
nication parts led directly to the optimisation objective of the Sequential Assign-
ment Problem (SAP) in Section 4.4 and the insight that the software assignment
problem we are trying to solve is a Generalized Quadratic Assignment Problem
(GQAP) which is known to be strongly NP-hard [148]. With the connection of
our problem to mathematical theory firmly established, the tractability of prac-
tical SAP problem instances was demonstrated using CPLEX 11’s B&B [112]
implementation to prune the solution space for problem instances corresponding
to real software benchmarks of up to 1792 nodes over a comprehensive range
of computational and communication characteristics [3]. The solutions demon-
strated that the characteristics of practical SAP problem instances allow the
optimal identification of heterogeneous partitions without the need to resort to
heuristics [2, 84,96,98] which are necessarily sub-optimal in some cases [141].
In Section 4.5 the Multi-level Assignment Problem (MAP) was presented.
MAP is a new assignment approach that integrates fine-grained sequential
assignment methods like SAP with previous parallel partitioning approaches
like [97, 100, 101, 102, 103] to deliver better results than either approach could
achieve alone. Results for two versions of MAP were provided: optimistic in
4.6. SUMMARY 67
which no scheduling conflicts were assumed and robust which minimised the pos-
sibility of scheduling issues in the worst case cross-task control flow sequence.
The results demonstrated the integrated MAP approach delivers between 51 and
64 times better performance than traditional coarse-grained parallel approaches
and improves the optimal sequential SAP results by a factor of up to 2.2 times
for the benchmarks considered.
Whether or not scheduling conflicts will be seen in a physical implemen-
tation cannot be determined without detailed scheduling information which is
compressed out of the Control/Data Flow Graphs (CDFGs) used in fine-grained
partitioning as discussed in Chapter 2. Consequently the decision of whether to
implement an optimistic or robust MAP partition (or another version in between
the two extremes) cannot be made with the formal models presented so far in
this document and the next chapter presents novel work on partitioning under
uncertainty which effectively takes a CDFG and optimises for the range of possi-
ble sequence paths. The work of the next chapter is applicable to the scheduling
uncertainty problem highlighted in this chapter, but is presented in the context of
multiple instantiation where it represents the first formal work to break the sin-
gle instance constraint of previous formal models [128,139,141,148]§2.4 and more
than doubles their optimality potential for the benchmark considered [5]§5.4.
68 CHAPTER 4. PARTITIONING WITH CERTAINTY
Chapter 5
Partitioning with Uncertainty
5.1 Introduction
This chapter presents and solves a new formal partitioning problem called the
Multiple Instantiation Problem (MIP). MIP differs from previous formal par-
titioning problems [128, 139, 141, 148]§2 in that it is focused on process return
optimisation rather than asset placement cost minimisation. The focus of MIP is
more appropriate for the goals of software assignment, process plant optimisation
and project management than previous work and this chapter contributes:
• Section 5.2: A New Partitioning Paradigm.
• Section 5.3: The First Formal Model for the Paradigm.
• Section 5.4: Results Quantifying the Paradigm’s Benefit.
The return optimisation focus of MIP has the potential to deliver far greater
accelerations than SAP and MAP§4 for the same constraints, but the cost of
these greater returns is additional problem complexity caused by path sequence
uncertainty which is the core issue addressed in this chapter.
The chapter begins by presenting the motivation behind the Multiple Instan-
tiation Problem (MIP) and explaining the path uncertainty issue that has to
be solved in Section 5.2. Section 5.3 then abstracts MIP into a formal robust
NP-hard optimisation problem and reformulates the problem for solution under
uncertainty using standard solvers such as CPLEX [112]. In Section 5.4, results
are presented for a real software benchmark partitioned using MIP for a hetero-
geneous architecture. The results show that the new MIP approach more than
doubles the partition optimality of previous paradigms. The chapter concludes
with a discussion of future work and a brief summary in Sections 5.5 and 5.6.
69
70 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
5.2 Multiple Instantiation Partitioning
Chapter 4 introduced the software assignment problem through the SAP and
MAP assignment formulations. As with all previous formal assignment prob-
lems [128,139,141,148]§2, SAP and MAP assume a single instance of each node
is assigned across the set of possible assignment locations. This chapter presents
MIP which is the first formal work to challenge the single instance assumption
as shown in Table 5.1 and the first partitioning paradigm targeted specifically at
process optimisation problems in the presence of uncertainty.
Characteristic Knapsack [128] GAP [139] QAP [141] GQAP [148] MIP [5]
Multiple locations 7 X X X X
Variable node costs 7 X 7 X X
Connection costs 7 7 X X X
Size constraints X X 7 X X
Multiple instances 7 7 7 7 X
Flow properties 7 7 7 7 X
Strongly NP-hard 7 7 X X X
Table 5.1: Characteristics of previous partitioning paradigms compared with MIP. GAP
stands for the Generalized Assignment Problem [122,139], QAP for the Quadratic
Assignment Problem [141] and GQAP for the Generalized QAP [148].
The single instance assumption of previous work comes from their focus on
physical asset placement. In the classic Knapsack [128] problem for example, an
asset is a unique physical object or limited resource (such as money in a portfolio
optimisation problem) which cannot be replicated arbitrarily and the problem is
to maximise the utility benefit given the fixed resources.
Software assignment differs from physical partitioning in that software units
are not intrinsically unique and can be replicated freely with only an associated
cost in space. Replicatability is actually the norm for most process optimisation
problems. Take for example the different machines in a manufacturing process
that needs to be spread over several locations, manufacturers today will naturally
balance the cost of buying duplicate machines for two or more locations against
the benefits in efficiency and reduced transport costs. As another example con-
sider project management where employing different experts with similar skills
as local contacts in a multinational project can deliver increased efficiency and
reduced total cost.
Like process engineering and project management, multiple instantiation has
been used for many years in software optimisation. Examples include: func-
tion in-lining in compilers [45], loop unrolling in hardware design [101] and the
separation of each iteration of a code section into separate assignable tasks in
5.2. MULTIPLE INSTANTIATION PARTITIONING 71
parallel partitioning [133]. However like process engineering and project man-
agement, previous multiple instantiation work in the software domain has either
used heuristics [45], manual methods [101] or effectively reduced the problem to
a single instance assignment problem using full sequence information [133].
In this work, a new form of automatic partitioning that uses multiple in-
stantiation to reduce execution latency in the presence of sequence uncertainty
is introduced. To illustrate the difference between the Multiple Instantiation
Problem (MIP) and previous work consider Figure 5.1 overleaf.
Figure 5.1(a) represents the optimal single instance SAP§4 partition for a
program on a two component heterogeneous architecture. Looking at the SAP
partition a designer could be tempted to hand optimise the assignments by placing
an instance of node 2 on each location as in Figure 5.1(b). However, Figure 5.1(b)
assumes that 2 always returns to its immediate caller and this does not have to
be the case to satisfy the original control flows of Figure 5.1(a). In fact, in the
absence of sequence information, Figure 5.1(c) is just as valid a run-time control
flow path as Figure 5.1(b) for the hand optimised multiple assignments and has
3.5 times the number of single instance optimal assignment cross-partition calls.
This brings us to the primary issue we need to solve with MIP: path un-
certainty. To cope with path uncertainty we will need to evaluate the possible
directed walks for a multiple instance assignment and ensure our decision to
multiply instantiate a node is robust [149, 150, 151] given the possible directed
walks that could satisfy the, unambiguous, original control flow graph. The di-
rected walks we consider clearly need to be feasible and we will discuss the flow
balancing, connectivity and other constraints feasibility implies in Section 5.3.
Figure 5.1(c) also illustrates the second issue we need to be aware of with
MIP, the issue of: calling convention. Clearly Figure 5.1(c) is pessimistic not
robust because surely, in the absence of sequence information, node 1 would call
the instance of 2 closest to it rather than the node furthest away and like wise
for node 3 as illustrated in Figure 5.1(d).
The calling convention implemented in Figure 5.1(d) is a greedy one, i.e. each
node chooses the instance of the next target in the control flow sequence which
is closest in terms of communication and execution times. We will stick to the
greedy calling convention in this document leaving the exploration of other causal
calling options for future work. With the greedy calling convention, Figure 5.1(d)
has the same number of cross-partition control flows as the SAP partition but
could deliver better execution times depending on whether node 2 is a data com-
pressor or expander and on the different heterogeneous execution characteristics
of computational locations A and B that we aim to exploit in this work.
72 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
1
10
2
10
A
1
1
3
3
3
B
1
2b
310
2a
10
A
1
1
3
B
3
Single Instance SAP Optimistic MIP
1
2b2a
A
1
1
3
B
3
3
10
10
Pesimistic MIP
1
2b
37
2a
10
A
1
1
3
B
3 3
Robust Calling Convention
(a) Single instance SAP assignments with
node 2 shared by node 1 and node 3 across
the partition boundary.
1
10
2
10
A
1
1
3
3
3
B
1
310 10
A
1
1
3
B
3
Single Instance SAP Optimistic MIP
1
A
1
1
3
B
3
3
10
10
Pesimistic MIP
1
37 10
A
1
1
3
B
3 3
Robust Calling Convention
2a 2b
2a 2b 2a 2b
(b) Multiple instance partition with node 2
replicated (pentagons) at each location illus-
trating the split control flow path correspond-
ing to an optimistic directed walk.
1
10
2
10
A
1
1
3
3
3
B
1
310 10
A
1
1
3
B
3
Single Instance SAP Optimistic MIP
1
A
1
1
3
B
3
3
10
10
Pesimistic MIP
1
37 10
A
1
1
3
B
3 3
Robust Calling Convention
2a 2b
2a 2b 2a 2b
(c) The same multiple instance partition as
Figure 5.1(b) but illustrating the split control
flow path corresponding to a pessimistic di-
rected walk.
1
10
2
10
A
1
1
3
3
3
B
1
310 10
A
1
1
3
B
3
Single Instance SAP Optimistic MIP
1
A
1
1
3
B
3
3
10
10
Pesimistic MIP
1
37 10
A
1
1
3
B
3 3
Robust Calling Convention
2a 2b
2a 2b 2a 2b
(d) The same multiple instance partition as
Figure 5.1(b) but il trating th split control
flow path corresponding to a pessimistic walk
using the greedy calling convention.
Figure 5.1: Single and Multiple Instance Partitions illustrating the uncertain path and calling
convention problems that need to be solved in MIP. Darker node instance shadings
indicate greater computation times and arc labels give the number of directed calls
between program nodes.
5.3. FORMAL MODEL 73
5.3 Formal Model
Objective Form
As discussed in Section 5.2, the MIP problem differs from the SAP problem of
Chapter 4 in that we need to take into account directed path sequence uncer-
tainty when evaluating potential assignment combinations. To account for path
uncertainty, the MIP objective needs to consider all valid directed walks for each
assignment and the robust MIP objective is thus of the form:
min
x∈X maxy∈Y (x)
txy (5.1)
where x represents an assignment from the valid set of assignments X and y a
directed walk from the set of directed walks valid for a given assignment Y (x).
The objective (5.1) is robust in that it tries to find the assignment with the
minimum execution time for its worst possible path similar to Figure 5.1(d). The
optimistic MIP objective form illustrated by Figure 5.1(b) is:
min
x∈X miny∈Y (x)
txy (5.2)
This section will concentrate on the robust MIP objective form because the op-
timistic form reduces easily to a standard minimisation problem and the compli-
cations lie in the robust formulation.
Comparing the robust MIP objective (5.1) with the SAP objective of Sec-
tion 4.4:
min
x∈X tx
it is clear that in order to continue with the formalisation and solution of MIP
problems there are three issues that need to be addressed:
1. the directed walk dependent timing equation txy needs to be specified.
2. the minimax representation needs to be formulated with constraints.
3. the minimax representation needs to be reformulated into a solvable form.
the specification of the timing equation in the presence of uncertain paths is
the subject of the next section and will lead us to our first formal definition for
robust MIP which we will later reformulate into a standard form using Lagrangian
dualisation [126,127].
74 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
Objective Compression
Using the WOA activation timing equation of Chapter 4, the total time for an
assignment x ∈ X to execute given a directed walk y ∈ Y (x) will be:
txy =
∑
pqlm|τ
(µˆpl|τ + cˆpqlm|τ ) ypqlm|τ (5.3)
where p and q are software segments, l and m computational locations, µˆpl|τ and
cˆpqlm|τ are the computation and communication times for a single step τ in the
programs control flow and ypqlm|τ is the directed walk.
The range of possible txy values for an assignment x ∈ X will be governed
by the set of valid walks Ψ(x) for the assignment which is constrained by the
characteristics of valid walks and the original control flow graph χpq :
Ψ(x) :=
{
y ∈ BV 2L2T : (xpl ∨ xqm) = 0 =⇒ ypqlm|τ = 0 ∀ τ ; (5.4)∑
pl
ypqlm|τ = 0 =⇒ yqqˆmmˆ|τ+1 = 0 ∀ qˆ, mˆ, τ ; (5.5)
∑
pqlm
ypqlm|τ = 1 ∀ τ ; (5.6)
y∅s∅r|0 = 1,
∑
pl
ypelr|T = 1; (5.7)
∑
lm|τ
ypqlm|τ = χpq ∀ p 6∈ {∅}, q
}
(5.8)
where V is the number of software code segments (vertices in the graph), L
the number of computational locations (knapsacks) and T =
∑
pq χpq the total
number of steps in the directed walk. The constraints on the Ψ(x) set ensure the
directed walk: only calls assigned instances, is continuous, is sequential, starts
at node s and ends at node e and meets the original single instance control flow
requirements χpq and assume s and e are constrained by x ∈ X to have only a
single instance which is at location r.
The set of valid directed walks Ψ(x) does not however include any calling
convention and, as we saw in Section 5.2, can include unrealistically pessimistic
directed walks. Thus, adding the greedy calling convention to Ψ(x) we have:
Y (x) :=
{
y ∈ Ψ(x) : ypqlm|τ = 1 =⇒ Qpqlm|τ = minmˆ {Qpqmˆl|τ : xqmˆ = 1}
}
(5.9)
5.3. FORMAL MODEL 75
which is the set of possible greedy calling convention walks in which walks follow
the path of minimum single step execution times. Referring to equation (5.3) our
directed walk single step execution time can be specified as:
Qpqlm|τ = µˆpl|τ + cˆpqlm|τ (5.10)
Now, if sequence compression were not required, we could simply measure the
exact trace and calculate the exact objective time t∗xy for an assignment x ∈ X
without any sequence uncertainty [133]. However, as the simple Monte Carlo
Simulator below shows, even a single line of code can produce traces that are too
large to deal with in any practical optimisation problem:
for(unsigned int i=0; i<-1U; i++) (rand()<RAND MAX/2.?A():B());
For this single line of C++ alone, there would be at least 4,294,967,295 steps
in the directed walk which would make the trace based optimisation problem
impractical.
However, looking back at equation (5.3) and our uncertainty set Y (x), we see
that both txy and Y (x) are defined in terms of a sequence of activations and so we
would actually be worse off in terms of problem size if we implemented a Linear
Program (LP) based on txy and Y (x) than we would be if we just used the exact
trace with the greedy calling convention y∗
pqlm|τ . So it should be clear at this
point that the request from the previous section for txy and Y (x) to calculate:
min
x∈X maxy∈Y (x)
txy (5.11)
was ill-founded and what we actually need is:
min
x∈X maxχ∈Ξ(x)
txχ (5.12)
that is, we need to work at a consolidated level in both timing txχ and its associ-
ated uncertainty set Ξ(x). In our process of consolidation however, we would like
to ensure that our timing equations are equivalent and the uncertainty range of
txχ is equal to that of txy so that the solution to optimisation problem 5.11 is
equal to the solution to problem 5.12.
76 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
Timing Compression
Subject to Assumption 5.3.1, our timing equation txy can be sequence compressed
to:
txχ =
∑
pqlm
(µˆpl + cˆpqlm) χpqlm (5.13)
where µˆpl and cˆpqlm are the location dependent per activation computation and
communication times respectively and χpqlm is the split control flow graph for a
directed walk ypqlm|τ defined by:
χpqlm =
∑
τ∈{1..T}
ypqlm|τ ∀p, q, l,m (5.14)
Assumption 5.3.1 To simplify our problem we will assume our programs have
sequence independent activation costs such that:
µˆpl = µpl|τ ∀τ (5.15)
cˆpqlm = cpqlm|τ ∀τ (5.16)
This assumption is true for a large number of computationally intense programs
including the warm-stared benchmark considered in Section 5.4 of this chapter
but can be invalid if for example computation times are dependent on inputs or
internal program state.
Lemma 5.3.2 The txχ time calculated by equation (5.13) is equal to the txy time
of equation (5.3) for a given y ∈ Y (x).
Proof. The proof is through simple arithmetic manipulation of equation (5.3)
given Assumption 5.3.1. 
Constraint Compression
In compressing sequence information out of the Y (x) constraints we require:
∀y ∈ Y (x) =⇒ ∃χ ∈ Ξ(x) (5.17)
∃y ∈ Y (x) ⇐= ∀χ ∈ Ξ(x) (5.18)
to ensure that the range of uncertainty on the timing equations that constitute the
original and the compressed optimisation problems is equivalent. The following
introduces and proves the equivalence of constraints on the compressed problem
through the series of lemmas summarised in Table 5.2.
5.3. FORMAL MODEL 77
Constraint Ψ(x) Y(x) Ξ(x)
Assignment 5.4 ⇒ 5.3.3
Continuous 5.5 ⇒ 5.3.4
Sequential 5.6 ⇒ 5.3.4
Start and End 5.7 ⇒ 5.3.5
Meets Original χpq 5.8 ⇒ 5.3.5
Calling Convention 7 5.9 5.3.6
Table 5.2: Direct and implied constraints in the valid Ψ(x) and greedy calling convention
walks Y (x) and their equivalents in the compressed split control graph Ξ(x) set.
Lemma 5.3.3 The assignment constraint (5.4) is equivalent to the consolidated
constraint:
(xpl ∨ xqm) = 0 =⇒ χpqlm = 0 (5.19)
Proof. The forward implication (5.17) comes from the definition of χpqlm in
equation (5.14). The reverse implication (5.18) comes from the fact that ∃χpqlm 6=
0 =⇒ (xpl ∧ xqm) = 1 from first-order logic. 
Lemma 5.3.4 The continuous and sequential constraints (5.5) and (5.6) are
equivalent to the consolidated constraints that the split control flow graph is: con-
nected and balanced. The consolidated constraints that ensure connectedness are
relatively complex and are discussed in a separate section but the consolidated
balancing constraint set is straightforward and is shown below:∑
qˆmˆ
χpqˆlmˆ =
∑
pˆlˆ
χ
pˆplˆl
∀p 6∈ {s, e}, l (5.20)
Proof. This proof requires an understanding of transitive closure. Readers who
are not familiar with the concept of transitive closure may wish to come back to
this proof after the section entitled Connectivity on page 81.
Constraints (5.5) and (5.6) together impose that walks are both continuous
and sequential. For the forward implication we know any directed walk that is
continuous has a path from the start node s to every other node that is executed
and thus the corresponding split call graph will be complete under transitive clo-
sure which should be taken as the definition of the term connected in the lemma.
Additionally, any walk that is both continuous and sequential, has the prop-
erty that for each step into a node ypqlm|τ there is another step out of the node
yqqˆmmˆ|τ+1 (which may well be back to the same node) and so the consolidated
split control flow graph is balanced which gives us constraint (5.20).
For the reverse implication, every split control flow graph that is both balanced
and connected (complete under transitive closure) will have at least one walk that
78 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
is both continuous and sequential. We can generate this walk by: removing all
cycles from χpqlm to produce a continuous sequential directed walk from s to
e into which we insert appropriate continuous sequential directed walk cycles
corresponding to the original cycles in the split control flow graph.
To prove the generation procedure is always valid, note first that cycles in a
split control flow graph are always balanced by definition. Consequently, remov-
ing all cycles from the split control flow graph must still leave a balanced split
control flow graph (except for s and e) and, as all nodes were reachable through
the connectivity constraints on the original split control flow graph, all the re-
maining nodes must be reachable from s. The split control flow graph pruned of
all cycles must then be a directed walk between s and e (as there is no input to s
or return from e they cannot have been removed as part of a cycle). Additionally,
it is clear that at least one of the nodes in the removed cycles will be present in
the pruned split control flow graph (the cycle was reachable from s) and we can
use such nodes as keys for inserting cyclic walks into the final directed walk. 
Lemma 5.3.5 The start and end constraint (5.7) and the meets original χpq
constraint (5.8) are equivalent to the consolidated constraint:∑
lm
χpqlm = χpq ∀p, q (5.21)
Proof. Constraint (5.7) implies that s and e have one extra out and in flow respec-
tively which is enforced by definition in χpq and through the definition of χpq into
the consolidated constraint (5.21) (the instance implications of constraint (5.7)
are part of the x ∈ X constraints). Additionally, referring to the definition of
the split control flow graph in equation (5.14), consolidated constraint (5.21) is
directly equivalent to the Ψ(x) constraint (5.8). 
Lemma 5.3.6 The calling convention constraint (5.9) is equivalent to the con-
solidated constraint:
χpqlm 6= 0 =⇒ Qpqlm = minqm {Qpqml : xqm = 1} (5.22)
Proof. Constraint (5.9) enforces that each move in the directed walk is to the in-
stance of a target with the lowest single activation costs. The forward implication
of the constraint equivalence is clear through simple mathematical consolidation
with Qpqlm defined as:
Qpqlm = µˆpl + cˆpqlm (5.23)
with reference to Lemma 5.3.2. The reverse implication is also clear as we can
trivially construct a directed walk satisfying (5.9) for any χpqlm satisfying con-
straint (5.22). 
5.3. FORMAL MODEL 79
Optimisation Problem
Definition 5.3.7 below presents the formal definition of the Multiple Instantiation
Problem (MIP). The problem joins the compressed timing equation (5.13) with
the robust objective (5.12) and includes integer x ∈ X and the consolidated χ ∈
Ξ(x) constraints from Lemmas 5.3.3 through to 5.3.6 expressed in relational form.
Definition 5.3.7 Multiple Instantiation Problem (MIP).
min
x∈X maxχ∈Ξ(x)
∑
pqlm
Qpqlm χpqlm (5.24)
where x ∈ X is a feasible assignment of program sections to locations satisfying:
∑
l
xpl = xpr = 1 ∀p ∈ E (5.25)∑
l
xpl ≥ 1 ∀p 6∈ E (5.26)∑
p
δplxpl ≤ ∆l ∀l (5.27)
xpl =
{
1 if p is assigned to l
0 otherwise.
(5.28)
and χ ∈ Ξ(x) is a valid split control flow for the assignment x satisfying:
χpqlm ≤ Mxpl ∀p, q, l,m (5.29)
χpqlm ≤ Mxqm ∀p, q, l,m (5.30)∑
qˆmˆ
χpqˆlmˆ =
∑
pˆlˆ
χ
pˆplˆl
∀p 6∈ {s, e}, l (5.31)
∑
lm
χpqlm = χpq ∀p, q (5.32)
χpqlm ≤ M(1− xqmˆ) ∀p, q, l,m 6= mˆ | Qpqlm > Qpqlmˆ (5.33)
with:
p, pˆ, q, qˆ assignable software code sections
E the set of externally dependent code sections
l, lˆ,m, mˆ, r computational locations with the reference r holding p ∈ E
Qpqlm the cost of activating q on m from p on l
χpqlm a multi-edge of the split control flow graph χ ∈ Ξ(x)
χpq multi-edges in the original single instance control flow graph
∆l the total space available at location l
δpl the space required to assign code section p to location l
80 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
Equation (5.24) is the robust optimisation objective of MIP which deals with
sequence uncertainty through its minimax form to minimise the worst execu-
tion time across all split control flow graphs valid for an assignment. Con-
straints (5.25), (5.26), (5.27) and (5.28) control the instantiation of vertices. Con-
straints (5.27) and (5.28) are the same as in the SAP problem of Section 4.4 and
control the size limits and indicator variable range.
Constraint (5.25) ensures that the class of externally dependent nodes p ∈ E
which includes the special start s and end e nodes are instantiated only on the
reference partition r. Constraint (5.26) is the new multiple instantiation con-
straint that allows internally dependent program code sections to be instantiated
at multiple locations but requires all code sections be instantiated at least once
somewhere to ensure feasibility.
The χ ∈ Ξ(x) constraints govern the bounds of the decision dependent un-
certainty in the problem through the split control flow graph feasibility region.
Constraints (5.29) and (5.30) tie χpqlm and x together by ensuring every split
control flow corresponds to instantiated nodes. Constraint (5.31) is the balanc-
ing (flow conservation) constraint, (5.32) ensures the split control flow meets the
original control flow χpq requirements and constraint (5.33) is the greedy calling
convention constraint in LP form.
The correspondence between the MIP split control flow graph constraints and
the lemmas of the previous section is shown in Table 5.3 and demonstrates that
the MIP constraints of Definition 5.3.7 are partially relaxed in that they include
discontinuous graphs in the Ξ(x) uncertainty set. The next section presents the
final constraints for the formal model which address the discontinuity issues by
enforcing graph connectivity. However before we consider the connectivity con-
straints, and the pressing issue of how to solve the robust MIP formulation, we
will complete the current definition of MIP with a proof that the problem is
strongly NP-hard.
Theorem 5.3.8 MIP is strongly NP-hard.
Proof. Consider a Quadratic Assignment Problem (QAP) 〈Vˆ , Lˆ, Qˆ〉 with a set of
assignable items Vˆ , a set of locations Lˆ and quadratic costs Qˆ defined as 0 for
unconnected vertices. The QAP problem reduces to an equivalent MIP problem
〈V,L,E,Q, χ,∆, δ〉 with 〈V = Vˆ , L = Lˆ, E = ∅, Q = Qˆ, χ = 1
Vˆ Vˆ
, ∆ = 1
Lˆ
, δ = 1
Vˆ Lˆ
〉
where 1◦◦ is a suitably sized matrix of 1’s.
Thus as MIP can represent any QAP problem through the above reduction,
MIP is at least as complex as QAP which is strongly NP-hard [141]. 
5.3. FORMAL MODEL 81
Connectivity
To guarantee the formal MIP problem definition represents the underlying un-
certainty set related to the valid greedy walks Y (x), we need to ensure all the
consolidated Ξ(x) constraints from Lemmas 5.3.3 to 5.3.6 are included in our
Linear Programming (LP) model. Table 5.3 shows that the formal MIP Defini-
tion 5.3.7 does not include the full continuous sequential path constraints and
so represents an expanded uncertainty set. As demonstrated in Lemma 5.3.4 on
page 77, the continuous sequential path constraints are equivalent to requiring
the split control flow graph be both connected and balanced. The balancing con-
straints are already included in the MIP formulation as constraint (5.31) and this
section explains how to add the connectivity constraints to the formal model.
Constraint Ξ(x) MIP
Assignment 5.3.3 5.29, 5.30
Continuous 5.3.4 5.31, 7
Sequential 5.3.4 5.31, 7
Start and End 5.3.5 5.25
Meets Original χpq 5.3.5 5.32
Calling Convention 5.3.6 5.33
Table 5.3: Correspondence between the Ξ(x) constraint lemmas of pages 77–78 and the formal
MIP constraints of Definition 5.3.7.
As discussed in Lemma 5.3.4, our graph connectivity constraint is that the
transitive closure [125] of the start node s be the set containing every node in-
stance in the graph. However, translating this requirement into LP constraints is
non-trivial and before disclosing the general method to perform the translation
it is necessary first to supply some essential definitions and theorems.
Definition 5.3.9 Unit Connectivity Matrix (UCM). The Unit Connectivity Ma-
trix corresponding to a control flow graph χij is a square binary matrix composed
of elements:
αij =
{
1 if χij ≥ 1
0 otherwise.
(5.34)
Thus the UCM for the split control flow graph χpqlm is simply a square matrix
with indices i = (p, l) and j = (q,m) with elements 1 where (p, l) activates (q,m) and
0 elsewhere. Figure 5.2(a) shows the split control flow graph for Figure 5.1(d) on
page 72 in matrix form and Figure 5.2(b) shows the corresponding UCM. The 1
in the last row of the second column of the UCM indicates that node 2 on location
A has a direct connection to 3 on location B in the split control flow graph.
82 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
Definition 5.3.10 Closure of a Unit Connectivity Matrix. The Nth closure of a
UCM is the binary union of the N powers (compositions) of the UCM matrix:
M(n) =
B⋃
n≤N
Mn (5.35)
The first closure of a UCM is the UCM itself. Figures 5.2(c) and 5.2(d) show
the 2nd power and 2nd closure of the UCM of Figure 5.2(b).
Definition 5.3.11 Transitive Closure. The transitive closure M∗ of a control
flow graph χ is the closure of the graph’s UCM as N 7→ ∞. The 2nd closure of the
split control flow graph shown in Figure 5.2(d) is equal to the transitive closure
of Figure 5.2(a) because the 2nd, 3rd, . . . Nth UCM closures are all the same.
With these standard definitions, it is a simple matter to derive the follow-
ing lemmas, theorem and corollary which characterise connected graphs for the
purpose of this work.
1a 2a 3a 1b 2b 3b
1a
2a
3a
1b
2b
3b

0 7 0 0 3 1
10 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 1
1 3 0 0 0 0

(a) Split control flow graph for Figure 5.1(d).
1a 2a 3a 1b 2b 3b
1a
2a
3a
1b
2b
3b

0 1 0 0 1 1
1 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 1
1 1 0 0 0 0

(b) UCM for the split control flow graph.
1a 2a 3a 1b 2b 3b
1a
2a
3a
1b
2b
3b

2 1 0 0 0 1
0 1 0 0 1 1
0 0 0 0 0 0
0 0 0 0 0 0
1 1 0 0 0 0
1 1 0 0 1 1

(c) The 2nd power of the UCM.
1a 2a 3a 1b 2b 3b
1a
2a
3a
1b
2b
3b

1 1 0 0 1 1
1 1 0 0 1 1
0 0 0 0 0 0
0 0 0 0 0 0
1 1 0 0 0 1
1 1 0 0 1 1

(d) The 2nd closure of the UCM.
Figure 5.2: Example split control flow graph, its Unit Connectivity Matrix (UCM) and the
UCM’s second power and closure.
5.3. FORMAL MODEL 83
Lemma 5.3.12 The Nth closure of a split control flow graph maps the starting
vector S to the set of nodes that can be reached in a walk of N or less steps.
Proof. By definition, the UCM is the single step connectivity matrix which we
will denote M for convenience. Multiplying the UCM by any index vector S(n)
returns S(n+1) which is non-zero for nodes directly connected to the index vector
through the split control flow graph:
S(n+1) = M.S(n)
and so the set of nodes reachable from S in N or less steps is:
S∪(N) =
B⋃
n≤N
Mn.S
S∪(N) = (
B⋃
n≤N
Mn).S
= M(N)S
where M(N) is the Nth closure of M. 
Lemma 5.3.13 The lowest order closure that is equal to the transitive closure
M∗ is never more than the number of instances N in an assignment.
Proof. From Lemma 5.3.12, the nth closure of a UCM can be used to determine
the nodes that can be reached in n or less steps from the start node by any valid
path. If at any point the UCM closures M(n) and M(n+1) are the same, then no
new nodes can be reached and we are at the transitive closure. In the worst case
then one unique node will be added on each of the M(1), . . .M(n) closures and so
the number of unique closures cannot be more than the number of unique node
instances N . 
Theorem 5.3.14 A graph is connected if and only if the transitive closure of the
starting vector S is every active instance in the assignment x ∈ X, that is:
M∗S =
{
xqm ∈ x : ∃p, l s.t. χpqlm > 0
}
(5.36)
Proof. From Lemma 5.3.12, M∗S is the set of nodes reachable from S by any
number of steps through the split control flow graph. If the set of reachable nodes
were not equal to the set of active instances, then there would be unreachable
nodes which proves the theorem by contradiction. 
84 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
Corollary 5.3.15 Theorem 5.3.14 allows for nodes to be instantiated and not
called. Nodes that are not activated do not contribute to the optimality of the
problem (they merely take up space). Thus if we prune uncalled nodes from our
problem solution we obtain the same objective and so adding a constraint that
every instance must be called has no effect on the optimality of the problem.
Consequently, equation (5.36) will give the same effective solutions as requiring:
M∗S
B⋃
S = x (5.37)
Now with all the definitions and theorems in hand we are ready to face the
problem of translating equation (5.37) from a mathematical statement into a set
of static Linear Programming (LP) constraints. Let us first denote the elements
of the nth closure of the split control flow graph M(n) by α(n)ij with i = (p, l) and
j = (q,m) in a similar fashion to Definition 5.3.9. As S is a binary vector with a
solitary 1 at index s corresponding to the start node on the reference partition,
from Lemma 5.3.13 and Corollary 5.3.15 we have:
α
(∗)
sj = α
(N)
sj = xj ∀ j 6= s (5.38)
arbitrarily setting the index s = 1, equation (5.38) says that the first column of
the transitive closure matrix needs to match the final (pruned) instance vector x
for the graph to be connected. This is thus the connectivity constraint we have
been looking for. However we still need to specify α(N)sj in LP form.
Referring to Definitions 5.3.10 and 5.3.11, we know that the transitive closure
is defined in terms of lower order closures. Thus we can define α(N)sj in terms of
the LP recursion.
α
(n)
ij ≤
 αij if n = 1∑
k α
(n−1)
kj
αik + α
(n−1)
ij otherwise.
(5.39)
noting that α(n)ij ∈ {0, 1} and αij is defined for the UCM in equation (5.34).
To remove the quadratics from equation (5.39) we need to introduce a set of
new variables β(n)
ijk
∈ {0, 1} defined by:
2β(n)
ijk
≤ α(n)
kj
+ αik (5.40)
5.3. FORMAL MODEL 85
which can only be non-zero if both α(n)
kj
and αik are 1 in a similar fashion to
Lemma 4.4.2 of page 52 [4] and adjust equation (5.39) to:
α
(n)
ij ≤
 αij if n = 1∑
k β
(n−1)
ijk
+ α(n−1)ij otherwise.
(5.41)
Equations (5.38), (5.34), (5.40) and (5.41) can now be brought together to form
our connectivity constraints:
α
(N)
sj = xj ∀ j 6= s (5.42)
α
(1)
ij ≤ χij (5.43)
2 β(n)
ijk
≤ α(n)
kj
+ αik (5.44)
α
(n)
ij ≤
∑
k
β
(n−1)
ijk
+ α(n−1)ij (5.45)
with α, β ∈ 0, 1. It should be noted that the x ∈ X constraints on the outer problem
enforce xj feasibility and constraint (5.42) then pulls the left hand side of the
inequalities to 1 where the right hand side permits and this is the reason why only
a single side of the quadratic removal construct was needed in equation (5.40).
In the worst case, the number of connectivity constraints would be O(N3) in
the number of possible node instances for a problem N . However, any practical
implementation can reduce the number of constraints by keeping an adjacency
list of each feasible connection at each step in the α(n)ij growth process based on
the original input control flow graph χpq and can identify the minimum transitive
closure order when all possible connections (i.e. where αij = 1) have been included
through the first column of α(N)sj . Further any self calling nodes can be ignored
from the practical constraints as they do not contribute to the transitive closure.
In the results of Section 5.4, these practical enhancements reduced the number
of connectivity constraints to less than N1.3.
At this point, we have a Mixed Integer Linear Program (MILP) which rep-
resents the MIP objective function and the directed walk uncertainty through
consolidated split control flow graph constraints. The MILP can be solved di-
rectly for the optimistic minx∈X minχ∈Ξ(x) objective form using B&B with the
connectivity and χ constraint variables kept in the integer domain. However,
for the robust objective form of Definition 5.3.7 the inner maximisation must be
dualised to combine it with the outer minimisation and this procedure relaxes
the strength of the χ ∈ Ξ(x) split control flow graph constraints including the
connectivity constraints as discussed in the next section.
86 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
Dual Relaxation
This section presents the method used to reformulate the robust minimax MIP
optimisation problem into a standard minimisation form. The method is to du-
alise the inner maximisation and associated constraints to form a minimisation
which is then combined with the external minimisation of the original robust
optimisation problem using new variables.
Function Description
dualLP() Dualises the LP in the current LPM working environment
expandM() Used to replace symbolic M’s with defined values
negateLP() Performs the transform {min cTx : Ax ≥ b} ↔ {max−cTx : −Ax ≤ −b}
normaliseLP() Moves constants and external parameters to the RHS of relations
replaceLP() Replaces all occurrences of a symbol (e.g. a quadratic) by another symbol
writeAMPL() Saves the LP to a file in the AMPL modelling language format
writeGMS() Saves the LP to a file in the GAMS modelling language format
writeLP() Saves the LP to a file in the ILOG LP model format
Table 5.4: Some of the functionality included in the current Linear Program Manipulation
(LPM) library.
To implement the dualisation method, a new general mathematical modelling
library called the Linear Program Manipulation (LPM) library was created. LPM
can be used with any problem and includes functionality to manipulate mathe-
matical forms not present in previous mathematical modelling languages such as
AMPL [113], GAMS [114] or OPL [115] as shown in Table 5.4.
To dualise a minimax problem with LPM, the inner problem is first entered
with parameterised placeholders for the linkage variables to the outer problem.
Next the LPM command dualLP() is called to dualise the inner problem and the
outer problem’s constraints are then added to specify the feasible range of the
linkage variables. Finally the dualised problem is exported for solution with a
standard solver using one of the LPM’s write*() commands.
The point of the preceding LPM methodology discussion is that dualisation
is a general abstract process with few specific issues. Consequently this section
will describe the dualisation of MIP using the general abstract terms of Papadim-
itriou [126] so that the specific issues introduced by the dualisation can be better
highlighted.
The robust inner maximisation of MIP can be represented in canonical
form [126] as:
max
χ∈Ξ(x)
cT χ (5.46)
s.t. Aχ ≤ b(x) (5.47)
5.3. FORMAL MODEL 87
with Ax ≤ b(x) constructed from the χ ∈ Ξ(x) constraints of Definition 5.3.7 to-
gether with the connectivity constraints and all xpl terms dependent on the outer
x ∈ X constraints linearised and isolated as external variables in b(x) on the right
hand side of the constraints:
b(x) = col(b1, b2, ..., bn | bn+1x1, ..., bn+mxm) (5.48)
The corresponding Lagrangian dual [126,127] then becomes:
min
pi∈Π(x)
b(x)T pi (5.49)
s.t. AT pi ≥ c (5.50)
with a new variable pii for each of the previous constraints in (5.47).
The dualisation procedure introduces two specific issues for the MIP formula-
tion. The first issue is that transforming from the inner primal variable space to
the dual variable space pi ∈ Π(x) sacrifices the integrality of the χ ∈ Ξ(x) and con-
nectivity constraints. The loss of integrality in the χ related constraints increases
the size of the feasible solution space to include non-integer split control flow
graphs. However the relaxed solution set still includes the integer split control
flow solutions and so the dualised minimax objective represents an upper bound on
the corresponding integer split control flow problem solution for the assignments
identified.
The second issue is the introduction of quadratic piixj terms in the dualised
objective resulting from the bixj linkage variables in equation (5.48). As xj ∈ {0, 1}
for MIP, these quadratics can be replaced by new linear terms zij ≥ 0 constrained
by:
zij =
{
≥ pii −M(1− xj) if bi ≥ 0
≤Mxj, ≤ pii otherwise.
(5.51)
this substitution takes advantage of the fact that the dualised quadratics are part
of a minimisation objective and was implemented in this work through calls to
the LPM library function replaceLP().
88 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
5.4 Results
This section provides execution time results for a Java Fibonacci benchmark par-
titioned for a three component heterogeneous architecture using four forms of
MIP. The four MIP forms are the robust optimisation minimax of Definition 5.3.7
referred to as MXM for Min Max Multiple, the optimistic form of Definition 5.3.7
where the inner maxχ∈Ξ(x) is replaced by a minχ∈Ξ(x) referred to as MMM for
Min Min Multiple and the two forms with the connectivity constraints of equa-
tions (5.42) to (5.45) from page 85 added referred to as MXM∗ and MMM∗ respec-
tively. The formal relation between the objective bounds of the four MIP forms
is:
MMM ≤MMM∗ ≤MIP∗ ≤MXM∗ ≤MXM (5.52)
where MIP∗ represents the best partition that could be identified if the actual
practical trace were known. Relation (5.52) can be used as a basis for B&B [123]
thresholding if required with all the objective values for the MIP forms upper
bounded by the single instance sequential assignment SAP§4.4 problem solution
which is referred to as SAP in this section.
Table 5.5 presents the results for the different problem forms. The Simula-
tion results were generated with a bespoke 3S tool (see Chapter 3) created for
this project. The tool uses the real run-time trace to simulate greedy calling
convention activations on a defined assignment while keeping track of the total
communication and execution times. The MXM∗ and MMM∗ results quantify the
potential benefit of MIP for this benchmark to be between 2.16 and 3.23 times
performance improvement over SAP. Further the practical value of the robust
MXM∗ MIP form is clearly evident as the optimistic MMM∗ partition delivers
considerably reduced performance when the actual program trace is simulated.
Problem Form Formal Simulation Instances
SAP 1037 1037 23
MXM 479 479 30
MXM∗ 479 479 32
MMM∗ 321 853 33
MMM 167 871 35
Table 5.5: Partition results for the Fibonacci benchmark. The SAP row gives the single in-
stance SAP results, MXM and MMM the open robust and optimistic MIP results
and MXM∗ and MMM∗ the connected robust and optimistic MIP results. The
Formal column provides the optimal results from the respective formal models,
the Simulation column gives the 3S trace simulation results for the formal assign-
ments and the Instances column provides the number of instances for each optimal
assignment. All results are total execution times (scaled by the same factor).
5.4. RESULTS 89
The sizeable MIP improvements for this benchmark are because, like all Java
programs, the benchmark’s functions share stack arithmetic code which MIP
correctly identifies and multiply instantiates optimally. While the duplication
of shared stack arithmetic code may well be an obvious manual step now it
has been explained, MIP identifies the code sections to duplicate automatically
without human context information using only the abstract control and data flow
information measured by 3S [1]§3.
G10
G03
64
G01
G02
1
G11
G12
64
G04
64
G13
64
G14 64
G15
G16
64
G1764
64
G18
G19
64
G20
63
G21
1
63
64
G22
1
G23
1
128
G05
384
G06
384
G07 384
G08
384
64
64
64
64
G09
128
64 64
Figure 5.3: Control flow graph for the Fibonacci benchmark. Nodes represent code segments
and are shaded in proportion to their total communication and computation costs
at the reference location (location A). Lines represent control flow graph multi-
edges and are labelled with their χ
pq
values.
Figure 5.3 shows the 3S control flow graph for the benchmark and Figures 5.4–
5.8 show the corresponding SAP and MIP partitions. In the partitions: circles
indicate singly instantiated nodes, pentagons multiply instantiated nodes and
lines show the split control flow graph edges χpqlm. Nodes are shaded in propor-
tion to their inbound communication and computation time requirements at each
location determined by the number of split control flow graph activations. The
effect of the connectivity constraints is clearly visible in the difference between
Figure 5.7 and Figure 5.8.
90 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
G10
G03
64
G01
G02
1
G13
G04
64
G14
64
G15
G16
64
G17
64
64
G21
G22
1
G23
1128
G05
384
G06
384
G07
384
G08
384
6464 64
G09
128
G18
64
64
G11
64
G12
64
64
G19
64 1
G20
63
63
64
AB
Figure 5.4: Single instance SAP partition for the benchmark. The heterogeneous computa-
tional locations are labelled A, B and C of which only A and B were used. Only
a single instance of each code section is present by definition in the SAP parti-
tion. Nodes are shaded in proportion to their total inbound communication and
computation time requirements at each location.
5.4. RESULTS 91
G10
G03
64
G01
G02
1
G13
G04
64
G14
64
G15
G16
64
G17
64
64
G21
G22
1
G23
1 64
G05
256
G06
256
256
G08
384
64 64 64
G09
128
G18
64
64
G11
64
G12
64
G04
64
G05
64
64
64
G19
64
1
G20
63
63
G03
64
G04
64
G05
64
G06
64
64
G06
G07
A
C
B
Figure 5.5: Robust minimax multiple instanceMXM partition without transitive closure con-
straints for the benchmark. The heterogeneous computational locations are la-
belled A, B and C. Multiply instantiated nodes are shown as pentagons and
singly instantiated nodes as circles. Node instances are shaded in proportion to
their total inbound communication and computation time requirements.
92 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
G10
G03
64
G01
G02
1
G13
G04
64
G14
64
G15
G16
64
G17
64
64
G21
G22
1
G23
1 64
G05
256
G06
256
G07
256
G08
256
6464 64
G09
128
G18
64
64
G11
64
G19
64
1
G20
63
63
G03
64
G04
64
G05
64
G06
64
G07
64
64
G12
64
G04
64
G05
64
G06
64
64
64 G07
A
C
B
Figure 5.6: Robust minimax multiple instance MXM∗ partition with transitive closure con-
straints for the benchmark. The heterogeneous computational locations are la-
belled A, B and C. Multiply instantiated nodes are shown as pentagons and
singly instantiated nodes as circles. Node instances are shaded in proportion to
their total inbound communication and computation time requirements.
5.4. RESULTS 93
G01
G02
1
G11
G12
64
G04
64
G13
64
G15
G16
64
G17
64
64
G18
G19
1
G21
1
G22
1
G23
1
192
G06
193
G07
193
G08
193
64 64 1
G09
64
64
G19
63
G20
63
63
G03
64
64
1
63
G10
G03
64
G14
G04
64
64
128
G06
191
G07
191
191
63
64
G09
64
64
G08
G18
G05
G04
G05
A B C
Figure 5.7: Optimistic minmin multiple instance MMM∗ partition with transitive closure
constraints for the benchmark. The heterogeneous computational locations are
labelled A, B and C. Multiply instantiated nodes are shown as pentagons and
singly instantiated nodes as circles. Node instances are shaded in proportion to
their total inbound communication and computation time requirements.
94 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
G01
G02
1
G15
G16
64
G17
64
G04
64
G18
G19
64
G20
63
G21
1
63
G03
64
G22
1
G23
1
64
G05
128
G06
128
G07
128
G08
128
6464
G11
G12
64
G04
64
G13
64
G05
128
G06
128
G07
128
G08
128
64
G09
64
64
G10
G03
64
G14
G04
64
64
G05
128
G06
128
G07
128
G08
128
64
G09
64
64
A B C
Figure 5.8: Optimistic minmin multiple instance MMM partition without transitive closure
constraints for the benchmark. The heterogeneous computational locations are
labelled A, B and C. Multiply instantiated nodes are shown as pentagons and
singly instantiated nodes as circles. Node instances are shaded in proportion to
their total inbound communication and computation time requirements.
5.5. DISCUSSION 95
5.5 Discussion
The results of Section 5.4 demonstrate that MIP produces partitions with be-
tween 2.16 and 3.23 times better performance than single instance SAP for the
benchmark considered. The lowest guaranteed performance improvement was
2.16 times for the robust MXM∗ formulation corresponding to Definition 5.3.7
together with the connectivity constraints of equations (5.42) to (5.45).
Form Objective B&B Nodes Solution Times (s)
SAP 1037 63 0.23
MMS 1037 105 0.59
MXS 1037 2,002,330 5,240.71
Table 5.6: CPLEX 11 solution information for the quad-thread 64 bit version of CPLEX 11
on the Imperial orion server. SAP is the standard single instance solution using
the SAP method of Chapter 4. MMS and MXS are the optimistic and robust MIP
results corresponding to Definition 5.3.7 with added single instance constraints.
Table 5.6 shows CPLEX solution information for the problem of Section 5.4
using the standard single instance SAP form from Chapter 4 and two single
instance forms of MIP. The single instance MIP results correspond to the for-
mulation of Definition 5.3.7 with constraint (5.26) changed to
∑
l xpl = 1 ∀p 6∈ E
using the optimistic minimin (MMS) and robust minimax (MXS) objectives.
The three different formulations produce the same objective value as expected
and despite solution degeneracy both the SAP and MMS problem forms remain
practically tractable. However, the dualisation required to remove the inner
maximisation from the MXS form has a dramatic effect on B&B solution space
prunability. Thus, while the robust problem form provides optimality guaran-
tees, theoretical work will be required on a new robust formulation or bounding
function [123, 146] to make the robust MIP form practically tractable for the
benchmarks considered in Chapter 4.
Although the optimistic MIP assignment results were not stable for the real
program walk when simulated, the optimistic results were still 19% better than
the single instance forms and further work could evaluate alternative optimistic
solutions against the actual trace to identify better trace optimalities given the
practical tractability of the minimin objective form. A final possible enhancement
would be to include the effect of path uncertainty on the consolidated timing
equation (5.13). As discussed in Assumption 5.3.1, Lemma 5.3.2 is only strictly
true for programs with step independent computations and data flows and the
inclusion of timing variance in MIP is left for future work.
96 CHAPTER 5. PARTITIONING WITH UNCERTAINTY
5.6 Summary
This chapter introduced a new partitioning problem called the Multiple Instanti-
ation Problem (MIP) with applications in software performance, manufacturing
process and project management optimisation. As Section 5.2 explained, MIP
differs from previous formal assignment problems [128,139,141,148]§2 in that MIP
allows a processing element to be instantiated at multiple locations to minimise
communication costs within size constraints.
While the idea of multiple instantiation is not new, previous work has used
heuristics [45], manual methods [101] or required full sequence information [133]
and MIP represents the first formal work to consider process optimisation un-
der control flow sequence uncertainty. The chapter began in Section 5.2 by
demonstrating that multiple instantiation introduces uncertainty into control flow
graphs which compress out sequence information. Section 5.3 then showed how
the directed walk uncertainty can be dealt with through an optimal robust for-
mulation based on consolidated feasible walks called split control flow graphs and
proved that the MIP problem is inherently strongly NP-hard through a reduction
from QAP [141].
In Section 5.4, results were presented that showed the robust MIP partition is
2.16 times better than the optimal single instance SAP partition for the software
benchmark considered. The results highlighted several opportunities for future
research which were discussed in Section 5.5. The opportunities included: iden-
tifying practically optimal partitions from alternative LP solutions, addressing
the complexity of the robust formalisation using new problem formulations or
bounds and including timing uncertainty in the MIP model.
The robust MIP assignments are guaranteed to never produce a worse parti-
tion than single instance SAP no matter what the actual program directed walk
is. Consequently, the MIP formulation can be used to directly replace SAP in
the parallel Multi-level Assignment Problem (MAP) introduced in Chapter 4 to
deliver even greater program acceleration potential while providing a means to
optimise over parallel task sequence interference uncertainties.
Chapter 6
Conclusions and Future Work
6.1 Introduction
The goal of this project was to investigate the potential of heterogeneous archi-
tectures to automatically accelerate general purpose programs. In the course of
this investigation, several problems in software characterisation, computational
modelling and mathematical optimisation were solved.
The main body of the thesis began in Chapter 3 where the 3S characterisa-
tion framework was presented. The chapter showed how 3S solves the problem of
fine-grained execution timing on closed architectures and how the 3S parallelism
tool derives fine-grained heterogeneous Instruction Level Parallelism (ILP) infor-
mation from real data flow measurements on software compiled for a CPU.
Chapter 4 presented the Write-Only Architecture (WOA) which minimises
heterogeneous communication latencies and from there the Sequential As-
signment Problem (SAP) was derived to abstract the software partitioning
problem to the strongly NP-hard Generalized Quadratic Assignment Problem
(GQAP) [141, 146, 148]. The chapter continued to show how SAP could be used
to remove the coarse-grained assignment issues seen in previous parallel parti-
tioning work [97, 101] through the new Multi-level Assignment Problem (MAP).
Like all partitioning work where fine-grained information is used [79, 81, 84, 89],
MAP partitions Control/Data Flow Graphs (CDFGs) which compress out se-
quence information. This left MAP with activation sequence uncertainty which
MAP addressed through optimistic and pessimistic formulations.
In Chapter 5 the issue of sequence uncertainty in CDFGs was examined in
more detail through the new Multiple Instantiation Problem (MIP) which is be-
lieved to be the first formal work on multiple instantiation under uncertainty.
97
98 CHAPTER 6. CONCLUSIONS AND FUTURE WORK
The approach taken to address the CDFG path uncertainty was to consolidate
directed walks into split control flow graphs and solve a robust minimax formu-
lation using Lagrangian dualisation [127] with transitive closures [125] expressed
as Linear Programming (LP) constraints. The results of Section 5.4 showed that
MIP produces between 2.16 and 3.23 times the optimality of SAP for the bench-
mark considered and the uncertainty range brought us to several possible areas
for future work which will be reviewed in Section 6.3 after a brief summary of
the main contributions delivered by this research in Section 6.2.
6.2 Contributions
The tangible contributions delivered by this project include:
• 3S [1]: a novel software characterisation framework that combines static in-
strumentation with dynamic characterisation enabling simple tools to mea-
sure real fine-grained execution timing, control and data flow information
for any compilable program.
• MAP [2,3]: a new execution model and multi-level parallel assignment ap-
proach that delivers up to 64 times the heterogeneous acceleration potential
of previous work for the benchmarks considered.
• MIP [4,5]: the first formal work on multiple instantiation under activation
sequence uncertainty, providing more than twice the acceleration poten-
tial of previous optimal assignment methods for the software considered
and capable of being integrated with MAP to deliver significant gains for
heterogeneous partitioning.
These contributions were discussed in Chapters 3, 4 and 5 of this thesis respec-
tively and represent orders of magnitude increases in heterogeneous execution
performance potential over previous work [81,84,89,97,101].
The intangible contributions of this work include: identifying characterisation,
modelling and assignment as the critical activities in optimal automated hetero-
geneous partitioning, demonstrating that fine-grained sequential partitioning is
the core issue in even parallel heterogeneous partitioning through MAP, connect-
ing the heterogeneous partitioning problem to formal mathematical abstractions
and demonstrating the tangibility of optimal approaches for real software assign-
ment problems. These benefits will allow future research to focus on well defined
sub-problems and support the commercialisation of this work.
6.3. FUTURE WORK 99
6.3 Future Work
The contributions of this research are implementable today. The unknowns asso-
ciated with characterisation information availability, execution model efficiency
and the tractability of optimal heterogeneous partitioning have been removed.
However, there is still practical and theoretical work that can be done as out-
lined in this section.
3S Enhancements
The 3S framework [1] is currently only implemented for x86 CPUs. The fine-
grained 3S timing measurements were used in Chapter 4 for assignments to the
reference CPU with computation times for assignments to other hardware com-
ponents estimated using 3S parallelism and iteration measurements together with
hardware data sheet information.
The x86 CPU is the most complex generally available CPU and it would be
relatively simple to port 3S to other CPUs. This would allow the computation
timing equation of Section 4.3 to be replaced with actual timing measurements
for current System on Chip (SoC) architectures which tend to use a RISC CPU
core [25,27,28].
In contrast to previous binary characterisation frameworks, 3S can identify
true run-time data dependencies for programs. However, as discussed in Sec-
tion 3.3, the current implementation of 3S does not identify floating point data
flow dependencies for the x86 CPU because of Intel’s unique internal 8087 float-
ing point register stack. Floating point data flow analysis was not required in
this work because the benchmarks considered were all integer but it would be a
simple matter to enhance 3S to shadow the internal 8087 register stack in mem-
ory at run-time and such an enhancement could make a reasonable project for a
Masters student.
Additionally, there is considerable scope for 3S characterisation tool enhance-
ment and new tool development. As an example of a possible tool enhancement,
the 3S parallelism tool could be augmented with extra slot map dimensions to
facilitate the quantification of loop pipelining and SIMD potential. Examples
of possible new tools include: a dynamic memory access violation checker based
on the 3S memory tool, a test coverage tool based on hotspot and a full system
simulator based on the 3S MIP execution simulator discussed in Section 5.4.
These enhancements would extend the contribution of 3S beyond the scope of
this project.
100 CHAPTER 6. CONCLUSIONS AND FUTURE WORK
Partitioning with Certainty
Chapter 4 identified heterogeneous acceleration potentials of up to 36 times for
single instance sequential assignment. The acceleration figures were dependent
on the WOA which is in turn dependent on the ability of a caller to predict the
data requirements of future targets in the control flow path.
For input independent transformation programs such as SHA [59] the data
requirements of each stage in the computation are fully predictable however for
data dependent transformation programs such as LZ77 [60], data access predic-
tion could require simulation making the assumption of a priori run-time access
predictability unrealistic. To address this issue Section 4.2 proposed sending data
contexts [92] and backing off to a Shared Memory model [79,81,84] where the con-
text transmission time exceeds the memory pull latency overhead. Future work
could investigate the sensitivity of heterogeneous acceleration potentials to these
communication alternatives and research new tools and models to determine data
predictability, context size and shared memory overhead for programs.
The optimal assignment of parallel tasks to shared sequential components
would require sequence information to identify scheduling conflicts and sequence
information is not usually available at the fine-grained characterisation level re-
quired for optimal heterogeneous assignment because of trace length issues. The
MAP parallel assignment approach introduced in Section 4.5 addressed the lack
of sequence information by providing both an optimistic and a pessimistic prob-
lem form. The optimistic form assumed that the execution of tasks on a shared
sequential computational component like a CPU core would be free from schedul-
ing conflicts and the pessimistic form assumed that all executions on sequential
components would have to be serialised.
Future work could extend MAP with the robust control flow sequence un-
certainty techniques developed in Chapter 5 to obtain a tighter solution upper
bound than the pessimistic scheduling assumption. Additionally, it would be a
straightforward task to evaluate the sensitivity of selected assignments against
different scheduling conflicts and future work could quantify practical serialisation
conflicts using trace simulations and real hardware implementations.
The problems of Chapter 4 proved practically tractable for the representative
software benchmarks and wide range of architectural characteristics considered.
However, there may be software benchmarks or architectural configurations for
which SAP and MAP become intractable and future work could abstract the
characteristics of general programs to allow automatic problem generation to
z-test the tractability of the single instance software assignment problems.
6.3. FUTURE WORK 101
Partitioning with Uncertainty
Chapter 4 demonstrated that the single instance SAP and, by extension, the
MAP problems were computationally tractable for software assignment problem
instances by virtue of Branch and Bound (B&B) [123] solution space pruning.
However, Section 5.5 disclosed that the robust formulation of multiple instance
MIP has tractability issues.
Theoretical research to address the robust MIP tractability issues could at-
tempt to find alternative model formulations and problem specific bounding re-
laxations [123, 146] and practical research could begin with the evaluation of
alternative solutions to the optimistic minmin MIP formulation against program
walks using 3S trace simulation. As discussed in Section 5.5, the minmin MIP
objective formulation does not suffer from tractability issues for the benchmark
considered and has the potential to identify assignments with better objectives
than the robust solution for practical traces but the size of the multiple solution
space coupled with the time for trace simulation are issues that may need to be
addressed.
The WOA timing equation of Chapter 4 encapsulated the fact that different
activations to the same software section at the same location can have different
computation and communication times. Time variability is not an issue for the
single instance SAP and MAP problems as formulated, but adds an extra level of
uncertainty to MIP as highlighted in Assumption 5.3.1 on page 76. The variability
of execution and communication costs was excluded from the MIP formulation
in this work and could be an interesting topic for future research.
An area of computational partitioning research that has been excluded from
this work is dynamic reconfiguration [19,27,41,42]. Dynamic reconfiguration ex-
ploits computation locality within programs to alleviate space constraints through
the temporal sharing of reconfigurable hardware components. Like the Multiple
Instantiation Problem, the exact identification of optimal dynamic reconfigura-
tion assignments requires full sequence information and the robust optimisation
techniques of Chapter 5 could be applied to identify dynamic reconfiguration
opportunities for programs in the absence of sequence information.
Finally, while it has been assumed throughout this work that sequence infor-
mation is not available in fine-grained characterisations due to trace compression
issues, many programs exhibit partial execution sequence patterns that can be
readily extracted using the 3S loopgraph d tool presented in Appendix A [1].
Future work could extend MIP to include such additional information and use
MIP to address a wider range of problems [153,154,155,156]§4.
102 CHAPTER 6. CONCLUSIONS AND FUTURE WORK
Glossary
3S Spacey Stream Splitter (characterisation framework)
ACK Acknowledge (communications signal)
ALU Arithmetic Logic Unit
APA Attractiveness Partitioning Algorithm
API Application Programming Interface
ASIC Application Specific Integrated Circuit
AST Abstract Syntax Tree
B&B Branch and Bound
CDB Common Data Bus
CDFG Control Data Flow Graph (combined CFG and DFG)
CFG Control Flow Graph
CISC Complex Instruction Set Computer (CPU type)
CMOS Complementary Metal-Oxide Semiconductor (HW technology)
CPELX The leading commercial MILP solver
CPU Central Processing Unit (computational component)
DAG Directed Acyclic Graph
DFG Data Flow Graph
ELF Executable and Linking Format
eOMAP expanded OMAP
eRMAP expanded RMAP
FPGA Field Programmable Gate Array (computational component)
FPU Floating Point Unit (CPU coprocessor)
GAP Generalized Assignment Problem
GLPK An open source MILP solver library
GLPSOL A command line interface to GLPK
GPU Graphical Processing Unit (computational component)
GQAP Generalized Quadratic Assignment Problem
GQMIP Generalized Quadratic Multiple Instantiation Problem (synonym for MIP)
HW Hardware
IE Instruction Equivalents (abstract HW size unit)
ILP Instruction Level Parallelism (SW characteristic)
IR Instruction Requests (SW characteristic)
JMP Jump (CPU instruction)
103
104 GLOSSARY
LP Linear Programming
LPM Linear Program Manipulation library§5.3
LUT Look-Up Table (FPGA HW primitive)
MAP Multi-level Assignment Problem§4.5
MILP Mixed Integer Linear Programming
MIMD Multiple Instruction Multiple Data (CPU type)
MIP Multiple Instantiation Partitioning§5
MIP Mixed Integer Programming
MIQP Mixed Integer Quadratic Programming
MMM Optimistic minimin MIP formulation without closure constraints
MMM* MMM formulation with closure constraints
MMS MMM formulation with single instance constraints
MMX Intel’s multimedia instructions set extensions
MPPA Massively Parallel Processing Array (HW Component)
MUX Multiplexer (Hardware building block)
MXM Robust minimax MIP formulation without closure constraints
MXM* MXM formulation with closure constraints
MXS MXM formulation with single instance constraints
OFB Output Feedback (cypher mode)
OMAP Optimistic Multi-level Assignment Problem
OR Operations Research
OS Operating System
QAP Quadratic Assignment Problem
QP Quadratic Programming
RCD Reconfigurable Computing Device
RDTSC ReaD Time Stamp Counter
RISC Reduced Instruction Set Computer (CPU type)
RMAP Robust Multi-level Assignment Problem
RPC Remote Procedure Call
RTL Register Transfer Level
SAP Sequential Assignment Problem§4.4
SHA Secure Hash Algorithm
SIMD Single Instruction Multiple Data (CPU type)
SIMT Single Instruction Multiple Thread (GPU type)
SMP Symmetric Multiprocessor
SW Software
TCP Transmission Control Protocol
TCP/IP TCP Internet Protocol Suite
TCP/UDP TCP User Datagram Protocol
UCM Unit Connectivity Matrix
UML Unified Modelling Language
VLSI Very Large Scale Integration
WOA Write-Only Architecture§4.2
Bibliography
[1] Spacey, S.A. 3S: Program Instrumentation and Characterisation Frame-
work. Imperial Technical Paper (http://www.doc.ic.ac.uk/research/
technicalreports/2008/DTR08-1.pdf, 2006).
[2] Spacey, S.A., Luk, W., Kelly, P.H.J., Kuhn, D. Rapid Design-
Space Visualisation through Hardware/Software Partitioning. IEEE South-
ern Programmable Logic Conference (2009).
[3] Spacey, S.A., Luk, W., Kelly, P.H.J., Kuhn, D. Coarse-grained
Parallel Partitioning Through Fine-grained Sequential Assignment for Het-
erogeneous Computational Systems. Awaiting Publication (2009).
[4] Spacey, S.A. Concise CPLEX. Imperial Technical Paper (http://www.
doc.ic.ac.uk/research/technicalreports/2009/DTR09-7.pdf, 2009).
[5] Spacey, S.A., Wiesemann, W., Kuhn D., and Luk, W. Robust
Software Partitioning with Multiple Instantiation. Awaiting Publication
(2009).
[6] Hennessy, J.L., Patterson, D.A. Computer Architecture: A Quanti-
tative Approach. Morgan Kaufmann (2002).
[7] De Micheli, G. Synthesis and Optimization of Digital Circuits. McGraw-
Hill (1994).
[8] Flynn, M.J. Some Computer Organizations and their Effectiveness. IEEE
Transactions on Computers, pp. 948–960 (1972).
[9] Patterson, D.A., Sequin, C.H. RISC I: A Reduced Instruction Set
VLSI Computer. International Symposium on Computer Architecture, pp.
443–457 (1981).
105
106 BIBLIOGRAPHY
[10] Estrin, G. et al. SARA (System ARchitects Apprentice): Modelling,
Analysis, and Simulation Support for Design of Concurrent Systems. IEEE
Transactions on Software Engineering, Vol. 12, No. 2, pp. 293–311 (1986).
[11] Reshadi, M., Dutt, N. Generic Pipelined Processor Modelling and
High Performance Cycle-Accurate Simulator Generation. Proceedings of
the Conference on Design, Automation and Test in Europe (DATE), Vol.
2, pp. 786–791 (2005).
[12] Burger, D., Austin, T.M. The SimpleScalar Tool Set, Version 2.0.
ACM SIGARCH Computer Architecture News, Vol. 25, pp. 13–25 (1997).
[13] Tomasulo, R.M. An Efficient Algorithm for Exploiting Multiple Arith-
metic Units. IBM Journal of Research and Development, Vol. 11, No. 1,
pp. 25–33 (1967).
[14] Bic, L., Nagel, M.D., Roy, J.M.A. Automatic Data/Program Parti-
tioning Using the Single Assignment Principle. In Proceedings of the 1989
ACM/IEEE Conference on Supercomputing, Reno, Nevada, United States,
pp. 551–556 (1989).
[15] Wolf, M.E., Lam, M.S. A Data Locality Optimizing Algorithm. Pro-
ceedings of the ACM SIGPLAN 1991 Conference on Programming Lan-
guage Design and Implementation, pp. 30–44 (1991).
[16] Showerman, M., Enos, J., Pant, A., Kindratenko, V., Steffen,
C., Pennington, R., Hwu, W. QP: A Heterogeneous Multi-Accelerator
Cluster. Proceedings of the 10th LCI International Conference on High-
Performance Clustered Computing (2009).
[17] Butts, M., Jones, A.M., Wasson, P. A Structural Object Program-
ming Model, Architecture, Chip and Tools for Reconfigurable Computing.
Proceedings of the IEEE Symposium on Field-Configurable Custom Com-
puting Machines (FCCM), pp. 55–64 (2007).
[18] Estrin, G. Reconfigurable Computer Origins: The UCLA Fixed-Plus-
Variable (F+V) Structure Computer. IEEE Annals of the History of Com-
puting, Vol. 24, No. 4, pp. 3–9 (2002).
[19] Gokhale, M., Graham, P.S. Reconfigurable Computing: Accelerat-
ing Computation with Field-Programmable Gate Arrays. Springer, ISBN:
0387261052 (2005).
BIBLIOGRAPHY 107
[20] Bjurus, P., Millberg, M., Jantsch, A. FPGA Resource and Tim-
ing Estimation from Matlab Execution Traces. Proceedings of the Tenth
International Symposium on Hardware/Software Codesign (2002)
[21] Link, G.M., Vijaykrishnan, N. Hotspot Prevention Through Run-
time Reconfiguration in Network-On-Chip. Proceedings of the Conference
on Design, Automation and Test in Europe (DATE), Vol. 1, pp. 648–649
(2005).
[22] Panainte, E.M., Bertels, K., Vassiliadis, S. Instruction Schedul-
ing for Dynamic Hardware Configurations. Proceedings of the Conference
on Design, Automation and Test in Europe (DATE), Vol. 1, pp. 100-105
(2005).
[23] Hung, A., Bishop, W., Kennings, A. Symmetric Multiprocessing on
Programmable Chips made Easy. Proceedings of the Conference on Design,
Automation and Test in Europe (DATE), Vol. 1, pp. 240–245 (2005).
[24] Arora, D., Ravi, S., Raghunathan, A., Jha, N.K. Secure Embedded
Processing through Hardware-Assisted Run-Time Monitoring. Proceedings
of the Conference on Design, Automation and Test in Europe (DATE), Vol.
1, pp. 178–183 (2005).
[25] Stretch Inc. The S6000 Family of Processors: Architecture White Paper
(2007).
[26] Stretch Inc. S6000 Family (2007).
[27] Xilinx Inc. Vertex-5 Family Overview (2008).
[28] Altera Corp. Stratix IV Device Family Overview (2008).
[29] Leong, P.H.W. et al. Pilchard – A Reconfigurable Computing Plat-
form with Memory Slot Interface. 9th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines (FCCM), pp. 170–179 (2001).
[30] Miyashiro, T. et al. DIMMnet-2: A Reconfigurable Board Connected into
a Memory Slot. International Conference on Field Programmable Logic and
Applications (FPL), pp. 1–4 (2006).
[31] Hauser, J., Wawrzynek, J. Garp: A MIPS Processor with a Re-
configurable Coprocessor. Proceedings of the IEEE Symposium on Field-
Programmable Custom Computing Machines (FCCM), pp. 12–21 (1997).
108 BIBLIOGRAPHY
[32] Cheung, R.C., Luk, W., Cheung, P.Y. Reconfigurable Elliptic Curve
Cryptosystems on a Chip. Proceedings of the Conference on Design, Au-
tomation and Test in Europe (DATE), Vol. 1, pp. 24–29 (2005).
[33] Skillicorn, D.B., Talia., D. Models and Languages for Parallel Com-
putation. ACM Computing Surveys, Vol. 30(2), pp. 123-169 (1998).
[34] Klingauf, W. Systematic Transaction Level Modelling of Embedded Sys-
tems with SystemC. Proceedings of the Conference on Design, Automation
and Test in Europe (DATE), Vol. 1, pp. 566–567 (2005).
[35] Rissa, T., Donlin, A., and Luk, W. Evaluation of SystemC Mod-
elling of Reconfigurable Embedded Systems. Proceedings of the Conference
on Design, Automation and Test in Europe (DATE), Vol. 3, pp. 253-258
(2005).
[36] Edwards, S.A. The Challenges of Hardware Synthesis from C-Like Lan-
guages. Proceedings of the Conference on Design, Automation and Test in
Europe (DATE), Vol. 1, pp. 66–67 (2005).
[37] Zhao, S., Gajski, D.D. Defining an Enhanced RTL Semantics. Proceed-
ings of the Conference on Design, Automation and Test in Europe (DATE),
Vol. 1, pp. 548–553 (2005).
[38] Guo, Z., Buyukkurt, B., Najjar, W., Vissers, K. Optimized Gener-
ation of Data-Path from C Codes for FPGAs. Proceedings of the Conference
on Design, Automation and Test in Europe (DATE), Vol. 1, pp. 112–117
(2005).
[39] Mancoridis, S., Mitchell, B.S., Rorres, C., Chen, Y., Gansner,
E.R. Using Automatic Clustering to Produce High-Level System Orga-
nizations of Source Code. Proceedings of 6th International Workshop on
Program Comprehension, pp. 45–52 (1998).
[40] Oak Ridge National Laboratory PVM: Parallel Virtual Machine.
[41] Mackinlay, P.I., Cheung, P.Y.K., Luk, W., Sandiford, R.D.
Riley-2: A Flexible Platform for Codesign and Dynamic Reconfigurable
Computing Research. Field-Programmable Logic and Applications, LNCS
1304, pp. 91–100 (1997).
[42] Sedcole, P., Blodget, B., Becker, T., Anderson, J., Lysaght,
P. Modular Dynamic Reconfiguration in Virtex FPGAs. IEE Computers
and Digital Techniques, Vol. 153, pp. 157–168 (2006).
BIBLIOGRAPHY 109
[43] Hex-Rays SA. IDA Pro — At the Cornerstone of IT Security (http:
//www.hex-rays.com/idapro/ida-executive.pdf).
[44] GAS Manual (http://sourceware.org/binutils/docs-2.19/as).
[45] Using the GNU Compiler Collection (http://gcc.gnu.org/onlinedocs/
gcc-4.4.1/gcc.pdf).
[46] Altera Corp. Automated Generation of Hardware Accelerators with
Direct Memory Access from ANSI/ISO Standard C Functions. Corporate
White Paper (2006).
[47] Mencer, O., Pearce, D.J., Howes, L.W., Luk, W. Design Space
Exploration with A Stream Compiler. IEEE International Conference on
Field Programmable Technology (FPT) (2003).
[48] Celoxica Ltd Handel-C Language Reference Manual (2004).
[49] Coutinho, J.G. de F., Jiang, J., Luk, W. Interleaving Behavioural
and Cycle-Accurate Descriptions for Reconfigurable Hardware Compila-
tion. Proceedings of IEEE Symposium on Field-Programmable Custom
Computing Machines (FCCM), pp. 245–254 (2005).
[50] nVidia Corp. nVidia Cuda Programing Guide (2009).
[51] KHRONOS Group OpenCL (2009).
[52] Chandra, S. Hardware/Software Partitioning from Application Binaries.
Xcell Journal, Third Quarter, pp. 26–28 (2006).
[53] Cardoso, J.M.P., Neto, H.C. Macro-Based Hardware Compilation of
Java Bytecodes into a Dynamic Reconfigurable Computing System. IEEE
Symposium on Field-Programmable Custom Computing Machines (FCCM)
(1999).
[54] Sass, R., Beeraka, P., Agron, J., Young, J., Andrews, D.,
Greskamp, B., Beeravolu, S., Trefftz, C. Run-Time Reconfigurable
Java Virtual Machine on a Platform FPGA. Submitted to the Fourteenth
Annual IEEE Symposium on Field-Programmable Custom Computing Ma-
chines (FCCM) (2006).
[55] Ha, Y., Hipik, R., Vernalde, S., Verkest, D., Engels, M., Lauw-
ereins, R., Man, H.D. Adding Hardware Support to the HotSpot Vir-
tual Machine for Domain Specific Applications. Lecture Notes in Computer
Science, Vol. 2438, pp. 1135 (2002).
110 BIBLIOGRAPHY
[56] Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M.,
Mudge, T., Brown, R.B. MiBench: A Free, Commercially Representa-
tive Embedded Benchmark Suite. IEEE 4th Annual Workshop on Workload
Characterization, Austin, TX (2001).
[57] Henning, J.L. SPEC CPU2006 Benchmark Descriptions. Computer Ar-
chitecture News, 34(4) (2006).
[58] Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A. The
SPLASH-2 Programs: Characterization and Methodological Considera-
tions. Proceedings of the 22nd International Symposium on Computer Ar-
chitecture, Santa Margherita Ligure, Italy, pp. 24–36 (1995).
[59] FIPS 180-2: Secure Hash Standard. National Institute of Standards and
Technology (NIST) (2002).
[60] Ziv, J., Lempel, A. A Universal Algorithm for Sequential Data Com-
pression. IEEE Transactions on Information Theory, Vol. 23, No. 3, pp.
337–343 (1997).
[61] Wilson, R. et al. SUIF: An Infrastructure for Research on Parallelizing
and Optimizing Compilers. ACM SIGPLAN Notices, Vol. 29, No. 12 (1994).
[62] Hall, M.W., Anderson, J.M., Amarasinghe, S.P., Murphy, B.R.,
Liao, S., Bugnion E., Lam, M.S. Maximizing Multiprocessor Perfor-
mance with the SUIF Compiler. IEEE Computer (A special issue on mul-
tiprocessors) (1996).
[63] Aigner, G., Diwan, A., Heine, D., Lam, M., Moore, D., Murphy,
B., Sapuntzakis. C. An Overview of the SUIF2 Compiler Infrastructure.
Stanford University Computer Systems Laboratory Technical Report (http:
//suif.stanford.edu/suif/suif2/doc-2.2.0-4).
[64] Pearce, D.J., Kelly, P.H.J., Field, T., Harder, U. GILK: A Dy-
namic Instrumentation Tool for the Linux Kernel. Proceedings of the 12th
International Conference on Computer Performance Evaluation, Modelling
Techniques and Tools, Vol. 37, pp. 220–226 (2002).
[65] Nethercote, N., Seward, J. Valgrind: A Program Supervision Frame-
work. Proceedings of the 3rd Workshop on Runtime Verification (2003).
[66] Luk, C. et al. Pin: Building Customized Program Analysis Tools with
Dynamic Instrumentation. Proceedings of the 2005 ACM SIGPLAN Con-
ference on Programming Language Design and Implementation (2005).
BIBLIOGRAPHY 111
[67] Srivastava, A., Eustace, A. ATOM: A System for Building Customized
Program Analysis Tools. Proceedings of the ACM SIGPLAN 1994 Confer-
ence on Programming Language Design and Implementation, pp. 196–205
(1994).
[68] Brewer, E.A., Dellarocas, C.N., Colbrook, A., Weihl, W.E.
PROTEUS: A High-Performance Parallel-Architecture Simulator. Proceed-
ings of the 1992 SIGMETRICS Joint International Conference on Mea-
surement and Modeling of Computer Systems, New York, USA, pp. 247–248
(1992).
[69] Herrod, S. Tango Lite: A Multiprocessor Simulation Environment. Stan-
ford University Computer Systems Laboratory Technical Report (1993).
[70] Larus, J.R., Schnarr, E. EEL: Machine-Independent Executable Edit-
ing. Proceedings of the ACM SIGPLAN 1995 Conference on Programming
Language Design and Implementation, pp. 291–300 (1995).
[71] Meeuws, R., Yankova, Y., Bertels, K., Gaydadjiev, G., Vassil-
iadis, S. A Quantitative Prediction Model for Hardware/Software Parti-
tioning. International Conference on Field Programmable Logic and Appli-
cations (2007).
[72] Mytkowicz, T., Diwan, A., Hauswirth, M., Sweeney, P.F., Vas-
siliadis, S. Producing Wrong Data Without Doing Anything Obviously
Wong!. Architectural Support for Programming Languages and Operating
Systems (ASPLOS), Washington, DC, USA, pp. 265–276 (2009).
[73] Larus, J.R. Whole Program Paths. Proceedings of the ACM SIGPLAN
1999 Conference on Programming Language Design and Implementation,
pp. 259–269 (1999).
[74] Brazma, A. Learning of Regular Expressions by Pattern Matching. Pro-
ceedings of the Second European Conference on Computational Learning
Theory, Vol. 904, pp. 392–403 (1995).
[75] Wegman, M. Summarizing Graphs by Regular Expressions. Proceedings
of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Pro-
gramming Languages, pp. 203–216 (1983).
[76] Wall, D.W. Limits of Instruction-Level Parallelism. WRL Research Re-
port (1993).
112 BIBLIOGRAPHY
[77] Girkar, M., Polychronopoulos, C.D. Automatic Extraction of Func-
tional Parallelism from Ordinary Programs. IEEE Transactions on Parallel
and Distributed Systems, Vol. 3, No. 2, pp. 166–178 (1992).
[78] Henkel, A., Lee, E.A. A Global Criticality/Local Phase Driven Algo-
rithm for the Constrained Hardware/Software Partitioning Problem. In-
ternational Workshop on Hardware/Software Codesign, pp. 42–48 (1994).
[79] Stitt, G., Vahid, F. Hardware/Software Partitioning of Software Bina-
ries. IEEE/ACM International Conference on Computer Aided Design, pp.
164–170 (2002).
[80] Stitt, G., Lysecky, R., Vahid, F. Dynamic Hardware/Software Par-
titioning: A First Approach. 40th Design Automation Conference, pp.
250–255 (2003).
[81] Lysecky, R., Vahid, F. A Configurable Logic Architecture for Dynamic
Hardware/Software Partitioning. Design Automation and Test in Europe
Conference (DATE), pp. 480–485 (2004).
[82] Lysecky, R., Vahid, F. A Study of the Speedups and Competitiveness
of FPGA Soft Processor Cores using Dynamic Hardware/Software Parti-
tioning. Design Automation and Test in Europe (DATE), pp. 18–23 (2005).
[83] Lysecky, R., Stitt, G., Vahid, F. Warp Processors. ACM Transac-
tions on Design Automation of Electronic Systems (TODAES), pp. 659–681
(2006).
[84] Sitt, G., Vahid, F. A Decompilation Approach to Partitioning Soft-
ware for Microprocessor/FPGA Platforms. Design Automation and Test in
Europe (DATE), pp. 396–397 (2005).
[85] Sitt, G., Vahid, F. Hardware/Software Partitioning of Software Bina-
ries: A Case Study of H.264 Decode. International Conference on Hard-
ware/Software Codesign and System Synthesis (CODES/ISSS), pp. 285–
290 (2005).
[86] Sitt, G., Vahid, F. New Decompilation Techniques for Binary-level Co-
processor Generation. IEEE/ACM International Conference on Computer-
Aided Design (ICCAD), pp. 547–554 (2005).
[87] Atasu, K., Du¨ndar, G., O¨zturan, C. An Integer Linear Programming
Approach for Identifying Instruction-Set Extensions. Proceedings of the
BIBLIOGRAPHY 113
International Conference on Hardware – Software Codesign and System
Synthesis (CODES+ISSS), Jersey City, New Jersey (2005).
[88] Atasu, K., Dimond, R.G., Mencer, O., Luk, W., O¨zturan, C.,
Du¨ndar, G. Optimizing Instruction-set Extensible Processors under Data
Bandwidth Constraints. Proceedings of Design, Automation and Test in
Europe Conference and Exhibition (DATE), Nice, France (2007).
[89] Atasu, K., O¨zturan, C., Du¨ndar, G., Mencer, O., Luk, W.
CHIPS: Custom Hardware Instruction Processor Synthesis. IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems, Vol.
27, No. 3, pp. 528–541 (2008).
[90] Sirowy, S., Wu, Y., Lonardi, S., Vahid, F. Two Level
Microprocessor-Accelerator Partitioning. IEEE/ACM Design Automation
and Test in Europe (DATE), pp. 313–318 (2007).
[91] Henkel, J. A Low Power Hardware/Software Partitioning Approach for
Core-Based Embedded Systems. 36th ACM/IEEE Conference on Design
Automation, pp. 122–127 (1999).
[92] Oh, S., Kim, T.G., Bozorgzadeh, E. Speculative Loop-Pipelining
in Binary Translation for Hardware Acceleration. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, pp. 409–422
(2008).
[93] Hunt, G.C., Scott, M.L. The Coign Automatic Distributed Partition-
ing System. Proceedings of the 3rd Symposium on Operating Systems Design
and Implementation, New Orleans, Louisiana, pp. 45–52 (1999).
[94] Hogstedt, K., Kimelman, D., Rajan, V. T., Roth, T., Wegman,
M. Graph Cutting Algorithms for Distributed Applications Partitioning.
SIGMETRICS Performance Evaluation Review, Vol. 28, pp. 27–29 (2001).
[95] Eles, P., Peng, Z., Kuchcinski, A., Doboli, A. System Level
Hardware/Software Partitioning Based on Simulated Annealing and Tabu
Search. Journal on Design Automation for Embedded Systems, Vol. 2, pp.
5–32 (1997).
[96] Wiangtong, T., Cheung, P.Y.K., Luk, W. Comparing Three Heuris-
tic Search Methods for Functional Partitioning in Hardware-Software Code-
sign. Journal on Design Automation for Embedded Systems, Vol. 6, No. 4,
pp. 425–449 (2002).
114 BIBLIOGRAPHY
[97] Wiangtong, T., Cheung, P.Y.K., Luk, W. Hardware/Software Code-
sign: A Systematic Approach Targeting Data-Intensive Applications. IEEE
Signal Processing, Vol. 22, No. 3, pp. 14–22 (2005).
[98] Lam, Y.M., Coutinho, J.G.F., Luk, W., Leong, P.H.W. Integrated
Hardware/Software Codesign for Heterogeneous Computing Systems. Pro-
ceedings of the IEEE Southern Conference on Programmable Logic, pp.
217–220 (2008).
[99] Urfianto, M.Z., Isshiki, T., Kahn, A.U., Li, D., Kunieda, H. De-
composition of Task-Level Concurrency on C Programs Applied to the De-
sign of Multiprocessor SoC. IEICE Transactions on Fundamentals of Elec-
tronics, Communications and Computer Sciences, pp. 1748–1756 (2008).
[100] Liu, Q., Constantinides, G.A., Cheung, P.Y.K. Combining Data
Reuse With Data-Level Parallelization for FPGA-Targeted Hardware Com-
pilation: A Geometric Programming Framework. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, Vol. 28, No. 3,
pp. 305–315 (2009).
[101] Lam, Y.M., Coutinho, J.G.F., Luk, W., Leong, P.H.W. Optimising
Multi-loop Programs for Heterogeneous Computing Systems Proceedings of
the IEEE Southern Programmable Logic Conference (2009).
[102] Ulm, D.R., Baker, J.W., Scherger, M.C. Solving a 2D Knapsack
Problem Using a Hybrid Data-Parallel/Control Style of Computing. Pro-
ceedings of the 18th International Parallel and Distributed Processing Sym-
posium (IPDPS04), pp. 260 (2004).
[103] Ferrandi, F., Fossati, L., Lattuada, M., Palermo, G, Sciuto,
D., Tumeo, A. Partitioning and Mapping for the hArtes European
Project. Proceedings of the Workshop on Directions in FPGAs and Re-
configurable Systems: Design, Programming and Technologies for Adaptive
Heterogeneous Systems-on-Chip and their European Dimensions (2007).
[104] Lee, J., Shragowitz, E., Sahni, S. A Hypercube Algorithm for the
0/1 Knapsack Problem. Journal of Parallel Distributed Computing, Vol. 5,
No. 4, pp. 438–456 (1988).
[105] Holden, B. Latency Comparison Between HyperTransport and PCI-
Express In Communications Systems. HyperTransport Consortium (2006).
BIBLIOGRAPHY 115
[106] Stevens, W.R. TCP/IP Illustrated, Volume 1: The Protocols Addison-
Wesley Professional Computing Series (1994).
[107] Sun Microsystems Inc. RPC: Remote Procedure Call Protocol Specifi-
cation IETF RFC1057 (1988).
[108] Gansner, E.R., North, S.C. An Open Graph Visualization System and
its Applications to Software Engineering. Software Practice and Experience,
Vol. 30, pp. 1203–1233 (1999).
[109] PRIV8 Ltd. (http://www.priv8.com).
[110] nVidia Corp. (http://www.nvidia.com).
[111] Intel Corp. (http://www.intel.com).
[112] ILOG Inc. (http://www.ilog.com).
[113] AMPL Modelling Language (http://www.ampl.com).
[114] General Algebraic Modeling System (GAMS) (http://www.gams.com).
[115] OPL Studio (http://www.ilog.com/products/oplstudio).
[116] Hoare, C.A.R. Communicating Sequential Processes. Communications
of the ACM, Vol. 21, No. 8, pp. 666–677 (1978).
[117] Aho, A.V., Hopcroft, J.E., Ullman, J.D. The Design and Analysis
of Computer Algorithms. Addison Wesley, 2nd Edition (1974).
[118] Garey, M.R., Johnson, D.S. Computers and Intractability: A Guide to
the Theory of NP-Completeness. Freeman, New York, 2nd Edition (1979).
[119] Hopcroft, J.E., Motwani, R., Ullman, J.D. Introduction to Au-
tomata Theory, Languages, and Computation. Addison Wesley, 2nd Edi-
tion (2000).
[120] Sedgewick, R. Algorithms in C Part 5: Graph Algorithms. Addison-
Wesley Professional, 3rd Edition (2001).
[121] Sipser, M. Introduction to the Theory of Computation. Course Technol-
ogy, 2nd Edition (2005).
[122] Martello, S., Toth, P. Knapsack Problems: Algorithms and Computer
Implementations. John Wiley & Sons Ltd. (1990).
116 BIBLIOGRAPHY
[123] Wolsey, L.A. Integer Programming. John Wiley & Sons Inc. (1998).
[124] Bronson, R., Naadimuthu, G. Operations Research. Schaums Out-
lines, McGraw Hill, New York, 2nd Edition (1997).
[125] Papadimitriou, C.H. Computational Complexity. Addison Wesley
(1993).
[126] Papadimitriou, C.H., Steiglitz, K. Combinatorial Optimization: Al-
gorithms and Complexity. Dover Publications Inc., New York (1998).
[127] Rockafellar, R.T. Convex Analysis. Princeton University Press (1970).
[128] Dantzig, G.B. Discrete-Variable Extremum Problems. Operations Re-
search, Vol. 5, No. 2, pp. 266–277 (1957).
[129] Cook, S. The Complexity of Theorem Proving Procedures. Proceedings of
the Third Annual ACM Symposium on Theory of Computing, pp. 151–158
(1971).
[130] Karp, R.M. Reducibility Among Combinatorial Problems. Complexity of
Computer Computations, pp. 85–103 (1972).
[131] Zuckerman, D. On Unapproximable Versions of NP-Complete Problems.
SIAM Journal on Computing, Vol. 25, No. 6, pp. 1293–1304 (1996).
[132] Kuhn, D. Aggregation and Discretization in Multistage Stochastic Pro-
gramming. Mathematical Programming, Series A (2006).
[133] Niemann, R., Marwedel, P. Hardware/Software Partitioning using
Integer Programming. Proceedings of the 1996 European Conference on
Design and Test, pp. 473–479 (1996).
[134] Diubin, G.N., Korbut, A.A. Greedy Algorithms for the Minimization
Knapsack Problem: Average Behaviour. Journal of Computer and Systems
Sciences International, Vol. 47, No.1, pp. 14–24 (2008).
[135] Marchetti-Spaccamela, A., Vercellis, C. Efficient On-line Algo-
rithms for the Knapsack Problem. 14th international Colloquium on Au-
tomata, Languages and Programming, pp. 445–456 (1987).
[136] Martello, S., Toth, P. An Algorithm for the Generalized Assignment
Problem. Operational Research, pp. 589–603 (1981).
BIBLIOGRAPHY 117
[137] Yagiura, M., Ibaraki, T. Recent Metaheuristic Algorithms for the
Generalized Assignment Problem. Proceedings of the 12th International
Conference on Informatics Research for Development of Knowledge Society
Infrastructure (ICKS’04), pp. 229–237 (2004).
[138] Nauss, R.M. Solving the Generalized Assignment Problem: An Opti-
mizing and Heuristic Approach. INFORMS Journal on Computing, pp.
249–266 (2003).
[139] Cohen, R., Katzir, L., Raz, D. An Efficient Approximation for the
Generalized Assignment Problem. Information Processing Letters, pp. 162–
166 (2006).
[140] Hahn, P.M., Grant, T.L. Lower Bounds for the Quadratic Assignment
Problem Based upon a Dual Formulation. Operations Research, Vol. 46,
pp. 912–922 (1998).
[141] Sahni, S., Gonzalez, T. P-complete Approximation Problems. Journal
of the ACM (JACM), Vol. 23, pp. 555–565 (1976).
[142] Nugent, C., Vollmann, T., Ruml, J. An Experimental Comparison
of Techniques for the Assignment of Facilities to Locations. Operations
Research, Vol. 16, pp. 150–173 (1968).
[143] Burkard, R.E., Karisch, S.E., Rendl, F. QAPLIB — A Quadratic
Assignment Problem Library. Journal of Global Optimization, Vol. 10, pp.
391–403 (1997).
[144] Green, K. Nug30 – Unsolved for More than Three Decades – Cracked
on the Grid. National Computational Science Alliance (http://access.
ncsa.illinois.edu/Releases/00Releases/000718.nug30.html, 2000).
[145] Cordeau, J-F., Gaudioso, M., Laporte, G., Moccia, L. A Memetic
Heuristic for the Generalized Quadratic Assignment Problem. Informs
Journal on Computing, Vol. 18, No. 4, pp. 433–443 (2006).
[146] Hahn, P.M., Kim, B., Guignard, M., Smith, J.M., Zhu, Y. An
Algorithm for the Generalized Quadratic Assignment Problem. Computa-
tional Optimization and Applications, pp. 351–372 (2008).
[147] Skutella, M. Convex Quadratic and Semidefinite Programming Relax-
ations in Scheduling. Journal of the ACM (JACM), pp. 206–242 (2001).
118 BIBLIOGRAPHY
[148] Lee, C.G., Ma, Z. The Generalized Quadratic Assignment Problem.
University of Toronto Department of Mechanical and Industrial Engineer-
ing Research Report, Canada (2004).
[149] Ben-Tal, A., Nemirovski, A. Robust Convex Optimization. Mathe-
matics of Operations Research, Vol. 23, No.4 (1998).
[150] Ben-Tal, A., Nemirovski, A. Robust Solutions of Uncertain Linear
Programs. Operations Research Letters, Vol. 25, pp. 1–13 (1999).
[151] Ben-Tal, A., Ghaoui, G.E., Nemirovski, A. Robust Semidefinite
Programming. Handbook on Semidefinite Programming, Kluwer Academic
Publishers, pp. 139–162 (2000).
[152] De, P., Dunne, E.J., Ghosh, J.B., Wells, C.E. The Discrete Time-
Cost Tradeoff Problem Revisited. European Journal of Operational Re-
search, Vol. 81, pp. 225–238 (1995).
[153] Brucker, P., Drexl, A., Mo¨hring, R., Neumann, K., Pesch, E.
Resource-Constrained Project Scheduling: Notation, Classification, Mod-
els, and Methods. European Journal of Operational Research, Vol. 112, pp.
3–41 (1999).
[154] Xu, G., Papageorgiou, L.G. A Construction-Based Approach to Pro-
cess Plant Layout Using Mixed-Integer Optimization. Industrial & Engi-
neering Chemistry Research, Vol. 46, pp. 351–358 (2007).
[155] Verderame, P.M., Floudas, C.A. Operational Planning of Large Scale
Industrial Batch Plants Under Demand Due Date and Amount Uncertainty:
I. Robust Optimization Framework. Industrial & Engineering Chemistry
Research, Vol. 48, pp. 7214–7231 (2009).
[156] Zhu, S., Fukushima, M. Worst-Case Conditional Value-at-Risk with
Application to Robust Portfolio Management. Operations Research, Vol.
57, No. 5, pp. 1155–1168 (2009).
Appendix A
3S Technical Introduction
Introduction
This appendix provides a self contained overview and usage tutorial for the 3S
framework. The overview starts with a description of the 3S framework method-
ology including samples of actual 3S instrumentation stub code and a detailed
example of how to create and use a simple 3S Instruction Request tool. The
overview then provides information about some of the more advanced 3S tools
that are available and presents output from the loopgraph d and data flow tools
which readers may find useful as starting points for their own work. The ap-
pendix concludes with 3S quality assurance and performance information along
with a brief summary.
The 3S Methodology
The 3S methodology divides the program characterisation problem in to two
parts: static architecture specific program instrumentation and dynamic archi-
tecture independent program characterisation. The methodology can be seen as a
hybrid of the static SUIF [63] approach and the dynamic GILK [64], Valgrind [65]
and Pin [66] approaches.
The instrumentation stage of the 3S methodology statically inserts instru-
mentation stubs in to a program’s assembly code. The static instrumentation
stubs call a 3S analysis tool with a stream of program control and data flow
information as the program runs. The 3S tool dynamically analyses the program
characterisation stream and creates reports that characterise the program.
119
120 APPENDIX A. 3S TECHNICAL INTRODUCTION
The idea of instrumenting program assembly rather than source code or bi-
naries dates back to the Stanford Tango Lite [69] multi-processor simulator en-
vironment of the early 90’s. However, unlike previous assembly parsing work,
3S offers a generalised instrumentation framework that is not tied to a specific
program analysis or simulation task.
By using standard compile-chain tools such as gcc and g++ available across
a range of architectures to obtain program assembly, an implementation of the
3S methodology does not need a dedicated source code compiler like SUIF or
an architecture specific internal disassembler like GILK, Valgrind or Pin. Addi-
tionally, by instrumenting compiler generated assembly, the 3S methodology can
take advantage of compiler generated assembly cues such as function and basic
block labels used to advantage by static frameworks and the same approach can
be applied to instrument any program written in any compilable language. The
only task an implementation of the 3S methodology has to perform is a straight-
forward top-down parsing of an assembly text file and this is the primary reason
for the simplicity of 3S implementations.
The run-time 3S tools that analyse the characterisation stream can be writ-
ten in a any high-level language to provide cross platform generality. By using
dynamic run-time analysis tools instead of static analysis tools 3S has access to
exact data dependent loop counts, control flow and data flow information which
makes program analysis easy as you will soon see.
Instrumentation Stubs
The static instrumentation stage of 3S instruments a program at either the basic
block or the instruction level. The default instrumentation level is the block level
where the 3S framework places instrumentation stubs after jump and function
labels and around calls. To instrument a program at the instruction level, all you
have to do is define 3S INSTRUMENT INSTRUCTIONS in your tool code and your tool
will be sent information about each instruction before it is executed.
The default block level 3S instrumentation stub for the x86 implementation is
shown in Figure A.1. Lines 1 and 18 are assembly comments, line 7 sets a symbol
for the 3S characterisation tool to recognise the current block or instruction and
line 10 performs the call to the actual 3S characterisation tool.
By defining 3S INSTRUMENT NO TICKS or 3S INSTRUMENT FAST, a tool can
modify the 3S instrumentation stub to reduce instrumentation overhead.
3S INSTRUMENT NO TICKS removes the x86 RDTSC clock tick collection code from
121
1: # >>> 3S INSTRUMENTATION >>>
2: movl %edx, _3S_register_edx
3: movl %eax, _3S_register_eax
4: .byte 0x0f, 0x31
5: movl %edx, _3S_previous_block_end_ticks+4
6: movl %eax, _3S_previous_block_end_ticks
7: movl $.3S.string.function.0.0.0, _3S_current_symbol
8: pushfl
9: pushal
10: call _3S__ir
11: popal
12: popfl
13: .byte 0x0f, 0x31
14: movl %eax, _3S_previous_block_start_ticks
15: movl %edx, _3S_previous_block_start_ticks+4
16: movl _3S_register_edx, %edx
17: movl _3S_register_eax, %eax
18: # <<< 3S INSTRUMENTATION <<<
Figure A.1: Default x86 3S assembly instrumentation stub.
1: # >>> 3S INSTRUMENTATION >>>
2: movl $.3S.string.function.0.0.0, _3S_current_symbol
3: call _3S__ir
4: # <<< 3S INSTRUMENTATION <<<
Figure A.2: Reduced x86 3S assembly instrumentation stub.
the 3S stubs (lines 2–6 and 13–17) and 3S INSTRUMENT FAST removes the register
backup code (lines 8–9 and 11–12) for tools that perform their own register man-
agement. With these changes, the basic x86 3S instrumentation stub is reduced
to only 2 assembly instructions as shown in Figure A.2.
It should be noted that the default 3S stub only backs-up the integer and
flags registers before calling the tool as an optimisation feature (lines 9–10 and
11–12). This means that your main tool function should not use MMX or floating
point operations that could effect the state returned to the program. If you need
floating point operations, you can use the fxsave, fxrstor and finit assembly
instructions either in-lined in your tool code or through stub modifications.
122 APPENDIX A. 3S TECHNICAL INTRODUCTION
Characterisation Tools
The 3S characterisation tools are architecture independent dynamic analysis tools
that act on 3S instrumentation stub calls. The 3S instrumentation stubs provide
the 3S tools with the program characterisation information shown in Table A.1
which the 3S tools access through mnemonics included in the tool.h file.
Mnemonic Description
3S time start Time the tool was initialised
3S symbol chain head First block entered in the program
3S symbol chain tail Last new block entered in the program
3S current symbol Block or instruction about to be executed
3S previous symbol Block or instruction just completed
3S previous block start ticksT CPU time-stamp for the previous block start
3S previous block end ticksT CPU time-stamp for the previous block completion
3S instruction parametersI Number of parameters for the next x86 instruction
3S instruction parameter[]I Parameter information for the next x86 instruction
Table A.1: 3S instrumentation information made available by 3S stubs to 3S tools through
the tool.h include file. Variables marked with T are only available if
3S INSTRUMENT NO TICKS is not defined and I variables are only available to in-
struction level tools that define 3S INSTRUMENT INSTRUCTIONS.
Using dynamic 3S characterisation stream information in a tool is a straight-
forward task. An example 3S tool to collect block level Instruction Request (IR)
counts for a program is shown in Figure A.3.
The main tool characterisation function is identified by 3S as 3S prepended
to the tool name. For this tool the main tool function is 3S ir() which keeps
the instrumentation overhead low by having only one line of active C code that
keeps count of the number of times each program block has been executed in the
3S symbol table entry for the block using an integer variable.
The 3S framework macro 3S() is used to keep track of the symbol table
entries on each tool call and can be implemented in only three x86 assembly
instructions after initialisation. The main reporting logic for the tool is contained
in the 3S ir report() function which is registered in the processes atexit()
chain on framework initialisation by 3S() and called by the C library when the
instrumented program terminates.
123
#include <stdio.h>
#include <time.h>
#include "tool.h"
#define _3S_INSTRUMENT_NO_TICKS; // no RDTSC in stubs
void _3S__ir_report(void); // forward declaration
//*** Main Tool Function Called by 3S Stubs
void _3S__ir(void) {
_3S(_3S__ir_report); // 3S framework code
_3S_current_symbol->entries += 1; // active tool code
}
//*** Report Generation Code Called on Process Termination
void _3S__ir_report(void) {
symbol_data_t* current_symbol = _3S_symbol_chain_head;
FILE* log_file = fopen("./ir.3s", "w");
fprintf(log_file, "#### 3S IR " _3S_VERSION ", %s", ctime(&_3S_time_start));
fprintf(log_file, "%-30s%15s\n", "block", "ir");
while(current_symbol != NULL) {
fprintf(log_file, "%-30s%15llu\n", &(current_symbol->label),
current_symbol->entries * current_symbol->instructions);
current_symbol = current_symbol->next_symb;
}
fclose(log_file);
}
Figure A.3: Simple 3S tool to measure fine-grained Instruction Requests (IRs).
124 APPENDIX A. 3S TECHNICAL INTRODUCTION
Example Usage
To obtain characterisation information for a program using 3S, you simply have
to run a 3S script to instrument a program and then execute the instrumented bi-
nary. The source program to be instrumented needs to be placed in the ./source
directory of the 3S distribution and the tool code placed in the ./tools directory
then the 3SInstrument.sh script is started with the source program directory and
3S tool name as parameters.
3S(ex)> mkdir ./source/gzip
3S(ex)> cp CINT2000/164.gzip/src/* ./source/gzip
3S(ex)> ./3SInstrument.sh gzip ir
3S INSTRUMENTING WITH ir...............FINISHED
3S(ex)> cp CINT2000/164.gzip/data/train/input/input.combined .
3S(ex)> ./build/gzip
Figure A.4: Example of how to use 3S to characterise a SPEC benchmark.
Figure A.4 above shows how to use 3S to instrument the SPEC2000 GZIP
benchmark with the ir tool from the previous page. The 3SInstrument.sh script
compiles each source file in the program to assembly, instruments the assembly
files with 3S stubs that call the 3S tool and links the instrumented assembly
and 3S tool together into an executable which it saves in the ./build directory.
Running the instrumented executable .build/gzip will produce the ir.3s tool
report in the current working directory.
For more information about using 3SInstrument.sh to instrument your code
just start the script without parameters or refer to the comments in the script
headers accompanying the current 3S distribution [1].
3S Tools
3S includes several innovative characterisation tools that are not available for
other frameworks as presented in Table A.2. This section provides details of the
loopgraph d and data flow tools and more information about the other 3S tools
can be found in the documentation accompanying the 3S distribution and in the
3S paper [1].
125
3S Tool Description Unique Code Comments
trace Saves the raw 3S characterisation stream X 54 110
hotspot Measures real CPU clock cycles and IRs X 72 109
callgrind Generates a control flow graph 7 172 132
memory Run-time memory access information 7 78 106
profile Measures real CPU cache timing effects X 43 67
parallelism Identifies instruction level data parallelism X 239 256
regex Regular expressions from control flows X 175 209
loopgraph d Adds hotspot and IR information to regex X 347 304
data flow Data dependencies between code sections X 366 411
all A meta-tool showing how to join 3S tools X 19 70
Table A.2: A selection of available 3S program characterisation tools and whether or not
they were unique to 3S at the time of their creation. The Code column lists
the number of code lines in the tool and the Comments column the number of
comment lines, both excluding blank lines. Smaller tools have a high comment
ratio due to boilerplate comments such as author, creation date and file history.
3S Tool: loopgraph d
The 3S loopgraph d tool works at the block level and creates a deterministic
regular expression describing a whole program’s execution trace. The regular ex-
pression is annotated with hotspot information and saved in the loopgraph d.3s
report file. The tool also creates a dotty file loopgraph d.dot with groups of
acyclic-paths connected by repetition markers. The dotty file is compiled auto-
matically into a postscript picture using neato [108] if the number of acyclic block
groups is less than 100.
The loopgraph d tool can be used to visualise the difference between compilers
and compiler options. For example, Figure A.5 shows the loopgraph d postscript
output for the PRIV8 R©AES256 8.2 [109] implementation compiled using two
different compilers. The top right figure is the program compiled with g++ -O0
and the bottom left the program compiled with gcc -O0. Nodes correspond to
acyclic program block groups in the program and edges to control flows (with
dashed edges for loops). Nodes are shaded in proportion to the 3S execution
time measurements for their corresponding block groups.
It is clear from the Figure A.5 that, g++ performs considerable optimisation
even when compiling with its “no optimisation” flag. The g++ optimisation pro-
duces only four active loops in the compiled program compared to twenty five in
the gcc -O0 version which corresponds more closely to the original C code.
126 APPENDIX A. 3S TECHNICAL INTRODUCTION
G01
G02 {16}
G03
G04
G05 {3}G06
G07 {3}
G08
{6}
G09
G10 {3}
G11
G12G13 {16}
G14
G15
G16 {16}
G17
G18
{16}
G19
G20 {4}
G21
G22
{16}
G23
{13}
G24
G25 {16}
G26
G27 {16}
G28
G29 {16}
G30
G31
{16}
G32 G33
G34 {16}
G35
G36
{16}
G37
G38 {4}
G39
G40 {16}
G41{13}
G42
G43 {16}
G44
G45 {16}
G46
G47 {16}
G48
{5000}
G49 G50
{16}
G51
G01
G02 {16}
G03
G04
G05 {13}
G06
{10000}
G07
G08 {16}
G09
Figure A.5: 3S loopgraph d tool reports for 10,000 OFB iterations of the PRIV8 R©AES256
benchmark compiled with g++ -O0 (top right) and gcc -O0 (bottom left).
127
3S Tool: data flow
The 3S data flow tool works at the instruction level to create an inter-block data
flow graph under user configurable cache simulations. The cache simulation mode
is set in the 3S.conf file before instrumentation and the modes available at the
time of writing are listed in Table A.3.
Mode Description
CACHE NONE No cache: every data read is a new read from the last writer block
CACHE LOCAL Classic data flow: each block has an infinite local cache
CACHE GLOBAL Shared global cache: direct mapped fixed size cache
Table A.3: 3S data flow tool cache simulation modes.
In addition to cache simulation modes, the 3S data flow tool has a report
mode configuration option that controls the report format produced by the tool
on process completion. The REPORT FLOWS report mode produces a postscript file
with a block-level data flow graph whose nodes are shades of red, blue and green
depending on whether the corresponding basic blocks are net data consumers,
data producers or data transfer nodes. The REPORT TRANSFER report mode pro-
duces a postscript file with a transfer graph for the program with nodes shaded
green proportional to the total external reads and writes performed by each cor-
responding basic block with the highest transfer block coloured pure green. The
REPORT CONTRIBUTION report mode produces a postscript file with a contribution
graph for the program with nodes shaded blue proportional to the total amount
of data read from the corresponding basic block by other blocks in the program
under the simulated cache conditions.
Figure A.6 shows the transfer report for 10,000 OFB iterations of the
PRIV8 R©AES256 8.2 implementation compiled with gcc -O0. Nodes correspond
to program basic blocks and only nodes with data transfers are shown for clarity.
Darker nodes and lines represent larger inter-node data flows with B34, B42 and
B48 being net data consumers corresponding to data dependent branch reads and
B35, B43 and B49 net data producers corresponding to the AES256 addRoundKey,
subBytes and shiftRows functions respectively.
3S data flow tool measurements are combined with the 3S callgrind tool
measurements to calculate cross partition communication costs in the WOA ex-
ecution model [3] used to determine heterogeneous partitioning in the APA [2],
MAP [3] and MIP [5] approaches.
128 APPENDIX A. 3S TECHNICAL INTRODUCTION
B00 B01
B03
B04
B14
B18
B23
B43
B49
B57
B31
B35
B50
B07
B27B72
B06
B09
B10
B11
B15
B12
B25
B16
B17
B19
B22
B20
B24
B30
B33
B28
B70
B41
B47
B55
B34
B36
B37
B38
B61
B63
B42
B44
B48
B52
B51
B58
B56
B60
B59
B68
B71
B79
B73
B74
B77
B78
B81
Figure A.6: 3S data flow tool’s REPORT TRANSFER report for the PRIV8 R©AES256 bench-
mark compiled with gcc -O0 using CACHE LOCAL simulation and executed for
10,000 OFB iterations. Only the data flows are shown with darker nodes and
lines indicating larger data transfers.
129
Testing and Performance
The 3S instrumentation and characterisation process is shown in Figure A.7.
Working with assembly files allowed the visual verification of 3S stub correctness
in early framework testing and test kernels including matrix multiply, memory
access patterns and complex loop simulation benchmarks which are distributed
with 3S were used to manually verify early end-to-end instrumentation and char-
acterisation results.
T r
a c
e
S i
m
u l a
t o
r
C o
m
p i l
e d
3 S
 T
o o
l
I n
s t r
u m
e n
t e
d
P r
o g
r a
m
C o
m
p i l
e d
3 S
 T
o o
lCompile to
Assembly
(gcc)
Program
Code
Instrument
with 3S Stub
Calls
Program
Assembly
Link with
3S Tool
(ld)
tool
assembly
Instrumented
Assembly
3SInstrument.sh
3S Tool
Characteristion
Report
3S Tool
Characteristion
Report
3S Trace
Trace Simulator
Executable
Instrumented
Executable
Figure A.7: 3S framework instrumentation and characterisation process.
For volume tests, MiBench [56], SPEC [57] and PRIV8 [109] benchmarks were
used. The instrumented program outputs were verified against uninstrumented
code and tool measurements verified against comparative tools in other frame-
works including Valgrind [65], gprof and oprof where possible.
To test 3S tools independently of the 3S instrumentation framework the 3S
trace simulation program depicted in Figure A.8 can be used.
T r
a c
e
S i
m
u l a
t o
r
C o
m
p i l
e d
3 S
 T
o o
l
I n
s t r
u m
e n
t e
d
P r
o g
r a
m
C o
m
p i l
e d
3 S
 T
o o
lCompile to
Assembly
(gcc)
Program
Code
Instrument
with 3S Stub
Calls
Program
Assembly
Link with
3S Tool
(ld)
tool
assembly
Instrumented
Assembly
3SInstrument.sh
3S Tool
Characteristion
Report
3S Tool
Characteristion
Report
3S Trace
Trace Simulator
Executable
Instrumented
Executable
Figure A.8: 3S trace simulation tool tester.
The trace simulation program is compiled with a 3S tool and reads in a 3S
trace file. The simulator constructs 3S framework data structures in memory
for each line of the trace file and sends the data structures to the 3S tool as if
they had come directly from 3S stubs in an instrumented executable. The trace
130 APPENDIX A. 3S TECHNICAL INTRODUCTION
simulator allows memory access test kernels to be created without the need for
actual assembly benchmarks and supports the independent verification of the 3S
tools and the 3S framework.
3S was designed with simplicity and flexibility as the primary goals and the
exhaustive measurement and proof of efficiency was low on the list of priorities.
Despite that, as Table A.4 shows, the 3S approach is very efficient. The table
compares the execution time of benchmarks for the 3S and Valgrind callgrind
tools. The 3S times are up to 15.5 times faster than Valgrind which corresponds
to a speed increase of 4.6 times over Pin by extension of the results of Luk et al.
in [66].
Program GZIP AES256
Compiler GCC GCC G++
SHELL1 3.96x 3.29x 10.66x
OPT1 3.05x 5.53x 15.52x
Table A.4: 3S performance improvements over Valgrind 3.2.0 for the callgrind tool (with the
3S callgrind tool also providing tick information). SHELL1 is an Intel Pentium
and OPT1 an AMD Opteron machine at Imperial. The compiler optimisation
level was -O2.
The results are compiler and machine dependent with the 3S improvements
less for programs with larger numbers of blocks because of the unoptimised 3S
stubs which are not in-lined with tool code as in Valgrind [65]. The reasons for
the 3S performance improvements despite the framework’s simplistic approach
are attributed to the fact that:
1. 3S does not expand x86 CISC instructions into RISC like equivalents [65].
2. 3S does not have a dynamic instrumentation executive overhead [65,66].
3. 3S statically assigns block instrumentation metric memory and instruction
parameter analysis code [64,65,66].
The 3S performance improvements can be verified using the 3SPerformance.sh
script which comes with the 3S(ex) distribution.
131
Summary
This appendix presented a technical introduction to the 3S instrumentation
framework and some of the 3S characterisation tools. The appendix started by
highlighting the unique 3S methodology of static instrumentation combined with
dynamic characterisation and then showed the basic 3S instrumentation stub and
how a an Instruction Request (IR) tool can be created with only 1 line of active
tool code in 3S. Existing 3S tools were introduced in Table A.2 and example
loopgraph d and data flow tool reports presented.
The main advantages of 3S over previous general program characterisation
frameworks [63,64,65,66] are:
• 3S is the simplest general framework available at only 288 lines of code.
• 3S is up to 15.5 times faster than other frameworks.
• 3S comes with unique characterisation tools like loopgraph d and data flow
not available for other frameworks.
• 3S provides full visibility of its instrumentation changes through annotated
assembly allowing the easy confirmation of program correctness.
More information about 3S is available in the 3S paper [1] and in the docu-
mentation accompanying the free 3S distribution.
132 APPENDIX A. 3S TECHNICAL INTRODUCTION
Appendix B
3S Technical Paper
133
3S: Program Instrumentation and Characterisation
Framework
Simon A. Spacey
ABSTRACT
3S is an efficient program instrumentation and profiling
framework. 3S is only 288 lines of framework code, yet it
can produce the same reports as Valgrind [1] and is up to an
order of magnitude faster.
I. INTRODUCTION
3S stands for “Spacey Stream Splitter”. 3S is a framework
used to instrument an x86 program. You use the framework
together with 3S analysis tools to analyse a program’s control
flow. The 3S framework provides the 3S tools with a stream of
control and data flow information as the instrumented target
program runs. The 3S tools split the control and data flow
information stream to create their reports.
This document introduces the 3S framework and some of
the 3S tools. I provide a summary of the 3S framework
methodology in section 2, examine two of the 3S tools in
section 3, evaluate the performance of 3S in section 4 and
consider some possible enhancements in section 5.
This document is not a survey of different instrumentation
frameworks and it is not a proposal for future research. If
you are interested in information on any of those topics you
should start with a review of the references at the end of this
document.
II. 3S FRAMEWORK METHODOLOGY
The 3S framework is only 288 lines of code, yet it can
produce the same reports as Valgrind [1] and is up to an order
of magnitude faster. The 3S framework works by inserting
instrumentation stubs in to the assembly of a program and
then linking the modified assembly with a 3S analysis tool
specified at instrumentation time. The instrumentation stubs
call the 3S analysis tool with a stream of program control
and data flow information as the program runs. The 3S tool
analyses the stream and creates reports from it.
By working at the assembly level, 3S does not have to
be concerned with its stubs over-writing program instructions
or jump targets [2] and also benefits from the basic block
identification algorithms already implemented in the source
code to assembler stage of the compiler. These simplifications
make instrumenting at the assembly level a straightforward
process of text file parsing. 3S simply takes a compiled
program’s assembly and adds 3S stubs after each compiler
generated function or jump target label. The one complication
is call instructions which are easy to spot in the assembly
using regular expressions.
In stark contrast to object code instrumentation frameworks
like Valgrind, the 3S instrumented output is fully readable
assembly text. This drastically reduced the development and
debugging time of the 3S framework itself and also benefits
users who can see exactly what 3S does to their programs to
assure themselves that the 3S instrumentation can not interfere
with their program’s validity.
A. Instruction and Block Level Instrumentation
The first release of 3S was designed to instrument a pro-
gram at the block level only. By version 2, instruction level
instrumentation features had also been added. Working at the
block level, 3S can create control flow and hotspot reports.
At the instruction level, it is possible to create data flow and
memory access reports like the Valgrind cachegrind report.
The difference between the block and instruction modes of
operation in 3S is seen in the type and placement of 3S stubs
inserted into the target program’s assembly file. In the block
mode, stubs are only inserted at the start of functions, basic
blocks and around call instructions. In instruction mode, stubs
are inserted before each instruction as well as at the start of
functions and blocks.
Obviously the overhead of 3S instruction level instrumenta-
tion is much greater than that of 3S block level instrumen-
tation, however it can still be less than that of alternative
frameworks like Valgrind. The reason for this lies in the way
3S and Valgrind analyse Intel’s complex instructions.
Valgrind expands the x86 CISC instructions to an internal
RISC form and then instruments the RISC instructions before
mapping them back (one-to-one) to x86 instructions to be
run on the processor [1]. This results in not only a tool call
overhead per instruction as with 3S, but also an instruction
expansion overhead as a single x86 CISC instruction can be
translated into multiple x86 RISC equivalents.
With 3S the x86 instructions are analysed statically at
instrumentation time. The static analysis creates a description
that contains all the aspects of the CISC instruction for
consideration by the 3S tool at run-time. There is then a single
3S tool call per instrumented x86 instruction and the 3S tool
is presented with a full description of the CISC instruction
and it’s parameters at run-time.
B. Clock Ticks vs Instruction Requests
The 3S framework stubs by default pass CPU clock tick
information on to the 3S tool. The clock tick value is mea-
sured around the instrumented assembly code using the x86
RDTSC instruction. The tick information is passed to the 3S
134 APPENDIX B. 3S TECHNICAL PAPER
tool in the 64 bit globals 3S previous block start ticks and
3S previous block end ticks. The tool can use these figures
to calculate the time a block took to run.
It should be noted that by using tick deltas, we remove most
of the tool instrumentation overhead from tick measurements.
However, the tick delta figures can still be inaccurate for small
blocks because of a residual caused by 6 instructions that are
outside the RDTSC instructions in the 3S stub and because of
instruction scheduling in the processor. The stub residual has
been measured to be around 9 ticks on an Opteron processor.
Because of the inaccuracy in tick measurements for small
blocks, Valgrind [1] does not use tick figures. Instead it uses
Instruction Requests (IR). 3S tools also have IR information
available to them by default. IR can be readily calculated in
the tool as the instructions per block multiplied by the entries
per block.
As measuring clock tick information adds overhead
to the instrumented program, it is possible to flag a
3S tool as not requiring ticks by defining the variable
3S INSTRUMENT NO TICKS in the tool’s C code. This flag
is defined in memory, regex, callgrind and some other 3S tools.
III. 3S TOOLS
The 3S framework makes creating program analysis tools
easy. For example, the 3S hotspot tool generates an execution
profile for a target program with only 2 lines of active C code.
Several example 3S tools can be found in the /tools directory
of the 3S(ex) distribution. All tools use the header tool.h which
describes the global variables that the 3S framework provides
for the 3S tools.
To use the 3S tools you need to place your source files in
a subdirectory of /source and run the 3SInstrument.sh script.
This script is a wrapper that calls the 3S framework parser
program (3SInstrument.py) to compile each of your source
files to assembly and instrument them in turn. When all your
files have been instrumented, the 3SInstrument.sh script links
the instrumented assembly files together with the 3S tool you
specified. The final 3S instrumented executable is placed in
the /build directory.
Most tools create an output called <toolname>.3s in the
working directory after the instrumented program has been
executed. Some tools also create a pictorial report as a
postscript file. The following sub-sections describe two of the
3S tools in more detail.
A. 3S Tool: loopgraph d
The 3S loopgraph d tool works at the block level and
creates a deterministic regular expression describing a whole
program’s execution trace. The regular expression is annotated
with hotspot information and saved in the loopgraph d.3s
report file. The tool also creates a dotty file loopgraph d.dot
with groups of acyclic-paths connected by repetition markers.
The dotty file is compiled automatically into a postscript
picture using neato [7] if the number of group nodes is less
than 100.
G01
G02 {16}
G03
G04
G05 {3}G06
G07 {3}
G08
{6}
G09
G10 {3}
G11
G12
G13 {16}
G14
G15
G16 {16}
G17
G18 {16}
G19
G20 {4}
G21
G22 {16}
G23
{13}
G24
G25 {16}
G26
G27 {16}
G28
G29 {16}
G30
G31 {16}G32 G33
G34 {16}
G35
G36 {16}
G37
G38 {4}
G39
G40 {16}
G41
{13}
G42
G43 {16}
G44
G45 {16}
G46
G47 {16}
G48
{5000}
G49
G50 {16}
G51
Fig. 1. 3S loopgraph d tool for PRIV8 R©AES256 with gcc -O0
An example of the loopgraph d postscript output is shown
in Figure 1 for the PRIV8 R©AES256 8.2 [8] implementation.
This picture was generated for 10,000 OFB AES256 iterations
using source compiled to assembly with gcc -O0. Figure 2
shows the same program compiled with g++ -O0.
B. 3S Tool: memory
The 3S memory tool works at the instruction level. The 3S
tool creates a report of all memory addresses read and written
by every assembly instruction. The report is saved to the file
memory.3s and can be easily plotted using Excel or a similar
application.
Figure 3 shows the stack memory accesses by origi-
nal source assembly line for 10,000 OFB iterations of the
PRIV8 R©AES256 8.2 implementation compiled with g++ -O0.
IV. PERFORMANCE
3S has been measured to be between 2 and 15 times as
fast as Valgrind for a comparative tool implementation. The
performance improvement is dependant on the x86 instructions
135
G01
G02 {16}
G03
G04
G05 {13}
G06
{10000}
G07
G08 {16}
G09
Fig. 2. 3S loopgraph d tool for PRIV8 R©AES256 with g++ -O0
Read Access
Write Access
Assembly Line
Memo
ry Ad
dress
 Acce
ssed
3S Memory Access Tool Output for AES256(STACK)
Fig. 3. 3S memory tool for PRIV8 R©AES256 with g++ -O0
used and the size of the basic blocks which are in turn
governed by the source code compiler.
Instrumenting the PRIV8 R©AES256 8.2 implementation
with a 3S callgrind tool using assembly generated by gcc -
O2 produced a 5.5x performance improvement over Valgrind.
With the g++ -O2 compiler, the performance improvement
was 15.5x. The SPEC2000 GZIP benchmark could only be
compiled with gcc. Despite this, the performance increase was
consistently over 3x when compared with the current Valgrind
distribution (3.2.0).
The 3S performance improvement can be verified using the
3SPerformance.sh script that comes with the 3S(ex) distribu-
tion.
V. POSSIBLE ENHANCEMENTS
A. Optimisations
There are several optimisations that could be added to the
3S framework. Some obvious possibilities are:
1) only instrumenting a sub-set of blocks or instructions
2) in-lining the tool assembly
3) using register re-mapping to make the stub code more
efficient
However, perhaps the single most useful characteristic of
3S is the ease with which a new user can pick-up the 288 line
framework and start writing new tools. By adding performance
optimisations, I believe this characteristic would be lost. I
therefore strongly recommend that changes and additional
features be kept to a minimum in 3S. If you must have a
feature, it should be implemented in a specialist branch of the
3S code so that the current simple framework is not lost.
B. New 3S Tools
There are several new tools that would be of benefit to the
3S community. They include:
1) a non-deterministic (statistical) loopgraph variant
2) a block level static memory prediction function evalu-
ated at run-time
3) a Valgrind style cachgrind tool
Creating 3S tools as separate modules that integrate with the
framework does not complicate the 3S code and I recommend
that someone set about creating these new 3S tools. Creating
these new 3S tools would make a good Masters project. Please
feel free to e-mail me if you would like to help.
VI. CONCLUSION
This document presented a brief overview of the 3S in-
strumentation framework and 3S analysis tools. One of the
main advantages of 3S over other instrumentation frameworks
is that it is extremely simple to understand being only 288
lines of code. This simplicity makes creating new 3S analysis
tools easy and brings previously unimagined program analysis
possibilities within the researcher’s grasp.
Because of the 3S framework’s simplicity, several unique
analysis tools have already been created in record time. These
include the loopgraph d tool which creates a regular expres-
sion from a whole program execution trace and the memory
tool which displays instruction level memory accesses. With
loopgraph d we have a way to automatically identify loops
and control dependancies and with memory we can identify
data dependancies.
The existing 3S tools are already casting new light on
important commercial programs [8]. With the proposed new
3S tools, the rapidly growing 3S community will have a
unique ability to shape the future of hardware and software
engineering for years to come.
136 APPENDIX B. 3S TECHNICAL PAPER
REFERENCES
[1] NETHERCOTE, N., SEWARD, J. Valgrind: A program supervision
framework. Proceedings of the 3rd Workshop on Runtime Verification
(http://valgrind.kde.org/, 2003).
[2] PEARCE, D.J., KELLY, P.H.J., FIELD, T., HARDER, U. GILK: A
dynamic instrumentation tool for the linux kernel. Proceedings of the
12th International Conference on Computer Performance Evaluation,
Modelling Techniques and Tools 37, pp 220–226, (2002).
[3] LARUS, J.R., SCHNARR, E. EEL: machine-independent executable
editing. Proceedings of the ACM SIGPLAN 1995 conference on
Programming language design and implementation, pp 291–300, (1995).
[4] SRIVASTAVA, A., EUSTACE, A. ATOM: a system for building cus-
tomized program analysis tools. Proceedings of the ACM SIGPLAN
1994 conference on Programming language design and implementation,
pp 196–205, (1994).
[5] LARUS, J.R. Whole program paths. Proceedings of the ACM SIGPLAN
1999 conference on Programming language design and implementation,
pp 259–269, (1999).
[6] LUK, C. ET AL Pin: building customized program analysis tools with
dynamic instrumentation. Proceedings of the 2005 ACM SIGPLAN con-
ference on Programming language design and implementation (2005).
[7] GANSNER, E.R., NORTH, S.C. An Open Graph Visualization System
and its Applications to Software Engineering. Software Practice And Ex-
perience, 1-5, (http://www.graphviz.org/Documentation/
GN99.pdf, 1999).
[8] PRIV8 LTD. http://www.priv8.com/.
137
138 APPENDIX B. 3S TECHNICAL PAPER
Appendix C
Concise CPLEX Technical Paper
139
CONCISE CPLEX
Simon A. Spacey
Department of Computing
Imperial College
London, UK
ABSTRACT
This paper is a concise guide to CPLEX, the leading solver
for linear and convex quadratic optimisation problems. The
paper is self contained and includes information for first
time CPLEX users as well as code snippets and lemmas that
may be of referential value to experienced users.
The paper starts with a brief explanation of how to run
CPLEX on departmental servers at Imperial and on stand-
alone machines in section 1, how to create and solve simple
Linear Programs in section 2 and how to obtain detailed so-
lution results in section 3. The paper then moves on to dis-
cuss several CPLEX issues and quirks that may confuse first
time users including: anomalous objective values caused
by big-M scaling, the implications of long MILP solution
times and removing memory limitations for problems with
large MILP solution trees. The paper concludes with logical
equivalence proofs in section 9 that can be used as a start-
ing point for complex problem translation and references are
provided for additional reading.
1. STARTING CPLEX
CPLEX [1] is installed in a single directory and can be moved
from one machine to another with simple file copying. How-
ever, to execute CPLEX you need a license file, a defined
hostname consistent with the license file and a license server
to validate the license file. You can run CPLEX from its in-
stall directory on an Imperial server with the commands:
./ilm/ilmd &
./bin/x86-64_debian4.0_4.1/cplex
The first command starts the CPLEX license server and may
not be necessary if the license server is already running as a
shared process.
To execute a local copy of CPLEX from outside Imperial
you will need to obtain a license file, set your local hostname
to be consistent with the license file and use a license tunnel
to Imperial so that CPLEX can validate the license and keep
track of the current licenses in use. This can be automated
using a script such as:
#!/bin/bash
$HOST="vm-qads-ilm"
$ILMD="$HOST.doc.ic.ac.uk"
$LOGIN="saspacey@shell1.doc.ic.ac.uk"
export ILOG_LICENSE_FILE=./access.ilm
hostname $HOST
ssh -f -L 3000:$ILMD:3000 $LOGIN sleep 10
./bin/x86_debian4.0_4.1/cplex
Both of the code snippets above start CPLEX running
interactively, if you close your shell, your CPLEX process
will be killed and any solution currently being calculated
will not complete. You can avoid CPLEX terminating when
you close your shell by wrapping your CPLEX process with:
screen ./bin/x86_debian4.0_4.1/cplex
then press CTRL+A CTRL+D to leave the process running in
the background when you want to log-out and when you log-
in again type screen -r to return to CPLEX. Another way
to run CPLEX when you are not logged-in is as a perpetual
background process which is discussed further at the end of
the next section.
2. SOLVING PROBLEMSWITH CPLEX
You can create problems for CPLEX to solve using simple
text files in Linear Programming (LP) format. Here is a sim-
ple optimisation problem in LP format:
MINIMIZE
obj: 5.8 x_1 + 3 x_2
SUBJECT TO
r1: x_1 + 2.1 x_2 = 6
r2: 3 x_2 < 4.2
BOUNDS
x_1 >= 0
x_2 >= 0
INTEGER
x_1
END
140 APPENDIX C. CONCISE CPLEX TECHNICAL PAPER
In the above program obj, r1 and r2 are optional labels
used in detailed solutions; the constraints and BOUNDS can
be of any relational form (i.e. <=, >=, >, < or =); the vari-
able x_1 must be a positive INTEGER and x_2 can be any
positive real. It should be noted that there is no multiplier
symbol between numbers and variable names, there has to
be an END statement and all relations must have only num-
bers on the right hand side.
Two methods can be used to include binary variables in
CPLEX LPs:
1. declare the variables as INTEGER with BOUNDS be-
tween 0 and 1.
2. remove the variables from both the INTEGER and BOUNDS
sections and add a separate BINARY section.
the choice of binary representation method does not effect
performance.
Assuming the LP file above is saved as problem.lp,
you can solve it by typing:
read problem.lp
opt
at the interactive CPLEX command prompt. You should
add the line set parallel 1 before the opt command
if you require result path reproducibility on multi-threaded
CPLEX installs [2]. The optimal solution returned by CPLEX
for the problem above is 26.057.
If you wanted to automate CPLEX to read and solve LPs,
you could start with a script like cplex.sh below:
#!/bin/bash
$CPLEX="/opt/cplex11/cplex"
rm -f cplex.log 2> /dev/null
rm -f results.s 2> /dev/null
$CPLEX 2> _cplex.err <<CPLEX_CMD
read problem.lp
opt
write results.s sol all
quit
CPLEX_CMD
and run the script as a perpetual background process with:
nohup ./cplex.sh > /dev/null &
Aside from LP forms, optimisation problems can also be
represented in the OPL [1], AMPL [3] and GAMS [4] mod-
elling languages or passed from C, C++ or Java to CPLEX
using one of the ILOG Concert Technology APIs [2]. The
modelling languages all allow relational constraints like LPs
but can abstract the general problem logic away from the
specific problem instance data. The AMPL and GAMSmod-
elling languages have the added advantages of being portable
across a range of different solvers and of being able to rep-
resent both concave and convex problems [5].
3. OBTAINING DETAILED SOLUTION RESULTS
To obtain the variable values that correspond to a solution
you can save the detailed CPLEX results to a file with:
write results.s sol all
this saves solution variable values, slacks and objectives in
an XML format which is easy to analyse.
In the XML file, your optimal result will be tagged with:
solutionName="incumbent"
Note that CPLEX 11 can report solutions in the results.s
file with a lower objective than the final “incumbent”. If
you see this it is a bug, but you should extract the variable
settings from the solution, set them as constraints in your
model and check the objective value to confirm.
You should save the CPLEX log file cplex.log with
your results, original problem formulation and CPLEX ver-
sion and settings information for future reference when quot-
ing results.
4. BIG-MS AND OBJECTIVE SENSITIVITY
When creating a Linear Program with logic relations like
those of section 9, it is often necessary to use large con-
stant multipliers generally called big-M’s which when cou-
pled with your variable and objective values contribute to
the range of numbers in your problem. If you have too large
a range of numbers in your problem not only can your so-
lution times suffer, but your solution values can actually be-
come invalid as demonstrated in table 1.
M0 Multiple Objective Error
1 0%
10 0%
100 1%
1000 60%
10000 100%
Table 1: Errors in the reported best solution objectives
found by CPLEX 11 when compared to the objective of the
true optimal solution for a GQMIP [6] problem using differ-
ent multiples of a nominal big-M constantM0.
Often you can reduce your number range by scaling the
problem objective and variables and setting big-M’s on a
constraint by constraint basis rather than say using the stan-
dard C programmers default of UINT MAX or the theoreti-
cal limit of [7]. Alternatively you could consider converting
your big-M constraints to CPLEX indicators [2] or using the
GNU GLPK [8] solver GLPSOL instead of CPLEX with the
exact arithmetic option:
glpsol --exact --cpxlp ./problem.lp
however both of these alternatives are likely to have a detri-
mental impact on your overall solution times [9].
141
5. SOLUTION TIMES
All Linear Programs with only real variables can be solved
in polynomial time [10]. However, the same is not true
for Mixed Integer Linear Programs (MILPs) with integer or
boolean variables which are often combinatorial in nature.
CPLEX uses Branch and Bound [11] to reduce theMILP
search space and can solve MILPs quickly provided the re-
laxed problem has sufficient sensitivity. If you find your
MILPs are taking too long to solve, you will need to refor-
mulate your model or create your own bounding relaxations
and, for example, integrate them into CPLEX’s Branch and
Bound algorithm using the ILOG Java Concert Technology
API [2].
The effect of model formulation can be seen in table 2
where three different MILP forms are used to solve the same
optimisation problem and produce markedly different Branch
and Bound tree sizes as a result of their different lower bound
relaxation granularities.
Method B&B Nodes
GQAP 39
mms 64
mxs 1,635,667
Table 2: Branch and Bound search tree nodes for three
equivalent versions of a GQAP problem solved with default
options in CPLEX 11. GQAP is standard GQAP [12] and
mms and mxs are optimistic and robust GQMIP [6] forms
respectively. All three forms produce the same objective.
6. QUICK SOLUTIONS
CPLEX includes a local neighbourhood heuristic search al-
gorithm and other options that can quickly approach an op-
timal solution in cases where combinatorial complexity can
not be avoided. The CPLEX heuristics slow down the over-
all solution process but can often produce better results than
bespoke heuristic algorithms after only a few seconds of ex-
ecution.
The CPLEX neighbourhood heuristic and solution focus
can be set with the options:
set mip strategy rinsheur 100
set mip strategy probe 3
set mip cuts all 2
set emphasis mip 3
before the opt command in either the interactive solver or
the cplex.sh script from section 2.
You can suspend the CPLEX solution process at any
time by pressing CTRL+C in the interactive shell. This allows
you to save an intermediary solution or change the CPLEX
options and continue the optimisation process by typing opt
again. For example, to save an intermediary solution after
the CPLEX upper bound stabilises and then reset the RINS
heuristic for faster completion press CTRL+C and type:
write results_quick.s sol all
set mip strategy rinsheur 100000
opt
In a script you can prematurely terminate CPLEX (without
the option to continue) with the setting:
set timelimit <seconds>
In the interest of completeness, you should also be aware
of the CPLEX MIP tolerance options uppercutoff and
lowercutoff which, while apparently attractive for speed-
ing up solutions in the presence of known bounds, do not
actually speed-up the Branch and Bound process when re-
laxed bounds are loose.
7. START FILES
You can export the current CPLEX results for use as a start-
ing point for future runs with:
write results.s sol all
To read in the file as a CPLEX solution starting point use:
read results.s mst
It should be noted that CPLEX strictly only needs aMST
file to restart from a previous solution [2], however the full
results file produced by the first statement above is more
generally useful than the compact minimal MST start files
which do not, for example, contain objective information.
8. MEMORY LIMITS
As CPLEX works through a combinatorial search tree it
keeps track of branches taken and bounding result informa-
tion at each tree node. If your problem can not be easily cut
by Branch and Bound, the search tree can be combinatorial
in size and can cause CPLEX to crash or at least stop with
an error if your physical memory limit is exceeded.
To avoid physical memory limits, use the following op-
tions to allow CPLEX to uses the disk to store large trees:
set workmem 256
set mip strategy file 3
As a rule of thumb, the workmem value should not be more
than half your physical memory (in megabytes) to ensure
CPLEX has enough memory for code and intermediaries
and that the OS does not need to resort to paging given other
resident applications.
As CPLEX runs, you may see your free memory drop
well below the 50% mark recommended above. This may
be because of OS buffers on the CPLEX tree files which the
OS will free automatically as memory is needed.
142 APPENDIX C. CONCISE CPLEX TECHNICAL PAPER
9. LINEAR PROGRAMMING LOGICAL
EQUIVALENCIES
I conclude this paper with a set of lemmas I developed to
demonstrate how logical functions that Computer Science
students will already be aware of can be expressed as In-
teger Linear Programming (ILP) minimisation constraints.
These lemmas and relations can be used as a starting point
for mathematically modelling a logical problem.
Lemma 9.1. Boolean logical NOT (γ = ¬α : α, γ ∈ B)
can be expressed as linear programming constraints through:
(1− α) ≤ γ ≤ (1− α) (1)
Proof. Trivial.
Lemma 9.2. Boolean logical AND (γ = α∧ β : α, β, γ ∈
B) can be expressed as linear programming constraints through:
α+ β − 1 ≤ 2γ ≤ α+ β (2)
Proof. Proof is through the logic table below.
α β α+ β − 1 α+ β γ
0 0 -1 0 0
0 1 0 1 0
1 0 0 1 0
1 1 1 2 1
Corollary 9.3. Boolean logical AND is equivalent to quadratic
multiplication in B and so lemma 9.2 represents a basis for
the linearisation of quadratic constraints.
Lemma 9.4. Boolean logical OR (γ = α ∨ β : α, β, γ ∈
B) can be expressed as Linear Programming constraints
through:
α+ β ≤ 2γ ≤ 2(α+ β) (3)
Proof. Proof is through the logic table below.
α β α+ β 2(α+ β) γ
0 0 0 0 0
0 1 1 2 1
1 0 1 2 1
1 1 2 4 1
Lemma 9.5. The 0-1 threshold of an integer parameter (β =
min(x, 1) : x ∈ N∗, β, γ ∈ B) can be expressed as linear
programming constraints through:
x ≤Mβ ≤Mx (4)
Proof. Proof is through the logic table below.
x Mx β
0 0 0
1 M 1
2 2M 1
n nM 1
ACKNOWLEDGEMENTS
Dr D. Kuhn and W. Wiesemann provided valuable contribu-
tions and advice for the construction of this paper.
REFERENCES
[1] ILOG Inc. http://www.ilog.com
[2] CPLEX 11.2 Manuals ILOG Inc., (2008).
[3] AMPL Optimization LLC http://www.ampl.com
[4] GAMS Development Corp. http://www.gams.com
[5] ROCKAFELLAR, R.T. Convex Analysis. Princeton
University Press, (1970).
[6] SPACEY, S.A. Computational Partitioning for Hetero-
geneous Architectures Imperial Ph.D. Thesis, (2009).
[7] PAPADIMITRIOU, C.H., STEIGLITZ, K. Combinato-
rial Optimization: Algorithms and Complexity. Dover
Publications Inc., New York, (1998).
[8] GLPK (GNU Linear Programming Kit) Free Software
Foundation, http://gnu.org/software/glpk
[9] SCIP: Solving Constraint Integer Programs Zuse Insti-
tute Berlin (ZIB), http://scip.zib.de
[10] KARMARKAR, N. A New Polynomial Time Algo-
rithm for Linear Programming Combinatorica, Vol.
4, No. 4, pp 373-395, (1984).
[11] WOLSEY, L.A. Integer Programming. John Wiley &
Sons Inc., (1998).
[12] LEE, C., MA Z. The Generalized Quadratic Assign-
ment Problem. University of Toronto, (2004).
143
144 APPENDIX C. CONCISE CPLEX TECHNICAL PAPER
Appendix D
Heuristic Assignment Workshop
Paper
145
RAPID DESIGN SPACE VISUALISATION THROUGH HARDWARE/SOFTWARE
PARTITIONING
Simon A. Spacey, Wayne Luk, Paul H.J. Kelly and Daniel Kuhn
Department of Computing
Imperial College
London, UK
ABSTRACT
This paper introduces the 3SP Design Space Exploration
System. 3SP automatically quantifies acceleration oppor-
tunities for programs across a wide range of heterogeneous
architectures to allow designers to identify promising im-
plementation platforms before investing in a particular hard-
ware/software codesign. 3SP uses a novel program execu-
tion model to integrate comprehensive hardware characteris-
tics including clock speed, number of execution units, issue
rates, bandwidths and latencies with software program exe-
cution, parallelism, control and data flow measurements to
estimate codesign performance for evaluating opportunities
for hardware acceleration.
1. INTRODUCTION
The 3S Partitioner (3SP) uses 3S [1] program characteri-
sation measurements to automatically partition software for
execution on a range of heterogeneous computational archi-
tectures to allow rapid design space visualisation and op-
portunity exploration before committing to a particular im-
plementation platform. 3SP partitions software at the bi-
nary level and can be used to generate design curves for any
program written in any compilable language and produces a
list of code assignments to assist the designer in their initial
hardware/software partitioning decisions.
The main contribution delivered by 3SP is a novel high-
quality heuristic to quantify partition opportunities for a wide
range of architectures. Unlike previous work, the 3SP heuris-
tic is generally applicable and can be seamlessly applied to
architectures with superscalar out-of-order CISC processors
and architectures with tightly and loosely coupled reconfig-
urable components.
This paper begins with a brief overview of related work
in section 2. In section 3 the 3SP methodology is disclosed
and in section 4 results are presented that demonstrate the
use of the 3SP system to identify acceleration opportunities
for several benchmarks for a range of potential heteroge-
neous platforms. The paper continues with a discussion of
future work and conclusions in sections 5 and 6.
2. RELATEDWORK
Table 1 compares the features of the architecture neutral 3SP
timing estimation heuristic against the heuristics of previous
automatic hardware/software partitioning research.
Characteristic [2] [3] [4] [5] 3SP
block size X X X X X
block iterations X X X X X
data flow 7 7 X X X
parallel execution slots 7 7 7 X X
communication bandwidth 7 7 7 X X
communication latency 7 7 7 7 X
control flow 7 7 7 7 X
execution cycle measurements 7 7 7 7 X
Table 1. Software characterisation metrics used in previous
heterogeneous partitioning heuristics.
In Stitt et al. [2], RISC binaries are decompiled and par-
titioned using loop iteration count profiling. By analysing
program binaries, the Stitt method has the advantage of be-
ing applicable to any compilable language, however the use
of only block size and iteration counts in the partitioning
heuristic means the method has limited applicability and is
most appropriate for simple single cycle ALUs and tightly
coupled architectures.
In Lysecky et al. [3], single loop kernels from RISC bi-
naries are partitioned to reconfigurable logic using hardware
loop profiling, on-chip CDFG analysis and warp processor
technology mapping. The Lysecky approach has the bene-
fit of providing low overhead profiling results through ded-
icated hardware however, like Stitt [2], the approach uses
loop iterations rather than actual timing measurements to
identify partition targets which limits the applicability of the
approach for architectures with variable-cycle superscalar
CPUs.
In Stitt et al. [4], RISC binaries are decompiled and par-
titioned with a greedy algorithm using execution loop pro-
filing and statically determined advanced CDFG informa-
tion. The Stitt approach has the benefit of reducing data flow
costs through shared data analysis but is focused on tightly
146 APPENDIX D. HEURISTIC ASSIGNMENT WORKSHOP PAPER
coupled architectures and does not take into account archi-
tecture communication latencies and bandwidths which are
required to partition for distributed hardware components.
In Atasu et al. [5], source code is partitioned using a
knapsack algorithm based on execution time estimation and
data flow requirements. Using source code has the disadvan-
tage of potentially excluding commercial applications where
source code is not available, however by including band-
width information, the Atasu approach has the advantage of
being applicable to distributed architectures with significant
communication bandwidth constraints.
Other relevant research includes automatic software par-
titioning [6–14], manual partitioning utilities [15, 16] and
program analysis systems [1, 17–19].
3. METHODOLOGY
3SP uses a unique combination of hardware and software
characteristics to quantify the acceleration potential of a pro-
gram for a range of architectures. The information 3SP uses
in its heuristic is summarised in table 2 below.
Hardware Characteristic Software Characteristic
τl cycle time execution time µpr
ωl parallel execution units parallel execution slots φpl
l execution efficiency program code unit iterations ιp
λlm bus latency control flows χpq
βlm bus bandwidth data flows ηpq
Zl hardware size capacity size of code at each location zpl
Table 2. Hardware and software characteristics used by the
3SP execution model.
3SP obtains the hardware information it requires from a
user initialised base configuration file which 3SP automati-
cally sweeps over a range of values while calculating the ac-
celeration opportunities for a design-space. The current 3SP
implementation operates at the program basic block level us-
ing software characterisation information obtained from 3S
tools [1]; however the 3SP approach can be used to partition
at any level of program granularity including the functional
level provided the above software characteristic information
is available. The 3S tools currently used are: 3S hotspot for
CPU block-level execution timing, iteration counts and size
measurements, 3S parallelism for block level parallelism in-
formation, 3S callgrind for control flow information and 3S
data flow for inter-block data flow information.
3SP unifies the hardware and software characteristic in-
formation into a single execution time estimate for a set of
potential assignments of code sections p to locations l in an
architecture. The heuristic is then used to select the best
assignment of a program’s code sections to particular archi-
tectures using the partitioning algorithm described later in
this section.
The general 3SP timing estimate heuristic is intuitive
and represents the sum of execution times µpl plus the sum
of all communication times cpqlm for a program’s code sec-
tions assigned to an architecture’s locations:
t =
X
p
µpl +
X
pq
cpqlm (1)
where p, q are code section indices and l,m are location in-
dices and the execution µpl and communication cpqlm times
are defined by equations (2) and (3) below with reference to
the architecture characteristics of table 2.
µpl =
ιpφplτl
l
(2)
cpqlm = χpqλlm +
ηpq
βlm
(3)
The l parameter of equation (2) is a hardware imple-
mentation efficiency factor that can be considered the max-
imum sustainable issue rate per execution unit at hardware
location l and could be quoted at the program code section
level of granularity if required. Equation (2) can be used as a
substitute for the hotspot CPU cycle time measurements µpr
if the reference component r where the 3S measurements
are made is not part of the heterogeneous architecture being
modelled.
Initialise
Code/Location Set
Remove Code/Location
Combinations too large to fit
on the Hardware
Metrics
Best
Partition
Store the Current Partition
if it has the Fastest
Execution Time seen so far
Adjust Attractiveness of
Remaining Code/Location
Combinations
Move the Highest
Attractiveness Codeto the
Corresponding Partition
Stop
Yes
NoPartitionable
 Code Remaining?
Fig. 1. The Attractiveness Partitioning Algorithm (APA).
3SP uses the Attractiveness Partitioning Algorithm (APA)
depicted in figure 1 to rapidly generate high-quality parti-
tioning solutions that minimise the 3SP execution time es-
timates for code assignments to hardware. The APA algo-
147
rithm is a heuristic regret minimisation algorithm that par-
titions considerably faster than a theoretically optimal ap-
proach while retainig partition quality with high average par-
tition speedups as discussed in section 5.
The APA algorithm begins with all code sections as-
signed to an initial reference partition, calculates the 3SP
times for each code section if moved to each of the alternate
hardware locations in isolation, generates the attractiveness
metrics and moves the reference code section with the high-
est attractiveness (corresponding to the highest potential re-
gret) to it’s optimal hardware partition. APA then continues
varying the 3SP times to account for the modified cross par-
tition communication costs, recalculating the attractiveness
measures and moving the most attractive reference code sec-
tion to its optimal hardware location until no reference nodes
that can fit on an alternate hardware location exist. At each
iteration APA keeps a record of the 3SP execution time esti-
mate for the current partition assignments and APA returns
the partition assignments with the minimum 3SP execution
time observed on algorithm completion.
The attractiveness measure used by APA is the maxi-
mum potential single step 3SP speed-up opportunity lost
if the program code section p currently located at r is not
moved to location l divided by the hardware space require-
ment zpl of block p on l defined as:
αpl 6=r =
min({tpx | x ∈ locations\{l}})− tpl
zpl
(4)
for a two component architecture equation (4) simplifies to:
αpl 6=r =
tpr − tpl
zpl
(5)
where tpr is the 3SP execution time estimate for the parti-
tion with p assigned to an initial reference partition r and tpl
the 3SP execution time estimate for the partition with p as-
signed to the alternate hardware location l. The simplified
two component αpl 6=r attractiveness ratio is reminiscent of
the regret minimisation Sharpe ratio used in financial port-
folio optimisation where a profit difference is divided by a
risk [22].
4. RESULTS
This section provides 3SP acceleration results for the syn-
chronous two component heterogeneous architecture pre-
sented in figure 2. The design space considered ranges from
an SoC architecture tightly coupled through a 32-bit bus op-
erating at the CPU/coprocessor speed, to a loosely coupled
architecture with the CPU and reconfigurable coprocessor
communicating through a 24x HyperTransport 3.0 bus with
the short packet store and forward (S&F) latency character-
istics [20], along with a variety of hardware sizes and imple-
mentation efficiencies.
CPU Memory
Reconfigurable
Coprocessor
Alternate Computational
Component
Reference Computational Component
Data FlowControl Flow
Fig. 2. The heterogeneous computational architecture ex-
plored in the results section of this paper.
Results are provided for six MiBench 1.0 [21] bench-
marks with actual 3S software execution measurements made
on a reference Intel Pentium 4 x86 machine for the MiBench
large data sets. The benchmarks are selected to cover the
six MiBench categories: crc32 from the Telecommunica-
tions class, jpeg (compression) from the Consumer Devices
class, stringsearch from Office Automation, sha from Se-
curity, susan (smoothing) from Automotive and Industrial
Control and dijkstra from the Network class. The small-
est benchmark is crc32 with 22 active basic blocks and the
largest is jpeg with 1792 active basic blocks. Unless oth-
erwise stated, all results assume a conservative maximum
hardware partition size of 256 x86 integer instructions.
4.1. High-Level Acceleration Opportunities
Figures 3 and 4 compare the 3SP program acceleration re-
sults for the six MiBench benchmark binaries compiled with
three different gcc optimisation levels. Figure 3 provides
results for a tightly coupled architecture and figure 4 for a
loosely coupled architecture.
The figures show that 3SP identifies accelerating parti-
tions with up to 33 times speed-up potential for the bench-
marks considered. The impact of the different compiler op-
tions are clearly visible with -O1 having a lower accelera-
tion potential than either -O2 or -O3 for some benchmarks
supporting the results of [6]. A designer can use these re-
ports to obtain a high level opportunity assessment for a
program before deciding whether the acceleration potential
justifies further analysis and, if so, which binary and archi-
tecture promises to deliver the best return on investment.
148 APPENDIX D. HEURISTIC ASSIGNMENT WORKSHOP PAPER
05
10
15
20
25
30
35
crc32 jpeg stringsearch sha susan dijkstra
S P
E E
D -
U P
 ( x
)
-O1 -O2 -O3
Fig. 3. 3SP program accelerations for MiBench benchmarks
compiled with different gcc compiler optimisation levels on
a 32-bit bus SoC architecture operating at 345MHz.
0
1
2
3
4
5
6
7
crc32 jpeg stringsearch sha susan dijkstra
S
P
E
E
D
- U
P
 (
x
)
-O1 -O2 -O3
Fig. 4. 3SP program accelerations for MiBench benchmarks
compiled with different gcc compiler optimisation levels on
a loosely coupled distributed architecture with a 2GHz CPU
and a coprocessor operating at 345MHz.
4.2. Design Space Exploration
After obtaining a high-level acceleration opportunity report,
a designer can use 3SP to delve in detail into the design
space and explore the impact of individual or combinations
of architecture parameters on program acceleration.
Figures 5, 6 and 7 show the effect of partition size, band-
width and latency on partition performance for the dijkstra,
susan and sha MiBench benchmarks compiled with -O3 us-
ing the tightly coupled architecture report of figure 3 as a
starting point. The bandwidth and latency figures cover the
full range of architecture characteristics from single chips to
distributed architectures, demonstrating the general applica-
bility of the 3SP execution time model.
From figure 5 a designer could conclude that hardware
sizes greater than 64 instruction equivalents will not bring
significant benefit when accelerating dijkstra in the size range
considered. From figure 6 a designer can see that sha is
bandwidth sensitive, and from figure 7 the step transitions in
3SP acceleration potential could justify a designer expend-
ing extra effort on relative heterogeneous component place-
ment and buffering when attempting to accelerate either the
sha or susan benchmarks using 3SP based partitions.
0
5
10
15
20
25
30
35
40
1 10 100 1000
PARTITION SIZE (Instructions)
S P
E E
D -
U P
 ( x
)
 dijkstra
 susan
 sha
Fig. 5. 3SP program accelerations for MiBench benchmarks
compiled with gcc -O3 on a 345MHz 32-bit bus SoC ar-
chitecture with hardware of different maximum sizes.
0
5
10
15
20
25
30
35
0.1 1 10 100 1000 10000
BANDWIDTH (MB/s)
S P
E E
D -
U P
 ( x
)
 dijkstra
 susan
 sha
Fig. 6. 3SP program accelerations for MiBench benchmarks
compiled with gcc -O3 on a 345MHz SoC architecture
with different CPU to coprocessor bandwidths.
Figures 8 and 9 show the coprocessor clock speed and
implementation efficiencies that must be achieved to deliver
a required acceleration for the MiBench 1.0 dijkstra bench-
mark compiled with gcc -O3 and partitioned using 3SP
for a known CPU speed. The curves allow a designer to as-
sess the implementation characteristics required to deliver a
desired program acceleration, and can form a basis for hard-
ware cost/benefit analysis.
149
05
10
15
20
25
30
35
1 10 100 1000 10000 100000 1000000
LATENCY (ns)
S P
E E
D -
U P
 ( x
)
 dijkstra
 susan
 sha
Fig. 7. 3SP program accelerations for MiBench benchmarks
compiled with gcc -O3 on a 345MHz 32-bit bus SoC ar-
chitecture with different CPU to coprocessor latencies.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
200 250 300 350 400 450 500 550 600
COPROCESSOR SPEED (MHz)
R E
F E
R E
N C
E  
C P
U  
S P
E E
D  
( M
H z
)
3x
5x
10x
20x
Fig. 8. Maximum CPU speed that can be accelerated to
achieve a required 3SP partition speed-up for the dijkstra
benchmark compiled with gcc -O3 using a given copro-
cessor assuming a 2.89ns latency and 1.28GB/s bandwidth
bus.
5. EVALUATION AND FUTUREWORK
The 3SP methodology can be applied to any program par-
titioning granularity and a range of architecture configura-
tions and components including superscalar CPUs. How-
ever, 3SP is currently only implemented for block based par-
titioning of x86 integer programs because of limitations in
the current 3S [1] release (version 2.8) used to gather soft-
ware characterisation information.
The 3SP hardware/software partitions are generated qui-
ckly to allow rapid design space visualisation using the APA
heuristic algorithm and are not theoretically optimal for the
problem of partitioning with communication costs which is
NP-hard [24]. For the design spaces presented in this pa-
per, an unoptimised APA script implementation is up to 16.7
times faster than the optimal CPLEX solver and produces
0
500
1000
1500
2000
2500
3000
3500
4000
4500
50% 75% 100% 125% 150% 175% 200%
EFFICIENCY (%)
R E
F E
R E
N C
E  
C P
U  
S P
E E
D  
( M
H z
)
3x
5x
10x
20x
Fig. 9. Maximum CPU speed that can be accelerated to
achieve a required 3SP partition speed-up for the dijkstra
benchmark compiled with gcc -O3 using a given hard-
ware implementation efficiency assuming a 345MHz copro-
cessor with a 2.89ns call latency and 1.28GB/s bandwidth
data bus.
partitions with speedups less than 15% below the optimal
values on average. The speed of APA can be improved to
31.9 times that of CPELX at an additional 15% loss in opti-
mality by stopping APA on the first negative αpl 6=r value.
Apart from the sub-optimality of the APA partitioning
algorithm, the performance opportunities and transition points
estimated by 3SP could differ from a physical implementa-
tion because of:
1. simplifications in the 3SP execution time model,
2. inaccuracies in the 3S measurements,
3. differences between the actual physical implementa-
tion and the 3SP model.
However despite these issues, the 3SP design space curves
can still be of use to designers for quickly identifying trends
and the sensitivity of programs to architectural parameters.
Future work may address the above issues through more
complete simulation (for example using the 3S cache flow
tool to account for cache interactions on code section migra-
tion), the use of special hardware for actual timing measure-
ments [3] and the provision of alternative partitioning meth-
ods and levels of granularity. Additionally the 3SP model
could be extended with program code dependant technology
mappings and efficiency factors to allow detailed space es-
timation through calibration [13] and program code section
dependent data issue rates [15].
3SP could be further enhanced to implement the parti-
tions it identifies on real hardware. To do this, the 3SP
partitions could be readily re-compiled for hardware using
existing source code compilers [15, 16] with a pull-based
memory architecture [3, 4] and standard execution synchro-
nisation techniques [8, 9].
150 APPENDIX D. HEURISTIC ASSIGNMENT WORKSHOP PAPER
6. CONCLUSION
This paper presents an overview of the 3SP Software Parti-
tioning System. 3SP partitions software for execution on a
heterogeneous architecture and allows designers to explore
the hardware/software codesign before committing to a par-
ticular hardware platform. 3SP uses a novel heuristic that
combines actual software size, run-time, parallelism, control
and data flowmeasurements with hardware characteristics to
allow seamless design space exploration across tightly and
loosely coupled heterogeneous computational architectures.
In section 4 results are presented for MiBench bench-
marks demonstrating the ability of 3SP to identify signif-
icant program acceleration potentials through its automatic
binary partitioning algorithm for heterogeneous architectures.
Further, the results demonstrate the applicability of 3SP as a
design-tool for rapid investigation and visualisation of pro-
gram acceleration opportunities, sensitivities and transition
points through the comprehensive exploration of the design
space for possible hardware and software architectures.
7. ACKNOWLEDGEMENT
The support of the UK EPSRC is gratefully acknowledged.
Dr O. Mencer, Dr D. Thomas and Dr T. Todman provided
valuable contributions.
8. REFERENCES
[1] S.A. Spacey. 3S: Program Instrumentation and Char-
acterisation Framework. Imperial Technical Paper
2008/1 http://www.doc.ic.ac.uk/research/
technicalreports/2008/DTR08-1.pdf, 2006.
[2] G. Stitt, F. Vahid. Hardware/Software Partitioning of Soft-
ware Binaries. IEEE/ACM International Conference on
Computer Aided Design, pp 164–170, 2002.
[3] R. Lysecky, F. Vahid. A Configurable Logic Architecture for
Dynamic Hardware/Software Partitioning. Design Automa-
tion and Test in Europe Conference (DATE), pp 480–485,
2004.
[4] G. Sitt, F. Vahid. A Decompilation Approach to partition-
ing Software for Microprocessor/FPGA Platforms. Design
Automation and test in Europe (DATE), pp 396–397, 2005.
[5] K. Atasu, C. O¨zturan, G. Du¨ndar, O. Mencer, W. Luk.
CHIPS: Custom Hardware Instruction Processor Synthesis.
IEEE Trans. on Computer-Aided Design of Integrated Cir-
cuits and Systems, vol. 27, no. 3, pp. 528–541, 2008.
[6] G. Sitt, F. Vahid. New Decompilation Techniques for Binary-
level Co-processor Generation. IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), pp 547–
554, 2005.
[7] R. Lysecky, G. Stitt, F. Vahid. Warp Processors. ACM Trans-
actions on Design Automation of Electronic Systems (TO-
DAES), pp 659–681, 2006.
[8] T. Wiangtong, P.Y.K. Cheung, W. Luk. Hardware/Software
Codesign: A Systematic Approach Targeting Data-Intensive
Applications. IEEE Signal Processing, Vol. 22, No. 3, pp
14–22, 2005.
[9] Y.M. Lam, J.G.F. Coutinho, W. Luk, P.H.W. Leong. In-
tegrated Hardware/Software Codesign for Heterogeneous
Computing Systems. In Proceedings of the Southern Con-
ference on Programmable Logic, pp 217–220, 2008.
[10] S. Sirowy, Y. Wu, S. Lonardi, F. Vahid. Two Level
Microprocessor-Accelerator Partitioning. Design Automa-
tion and Test in Europe (DATE), pp 313–318, 2007.
[11] G.C. Hunt, M.L. Scott. The Coign Automatic Distributed
Partitioning System. In Proceedings of the 3rd Symposium
on Operating Systems Design and Implementation, New Or-
leans, Louisiana, pp 45–52, 1999.
[12] A. Kalavade, E.A. Lee. A Global Criticality/Local Phase
Driven Algorithm for the Constrained Hardware/Software
Partitioning Problem. International Workshop on Hard-
ware/Software Codesign, pp 42–48, 1994.
[13] P. Eles, Z. Peng, A. Kuchcinski, A. Doboli. System Level
Hardware/Software Partitioning Based on Simulated An-
nealing and Tabu Search. Journal on Design Automation for
Embedded Systems, vol. 2, pp 5–32, 1997.
[14] M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R. Murphy,
S. Liao, E. Bugnion, M.S. Lam. Maximizing Multiprocessor
Performance with the SUIF Compiler. IEEE Computer (A
special issue on multiprocessors), 1996.
[15] O. Mencer, D.J. Pearce, L.W. Howes, W. Luk. Design Space
Exploration with A Stream Compiler. IEEE International
Conference on Field Prog. Tech., 2003.
[16] Celoxica Ltd. Handel-C Language Reference Manual. 2004.
[17] N. Nethercote, J. Seward. Valgrind: A Program Supervision
Framework. Proceedings of the 3rd Workshop on Runtime
Verification 2003.
[18] C. Luk et al.. Pin: Building Customized Program Analy-
sis Tools with Dynamic Instrumentation. Proceedings of the
2005 ACM SIGPLAN conference on Programming language
design and implementation, 2005.
[19] D.J. Pearce, P.H.J. Kelly, T. Field, U. Harder. GILK: A Dy-
namic Instrumentation Tool for the Linux Kernel. Proceed-
ings of the 12th International Conference on Computer Per-
formance Evaluation, Modelling Techniques and Tools 37,
pp 220–226, 2002.
[20] B. Holden. Latency Comparison Between HyperTransport
and PCI-Express In Communications Systems. HyperTrans-
port Consortium, 2006.
[21] M.R. Guthaus et al.. MiBench: A Free, Commercially Rep-
resentative Embedded Benchmark Suite, IEEE 4th Annual
Workshop on Workload Characterization, 2001.
[22] Sharpe, W.F. Mutual Fund Performance. Journal of Busi-
ness, pp 119–138, 1966.
[23] W. Wiesemann, D. Kuhn, B. Rustem. Multi-Resource Al-
location in Stochastic Project Scheduling. Annals of Opera-
tions Research, 2008.
[24] S. Sahni, T. Gonzalez, P-complete Approximation Problems.
Journal of the ACM (JACM), Vol. 23, pp. 555–565, 1976.
151
152 APPENDIX D. HEURISTIC ASSIGNMENT WORKSHOP PAPER
Appendix E
Multi-level Assignment Journal
Paper
153
COARSE-GRAINED PARALLEL PARTITIONING THROUGH FINE-GRAINED
SEQUENTIAL ASSIGNMENT FOR HETEROGENEOUS COMPUTATIONAL SYSTEMS
Simon A. Spacey, Wayne Luk, Paul H.J. Kelly and Daniel Kuhn
Department of Computing
Imperial College London
London, UK
{saspacey, wl, p.kelly, dkuhn}@doc.ic.ac.uk
ABSTRACT
This paper introduces a general optimisation method that
takes advantage of both task parallelism and fine-grained
code-sign specialisation on heterogeneous computational ar-
chitectures to deliver up to 64 times better acceleration re-
sults than previous assignment methods for the benchmarks
considered. The method includes a novel execution model
called the Write-Only Architecture that minimises commu-
nication latencies, general execution timing equations that
integrate a comprehensive set of characterisation informa-
tion and a robust optimal Linear Programming assignment
formalisation that can be used with any number of tasks
and hardware components. The paper presents results that
demonstrate the use and tractability of the optimal approach
over a design-space ranging from a tightly coupled SoC to a
heterogeneous computational cluster communicating over a
wide area network.
1. INTRODUCTION
This paper presents a method to combine coarse-grained
parallel task partitioning with sequential fine-grained assign-
ment to identify optimal partitions of software to heteroge-
neous execution components. The paper contributes:
1. a formal mathematical linear program that can be solved
to robustly assign multiple parallel tasks to shared het-
erogeneous components at a fine-grained level using
standard tools such as CPLEX.
2. a novel execution model called the Write-Only Archi-
tecture (WOA) that can reduce run-time communica-
tion latencies by a factor of up to five.
3. solution timing information that demonstrates the prac-
tical tractability of the optimal model for a set of bench-
marks over a wide range of architectural configura-
tions.
The paper begins with an overview of previous work
in Section 2. In Section 3 our robust linear programming
multi-level abstraction is presented along with our novel ex-
ecution model. In Section 4 design-space exploration re-
sults are provided that demonstrate the use of the multi-level
approach as an opportunity quantification tool for parallel
hardware/software partitioning and the practical tractability
of the approach. The paper continues with future work and
conclusions in Sections 5 and 6.
2. BACKGROUND AND RELATEDWORK
Previous software partitioning work can be classified as ei-
ther parallel or sequential in focus. As Table 1 shows, par-
allel assignment methods typically deliver program acceler-
ation by exploiting coarse-grained data independence and
sequential assignment methods use fine-grained heteroge-
neous computational characteristics to deliver their perfor-
mance improvements.
Parallel Sequential MAP
Graph Type DAG CDFG DAG/CDFG
Granularity Coarse Fine Multi-level
Exploit Parallelism X 7 X
Optimal Heterogeneity 7 X X
Table 1: Summary of typical parallel and sequential par-
titioning characteristics compared with those of our Multi-
level Assignment Partitioning (MAP) approach.
Parallel assignment techniques require detailed sequence
and data dependency information which is often represented
through the levels of a Directed Acyclic Graph (DAG). Tra-
ditional DAGs are run trace length in size and DAG level
scheduling is made tractable through the coarse-grained con-
solidation of vertices in to tasks at the process [1], func-
tion [2] or loop [3] level. The consolidation of fine-grained
program parts into coarse-grained tasks means that tradi-
tional parallel assignment approaches cannot take full ad-
vantage of fine-grained heterogeneous computational spe-
154 APPENDIX E. MULTI-LEVEL ASSIGNMENT JOURNAL PAPER
cialisation although some task level heterogeneous assign-
ment benefits are often still possible [1, 4].
Sequential assignment techniques work with Control/
Data Flow Graphs (CDFGs) instead of DAGs. CDFGs com-
press out sequence information from traces rather than sac-
rificing graph granularity like DAGs which allows sequen-
tial partitioning to be performed at a far finer-grained level
than parallel partitioning. Sequential partitioning is often
performed at the program kernel (tight loop) [5,6], the basic
block [7] or even at the raw assembly instruction [8] level.
In this paper we present a hybrid partitioning technique
calledMulti-level Assignment Partitioning (MAP) which as-
signs sets of coarse-grained parallel tasks to shared hetero-
geneous hardware components on a fine-grained basis. In so
doing, MAP is able to exploit the benefits of data parallelism
as well as the benefits of fine-grained heterogeneous assign-
ment while remaining computationally tractable through its
multi-level paradigm.
The assignment methodology presented in this paper fur-
ther differs from previous work in that it is applicable to pro-
grams of any compilable language [1,4,8,9], is hardware ar-
chitecture independent [5–8] and produces mathematically
optimal results [1, 2, 4–7, 9, 10]. To work with programs
written in any language we use the 3S framework [17] to au-
tomatically obtain the fine-grained software characterisation
information shown in Table 2. The software information is
then combined with the publicly available hardware charac-
teristics [32–34] shown in Table 3 using our efficient archi-
tecture independent Write-Only Architecture (WOA) exe-
cution model to produce a Linear Programming formulation
we partition optimally using CPLEX 11 [37].
Characteristic [5] [6] [7] [8] MAP
size of code at each location X X X X X
program code unit iterations X X X X X
data flows 7 7 X X X
parallel execution slots 7 7 7 X X
control flows 7 7 7 7 X
execution cycle measurements 7 7 7 7 X
Table 2: Comparison of software characterisation metrics
used in this and previous heterogeneous partitioning work.
Characteristic [5] [6] [7] [8] MAP
hardware size X X X X X
data bandwidth 7 7 7 X X
communication latency 7 7 7 7 X
relative cycle times 7 7 7 7 X
parallel execution units 7 7 7 7 X
execution efficiency 7 7 7 7 X
Table 3: Comparison of hardware characterisation metrics
used in this and previous heterogeneous partitioning work.
Execution efficiency is the maximum sustainable issue rate
per execution unit.
3. METHODOLOGY
3.1. The Multi-level Assignment Problem
Our multi-level assignment approach to software partition-
ing is summarised in Figure 1. Like traditional parallel par-
titioning we start with a set of coarse-grained tasks that can
run in parallel. However unlike traditional parallel parti-
tioning we use fine-grained characterisation information to
split the coarse-grained tasks in to smaller code sections to
achieve optimal heterogeneous assignment to shared hard-
ware components. The problem of identifying parallel task
sets has been dealt with elsewhere [4,26] as has the problem
of heterogeneous software characterisation [17] and in this
paper we focus on the last steps of the MAP process: the
fine-grained assignment of a set of parallel coarse-grained
tasks to shared components to obtain latency optimality.
For any set of coarse-grained parallel tasks our optimal
assignment approach is required to find the fine-grained as-
signment of all task code sections to execution locations that
optimally minimises the heterogeneous execution time for
the slowest task in the parallel set. With both shared se-
quential and parallel components, the best execution time
that can be guaranteed independently of sequential schedul-
ing conflicts is given by:
Definition 3.1. Robust Multi-level Assignment Problem.
min t‖ + t⊥ (1)
s.t. t‖ ≥ ti‖ ∀i ∈ T (2)
t⊥ =
∑
i∈T
ti⊥ (3)
where T is the parallel task set, i a task identifier, ti⊥ and t
i
‖
the times task i spends executing on sequential and parallel
components respectively, t‖ no less than the longest parallel
execution time across all tasks and t⊥ the sum of the serial
execution times for all tasks.
The optimistic version of Definition 3.1 assumes no cross-
task schedule conflicts on shared components with all task
execution times being parallisable. This is equivalent to re-
placing equations (2) and (3) in Definition 3.1 with:
t‖ ≥ ti‖ + ti⊥ ∀i ∈ T (4)
t⊥ = 0 (5)
The robust form of the Multi-level Assignment Problem
(MAP) shown in Definition 3.1 requires execution times di-
vided into sequential ti⊥ and parallel t
i
‖ parts which can be
performed with hardware classified as sequential or paral-
lel as will be discussed in Section 3.4. However before we
extend Definition 3.1 with our classified timing equations
we first need to introduce our general execution model upon
which our timing equations are based.
155
t
new
t
1 t2 t3
t
old
1 2 3
S
E
1c
1b
1a
Task DAG
Intra-task CFG
Optimal Timing Fine-grained Assignment
1a 1b
2a
2c
1c
3a 3b 3c
2b
(a) Coarse-grained parallel tasks are identified.
tnew
t1 t2 t3
told
1 2 3
S
E
1c
1b
1a
Task DAG Intra-task CFG
Optimal Timing Fine-grained Assignment
1a 1b
2a
2c
1c
3a 3b 3c
2b
(b) Fine-grained characterisation information is gen-
erated for each coarse-grained task.
tnew
t1 t2 t3
told
1 2 3
S
E
1c
1b
1a
Task DAG Intra-task CFG
Optimal Timing Fine-grained Assignment
1a 1b
2a
2c
1c
3a 3b 3c
2b
(c) Tasks are assigned to shared locations on a fine-
grained basis.
tnew
told
1 2 3
S
E
1c
1b
1a
Task DAG Intra-task CFG
Optimal T ming Fine-grained Assignment
1a 1b
2a
2c
1c
3a 3b 3c
2b
t1 t2 t3
(d) The fine-grained assignments with the best tim-
ings are found.
Fig. 1: The Multi-level Assignment Partitioning (MAP) approach of combining coarse-grained parallel task partitioning with
fine-grained sequential assignment.
3.2. The Write-Only Architecture
In multi-level assignment, the parallel software partitioning
problem becomes a set of sequential code assignment prob-
lems inter-related through the equations of Definition 3.1 for
each task in a parallel task set. As individual tasks within
a task set are independent and sequential by definition, we
only need to specify an execution model for a single in-
dependent sequential task like that of Figure 1(b) in order
to identify execution times for an assignment such as Fig-
ure 1(c). Our sequential execution model is called theWrite-
Only Architecture (WOA) and is described in this section.
Communication Paradigm Latencies Response Target
Custom Instructions [8, 22] 3 caller
Shared Memory [5–7] 5 caller
Client-Server [19, 27] 4/2 (TCP/UDP) caller
Write-Only-Architecture 1 control path
Table 4: Comparison of latencies and call response targets
for common communication paradigms.
TheWOA is an execution model based on control-flows.
The distinguishing feature of the WOA is that there are no
reads of data or control signals, only data writes along the
control path. In the WOA, an executing code section passes
the result of its calculation directly along the task’s control
flow path to the next execution code section target rather
than passing computation results back through an execution
scheduler or message broker as in traditional paradigms.
Table 4 compares the WOA against previous commu-
nication paradigms which are all derived from the Remote
Procedure Call (RPC) [28] paradigm to some extent. RPC
was designed for two component distributed architectures
where waiting for a response does not incur extra commu-
nication latencies and where service responses are going to
be used by the caller and not forwarded on by the caller to
a third computational component. The moment we move to
tightly coupled busses or distributed architectures with more
than 2 components, the efficiency of the RPC paradigm breaks
down and this is where the newWOA execution model comes
in.
156 APPENDIX E. MULTI-LEVEL ASSIGNMENT JOURNAL PAPER
WOA Activate Packet Format
Data[0]
.
.
.
Data[n]
Target ID
Bus Width
{1c,
data}
{1a,
data}
1a 1c
Activate 1b Activate 1b
Activate 1a Activate 1c
{1b,
data}
{1b,
data}
1b
JMP (1a | 1c)
MUX (1b)
Client Writes Data and Sends Read Request
Service Sends ResponseService Computation
Write Input Data
Send Read
Computation
Reply to Read
Write and Suspend
Compute
Write Activate
Computation
Write Activate
Latency Saving
Latency
Data Transfer
Computation
Write and Suspend
(a) Communications for a Custom Instruction call.
WOA Activate Packet Format
Data[0]
.
.
.
Data[n]
Target ID
Bus Width
{1c,
data}
{1a,
data}
1a 1c
Activate 1b Activate 1b
Activate 1a Activate 1c
{1b,
data}
{1b,
data}
1b
JMP (1a | 1c)
MUX (1b)
Client Writes Data and Sends Read Request
Service Sends ResponseService Computati n
Write Input Data
Send Read
Computation
Reply to Read
Write and Suspend
Compute
Write Activate
Computation
Write Activate
Latency Saving
Latency
Data Transfer
Computation
Write and Suspend
(b) Communications for the WOA.
Fig. 2: Communications for a Custom Instruction protocol
verses the WOA on a two component SoC architecture.
In Custom Instruction paradigms like those of [8, 22], a
CPU passes data to a custom instruction fabric using a write
and then issues a read to block the CPU and wait for the cus-
tom instruction’s response. This process requires three la-
tencies per call whereas the WOA requires only one latency
per control flow because it does not alter the homogeneous
program control flow path as illustrated in Figure 2.
Shared Memory models like those of [5–7] are similar to
Custom Instruction models except that components read the
data they need from shared memory after activation instead
of having the data pushed to them as part of the activation
signal. Shared Memory models are useful where the data
that will be required by a custom instruction cannot be eas-
ily predicted by the caller, however the model suffers from
additional latencies as shown in Table 4.
Client-Server communication paradigms like those of [19]
often use TCP or UDP as an underlying protocol. While
UDP is relatively efficient, even UDP Client-Server models
suffer from up to twice the latency overhead of WOA per
control flow because of the need for a central message bro-
ker or nested call stack as shown in Figure 3.
The WOA mechanism can be implemented on tightly
coupled architectures using writes to memory mapped hard-
ware, SMP style cache snooping [29] and process spin-locking
or interrupts. In a distributed environment WOA can be im-
plemented using UDP with the fact that the caller is not nec-
essarily the response target meaning that acknowledgement
messages can be sent off the main control flow path at low
cost for communication reliability as shown in Figure 3(b).
CBACBA
Activate
ACK
Activate
ACK
Activate
ACK
Call
ACK
Return
ACK
Call
ACK
Return
ACK
La
te
nc
y S
av
ing
(a) Client-Server protocol.
CBACBA
Activate
ACK
Activate
ACK
Activate
ACK
Call
ACK
Return
ACK
Call
ACK
Return
ACK
La
te
nc
y S
av
ing
(b) Communications for the WOA.
Fig. 3: Communications for a Client-Server protocol verses
the WOA on a three component distributed architecture.
To deal with data dependent control flow paths in a het-
erogeneous environment requires not only that cross com-
ponent calls be directed to the correct hardware component,
but that they carry with them fine-grained identifiers for the
correct program block at the target location to execute next.
Figure 4 shows the basic WOA activation packet which ad-
dresses the data dependent path problem by including a fine-
grained program target identifier in the packet header. The
WOA activation packet data section can be used to pass path
context information (for example the current loop iteration)
as well as data for computations which can be identified with
fine-grained data dependence analysis [17].
Figure 5 illustrates how task 1 from Figure 1 would be
implemented using WOA activation packets if the block 1b
were an if else data dependent branch assigned to a hard-
WOA Activate Packet Format
Data[0]
.
.
.
Data[n]
Target ID
Packet Width
Fig. 4: The basic WOA activation packet sent between code
sections. The packet width is either the physical bus width
or a nominal bit grouping for a serial link.
157
Client Writes Data and Sends Read Request
Service Sends ResponseService Computation
Write Input Data
Send Read
Computation
Reply to Read
WOA Activate Packet Format
Data[0]
.
.
.
Data[n]
Target ID
Bus Width
Write and Suspend
Compute
Write Activate
Computation
Write Activate
Up to 1 Latency Saved
{1c, data}{1a, data}
1a 1c
Activate 1b Activate 1b
Activate 1a Activate 1c
{1b, data} {1b, data}
1b
Latency
Data Transfer
Computation
JMP (1a | 1c)
MUX (1b)
Write and Suspend
Fig. 5: An illustration of how the WOA allows an if else
statement implemented in hardware (bottom) to activate dif-
ferent software blocks (top) using WOA Controllers (la-
belled JMP and MUX).
ware component. A WOA Controller on the CPU (top) ex-
amines the Target ID of each WOA activation packet sent by
1b and chooses either 1a or 1c to forward the WOA activa-
tion on to to implement the heterogeneous cross component
data dependent path.
Aside from directingWOA activation packets, WOACon-
trollers allow for the fine-grained serialisation of access to
sequential components such as CPU cores shared by mul-
tiple parallel tasks. Fine-grained serialisation of sequential
components is required to allow partial utilisation of par-
allelising components for programs that are too large to fit
as a single coarse-grained unit on space constrained parallel
devices like FPGAs as will be demonstrated in Section 4.4.
3.3. Timing Objective
The WOA execution model associates an activation transfer
with each computation. Thus the total time required for an
isolated coarse-grained task which has been fine-grain as-
signed to multiple heterogeneous components to complete
its execution is given by:
t =
∑
pl
µpl +
∑
pqlm
cpqlm (6)
where µpl is the total computation time of code section p on
location l and cpqlm is the total WAO communication time
for activations sent from components p on l to q onm.
The fine-grained computation times µpl were obtained
using real timing measurements with the 3S hotspot tool [17]
for the complex superscalar out-of-order x86 CPU in this
work and estimated through equation (7) using published
hardware data [32–34] and other 3S tool measurements for
Hardware Characteristic Software Characteristic
τl cycle time execution time µpl
ωl parallel execution units parallel execution slots φpl
pl execution efficiency program code unit iterations ιp
λlm communication latency control flows from p to q χpq
βlm data bandwidth data flow from p to q ηpq
Table 5: Hardware and software characteristic symbols used
in the fine-grained WOA timing equations. The hardware
characteristics were obtained from data sheets [32–34] and
the software characteristics from 3S [17] measurements.
the symbols defined in Table 5 for alternate computational
components:
µpl =
ι
p
φplτl
l
(7)
Equation (7) is simply the total number of parallel execution
slots a program section would take to execute at a location
(iteration count ιp multiplied by the number of parallel slots
per iteration φpl) multiplied by the time required for each
parallel slot to execute (the cycle time of the component τl
divided by the issue rate l).
Referring to Figure 4 and Table 5, the WOA communi-
cation times cpqlm are given by the number of control flows
between code sections χpq multiplied by the communication
latency λlm for sending the WOA Target IDs plus the total
number of data bytes ηpq divided by the bandwidth βlm:
cpqlm = χpqλlm +
ηpq
βlm
(8)
Equation (8) is general enough to cope with implementa-
tions that wrap even intra-location control flows in a WOA
packet (l andm can be the same) and is still valid for indirect
data transfers with the insight that indirect data requirements
can be encapsulated in WOA packets sent by nodes on the
direct control flow path if required.
3.4. Expanding MAP with the Objective
If we denote the set of sequential computational locations
in an architecture with L⊥ and the set of shared communi-
cation channels with M⊥ then equation (6) can be directly
divided into its serial and parallel components and used in
Definition 3.1 to form the ExpandedMulti-level Assignment
Problem 3.2 below. Examples of sequential components
and shared communication channels include CPU cores and
shared busses. Examples of parallel components and dedi-
cated communication channels include FPGAs and task ded-
icated internal hardware inter-connects.
In Definition 3.2, Equations (12) and (13) are the equa-
tion (6) expansions of ti‖ and t
i
⊥ for the parallel and sequen-
tial components; constraint (15) ensures every code segment
is assigned somewhere; constraint (14) forces calls to exter-
nal libraries to be instantiated on the reference CPU’s parti-
tion and constraint (16) ensures that component space limits
are not violated.
158 APPENDIX E. MULTI-LEVEL ASSIGNMENT JOURNAL PAPER
Definition 3.2. Expanded Robust MAP.
min
x∈X
t‖ + t⊥ (9)
s.t. t‖ ≥ ti‖ ∀i ∈ T (10)
t⊥ =
∑
i∈T
ti⊥ (11)
ti‖ =
∑
pl 6∈L⊥
µiplx
i
pl +
∑
pqlm 6∈M⊥
cipqlmx
i
plx
i
qm (12)
ti⊥ =
∑
pl∈L⊥
µiplx
i
pl +
∑
pqlm∈M⊥
cipqlmx
i
plx
i
qm (13)
where x ∈ X is a feasible assignment of fine-grained code
sections to locations for all coarse-grained tasks T satisfy-
ing the space constraints:
xipr = 1 ∀p ∈ E (14)∑
l
xipl = 1 ∀p, i (15)∑
pi
δiplx
i
pl ≤ ∆l ∀l (16)
with:
i the task identifier
p, q assignable code sections within a task
E the set of sections calling external code
l,m computational locations
r the reference CPU location (running the OS)
xipl 1 if p from task i is assigned to l, 0 otherwise
δipl the size of code p from task i on location l
∆l the size capacity of location l
L⊥ the set of sequential execution locations
M⊥ the set of shared communication links
The xiplx
i
qm quadratics can be removed from equations (12)
and (13) in Definition 3.2 using the substitution yipqlm ∈
{0, 1} defined by [30]:
xipl + x
i
qm − 1 ≤ 2 yipqlm ≤ xipl + xiqm (17)
to produce a Mixed Integer Linear Program (MILP) solvable
with standard LP solvers such as CPLEX.
Problem 3.2 is easily seen to be strongly NP-hard with
quadratic communication costs and binary assignment vari-
ables [38–40] and this is perhaps why previous authors have
resorted to assignment heuristics rather than optimal meth-
ods [1, 2, 4–7, 9, 10]. However, the software partitioning
problems considered in this paper indicate that the implicit
assumption of practical intractability embedded in previous
work is not always valid and fine-grained software partition-
ing problems including thousands of nodes can be solved to
optimality using modern solvers such as CPLEX [37] in just
a few seconds.
4. RESULTS
This section provides acceleration results for tasks assigned
to the two component heterogeneous architecture of Fig-
ure 6 over a wide range of relative clock speeds, hardware
size limits and communication characteristics. The tasks as-
signed are processes from the MiBench 1.0 [31] benchmark
suite with software characterisation measurements generated
using the 3S [17] hotspot (µpr and ιp), parallelism
(φpl), callgrind (χpq) and cache flow (ηpq) tools on an
Intel Pentium 4 x86 machine. We start with optimal design-
space reports for isolated tasks in Sections 4.1, 4.2 and 4.3
before presenting results for the parallel assignment of a set
of three of the tasks to shared hardware in Section 4.4 and
examining the practical complexity in Section 4.5.
x86 CPU
WOA Control
FPGA
Coprocessor
WOA Control
Memory
WOA Packets
Memory Access
x86 CPU
WOA Control
FPGA
Coprocessor
WOA Control
Memory
WOA Packets
Memory Access
Memory
WOA Packets
Memory Access
x86 CPU
WOA Controller
FPGA
Coprocessor
WOA Controller
Fig. 6: The heterogeneous architecture explored.
4.1. High-Level Design-Space Opportunities
The MAP high-level design-space reports quantify acceler-
ation opportunities for available program forms and fixed
points in the architectural design-space. The high-level re-
ports can be used to support investment in a particular ar-
chitecture or to justify a more detailed investigation of the
design-space characteristics of a task set if performance re-
quirements cannot be met with existing hardware.
Figures 7 and 8 compare our optimal program accelera-
tion results for six MiBench benchmarks compiled with gcc
using three different optimisation levels and partitioned for
the two component architecture of Figure 6. Figure 7 is for
the tightly coupled architectural characteristics of Table 6
and Figure 8 for the loosely coupled characteristics.
The considerable difference between the acceleration po-
tentials of the benchmarks is a result of two factors: the par-
allelism available in the program and the proportion of the
program’s computation that can be relocated away from the
CPU in correspondence with Amdahl’s law [29]. For exam-
159
Tightly Coupled Loosely Coupled
Characteristic CPU FPGA CPU FPGA
τl cycle time 2.90ns 2.90ns 0.5ns 2.90ns
ωl parallel execution units 4 256 4 256
pl execution efficiency 100% 100% 100% 100%
∆l hardware size capacity ∞ 256 ∞ 256
λlm bus latency 2.90ns 165ns
βlm data bandwidth 1.38GB/s 3.12GB/s
Table 6: Architectural characteristics for the tightly and
loosely coupled design points discussed in this paper. The
tightly coupled characteristics correspond to a reconfig-
urable SoC operating at 345MHz [32] with a 32-bit inter-
nal bus and the loosely coupled characteristics to a 2GHz
Pentium 4 coupled with a 345MHz FPGA through a Hyper-
Transport connection with the latencies described in [34].
0
5
10
15
20
25
30
35
40
crc32 jpeg stringsearch sha susan dijkstra
S P
E E
D -
U P
 ( x
)
-O1 -O2 -O3
Fig. 7: Accelerations for MiBench benchmarks compiled
with different gcc compiler optimisation levels and parti-
tioned for the tightly coupled design point of Table 6.
0
1
2
3
4
5
6
7
crc32 jpeg stringsearch sha susan dijkstra
S P
E E
D -
U P
 ( x
)
-O1 -O2 -O3
Fig. 8: Accelerations for MiBench benchmarks compiled
with different gcc compiler optimisation levels and parti-
tioned for the loosely coupled design point of Table 6.
ple, the crc32 benchmark spends 29.3% of its time in file
related system calls which can not be relocated to the FPGA
which limits the maximum possible speed-up of crc32 to:
Reference CPU Cycles
Fixed CPU Cycles
= 3.412 times
This value is in fact very close to the actual speedup shown
in Figure 7 of 3.261 times acceleration for crc32. The only
benchmarks in Figure 7 which do not come close to their
Amdahl’s limits are jpeg and sha which are both restricted
from reaching their theoretical bounds because of the FPGA
size constraints of Table 6.
The impact of the different architectural configurations
and compiler options is clearly visible in the figures with
the tightly coupled design point providing superior accel-
eration potential in all cases and the -O1 compiler option
having a lower acceleration potential than either the -O2
or -O3 options for some benchmarks supporting the results
of [11]. The effect of compiler optimisation on acceler-
ation performance is governed by the interaction between
loop unrolling and Instruction Level Parallelism (ILP). The
loop unrolling performed by higher compiler optimisation
levels increases the average block size which reduces the op-
portunity for fine-grained acceleration improvements, how-
ever the issue of increased node size can be counteracted
by increased node parallelism where the unrolled loops of a
benchmark have limited internal data dependence which in-
creases acceleration potential for the susan and dijkstra
benchmarks.
4.2. Assignment Reports
Assignment reports show the fine-grained assignments for a
set of tasks on a particular architecture. Designers can use
the assignment reports with either automated cross-compilation
tools [21] or manual processes to implement the assignments.
Figure 9 shows the graphical assignment reports for the
MiBench stringsearch benchmark compiled with gcc -O3
for the tightly and loosely coupled architectural characteris-
tics of Table 6. Cross partition control flows are illustrated
as weighted dashed lines crossing the partition boundaries
and number 14800 for the tightly coupled architecture (top)
and 5418 for the loosely coupled architecture (bottom). The
higher-latency (bottom) architecture has 13 extra blocks on
the CPU around the external call code sections.
4.3. Sensitivity Reports
Sensitivity reports plot the acceleration potential of tasks
over a range of architectural configurations and can be used
to identify the sensitivity of a partition to modelling assump-
tions such as hardware size and bus congestion.
Figures 10, 11 and 12 show the effect of reconfigurable
fabric size, inter-component bandwidth and latency on par-
tition performance for the dijkstra, susan and sha tasks
using the tightly coupled characteristics of Table 6 as a basis.
160 APPENDIX E. MULTI-LEVEL ASSIGNMENT JOURNAL PAPER
B00
B01
B13
B14
B18
B28
B30
B05
B02
B39
B15
B16
B20
B21
B03
B24
B25
B26
B27
B29
B36
B37
B04
B08
B09
B10 B11 B12
B19
B23
B35
B17
B31
B32
B33
B22
B38
B07
B34
B06
CPU
FPGA
CPU
FPGA
B00
B01
B28
B30
B05
B13
B14
B18
B02
B39
B15
B16
B20 B21
B03
B22
B32
B23 B24 B25
B26 B27 B29 B38
B07
B12
B19
B35
B04
B36 B37 B06 B08
B09
B10
B11 B17
B31
B33
B34
B00
B01
B13
B14
B18
B28
B30
B05
B02
B39
B15
B16
B20
B21
B03
B24
B25
B26
B27
B29
B36
B37
B04
B08
B09
B10 B11 B12
B19
B23
B35
B17
B31
B32
B33
B22
B38
B07
B34
B06
CPU
FPGA
CPU
FPGA
B00
B01
B28
B30
B05
B13
B14
B18
B02
B39
B15
B16
B20 B21
B03
B22
B32
B23 B24 B25
B26 B27 B29 B38
B07
B12
B19
B35
B04
B36 B37 B06 B08
B09
B10
B11 B17
B31
B33
B34
Fig. 9: Assignments for the MiBench stringsearch task compiled with gcc -O3 and partitioned for the tightly (top) and
loosely (bottom) coupled architectural characteristics of Table 6. Double circles are used for the start and end nodes along
with the special node B00 which represents static or externally initialised data. Dashed lines are control flows which may
be composed with data flows (full lines) to make WOA packets. Calls to external code are constrained to be from the CPU
partition and may return data through B00. Nodes and lines are shaded in proportion to their computation and communication
time requirements respectively.
161
05
10
15
20
25
30
35
40
0001001011
PARTITION SIZE (Instructions)
SP
EE
D-
UP
 (x
)
 dijkstra
 susan
 sha
Fig. 10: Accelerations for MiBench benchmarks compiled
with gcc -O3 and partitioned for heterogeneous architec-
tures with different hardware sizes.
0
5
10
15
20
25
30
35
40
0.1 1 10 100 1000 10000
BANDWIDTH (MB/s)
S P
E E
D -
U P
 ( x
)
 dijkstra
 susan
 sha
Fig. 11: Accelerations for MiBench benchmarks compiled
with gcc -O3 and partitioned for heterogeneous architec-
tures with different bandwidths.
0
5
10
15
20
25
30
35
40
1 10 100 1000 10000 100000 1000000
LATENCY (ns)
S P
E E
D -
U P
 ( x
)
 dijkstra
 susan
 sha
Fig. 12: Accelerations for MiBench benchmarks compiled
with gcc -O3 and partitioned for heterogeneous architec-
tures with different latencies.
From Figure 10 a designer could conclude that hardware
sizes greater than 256 instruction equivalents are required to
exploit the full potential of sha. From Figure 11 a designer
can see that sha and susan are both bandwidth sensitive
and from Figure 12 a designer could justify a focus on rel-
ative heterogeneous component placement and buffering to
reduce latency when attempting to accelerate either the sha
or dijkstra tasks.
4.4. Parallel Acceleration Reports
This section presents results for the Multi-level Assignment
Problem (MAP) applied to three tasks wishing to share the
hardware resources of Figure 6 in parallel. The three tasks
are the MiBench dijkstra, susan and sha benchmarks
compiled with gcc -O3 considered for sensitivity analysis
in Section 4.3 and the shared hardware configuration is the
tightly coupled architecture of Table 6 with an FPGA size of
256 instruction equivalents.
Table 7 gives the assignment timings. The Traditional
column lists the execution times for traditional coarse-grained
task assignments on the architecture. The Isolated column
lists the isolated optimal fine-grained assignment results where
each task is allowed to use the entire FPGA and corresponds
to the results presented in Figure 7. The Optimistic and Ro-
bust columns list the execution timing results for the three
tasks partitioned to share the limited FPGA resources using
the optimistic (see Section 3.1) and robust versions of the
MAP optimisation problem presented in Definition 3.2.
Task Traditional Isolated Optimistic Robust
dijkstra 15.049 0.4295 0.4306 0.4349
susan 9.1977 0.3385 0.4075 0.3406
sha 3.4849 0.1892 0.2691 0.2067
Best Time 27.732 0.9572 0.4306 0.5442
Table 7: Execution times for three MiBench benchmarks
assigned using traditional coarse-grained parallel methods
(Traditional), optimally assigned as isolated tasks (Isolated)
and MAP assigned in parallel to share a heterogeneous
tightly coupled SoC architecture optimistically (Optimistic)
and robustly with sequential cross-component communica-
tions (Robust). The Best Time row is the serial sum of the
task times for the Traditional and Isolated approaches, the
maximum of the three shared task times for the Optimistic
approach and a combination of 0.2193 seconds serial and
0.3248 seconds maximum parallel execution for the Robust.
All of the tasks are too large to fit on the FPGA as an
entire unit using coarse-grained assignment with dijkstra
being the smallest of the three tasks at 294 instruction equiv-
alents. Consequently the best achievable traditional coarse-
grained parallel assignment time is limited by the serial ex-
ecutions of the tasks on the CPU and cannot be less than
27.73 seconds.
162 APPENDIX E. MULTI-LEVEL ASSIGNMENT JOURNAL PAPER
Using the multi-level method on the other hand allows
for the tasks to be split at the basic block level and assigned
to share the resources. The Robust assignments of Table 7
share the reconfigurable hardware 26% to dijkstra, 28%
to susan and 46% to sha and deliver an execution time
of between 0.4349 and 0.5442 seconds depending on how
much conflict serialisation is actually required by the CPU’s
WOA Controller at run-time. Thus, compared to the tradi-
tional parallel approach, the MAP parallel assignments offer
between 51 to 64 times better performance.
4.5. Practical Complexity
Table 8 presents the CPLEX 11 solve times for the tightly
coupled (TC) and loosely coupled (LC) results of Figures 7
and 8. The tightly coupled partitions have less cross-partition
communication costs and are closer to the pseudo-polynomial
time Knapsack problem than the loosely coupled assign-
ments for which quadratic communication costs are more
significant and this distinction is reflected in the CPLEX so-
lution timings. However, despite the more significant com-
munication costs in the loosely coupled problems, the only
problem where CPLEX had to resort to Branch and Bound
was the jpeg partition compiled with gcc -O3 where the
1792 basic block problem was solved in 3.16 seconds using
12 Branch and Bound tree nodes.
TC CPLEX Times (s) LC CPLEX Times (s)
Benchmark O1 O2 O3 O1 O2 O3
crc32 0.00 0.00 0.00 0.00 0.00 0.00
jpeg 0.73 0.64 0.50 2.59 1.46 3.16
stringsearch 0.01 0.01 0.01 0.01 0.01 0.01
sha 0.02 0.04 0.03 0.02 0.04 0.02
susan 0.03 0.03 0.03 0.04 0.04 0.03
dijkstra 0.01 0.01 0.01 0.01 0.01 0.02
Table 8: The solution times in seconds from the CPLEX
log for Figures 7 (TC) and 8 (LC). All timing measurements
were made with a quad-thread 64 bit version of CPLEX 11
using the default settings on the Imperial orion server which
has two Dual Core AMD Opteron 275 CPUs operating at
2.2GHz and 4GB of memory. The benchmark sizes were:
22, 23 and 23 blocks for crc32; 1611, 1640 and 1792 for
jpeg; 43, 45 and 38 for stringsearch; 72, 70 and 67 for
sha; 153, 153 and 148 for susan and 69, 68 and 116 blocks
for dijkstra when compiled with gcc -O1, -O2 and -O3.
The Assignment report of Figure 9 and the Sensitivity
reports of Figures 10, 11 and 12 were all created in less than
0.1 seconds per design point and did not require CPLEX
to enter a Branch and Bound solution phase. The solu-
tion times for the three task Isolated, Optimistic and Robust
problems of Section 4.4 were 0.07, 0.58 and 0.11 seconds re-
spectively with the Optimistic MAP problem being the only
version for which CPLEX needed to use Branch and Bound
with a tree size of 17 nodes.
5. EVALUATION AND FUTUREWORK
The MAP assignment results of Section 4.4 were of be-
tween 51 and 64 times higher quality than traditional coarse-
grained partitioning methods because of the multi-level as-
signment approach presented in this paper. The exact speed-
up actually achievable would depend on the amount of shared
component or bus contentions seen in the actual task exe-
cutions and this would require exact fine-grained sequence
information to quantify analytically.
Unfortunately however, fine-grained sequence informa-
tion is routinely compressed out of the Control Data Flow
Graphs used in fine-grained partitioning because of trace
length issues and so a designer would be forced to perform
sensitivity analysis on the Optimistic and Robust partitions
to decide between them using theMAPmethod as presented.
Future work could address the issue of sequence compres-
sion using the minimax uncertain path method presented by
the authors in [36] to further improve the optimisation re-
sults and account for feasible cross-task sharing conflicts.
The combined time required to generate all the reports
of this paper was less than 15 seconds using CPLEX 11.
The results explore a design space ranging from tightly cou-
pled SoCs to loosely coupled distributed architectures and
six full program benchmarks with up to 1792 basic blocks
were partitioned. Single design point solution times ranged
from less than 10 ms to a maximum of 3.16 seconds.
While the practical complexity results are promising, the
validity of previous heuristic methods [1,2,4–7,9,10] can not
be questioned without further analysis. To assess the prac-
tical tractability of the formal method, solution times could
be generated for either an exhaustive set of real software
benchmarks or benchmark characteristics abstracted so that
problem graphs could be automatically generated and z-tests
performed.
The fine-grained computation and communication infor-
mation required by the WOA timing equations can be ob-
tained using standard software characterisation frameworks
like 3S and hardware data sheets as in this paper. How-
ever where computational models are not sufficient, inter-
component communication times could be sampled, a set of
equivalent fine-grained implementations constructed [9] or
a standard data-base of translations [10] used to provide al-
ternatives to the abstracting equations of Section 3.3.
A critical assumption of the WOA is that the data used
by a target code segment can be predicted by its activator.
This assumption is certainly true for a wide range of com-
putationally intense programs such as sha, but may be in-
valid for data dependent path programs such as lz77. To
address the issue of data predictability, future work could
implement data predictability tools [41] for 3S and modify
the WOA timing equations to include context synchronisa-
tion and Shared Memory [6,7] communication costs for pro-
gram sections where run-time data prediction issues exist.
163
6. CONCLUSION
This paper presented the Multi-level Assignment Problem
(MAP) which combines coarse-grained task partitioning
with sequential fine-grained assignment. The paper con-
tributed:
1. a formal mathematical model that can be solved to
robustly partition parallel code using standard solvers
such as CPLEX.
2. a new execution model that abstracts away architec-
tural differences and reduces execution communica-
tion costs by up to a factor of five.
3. solution timing information that demonstrates the prac-
tical tractability of the optimal model for a range of
benchmarks and architectures.
In Section 4, the Write-Only Architecture (WOA) ex-
ecution model was used with MAP to explore the optimal
design-space of six benchmarks partitioned for execution
on heterogeneous architectures consisting of Intel x86 Pen-
tium processors and reconfigurable FPGA components with
a range of speed, size and communication characteristics.
The design-space results demonstrated the practical tractabil-
ity of optimal approaches to the software assignment prob-
lem and identified important sensitivity information for the
benchmarks including that the sha benchmark is size sensi-
tive, the dijkstra benchmark is bandwidth insensitive and
that the susan benchmark is relatively latency insensitive.
The single task design-space results were followed by
accelerations for a three task set sharing a CPU and FPGA
accelerator for parallel execution. The robust MAP paral-
lel assignment results were between 51 and 64 times higher
than the accelerations that could be achieved using tradi-
tional coarse-grained parallel task partitioning methods with
the FPGA hardware shared 26% to dijkstra, 28% to susan
and 46% to sha.
Several areas of future work were discussed in Section 5
including the integration of other work to improve the MAP
acceleration results further [36], the use of MAP to evaluate
the practical complexity of optimal approaches applied to
other benchmarks and the creation of automatic WOA pro-
gram classification tools in 3S. However even without this
work, MAP represents a significant set of tools and abstrac-
tions that allows the immediate evaluation of heterogeneous
acceleration potentials for sets of parallel tasks on a wide
range of architectures.
7. ACKNOWLEDGEMENT
The support of the UK EPSRC is gratefully acknowledged.
Dr O. Mencer, Dr D. Thomas, Dr T. Todman, Dr K.H. Tsoi
and Dr Y.M. Lam provided valuable contributions.
8. REFERENCES
[1] T. Wiangtong, P.Y.K. Cheung, W. Luk. Hard-
ware/Software Codesign: A Systematic Approach Tar-
geting Data-Intensive Applications. IEEE Signal Pro-
cessing, pp. 14–22 (2005).
[2] nVidia Corporation. nVidia Cuda Programing Guide
(2009).
[3] Q. Liu, G.A. Constantinides, P.Y.K. Cheung. Com-
bining Data Reuse With Data-Level Parallelization
for FPGA-Targeted Hardware Compilation: A Geo-
metric Programming Framework. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and
Systems, Vol. 28, No. 3, pp. 305–315 (2009).
[4] Y.M. Lam, J.G.F. Coutinho, W. Luk, P.H.W. Leong.
Optimising Multi-loop Programs for Heterogeneous
Computing Systems. IEEE Southern Programmable
Logic Conference (2009).
[5] G. Stitt, F. Vahid. Hardware/Software Partitioning of
Software Binaries. IEEE/ACM International Confer-
ence on Computer Aided Design, pp. 164–170 (2002).
[6] R. Lysecky, F. Vahid. A Configurable Logic Archi-
tecture for Dynamic Hardware/Software Partitioning.
Design Automation and Test in Europe Conference
(DATE), pp. 480–485 (2004).
[7] G. Sitt, F. Vahid. A Decompilation Approach to parti-
tioning Software for Microprocessor/FPGA Platforms.
Design Automation and Test in Europe (DATE), pp.
396–397 (2005).
[8] K. Atasu, C. O¨zturan, G. Du¨ndar, O. Mencer, W. Luk.
CHIPS: Custom Hardware Instruction Processor Syn-
thesis. IEEE Trans. on Computer-Aided Design of In-
tegrated Circuits and Systems, Vol. 27, No. 3, pp. 528–
541 (2008).
[9] P. Eles, Z. Peng, A. Kuchcinski, A. Doboli. System
Level Hardware/Software Partitioning Based on Sim-
ulated Annealing and Tabu Search. Journal on Design
Automation for Embedded Systems, Vol. 2, pp. 5–32
(1997).
[10] Y.M. Lam, J.G.F. Coutinho, W. Luk, P.H.W. Leong.
Integrated Hardware/Software Codesign for Heteroge-
neous Computing Systems. IEEE Southern Confer-
ence on Programmable Logic, pp. 217–220 (2008).
[11] G. Sitt, F. Vahid. New Decompilation Techniques
for Binary-level Co-processor Generation. IEEE/ACM
International Conference on Computer-Aided Design
(ICCAD), pp. 547–554 (2005).
[12] R. Lysecky, G. Stitt, F. Vahid. Warp Processors. ACM
Transactions on Design Automation of Electronic Sys-
tems (TODAES), pp. 659–681 (2006).
164 APPENDIX E. MULTI-LEVEL ASSIGNMENT JOURNAL PAPER
[13] R. Wilson et al.. SUIF: An Infrastructure for Research
on Parallelizing and Optimizing Compilers. ACM SIG-
PLAN Noticies, Vol. 29, No. 12 (1994).
[14] D.J. Pearce, P.H.J. Kelly, T. Field, U. Harder. GILK:
A Dynamic Instrumentation Tool for the Linux Ker-
nel. 12th Int. Conf. on Computer Performance Evalua-
tion, Modelling Techniques and Tools 37, pp. 220–226
(2002).
[15] N. Nethercote, J. Seward. Valgrind: A Program Super-
vision Framework. 3rd Workshop on Runtime Verifica-
tion (2003).
[16] C. Luk et al.. Pin: Building Customized Program
Analysis Tools with Dynamic Instrumentation. 2005
ACM SIGPLAN Conference on Programming Lan-
guage Design and Implementation (2005).
[17] S.A. Spacey. 3S: Program Instrumentation and Char-
acterisation Framework. Imperial Technical Paper
(http://www.doc.ic.ac.uk/research/
technicalreports/2008/DTR08-1.pdf,
2008).
[18] S. Sirowy, Y. Wu, S. Lonardi, F. Vahid. Two Level
Microprocessor-Accelerator Partitioning. Design Au-
tomation and Test in Europe (DATE), pp. 313–318
(2007).
[19] G.C. Hunt, M.L. Scott. The Coign Automatic Dis-
tributed Partitioning System. 3rd Symposium on Op-
erating Systems Design and Implementation, New Or-
leans, Louisiana, pp. 45–52 (1999).
[20] M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R.
Murphy, S. Liao, E. Bugnion, M.S. Lam. Maximizing
Multiprocessor Performance with the SUIF Compiler.
IEEE Computer (A Special Issue on Multiprocessors)
(1996).
[21] Altera Corp. Automated Generation of Hard-
ware Accelerators with Direct Memory Access from
ANSI/ISO Standard C Functions. Corporate White Pa-
per (2006).
[22] O. Mencer, D.J. Pearce, L.W. Howes, W. Luk. design-
space Exploration with A Stream Compiler. IEEE In-
ternational Conference on Field Prog. Tech. (2003).
[23] Celoxica Ltd. Handel-C Language Reference Manual
(2004).
[24] P.M. Hahn, B. Kim, M. Guignard, J.M. Smith, Y. Zhu.
An Algorithm for the Generalized Quadratic Assign-
ment Problem. Computational Optimization and Ap-
plications, pp. 351–372, (2008).
[25] S. Martello, P. Toth. Knapsak Problems: Algorithms
and Computer Implementations. John Wiley & Sons
Ltd. (1990).
[26] M. Girkar, C.D. Polychronopoulos. Automatic Ex-
traction of Functional Parallelism from Ordinary Pro-
grams. IEEE Trans. on Parallel and Distributed Sys-
tems, Vol. 3, No. 2, pp. 166–178 (1992).
[27] W.R. Stevens. TCP/IP Illustrated, Volume 1: The Pro-
tocols. Addison-Wesley Professional Computing Series
(1994).
[28] Sun Microsystems Inc. RPC: Remote Procedure Call
Protocol Specification. IETF RFC1057 (http://
www.ietf.org/rfc/rfc1057.txt, 1988).
[29] J.L. Hennessy, D.A. Patterson. Computer Architec-
ture: A Quantitative Approach. Morgan Kaufmann
(2002).
[30] S.A. Spacey. Concise CPLEX. Imperial Technical Pa-
per (http://www.doc.ic.ac.uk/research/
technicalreports/2009/DTR09-7.pdf,
2009).
[31] M.R. Guthaus et al. MiBench: A Free, Commercially
Representative Embedded Benchmark Suite. IEEE
4th Annual Workshop on Workload Characterization
(2001).
[32] Stretch Inc. The S6000 Family of Processors: Archi-
tecture White Paper (2007).
[33] Xilinx Inc. Vertex-5 Family Overview: Advanced
Product Specification (2008).
[34] B. Holden. Latency Comparison Between Hyper-
Transport and PCI-Express in Communications Sys-
tems. HyperTransport Consortium (2006).
[35] S.A. Spacey, W. Luk, P.H.J. Kelly, D. Kuhn.
Rapid Design-Space Visualisation through Hard-
ware/Software Partitioning. IEEE Southern Pro-
grammable Logic Conference (2009).
[36] S.A. Spacey, W. Wiesemann, D. Kuhn, W. Luk. Ro-
bust Software Partitioning with Multiple Instantiation.
Awaiting Publication (2009).
[37] ILOG Inc. CPLEX 11.2 Manuals (2008).
[38] S. Sahni, T. Gonzalez. P-complete Approximation
Problems. Journal of the ACM (JACM), Vol. 23, pp.
555–565 (1976).
[39] P.M. Hahn, B. Kim, M. Guignard, J.M. Smith, Y., Zhu.
An Algorithm for the Generalized Quadratic Assign-
ment Problem. Computational Optimization and Ap-
plications, pp. 351–372 (2008)
[40] S.A. Spacey. Computational Partitioning for Hetero-
geneous Systems. Imperial Ph.D. Thesis (2009).
[41] S. Oh, T.G. Kim, E. Bozorgzadeh. Speculative Loop-
Pipelining in Binary Translation for Hardware Accel-
eration. IEEE Transactions on Computer-Aided De-
sign of Integrated Circuits and Systems, pp. 409–422
(2008).
165
166 APPENDIX E. MULTI-LEVEL ASSIGNMENT JOURNAL PAPER
Appendix F
Robust Optimisation Journal
Paper
167
Robust Software Partitioning with Multiple
Instantiation
Simon A. Spacey, Wolfram Wiesemann, Daniel Kuhn and Wayne Luk
October 6, 2009
Abstract
The purpose of software partitioning is to assign code segments of a given
computer program to a range of execution locations such as general purpose
processors or specialist hardware components. These execution locations differ
in speed, communication characteristics, and in size. In particular, hardware
components offering high speed tend to accommodate only few code segments.
The goal of software partitioning is to find an assignment of code segments
to execution locations that minimizes the overall program run time and re-
spects the size constraints. In this paper we demonstrate that an additional
speedup is obtained if we allow code segments to be instantiated on more than
one location. We further show that the program run time not only depends
on the execution frequency of the code segments but also on their execution
order if there are multiply instantiated code segments. Unlike frequency infor-
mation, however, sequence information is not available at the time when the
software partition is selected. This motivates us to formulate the software par-
titioning problem as a robust optimization problem with decision-dependent
uncertainty. We transform this problem to a mixed-integer linear program of
moderate size and report on promising numerical results.
Keywords. Robust optimization, software partitioning, decision-dependent
uncertainty, multiple instance partitioning.
1 Introduction
We consider a computer program that must be executed quickly and frequently over
a long (maybe indefinite) life time. Such programs arise in cryptography [1], digital
signal processing [2], computer vision [3], video image processing [4], database pro-
cessing [5], network analysis [6] and on-line commercial services [7]. It is assumed
that the program consists of several indivisible building blocks or code segments.
The overall execution time of the program is the time needed to execute the indi-
vidual code segments and the time needed to exchange information between code
segments which are executed in direct succession. These contributions to the run
time will be referred to as the execution costs and communication costs, respectively,
and may depend on the characteristics of the execution location where each code
segment is run. Examples of execution locations are central processing units, graph-
ical processing units, or specialist hardware components such as field-programmable
gate arrays. The goal of software partitioning is to find an assignment of code seg-
ments to execution locations that results in the fastest total program execution
while respecting the size constraints of specialist hardware components. More pre-
cisely, we seek an assignment which ensures that the underlying program runs fast
on average for a broad range of possible input parameters.
1
168 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
The necessity for software partitioning has been recognised since the early six-
ties, when manual partitioning methods based on the Fixed Plus Variable system
were proposed [8]. By the early nineties, the Cosyma and Vulcan systems [9, 10]
began to apply semi-automated approaches to software partitioning. While Cosyma
relies on a simulated annealing heuristic to assign code segments to execution lo-
cations, Vulcan employs a greedy approach. In the following years, a plethora of
heuristic solution procedures for software partitioning were studied including tabu
search [11], genetic algorithms [12,13], particle swarm algorithms [14] and ant colony
optimization [15]. Recent exact solution approaches for various forms of the soft-
ware partitioning problem rely on dynamic programming [16–19] and mixed-integer
linear programming [20–24]. A survey on software partitioning and related areas is
provided in [25].
For the purpose of software partitioning, a program’s execution is described ex-
haustively by the execution sequence or execution trace, that is, the sequence in
which the program’s code segments are executed. In typical programs, some code
segments are called very often as part of nested loops. This renders the execution
trace information too large to be stored at the fine-grained (program basic block or
subroutine) level required for optimal software partitioning [24]. Moreover, execu-
tion traces often depend on input data implying that there may be infinitely many
possible traces that have to be considered for some programs.
As it is impractical or impossible to manipulate large execution traces, software
engineers tend to deal with control flow graphs (CFGs) instead. CFGs compress out
sequence information from traces and retain only a program’s calling frequencies,
that is, the frequencies with which the code segments call each other. It is relatively
easy to predict average calling frequencies over a large number of program runs with
statistically independent input data. Hence, it is reasonable to assume that this type
of frequency information is available at the time when the software partition has to
be selected.
Frequency information is sufficient to solve the software partitioning problem
optimally if each code segment is assigned to a single location [20,22,24]. However,
in this work we seek to obtain an additional speedup by assigning code segments
to more than one execution location. In this multiple instantiation setting, infor-
mation about the program’s execution sequence is required in order to solve the
partitioning problem optimally. The motivation for considering multiple instantia-
tion is that in software partitioning, as with most distributed process optimization
problems, communications on the same location incur substantially smaller costs
than those between different locations. Therefore, it can be beneficial to instantiate
frequently visited code segments at more than one location, just as a manufacturer
would naturally consider installing the same machine at multiple locations to re-
duce transportation costs. Previous exact approaches to multiple instantiation have
assumed complete knowledge of the program’s execution sequence [17, 21, 23]. By
the above discussion, however, this assumption restricts the applicability of these
approaches to programs with short execution traces.
In this paper we propose a novel approach to software partitioning which does
not require sequence information but still allows for multiple instantiation of code
segments. Since sequence information is absent, we formulate the multiple instance
partitioning problem as a robust optimization problem which minimizes the worst-
case run time over all execution sequences consistent with known CFG calling fre-
quencies. After applying a dimensionality reduction mechanism, we end up with a
robust optimization model with integer decisions and a decision-dependent uncer-
tainty set. We reformulate the resulting model as an equivalent mixed integer linear
program (MILP) which we solve with off-the-shelf optimization software.
Although our formulation only requires information about the calling frequen-
cies, its objective function (the worst-case run time) is determined by the location-
2
169
aware execution traces that are consistent with the given calling frequencies. A
location-aware execution trace collects full information about the order of the code
segment/location pairs visited during program execution, and it is crucially influ-
enced by the adopted calling convention. Indeed, whenever a specific code segment
must be executed that has been instantiated on several locations, the calling con-
vention decides which of its twins is called. While it is straightforward to construct
an optimisation model that determines the optimal calling convention when the
whole program trace is known, it becomes a major challenge to derive an optimal
calling convention if only calling frequencies are available. To ensure computational
tractability, we adopt a problem independent greedy calling convention in this work.
Our robust software partitioning problem bears some similarity to the general-
ized quadratic assignment problem (GQAP) [26]. The GQAP considers a number of
entities with given sizes that need to be assigned to locations with size constraints.
Every pair of entities gives rise to communication and assignment costs that depend
on the locations the entities are assigned to, and the goal is to minimize the overall
costs. The software partitioning problem introduced in this paper can be regarded
as a multiple instance generalization of GQAP, which is in itself a generalization of
the NP-hard quadratic assignment problem [27].
We evaluate our approach on a real software benchmark for execution on an
architecture with three execution locations. Although the program is fairly small
with an execution trace of 2,946 entries for 23 assignable code segments, it is already
too large for traditional exact software partitioning approaches. Our numerical
experiments demonstrate that the robust multiple instantiation approach may have
significant advantages over both an optimistic multiple instantiation model and
a single instance GQAP approach in simulated timings for the benchmark’s real
execution trace.
The remainder of this paper proceeds as follows. In Section 2 we study the
software partitioning problem under the assumption that the program’s execution
trace is precisely known. The problem then reduces to an integer quadratic program
that optimizes over all admissible partitions and location-aware execution traces.
In Section 3 we argue that, in reality, only frequency information is available at the
time when the partition is selected, and that actual sequence information is revealed
gradually during program execution. In this more realistic setting, the software par-
titioning problem reduces to a robust multistage optimization problem with integer
recourse and is therefore severely computationally intractable. Section 4 suggests
the use of a greedy calling convention and a problem reformulation in terms of
location-aware control flows to improve computational tractability. These simplifi-
cations lead to a robust single-stage optimization problem with decision-dependent
uncertainty. In Section 5 we demonstrate that this robust problem can be reformu-
lated as an equivalent MILP, and in Section 6 we discuss how the approximations
of Section 4 can be refined. We provide numerical results in Section 7 and conclude
in Section 8.
Notation An arc-weighted directed graph is called (strongly) connected if there
is a directed path between any two vertices in the graph, and each arc on this
connecting path has strictly positive weight. We call a node in an arc-weighted
graph isolated if it is only incident to arcs of weight zero. We define B := {0, 1}.
3
170 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
2 Software Partitioning under Complete Informa-
tion
We consider a computer program given by a set of code segments V := {1, . . . , V }
and an execution trace τ := {τv}v∈V , where each τv represents a function from
T := {1, . . . , T} to B. By definition, τv(t) := 1 if code segment v is executed at time
t, and τv(t) := 0 otherwise. In the following, we assume that the code segments are
executed sequentially, that is, we assume that τ satisfies∑
v∈V
τv(t) = 1 ∀ t ∈ T. (2.1)
For situations where code segments may be executed in parallel, one can apply
the method presented in [24] to obtain a sequential description of the program
which satisfies (2.1). We say that segment v calls segment w at time t if τv(t) =
τw(t+ 1) = 1. Without loss of generality, we assume that the program starts with
code segment 1 and returns to this first code segment at time T +1, that is, we set
τv(T + 1) := τv(1) := 1 for v = 1; := 0 for v 6= 1.
A program’s execution trace typically depends on the input data, which itself
differs with every execution. Since we are interested in the long term program
performance over all future inputs, we will not consider a single execution trace.
Instead, from now on we assume that τ represents the concatenation of all future
execution traces. In this section, we thus assume that complete information about
the composite execution trace is available at the time the software partition is
selected. This assumption will be relaxed in later sections.
In the following we assume that there is a finite set of possible execution locations
L := {1, . . . , L}. The size of location l is given by Sl. We assume that code
segment v has size svl on location l. Note that we use an abstract notion of ‘size’
which accounts for heterogeneous resource types such as program and data memory
requirements as well as physical area for logical gates. Nevertheless, all of the
following models extend to multidimensional resource measures in a straightforward
way. A software partition can formally be represented as a matrix of binary variables
x ∈ BV×L with the interpretation that xvl = 1 if and only if segment v is assigned
to location l. Partition x is feasible if it is an element of the set
X :=
{
x ∈ BV×L :
∑
l∈L
xvl ≥ 1∀v ∈ V,
∑
v∈V
svlxvl ≤ Sl ∀l ∈ L, x1l = Il=1
}
,
where Il=1 := 1 if l = 1; := 0 otherwise. The first group of constraints in the
definition of X requires that each code segment is assigned at least to one location.
This guarantees that the program can be executed without errors. The second
constraint group enforces the size restrictions on the different locations. Note that
we explicitly allow for multiple instantiation of individual code segments. The last
constraint requires that the first code segment (with index 1) is instantiated only
on location 1. This can always be enforced by introducing a virtual dummy code
segment and/or location. For later reference, we also introduce the set X1 ⊂ X
which only allows for single instantiation. Thus, X1 is obtained by replacing the
inequality in the first constraint group of X by an equality.
If an instance of segment v on location l calls an instance of segment w on
location m anytime during program execution, then a calling cost cvwlm is incurred.
This cost represents a latency and accounts for delays due to execution of segment
v as well as communication between segments v and w. If the instances of v and
w reside on different locations, then the communication costs are typically high.
Relatively low communication costs arise if the instances of v and w occupy the
4
171
same location. After data transfer, the execution costs depend solely on the location
l where the code segment is executed. In practice it is often impossible to assign all
code segments to the location where they incur their lowest execution costs because
of hardware size constraints.
Given an assignment x ∈ X1 subject to single instantiation of the code segments,
the overall execution time of the program amounts to
c1(x) :=
∑
t∈T
∑
v,w∈V
∑
l,m∈L
cvwlmxvlτv(t)xwmτw(t+ 1).
Recall that the program was assumed to return to the first code segment after
termination, and observe that the costs associated with this call can be set to zero
if necessary. If only single instantiation is allowed, the best software partition is
thus found by solving the integer quadratic program
min
x∈X1
c1(x) . (P1)
Remark 2.1. Note that problem P1 encapsulates the quadratic assignment problem
as a special case. To see this, set V = L = {1, . . . , V }, Sl = 1, svl = 1, T =
V 2+1, and let the trace τ describe an Eulerian cycle in the complete directed graph
(V,V ×V) with self-loops for all nodes. Recall that an Eulerian cycle in a graph is a
cycle that traverses each arc exactly once. Every complete directed graph possesses
an Eulerian cycle [28]. We thus have
∑
t∈T τv(t)τw(t+ 1) = 1 for all v, w ∈ V, and
problem P1 reduces to
min
x∈BV 2
V∑
v,w,l,m=1
cvwlmxvlxwm
s.t.
V∑
l=1
xvl = 1 ∀ v = 1, . . . , V,
V∑
v=1
xvl ≤ 1 ∀ l = 1, . . . , V,
x11 = 1.
(2.2)
Note that since x is a binary square matrix, all inequalities in this problem can be
replaced by equalities. Thus, (2.2) is readily recognizable as a variant of the quadratic
assignment problem. It is well known that the quadratic assignment problem is
strongly NP-hard. The software partitioning problem P1 and its generalizations
to be developed below are thus also strongly NP-hard, that is, they allow for no
polynomial-time solution or approximation scheme [29].
The situation is further complicated if multiple instantiation of code segments
is admissible. To see this, assume that at time t the program executes an instance
of segment v on location l (note that there may be other instances of v on locations
l′ 6= l). Moreover, assume that v calls a segment w which is multiply instantiated.
Thus, at time t + 1 the program can jump to one of several locations on which
w is instantiated. Note that the given description of the program in terms of the
execution trace τ provides no guidelines on which instance to choose. Instead, we
are free to adopt any calling convention for choosing among the different instances
of w.
The additional flexibility to choose a calling convention can be exploited to
further reduce the overall execution time of the program. To this end, we introduce
a location-aware execution trace θ which represents a function from T to the family
of binary matrices BV×L. By definition, we set θvl(t) = 1 if and only if code segment
5
172 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
v is executed on location l at time t. The location-aware execution trace θ must
therefore be an element of the set
ΘPI(x; τ) :=
{
θ ∈ BV×L×T : θvl(t) ≤ τv(t)xvl ∀v ∈ V, l ∈ L, t ∈ T,∑
l∈L
θvl(t) = τv(t) ∀v ∈ V, t ∈ T
}
.
The subscript ‘PI’ indicates that perfect information about the ordinary execution
trace τ is assumed to be available. This assumption will be reconsidered in Section 3.
The first constraint group in the definition of ΘPI(x; τ) ensures that an instance of
segment v on location l is executed at time t only if v is in fact instantiated on l
and some instance of v must in fact be executed at time t. The second constraint
group makes sure that exactly one instance of segment v at time t is called if v is
supposed to be executed at t. Note that our definition of execution traces implies
that θvl(T + 1) := θvl(1) for all v ∈ V and l ∈ L, indicating that after termination
the program must return to the initial code segment and location. It is easy to
verify that for x ∈ X1 we have ΘPI(x; τ) = {θ◦} where θ◦vl(t) := τv(t)xvl for all v, l,
and t.
Given an assignment x ∈ X and a calling convention θ ∈ ΘPI(x; τ), the overall
execution time of the program amounts to
c(θ) :=
∑
t∈T
∑
v,w∈V
∑
l,m∈L
cvwlmθvl(t)θwm(t+ 1) . (2.3)
If multiple instantiation of code segments is allowed, the best software partition is
thus found by solving the integer quadratic program
min
x∈X
min
θ∈ΘPI(x;τ)
c(θ) . (P)
Since X1 is a subset of X, the optimal value of P is never larger than the optimal
value of P1. In other words, the extra flexibility introduced by allowing for multiple
instantiation necessarily reduces the program’s optimal execution time.
3 Causal Calling Conventions
A crucial assumption underlying the software partitioning problem P is that the
execution trace τ is precisely known. Note that T represents the total number of
executions of all code segments in V during the program’s lifetime. Even for a single
program run, T typically exceeds V since some code segments are called several
times. Since τ represents the concatenation of all future program traces, T can be
expected to be much larger than V . Moreover, τ depends on future input data which
is unknown at the time when P is solved. Since P requires full trace information as
an input, it can therefore not be solved in practice. Instead of collecting, storing and
manipulating τ itself, one needs to compress its essential information in an efficient
way. This is most commonly achieved by removing sequence information from the
trace using a high-level control flow graph [30], that is, only calling frequencies are
gathered. Although the calling frequencies also depend on future input data and
are thus uncertain, they can be estimated by observing the traces of a moderate
number of representative program runs.
Let us consider the directed, arc-weighted control flow graph G associated with
the program. The vertices of this graph correspond to the code segments v ∈ V,
while the arcs (v, w) ∈ V × V represent calls between code segments. An arc
(v, w) with weight χvw indicates that segment w is called χvw times by segment
6
173
v during program execution. For notational convenience, we assume that G is
complete with self-loops for all nodes. Arcs that do not correspond to calls between
code segments are assigned weight zero. Observe that the control flow graph G
is uniquely determined by the execution trace τ . Indeed, the arc weights χ are
obtained through the relation∑
t∈T
τv(t) τw(t+ 1) = χvw ∀ v, w ∈ V . (3.4)
In contrast, a given control flow graph G fails to induce a unique execution trace
because it contains no information about the order of the calls. The set of all
execution traces consistent with G is representable as
T :=
{
τ ∈ BV×T :
∑
t∈T
τv(t)τw(t+ 1) = χvw ∀v, w ∈ V,∑
v∈V
τv(t) = 1 ∀t ∈ T, τ1(1) = 1
}
.
The first constraint group in the definition of T enforces consistency with the call-
ing frequencies stipulated in the control flow graph, while the second constraint
group ensures that exactly one code segment is executed at any time under trace
τ . The last constraint, finally, requires the program to start and terminate at code
segment 1. From now on we assume that only T (or, equivalently, G) is known at
the time when the software partition is selected. Thus, there is uncertainty about
which trace τ ∈ T will materialize.
Before we formulate the software partitioning problem under trace uncertainty,
we should investigate under what conditions on G the set T is nonempty (which is a
prerequisite for a meaningful optimization model). The following lemma describes
the set of control graphs that guarantee non-emptiness of T .
Lemma 3.1. The set T of execution traces which are compatible with the control
flow graph G is nonempty if and only if
(i) χvw ∈ Z+ for all v, w ∈ V;
(ii) G is the union of isolated vertices and a connected subgraph containing ver-
tex 1;
(iii) T =
∑
v,w∈V χvw;
(iv)
∑
w∈V χvw − χwv = 0 for all v ∈ V.
Proof. Assume that T is nonempty, and select some τ ∈ T . Then, (i) holds since
τ is integer-valued, while (ii)–(iv) hold since τ induces a cycle in G that starts at
vertex 1 and visits arc (v, w) exactly χvw times for all v, w ∈ V. Assume now that
the assertions (i)–(iv) are satisfied, and consider the directed multigraph G(χ) with
vertices V that has χvw parallel arcs from v to w for all v, w ∈ V. Condition (iv)
guarantees that each vertex in G(χ) has equally many incoming as outgoing arcs.
Moreover, G(χ) is the union of some isolated vertices and a connected subgraph
containing vertex 1; this property is inherited from G. Thus, the multigraph G(χ)
possesses an Eulerian cycle {vt}t∈T of length T that starts at vertex 1 and visits
each arc exactly once. Any such Eulerian cycle can be used to construct a trace
τ ∈ T by setting τv(t) := 1 if v = vt; := 0 otherwise.
In the remainder of this paper we will always assume that the conditions (i)–(iv)
of Lemma 3.1 are satisfied, implying that T is in fact nonempty.
7
174 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
Generic calling conventions generating location-aware traces θ ∈ ΘPI(x; τ) are
not implementable under incomplete information about τ . Indeed, according to the
above discussion, only the control flow graph is known at the time when the software
partition is selected. Even though the trace τ is initially unknown, it is revealed
during program execution, and thus the amount of available information gradually
increases: at any time t, the history of the trace up to time t, that is, the sequence
of binary vectors {τ(s)}ts=1, is available. Causal (or non-anticipative) calling con-
ventions that exploit this online information remain implementable and exhibit
considerable flexibility. For more information on the role of non-anticipativity in
decision making under uncertainty see e.g. [31].
Set ΘPI(x) := ×τ∈T ΘPI(x; τ). Thus, every θ ∈ ΘPI(x) constitutes a collection
of location-aware traces θ = (θτ )τ∈T where θτ ∈ ΘPI(x; τ) for each τ ∈ T . Any
θ ∈ ΘPI(x) should be interpreted as a decision rule of the following type: if trace
τ materializes, then apply the calling convention that generates θτ . Note that this
decision rule is (usually) only implementable if perfect trace information is available
before the first call. We can now introduce the set of all location-aware traces that
are generated by causal calling conventions.
ΘC(x) :=
{
θ ∈ ΘPI(x) : θτ (t) = θτ ′(t) ∀t ∈ T, τ, τ ′ ∈ T
with τ(s) = τ ′(s) ∀ s = 1, . . . , t
}
By definition, ΘC(x) is a subset of ΘPI(x). Thus, any given θ ∈ ΘC(x) still consti-
tutes a decision rule of the kind described above. This θ is implementable despite
the fact that the full trace τ is only known after program termination. Because
of the non-anticipativity constraints in the definition of ΘC(x), knowledge of the
trace history τ(1), . . . , τ(t) up to time t is sufficient to implement the time t calling
convention yielding θτ (t). In fact, all τ ∈ T which are indistinguishable up to time
t result in the same call at time t.
The above discussion suggests that we should employ causal calling conventions
corresponding to location-aware traces θ ∈ ΘC(x) if τ is uncertain. As no probabil-
ities can be assigned to the different traces in T , it is reasonable to select a software
partition x ∈ X and location-aware trace θ ∈ ΘC(x) which are optimal in view of
the worst-case realization of τ . Ideally, we thus would like to solve the following
robust counterpart of problem P.
min
x∈X
min
θ∈ΘC(x)
max
τ∈T
c(θτ ) (RP)
4 Complexity Reduction
The software partitioning problem RP represents a multi-stage robust optimization
problem with integer recourse and is therefore severely computationally intractable.
Moreover, accumulating trace information during program execution is impractical
due to excessive storage requirements; see also the discussion at the beginning of
Section 3. To reduce the computational complexity of RP, we now apply several
approximations.
4.1 Greedy Calling Convention
First, we shrink the set of admissible location-aware traces to a singleton, that is,
we stipulate that a specific greedy calling convention generating the location-aware
trace θ∗ must be used. The software partitioning problem RP thus reduces to
min
x∈X
max
τ∈T
c(θ∗τ ) . (GRP)
8
175
If the location-aware trace θ∗ is an element of ΘC(x), then GRP is more restrictive
than the original problem RP and thus represents a conservative approximation.
In order to specify θ∗, we explicitly define the greedy calling convention µ.
µ : X × V × V × L → L, µ(x; v, w, l) := min
{
argmin
m∈L,xwm=1
{cvwlm}
}
Note that the argmin-mapping constitutes a set-valued function. For µ to be well-
defined, we must prescribe a rule for selecting a unique minimizer if the argmin
mapping returns several values. Without loss of generality, we always select the
minimizer with the lowest index. For a given software partition x, the greedy calling
convention µ has the following property. If code segment v on location l needs to
call code segment w, then calling w’s instance on location µ(x; v, w, l) incurs the
smallest instantaneous costs. The location-aware trace θ∗ generated by the greedy
calling convention µ can be constructed recursively. For all t ∈ T we set
θ∗τ,wm(t+ 1) :=

1 if τw(t+ 1) = 1 and m = µ(x; v, w, l)
for v and l with θ∗τ,vl(t) = 1,
0 otherwise.
As usual, we use the convention that θ∗τ,vl(T +1) := θ
∗
τ,vl(1) for all v ∈ V and l ∈ L.
Note that the recursive construction of θ∗ is well-defined since—by definition of
X—the first code segment (with index 1) is instantiated only on location 1, while
each other code segment is instantiated on at least one location. The location-
aware trace θ∗ depends on the selected software partition x. To avoid proliferation
of subscripts, however, we notationally suppress this dependency.
Lemma 4.1. The location-aware trace θ∗ is an element of ΘC(x).
Proof. We first show that θ∗τ ∈ ΘPI(x, τ) for any τ ∈ T . By construction, θ∗τ,vl(t) is
binary and vanishes if xvl = 0 or τv(t) = 0. This implies
θ∗τ,vl(t) ≤ τv(t)xvl ∀ v ∈ V, l ∈ L, t ∈ T .
By induction on time one can show that for any fixed t ∈ T there is exactly one
code segment vt and location lt such that θ∗τ,vl(t) = 1 if v = vt and l = lt; := 0
otherwise. This essentially follows from the fact that µ is a single-valued mapping
on its entire domain. In particular, notice that θ∗τ,vl(1) = 1 if and only if v = l = 1.
This holds because each execution trace in T starts with code segment 1, which is
instantiated only on location 1. Thus, we find∑
l∈L
θ∗τ,vl(t) = τv(t) ∀ v ∈ V, t ∈ T,
implying that θ∗τ is indeed an element of ΘPI(x, τ). It remains to be shown that θ
∗
is causal. To this end, notice that θ∗τ (1) is independent of τ , while θ
∗
τ (t+1) depends
only on τ(t+1) and θ∗τ (t) for all t ∈ T. Causality thus follows by induction on t.
4.2 Location-Aware Control Flows
Problem GRP is still not suitable for numerical solution. To improve its compu-
tational tractability, we should eliminate its explicit dependence on time. To this
end, we introduce a set Ξc(x) of location-aware control flows.
Ξc(x) :=
{
ξ ∈ ZV 2×L2+ : ∃τ ∈ T with
ξvwlm =
∑
t∈T
θ∗τ,vl(t)θ
∗
τ,wm(t+ 1) ∀v, w ∈ V, l,m ∈ L
}
9
176 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
The component ξvwlm of any ξ ∈ Ξc(x) indicates how often code segment v on
location l calls code segment w on location m during program execution when the
greedy calling convention is employed. If some of the code segments are multiply
instantiated, this number may vary with the execution trace τ . The set Ξc(x)
collects all location-aware control flows ξ associated with the possible traces τ ∈ T .
In other words, Ξc(x) is the set of all location-aware control flows that are consistent
with the (location-unaware) control flow graph G. Recalling the definition of the
cost function (2.3), problem GRP can now be reformulated as
min
x∈X
max
ξ∈Ξc(x)
∑
v,w∈V
∑
l,m∈L
cvwlmξvwlm . (4.5)
Note that (4.5) can be interpreted as a robust optimization problem with decision-
dependent uncertainty. The goal is to find a robust software partition x that min-
imizes the worst-case program execution time. The worst case is taken over all
location-aware control flows ξ that are consistent with the control flow graph G and
the software partition x. In robust optimization terminology, x is the decision vari-
able, ξ is the uncertain parameter, and Ξc(x) represents the underlying uncertainty
set. The uncertainty set explicitly depends on the decision x. While stochastic pro-
grams with decision-dependent uncertainty have been considered recently [32], it
seems that robust optimization problems of this type have received little attention
until now. See [33] for a textbook introduction to robust optimization.
Problem (4.5) constitutes an exact reformulation of GRP. While its objective
function is linear in ξ and thus lends itself to computational treatment, the un-
certainty set Ξc(x) looks cumbersome and still exhibits an explicit dependence on
time. We now construct a more tractable approximation for Ξc(x). To this end, we
let M be any constant which is larger than maxv,w χvw, and we define Ξ(x) as the
set of all ξ ∈ RV 2×L2+ satisfying the constraints∑
w∈V
∑
m∈L
ξvwlm − ξwvml = 0 (4.6a)∑
l,m∈L
ξvwlm = χvw (4.6b)
ξvwlm ≤M min {xvl, xwm} (4.6c)
ξvwlm ≤M min
m′∈Lvwlm
(1− xwm′) (4.6d)
for all v, w ∈ V and l,m ∈ L. The index set Lvwlm is defined as the collection of
all m′ ∈ L that satisfy cvwlm′ < cvwlm. Notice that Ξ(x) is indeed independent of
the choice of M as long as M is larger than all χvw. We emphasize that Ξc(x) is a
discrete set, whereas Ξ(x) constitutes a convex polyhedron.
Proposition 4.2. Ξc(x) is a subset of Ξ(x).
Proof. Choose an arbitrary ξ ∈ Ξc(x) and let τ be an element of T satisfying
ξvwlm =
∑
t∈T
θ∗τ,vl(t)θ
∗
τ,wm(t+ 1) ∀v, w ∈ V, l,m ∈ L .
The existence of such a τ is guaranteed by the definition of Ξc(x). Thus, we have∑
w∈V
∑
m∈L
ξvwlm − ξwvml
=
∑
w∈V
∑
m∈L
∑
t∈T
θ∗τ,vl(t)θ
∗
τ,wm(t+ 1)− θ∗τ,wm(t)θ∗τ,vl(t+ 1)
=
∑
t∈T
θ∗τ,vl(t)− θ∗τ,vl(t+ 1) = θ∗τ,vl(1)− θ∗τ,vl(T + 1) = 0 ,
10
177
where the second equality holds since the program executes exactly one code seg-
ment on exactly one location at each time. Thus, (4.6a) holds. Next, we find∑
l,m∈L
ξvwlm =
∑
l,m∈L
∑
t∈T
θ∗τ,vl(t)θ
∗
τ,wm(t+ 1) =
∑
t∈T
τv(t)τw(t+ 1) = χvw ,
where the second equality holds because θ∗τ ∈ ΘPI(x; τ), while the third equality
follows from the properties of traces τ ∈ T . Thus, (4.6b) is established. The fact
that θ∗τ is contained in ΘPI(x; τ) further implies
ξvwlm =
∑
t∈T
θ∗τ,vl(t)θ
∗
τ,wm(t+ 1) ≤
∑
t∈T
xvlτv(t)xwmτw(t+ 1) = xvlxwmχvw .
Thus, we have ξvwlm ≤ Mxvl and ξvwlm ≤ Mxwm, which implies (4.6c). In order
to establish (4.6d), we notice that
∃m′ ∈ Lvwlm : xwm′ = 1 ⇒ µ(x; v, w, l) 6= m
⇒ θ∗τ,vl(t)θ∗τ,wm(t+ 1) = 0 ∀ t ∈ T
⇒ ξvwlm =
∑
t∈T
θ∗τ,vl(t)θ
∗
τ,wm(t+ 1) = 0 .
Here, the first and second implications follow from the definitions of µ(x; v, w, l)
and the location-aware trace θ∗, respectively. The above reasoning implies that
ξvwlm ≤ M(1 − xwm′) for all m′ ∈ Lvwlm. In summary, all constraints (4.6) are
satisfied, and thus ξ is an element of Ξ(x).
In the following, we argue that Ξc(x) is a strict subset of Ξ(x), and we establish
an explicit criterion to decide whether ξ ∈ Ξ(x) is contained in Ξc(x). To this end,
we assign to each V 2 × L2-dimensional vector ξ with nonnegative integer entries
a directed multigraph G(ξ) with vertices V × L and with ξvwlm parallel arcs from
(v, l) to (w,m), where (v, l) and (w,m) range over V × L.
Proposition 4.3. If ξ ∈ Ξ(x) has only integer entries, while G(ξ) is the union of
a connected subgraph and some isolated vertices, then ξ ∈ Ξc(x).
Proof. Select a ξ satisfying the conditions in the proposition statement. Since ξ is
an element of Ξ(x), the number of incoming arcs equals the number of outgoing arcs
in each vertex of the multigraph G(ξ), see (4.6a). Since G(ξ) can be decomposed
into a connected subgraph and some isolated vertices, there exists an Eulerian cycle
{vt, lt}t∈T of length ∑
v,w∈V
∑
l,m∈L
ξvwlm
(4.6b)
=
∑
v,w∈V
χvw = T
which visits each arc exactly once. The component sequence {vt}t∈T can be used to
construct an execution trace τ ∈ T by setting τv(t) := 1 if v = vt; := 0 otherwise.
The constraints (4.6c)–(4.6d) ensure that if vt = v, vt+1 = w, lt = l, and
lt+1 = m for some t ∈ T, then v must be instantiated on l, w must be instantiated
on m, and the cost of calling w on m from v on l is minimal over all locations m′ on
which w is instantiated. Therefore, we have lt =
∑
l∈L lθ
∗
τ,vl(t) for all t ∈ T, v ∈ V,
and l ∈ L, that is, the Eulerian cycle {vt, lt}t∈T is induced by the execution trace τ
under the location-aware trace θ∗. This implies that ξ ∈ Ξc(x).
If we replace Ξc(x) by its superset Ξ(x) in (4.5), we obtain a conservative ap-
proximation for GRP.
min
x∈X
max
ξ∈Ξ(x)
∑
v,w∈V
∑
l,m∈L
cvwlmξvwlm (AGRP)
11
178 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
Note that the inner maximization problem is now a linear program. In the next
section we will show that AGRP has an equivalent reformulation as a MILP and is
thus a promising candidate for numerical solution.
5 MILP Formulation
We convert AGRP to an equivalent minimization problem by dualizing the linear
program over the uncertain parameters ξ ≥ 0. To do so, we introduce Lagrange
multipliers αvl and βvw corresponding to the flow conservation and consistency con-
straints (4.6a) and (4.6b), respectively. Moreover, we assign nonnegative multipliers
γvwlm and δvwlm to the constraints (4.6c) and (4.6d), respectively, which ensure that
the location-aware control flow is consistent with the selected software partition x
and obeys the greedy calling convention. The dual of the inner maximization prob-
lem adopts the following form.
min
α,β,γ,δ,ε
∑
v,w∈V
χvwβvw +M
∑
v,w∈V
∑
l,m∈L
min {xvl, xwm} γvwlm (5.7)
+M
∑
v,w∈V
∑
l,m∈L
min
m′∈Lvwlm
(1− xwm′) δvwlm
s.t. αvl − αwm + βvw + γvwlm + δvwlm ≥ cvwlm ∀v, w ∈ V, l,m ∈ L
γ, δ ≥ 0
Notice that strong linear programming duality holds since the inner maximization
problem in AGRP is feasible, that is, because Ξ(x) is nonempty for all x ∈ X.
This is a consequence of the fact that Ξ(x) is a superset of Ξc(x), which in turn is
nonempty because of Lemma 3.1 and our assumptions about the given control flow
graph G. Thus, the dual linear program (5.7) has the same optimal value as the
inner maximization problem in AGRP.
From a computational point of view, the frequent occurrence of the large con-
stantM in the dual objective function is undesirable as it deteriorates the problem’s
scaling properties. Moreover, the bilinear terms in the assignment variable x and
the dual variables γ and δ lead to a mixed-integer nonlinear program when (5.7)
is substituted into AGRP. It turns out that the outlined deficiencies can be over-
come by exploiting the following observation. The primal feasible set Ξ(x) of the
inner problem in AGRP is independent of the choice of M as long as M is larger
than M0 := maxv,w χvw. Thus, by strong duality, the optimal value of (5.7) is also
independent of M as long as M ≥M0. This implies that∑
v,w∈V
∑
l,m∈L
min {xvl, xwm} γvwlm + min
m′∈Lvwlm
(1− xwm′) δvwlm = 0 (5.8)
at optimality. Since γ and δ are nonnegative, while x is a vector of binary variables,
(5.8) can be interpreted as a complementarity condition which is equivalent to
γvwlm ≤Mdmax {1− xvl, 1− xwm} , δvwlm ≤Md max
m′∈Lvwlm
xwm′ (5.9)
for all v, w ∈ V and l,m ∈ L. The new constant Md > 0 represents a uniform a
priori bound on the optimal dual variables γ and δ with respect to the maximum
norm. Note that Md can be chosen independently of x ∈ X and M ≥ M0. This
reasoning shows that we can remove all terms proportional to M in the objective of
(5.7) at the cost of appending the constraints (5.9). By construction, the optimal
value of the resulting streamlined optimization problem is independent of Md as
12
179
long as this constant is chosen sufficiently large. Standard arguments similar to
those outlined above can be used to show that the constraint
αvl − αwm + βvw + γvwlm + δvwlm ≥ cvwlm (5.10)
is redundant (that is, not binding at optimality) if xvl = 0 or xwm = 0 or xwm′ = 1
for at least one m′ ∈ Lvwlm. This observation allows us to eliminate δ from the
problem and to replace (5.9) and (5.10) by
αvl − αwm + βvw + γvwlm ≥ cvwlm
γvwlm ≤Md
(
1− xvl + 1− xwm +
∑
m′∈Lvwlm
xwm′
)
.
In summary, we have thus demonstrated that AGRP can be equivalently expressed
as the following MILP.
min
x,α,β,γ
∑
v,w∈V
χvwβvw (5.11)
s.t. αvl − αwm + βvw + γvwlm ≥ cvwlm ∀v, w ∈ V, l,m ∈ L
γvwlm ≤Md
(
2− xvl − xwm +
∑
m′∈Lvwlm
xwm′
)
”
x ∈ X, γ ≥ 0
The current state of the art in MILP techniques enables us to solve large-scale MILP
problems to optimality with commercial solvers.
6 Connectivity Cuts
Problem AGRP constitutes a conservative approximation for the robust software
partitioning problem with greedy calling convention since Ξ(x) ⊃ Ξc(x); see Propo-
sition 4.2. For a given software partition x ∈ X, we call a location-aware control
flow ξ ∈ Ξ(x) with integral components connected if the associated multigraph G(ξ)
is the union of a connected subgraph and some isolated vertices. According to
Proposition 4.3, every control flow ξ ∈ Ξ(x) \Ξc(x) is physically impossible for one
of the following reasons: either the components of ξ are non-integral, or ξ is dis-
connected. In the following, we elaborate a tighter convex approximation for Ξc(x)
which cuts off some of the spurious control flows in Ξ(x) \ Ξc(x).
For the remainder of this section, let N be an upper bound on the length of
any vertex-disjoint path in G(ξ), where ξ can be any location-aware control flow in⋃
x∈X Ξc(x). A simple bound is given by N := V L, while tighter bounds can be ob-
tained for specific problem instances. The following proposition gives an alternative
characterization of connected control flows.
Proposition 6.1. For a given software partition x ∈ X, an integral control flow
ξ ∈ Ξ(x) is connected, that is, ξ ∈ Ξc(x), if and only if there is α ∈ BN×V×L such
that
α1vl ≤ ξ1v1l (6.12a)
αnwm ≤ αn−1wm +
∑
v∈V
∑
l∈L
αn−1vl ξvwlm (6.12b)
ξvwlm ≤ χvwαNvl (6.12c)
for all v, w ∈ V, l,m ∈ L and n ∈ {2, . . . , N}.
13
180 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
Proof. Select an arbitrary ξ ∈ Ξ(x) with integral components. By construction,
vertex (1, 1) in the multigraph G(ξ) is not isolated. We say that vertex (v, l) ∈ V×L
is connected to vertex (1, 1) in G(ξ) if there is a directed path from (1, 1) to (v, l).
Without loss of generality, we can focus on paths of length N or smaller.
If α ∈ BN×V×L satisfies (6.12a) and (6.12b), then αnvl = 1 implies that (v, l) is
connected to (1, 1) by a path of length k ≤ n. This can be shown by induction on
n: the base step for n = 1 follows from (6.12a), while the inductive step for n > 1
follows from (6.12b). By definition, each vertex in G(ξ) has equally many incoming
and outgoing arcs. Hence, any non-isolated vertex in G(ξ) has both incoming and
outgoing arcs, and constraint (6.12c) enforces that αNvl = 1 for every non-isolated
vertex (v, l) in G(ξ). If there exists α ∈ BN×V×L subject to (6.12), then the above
arguments imply that there is a directed path from vertex (1, 1) to every non-isolated
vertex in G(ξ). Consider now the subgraph of G(ξ) consisting of all vertices that
can be reached from (1, 1) and all arcs of G(ξ). Each vertex in this subgraph has
equally many incoming as outgoing arcs. Hence, there exists an Eulerian cycle that
visits each arc once, that is, the subgraph is connected. If there exists α ∈ BN×V×L
subject to (6.12), the subgraph contains all non-isolated vertices, which implies that
ξ is connected.
Now assume that ξ ∈ Ξ(x) is integral and connected. Define α̂ ∈ BN×V×L by
setting α̂nvl := 1 if (v, l) is connected to (1, 1) by a path of length k ≤ n; := 0
otherwise. By construction, α̂ satisfies (6.12a) and (6.12b). Since ξ is connected,
α̂Nvl = 1 for every non-isolated vertex (v, l) in G(ξ). Thus, α̂ satisfies (6.12c), too.
To mitigate the over-conservativeness of AGRP, we want to disregard control
flows ξ ∈ Ξ(x) \ Ξc(x), and hence we would like to replace Ξ(x) with the following,
tighter approximation of Ξc(x):
Ξ′(x) :=
{
ξ ∈ Ξ(x) : ∃α ∈ BN×V×L that satisfies (6.12)}
Note that by Proposition 6.1, each ξ ∈ Ξ′(x) with integral components is contained
in Ξc(x). Since Ξ′(x) is non-convex, however, we cannot use it to replace Ξ(x)
in AGRP as the dualization in Section 5 would result in a duality gap that may
destroy the conservativeness of the approximation. Instead, we can replace Ξ(x)
with the following polyhedral outer approximation of Ξ′(x):
Ξ∗(x) :=
{
ξ ∈ Ξ(x) : ∃α ∈ RN×V×L+ , β ∈ RN×V
2×L2
+ .
αnvl ≤ 1 ∀v ∈ V, l ∈ L, n ∈ {2, . . . , N} ,
βnvwlm ≤ 1 ∀v, w ∈ V, l,m ∈ L, n ∈ {2, . . . , N} ,
α1vl ≤ ξ1v1l ∀v ∈ V, l ∈ L,
αnwm ≤ αn−1wm +
∑
v∈V
∑
l∈L
βnvwlm ∀w ∈ V, m ∈ L, n ∈ {2, . . . , N} ,
βnvwlm ≤ min
{
αn−1vl , ξvwlm
} ∀v, w ∈ V, l,m ∈ L, n ∈ {2, . . . , N} ,
ξvwlm ≤ χvwαNvl ∀v, w ∈ V, l,m ∈ L
}
Note that the auxiliary variable β has been introduced to relax the bilinear terms
in (6.12b). We can now solve the following variant of AGRP:
min
x∈X
max
ξ∈Ξ∗(x)
∑
v,w∈V
∑
l,m∈L
cvwlmξvwlm, (AGRP∗)
where the uncertainty set Ξ(x) has been replaced with Ξ∗(x). Similar derivations
as in Section 5 allow us to convert AGRP∗ into an equivalent MILP. The inclusions
Ξc(c) ⊆ Ξ′(x) ⊆ Ξ∗(x) ⊆ Ξ(x) imply that the optimal value of AGRP∗ is bracketed
by the optimal values of GRP and AGRP.
14
181
7 Numerical Results
We compare our robust software partitioning approach with single instantiation
partitioning and optimistic multiple instantiation partitioning. More precisely, we
evaluate the following approaches to software partitioning:
1. Robust Multiple Instantiation Software Partitioning. We consider
models AGRP (without connectivity cuts) and AGRP∗ (with connectivity
cuts).
2. Optimistic Multiple Instantiation Software Partitioning. We replace
the inner maximization in AGRP by a minimization. The resulting problem
OP optimizes in view of the least time-consuming control flow consistent with
the given frequency information. We also consider a variant OP∗ that enforces
connectivity.1
3. Single Instantiation Software Partitioning. We consider model P1 from
Section 2, where every code segment must be assigned to exactly one location.
Since the program’s execution time is uniquely determined by the calling
frequencies and does not depend on the execution sequence, P1 reduces to a
generalized quadratic assignment problem [26].
The optimal values of the aforementioned partitioning approaches satisfy
OP  OP∗  P  AGRP∗  AGRP  P1,
where P denotes the software partitioning problem under complete information (see
Section 2) and ‘’ refers to the ordering of optimal objective values. The first and
fourth inequality follow from the fact that connectivity cuts reduce the feasible
region of the inner optimization problem in OP and AGRP, respectively. The
last inequality holds since AGRP minimizes over the set of multiple instantiation
partitions, while P1 optimizes over the smaller set of single instantiation partitions.
We apply all software partitioning approaches to a Java simulation program.
Figure 1 illustrates the control flow graph for a representative program run to be
optimized. The graph is obtained with the 3S characterization framework discussed
in [34]. For the sake of clarity, we omit the auxiliary arc that connects the sink
node with the source node throughout this section. We consider three heterogeneous
execution locations A, B and C. The execution and communication costs of the code
segments on the different locations were obtained with the Write-Only Architecture
computation model [24,30]. We use the commercial solver CPLEX 11.2 to solve the
arising MILP models.2
Table 1 compares the solutions of the considered partitioning approaches in
terms of their forecasted and simulated execution times, as well as the number of
duplicate code segments. The forecasted execution times correspond to the optimal
objective values of the corresponding optimization problems, while simulated exe-
cution times are obtained with 3S simulations of the real program execution traces
τ using the optimal assignments and the greedy calling convention. The number of
duplicate code segments represents the difference between the overall number of in-
stantiations and |V|, the number instantiations in P1. As expected, the optimistic
model underestimates the factual execution time. Due to its inherent optimism,
the model determines a partition with many duplicate code segments that performs
well for some benign execution flows but performs poorly on average. The single
instantiation model, on the other hand, correctly predicts the factual execution
1As the inner minimization problem does not have to be dualized, connectivity can be strictly
enforced by using the constraints (6.12) involving binary variables α ∈ BN×V×L.
2CPLEX is a registered trademark of IBM ILOG.
15
182 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
times but determines a poor partition due to the single instantiation restriction.
The robust partitioning approach, finally, outperforms both methods in the simula-
tions because it predicts the factual execution times more accurately and allows for
multiple instantiation. Figures 2–6 illustrate the solutions obtained by the different
partitioning approaches. Circles indicate code segments that are instantiated once,
while pentagons refer to multiply instantiated code segments. The arcs display the
location-aware control flows that determine the corresponding objective values.
G10
G03
64
G01
G02
1
G11
G12
64
G04
64
G13
64
G14 64
G15
G16
64
G1764
64
G18
G19
64
G20
63
G21
1
63
64
G22
1
G23
1
128
G05
384
G06
384
G07 384
G08
384
64
64
64
64
G09
128
64 64
Figure 1: Control flow graph of the software benchmark. Nodes are shaded in proportion
to their total inbound communication and computation time requirements on a reference
hardware location with darker shades indicating code segments with greater time require-
ments. G01 and G23 are unique source and sinks for the control flow.
execution times (secs)
model forecasted simulated # of duplicates
AGRP 479 479 7
AGRP∗ 479 479 9
OP 167 871 12
OP∗ 321 853 10
P1 1037 1037 0
Table 1: Summary of the solutions for the benchmark instance from Figure 1.
16
183
8 Conclusion
Previous exact approaches to software partitioning either assume knowledge of the
complete execution sequence or they only support the single instantiation of code
segments. In practice, sequence information is available only at coarse granularity
levels which reduces assignment optimization potentials [24], while the restriction
to single instantiation partitions can severely reduce the achievable program per-
formance, see e.g. Table 1. In this paper we present a novel approach to software
partitioning with multiple instantiation that only requires knowledge of the control
flow graph, which is stripped of all sequence information.
As soon as code segments can be instantiated multiple times, the execution time
of an assignment depends not only on the calling frequencies but also on the execu-
tion sequence. In the absence of such sequence information, we propose to formulate
the multiple instantiation software partitioning problem as a robust optimization
problem that minimizes the worst-case run time over all execution traces consistent
with the known control flow graph. We show that the resulting problem can be
approximated by a MILP amenable to optimization with off-the-shelf commercial
solvers. We also provide results for a benchmark application demonstrating that
our approach compares favourably with alternative software partitioning methods
when evaluated on real execution traces.
We identify three promising areas for future research. Firstly, even though
our method does not require sequence information, it becomes computationally
challenging for large applications. In order to improve the scalability of our method,
we propose the investigation of new formulations and bounding techniques along
the lines of [26]. Secondly, the robust solution may be too conservative in some
instances and the inclusion of additional software characterisation information such
as compressed partial trace sequences generated by the 3S loopgraph d tool [34]
may be investigated as future work. Thirdly, research into the effect of different
non-anticipative calling conventions and calling cost variation over the uncertainty
set Ξc(x) should be performed [35].
Finally, as our robust partitioning approach can be regarded as an extension
of the generalized quadratic assignment problem to multiple instances, and as the
latter problem has manifold practical applications [26], it seems promising to inves-
tigate the applicability of our model to domains outside the software partitioning
arena.
References
[1] R. C. C. Cheung, N. J. Telle, W. Luk, and P. Y. K. Cheung. Customizable elliptic
curve cryptosystems. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 13(9):1048–1059, 2005.
[2] C. A. Constantinides, P. Y. K. Cheung, and W. Luk. Wordlength optimization for
linear digital signal processing. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 22(10):1432–1442, 2003.
[3] S. A. Fahmy, C.-S. Bouganis, P. Y. K. Cheung, and W. Luk. Real-time hardware
acceleration of the trace transform. Journal of Real-Time Image Processing, 2(4):235–
248, 2007.
[4] S. D. Haynes, J. Stone, P. Y. K. Cheung, and W. Luk. Video image processing with
the sonic architecture. IEEE Computer, 33(4):50–57, 2000.
[5] N. Shirazi, D. Benyamin, W. Luk, P.Y.K. Cheung, and S. Guo. Quantitative anal-
ysis of FPGA-based database searching. The Journal of VLSI Signal Processing,
28(1/2):85–96, 2001.
17
184 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
[6] S. Yusuf, W. Luk, M. Sloman, N. Dulay, E. C. Lupu, and G. Brown. Reconfigurable
architecture for network flow analysis. IEEE Transactions on Very Large Scale Inte-
gration (VLSI) Systems, 16(1):57–65, 2008.
[7] Google Inc. http://www.google.com.
[8] G. Estrin. Reconfigurable computer origins: The UCLA fixed-plus-variable (F+V)
structure computer. IEEE Computer, 24(4):3–9, 2002.
[9] R. Ernst, J. Henkel, and T. Benner. Hardware-software cosynthesis for microcon-
trollers. IEEE Design & Test of Computers, 10(4):64–75, 1993.
[10] R. Gupta and G. De Micheli. Hardware-software co-synthesis for digital systems.
IEEE Design & Test of Computers, 10(3):29–41, 1993.
[11] P. Eles, Z. Peng, K. Kuchcinski, and A. Doboli. System level hardware/software
partitioning based on simulated annealing and tabu search. Design Automation for
Embedded Systems, 2(1):5–32, 1997.
[12] R. P. Dick and N. K. Jha. Mogac: A multiobjective genetic algorithm for the
co-synthesis of hardware-software embedded systems. In Proceedings of the 1997
IEEE/ACM International Conference on Computer-Aided Design, pages 522–529,
1997.
[13] M. Purnaprajna, M. Reformat, and W. Pedrycz. Genetic algorithms for hardware-
software partitioning and optimal resource allocation. Journal of Systems Architec-
ture: the EUROMICRO Journal, 53(7):339–354, 2007.
[14] M. B. Abdelhalim, A. E. Salama, and S. E. D. Habib. Hardware software partitioning
using particle swarm optimization technique. In Proceedings of the 6th International
Workshop on System-on-Chip for Real-Time Applications, pages 189–194, 2006.
[15] M. Koudil, K. Benatchba, S. Gharout, and N. Hamani. Solving partitioning problem
in codesign with ant colonies. In Artificial Intelligence and Knowledge Engineering
Applications: A Bioinspired Approach. Springer, 2005.
[16] P. V. Knudsen and J. Madsen. PACE: A dynamic programming algorithm for hard-
ware/software partitioning. In Proceedings of the 4th International Workshop on
Hardware/Software Co-Design, pages 85–92, 1996.
[17] S.-R. Kuang, C.-Y. Chen, and R.-Z. Liao. Partitioning and pipelined scheduling
of embedded system using integer linear programming. In Proceedings of the 11th
International Conference on Parallel and Distributed Systems, pages 37–41, 2005.
[18] A. Shrivastava, H. Kumar, S. Kapoor, S. Kumar, and M. Balakrishnan. Optimal
hardware/software partitioning for concurrent specification using dynamic program-
ming. In Proceedings of the 13th International Conference on VLSI Design, pages
110–113, 2000.
[19] J. Wu and T. Srikanthan. Low-complex dynamic programming algorithm for hard-
ware/software partitioning. Information Processing Letters, 98(2):41–46, 2006.
[20] P. Arato´, S. Juha´sz, Z. A. Mann, A. Orba´n, and D. Papp. Hardware-software parti-
tioning in embedded system design. In Proceedings of the 2003 IEEE International
Symposium on Intelligent Signal Processing, pages 197–202, 2003.
[21] S. Banerjee, E. Bozorgzadeh, and N. D. Dutt. Integrating physical constraints in HW-
SW partitioning for architectures with partial dynamic reconfiguration. IEEE Trans-
actions on Very Large Scale Integration (VLSI) Systems, 14(11):1189–1202, 2006.
[22] S. A. Khayam, S. A. Khan, and S. Sadiq. A generic integer programming approach
to hardware/software codesign. In Proceedings of the IEEE International Multi Topic
Conference, pages 6–9, 2001.
[23] R. Niemann and P. Marwedel. Hardware/software partitioning using integer pro-
gramming. In Proceedings of the 1996 European Conference on Design and Test,
pages 473–479, 1996.
[24] S. A. Spacey, W. Luk, P. H. J. Kelly, and D. Kuhn. Coarse-grained parallel partition-
ing through fine-grained sequential assignment for heterogeneous systems. Working
Paper, 2009.
18
185
[25] W. Wolf. A decade of hardware/software codesign. IEEE Computer, 36(4):38–43,
2003.
[26] P. M. Hahn, B.-J. Kim, M. Guignard, J. M. Smith, and Y.-R. Zhu. An algorithm
for the generalized quadratic assignment problem. Computational Optimization and
Applications, 40(3):351–372, 2008.
[27] S. Sahni and T. Gonzalez. P-complete approximation problems. Journal of the ACM
(JACM), 23:555–565, 1976.
[28] R. Diestel. Graph Theory. Springer, 3rd edition, 2005.
[29] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory
of NP-Completeness. Freeman, 1979.
[30] S. A. Spacey, W. Luk, P. H. J. Kelly, and D. Kuhn. Rapid design space visualisation
through hardware/software partitioning. In Proceedings of the Fifth IEEE Southern
Programmable Logic Conference, pages 159–164, 2009.
[31] P. Kall and S. W. Wallace. Stochastic Programming. John Wiley & Sons, 1994.
[32] V. Goel and I.E. Grossmann. A class of stochastic programs with decision dependent
uncertainty. Mathematical Programming, 108(2–3):355–394, 2006.
[33] A. Ben-Tal, A. Nemirovski, and L. El Ghaoui. Robust Optimization. Princeton
University Press, 2009.
[34] S. A. Spacey. 3S: Program instrumentation and characterisation framework. Technical
report, Imperial College London, 2006.
[35] S. A. Spacey. Computational Partitioning for Heterogeneous Systems. PhD thesis,
Imperial College London, 2009.
19
186 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
G10
G03
64
G01
G02
1
G13
G04
64
G14
64
G15
G16
64
G17
64
64
G21
G22
1
G23
1 64
G05
256
G06
256
256
G08
384
64 64 64
G09
128
G18
64
64
G11
64
G12
64
G04
64
G05
64
64
64
G19
64
1
G20
63
63
G03
64
G04
64
G05
64
G06
64
64
G06
G07
A
C
B
Figure 2: Software partition determined by the robust multiple instantiation modelAGRP.
Nodes are shaded in proportion to their total inbound communication and computation
time requirements at each location and arcs are labelled with the number of location aware
control flows. Multiply instantiated nodes are shown as pentagons and singly instantiated
nodes as circles. In contrast to Figure 1, the singly instantiated node G07 is the most time
consuming node because of its high cross partition inbound communication costs.
20
187
G10
G03
64
G01
G02
1
G13
G04
64
G14
64
G15
G16
64
G17
64
64
G21
G22
1
G23
1 64
G05
256
G06
256
G07
256
G08
256
6464 64
G09
128
G18
64
64
G11
64
G19
64
1
G20
63
63
G03
64
G04
64
G05
64
G06
64
G07
64
64
G12
64
G04
64
G05
64
G06
64
64
64 G07
A
C
B
Figure 3: Software partition determined by AGRP∗, the robust multiple instantiation
model with connectivity cuts. The shapes and shades of the nodes as well as the labels of
the arcs have the same meaning as in Figure 2.
21
188 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
G01
G02
1
G15
G16
64
G17
64
G04
64
G18
G19
64
G20
63
G21
1
63
G03
64
G22
1
G23
1
64
G05
128
G06
128
G07
128
G08
128
6464
G11
G12
64
G04
64
G13
64
G05
128
G06
128
G07
128
G08
128
64
G09
64
64
G10
G03
64
G14
G04
64
64
G05
128
G06
128
G07
128
G08
128
64
G09
64
64
A B C
Figure 4: Software partition determined by the optimistic multiple instantiation model
OP. Note that the location-aware control flow is not connected. The shapes and shades
of the nodes as well as the labels of the arcs have the same meaning as in Figure 2.
22
189
G01
G02
1
G11
G12
64
G04
64
G13
64
G15
G16
64
G17
64
64
G18
G19
1
G21
1
G22
1
G23
1
192
G06
193
G07
193
G08
193
64 64 1
G09
64
64
G19
63
G20
63
63
G03
64
64
1
63
G10
G03
64
G14
G04
64
64
128
G06
191
G07
191
191
63
64
G09
64
64
G08
G18
G05
G04
G05
A B C
Figure 5: Software partition determined by OP∗, the optimistic multiple instantiation
model which enforces connectivity. Note in contrast to Figure 4, the location-aware control
flow is now connected. The shapes and shades of the nodes as well as the labels of the
arcs have the same meaning as in Figure 2.
23
190 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
G10
G03
64
G01
G02
1
G13
G04
64
G14
64
G15
G16
64
G17
64
64
G21
G22
1
G23
1128
G05
384
G06
384
G07
384
G08
384
6464 64
G09
128
G18
64
64
G11
64
G12
64
64
G19
64 1
G20
63
63
64
AB
Figure 6: Software partition determined by the single instantiation model P1. Only two
of the three execution locations were used. The shades of the nodes as well as the labels
of the arcs have the same meaning as in Figure 2. All nodes are singly instantiated and
are shown as circles.
24
191
192 APPENDIX F. ROBUST OPTIMISATION JOURNAL PAPER
