Dynamisches Scheduling in der Hochsprachen-Compilierung für adaptive Rechner by Gädke-Lütjens, Hagen
Dynamic Scheduling
in High-Level Compilation
for Adaptive Computers
Von der Carl-Friedrich-Gauß-Fakultät
Technische Universität Carolo-Wilhelmina zu Braunschweig
zur Erlangung des Grades
Doktor-Ingenieur (Dr.-Ing.)
genehmigte Dissertation
von Dipl.-Inform. Hagen Gädke-Lütjens
geboren am 15.02.1979
in Berlin
Eingereicht am: 11.03.2011
Mündliche Prüfung am: 27.04.2011
Referent: Prof. Dr. Ulrich Golze
Korreferent: Prof. Dr. Andreas Koch
(2011)
2
Abstract
The single-thread performance of conventional CPUs has not improved sig-
nificantly due to the stagnation of the CPU frequencies since 2003. Adaptive
computers, which combine a CPU with a reconfigurable hardware unit (e.g., an
FPGA) used as hardware accelerator, represent a promising, alternative com-
pute platform. During the past 10 years, much research has been done to de-
velop tools that enhance the usability of adaptive computers. An important goal
here is the development of an adaptive compiler, which compiles hardware de-
scriptions from common high-level languages such as C in a fully automated
way.
Most of the compilers developed until today use static scheduling for the gen-
erated hardware. However, for complex programs containing nested loops,
irregular control flow, and arbitrary pointers, dynamic scheduling is more ap-
propriate. Unexpected operator lantencies which occur, e.g., when accessing
cached memories, force a statically scheduled system to stall all operations in
the design. Using dynamic scheduling, only the directly affected operations
stall, while independent operators can continue their computation. This work
examines the feasibility of compiling to dynamically scheduled hardware, an
approach that has been the subject of only limited research efforts so far.
Based on previous work we have developed the adaptive compiler COM-
RADE 2.0, which generates synthesizable hardware descriptions (using dy-
namic scheduling) from ANSI C. The compiler front-end transforms the input
program into a structured control flow graph, which is partitioned into hard-
ware and software components. For each hardware component, COMRADE
2.0 creates a data flow-based COMRADE Controller Micro-Architecture (CO-
COMA) instance. COCOMA models even complex control and memory de-
pendences and is thus especially suitable as intermediate representation in a
compile flow that supports complex C programs. COCOMA allows pipelin-
ing, speculative execution with early evaluation, and even cancelation of mis-
speculated operations with cancel tokens (CTs). In this context, COCOMA
supports two different token flow models: static and dynamic CTs. From the
COCOMAmodel, the compiler back-end generates synthesizable hardware de-
scriptions (hardware kernels, each consisting of a data path and a sequencer)
3
in the hardware description language Verilog. The data path instantiates hard-
ware operators provided by a platform-independent hardware operator library,
which has also been developed in the context of this work.
We examine the effects of parameter variations and low-level optimizations on
the simulation and synthesis results. Our analysis shows that static CTs outper-
form dynamic CTs, because they save area resources without affecting the run-
time. Operation chaining can help to eliminate unnecessary registers and thus
save runtime as well as area resources. Further runtime savings are possible
by memory access reordering and parallelization. The most promising opti-
mization technique considering the runtime is memory localization which can
significantly increase the memory bandwidth available to the compiled hard-
ware kernels. Using memory localization we have obtained hardware kernel
speed-ups of up to 37x over an embedded CPU. Beyond single kernels, we
have also examined the application-level speed-ups. For a multi-phase image
compression application, we have managed a speed-up of 5x over a 400 MHz
superscalar embedded CPU.
4
Kurzfassung
Bedingt durch die Stagnation der CPU-Frequenzen stagniert seit 2003 auch
die Single-Thread-Rechenleistung herkömmlicher CPUs. Adaptive Rechner
bieten eine vielversprechende, alternative Rechenarchitektur, indem sie die
CPU um eine rekonfigurierbare Einheit (z. B. einen FPGA), die als Hardware-
Beschleuniger verwendet wird, erweitern. In den vergangenen zehn Jahren
wurde viel Forschung in Entwurfswerkzeuge investiert, die eine einfachere
und praktikablere Verwendung adaptiver Rechner ermöglichen sollen. Ein
wesentliches Ziel ist dabei die Entwicklung eines adaptiven Compilers, der
aus einer allgemein verwendeten Hochsprache wie C vollautomatisch Hard-
warebeschreibungen erzeugen kann.
Die meisten der bisher entwickelten Compiler setzen in der erzeugten Hard-
ware statisches Scheduling ein. Für komplexere Programme mit verschachtel-
ten Schleifen, irregulärem Kontrollfluss und beliebigen Zeigerzugriffen ist je-
doch dynamisches Scheduling besser geeignet. Unerwartete Operatorlatenzen,
wie z. B. ein Cache-Miss bei Speicherzugriffen, erzwingen bei statischem
Scheduling das Anhalten aller Operationen in der Hardware-Recheneinheit,
während bei dynamischem Scheduling nur direkt betroffene Operationen ange-
halten werden müssen, unabhängige Berechnungen aber weiterlaufen können.
Diese Arbeit untersucht die praktische Machbarkeit des dynamischen Schedul-
ings, zu dem es bislang kaum Untersuchungen im Kontext adaptiver Rechner
gibt.
Auf der Grundlage bestehender Vorarbeiten haben wir den Compiler COM-
RADE 2.0 entwickelt, der aus ANSI C vollautomatisch synthetisierbare Hard-
warebeschreibungen generieren kann, die dynamisches Scheduling verwenden.
Das Compiler-Front-End erzeugt aus dem Eingabeprogramm einen strukturier-
ten Kontrollflussgraphen, der in Hardware- und Software-Komponenten parti-
tioniert und optimiert wird. Für jede Hardware-Komponente wird dann eine
datenflussbasierte COMRADE Controller Micro-Architecture- (COCOMA-)
Darstellung erzeugt, die auch komplexere Kontroll- und Speicherabhängigkei-
ten korrekt modelliert und daher besonders für die Compilierung komplexerer
C-Programme geeignet ist. COCOMA ermöglicht Pipelining, spekulative Aus-
führung mit frühzeitiger Evaluation und sogar das Abbrechen unnötiger speku-
5
lativer Operationen mit Cancel-Tokens (CTs). Hierbei unterstützt COCOMA
zwei verschiedene Tokenflussmodelle: statische und dynamische CTs. Aus
der COCOMA-Darstellung erzeugt das Back-End dann synthetisierbare Hard-
warebeschreibungen (HW-Kernels, bestehend aus Datenpfad und Steuerwerk)
in der Hardware-Beschreibungssprache Verilog. Der Datenpfad instanziert
Hardware-Operatoren aus einer plattformunabhängigen Operator-Bibliothek,
die in weiten Teilen auch im Rahmen der hier vorgestellten Forschungen ent-
standen ist.
Wir untersuchen die Auswirkungen von Parameteränderungen und Low-Level-
Optimierungen auf die Simulations- und Syntheseergebnisse. Unsere Unter-
suchungen zeigen, dass statische CTs dynamischen CTs überlegen sind, weil
sie bei gleicher Laufzeit Ressourcen einsparen. Mit Operation-Chaining kann
durch geschickte Entfernung von Pufferregistern gleichzeitig die Laufzeit ver-
ringert und die Fläche verkleinert werden. Weitere Laufzeitersparnisse ergeben
sich durch die Umsortierung und Parallelisierung von Speicherzugriffen. Die
für die Laufzeit vielversprechendste Optimierungstechnik ist die Speicherloka-
lisierung, die eine wesentlich höhere Speicherbandbreite bei geringer Latenz
bietet. Wir messen hier Beschleunigungen eines Hardware-Kernels gegenüber
einer eingebetteten CPU von bis zu 37-fach. Neben Betrachtungen einzelner
Hardware-Kernels untersuchen wir auch die Beschleunigung auf Programm-
ebene. Für eine mehrstufige Bildkomprimierung erreichen wir dabei eine
Beschleunigung um den Faktor 5 gegenüber einer superskalaren, eingebetteten
CPU mit 400 MHz.
6
Acknowledgments
I would like to thank Professor Andreas Koch. He gave direction to my re-
search and guided me throughout the development phase of this work. Fur-
thermore, the readability of the text was significantly improved by considering
his comments. Andreas has acquainted me with adaptive computers during my
computer science studies.
I would also like to thank Professor Ulrich Golze for his steady support. The
comprehensibility of the formal descriptions in this thesis improved notably
by taking his suggestions into account. He furthermore provided the working
environment, including hardware and software infrastructure.
Several colleagues have provided me with essential building blocks. The pre-
liminary work of Nico Kasprzyk on COMRADE 1.0 is the basis of the COM-
RADE 2.0 compiler developed in this work. Benjamin Thielmann provided
an initial implementation of Modlib and LMEM. Holger Lange and Thorsten
Wink provided and adjusted MARC as well as the ACE-M5 design and oper-
ating system.
The ML507 development board was donated by Xilinx.
Spelling and grammar in this work were improved by considering Nadia and
Michael Acland’s comments.
The moral support from my wife Hendrikje Lütjens was invaluable throughout
this work.
I thank my parents Edith Gädke-Döblitz and Jürgen Gädke for their moral sup-
port as well as the financial support provided during the completion of my
computer science degree, laying the groundwork for this thesis.
7
8
Contents
1. Introduction 13
2. Basics 21
2.1. Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2. Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3. Hardware Operation Scheduling . . . . . . . . . . . . . . . . 31
3. COMRADE 1.0 33
3.1. Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2. Compile Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3. Deficiencies . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1. Hardware Generation . . . . . . . . . . . . . . . . . . 37
3.3.2. Front-end . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3. Memory Accesses . . . . . . . . . . . . . . . . . . . 42
3.3.4. Toolchain . . . . . . . . . . . . . . . . . . . . . . . . 43
4. Related Work 45
4.1. Existing Adaptive Compilers and HLS Frameworks . . . . . . 45
4.1.1. PRISC . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.2. ROCCC . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.3. SPARK . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.4. CASH . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.5. CHiMPS . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2. Early Evaluation and Cancel Tokens . . . . . . . . . . . . . . 53
5. COMRADE 2.0 Execution Model 55
5.1. Generic Target Architecture . . . . . . . . . . . . . . . . . . . 55
5.2. Hardware/Software Co-Execution Model . . . . . . . . . . . 56
5.3. Hardware Execution Model . . . . . . . . . . . . . . . . . . . 59
5.3.1. Modelling Dynamic Scheduling with Early Evaluation 60
5.3.2. COCOMA Equivalents for Software Constructs . . . . 65
9
6. COCOMA 77
6.1. Formal CMDFG Definition . . . . . . . . . . . . . . . . . . . 78
6.2. Formal Sequencer Definition . . . . . . . . . . . . . . . . . . 88
6.3. Formal COCOMA Definition . . . . . . . . . . . . . . . . . . 92
7. Hardware Operator Library 95
7.1. Modlib Parameters . . . . . . . . . . . . . . . . . . . . . . . 96
7.2. Meta Data Fetcher . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2.1. Measurements . . . . . . . . . . . . . . . . . . . . . 101
8. COMRADE 2.0 Compile Flow 107
8.1. Building the CMDFG . . . . . . . . . . . . . . . . . . . . . . 110
8.1.1. Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.1.2. Data Edges . . . . . . . . . . . . . . . . . . . . . . . 111
8.1.3. Control Edges . . . . . . . . . . . . . . . . . . . . . 112
8.1.4. Memory Edges . . . . . . . . . . . . . . . . . . . . . 120
8.2. Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.2.1. Static CTs . . . . . . . . . . . . . . . . . . . . . . . . 126
8.2.2. Dynamic CTs . . . . . . . . . . . . . . . . . . . . . . 128
8.3. Redundant Control Dependences . . . . . . . . . . . . . . . . 129
9. Low-Level Optimizations 131
9.1. Test Platform ACE-M5 . . . . . . . . . . . . . . . . . . . . . 131
9.2. Dynamic CTs vs. Static CTs . . . . . . . . . . . . . . . . . . 133
9.3. Operation Chaining . . . . . . . . . . . . . . . . . . . . . . . 137
9.4. Memory Access Reordering . . . . . . . . . . . . . . . . . . 141
9.5. Memory Access Parallelization . . . . . . . . . . . . . . . . . 143
9.6. Memory Localization . . . . . . . . . . . . . . . . . . . . . . 146
10. Application-Level Study 155
11. Summary and Future Work 167
Bibliography 171
Weblinks 177
Index 179
Acronyms 183
Appendices
10
A. Sequencer Conditions 185
B. CMDFG Node Types for C Operations and Statements 213
C. Regression Tests 215
C.1. Source Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 215
C.2. Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 224
D. C Sources 225
D.1. Fcdf22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
D.2. GfMultiply . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
D.3. MD5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
D.4. Memcopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
D.5. Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
D.6. SHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
D.7. Susan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
D.8. Fcdf22_local . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Curriculum Vitae 241
11
12
1. Introduction
Computers advance into an expanding set of application domains. They are
used in commerce and research, in industry as well as in education, entertain-
ment, and infotainment. Many applications require increasingly more compu-
tational performance for more accurate simulations, faster data transformation,
higher throughput, more realistic games, and ever more features. Mobile ap-
plications have to achieve these objectives with a limited power budget. Most
of these applications are written in software which is executed on a central
processing unit (CPU). Despite numerous sophisticated enhancements, CPUs
currently suffer from the 4 GHz performance barrier which seems to be insur-
mountable without fundamental architectural modifications. A very promising
option is the usage of flexible hardware accelerators which can be adjusted to
the currently executed application. Such accelerator devices do exist, however,
they are not used commonly today due to the lack of adequate compilers. This
work develops such a compiler.
During the past 40 years, the CPU complexity (i.e., the number of transistors)
has increased exponentially (Fig. 1.1) according to Moore’s Law (these num-
bers relate to Intel only, however, similar results hold for other CPU brands).
Similarly, the CPU frequency has increased exponentially, continuously boost-
ing the CPU performance. Unfortunately, this has caused an exponential
growth in CPU power consumption, exceeding 100 W in 2003. Due to the
resulting heat generation, frequencies higher than 4 GHz are hard to handle, so
that we have been experiencing a frequency stagnation between 3 and 4 GHz
since then. Instead of increasing the frequency, the CPU manufacturers now
integrate multiple CPU cores in a single chip and use ever larger caches, so
that the number of transistors per chip still grows exponentially. While mul-
tiple cores are a good choice to improve the performance of multi-threaded
applications, they do not help to increase the performance of single-thread pro-
grams (making up a good portion of the existing SW today). As both the CPU
frequency and the number of instructions processed per clock cycle have stag-
nated, the single-thread performance has also been stagnating since 2003.
Thus it makes sense to investigate alternative compute models. A possible
option is adaptive computers. These combine a conventional CPU with a re-
13
1. Introduction
Instructions / Cycle
Power [W]
Frequency [MHz]
Transistors
Figure 1.1.: Intel CPU trends [Josh11].
14
configurable device used as hardware (HW) accelerator. Such a device is able
to emulate arbitrary hardware circuits1, and it can be reprogrammed as often as
required to fit the needs of a given application. The CPU executes those parts
of an application which cannot be accelerated by the reconfigurable device.
This computational concept is called adaptive computing. The most popular
reconfigurable devices used for adaptive computing are field programmable
gate arrays (FPGAs), offering a fine-grain and regular structure of reconfigur-
able units (e.g., look-up tables and flip-flops) as well as hard-wired functional
blocks such as memory, multipliers, and even embedded CPU cores.
Until today, adaptive computing has successfully been used to accelerate ap-
plications from a wide spectrum of fields. The classic example is the gene se-
quence comparison. In 1993, the SPLASH-2 ACS [Hoan93] achieved a speed-
up of 1,300 over one of the fastest commercial computers available at that
time. Currently, the largest speed-up reported in this field is 3,000, achieved by
a system consisting of an Intel Xeon CPU @ 2.8 GHz and two Xilinx Virtex-4
FPGAs, over the Xeon alone [BuNa08]. Other worthwhile speed-up exam-
ples (yet not that spectacular) can be found in cryptography (10x) [ShVu93],
wavelet image compression (7x) [GäKo04], molecular dynamics simulations
(10x) [GuHe07], and exhaustive search (27x) [Schu07].
In addition to the speed-up, adaptive computing is promising due to its low
energy consumption. Despite speeding up the computation over a general-
purpose CPU, an FPGA often reduces power consumption by an order of mag-
nitude [GäKo04] [LSKH09].
Even with these advantages, adaptive computing has not yet become a main-
stream methodology. Probably the most prominent reason for this is the pro-
gramming complexity. Before a program can run on an adaptive computer, it
has to be partitioned into software (SW) and hardware (HW) parts, and both
parts must be designed according to the target technology, as shown in Fig. 1.2.
Traditionally, HW/SW partitioning is done manually, and the FPGA is pro-
grammed using a hardware description language (HDL) such as Verilog or
VHDL. Both steps, especially the latter, require special knowledge about hard-
ware design and the target architecture, forming a barrier for the large majority
of programmers who are short on HDL experience. Furthermore, designing an
application in an HDL is much more costly and time consuming compared to
software development. Thus, an important issue for the acceptance of adaptive
computing is the development of high-level tools that simplify programming
1The complexity of the HW circuit that can be emulated is limited by the area resources provided
by the device.
15
1. Introduction
HW/SW Partitioning
SW Implementation HW Implementation
Behavioral Description
Combined HW/SW
Executable
Figure 1.2.: Design flow for adaptive computer applications.
for adaptive computers. At this point, adaptive computing of course over-
laps with general high-level synthesis (HLS), which is the “automated gener-
ation of the hardware circuit of a digital system from a behavioral description”
[GGDN04].
Many of today’s high-level tools are built on software compiler frameworks
for two reasons. First, the input language (e.g., C) is already widespread and
known to many users. Second, many existing code optimizations from the
software world can be re-used for hardware development. Example SW frame-
works used in HW generation are SUIF2 [ADHL00] (used by COMRADE
[KoKa05], ROCCC [GuNB08], and Molen [Pana07]) and LLVM [Latt02]
(used by CHiMPS [PBDM08]).
By now, a generic compile flow (or synthesis flow) for creating hardware from
high-level languages (HLL) has emerged (Fig. 1.3). From the HLL represen-
tation, a control flow graph (CFG) is built and optimized by several compiler
passes. Scalar replacement of array accesses is an important method to reduce
memory accesses, which is often critical for efficient hardware solutions. Con-
stant propagation, common subexpression elimination, and dead code elim-
ination reduce the hardware area required. Loop unrolling can increase the
instruction level parallelism (the number of operations executable in parallel),
at the expense of bigger designs. Furthermore, the CFG conversion to static
single assignment (SSA) form has proven to be very useful for the transforma-
tion to low-level, hardware-centric representations such as (control) data flow
graphs ((C)DFGs). These are used as intermediate steps before creating the
actual hardware representation (typically in an HDL).
16
Scalar Replacement
Constant Propagation
Common Subexpression Elimination
Dead Code Elimination
Loop Unrolling
High-Level Language
Hardware Representation
CFG
SSA Form
(C)DFG
Figure 1.3.: Generic compile flow for hardware generation from high-level lan-
guages.
17
1. Introduction
The generation of the hardware representation (as well as several intermediary
compile passes) heavily depend on the scheduling technique used. Scheduling
determines which parts of a hardware computation are executed at a certain
time. Static scheduling (Fig. 1.4(a)) defines fixed time slots (1, 2, 3 in the
Figure) which are assigned HW operations at compile time. A global state
machine controls (at runtime) which time slots (and thus HW operations) are
currently active.
Dynamic scheduling (Fig. 1.4(b)) does not use such time slots. Instead, op-
erations execute independently as soon as their input data is available. The
availability of input data is denoted by the presence of tokens (the circles con-
taining a + in the Figure) similar to Petri-nets. Such self-organizing systems
are more flexible and better suited for variable operator latencies which can,
for example, occur when a cached memory is accessed. While reading a data
word from the cache could take just 2 cycles, the same read access might need
much longer (e.g., 100 cycles) if that data word has to be loaded from the
main memory. If such a memory stall occurs in a statically scheduled system,
the complete state machine stalls. In dynamic scheduling only the read opera-
tion stalls, while other independent operations can continue their computation.
Thus dynamic scheduling can save execution time versus static scheduling.
add
sub
mul
 
 
  
 
 
  
read
1
2
3
(a) (b)
add
sub
mul
 
 
 
  
  
 
read variablelatency
+
+
+ +
+
+
Figure 1.4.: (a) static scheduling, (b) dynamic scheduling.
While static scheduling is used by most current synthesis flows, only limited
research has been done on dynamic scheduling in the context of adaptive com-
puting to date. This work examines the applicability of dynamic scheduling to
a compile flow for adaptive computers.
18
Our major goals are:
• The design of a hardware execution model using dynamic scheduling,
suited for compilation from high-level languages to hardware descrip-
tions for adaptive computers.
• Simple programmability, i.e., restrictions and extensions of the input lan-
guage should be kept minimal.
• The evaluation of the execution model using an appropriate compiler
framework.
This thesis is structured as follows: Chapter 2 defines basic terms used through-
out this work. Chapter 3 introduces the initial version of COMRADE, a high-
level compiler for adaptive computers used as a basis for this work. Chapter 4
reviews other existing high-level compilers. Chapter 5 illustrates the HW/SW
execution model, which is then formally defined in Chapter 6. Chapter 7 de-
scribes the hardware operator library used by our approach. The compile flow
which integrates our model is presented in Chapter 8. Chapter 9 gives simula-
tion and synthesis results obtained for different optimization techniques, while
Chapter 10 examines the application-level speed-up. The work is summarized
in Chapter 11.
19
1. Introduction
20
2. Basics
This Chapter defines fundamental terms used throughout this work.
An adaptive compiler is a compiler which transforms a program written in a
high-level language (e.g., C, C++, Java) into a combined HW/SW executable
for an adaptive computer. An adaptive compiler can use several intermediate
representations (IR) of the input program to simplify compiler optimizations
or to successively transform the input program into the target representation.
The two most prominent IRs used in adaptive compilers and high-level syn-
thesis are control flow graphs and data flow graphs; these are discussed in the
following two Sections. Subsequently, Section 2.3 introduces important terms
related to the scheduling of hardware operations.
2.1. Control Flow
Definition 2.1.1. A control flow frame CF = (N, E, ann, start, end) consists of
• a finite directed graph (N, E) with nodes N and directed edges E ⊆ N×N,
• an annotation ann : E → Z ∪ {} assigning an integer or  to each edge,
• and two distinct nodes start, end ∈ N.
Furthermore, every node is reachable from start, and end is reachable from every
node. When a control flow frame is used to model the control flow of a pro-
gram, ann annotes the outgoing edges of conditional branches (e.g., if, switch in
C) to define a branch target depending on the outcome of the condition. The
 value is used for default successors in the switch case as well as for ordinary
(non-branching) edges.
Fig. 2.1(a) shows a sample CF: Edges (a, b) and (a, c) are annotated with 0
and 1 respectively, while the remaining edges are annotated with , which is
usually omitted in graphical representations.
21
2. Basics
start
a
b c
d
end
0 1
(a) (b)
start
a
end
0 1
b
c d
e
0 1
Figure 2.1.: Graphical representations of control flow frame samples.
Definition 2.1.2. A basic block is a (possibly empty) sequence of program
statements without intermediary branches, i.e., at most the last statement of the
sequence may be a branch statement.
We now combine the control flow frame with basic blocks to define the control
flow graph.
Definition 2.1.3. A control flow graph CFG = (CF,BB, nb) consists of
• a control flow frame CF,
• a set of basic blocks BB, and
• an assignment nb : N → BB of nodes to basic blocks.
A CFG represents a program by mapping each node of a control flow frame CF
to a basic block, the edges of CF defining the program order. As anticipated in
the CF definition, the edge annotations define which of the succeeding basic
blocks is executed after a conditional branch. In a graphical representation, the
statements contained in a basic block b are typically printed inside CFG node
n, if n is mapped to b, i.e., nb(n) = b.
Definition 2.1.4. Given a CFG with nodes N, edges E, start and end node, we
define several useful terms:
• n ∈ N is a branch node, if n has more than 1 successor.
22
2.1. Control Flow
• n ∈ N is a join node, if n has more than 1 predecessor.
• A region is a set of nodes R ⊆ N.
• A path p is a list of nodes, such that for each two successive nodes n, m
in p, there is an edge (n,m) ∈ E. If the first node of p is n and the last
node is m, then p is said to be a path from n to m.
• d ∈ N dominates n ∈ N, if every path from start to n contains d.
• z ∈ N post-dominates n ∈ N, if all paths from n to end contain z.
• The post-dominance frontier (PDF) of z ∈ N is a subset {n1, . . . , nk} of
N, such that for ni ∈ PDF(z): ni is not post-dominated by z, and ni has a
successor which is post-dominated by z.
Referring to Fig. 2.1(b), we give some examples for the latter definitions: a
and b are branch nodes, e is a join node, {a, b, c, e} is a region, and (a, b, d)
is a path. a dominates b, c, d, e, and end. e post-dominates b, c, and d. The
post-dominance frontier of e is {a}.
The post-dominance frontier is essential for finding the CFG nodes which de-
termine if a given CFG node is entered during the program execution. We point
this out with the notion of control dependence.
Definition 2.1.5. Given a CF with node set N and nodes c, n ∈ N, n is control
dependent on c if c is contained in the post-dominance frontier of n.
If n is control dependent on c, we call c a controller of n, or in short: c controls
n. Note that in contrast to Ferrante’s original definition of control dependence
[FeOW87], nodes never control themselves, which is a consequence of the
post-dominance frontier definition above.
To give an example, in the CF in Fig. 2.1(b), a controls b and e, while b
controls c and d. Note that a does not control c or d, because neither of the two
post-dominates b.
In high-level synthesis (HLS), it is often desirable to simplify the control flow
of a program. This simplifies optimization passes on the one hand, and can be
a requirement for certain hardware implementations on the other. In this work,
structuredness as a central property of programs and CFGs is a requirement for
hardware generation. A structured program uses only concatenation, selec-
tion, and repetition of statements as defined by Dijkstra1 [Dijk72]. In particu-
1Dijkstra disapproves of goto statements explicitly and focuses on concatenation, selection, and
repetition. However, the principle of using only these three control constructs is also known as
imperative programming and has been used long before Dijkstra’s definition.
23
2. Basics
lar, a program which does not contain any goto statements, neither explicit nor
implicit (e.g., break and continue statements in C loops) is structured. The struc-
tured programming theorem [BöJa66] states that every computable function
can be implemented in a programming language that combines subprograms
using only these three control structures. Thus, we can restrict HLS to struc-
tured programs without loss of generality: unstructured programs can be made
structured before the actual HLS is applied.
Beyond structuring, we can further normalize programs prior to the HLS.
Bottom-testing loops (e.g., do-while in C) can be replaced by top-testing loops
(while, for) by inserting an additional flag into the loop condition as shown in
Fig. 2.2. We call a structured program without bottom-testing loops a top-
structured program, or short t-structured program.
(a) (b)
do {
    ...
} while (c);
int start = 1;
while (start || c) {
    start = 0;
    ...
}
Figure 2.2.: (a) Bottom-testing loop; (b) replaced by top-testing loop.
According to t-structured programs, we define t-structured CFs in a bottom-up
manner, using Dijkstra’s three building blocks of structuredness.
Definition 2.1.6. A t-structured control flow frame is a control flow frame
which can be built up starting with the initial CF illustrated in Fig. 2.3(a) and
successively applying one of the three t-structured transformations (b), (c),
(d) shown in Fig. 2.3.
The example in Fig. 2.4(a) shows a t-structured CF. It can be generated by
applying the substitutions in the order (b), (c), (d) to the initial CF. In con-
trast, the CF shown in Fig. 2.4(b) is not t-structured, which can be proven by
trying all combinations of at most four substitution steps (every substitution
adds at least one node). As an indication for the non-t-structuredness of this
example, observe that after generating the alternative branches (a, b), (a, c) us-
ing substitution (c), there is no substitution which can insert an edge from c to
b.
The t-structuredness can of course be propagated to CFGs.
Definition 2.1.7. A t-structured control flow graph is a control flow graph
24
2.1. Control Flow
a
(b)
start
a
end
a
b
a
(c)
a
c
x
y
x
ε
y
b
1
b
m
...
(a)
x
y
x
y
z
1
z
m
ε ε
z
i
ε
a
(d)
a
b
x
ε
x
0
1
ε
Figure 2.3.: (a) Initial CF; (b) concatenation; (c) selection; (d) repetition.
(a)
a
b
0
(b)start
end
c
de
1
10
a
b
0
start
end
c
d
1
1
0
Figure 2.4.: (a) T-structured CF; (b) non-t-structured CF.
25
2. Basics
CFG = (CF,BB, nb), such that its control flow frame CF is t-structured.
Lemma 2.1.1. Every t-structured program can be represented by a t-structured
CFG.
Proof. This can be shown by induction over the types of statements allowed in
the source language, which we omit here for brevity. 
In Fig. 2.3, transformations (c) and (d) introduce new branch and join nodes.
Note that this induces a one-to-one mapping ρ from the inserted branch nodes
to the inserted join nodes: ρ(b) = j maps a branch node b to the join node
j which has been introduced during the same transformation in which b has
been introduced. Note that a branch node introduced by (d) is mapped to itself,
being a join node at the same time.
Proposition 2.1.1. In a t-structured CFG, every node has at most one con-
troller.
Proof. We show this by induction over the t-structured transformations. In the
initial control flow frame (Fig. 2.3(a)), obviously none of the three nodes has a
controller. We now assume that none of the nodes in a given t-structured CFG
has more than one controller. We pick an arbitrary node a and apply one of
the t-structured transformations shown in Fig. 2.3(b)-(d). First, we assume that
concatenation (b) is applied. If a had a controller before applying the trans-
formation, the inserted node b has the same controller after the transformation,
because b post-dominates a. Otherwise, neither a nor b have a controller. Sec-
ond, we assume that selection (c) is applied. If a had a controller before the
transformation, then a and c obviously have the same controller after the trans-
formation (and no controller otherwise). The newly inserted alternative nodes
b1, . . . , bm are controlled by a (and only a), because they do not post-dominate
a. Third, we assume that repetition (d) is applied. Because b does not control a
and a does not control itself, the transformation leaves the control dependence
of a invariant. The newly inserted node b is control dependent on a (and only
on a), because it does not post-dominate a. 
The next definitions give a notion of loops, consisting of loop header and loop
body.
Definition 2.1.8. Given a t-structured CFG with node set N and edge set E.
• e = (n, d) ∈ E is a back-edge, if d dominates n.
26
2.1. Control Flow
• l ∈ N is a loop header, if it has an incoming back-edge.
Due to the structuredness, a loop header has exactly one incoming back-edge.
Given a loop header l with incoming back-edge (n, l),
• the loop body LB(l) ⊂ N of l contains all nodes fromwhich l is reachable
only by passing through n, i.e., LB(l) = {k ∈ N : ∀ paths p from k to l :
n ∈ p}.
Examples for the latter definitions can be found in Fig. 2.1(b): (e, b) is a back-
edge, b is a loop header, and {c, d, e} is the loop body of loop header b.
We conclude this Section by introducing several terms needed for hardware/-
software partitioning CFGs.
Definition 2.1.9. A partitioned control flow graph consists of
• a control flow graph with node set N, and
• an additional mapping function m : N → {hw, sw, swsub}, which
maps each node to either hardware (hw), software (sw), or sub-software
(swsub). Swsub nodes will be used to model SW services (see below).
Definition 2.1.10. Given a partitioned CFG (N, E, ann, start, end,m).
• A hardware region (HW region) is a region in which all nodes are
mapped to hardware or sub-software. Furthermore, a HW region H is
maximal in the sense that no node in H has a successor or predecessor
node which is mapped to hardware, but which is not contained in H.
• A software region (SW region) is a region in which all nodes are
mapped to software. A SW region is maximal in analogy to HW regions.
• A software subregion (SW subregion) of a hardware region H is a
weakly connected subset S of H, such that all nodes of S are mapped
to sub-software and such that S is maximal in analogy to HW regions.
• A hardware-to-software transition (HW/SW transition) is an edge
(h1, s2) ∈ E such that m(h1) = hw, and m(s2) ∈ {sw, swsub}.
• A software-to-hardware transition (SW/HW transition) is an edge
(s1, h2) ∈ E such that m(s1) ∈ {sw, swsub}, and m(h2) = hw.
27
2. Basics
2.2. Data Flow
While CFGs, based on sequential statements in basic blocks, form a software-
related intermediate representation (IR), data flow-based representations are
more hardware-centric and thus belong to the low-level IRs of adaptive com-
pilers. To give a definition of the popular data flow graph, we adhere to
[GGDN04].
Definition 2.2.1. A data flow graph (DFG) consists of
• a finite, directed, acyclic graph (N, E),
• a set of operations OP, and
• an annotation op : N → OP assigning each node an operation.
Each instance of an operation assigned to a node can produce and consume
data. The edges E ⊆ N × N represent data flow dependences, i.e., a directed
edge (a, b) exists in E if data is produced by op(a) and consumed by op(b).
For simplicity, we will write a instead of op(a) when addressing the operation
assigned to a.
Fig. 2.5(b) shows the data flow graph for the code shown in Fig. 2.5(a). Note
that the variable identifiers shown in Fig. (b) are not part of the DFG. The
same applies for the edges lacking a source or target node; they are shown just
to clarify the correlation to the code in Fig. (a).
(a) (b)
c = a + b;
e = c / d;
f = e - c;
/
+
c
e
-
c
d
a b
f
Figure 2.5.: (a) Sample C code; (b) DFG representing the C code.
DFGs can be used to model synchronous hardware on the register transfer
level. In this case, the behavior of each operation is aligned to a dedicated
28
2.2. Data Flow
clock signal. The time between two rising clock edges is one cycle. In this
context, we define several properties of operations.
Definition 2.2.2. Given an operation p. The latency of p is a non-negative
integer denoting the number of cycles needed to compute a result from the
operation input values.
Definition 2.2.3. Given an operation p with latency lat. The initiation inter-
val ii of p, 0 ≤ ii ≤ lat denotes the mininum number of cycles we have to wait
before we can feed the next set of input values into p. Note that the latency can
be 0, which allows us to model the execution of several consecutive operations
during the same cycle.
Definition 2.2.4. Given an operation p with latency lat > 1 and initiation
interval ii. If ii < lat, then p is pipelinable. If ii = 1, p is fully pipelinable. If
ii = lat, p is non-pipelinable.
Definition 2.2.5. A timed data flow graph consists of
• a data flow graph,
• a latency function lat : OP → N0, assigning each operation a latency,
and
• an initiation interval function ii : OP → N0, assigning each operation
an initiation interval.
DFGs and timed DFGs are limited to data dependences. For example, they
do not account for control dependences such as if/else structures in high-level
languages (Fig. 2.6(a)). To a certain extent, it can be reasonable to replace
control dependences with pure data flow (Fig. 2.6(b)). A disadvantage of this
is that every operation is executed, even if their results are not needed at run-
time, e.g., when they originate from a branch which is not taken. Depending
on the scheduling strategy, this can have a negative impact on the execution
time and the energy consumption. Furthermore, eliminating the control depen-
dences of operations with side effects (e.g., write accesses to the memory as
shown in Fig. 2.6(c)) is impragmatic in most cases, because this would either
require side effect fixes or a redesign of the source code in order to replace side
effect operations by operations without side effects (e.g., completely removing
pointers and arrays).
Control dependences can be integrated into the DFG concept by extending the
DFGwith control (dependence) edges. The resulting structure is called control
29
2. Basics
(a) (b)
if (i > 2) {
    a = b;
} else {
    a = c;
}
a = b * (i > 2) +
    c * (1 - (i > 2));
(c)
if (i > 2) {
    *p = b;
} else {
    b = *p;
}
Figure 2.6.: (a) Sample C code with control dependences; (b) control depen-
dences replaced by data dependences; (c) Sample C code contain-
ing statements with side effects (memory accesses).
data flow graph (CDFG)2. Control edges are annotated with conditions, their
result depending on the result of the source node operation. Typically, control
edges represent the predicated execution semantics, i.e., the target node is
executed only if the control edge condition is true. Fig. 2.7 shows the CDFG
for the C code in Fig. 2.6(c). The two control edges represent the if/else control
structure: If i > 2 is true (==1), the store is executed, otherwise the result is
false (==0) so that the load is executed.
>
i 2
store load
p
b
==0==1
p b
Figure 2.7.: CDFG for the C code in Fig. 2.6(c).
To save runtime, it can be reasonable to execute control dependent operations
(i.e., targets of control edges) already before it is known if the result is actually
needed. This technique is called speculative execution. Speculative execution
of an operation is of course only allowed if the operation has no side effects or
if there are mechanisms to care for the impacts of side effects (e.g., a memory
system allowing speculative write accesses). Practically, speculative execution
of an operation n can be implemented in two ways. The first way is to re-
2This is only a first glance at CDFGs; we will later define a model in detail which additionally
contains memory dependences.
30
2.3. Hardware Operation Scheduling
move the control edge(s) that target(s) n. The second way is to change the
control edge semantics so that n is executed as soon as its data dependences
are fulfilled, but the result is not propagated to successive operations before
the control dependences are fulfilled. We term the latter technique speculative
predicated execution.
2.3. Hardware Operation Scheduling
When we transform a data flow-oriented representation to an actual hardware
representation (e.g., register transfer logic in a hardware description language)
we have to define a schedule, i.e., we define when each of the operations is ex-
ecuted. A well-known and widely used technique is static scheduling, which
schedules the operations into fixed time slots. This approach keeps the con-
trol logic simple (ordinary state machines), and it allows resource sharing,
i.e., using the same pieces of hardware to compute different operations during
different time slots. However, static scheduling requires the compiler to know
the latencies of each operation. If an operation has a variable latency (e.g.,
a cached memory access), the compiler schedules for an expected latency. If
the actual latency is smaller than expected, successive operations execute later
than they could. If the actual latency is greater than expected, the complete
state machine has to be stalled. Both cases result in idle time. Another dis-
advantage is that the number of registers or pipeline steps in a computational
path of several operations is fixed. It is not possible to locally insert or remove
a register without adjusting the whole state machine. The insertion of registers
can be useful to enable higher frequencies; register removal in turn reduces
latencies, but increases combinatorial logic delays.
Dynamic scheduling is a more flexible, yet less common technique. Instead of
mapping operations to fixed time frames, it defines conditions which determine
the time frames at runtime. Thus, even operations with variable latencies can
be handled adequately. For this flexibility, dynamic scheduling accepts an area
increase of the resulting hardware. First, the sequencer, i.e., the logic deter-
mining when to enable which operation, is larger than a state machine which
would have been used for static scheduling. Second, there are fewer resource
sharing opportunities because it is not known at compile time when an opera-
tion will actually execute. In dynamic scheduling valid data items are marked
by tokens, usually implemented as a single bit of information at data inputs and
outputs. More sophisticated models (such as this work) use different kinds of
tokens (cf. Section 4.2).
31
2. Basics
Dynamically scheduled HW can be viewed as a latency-insensitive system
(LIS) [CaMS01]. In an LIS, composed of modules and communication chan-
nels between the modules, the channel latency does not affect the correctness
of the system. Similarly, the latency of an operator in dynamically scheduled
HW does not affect the correctness of the HW.
32
3. COMRADE 1.0
The roots this work has emerged from reach back to the NIMBLE compiler
[MacM01]. The aim of NIMBLE was to synthesize hardware from ANSI C in-
put code. The drawbacks of NIMBLE included the restriction of hardware cre-
ation from innermost loops only, and its technology dependence. Only Xilinx
XC4000 FPGAs as well as the GARP architecture [HaWa97] were supported
as target platforms.
To overcome these restrictions, Nico Kasprzyk tried to generalize the hard-
ware creation techniques used by NIMBLE. He developed a new compiler for
adaptive computers from scratch between 2001 and 2005 [Kasp05]. The re-
sult of this work was the first version of the COMRADE compiler, which we
therefore call COMRADE 1.0. This Chapter gives an overview of the compile
flow and implementation. We will also point out limitations (both conceptual
and concerning the implementation) to better highlight the progress made in
COMRADE 2.0, one of the main achievements of this current work. Note that
despite its weakness, COMRADE 1.0 was a far more ambitious project than
NIMBLE and provided numerous insights useful for guiding our own research.
3.1. Features
The basic idea of COMRADE is to support the complete ANSI C language
without additional annotations, including arbitrarily nested loops and control
conditions as well as arbitrary array references and pointers. Furthermore,
hardware creation should not be bound to a specific target FPGA to support
virtually any reconfigurable compute platform.
When creating hardware in such a general way, using an advanced compu-
tational model is essential for the hardware efficiency (e.g., concerning the
runtime) compared to competitive computing architectures, such as execution
on an embedded CPU. Therefore, COMRADE has been designed to generate
dynamically scheduled, speculative hardware from the beginning. Dynamic
scheduling allows efficient execution even in the presence of variable oper-
33
3. COMRADE 1.0
ator latencies. Speculative execution is another technique used to decrease the
runtime: Before a branch occurs, the computations in all branch targets are
already precomputed (if this is not limited by other dependences). This tech-
nique is very amenable to hardware, because it can be done without side effects
such as pipeline refills after a branch misprediction in CPUs.
3.2. Compile Flow
The COMRADE 1.0 Compile Flow is depicted in Fig. 3.1. COMRADE is
based on the Stanford SUIF2 framework [ADHL00] and therefore consists of
several SUIF2 passes. Inputs to COMRADE are a C program in source code,
user constraints, and the GLACE module generators [NeKo01]. After an initial
preparation, which includes a C preprocessor (CPP) run on the input C code
and the transformation to the abstract syntax tree (AST) based SUIF2 format,
COMRADE performs its own analysis and transformations. As a first step, the
dynamic profiling pass compiles the program using a standard software com-
piler, executes it on a sample input data set (specified in the user constraints),
and attaches the gathered profiling data to the AST statements. The next pass
inlines the procedures which are called most frequently; the inlining threshold
is obtained from the user constraints. This elimination of procedure calls sim-
plifies the later hardware generation. After that, a control flow graph (CFG) is
built for each procedure. This includes the execution frequency annotation of
each CFG node.
In the HW/SW partitioning pass, the CFG nodes exceeding a user defined ex-
ecution frequency are marked for hardware execution, while the other nodes
remain software-associated. Connected hardware CFG nodes form a hardware
region in the CFG, called a hardware kernel after the ensuing transformation
to hardware. By construction, a hardware region is a (possibly nested) loop.
For the case of a nested loop, Kasprzyk argues that there might be cases in
which it cannot be determined at compile time whether execution of the whole
loop nest or only of the deeper nesting levels is more efficient. He has therefore
integrated the loop duplication pass, which copies a hardware-associated loop
nest multiple times and thus generates all reasonable HW/SW associations of
the nesting levels. This allows for switching between hardware and software
execution on runtime conditions.
The next pass performs an array iteration space analysis using the Omega
framework [Pugh91]. Results of this test can be used for later optimization
passes.
34
3.2. Compile Flow
C Source
SUIF2
Passes
Control Flow Graph (CFG)
HW/SW Partitioning
SSA Conversion
High-Level Optimizations
Constant Propagation
Bitwidth Reduction
Tree Height Reduction
Scalar Replacement
HW Kernel Compaction
SW Part (C) Data Paths
.v
Configuration Scheduling
FPGA
Configurations
Dynamic Profiling
Profile-based Inlining
Loop Duplication
Omega Dependence Analysis
Meta Data Annotation
Control Data Flow Graph (CDFG)
Sequencer
Verilog / SLIF / EDIF
Sequencers
.slif
Operators
.v
Sequencers
.v
ULK Tool
GLACE
Generators
Operators
.edf
Vendor
Tools
GCC
Adaptive
Computer SimulationSW Executable
Query
Meta Data
EDIF
CPP
Transform to SUIF2 Format
Self-Assignment Statements
User Constraints
Figure 3.1.: The COMRADE 1.0 Compile Flow.
35
3. COMRADE 1.0
Before the conversion to the SSA form [CFRW91], COMRADE iterates over
all loop bodies in HW regions and inserts a self-assignment statement (such as
n = n;) for each variable which is read by an expression, but not assigned any
value in the current loop body. Thus, after SSA conversion, for every variable
that is written-to or read-from, there is a phi statement in the loop header having
two inputs: one from outside the loop, and a feed-back input from the loop
body itself. This feed-back ensures that each variable is reactivated for each
loop iteration.
A basic concept in COMRADE 1.0 is the correspondence between C opera-
tions as well as statements in the CFG and the HW operators modeled later
in the CDFG. Therefore, COMRADE inserts dedicated statements at HW/SW
and SW/HW transitions to model the variable transfer between software and
hardware. At HW/SW transitions, a statement for the sending of an IRQ signal
is inserted. These statement insertions are performed after the SSA conversion,
but integrated into the SSA conversion pass.
Next, several high-level optimizations improve the hardware eligibility, among
them constant propagation, bitwidth reduction, height reduction of abstract
syntax trees, and scalar replacement. The latter uses dependence information
gained from the previous Omega analysis to forward data loaded from memory
to successive loop iterations as well as to relocate array accesses, reducing the
total number of memory accesses.
The C expressions and statements in the CFG are then annotated with meta
data describing the area requirements and the critical path of their hardware
operator counterparts. This meta data is obtained from the GLACE module
generator library.
During HW kernel compaction, hardware region CFG blocks are reverted to
software if they do not fit onto the target FPGA or contain statements which
cannot be transformed to hardware. These are floating point computations
and (non-inlined) procedure calls, e.g., to the C library. HW/SW interfaces
(memory-mapped transfer of live variables, interrupt handling) are inserted ac-
cordingly.
Subsequently, a control data flow graph (CDFG) is created for each hardware
region. This involves creating hardware operator nodes for each C expres-
sion and statement, connecting these nodes by data edges according to the data
dependences, and adding further control edges needed for predicated and spec-
ulative execution. COMRADE here even inserts memory edges resembling
memory dependences, allowing the integration of memory access nodes into
the CDFG. The mechanisms used here are still limited and error-prone, but
36
3.3. Deficiencies
they form the cornerstone of the more sophisticated COCOMA representation
developed in this work (cf. Chapter 6).
Each CDFG is then equipped with sequencer logic used to control the token
flow at runtime. Then, COMRADE outputs for each CDFG a data path Verilog
module which instantiates and connects the hardware operators obtained from
GLACE in the EDIF format [Kahn95] (FPGA vendor tools are used to convert
them to Verilog simulation files). COMRADE outputs the sequencers in SLIF
format [Drey94], which is converted to Verilog using the conversion tool ULK
[Drey94]. SLIF was chosen as the intermediary format as it was able to apply
(at that time) advanced logic optimization techniques using SIS [SSLM92].
The final reconfiguration scheduling step [KaVK05] aims to merge hardware
kernels to configurations for the adaptive computer, and to schedule the dy-
namic reconfigurations to be performed during the runtime of the application.
The C code including HW/SW interfaces as augmented by the HW kernel com-
paction pass is compiled with a standard GCC cross compiler and can then be
executed on the adaptive computer.
While the configuration scheduling pass determines which kernels should be
merged and generates a configuration schedule, this information is not yet used
in COMRADE 1.0 to actually create the FPGA configurations and to augment
the software with reconfiguration commands. The associated arrows in Fig.
3.1 are therefore printed in dashed style. However, a simulation environment
exists which is able to simulate single hardware kernels consisting of a data
path, a sequencer and operator module instances.
3.3. Deficiencies
This Section describes the major deficiencies and problems of COMRADE
1.0 concerning hardware generation, the front-end, memory accesses, and the
overall toolchain.
3.3.1. Hardware Generation
For the insertion of control edges into the CDFG, Kasprzyk describes a method
using dominator relations on the line graph of the CFG, which he terms ECFG.
He describes that method only for the insertion of control edges to multiplexer
predecessors, lacking support for control hierarchies and memory accesses.
However, the actual COMRADE 1.0 source code already contains extensions
37
3. COMRADE 1.0
of Kasprzyk’s initial approach; an example is illustrated in Fig. 3.2.
We assume that CFG node 5 in Fig. 3.2(a) contains a storing memory access,
i.e., a write access to an array or pointer. Such a memory access appears as
store node in the CDFG. Each store node needs a control edge pointing to it
to prevent mis-speculated writes to the memory. To find the origin of that
edge in the CDFG, we need to locate the corresponding branch node in the
CFG, called the CFG controller node of a given CFG node. In Fig. 3.2(a)
the CFG controller node of CFG node 5 clearly is 1: If the transition (1, 2)
occurs, all statements in node 5 are executed; if (1, 6) occurs instead, they are
not executed.
To find the CFG controller node of node 5, Kasprzyk first picks a CFG edge
entering the node. In Fig. 3.2(a) this can be (3, 5) or (4, 5). Without loss of
generality, we assume that (3, 5) is picked, corresponding to node 3/5 in Fig.
3.2(b). Next, the immediate dominator of 3/5 is picked, which is 2/3 as shown
by the dominator graph in Fig. 3.2(c). For the ECFG node 2/3, Kasprzyk’s
method then locates the corresponding CFG edge (2, 3) and declares the source
node 2 as CFG controller node. If we assume that (4, 5) is picked instead
of (3, 5), the algorithm would compute the same result, which is obviously
wrong, because we have already seen that node 1 is the node we are looking
for. This shows that Kasprzyk’s method does not completely solve the control
edge insertion problem.
(a)       CFG
0
1
2
3 4
5
6
(b)      ECFG
0/1
1/2
2/3
3/5 4/5
5/1
1/6
2/4
(c)  DOM(ECFG)
0/1
1/2
2/3
3/5 4/5
5/1
1/6
2/4
Figure 3.2.: (a) Sample CFG, (b) associated ECFG, (c) ECFG dominators.
Moreover, the token flow in hardware generated from complex control struc-
tures has not been sufficiently covered by Kasprzyk’s work. Although he men-
tions cancel token forwarding for hierarchical conditions, he does not address
the problem of redundant control dependences [GäKo08]. In general, nested
loops produce incorrect hardware, partially due to excess activate tokens in
38
3.3. Deficiencies
nodes representing constants. These problems can result in deadlocks or even
wrong results when simulating the generated hardware.
Kasprzyk proposes an extension of speculative execution which uses cancel
tokens to decrease the runtime of hardware kernels containing imbalanced al-
ternative data paths. Lacking quantitative experimental results, he fortifies his
arguments with the example shown in Fig. 3.3. The Figure compares three
speculative execution models for alternative branches. In the late evaluation
model (a), the select node waits until all inputs are available before firing. Us-
ing early evaluation (b), it fires as soon as the chosen input (the one on the
right in the example) is available. Counters at each input of the select node
are used to detect mis-speculated results, which can then be discarded. The
moving cancel token method (c) proposed by Kasprzyk produces a cancel to-
ken instead of increasing a counter. The conceptual difference is that the can-
cel token moves backwards along data edges and actively cancels a running
mis-speculated computation rather than waiting for the mis-speculated result
to reach the select node. In Fig. 3.3(c) the second result obviously reaches the
select node one cycle earlier than in Fig. 3.3(b). Kasprzyk concludes that mov-
ing cancel tokens can save execution time when compared to early execution
with input counters. Here, he implicitly assumes non-pipelined operators. If
the division operator in the left data path was pipelined, the second division
could be started one step earlier (i.e, in step 3) and the second result would
then arrive at the select node in step 4, resulting in the same runtime achieved
by moving cancel tokens.
The effects of goto, break, and continue statements on the token flow have not
been considered. Tests have shown that the computation model used by COM-
RADE 1.0 is not suitable to directly support the unstructured control flow that
these statements can evoke.
In the implementation, the canceling of speculatively executed operations is
incorrect. A running operation does correctly propagate a cancel token to its
predecessors, but the operation itself is not canceled, i.e., as soon as the op-
eration finishes, it emits activate tokens, although it should already have been
canceled, resulting in wrong results or deadlocks.
Target nodes of control edges are never executed speculatively. This can have
a negative impact on the runtime, when a high-latency operator (such as a di-
vider) is a control edge target; the operation will not start until the control
condition has been evaluated.
Furthermore, neither the GLACE operators nor the sequencer support operator-
internal pipelining. In the presence of high-latency operators, this results in
39
3. COMRADE 1.0
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
(a) late evaluation
step 1 step 2 step 3 step 4 step 5
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
/3
|56 +23
select
(b) early evaluation with counters for discarded inputs
10 0 0 1 0 0 0 1 0
1
1
1
1
1 1
1
2
1
1
1
1↓
2 2
1
1↓
1 2
2
2
2↓ 3
3 3
2
(c) early evaluation with moving down tokens (COMRADE 1.0)
+25 1 activated operator,first computation
/3 2↓ inactive operator,down token for second 
computation
|56 5 down token of the fifth computation
1 2
1 1
2
1
1
2
1
step 1 step 2 step 3 step 4 step 5
step 1 step 2 step 3 step 4 step 5
Figure 3.3.: Speculative execution models: Fig. 4.11 from [Kasp05]. Each of
the multiplexers selects the right-hand data input.
40
3.3. Deficiencies
increased runtimes due to the larger initiation intervals (reduced throughput)
of the operators. But even in the absence of such operators in a hardware
kernel, the runtime increases due to bubbles in the data paths. An operator
holding a result cannot be reactivated before that result has been consumed by
all successors. Thus, even a single-cycle operator has an initiation interval of
two.
Despite their wide applicability, the GLACE module generators do not support
multiplexers with more than two inputs. Therefore, in order to implement big-
ger multiplexers (e.g., needed for hardware generated from switch statements),
Kasprzyk has arranged 2-input multiplexers in a tree structure. The implemen-
tation raises two issues. First, the connection of the global select signal to the
selects for the 2-input multiplexers is wrong. Second, the input data traverses
all multiplexer instances in the tree in one cycle, which limits the achievable
clock frequency.
The GLACE library has not undergone a major update since 2004, thus the tar-
get technologies it supports (Xilinx XC4000 and Virtex) are obsolete in com-
parison to current standards. The incompatibility with newer FPGAs is caused
by the technology dependent preplacement applied in GLACE. This problem
can be bypassed by eliminating such placement directives (generally RLOC
constraints) from the generated EDIF netlists. This would, however, result in
wrong meta data about area and critical paths, confounding the decisions made
during the configuration scheduling pass.
3.3.2. Front-end
In addition to the hardware generation issues, other problems are related to
the compiler front-end. The loop duplication pass generates all reasonable
HW/SW partitions for nested loops on the one hand, but does not provide the
software with the intelligence to determine whether it would actually be better
to choose the hardware or the software version of a certain loop level. In-
stead, the hardware version is preferentially used. The remaining alternative
hardware regions dedicated to mixed HW/SW execution of nested loops are
processed by all successive compiler passes, although they are never really
used.
41
3. COMRADE 1.0
3.3.3. Memory Accesses
A hardware kernel generated by COMRADE 1.0 is destined to be connected
to an instance of the configurable memory access system MARC [LaKo00].
MARC offers the kernel access to the main memory via a cache port (ac-
cesses are cached inside MARC), while accesses to the kernel from the CPU
are routed via a slave port (implementing a simple slave interface). Fig. 3.4
illustrates this for the ACE-V [Koch04], which implements an adaptive com-
puter on a circuit board linking a microSPARC-IIep CPU with a Xilinx Virtex
1000 FPGA, the CPU containing a memory controller that allows both the CPU
and the FPGA to access the DRAM main memory.
ACE-V Card
CPU
microSPARC-IIep
DRAM
MARC
FPGA
Xilinx Virtex 1000
Cache
HW Kernel
PCI
Slave
Port
Cache
Port
MEMC
Figure 3.4.: The adaptive computer ACE-V (simplified), used as COMRADE
1.0 target platform.
The COMRADE 1.0 execution model for generated hardware uses a single
cache port to a centralized cache. Completely ignoring scenarios with local
scratch pad memories as well as multiple access ports to the cache, this restricts
the memory bandwidth to 32 bits per cycle. As a result, memory-intensive
applications execute more sequentially and less in parallel, with a negative
impact on the kernel performance.
The hardware implementation of the memory access operators assume single-
cycle latency accesses to the cache, which is often not realistic when a high
target clock frequency is required on the target platform.
The insertion of memory edges into the CDFG in order to sequentialize multi-
ple array and pointer accesses for one cache port is rather rudimentary, causing
problems in several cases. In an if/else structure containing a memory access
only in the then part (not in the else part), a deadlock occurs as soon as the else
42
3.3. Deficiencies
part is entered, because consecutive memory accesses (after the if/else) wait for
a non-existing activate token. Furthermore, data independent loops and nested
loops execute in parallel without obeying memory dependences and without
sequentializing their accesses, which can produce erroneous results and dead-
locks.
3.3.4. Toolchain
In COMRADE 1.0, the compile flow is not completely implemented yet. The
compiler outputs several generated hardware kernels, each consisting of three
Verilog files (top-level, data path, and sequencer) plus a directory containing
netlists and simulation models for each hardware operator instance, and a C
software output file containing the original C code augmented with HW/SW
interfacing. The reconfiguration scheduling pass does create a data path load
graph, but this graph is not used in any way to create FPGA bitstreams or to
actually schedule reconfigurations.
Kasprzyk presents no runtime measurements of the generated hardware.
A simulation environment is available, but this is rather rudimentary: Only
one hardware kernel can be simulated at a time, and quite a lot of manual
intervention is necessary to run a simulation. For example, the I/O register
numbers have to be manually found in the Verilog code and inserted into the
top-level simulation file.
The implementation also lacks robustness; even simple input programs can
cause it to crash.
The COMRADE 1.0 source code is to some extent (mainly in the SSA and the
back-end passes) non-transparent and partly redundant, making it difficult to
maintain.
Finally, the toolchain is restricted to integer computations, i.e., there is no sup-
port for floating point operators and data types.
43
3. COMRADE 1.0
44
4. Related Work
In this Chapter we will first categorize existing adaptive compilers and high-
level synthesis frameworks for the C language and review five representative
approaches (Section 4.1). Section 4.2 then discusses related work on early
evaluation with cancel tokens, which is a central feature of COMRADE.
4.1. Existing Adaptive Compilers and HLS
Frameworks
Today there are many frameworks which can create hardware descriptions from
C or C-like representations. Comparisons are difficult to some extent, because
commercial compilers understandably do not publish internal matters and the
developers do not use consistent benchmarks. In the field of adaptive comput-
ing, another issue is that different compilers often use different target architec-
tures that lack a commonly accepted test basis.
As our own compiler COMRADE focuses on the generation of dynamically
scheduled hardware from unrestricted, non-extended ANSI C, we make a com-
parison to the existing approaches by concentrating on two central aspects: the
kind of scheduling (static vs. dynamic) and the support of arbitrary pointers.
We choose the latter because arbitrary pointer support is a feature which is typi-
cally lost when the supported C is restricted to some subset of the language. Ta-
ble 4.1 accordingly categorizes 21 commercial and academic compilers which
have emerged during the past 15 years. It is notable that most compilers use
static scheduling. Dynamic scheduling has hitherto been employed by aca-
demic compilers, except for the Xilinx CHiMPS approach.
The next Sections highlight five representative compilers. Despite interesting
approaches and features, none of the compilers known to us (beyond COM-
RADE) have ever examined early evaluation with cancel tokens (Section 4.2),
which is a primary objective of this work.
45
4. Related Work
Static scheduling:
Compiler arbitrary Compiler arbitrary
pointers pointers
Bach-C [STKY99] no Handel-C [Bowe09] no
C2H [LaPM06] yes Impulse-C [Impu11] no
Catapult-C∗ [Ment09] no Mitrion-C∗ [Mitr11] no
CHC [Alti08] yes NIMBLE [MacM01] yes
Cyber [Waka99] no PRISC [RaSm94] yes
Dime-C∗ [Stef08] no ROCCC [GuNB08] no
FPGA C [Bass11] no SPARK [GDGN03] no
GarpCC [CaHW00] yes xPilot [CFHJ06] no
GAUT [GCHB05] no
Dynamic scheduling:
Compiler arbitrary Compiler arbitrary
pointers pointers
CASH [Budi03] yes COMRADE [KoKa05] yes
CHiMPS [PBDM08] yes Molen [Pana07] no
Table 4.1.: Existing C to hardware compilers (without claim of complete-
ness). ∗The fact that Catapult-C, Dime-C, and Mitrion-C use static
scheduling is very likely, but not 100% sure from the documentation
and publications.
46
4.1. Existing Adaptive Compilers and HLS Frameworks
4.1.1. PRISC
One of the earlier approaches (1994) is PRISC (PRogrammable Instruction
Set Computers), developed by researchers of Harvard University. It differs
significantly from the other compilers, because rather than being used as a
dedicated coprocessor, the generated accelerator hardware is programmed into
the data path of the CPU. A hardware-programmable functional unit (PFU)
such as this executes computations in parallel with the other CPU functional
units (such as ALUs), as illustrated in Fig. 4.1. For simplicity, the logic in
the PFU is limited to a latency of one CPU clock cycle. To use the PFU, the
CPU instruction set is extended appropriately. The drawbacks of this approach
are the limited memory access of the PFU (only possible using the existing
CPU memory interface) and the limitation to combinatorial logic in the PFU.
Simulations assuming a PFU in a 200 MHz CPU have shown relatively low
speed-ups between 1.06 and 1.91 for SPEC’92 benchmarks.
Register
File
and
Bypass
Logic
FU1 FU2 PFU
Source Operand Buses
Result Operand Bus
Paddr
Pdata
Figure 4.1.: The PRISC computation model. Source of image: [RaSm94], page
173, Fig. 1.
4.1.2. ROCCC
A prime example of statically scheduled hardware accelerators used as a co-
processor is ROCCC (Riverside Optimizing Compiler For Configurable Com-
puting) [BuNa08, GuNB08], developed at the University of California, River-
side. This is a hardware-only compiler, generating VHDL from designated C
source code passages. It is built on the SUIF2 framework [ADHL00] and uses
Machine-SUIF virtual machine [SmHo02] as low-level IR. Machine-SUIF is
an assembler-like, data flow-based representation. ROCCC extends Machine-
SUIF by foi instructions (feedback or initialization), which feed data paths in
loops either with initial values (when the loop starts) or with feedback values
from the previous loop iteration. Fig. 4.2(a)) shows a sample Machine-SUIF
47
4. Related Work
code, representing a loop header in the upper section (Node:2) and parts of the
loop body in the lower one (Node:3). The data path generated from this code
is shown in Fig. 4.2(b). The fois are implemented as multiplexers controlled
by loop controllers. Like many statically scheduled compilers, ROCCC sep-
arates computation processes and memory accesses from each other, i.e., the
data path contains no elements accessing memory. Thus, the C source code is
very restricted in relation to memory accesses. Pointers are not allowed at all,
all array indices must be computed from the loop index plus a constant stride,
arrays represent exclusive read-only or write-only accesses, and arrays may not
have more than two dimensions. In the presence of these restrictions, ROCCC
uses smart buffers which reduce the number of memory reads by re-using data
from previous loop iterations. ROCCC creates efficient systolic array hardware
accelerators. This results in very high speed-ups up to 3000x on a Virtex-4 @
174 MHz when compared to a Xeon @ 2.8 GHz. On the other hand it im-
poses a lot of restrictions on the input C code, decreasing the language support
significantly.
(a) (b)
Figure 4.2.: (a) Machine-SUIF representation and (b) data path generated by
ROCCC. Source of image: [GuNB08], page 6:12, Fig. 6.
48
4.1. Existing Adaptive Compilers and HLS Frameworks
4.1.3. SPARK
The SPARK compiler [GGDN04] developed at the University of California,
Irvine, compiles C to RT-level VHDL code. Similar to ROCCC, SPARK uses
static scheduling and does not allow pointers in the input C code at all. The
main focus of SPARK is the evaluation of different optimization strategies such
as (speculative) code motion and dynamic common subexpression elimination1
[GDGN03]. Therefore, the published results focus on comparisons of different
optimization passes and their combinations rather than measuring hardware
accelerator speed-ups.
SPARK uses a hierarchical IR called hierarchical task graph (HTG). An HTG
is a directed acyclic graph having three types of nodes: single nodes, com-
pound nodes, and loop nodes. A single node contains a sequence of program
statements without intermediary branches similar to the basic blocks of a CFG.
Compound nodes contain sub-nodes and thus account for the hierarchical char-
acter of HTGs. Loop nodes are special compound nodes containing a loop
head, loop body, and loop exit node. Each of these are again either a single
or a compound node. The nodes are connected by directed control flow edges.
Hence, the HTG essentially is a hierarchical CFG. Fig. 4.3 shows two example
HTGs representing an if block (a) and a for loop (b).
Operation/
Variable Binding
FSM Generation
and Optimization
Heuristics Transformations Toolbox
Percolation/Trailblazing
Loop Transformations
Scheduling and Allocation
Dynamic CSE
Loop Transformations
Scheduling
Speculative Code Motions
CSE/IVA/Folding/Inlining
Operation Chaining
Code Generation BackEnd
Logic Synthesis and Functional Verification
Parser Front End
Control Synthesis/Optimization
Constraints
& Resource
Library
C Input
Synthesizable RTL VHDL
Behavioral VHDL and C
Graph
Viewer
SPARK IR
(HTGs +
Data Flow
Graph)
Figure 1. The SPARK High-Level Synthesis System
mixed control-data flow designs and demonstrated their ef-
fects on schedule lengths [5]. CVLS [1] uses condition
vectors to improve resource sharing among mutually exclu-
sive operations. Radivojevic et al [2] present an exact sym-
bolic formulation which generates an ensemble schedule of
valid, scheduled traces. The “Waveschedule” approach [3]
minimizes the expected number of cycles by using specu-
lative execution. Santos et al [4] and Rim et al [9] support
generalized code motions for scheduling in HLS. Similar
code transformation techniques that have been presented in
for software (parallelizing) compilers [10] need to be re-
instrumented for synthesis to use hardware cost models for
operations and resources.
An important limitation of earlier work is the restrictions
on the input description that can be synthesized. Also, sev-
eral systems do not provide a complete design flow from
architectural description to final synthesized netlist and
present only scheduling results for small, synthetic bench-
marks. The SPARK system has been designed to overcome
these limitations as explained over the next few sections.
3 The SPARK High Level Synthesis System
The SPARK synthesis framework is a modular and ex-
tensible high-level synthesis system that provides a num-
ber of code transformation techniques. SPARK has been
designed to aid in experimenting with new transformations
and heuristics that enhance the quality of synthesis results.
Figure 1 provides an overview of the SPARK system. The
input language for design descriptions is ANSI-C, currently
with the restrictions of no pointers and no function recur-
sion. This input description is parsed into a hierarchical
intermediate representation described in Section 4.
The core of the synthesis system has a transforma-
tions toolbox that consists of a set of information gathering
passes, basic code motion techniques and several compiler
transformations. Passes from the toolbox are called by a set
of heuristics that guide how the code refinement takes place.
Since the heuristics and the underlying transformations that
they use are completely independent, heuristics can be eas-
ily tuned by calling different passes in the toolbox.
As shown in Figure 1, the transformations toolbox con-
tains a data dependency extraction pass, parallelizing code
motion techniques [11, 12], dynamic renaming of variables,
the basic operations of loop (or software) pipelining and
some supporting compiler passes such as copy and con-
stant propagation and dead code elimination [13]. The vari-
ous passes and transformations can be controlled by the de-
signer using scripts, hence, allowing experimentation with
different transformations and heuristics.
After scheduling, the system then does control synthesis
and optimization. Control synthesis generates a finite state
machine controller and also does resource binding [5]. The
back-end of the SPARK system then generates synthesiz-
able RTL VHDL and hence, the SPARK system integrates
into the standard synthesis design flow. In the next few sec-
tions, we examine the SPARK system in mo e detail, start-
ing with the internal intermediate representation it uses.
4 HTG: A Model for Control Intensive De-
signs
The SPARK system stores the behavioral description in
an intermediate representation (IR) that retains all the in-
formation given in the input description. This is critical for
enabling source-level transformations, making global deci-
sions about code motion and enabling the visualization of
intermediate results to improve user-interaction.
The intermediate representation used in SPARK consists
of basic blocks encapsulated in Hierarchical Task Graphs
(HTGs) [12, 14]. An HTG is a directed acyclic graph that
has t ree types of nodes: single nodes (non-hierarchical
nodes), compound nodes (nodes that have sub-nodes), a d
loop no es. Operations that execute concurrently are aggre-
gated t gether in single nodes called st tements. Statements
that have no control fl w between them are aggregated to-
gether int basic blocks. Basic blocks are encapsulated into
compound HTG nodes t form hierarc ical structures such
as if-then-else blocks, switch-case blocks, loop nodes r a
series of HTG nodes. Expressions are stored as abstract
syntax trees [13] and each operation expression is initially
encapsulated in a statement node of its own.
(a) (b)
HTG Node
Compound
HTG Node
Compound
i < N
i = 0
i = i + 1
Empty BB
Empty BB
Condition
True False
Compound
HTG Node
(Loop Body)
False
True
For Loop HTG Node
Loop Exit
If HTG Node
Figure 2. The hierarchical task graph (HTG) representa-
tion of (a) an if-block, (b) a For-Loop.
2
Figure 4.3.: Hierarchical task graph (HTG) representation of (a) an if block and
(b) a for loop. Source of image: [GDGN03], Fig. 2.
1Common subexpression elimination carried out during the static scheduling phase.
49
4. Related Work
4.1.4. CASH
As far as we know the first compiler generating dynamically scheduled hard-
ware from C is CASH (Compiler for Application-Specific Hardware) [Budi03]
developed in 2003 by Budiu (Carnegie Mellon University, Pittsburgh). Like
ROCCC it is a pure hardware compiler, however instead of synchronous hard-
ware descriptions for reconfigurable devices it creates asynchronous descrip-
tions destined for ASIC synthesis. The concept of spatial computation de-
scribed by Budiu originates from data flow computers. Operations are exe-
cuted as soon as all of their inputs are available. CASH assumes a monolithic
cached memory architecture, thus the support of arbitrary pointers in C is not a
problem. The focus of Budiu’s work is Pegasus, a CDFG-based intermediate
representation. Fig. 4.4(B) shows an example Pegasus graph which represents
the code of Fig. 4.4(A). Nodes contain operations for data transformation and
token handling. Edges forward data, boolean values (used for predicated ex-
ecution) and tokens. A Pegasus graph is built from a CFG and consists of
connected subgraphs generated from different CFG regions (e.g., loops). Merge
and eta nodes represent data inputs and outputs for each Pegasus subgraph.
Budiu’s work contains concepts of speculative execution of data paths and
even speculative memory accesses using a load-store queue (LSQ). However,
many of these concepts have only been evaluated in high-level simulations (i.e.,
simulating Pegasus models) lacking actual HDL implementations. Low-level
Verilog simulations use a memory system without LSQ and assume a perfect
cache. Furthermore, the Verilog back-end does not allow function calls or
pipelining of operators.
High-level simulations against a CPU exhibiting the same latencies as the gen-
erated hardware have shown speed-ups between 0.5x and 12x. In contrast,
post-layout simulations of kernels mapped to an asynchronous ASIC using a
[180nm/2V] standard cell library from STMicroelectronics against a 600 MHz
CPU showed slowdowns of 0.78x up to 4.5x. According to Budiu the major
reason for this slowdown is the memory access protocol, which does not allow
the issue of a new access before the current access has completed.
4.1.5. CHiMPS
CHiMPS (Compiling High level language to Massively Pipelined System)
[PBDM08], developed by Xilinx, is one of the latest compilers found in lit-
erature. As a C to hardware/software tool, it creates dynamically scheduled
hardware, borrowing the spatial computation model introduced by CASH. The
50
4.1. Existing Adaptive Compilers and HLS Frameworks
int fib(int k)
{
int a = 0;
int b = 1;
while (k) {
int tmp = a;
a = b;
b = b+tmp;
k--;
}
return a;
}
 a
ret
 k
 k
!=
 b a
 a
1
!0
0
0
 b  k a
 a
 a
+ −
 b  k!
0
!=
1
merge node
argument1
2
3
merge nodes
loop again
loop done
eta nodes
eta nodes
to hyper 2
to hyper 3
(A) (B)
Figure 3.13: (A) Iterative C program computing the k-th Fibonacci number and (B) its (slightly simplified)
Pegasus representation comprising three hyperblocks. This figure does not include the program counter, the
token edges, or the current execution point.
value and a predicate—and one output. When the predicate evaluates to “true”, the input value is moved to
the output. When the predicate evaluates to “false”, the input value and the predicate are simply consumed,
generating no output. Each eta is controlled by the edge predicate.
Hyperblocks with multiple predecessors may receive control at run-time from one of several different
points. Such join points are indicated by merge nodes, shown as triangles pointing up.
Putting together all the transformations above, Figure 3.13 illustrates the representation of a program
comprising three hyperblocks, including a loop. The eta nodes in hyperblock 1 will steer data to either
hyperblock 2 or 3, depending on the test k != 0. Note that the etas going to hyperblock 2 are controlled
by this predicate, while the eta going to hyperblock 3 is controlled by the complement. There are merge
nodes in hyperblocks 2 and 3. The ones in hyperblock 2 accept data either from hyperblock 1 or from the
back-edges in hyperblock 2 itself. The back-edges denote the flow of data along the “while” loop. The
48
Figure 4.4.: (A) Sa ple C source code. (B) Pegasus representation (simpli-
fied). Source of image: [Budi03], page 48, Fig. 3.13.
51
4. Related Work
fundamental CHiMPS innovation is the many-cache model. To increase the
number of parallel cache accesses, the cache is split up into multiple caches
that are synchronized through flushing. The mapping of memory access oper-
ations to caches uses a simple memory analysis based on the C99 restrict
keyword. Compared to its first version, which did not carry out any optimiza-
tions during compilation, CHiMPS has offered an enhanced front-end based
on LLVM [Latt02] since 2008. This offers better C language support and more
high-level optimization passes [LeRK08].
Fig. 4.5 illustrates the main compilation steps. The input C code (a) is trans-
formed into the CHiMPS Target Language (CTL) (b). CTL is an assembler-like
IR describing data flow (shown graphically in (c)) as well as memory depen-
dences between operations. From CTL representations CHiMPS then creates
VHDL code.
The target platform most recently used for HW/SW applications generated by
CHiMPS is the Xilinx Accelerated Computing Platform (ACP). The used ACP
configuration connects a Xilinx Virtex-5 LX110T FPGA to an Intel quad-core
Xeon CPU @ 2.66 GHz via the Intel Front-Side Bus (FSB). Cycle-accurate
simulations of this platform have shown application-level speed-ups between
1x and 67x2 against the ACP using the Xeon without FPGA support [PEBD09].
2 Sub
int foo(int w, int x, 
        int y, int z)
{
  int u = x + y;
  int v = w - z;
  int s = z + v;
  int t = u & (y+z);
  int q = t | (s-v);
  return q;
}
Enter foo; w,x,y,z
 reg u,v,s,t,q 
 reg _f1,_f2
!: add  x,   y;   u
2: sub  w,   z;   v
3: add  z,   v;   s
4: add  y,   z; _f1
5: and  u, _f1;   t
6: sub  s,   v; _f2
7: or   t, _f2;   q
Exit foo; q
1 Add 4 Add
3 Add 5 And
6 Sub 7 Or
Enter
Exit
FIFO
C Code CTL  Instruction Blocks VHDL Blocks(a) (b) (c)
z
w
v
x
u
z
_f1
s
_f2
y
t
Fig. 1: CHiMPS toolﬂow. (a) A simple example C function. (b) The C function translated into CHiMPS Target Language (CTL) instruction
blocks. (c) The resulting FPGA accelerator.
tion block, and from there the ﬂow of data drives program
execution. There is no program counter-driven instruction
fetch to order instructions, so data-independent instruction
blocks can execute in parallel. Thus the model can exploit
instruction-level parallelism over a much larger window of
instructions than a superscalar’s out-of-order execution core.
The spatial dataﬂow model also efﬁciently exploits
pipeline parallelism. Unlike a CPU or GPU, which have
ﬁxed pipeline depths, instruction blocks on an FPGA can be
conﬁgured to any pipeline depth. Consequently, FPGAs can
have more sequential instructions in-ﬂight. This is partic-
ularly useful for HPC streaming [5] applications. Further-
more, loop iterations can start on every cycle when there are
no loop carried dependencies, and more loop iterations can
be in-ﬂight simultaneously.
While the spatial dataﬂow model is good for exploiting
parallelism, it is questionable whether an FPGA has enough
resources to support a VHDL block for every instruction in
the program. The results in Section 3 show that some bench-
marks push the limits of currently available FPGAs. When
hardware resources are constrained, CHiMPS uses two sim-
ple methods for reducing resource requirements: (1) time-
sharing instruction blocks, and (2) using a conﬁgurable soft
processor core in place of a set of VHDL blocks.
Time-sharing instruction blocks involves mapping two or
more CTL instructions to one VHDL block. For example,
consider an application with separate paths that both require
a ﬂoating point multiply. The two multiplies can be imple-
mented by a single multiplier enclosed in a mux-demux pair.
The mux selects between the two paths, passing the selec-
tion to the demux block, which then routes the result to the
correct consumer block.
Using a conﬁgurable soft processor core allows the
CHiMPS compiler to replace arbitrary sections of connected
CTL instructions with a small, simple CPU, called MicroB-
laze [10]. MicroBlaze is a small conﬁgurable processor
(2000 LUTs, or 2% of a Virtex-5 LX330T FPGA) that can
be used to execute any CTL instruction.
The CHiMPS compiler currently instantiates a MicroB-
laze processor in the following cases: Short serial code sec-
tions. Code sections that are executed once or that have no
parallelism sometimes appear between parallel sections. If
the overhead of transferring back to the CPU is greater than
the time required to execute the serial code, a MicroBlaze is
instantiated. Infrequent paths. If a code path isn’t likely to be
executed (according to an execution proﬁle), then a MicroB-
laze can be used for the uncommon path. Complex functions
and some syscalls. Some complex functions and syscalls re-
quire signiﬁcant resources when implemented in the dataﬂow
style. In these cases, the compiler instantiates a MicroBlaze
to either execute the function or pass it to the CPU.
2.3. Many-cache Memory Model
A key difference between CHiMPS and previous work is its
memory structure, which we call the many-cache memory
model. This section describes the cache-based model and
the CHiMPS policies to effectively manage the caches.
Caching may initially seem like a bad match for FPGAs,
because arranging the distributed FPGA memory blocks into
one large monolithic cache causes three major problems:
Limited bandwidth: One cache can only provide a few
reads/writes per cycle rather than the hundreds provided by
the small FPGA memories. Increased latency: The greater
the number of read/write VHDL blocks, the more layers of
mux’ing are required to access the cache, each adding 2 cy-
cles to the cache access time. Decreased clock frequency:
Large numbers of read/write instruction blocks going to a
single cache causes routing congestion, which decreases the
maximum clock frequency for the design.
CHiMPS’s solution is to use many small, distributed
caches rather than the one large, monolithic cache used in
CPUs. This addresses the limited bandwidth issue by sup-
porting more simultaneous memory requests, and the la-
tency and frequency problems by reducing the number of
read/write instructions per cache, hence decreasing the mux
layers and routing congestion.
However, the many-cache memory model raises two
other issues: keeping memory coherent and creating mul-
tiple caches from C source code. The coherence problem
arises when the same memory address is cached in multi-
ple caches. If the value is changed in one cache, the others
may not receive the update. CHiMPS guarantees coherency
by sidestepping the issue entirely, stipulating that only one
cache can hold a particular memory address at any given
time. This avoids the cache coherence problem without the
hardware complexity of a cache coherency protocol.
Given this stipulation, the CHiMPS compiler creates a
separate cache for any unique range of memory addresses.
This typically means a separate cache for each array, as long
175
Figure 4.5.: (a) Sample C function; (b) CHiMPS Target Language (CTL) rep-
resentation; (c) graphical DFG representation. Source of image:
[PBDM08], Fig. 1.
2The assumed operation frequencies of t e kernels running on the FPGA have not been published.
52
4.2. Early Evaluation and Cancel Tokens
4.2. Early Evaluation and Cancel Tokens
In dynamic scheduling a hardware operation is executed when all of its inputs
are available. Early evaluation (EE) of the operation can save execution time:
The operation is executed and possibly outputs a result even though not all of
its inputs are available yet. For example, if the selected data input of a multi-
plexer is available, the value is already forwarded to the data output, although
the other inputs are not ready yet. However, to implement such a behavior, we
have to handle the data that finally comes in through the pending inputs. More
precisely, we have to discard those data items to prevent them from being used
as input for consecutive computations. Thus, if an early evaluation takes place,
a token representing a discard information must be attached to each pending
input. We call these tokens cancel tokens (CT), while the tokens representing
valid data are called activate tokens (AT). CTs are either static, i.e., they wait
at the pending inputs until they collide with an incoming AT, or they are dy-
namic, i.e., they move in reverse data flow direction until they collide with an
AT. If an AT and CT collide they erase each other.
The ideas of early evaluation and cancel tokens have been developed indepen-
dently by two authors. In 2003 Brej [BrGa03] described EE for asynchronous
hardware on the gate level. He used the term anti-token instead of cancel
token. In 2005 Kasprzyk [Kasp05, KoKa05] described EE for synchronous
hardware, which is created by the COMRADE compiler from ANSI C code.
He used the term up token instead of activate token and down token instead
of cancel token.
Of course, implementing EE is reasonable only if the inputs of an operation
can actually arrive at different points in time. When static scheduling is used,
data is processed during fixed time slots and inputs always enter an operation
simultaneously. Hence, EE is only used in dynamically scheduled hardware.
It can save runtime if data paths are imbalanced (e.g., different numbers of
pipeline stages in alternative branches) or if operations can have a variable
latency.
53
4. Related Work
54
5. COMRADE 2.0 Execution
Model
In this Chapter we will first illustrate the generic target architecture of COM-
RADE 2.0 (Section 5.1). Then we will present our hardware/software co-
execution model employed in the architecture (Section 5.2) before describing
in Section 5.3 the execution model of the hardware generated by COMRADE
2.0.
5.1. Generic Target Architecture
COMRADE 2.0 relies on platform-independent concepts for hardware and
software generation and is thus not limited to a specific target platform. How-
ever, some basic properties of an adaptive computer target architecture can
be derived directly from the COMRADE 2.0 design goals. The most impor-
tant issue is the broad language support including hardware generation from
C code containing pointers. Pointer support implies that the hardware cannot
be constrained to use just simple variables and arrays for data I/O. Instead, it
must have access to complex, pointer-based data structures, e.g., an array of
pointers each referencing other pointer arrays. In such scenarios copying the
relevant data to dedicated HW-accessible memories before the actual HW ker-
nel computation (and copying the results back afterwards) is difficult and, if at
all possible, often inefficient. A better alternative is to just pass a pointer to
the referred data structure from SW to HW, analogous to the pass-by-reference
principle. Recent work [LaKo09] has shown that direct exchange of pointers
between software and hardware can be implemented at high performance even
in full-scale operating system environments using virtual memory. The result-
ing requirement for the target architecture is a shared memory for SW and HW
which holds all data that is accessible by both parties. One way to implement
this (and this is the method we employ in our subsequent tests) is to use a sub-
section of the main memory as shared memory and grant the RCU access to
that memory section.
55
5. COMRADE 2.0 Execution Model
Ideally both the CPU and the RCU use the same cache for accessing the shared
memory (Fig. 5.1(a)). This allows the RCU to directly load the data pro-
cessed by the CPU from the cache and vice versa. Unfortunately, as far as
we know, there are no real test platforms which offer such a configuration.
Instead, existing platforms use a separate cache for the RCU (Fig. 5.1(b)),
typically configured onto the FPGA. Cache coherency can be ensured either
by cache coherency protocols (MESI, MOESI) implemented in hardware or
through software mechanisms. The caches are connected to the shared mem-
ory via a crossbar switch.
Furthermore, the CPU and the RCU are directly connected to each other by two
channels (Fig. 5.1). The CPU r/w channel allows the CPU to directly address the
RCU, e.g., to exchange live variables between SW and HW. The IRQ channel
is used by the RCU to signal the end of a HW computation to the CPU or to
signal an interruption of the HW computation, initiating a SW service.
C
AC
H
E
Shared
MEM
RCU
CPU
C
PU
 r/
w
IR
Q C
AC
H
E
Shared
MEM
C
AC
H
E
X-bar
switch
RCU
CPU
IR
Q
C
PU
 r/
w
(a) (b)
Figure 5.1.: Target architectures: (a) ideal, (b) real.
5.2. Hardware/Software Co-Execution Model
This Section uses an example to explain the HW/SW co-execution model, i.e.,
the way the CPU and the RCU behave and communicate to each other during
the execution and the switching between SW and HW parts of an application.
Fig. 5.2 shows the partitioned CFG of the sample program used throughout
this Section. Although the CFG is only an intermediate representation, we will
refer to CFG nodes or regions being executed on the adaptive computer; this
actually means that the computations modeled by a CFG node or region are
executed.
For this work we assume that regions are always executed in the original pro-
gram order. As a consequence only one region is active at a time. Paralleliza-
tion of regions would require sophisticated pointer analysis techniques (such
56
5.2. Hardware/Software Co-Execution Model
as [ShHo97]) to prove memory independence and is not in the scope of this
work.
The example program in Fig. 5.2 comprises three SW regions (one of them
being a SW subregion) and one HW region. Execution starts with node (0) on
the CPU. After some initial assignments the HW region is entered. For such a
SW/HW transition the SW performs four steps:
1. The CPU has to program the FPGA with the HW region to be executed
if this has not been done before1. This step is not shown in the Figure.
2. The SW transfers the values of those variables to the HW, which are
needed during the HW kernel execution (SendVars(kernel_no)).
3. If necessary it flushes the data which is stored in the shared memory from
the CPU cache, if the adaptive computer has to manage cache coherency
in software. This third step is also not shown in the Figure.
4. The SW starts the HW kernel (StartHW(kernel_no)) by writing to a
dedicated hardware register of the target kernel. The CPU then waits
until the RCU signals the end (or an interruption) of its computation via
an IRQ.
Program control has now been transferred to the RCU, i.e., the hardware de-
cides on its own when its computations are finished or when to interrupt them.
This execution model is called master mode. However, note that although the
RCU now controls the program flow and is able to directly access the mem-
ory, it never accesses the CPU except by sending an IRQ. Effectively, variable
exchange between SW and HW is always initiated by SW.
After the HW has received the current values for n, in1, in2, and out, it enters
a loop which is decomposed into the loop condition in node (2), the loop body
(3, 4, 5, 6, 7, 8), the increment of i located in node (8), and a CFG edge (8, 2)
from the last body node back to the loop condition.
If b == 0 in node (3), the HW interrupts the CPU and calls a SW service,
i.e., SW subregion (5) is entered. To this end, the HW writes into a dedicated
HW register called irqreg the number of the exit node (4) and then issues an
IRQ (StopHW(4))2. The SW interrupt handler, not shown in the Figure, reads
1Depending on the HW kernel sizes, multiple kernels can be arranged in one FPGA configuration
to minimize the total number of reconfigurations. This is known as reconfiguration scheduling
and has been explored by Kasprzyk in [KaVK05].
2In absence of a cache coherency mechanism, the RCU has to flush its cache before issueing the
IRQ.
57
5. COMRADE 2.0 Execution Model
(0)
int a, b, c;
int* in1 = ...;
int* in2 = ...;
int* out = ...;
char* kernel_name = ...;
int kernel_no = ...;
SendVars(kernel_no);
StartHW(kernel_no);
(1)
ReceiveVars();
i = 0;
(2)
if (i < n)
n, in1, in2, out
(3)
a = *in1;
b = *in2;
if (b == 0)
(4)
ProvideVars();
StopHW(4);
(6)
ReceiveVars();
(5)
RequestVars(kernel_no);
printf("division by 0 for
  i = %d\n", i);
c = 0;
SendVars(kernel_no);
StartHW(kernel_no);
(7)
c = a / b;
(8)
*out = c;
++i;
(9)
StopHW(9);
i
HardwareSoftware
true
false
true
false
c
(10)
printf("loop %s ended\n",
  kernel_name);
(SW subregion)
Figure 5.2.: COMRADE co-execution model with SW service. Numbers in
brackets denote CFG node numbers.
58
5.3. Hardware Execution Model
value 4 from the irqreg and, using an internal map from IRQ reasons to SW
basic blocks, therefore knows that SW execution has to continue with CFG
node (5). The software reads the value of variable i from the HW, handles the
exception, and continues HW execution. Note that the HW remembers that it
has been stopped at CFG node (4) and so knows where to continue on the next
StartHW(kernel_no) call.
Further interruptions may occur until the hardware loop finally ends and node
(9) is reached. After the final HW/SW transition (9, 10), node (10) is executed
in SW and the program finishes.
5.3. Hardware Execution Model
One of the main goals of COMRADE 2.0 is the implementation of a hardware
generation methodology which supports a wide range of the ANSI C language,
including arbitrarily nested conditions, loops, and pointers. As a side condi-
tion, the hardware generated should of course have a high runtime efficiency so
that it actually makes sense to use it as a HW accelerator. For this, COMRADE
2.0 exploits several techniques:
• For better performance, even in the presence of variable latency op-
erators (e.g., cached memory accesses) COMRADE 2.0 uses dynamic
scheduling (Section 2.3).
• Through speculative predicated execution (Section 2.2) alternative
branches (if/else, switch) and loop bodies are precomputed. This tech-
nique, originally applied for pure SW execution on CPUs, is espe-
cially amenable to data flow-based computation in hardware, because
all branch targets can be precomputed simultaneously and without the
risk of a pipeline stall. We refine speculative predicated execution by
early evaluation (Section 4.2).
• Operations with a latency > 1 (even those with variable latency) are
pipelined according to their specific initiation intervals (instead of using
a loop-wide worst case).
After explaining the basic elements of the COMRADE hardware execution
model (Section 5.3.1) we will illustrate that model with several representative
examples, relating input C snippets to compiler-generated hardware models
(Section 5.3.2).
59
5. COMRADE 2.0 Execution Model
5.3.1. Modelling Dynamic Scheduling with Early
Evaluation
To be able to model mutual influences of data, control, and memory depen-
dences and to support pipelined operators we have developed the COMRADE
Controller Micro Architecture (COCOMA). Before the next Chapter gives a
formal definition, we will introduce the essential elements of COCOMA and
explain the model using some illustrative examples.
A COCOMA instance can be pictured as a directed graph; the nodes corre-
spond to C operations (e.g., arithmetic), while three distinct kinds of edges
represent data, control, and memory dependences between them. Nodes can
store and forward data items as well as two different kinds of tokens: activate
tokens (AT) and cancel tokens (CT) (cf. Section 4.2). Edges can forward ATs
and CTs and they are able to buffer one token. Memory edges, however, buffer
or forward only ATs. Data edges forward data in addition to tokens. ATs mov-
ing along data edges always reference a (possibly speculative) piece of data,
which is stored in a node and moves from node to node along with an AT. CTs
move in reverse data flow direction. If AT and CT collide they cancel each
other out. If all ATs referencing a piece of data are eliminated, that piece of
data is deleted. Control edges can have different annotations which affect the
token flow.
Fig. 5.4 shows graphical representations of the most important COCOMA
elements. All operations storing or manipulating data (apart frommultiplexers)
are displayed as ellipses containing the operator name; an optional number
below the name denotes the current output value. To start such an operation all
incoming edges need to have an AT, with one exception: If there are multiple
incoming control edges an AT from one of them suffices.
The same ellipse symbol with a dashed frame is used for token nodes; these
only propagate tokens, not data. Different kinds of token nodes, identified by
different names, perform different operations on tokens.
Data edges are solid; control and memory edges are dashed and have a differ-
ent thickness. Control edges can have different annotations, influencing their
semantics as explained in the examples below.
For multiplexers we use the traditional trapezoid symbol. COCOMA multi-
plexers support early evaluation, i.e., they can forward a value even if not all
inputs have valid data yet, indicated by ATs. We distinguish three different
kinds of multiplexers: muxes, loop muxes, and irq data muxes.
Muxes merge data paths after an if/else or switch branch. Each data input is
60
5.3. Hardware Execution Model
associated a one-hot encoded select value, printed below the incoming data
edge inside the mux. The annotation of the incoming control edge defines how
the select signal is computed from the result of the control edge source node.
An example is shown in Fig. 5.3. Each value in the annotation ==(2,19,4) is
compared to the result of node n. Only for the value 4 the comparison is true,
giving the result vector 0,0,1, which is exactly the current select value. Thus,
the rightmost input is selected.
n
4
bc
==(2,19,4)
100 010
mux
001
a
select = 001
Figure 5.3.: Determination of the mux select signal.
Loop muxes (the word loop is printed inside the trapezoid) represent the data
entry and exit points of loop bodies. A loop mux always has exactly two inputs:
an init input i and a continue input c. The latter receives data from the loop
body, the former from outside the loop. There is also an incoming control
edge, but here only the continue input is controlled by the edge. There is no
select signal; the loop mux reads the init input initially and after the loop ending
and accepts the continue input while the loop is iterating.
Irq data muxes merge exit codes for IRQ signaling at HW/SW transitions.
They neither have a select signal nor an incoming control edge. Here, simply
the active input is propagated. COCOMA makes sure that only one input of an
irq data mux can be active at a time.
ATs are printed as circles containing a +; CTs analogously contain a -.
The runtime behavior of the COCOMA elements regarding the token flow for
the dynamic CT model is shown in Fig. 5.5. Note that the purpose of these
rules will become more apparent in Section 5.3.2, where the greater context is
described.
61
5. COMRADE 2.0 Execution Model
op
1
operation node (arithmetic, shift, comparison, boolean op, 
nop, memory access; optional lower number denotes 
current output value)
loop mux node with init input i and continue input c; data 
consumption of c input is predicated via control edge
data edge: forwards data and ATs; CTs move in reverse 
direction
control edge with control value: forwards ATs if condition 
fulfilled, creates CT from AT if condition not fulfilled, 
forwards CTs
==1
control edge with nCT annotation: forwards only ATs (no 
CTs)
==1
nCT
control edge with nAT annotation: forwards only CTs (no 
ATs)
==1
nAT
control edge with alwAct annotation: creates ATs in target, 
whenever condition is evaluated or canceled (always 
activate)
==1
alwAct
memory edge; forwards only ATs
mux node with data and control inputs
+
–
activate token (AT)
cancel token (CT)
control edge with control values for connection to a mux 
select signal
==(1,0)
irq irq data mux node; propagates active input
op token node (mf, all, any, initial, always)
loop
i c
==(1,0)
mux
==1
10 01mux
control edge with atOnCancel annotation: creates CT in 
target on false condition; creates AT if condition true, or 
source node canceled
==1
atOnCancel
Figure 5.4.: Basic COCOMA elements.
62
5.3. Hardware Execution Model
op
a b(a) (b)
op
a(h) c
1
==1
c
1
(c)
a
op
(k)
false
nCT
op
(l)
nAT
op
(m)
true
alwAct
a  
mf
(e) (f)
all
–
(j)
+ + –
b
==(1,0)
c
1
ab
==(1,0)
+ +
+ b
–
(d)
+
+
mf
+
+ +
(g)
all
+
any
+
any
+
+
+
op
ac
1
==1
+
op
a(i) c
0
==1 + op
ac
0
==1
–
op
a
c
==1 op
a
c
==1–
–
op
a b
+
op
a b
+
op
a b
op
false
nCT
+
+ op
nAT
true true
+
+
op
true
alwAct
a
+
op
(n)
false
alwAct
a  
+
op
false
alwAct
a
+ dummy
loop
i c
n
==1
+
loop
i c
n
==1
10 01
mux
10 01
mux
No CT here!
select=10 enterable=1 enterable=0
op
(o)
false
atOnCancel
a  
+
op
false
atOnCancel
a
op
(p)
atOnCancel
a  
op
atOnCancel
a
+
–
–
+
AT here!
Figure 5.5.: COCOMA elements: token flow (dynamic CTs).
63
5. COMRADE 2.0 Execution Model
(a) represents the activation of an operation node: all inputs are active, op con-
sumes the ATs and starts its computation. When it finishes it assigns an AT to
each outgoing edge.
In Fig. (b) op is canceled and propagates the CTs to its inputs. At input b, the
CT extinguishes an AT.
(c) shows an operator controlling a mux. The control edge annotation ==(1,0)
establishes that if c is active and its current result is 1, input 10 (here b) is
propagated through the mux and input 01 (a) is canceled.
Loop muxes (d) initially accept the init input without consuming a token from
the controller node n, which controls only the continue input. After being ini-
tialized the loop mux sets its internal enterable state to 0, refusing additional ini-
tializations and accepting one value from the continue input per loop iteration.
This corresponds to one AT coming in via the control edge per iteration. An
incoming CT signals the loop mux the end of the loop, provoking the loop mux to
reset its enterable state to 1. Section 5.3.2 (Loops) explains why this mechanism
does not inhibit pipelining.
(e) shows the handover of an AT from a control edge to a memory edge. This
is performed by a mf (memory forwarder) token node.
Note that for a given COCOMA node data dependences are fulfilled if all data
predecessors deliver an AT3, whereas control dependences are fulfilled if one
of the control predecessors delivers an AT. This behavior seems inconsistent at
first glance, but it makes sense when investigating actual COCOMA instances
because this behavior is simply required in most situations. A more consis-
tent notation would require additional COCOMA nodes and possibly lead to
(slightly) more complex hardware. However, if the default behavior shown
in Fig. (g) is not desired, control edges can be combined in an all node (Fig.
(f)) requiring all control inputs to deliver an AT before an AT is sent via the
outgoing control edge.
Fig. (h) explains another application of control edges: the predicated execution
of a (non-mux) operator. If the control annotation matches the output of c and
all incoming edges have an AT, op is executed. Fig. (i) shows the case in which
the c output does not match the annotation. Here, op is canceled instead. Thus,
the AT of the control edge (c, op) turns into a CT because the control condition
is not fulfilled (0 , 1). If operator c is canceled already (regardless of its result
value), that CT is forwarded to op as depicted in Fig. (j). This shows that
CTs not only flow in reverse direction along data edges but also in forward
3Multiplexers are an exception; here, already one active predecessor can activate the multiplexer.
64
5.3. Hardware Execution Model
direction along control edges. Note that the controlled operator decides when
to commence the operation, given that the data inputs have an AT. Operations
without side effects can take place before a token has arrived at the input control
edge; other operations (e.g., memory accesses) have to wait for an AT at the
control edge.
Figs. (k) to (p) show the effects of further control edge annotations on the token
flow. An nCT annotation (Fig. (k)) establishes that the control edge does not
create CTs: The token on the control edge simply disappears. Analogously,
nAT annotations (Fig. (l)) prohibit the propagation of ATs. An alwAct annotation
preserves the standard control edge behavior if the control condition is true
(Fig. (m)), but in the case of a false condition (Fig. (n)) or canceled condi-
tion, an AT is created instead of a CT. Note that this AT is bound to dummy
data which cannot be used for any reasonable computation. However, such an
AT can be used to eliminate a left-over CT4. Edges annotated with atOnCancel
(Figs. (o), (p)) forward an AT to the target node if the source node is canceled.
The difference to the alwAct annotation is, that in the case of a false condition
(with AT in the source node) a CT is sent to the target node (instead, an alwAct
annotation would send an AT to the target node in this case).
5.3.2. COCOMA Equivalents for Software Constructs
Sequential Code without Intermediate Branches
Fig. 5.6(b) shows the COCOMA section created from the code snippet in Fig.
5.6(a). Each operator will execute its computation as soon as all inputs are
available, i.e., all incoming data edges have an AT. At data path branches (fan-
out), such as the branch after the + operator, an AT is assigned to each outgoing
edge so that the successors / and − can start their computations independently.
However, the two ATs created by the + are bound to a single piece of data: the
result of the addition.
If/Else and Switch Branches
The mux in Fig. 5.7 joins the data flow after an if/else or switch branch. At the
same time it represents the end of a speculative data flow section. The mux does
not consume a (speculative) data input before the correct input has been chosen
by the control edge. The mis-speculated inputs are canceled; in the dynamic
4A statement on the use of this technique follows in Section 8.3
65
5. COMRADE 2.0 Execution Model
a = b + c;
d = a - e;
f = a / g;
/
g e
f d
b c
a a
-
+
(a) (b)
Figure 5.6.: Sequential operations: (a) sample code snippet, (b) generated CO-
COMA section.
CT model (cf. Section 4.2) they are assigned a CT which will move in reverse
data flow direction until it extinguishes an AT. In the static CT model the CTs
are stored inside the mux until the mis-speculated data actually arrives.
if (c == 1)
    s = a + b;
else
    s = e - f;
+
a b
–
e f
==
c 1
(a) (b)
==(1,0)
mux
10 01
Figure 5.7.: If/else branch: (a) sample code snippet, (b) generated COCOMA
section.
Memory Accesses
Fig. 5.8 shows an example with two memory writes. Because these opera-
tions have side effects, they are not executed speculatively; therefore they are
predicated with an incoming control edge.
Memory edges are used to enforce the execution of memory accesses in pro-
gram order. An access may not start before the incoming memory edge (if any)
has an AT, i.e., the previous access has completed. However, a direct memory
edge between the two store nodes in Fig. (b) would raise a problem. If c was
66
5.3. Hardware Execution Model
false, the store *p would not be executed and thus not emit an AT to the store
*q. *q would wait for *p to complete, resulting in a deadlock. As a solution
we insert memory forwarder statements (<mf>) in the C code in the front-end
of COMRADE 2.0, from which we generate memory forwarder (mf) nodes in
COCOMA as shown in Fig. 5.8. Of course, mfs are only needed if one target
of a branch actually contains a memory access. Similar to a mux joining data
flows, an any node joins alternative memory dependences and forwards the AT
to successive memory accesses. We annotate control edges to mfs with nCT
because creating CTs would not make sense here; mf nodes do not have data
predecessors or control successors that have to be canceled.
if (c) {
  *p = 1;
} else {
  <mf>;
}
*q = 0;
c
store
p ==1
(a) (b)
1
mf
store
q 0
==0
any
==1
nCT
Figure 5.8.: Memory accesses: (a) sample code snippet, (b) generated CO-
COMA section.
Fig. 5.9(a) shows a sample code containing a load and a store. Figs. (b) and (c)
illustrate how tokens flow when a memory read receives a CT from a control
edge. In (b) both control conditions are unfulfilled, because the output of c
does not equal 1. This results in CTs generated for each data input of each of
the memory accesses. The problem now is the CT moving from the store up
to the load (assuming the dynamic CT model). There is no AT which could be
canceled by the CT. Suppressing the generation of this CT is not an immediate
solution, because there is an AT coming into the division operator from its
second data input (75). Therefore, an AT is generated at the output of the
load although the load is canceled. Note that this AT is bound to dummy data
which will never produce a relevant computational result; its single purpose
is to collide with the superfluous CT. We thus call such ATs pseudo activate
tokens (pAT). This technique also works for the static CT model.
67
5. COMRADE 2.0 Execution Model
if (c) {
  v = *p;
  r = v / 75;
  *q = r;
}
c
0
load
p
==1
(a) (b)
1
store
q
==1
/
75 +
c
0
load
p
==1
(c)
1
store
q
==1
/
75
–
–
–
–
+
+
Figure 5.9.: Token flow for canceled loads: (a) sample code snippet, (b) CO-
COMA section with unfulfilled control conditions, (c) pseudo AT
generation at load output.
Nested Conditions
COCOMA has to guarantee that the store *p in Fig. 5.10 is only executed if
both c1 and c2 are true. To guarantee this, COCOMA inserts a control edge
from c1 to c2, reproducing the control hierarchy in the C code.
If c1 is false and creates a CT for c2, that token is forwarded along the outgoing
control edges of c2 to prevent the store from being executed in the case c1 == 0
and c2 == 1. Thus, a CT is distributed in the whole conditional subtree.
This CT distribution raises a problem concerning the behavior of muxes. If
c1 is false, a CT enters c2 and is then sent to the upper mux in Fig. 5.10(b),
so that none of its data inputs is propagated to its output. But the lower mux
anticipates an AT from each data input (including the upper mux) even in this
case, resulting in an excess CT (or a missing AT) at the left input of the lower
mux. As a solution, we require muxes to output a value in any case, even if all
of its inputs are discarded; i.e., in that latter case, the mux emits a dummy value
(which will be discarded by a subsequent mux). Alternatives to this approach
are detailed in Section 8.3.
Loops
Fig. 5.11 shows how loops are implemented with COCOMA. Here, looping is
based on three general concepts. First, the loop control node (i.e., the node
68
5.3. Hardware Execution Model
s = 0;
if (c1) {
  if (c2) {
    s = a + b;
    *p = s;
  }
}
+
a b
c2
c1
store
p
==1
(a) (b)
0
==(1,0) ==(1,0)
s
==1
mux mux
10 01
10 01
Figure 5.10.: Nested conditions: (a) sample code snippet, (b) generated CO-
COMA section.
which computes the loop condition) is evaluated n + 1 times for n loop iter-
ations, i.e., once for each iteration and once for realizing the loop end. This
allows COCOMA to implement very complex loop conditions beyond simple
constant-stride increments. From the loop control node a control edge leads to
an any token node; another control edge leads back to the loop control node.
This enables the reactivation of the loop control node after each loop iteration.
If the any node was missing, the loop control node would need an AT from the
control edge coming in from the left side in the Figure, which is not provided.
However, if two control edges target a node, COCOMA requires only one of
the control predecessors to have an AT in order to activate the target node (see
details in Appendix A). As the any node is only used to forward ATs, we add
an nCT annotation.
Second, each input of the loop body data paths (i.e., the set of operators orig-
inating from a loop body and connected via data edges) is assigned exactly one
AT per token assignment of the loop control node. Token assignment includes
that the loop control node is assigned an AT (evaluation of the loop condition)
or a CT (forwarded from an outer condition or loop).
Third, speculation is extended to loop bodies. The loop body data paths start
computing already before we know if the loop will conduct an(other) iteration.
Each variable which is assigned a value in the loop body and which is con-
69
5. COMRADE 2.0 Execution Model
s = 0;
n = ...;
k = ...;
for (i = 0; i < n; ++i) {
  s += 7;
}
r = s * k;
n
<
(a) (b)
 
loop
+
0
7
*
nop k
==1
==0 s
any
==1
nCT
r
i c 
loop
+
0
1
i c
==1
i
atOnCancel
Figure 5.11.: Loops: (a) sample code snippet, (b) generated COCOMA section.
sumed during a successive loop iteration or by an expression after the loop is
a loop mux variable. In Fig. 5.11(a) loop mux variables are i and s. Fig. (b)
shows the COCOMA section generated for this code. For s, a loop mux has been
generated. Its init input is connected to 0, while the continue input receives
the updated value from an adder. A similar structure is generated for the loop
index variable i; here, the value 1 (instead of 7) is added for each iteration.
According to its internal enterable state, the loop mux accepts data from its init
input only when a loop starts, propagating the continue input afterwards. The
initialization is independent of the loop control node, which controls only the
continue input of the loop mux. As soon as a CT enters the loop mux via the control
edge, the loop mux switches the enterable state back to 1 after all iterations have
actually been processed.
Note that the data successors of a loop mux can be inside the loop body (e.g., the
two adders) as well as outside (e.g., the nop node, which is a simple data buffer).
While the adder consumes an AT from the loop mux during each loop iteration,
only the final result of s may flow into the multiplier outside the loop. We
handle this through inserting a control edge from the loop control node to each
loop mux successor outside the loop body. The condition annotation is inverted
70
5.3. Hardware Execution Model
as shown in Fig. 5.11(b) (==0 instead of ==1), so that a CT is created for every
valid loop iteration. Only when the loop control node is evaluated to 0, i.e.,
after the last loop iteration, is the loop mux output propagated to successors
outside the loop. The atOnCancel annotation establishes that if the loop control
node itself is canceled (e.g., because the loop is located inside an if branch
which is not chosen), the initial value of the loop mux is forwarded to the data
successors outside the loop instead being canceled. This is an important feature
to justify the correctness arguments in Chapter 8. The nop node is inserted to
make sure that the CT dedicated to the loop mux output does not cancel the k
input of the multiplier. If the control edge led directly to the multiplier, a CT
would flow into the k input for each loop iteration. This would produce excess
CTs because k provides only one value intended to be multiplied with the final
s value exiting the loop.
Fig. 5.12 shows how COMRADE guarantees that the variables which are read
in a loop body, but are not assigned a value inside the loop, are reactivated,
providing an AT with associated data for each loop iteration. The C code in
part (a) contains the variables i and n, i being assigned a new value in each
iteration due to ++i, but n not being assigned a new value. The COMRADE 2.0
front-end alters the code, producing the version shown in part (b). After the
insertion of the n = n; statement (which never changes the program semantics),
all variables used inside the loop body receive a new value in each iteration.
Part (c) shows the generated COCOMA section with loop muxes for i and n.
This approach, which has been borrowed from COMRADE 1.0, features quite
a simple implementation (covered by a simple front-end pass), but introduces
an area overhead: In a more sophisticated implementation, the loop mux for n
could be optimized away. In that case, however, a mechanism for the proper
reactivation of n would be needed. Note that simply keeping n active is not a
solution because this would introduce excess ATs on the one hand and render
updates of n (with a different value) impossible on the other hand.
Note that although the loop mux enterable state guarantees that the loop mux is
not re-initialized before the entire loop has ended, pipelining loop bodies is
not inhibited. On data paths along loop muxes, there is actually no pipelining;
however, as long as loop-carried dependences are limited, it takes just a few
cycles until a loop mux outputs the value for the next iteration. E.g., for the
increment ++i of a loop index variable, two cycles are needed. Pipelineable
loop data paths typically receive their input from a load node and write the
output to a store node; both of which are part of the loop body. Such paths also
need input from loop muxes; but these quickly deliver new data (e.g., every two
cycles), independently of the length of the pipelined data path. Thus, the loop
71
5. COMRADE 2.0 Execution Model
mux mechanisms do not inhibit pipelining.
n = ...;
for (i = 0; i < n; ++i) {
  ;
} <
(a) (c)
 
loop
nop...
==1
n
any
==1
nCT
i c 
loop
+
0
1
i c
==1
n = ...;
for (i = 0; i < n; ++i) {
  n = n;
}
(b)
i
i n
Figure 5.12.: Data reactivation in loops: (a) sample code snippet, (b) code
altered by compiler front-end, (c) COCOMA section generated
from (b).
Nested Loops
Fig. 5.13 shows a nested loop example. Here, we have to guarantee that the
outer loop does not introduce a CT (via CT forwarding along control edges)
into the inner loop control node before the inner loop has finished, i.e., the
inner loop control node has been evaluated n + 1 times for n inner iterations.
Otherwise a running inner loop would be incorrectly terminated early.
This is established by delaying the evaluation of the outer loop control node.
We replace the original any node by a network of token nodes that outputs an AT
as soon as all inner loops have ended. In Fig. 5.13(b) this reduces to a simple
any node which is activated when the inner loop has ended (( j < m) == 0) or
the inner loop has not been started at all ((i == 4) == 0). The control edges
leading into this network are annotated with nCT, because this computation is
only AT-related. If there are multiple inner loops, an all node is used in the
token node network to guarantee that all inner loops have ended before the
next outer loop iteration.
72
5.3. Hardware Execution Model
for (i = 0; i < n; ++i)
  if (i == 4)
    for (j = 0; j < m; ++j)
      ...
==
ni
<
any
mj
<
       
                ==0
               nCT
==0
nCT
4i
...
(a) (b)
==1
==1
Figure 5.13.: Nested loops: (a) sample code snippet, (b) generated COCOMA
section.
Memory Accesses in Nested Loops
Fig. 5.14 shows dependent memory accesses on different loop levels. store3 is
executed if i < 8, but not before the inner loop has finished. Here, the program
order of the memory accesses cannot be established by a memory edge from
store2 to store3, because only the execution of store2 in the last inner iteration is
relevant. Instead, store3 is controlled by two control edges, one from the outer
loop control node being responsible for creating CTs in store3, and one from the
inner loop control node being responsible for AT delivery. We encode this with
a nAT annotation of the edge that creates CTs (no ATs) and use nCT annotations
for the edge delivering the ATs.
Fig. 5.14(b) also shows the general way used by COCOMA to obey memory
dependences between loop iterations. The store2 accesses have to be carried out
in program order so that store2 in the last iteration of the inner loop is executed
after all previous store2 accesses. For this we insert a memory edge from store2
back to the loop control node (via an any node). We call the resulting structure
of sequential memory accesses and control nodes a memory chain.
Hardware/Software Interface
To receive variables transferred from SW to HW, a COCOMA instance has
input register nodes (inreg). An inreg receives an AT as soon as a data word
73
5. COMRADE 2.0 Execution Model
for (i = 0; i < 8; ++i) {
  // load1
  s += *(p + i);
  for (j = k; j < 9; ++j) {
    // store2
    *q = s + j;
  }
  // store3
  r[i] = i;
}
load1
i < 8
store3
r+i i
p+i
j < 9
any
store2
s+j
(a) (b)
==1
==1
==1
nAT
nCT
==0 ==1 q
Figure 5.14.: Memory accesses on different loop levels: (a) sample code snip-
pet, (b) generated COCOMA section.
has been transferred from the SW. For the other direction outreg nodes are
used; from there the SW can read the currently stored value at any time. For
signaling the end of the HW computation (or a break in HW computation in the
case of a SW service call) an irqreg node is used. Receiving an AT, this node
generates an IRQ for the CPU. The data word stored in the irqreg identifies the
point in the program where the SW has to continue execution. This identifier
can be read by the SW similar to outregs.
Fig. 5.15 shows an example containing a SW service which is executed if b
== 0. If this condition is true in Fig. (b), the outreg waits for the c value,
which it stores, and then activates the const node. The number 4 in this case
is the identifier for the program section containing the branch target of b ==
0, i.e., the section where the HW switches back to SW5. The identifier is for-
warded through an irq data mux to the irqreg. All constant nodes identifying a
HW-to-SW transition are connected to this special multiplexer, which simply
forwards the currently active input. The irqreg now sends an IRQ to the SW.
The SW has an interrupt handler (not shown in the Figure) which reads out the
irqreg value and jumps to the corresponding line of code. From there, several
5COMRADE 2.0 uses CFG node numbers as identifiers, cf. Fig. 5.2.
74
5.3. Hardware Execution Model
outreg values may be read out, before the SW service executes (layered gray
in Fig. 5.15(a)). After its completion several variable values may be trans-
ferred to inregs (SendVars(kernel_no)) and the SW switches back to HW
(StartHW(kernel_no)). This last action activates an initial node. We assign
such an initial node to each SW-to-HW transition and embed them in the mem-
ory dependence chain. This is necessary to ensure the program order execution
of memory accesses even at the level of HW/SW regions. In the example, the
store *q = 9, which is implemented in HW, may not execute before the store
*c = 0 in the SW service, which is guaranteed by the memory edge from initial
to any.
if (b == 0) {
  ProvideVars();
  StopHW(4);
  RequestVars(kernel_no);
  printf("set %p = 0\n", c);
  *c = 0;
  SendVars(kernel_no);
  StartHW(kernel_no);
  ReceiveVars();
} else {
  *c = a / b;
}
*q = 9;
outreg
b == 0
store
c
const 4
(a) (b)
==1==0a / b
irq
(other HW exits)
irqreg
initial
any
store
Figure 5.15.: HW/SW interface: (a) sample code snippet (SW service on gray
layer), (b) generated COCOMA section.
75
5. COMRADE 2.0 Execution Model
76
6. COCOMA
The COMRADE Controller Micro-Architecture (COCOMA) is a token-based
computation model and, at the same time, an intermediate representation (IR)
for compilation from structured CFGs to hardware. From a COCOMA instance
descriptions of dynamically scheduled hardware can be created in a straight-
forward way. COCOMA models variable latency operators, operator-internal
pipelining, and early evaluation with cancel tokens (both dynamic and static
CTs are supported). This Chapter gives a formal COCOMA definition, which
serves as a baseline for the actual implementation in COMRADE. Because the
complete model description is quite complex, we have moved some details to
Appendix A.
op1
op2 op3
sequencer
op4
CMDFG
Figure 6.1.: COCOMA structure consisting of CMDFG and sequencer.
A COCOMA instance consists of a control memory data flow graph (CMDFG)
and a sequencer as illustrated in Fig. 6.1. The CMDFG models parameterized
operations and their dependences in a directed graph, while the sequencer de-
fines token flow directives and manages the inter-operator handshaking. The
sequencer can be automatically generated from the CMDFG, thus the CMDFG
alone is an IR already - in the compile flow, it is the input of the compiler
pass which generates the COCOMA. Despite this redundant character of the
77
6. COCOMA
sequencer, its generation is an important step towards hardware implementa-
tion.
In the sections that follow we will define the CMDFG (Section 6.1) and the
sequencer (Section 6.2). As the CMDFG is a complete IR, its definition con-
tains far more details than traditional graph-based descriptions such as DFG or
CDFG. For example, nodes have a type defining their function and parameters
specifying buffer sizes, bitwidths, and other options. I/O ports depend on the
node type. Data edges are connected to I/O ports instead of nodes, which is
essential for non-commutative operations. Section 6.3 then combines CMDFG
and sequencer into a COCOMA instance.
6.1. Formal CMDFG Definition
First, we adopt two practical notations from [Golz78] for binary relations, as
we will often make use of these.
Definition 6.1.1. Given a binary relation E ⊆ A× B and some set C, we define
the set of E-successors of C, E˜(C):
E˜(C) := {b ∈ B : ∃a ∈ C : ((a, b) ∈ E)}.
Dually, the set of E-predecessors of C, E˜(C) is defined to be
E˜(C) := {a ∈ A : ∃b ∈ C : ((a, b) ∈ E)}.
For example, in a directed graph G = (N, E), with E ⊆ N × N, E˜({c}) is the set
of successor nodes of node c, i.e., the set of all nodes b such that there is an
edge (c, b) ∈ E. E˜({c}) is the set of predecessor nodes of node c, i.e., the set ofall nodes a such that there is an edge (a, c) ∈ E.
We now define a CMDFG frame which imposes the CMDFG graph structure.
Definition 6.1.2. ACMDFG frame CF = (N, Edat, Econ, Emem) defines a finite
directed graph (N, E) with edge set E = Edat ∪˙ Econ ∪˙ Emem partitioned into
three disjoint subsets: data edges Edat ⊆ N × N, control edges Econ ⊆ N × N
and memory edges Emem ⊆ N × N.
Each node has either only outgoing control edges or no outgoing control edges:
∀(n,m) ∈ Econ : ∀k ∈ E˜(n) : (n, k) ∈ Econ.
Note that this definition of a CMDFG frame does not allow more than one edge
in the same direction between two nodes, i.e., a CMDFG frame is a graph, not
78
6.1. Formal CMDFG Definition
a multi graph. Let us shortly make sure that this restriction is acceptable for
CMDFGs which represent sections of C code, by examining all the prohibited
cases of edges between two nodes n,m.
If there is a control edge (n,m) the CMDFG uses n as dedicated control node.
There will be no necessity for any data or memory flow from n to some other
node, because C statements can only be control dependent on an if/else or switch
condition, not data or memory dependent. Thus, control/data or control/mem-
ory combinations of edges are not necessary. The only remaining combination
of different edges is data/memory. Such combinations are not necessary either
because a memory dependence would be redundant to an existing data depen-
dence. Thus, we only have to examine multiple data edges, multiple control
edges and multiple memory edges between two nodes n and m.
Multiple data edges can only occur when the same variable is used as primary
and secondary input of a binary operator, e.g., a + a. Nearly all of these cases
can be replaced by other expressions (Table 6.1), thereby bypassing the prob-
lem. However, there are no direct substitutions for a ∗ a and a << a. For these
cases (the latter being rarely used), an intermediate nop (no operation) node as
shown in Fig. 6.2(c) can be inserted (without affecting the hardware latency
when configuring the nop operation as combinatorial logic), avoiding the need
for multiple data edges (Fig. 6.2(b)).
a
nop
*
b = a * a;
(a) (b) (c)
a
*
Figure 6.2.: Intermediate node insertion to prevent multiple data edges with the
same orientation between two nodes.
Due to the structuredness of the C source code multiple control edges can only
be created from fall-through switch statements as shown in Fig. 6.3(a). In-
stead of implementing multiple control edges (Fig. 6.3(b)), the solution is to
aggregate the control values (0, 1) into a single control edge (Fig. 6.3(c)).
Furthermore, there is no need for multiple memory edges, because these would
encode exactly the dependence modeled by a single memory edge.
On a CMDFG frame, we define some auxiliary functions:
79
6. COCOMA
Problematic expression Substitute
a + a a << 1
a − a 0
a ∗ a —
a / a 1
a% a 0
a << a —
a >> a 0
a < a 0
a > a 0
a <= a 1
a >= a 1
a == a 1
a != a 0
a & a a
a && a a != 0
a | a a
a || a a != 0
a ∧ a 0
Table 6.1.: Replacement of C expressions which, when directly translated into
CMDFG syntax, would lead to multiple data edges with identical
orientation between two nodes.
c
9
store
switch (c) {
  case 0:
  case 1: *p = 9;
          break;
} p
==1
c
9
store
p
=={1,0}
(a) (b) (c)
==0
Figure 6.3.: Agglomeration of control values in a single control edge.
80
6.1. Formal CMDFG Definition
Definition 6.1.3. Given a CMDFG frame CF with node set N and edge set
E ⊆ N × N. The outgoing edges function Eout : N → P(E) assigns each node
n ∈ N the set of its outgoing edges Eout(n) := {n}× E˜(n). The outgoing control
edges function Eout,con : N → P(Econ) analogously assigns each node n ∈ N
the set of its outgoing control edges Eout,con(n) := {n} × E˜con(n).
The source projection function pro j1 : E → N projects an edge to its source
node pro j1((n,m)) := n.
Analogously, the target projection function pro j2 : E → N projects an edge
to its target node pro j2((n,m)) = m.
We now define node types that can occur in a CMDFG, as well as input and
output ports necessary for data flow.
Definition 6.1.4. A node library
NL = (T,Din,Dout, din, dout,Cin,Cout,Param, param) consists of
• a set of node types T reflecting all integer operations that can occur
in a C source code, as well as auxiliary and token flow node types:
T := {add, sub, mul, div, mod, negation (!), complement (~), shiftLeft, shiftRight,
lessThan, greaterThan, lessThanOrEqual, greaterThanOrEqual, equal, notEqual, bit-
wiseAnd, logicalAnd, bitwiseOr, logicalOr, bitwiseXor, bitSel, mux, loopMux, irqData-
Mux, nop, inreg, outreg, irqreg, load, store, initial, always, all, any, mf},
• two non-empty, finite, disjoint sets Din and Dout of data input ports
and data output ports: Din := {A, B}, Dout := {R} (more than two data
inputs, needed for multiplexers, are also supported – see below),
• a function din : T −→ P(Din) assigning a set of data input ports to each
node type,
• a function dout : T −→ P(Dout) assigning a set of data output ports to
each node type, obeying the condition |dout(t)| ≤ 1 for all t ∈ T , i.e., each
node type has at most one data output port,
• two sets of input and output token flow ports: Cin := {Start, StartCtrl, StartC-
trlPseudo, Sel, StartSel, ResultReadyAck, Cancel, CancelStateAck, CancelStateCtr-
lAck} and Cout := {StartAck, StartCtrlAck, StartSelAck, ResultReady, CancelAck,
CancelState, CancelStateCtrl, TokensEmpty},
• a set of node type parameters Param := {WA,WB, WR, Sign, Depth, QDepth,
TQDepth, StaticCT, StartCtrlIn, PseudoAT, NoCT, NIn, IOAddr, IOAWidth, PortNum,
81
6. COCOMA
Node Type t din(t) dout(t)
add, sub, shiftLeft, shiftRight, lessThan, greaterThan,
lessThanOrEqual, greaterThanOrEqual, equal, notEqual,
bitwiseAnd, logicalAnd, bitwiseOr, logicalOr, bitwiseXor mul, div,
mod, load
{A, B} {R}
negation (!), complement (~), mux, loopMux, irqDataMux {A} {R}
store {A, B} ∅
bitSel, nop {A} {R}
inreg, const ∅ {R}
outreg, irqreg {A} ∅
initial, always, all, any, mf ∅ ∅
Table 6.2.: Input and output data ports assigned to node types.
AddrWidth, DataWidth, Width, AreaNSpeed}, which define bitwidths, signed-
ness, buffer sizes, the token flow model, and other properties (see Chap-
ter 7 for details), and
• a parameter assignment param : T −→ P(Param), assigning a set of
parameters to each node type.
Mux := {mux, loopMux, irqDataMux} is the set of multiplexer data types.
We use din and dout to assign input and output ports to the node types, so that
there is a correspondence to C operators. For example, a div type has two
distinct inputs (the dividend A and the divisor B) and one output (the quotient
R), corresponding to the C operator /. The distinction of the inputs is neces-
sary because in general C operators are not commutative. Table 6.2 gives the
exact definitions for din and dout. Note that inregs have no data input port in
the CMDFG context. However, in the hardware implementation inregs are con-
nected to a data bus, such that the CPU can feed data into the inreg, producing
the new AT. Similarly, outregs and irqregs don’t model data output ports. Mul-
tiplexer node types support an arbitrary number of inputs although they use
only one data input port. All data inputs will be concatenated to a single bus,
which is then connected to port A, having a greater bitwidth accordingly (cf.
Fig. 6.5).
The following listing describes the semantics of the token flow signals, most of
which implement a simple handshaking protocol. This handshaking is based
on synchronous logic, i.e., in the hardware implementation each operator has
82
6.1. Formal CMDFG Definition
a Clock input which is not explicitly defined in the CMDFG. Further customary
inputs are Reset and CE (clock enable) detailed in Secion 7.
• Start, StartAck: If Start is true, there are valid data items at the data input
ports. If the node consumes the data, it sets StartAck to true. For multi-
plexer nodes, Start and StartAck are multi-bit signals, each bit respectively
indicating or acknowledging a data item at exactly one data input.
• StartCtrl, StartCtrlAck: StartCtrl indicates an AT coming in along a control
edge, acknowledged by StartCtrlAck.
• StartCtrlPseudo: If this signal is true while StartCtrl and StartCtrlAck are set,
the node receives a pseudo activate token (pAT). Unlike ATs, pATs create
a dummy value at the node data output without consuming a data input
value. This can be used for the elimination of excess CTs.
• Sel: For mux nodes this is the (multi-bit) select signal. loopMux and irqData-
Mux nodes do not have an explicit select input.
• StartSel, StartSelAck: For mux nodes these signals are used to notify and
acknowledge a valid value on the select input Sel.
• ResultReady, ResultReadyAck: If ResultReady is true, the node has finished
the current operation. The value currently assigned to the data output
port R (if present) represents the result of the computation. ResultReady
and R are held until ResultReadyAck is true.
• Cancel, CancelAck: A true Cancel signals the node that the next result is
superfluous. If CancelAck is true, the node will discard the next computed
result or (if no computation is running and the dynamic CT model is
used) forward the cancel information by setting CancelState and Cancel-
StateCtrl to true.
• CancelState, CancelStateAck: A true CancelState forwards a CT to the data
predecessor nodes, acknowledged by CancelStateAck.
• CancelStateCtrl, CancelStateCtrlAck: A true CancelStateCtrl forwards a CT to
the control successor nodes, acknowledged by CancelStateCtrlAck.
• TokensEmpty: If true, the node does not currently contain any ATs or CTs.
This information is needed to ensure that a loop has executed all its mem-
ory accesses before ending.
83
6. COCOMA
StartCtrl Start CancelState
CancelStateAck
ResultReady
ResultReadyAck
Cancel
CancelAck
add
StartCtrlAck StartAck
R
A
B
StartCtrlPseudo
TokensEmpty
CancelStateCtrl
CancelStateCtrlAck
Figure 6.4.: Data and token flow ports for the node type add.
Fig. 6.4 shows the data and token flow ports relevant for the add node type.
Next, we assign node types to the nodes of a CMDFG frame, set parameters,
and connect predecessor nodes to input ports.
Definition 6.1.5. Given a CMDFG frame and a node library, a data flow
DatF = (t, ps, pin, pout, bits) consists of:
• a type function t : N → T , assigning each node a node type,
• a parameter setting functional psn : param(t(n)) → N0, which sets the
parameters of each node n ∈ N,
• an input port functional pin, which defines for each n ∈ N a surjective
input port function pinn : E˜dat({n}) → din(t(n)), i.e., each data predeces-sor node of n is assigned an input port of t(n),
• an output port functional pout, which defines for each n ∈ N a surjec-
tive output port function poutn : E˜dat({n}) → dout(t(n)), i.e., each data
successor node is assigned a data output port of t(n), and
• a bitwidth configuration bits : Edat → N, assigning each data edge a
positive integer representing its bitwidth.
Given a CMDFG frame, a node library, and a node type t1, the control edge
set Econ can be partitioned into two subsets Econ,t1 , Econ \ Econ,t1 , with Econ,t1
containing all control edges e ∈ Econ pointing to a node of type t1.
A control flow assigns several attributes to control edges:
84
6.1. Formal CMDFG Definition
Definition 6.1.6. Given a CMDFG frame and a node library, a control flow
ConF = (ann, cw,muxPred, con, nctMux) consists of:
• an annotation ann : Econ → {1, nAT , nCT , alwAct, atOnCancel, mux},
which can assign control edges an annotation nAT (no activate token),
nCT (no cancel token), alwAct (always activate), atOnCancel (AT on
CT), or mux (mux control), 1 representing no annotation,
• a control width cw : Econ → N, assigning mux control edges the num-
ber of the mux data predecessors, while the remaining control edges are
assigned a 1, i.e.
cw((n,m)) =

∣∣∣E˜dat(m)∣∣∣ , if t(m) = mux
1, otherwise,
• a mux predecessor assignment functional muxPredm : E˜dat(m) →{0, . . . , ∣∣∣E˜dat(m)∣∣∣−1}, for each mux nodem assigning each data predeces-sor a distinct number, which is needed below to associate predecessors
and control functions,
• a control configuration con : Econ → ⋃k∈N(Fbool)k, which assigns each
control edge e ∈ Econ a vector of control functions with boolean do-
main, ∀ f ∈ Fbool : f : Z → {true, f alse}, where con(e) ∈ (Fbool)cw(e).
If the target of e is not a mux, cw equals 1, so that the vector com-
prises only one element. The purpose of the control function is to tell
for which outputs z ∈ Z of the control edge source node the successor
node is activated or canceled, respectively. We use the set of integers
here, because the branching of a switch statement can be based on any
positive or negative integer value, in contrast to simple if/else scenar-
ios, for which the set {0, 1} would suffice. In the mux case the control
function vector contains one control function per mux data input. Then,
each control function determines for a given control node data output
z, if the next value received from the associated mux data predecessor
will be forwarded ( f (z) = true) or discarded ( f (z) = false). To associate
control functions with mux data predecessors, muxPred is used; for mux
m, control function fi j in the vector ( fi0 , fi1 , . . . ) is associated to the data
predecessor p of m such that muxPredm(p) = j.
• A mux control nCT annotation nctMux : Econ,mux →⋃
k∈N{true, false}k, with nctMux(e) ∈ {true, false}cw(e), can assign
an nCT property to each mux input using the same association as the
control configuration.
85
6. COCOMA
We can now define the CMDFG.
Definition 6.1.7. A Control Memory Data Flow Graph (CMDFG) C =
(CF,NL,DatF,ConF) consists of a CMDFG frame CF, a node library
NL, a data flow DatF, and a control flow ConF.1
Given a CMDFG, a node configuration assigns values to the input or output
ports of each node at a given abstract point t in time2. Thus, this represents
dynamic behavior in contrast to the static model described so far. Node input
configurations are needed to define the current input values needed to carry out
a computation in the nodes; node output configurations represent the computa-
tional results.
Definition 6.1.8. Given a CMDFG, a node input configuration at time t,
NIt : N × (Din ∪ Cin) → Z, assigns each pair of a node and an input port an
integral value. Token flow ports are always assigned either 0 or 1, with two
exceptions: the Sel port transmitting the select signal for mux nodes, and the
Start and CancelStateAck ports of mux and loopMux nodes. In these cases Start and
CancelStateAck have a dedicated bit for each data input as shown in Fig. 6.5.
However, such multi-bit signals (with each bit either 0 or 1) are represented by
non-negative integers in the node input configuration.
Start[2:0]mux
op2
R[7:0]
op1
R[7:0]
op0
R[7:0]
A[23:0]
Sel[2:0]
1 1 1 8 8 8
4 = (100)2
Figure 6.5.: Connection to multi-bit input ports; mux forwards next value com-
ing in from op2, because associated Sel bit is 1.
By partitioning the domain, we break down a node input configuration at time
1There is no need for a dedicated memory flow component, because the memory edges (already
being defined in the CMDFG frame CF) have no further annotations.
2Practically, time is discretized into clock cycles.
86
6.1. Formal CMDFG Definition
t into a node input data configuration at time t, NIDt : N × Din → Z, and a
node input control configuration at time t, NICt : N ×Cin → Z.
Analogously, a node output configuration at time t NOt : N× (Dout∪Cout) →
Z assigns each pair of a node and an output port an integral value. It is broken
down into node output data configuration at time t, NODt : N × Dout → Z,
and node output control configuration at time t, NOCt : N ×Cout → Z.
Having assigned values to the output ports, we can determine if the control
functions which are assigned to control edges return a true or a false value.
The following notation simplifies such return value queries.
Definition 6.1.9. Given a CMDFG, a node output configuration at time t, NOt,
and a node n ∈ N, we define the evaluation functional evaln : Eout,con(n) →⋃
k∈N{true, false}k, evaln((n,m)) := con((n,m))(NOt(n,R)).
Thus, for a control edge econ = (n,m) ∈ Econ, assuming that nodem ∈ N is not a
mux, evaln(econ) is true if the control function of econ, i.e., con(econ) = f , is true
for the value assigned to the output port R of n according to the current node
output configuration NOt. Note that in general, evaln(econ) ∈ {true, false}cw(econ),
because con((n,m)) is a vector of control functions, each one reading the same
value assigned to the output port R of n. Fig. 6.6(a) shows an example: Here,
the current value assigned to output port R is 0, leading to a true evaluation,
because that value matches the control edge annotation. Formally, econ = (a, b),
and con(econ) = f , with f : Z→ {true, false}, so that
f (R) =
true, if R = 0false, otherwise.
Because R = 0, f (R) = {true}, and therefore evala(econ) = true. The example in
Fig. 6.6(b) shows the (more complicated) mux case. As the mux has 3 data pre-
decessors, cw((c, d)) = 3. For econ = (c, d), we have con(econ) = ( f2, f1, f0) ∈
F3bool, with
f2(R) =
true, if R = 2false, otherwise,
f1(R) =
true, if R = 19false, otherwise, and
f0(R) =
true, if R = 4false, otherwise.
87
6. COCOMA
As R = 19 in the example, evalc(econ) = (false, true, false)3.
To simplify the notation, we may write eval(n,m) instead of evaln((n,m)).
==(2,19,4)
d
mux
a
R=0
==0
c
R=19
b
(a) (b)
Figure 6.6.: (a) Non-mux control edge with control width 1; (b) mux control
edge with control width 3.
Each CMDFG node is able to carry out a computation, i.e., use the values avail-
able at the input ports to compute resulting values which are then assigned to
the output ports. We regard this node operation as a black box here, describ-
ing the internal computation structure of a node in Section 7. However, all
CMDFG nodes together formally compute a node output configuration NOt2
from a node input configuration NIt1 as depicted in Fig. 6.7.
nodes
NOt2
NIt1
Figure 6.7.: Operation of CMDFG nodes.
6.2. Formal Sequencer Definition
Whilst data flow connections between input and output data ports (A, B, R) are
already fully covered by the CMDFG, the sequencer defines logic which con-
3In practice, this will set the Sel input of the mux to 010 (binary).
88
6.2. Formal Sequencer Definition
nects the handshaking ports of adjacent nodes. The exact definition of this
logic is quite complex for several reasons. First, data, control, and memory
dependences influence each other. Second, the control edge annotations af-
fect the token flow. Third, the token flow model is parametrized to support
either dynamic or static CTs. Due to their complexity, refer Appendix A for
sequencer details. In this Section we will concentrate on just the basic facts
and definitions.
The logic defined by the sequencer depends on the direct neighborhood of a
node. Fig. 6.8 shows the neighborhood of an arbitrary node y ∈ N. The
predecessor nodes are contained in the node set X, which is partitioned into
Xdat, Xcon, and Xmem, containing data, control, and memory predecessor nodes
respectively. According to the annotation Xcon can be further broken down
into Xcon,1, Xcon,nCT , Xcon,nAT , Xcon,alwAct, Xcon,atOnCancel, and Xcon,mux. The same
scheme is applied to name the successor nodes Z. Elements of such a node set
are printed in lower case, e.g., x ∈ X. To clarify the associated node set, we
sometimes use subscripts in node identifiers, such as xdat ∈ Xdat.
To simplify the sequencer notation further, we write InputPortNamet(y) instead
of NIt(y, InputPortName) for referencing the value assigned to the input port
InputPortName of node y, given a node input configuration NIt. The analogous
notation is applied for output ports. For example, given a node output configu-
ration, ResultReadyt(x) is the value assigned to the ResultReady output port of
node x at time t.
As a basic feature of COCOMA, successor nodes may consume incoming to-
kens independently of each other as illustrated in Fig. 6.9. y provides a valid
result in Fig. (a), which is consumed by z2, but not yet by z1 due to a missing
token from n. Fig. (b) shows how we prevent z2 from being started a sec-
ond time with the same input token; ReadyConsumedt2 (y, z2) is set to 1. Thus,
while we block ResultReadyt2 (y) from activating z2, we let the signal pass to
z1, which still needs to be activated.
Considering current token positions in Fig. 6.9(a), y places an AT on both data
edges, (y, z1) and (y, z2), by setting ResultReadyt2 (y) to 1. In Fig. (b) the AT
on edge (y, z1) still remains, while the AT on edge (y, z2) has moved into z2.
Hence, an AT is present at an edge (y, z) at time t, if ResultReadyt(y) = 1 and
ReadyConsumedt(y, z) = 0.
Analog arguments hold for CTs; here, CancelConsumedt prevents the CT from
canceling a data predecessor more than once. We now formally define these
two functions.
Definition 6.2.1. Given a CMDFG, a sequencer configuration at time t,
89
6. COCOMA
y
z
i
z
j
z
k ...
x
dat
...
w
0
x
con w
0
x
mem......
Z
dat
Z
con
Z
mem
X
dat Xcon Xmem
w
0
x'
con,nAT...
          nAT
X
con,nAT
...
...
X = X
dat
 ∪ X
con
 ∪ X
mem
X
dat
 = {x
dat
, ..., x'
dat
}
X
con
 = X
con,1
 ∪ X
con,nAT
 ∪ X
con,nCT
 ∪ X
con,alwAct
 ∪ X
con,mux
X
con,1
 = {x
con
, ..., x'
con
, ...}
X
mem
 = {x
mem
, ...}
Z
dat
 = {z
dat
, ..., z'
dat
}
Z
con
 = {z
con
, ...}
Z
con
 = Z
con,1
 ∪ Z
con,nAT
 ∪ Z
con,nCT
 ∪ Z
con,alwAct
 ∪ Z
con,mux
Z
con,1
 = {z
con
, ..., z'
con
, ...}
Z
mem
 = {z
mem
, ...}
Figure 6.8.: Predecessors X and successors Z of a node y ∈ N.
90
6.2. Formal Sequencer Definition
y
z2
(a) n
z1
ResultReadyt1(y) = 1
ResultReadyAckt1(y) = 0
Startt1(z2) = 1
StartAckt1(z2) = 1
Startt1(z1) = 0
ResultReadyt1(n) = 0
y
z2
(b) n
z1
ResultReadyt2(y) = 1
ResultReadyAckt2(y) = 0
Startt2(z2) = 0Startt2(z1) = 0
ResultReadyt2(n) = 0
ReadyConsumedt2(y, z2) = 1
+ +
+
Figure 6.9.: AT flow example: (a) both outgoing edges of y have an AT; (b) AT
of edge (y, z2) has been consumed by z2.
SeqCt, consists of aReadyConsumed function and aCancelConsumed func-
tion:
• ReadyConsumedt : E → {0, 1},
• CancelConsumedt : E → {0, 1}.
The main duty of the sequencer is to compute a node input control configura-
tion NICt2 from a given node output control configuration NOCt1 . It also needs
to update its own sequencer configuration SeqCt1 to SeqCt2 . This duty cycle is
depicted in Fig. 6.10.
Definition 6.2.2. Given a CMDFG, a sequencer Seq reads a node output
control configuration NOCt1 and a sequencer configuration SeqCt1 , and out-
puts a node input control configuration NICt2 and a new sequencer configu-
ration SeqCt2 . A detailed specification of how NOCt2 and SeqCt2 are com-
puted is given in Appendix A. Furthermore, a sequencer is parametrized with a
scheduling type SchedType which is either dynCT for dynamic cancel tokens
or statCT for static cancel tokens.
91
6. COCOMA
sequencer
NOCt1
NICt2
SeqCt1
SeqCt2
Figure 6.10.: The sequencer duty cycle.
6.3. Formal COCOMA Definition
To be able to model the behavior of combinatorial logic (e.g., for operation
chaining), we will from now on use a more detailed notion of time. Instead of
pure points t in time, we write (t, i), where t can be understood as clock cycle
number and i references the delta cycle, which does not consume time in the
model. Thus, NOC0,1 and NOC0,2 can be different, although they are assigned
to the same clock cycle.
Definition 6.3.1. A Comrade Controller Micro-Architecture (COCOMA)
instance (C,NO0,0, S , SeqC0) consists of
• a CMDFG C,
• an initial node output configuration NO0,0,
• a sequencer Seq, and
• an initial sequencer configuration SeqC0.
To get a quick overview the COCOMA operation is shown in Fig. 6.11 in
a simplified version without timing subscripts. The CMDFG nodes generate
NOD and NOC from NID and NIC. The sequencer uses NOC and SeqC to
compute NIC and update SeqC. Node data outputs NOD are directed to node
data inputs NID according to the CMDFG data edges.
For the implementation of COCOMA in synchronous hardware, a central as-
pect is the mapping to actual clock edges. COCOMA primarily performs two
92
6.3. Formal COCOMA Definition
sequencerCMDFG
nodes
NOC
NIC
SeqC
 
 CMDFG
data edges
NOD
NID
Figure 6.11.: COCOMA operation overview.
actions during one cycle. First, node outputs flow along data edges and through
the sequencer logic to produce node inputs which are consumed by nodes to
produce new node outputs. Second, the sequencer configuration is updated.
However, mapping all of these operations to one clock cycle would restrict
COCOMA too much, because at least one cycle would be needed to compute
the node outputs from the node inputs. To support even combinatorial (un-
registered) operations (operation chaining) COCOMA contains a delta cycle
phase as shown in the upper part of Fig. 6.12. Here, new node outputs are
computed until the configuration is stable. Whether a node works combinatori-
ally depends on the node parameter DEPTH. DEPTH = 0 means a combinatorial
computation (unregistered), DEPTH > 0 results in a non-combinatorial, buffered
(registered) computation (cf. Section 7). After that, the lower part of Fig. 6.12
is entered increasing the time value (m := n + 1) and resetting the delta cy-
cle number. Note that sequencer configurations SeqC do not need delta cycle
updates, therefore only one index for clock edge mapping is present.
93
6. COCOMA
ne
xt
 c
lo
ck
 c
yc
le
sequencer
unregistered
CMDFG nodes
NOCn,i
NICn,j
SeqCn
SeqCm
CMDFG
data edges
NODn,i
NIDn,j
NOCn,kNODn,k
sequencer
NICn,k+1
CMDFG
data edges
NIDn,k+1
if NOn,k != NOn,i {
    i := k;
}
m := n + 1
if NOn,k == NOn,i
k := j + 1
n := m
i := 0
registered
CMDFG nodes
NOCm,0NODm,0
n := m
de
lta
 c
yc
le
 b
eh
av
io
r
j := i + 1
Figure 6.12.: Detailed COCOMA operation.
94
7. Hardware Operator Library
The COCOMA nodes motivated in Chapter 5 and defined in Chapter 6 require
a hardware operator library from which they can be instantiated for simulation
and actual synthesis purposes. COMRADE 2.0 sets out the following require-
ments for a library such as this:
• It must support all C language operators as well as the auxiliary and
token flow operators defined in Def. 6.1.4.
• It is platform independent, i.e., it can be used for a variety of target FP-
GAs.
• Each operator is synthesizable, so that current FPGA synthesis, map-
ping and placement tools can automatically generate appropriate FPGA
configurations.
• It is compatible to the COCOMA token flow, i.e., operators must be able
to consume, process, and propagate ATs and CTs.
• It is parametrizable, so that operator features can be adjusted to the com-
piler needs. Parameters include bitwidths, buffer sizes, token flow (dy-
namic or static CTs), predication, and synthesis directives such as opti-
mization strategies (e.g., area vs. time).
• It provides meta data for each operator, such as area consumption and
achievable frequency, depending on the parameter values and on the tar-
get FPGA.
COMRADE 1.0 has used the GLACE module generator library [NeKo01] for
the mapping of C operators to hardware. GLACE creates compact gate-level
designs, furnished with placement directives for specific target platforms. This
low-level approach is no longer required today. Logic synthesis tools have
improved and at the same time the resources available on an FPGA have in-
creased, so that the designing on the higher register transfer level (RTL) has
become a universal standard. In particular the overhead of porting the GLACE
95
7. Hardware Operator Library
generator routines to new FPGA fabrics on the one hand and the additional
features required by COCOMA on the other hand have prompted us to re-
place GLACE with a newer, higher-level and more powerful RTL operator li-
brary. This new library called Modlib has been developed fulfilling the above
requirements in cooperation with the Embedded Systems and Applications
Group (ESA), headed by Professor Andreas Koch at Technische Universität
Darmstadt and Benjamin Thielmann (Abteilung E.I.S., Technische Universität
Braunschweig), who has implemented the major part of the functionality.
Modlib is applicable beyond COMRADE. The library can also be utilized by
compilers which create statically scheduled hardware. An example is a domain
specific compiler for Geometric Algebra [HMSH10], which has recently been
developed by the ESA Group at TU Darmstadt.
Section 7.1 describes theModlib module parameters and highlights some of the
library features. MoreModlib details can be found in [GäTK10] and [ThGä10].
Modlib fulfills all of the above requirements except for the last requirement – it
does not provide meta data. The GLACE generators used by COMRADE 1.0
contain an algorithm for each operator class (such as addition, multiplication
etc.) which analytically computes the meta data. This approach is not feasible
for Modlib due to the higher level of design. First, the actual gate level imple-
mentation of an operator is left to the synthesis tool, so that Modlib itself has
no knowledge of the area required and frequency achieved on the target plat-
form. Second, an analytical computation is not desirable because this would
inhibit the immediate support of new target FPGAs; the analysis algorithms
of each operator class would have to be updated for every new technology.
Instead, meta data for Modlib operators are determined empirically. This is
accomplished by the Meta Data Fetcher (MDF) framework, which has been
developed in context of this work. Details about MDF are given in Section 7.2.
7.1. Modlib Parameters
Modlib consists of a number of Verilog files, with each file containing a mod-
ule definition of one operator. This Section gives an overview of the module
parameters and their meaning, highlighting some central features which are
useful for the application of Modlib in COMRADE 2.0.
96
7.1. Modlib Parameters
WA, WB, WR
These parameters define the bitwidths of the data inputs A and B and the data
output R.
Sign
Sign = 0 instantiates an unsigned operator, Sign = 1 creates a signed operator.
Depth, QDepth, TQDepth
These parameters define buffer sizes.
Depth stands for an optional additional latency, which can be useful for three
purposes:
1. Implementation of a data output register. Most Modlib operators are
combinatorial by default; to prevent long combinatorial paths in a se-
quence of several operators, COMRADE 2.0 adds an output register (i.e.,
Depth = 1) for each data manipulating operator. If two or more consec-
utive operators fit in one clock cycle, Depth can be locally adjusted to 0
(operation chaining).
2. Retiming: by setting Depth to a value greater than 1, the synthesis tool is
given the possibility to span complex operations (such as multiplication
or division) over several pipeline stages.
3. Pipeline balancing: A Depth value greater than 1 can be used to align the
operator latency with the latency of a parallel data path.
QDepth defines the size of an optional, transparent data output queue. Trans-
parent means that instead of inserting a fixed latency, the buffered data is for-
warded immediately. If the queue is empty, data passes the buffer combinato-
rially; otherwise an incoming data word is stored until it exits the buffer. This
can be used to alleviate back pressure. By having an output queue, an oper-
ator can compute further results, even though the current operator data output
has not yet been acknowledged by the successive units. For dynamic schedul-
ing contexts, this implements a dynamic kind of pipeline balancing. Thus,
in dynamic scheduling, sections of fixed latencies can be balanced using both
Depth or QDepth. Fig. 7.4 shows that (at least for the synthesis tool and the
97
7. Hardware Operator Library
target FPGA used here) for buffer sizes greater than one, transparent output
queues (QDepth) occupy fewer area resources than fixed latency shift registers
(Depth). Thus, we will generally use output queues for pipeline balancing issues
in COMRADE 2.0.
TQDepth (used only when the operation is predicated) sets the size of a trans-
parent buffer for tokens coming in via a control edge.
StaticCT
StaticCT = 1 configures the operator for static CT behavior, i.e., CTs are not
propagated to predecessor operators; StaticCT = 0 sets the dynamic CT behav-
ior.
StartCtrlIn
If StartCtrlIn = 1, the operator is predicated. It then executes its computation
speculatively, but does not forward the result until an AT has entered the oper-
ator along an incoming control edge. Exceptions are the memory access opera-
tions memread and memwrite which do not access the memory before the AT has
arrived1. StartCtrlIn = 0 means non-predicated operation, i.e., no control edge
leads to the operator, and a result is output as soon as it is ready.
PseudoAT
If PseudoAT = 1, the operator creates a pseudo AT (pAT) from an incoming CT,
cf. Section 5.3.1, Fig. 5.5(n). PseudoAT = 0 conversely disables this feature.
NIn
For multiplexers, NIn sets the number of data inputs. An exception is the loopMux
operator, which always has two inputs.
1For memread and memwrite, we diverge from the speculative operator concept because of cur-
rent restrictions of the implementation. As soon as our memory back-end supports speculative
accesses, memread and memwrite may shift back in line and also implement the speculative
execution semantics. Memwrite may also be updated to use the general mechanisms shown
here.
98
7.2. Meta Data Fetcher
NoCT
For multiplexers configured for dynamic CTs, this parameter denotes the inputs
to which no CTs may be propagated in the cancel case. COMRADE 2.0 needs
this to avoid excess CTs, cf. Section 8.
IOAddr, IOAWidth
These are parameters used only by the operators inreg, outreg, and irqreg, which
implement I/O registers, i.e., registers which are accessible from an outlying
authority, such as a CPU. IOAddr is the address associated to the register; IOAW-
idth is the bitwidth of the address bus.
PortNum, AddrWidth, DataWidth, Width
These are parameters dedicated to the memory access operators memread and
memwrite. PortNum defines the port number for memory systems with multiple
parallel ports such as the MARC system [LaKo00] used by COMRADE 2.0.
AddrWidth and DataWidth give the address and data bus widths; Width is used as a
byte enable, supporting 8, 16, or 32 bits wide accesses.
AreaNSpeed
If AreaNSpeed = 1, the operator is optimized for area; otherwise it is optimized
for speed (i.e., for a high throughput with a low initiation interval). This does
not change the parameters for the synthesis tool (such as the target frequency) –
instead a preselection of the operator implementation is made in Modlib. This
is currently only used for the division operator, which has by far the biggest
spectrum of different implementations amoung the integer operators (cf. the
comparisons in Section 7.2.1).
7.2. Meta Data Fetcher
Fig. 7.1 shows the MDF tool flow. When COMRADE 2.0 sends a query, MDF
determines the desired meta data empirically. Query parameters include the
operator name (e.g., add), the Modlib parameters described in Section 7.1, the
target technology (e.g., Xilinx Virtex-5, xc5vfx70t, speed grade -1), the target
99
7. Hardware Operator Library
COMRADE 2.0 Compiler
MDF
Modlib
● generate operator wrapper
● execute tool chain (2 passes)
● extract results from reports
add sub mul div ...
query parameters:
● operator
● Modlib parameters
● technology
● target frequency
● synthesis tool
reply values:
● area
− FF
− LUT
− slice
− mult
● frequency
instantiate operator
Figure 7.1.: MDF tool flow.
100
7.2. Meta Data Fetcher
frequency (e.g., 100 MHz), and the synthesis tool to be used (currently, Synop-
sys Synplify or Xilinx XST are supported). On a query, MDF creates a Verilog
module wrapper which instantiates the target operator. Then, MDF executes a
third party tool chain, which synthesizes the wrapper and the instantiated mod-
ule (using the selected synthesis tool) and then maps, places, and routes the
design for the target FPGA, finally analysing the timing. At this point in time
MDF is currently configured for the Xilinx ISE tool chain (synthesis → ngdbuild
→ map → par → trce), but can easily be adapted to other vendors’ tool chains.
There are two tool chain passes, the first pass determining the required area
from the map report (a text file output by the map tool), while the second pass
extracts the maximum frequency from the timing report (the output of the trce
tool). Therefore, the first pass ends with the mapping step.
Two passes are needed because for the determination of timing results, another
operator wrapper must be used in which all I/Os (except CLK and RESET) are
buffered in a register: Logic paths not containing registers are not considered
by the default period analysis of the Xilinx trce tool. Area meta data measure-
ments must be performed without these registers, since they do not belong to
the module internals.
The meta data extracted from the reports are transferred in a reply to the com-
piler. MDF holds queries and replies in a cache file, so that meta data which has
been determined already on a query is re-used as reply for subsequent identical
queries.
7.2.1. Measurements
Figs. 7.2 through 7.8 show area and timing data determined by MDF for a
Xilinx Virtex-5, xc5vfx70t FPGA, speed grade -1, target frequency 100 MHz,
using Synopsys Synplify Premier DP 9.6 for synthesis and the Xilinx ISE 10.1
tools for map, place, and route. However, when interpreting the numbers gath-
ered with MDF, we first have to admit that these numbers do not account for
effects occuring when multiple Modlib modules are used in bigger designs.
On the one hand the occupied area may increase because of logic or register
duplication coping with high fanouts or target frequencies. On the other hand
the area may decrease because of boundary optimization. Regarding the fre-
quency, high fanouts or complex routing can lead to an overall frequency which
is lower than the lowest frequency determined by MDF.
Second, timing measurements for simple, combinatorial modules (such as a 32
bits wide adder) often heavily depend on the placement of input and output
101
7. Hardware Operator Library
buffer registers relative to the operator logic. The routing delay to such a buffer
can be a significant part of the critical path.
Third, MDF measurements are highly dependent on the synthesis tool. For
example, the tool decides if storage elements are implemented as slice flip-
flops or as distributed RAM in look-up tables (LUTs). Similarly, multiplication
operators can be implemented using LUTs or embedded DSP blocks.
All in all, MDF measurements should be regarded as an estimate for the re-
quired area and achieved frequency of Modlib operators in a larger design.
Fig. 7.2 shows measurements for different operators with invariant Modlib pa-
rameters. add (addition), bitand (bitwise and operator), and mul (multiplication)
consume between 22 and 36 FFs and 41 LUTs, while the divider optimized for
high throughput goes far beyond that with 3,184 FFs and 4,228 LUTs (an area
optimized version significantly reduces the required resources, see Fig. 7.8).
The mul operator has been implemented using three embedded DSP blocks of
the Virtex-5 FPGA. Measured frequencies are between 137 and 236 MHz (note
the 10 MHz unit).
0 0
3
0
0
10
20
30
40
50
and bitand mul divint
#FFs
#LUTs
#Mults
Frequency [10 MHz]
3184  4228
Figure 7.2.: MDF: Different operators; WA=WB=WR=32, Depth=1,
QDepth=TDepth=0, StaticCT=0, StartCtrlIn=0, Sign=0; divint: Are-
aNSpeed=0.
Fig. 7.3 compares area requirements and frequencies of adders with differ-
ent bitwidths. While the number of LUTs increases almost linearly with the
102
7.2. Meta Data Fetcher
bitwidth, the frequency drops from about 400 MHz (8 or 16 bit adder) to just
under 300 MHz (32 or 64 bit adder). However, the latter effect is attributable
more to I/O buffer routing delays than to the longer carry chain.
0
10
20
30
40
50
60
70
8 bit 16 bit 32 bit 64 bit
#FFs
#LUTs
Frequency [10 MHz]
Figure 7.3.: MDF: add operator: different bitwidths; Depth=QDepth=TDepth=0,
StaticCT=0, StartCtrlIn=0, Sign=0.
Fig. 7.4 compares the area requirements of fixed-latency buffers (Depth) to
transparent buffers (QDepth). For buffer sizes greater than one, transparent
buffers occupy fewer area resources than fixed-latency buffers (for Virtex-5,
the FF:LUT reation is 1:1), the latter exhibiting a better balance between FFs
and LUTs. The Depth LUT and FF curves indicate that the synthesis tool uses
distributed RAM for buffer sizes greater than two.
Fig. 7.5 shows the area increase of predication due to the additional token
queue and the computation logic for the acknowledge signal StartCtrlAck. The
measured frequencies differ by about 30 MHz.
A comparison of results for static and dynamic CT configurations is shown
in Fig. 7.6. Static CTs achieve a slightly smaller area and a much higher
frequency (307 MHz static vs. 175 MHz dynamic) than dynamic CTs, due to
the token propagation logic which is only needed in the dynamic case.
Fig. 7.7 shows the area increase due to greater transparent data buffers (QDepth).
The greater hop from QDepth=0 to QDepth=16 is due to the buffer administration
logic, which is optimized away for buffer size 0.
103
7. Hardware Operator Library
0
20
40
60
80
100
120
140
160
180
200
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Depth or QDepth value
#FFs Depth
#FFs QDepth
#LUTs Depth
#LUTs QDepth
#FFs + #LUTs Depth
#FFs + #LUTs QDepth
Figure 7.4.: MDF: add operator: comparing area for Depth vs. QDepth;
WA=WB=WR=32, TDepth=0, StaticCT=0, StartCtrlIn=0, Sign=0.
0
10
20
30
40
50
60
predication off predication on
#FFs
#LUTs
Frequency [10 MHz]
Figure 7.5.: MDF: add operator: predication (StartCtrlIn=1 vs. no predication
(StartCtrlIn=0); WA=WB=WR=32, Depth=1, QDepth=0, TDepth=16,
StaticCT=0, Sign=0.
104
7.2. Meta Data Fetcher
0
5
10
15
20
25
30
35
40
45
static CTs dynamic CTs
#FFs
#LUTs
Frequency [10 MHz]
Figure 7.6.: MDF: add operator: static CTs (StaticCT=1 vs. dynamic CTs
(StaticCT=0); WA=WB=WR=32, Depth=1, QDepth=TDepth=0, StartC-
trlIn=0, Sign=0.
0
20
40
60
80
100
120
140
160
180
0 16 32 48 64
QDepth
#FFs
#LUTs
Figure 7.7.: MDF: add operator: variation of the QDepth parameter;
WA=WB=WR=32, Depth=TDepth=0, StartCtrlIn=0, StaticCT=0, Sign=0.
105
7. Hardware Operator Library
Fig. 7.8 shows the influence of the AreaNSpeed parameter for the divider oper-
ator. The area optimized version – here, the initiation interval (II) is 8 – re-
quires 545 FFs and 357 LUTs, while the throughput optimized version (II=1)
occupies 3,184 FFs and 4,228 slices.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Area Speed
#FFs
#LUTs
Figure 7.8.: MDF: divint operator: area optimized (AreaNSpeed=1) vs.
throughput optimized (AreaNSpeed=0); WA=WB=WR=32, Depth=1,
QDepth=TDepth=0, StartCtrlIn=0, StaticCT=0, Sign=0.
106
8. COMRADE 2.0 Compile Flow
Fig. 8.1 shows the COMRADE 2.0 compile flow which improves and extends
the COMRADE 1.0 flow (Section 3.2). Most of the extensions are related to
the back-end; however some adjustment of the front-end was necessary, too.
The major changes of the SUIF2 passes are denoted by an asterisk.
At the very beginning of the flow a new goto removal pass [Webe08] trans-
forms the input program into a t-structured program (cf. page 24). For this we
have adopted the method proposed by Erosa and Hendren [ErHe94], which re-
places goto statements with loops and if/else structures, and have extended the
approach so that breaks and continues are replaced as well. An exception are the
breaks in switch statements which do not violate the structuredness. An example
of this is shown in Fig. 8.2. Here the breaks are used to prevent a fall-through
to the next switch case. Finally, the pass normalizes do-while loops to while
loops to actually produce a t-structured program which processes loop headers
before the loop body.
The loop duplication pass of COMRADE 1.0 has been removed in 2.0 due to
lack of infrastructure supporting its use in practice. We do not choose between
hardware and software implementations at runtime, and instead determine the
(constraint-based) partitioning at compile time – COMRADE 1.0 did not offer
any method to determine such decisions at runtime anyway.
As Section 5.3.2 points out, memory forwarder nodes in the CMDFG are re-
quired for correct token flow along memory edges. The memory forwarder
statements from which those nodes are created are inserted using the newmem-
ory forwarder insertion pass. This pass also makes sure that for each loop
header the exit successor CFG node (the successor which is executed when the
loop ends) contains a memory access (MA), which is required for the subse-
quent memory edge insertion. If noMA is a priori present, a memory forwarder
statement is inserted here.
Instead of GLACE, COMRADE 2.0 uses the new Modlib library (Chapter 7)
for the implementation of the C operator semantics in hardware; accordingly,
MDF (Section 7.2) is used to extract meta data. MDF simplifies many algo-
rithms and data structures because it is a C++ library, while GLACE is pro-
107
8. COMRADE 2.0 Compile Flow
C Source
SUIF2
Passes
CFG Generation
HW/SW Partitioning
SSA Conversion
High-Level Optimizations
Constant Propagation
Bitwidth Reduction
Tree Height Reduction
Scalar Replacement
HW Kernel Compaction
SW Part (C)
Synthesis
Constraints
Configuration Scheduling
FPGA
Configurations
Dynamic Profiling
Profile-based Inlining
Omega Dependence Analysis
Memory Forwarder Insertion
Meta Data Annotation
CMDFG Generation
Sequencer
Verilog
Datapaths
.v
Support IP
MDF
Sequencers
.v
Vendor
Tools
GCC
Adaptive
Computer
Simulation
SW Executable
Query
Meta Data
CPP
Transform to SUIF2 Format
Goto Removal
Self-Assignment Statements
Low-Level Optimizations
Operation Chaining
Memory Access Reordering
Memory Access Parallelization
Memory Access Localization
A
ST
C
FG
C
M
D
FG
C
FG
Synthesis
User Constraints
Modlib
Modlib
LMEM
MARC
FPGA
Configuration
Figure 8.1.: The COMRADE 2.0 compile flow.
108
switch(c) {
  0:       a = 7;
           break;
  1:       a = 8;
           break;
  default: a = 9;
           break;
}
(a) (b)
switch(c)
a1 = 7; a2 = 8; a3 = 9;
a4 = phi(a1, a2, a3);
0 1 default
Figure 8.2.: Break statements not violating the structuredness; (a) C code, (b)
CFG.
grammed in Java and so introduced a considerable overhead in COMRADE
1.0 due to the required C++/Java interfaces.
The original COMRADE 1.0 CDFG generation pass has been partly re-used
(node and data edge creation) and extended with new functionality (regarding
control edges and memory edges) to form the new CMDFG generation pass
which implements the hardware generation concepts described in Chapter 5.
The implementation of these new concepts are the key for the broad language
support offered by COMRADE 2.0, e.g., arbitrarily nested loops and conditions
(including pointer accesses in loop nests) which were not correctly supported
yet by COMRADE 1.0. Details on how the CMDFG is built from the CFG
follow in Section 8.1.
New features of COMRADE 2.0 include four low-level optimization passes
which we detail in Chapter 9.
The sequencer generation pass implements the token flow concepts defined
with COCOMA and represents one of the most important contributions of this
work.
The Verilog creation pass has been completely reworked in COMRADE 2.0.
The sequencers are written directly in Verilog instead of SLIF, because the
synthesis tools today are sophisticated enough to render additional logic opti-
mization techniques virtually unnecessary. Furthermore, the naming of wires
connecting operator instances and sequencer logic is now considerably more
intuitive and context-based, simplifying debugging processes. The generated
synthesis constraints allow the fully automatic synthesis of the created HW
kernels. The new LMEM framework provides access to local on-chip memory,
increasing the available memory bandwidth considerably. However, the local
109
8. COMRADE 2.0 Compile Flow
memory features (cf. Chapter 9) have been added only during the final develop-
ment phase of this work, and are not completely integrated into the automatic
flow. Thus, some user intervention is required for their use.
A step which is still missing in order to complete the desired framework is
the integration of the existing configuration scheduling technique. For this, an
adequate adjustment of the output C code needs to be implemented, including
code segments for the reconfiguration of the RCU. However, a significant im-
provement over COMRADE 1.0 here is that through setting user constraints,
multiple HW kernels can be selected and automatically synthesized into a sin-
gle FPGA configuration.
8.1. Building the CMDFG
The transformation from the control flow-oriented, high-level CFG to the low-
level, data flow-based CMDFG representation is a central part of the COM-
RADE 2.0 hardware back-end. We will first explain the CMDFG creation
on a step-by-step basis: CMDFG nodes (Section 8.1.1), data edges (Section
8.1.2), control edges (Section 8.1.3), and memory edges (Section 8.1.4). Note
that each of these steps operates on a HW region within the CFG, i.e., a CFG
can contain several HW regions, from each of which we generate a separate
CMDFG. Second, we investigate the correctness of the approach (Section 8.2)
and finally consider an alternative token flow for nested conditions in Section
8.3.
8.1.1. Nodes
To create CMDFG nodes for a CFG HW region, the statements in each HW
region node are traversed. From each of these statements one or more CMDFG
nodes are generated according to the complexity of the contained expressions.
An example is shown in Fig. 8.3, which already includes CMDFG data edges
for easier correlation with the C code; however, the data edges are actually
inserted in a later step. For c = a + b, only a single adder is created. From *p = a -
b + c, the compiler builds a store node with an address input for p, a subtraction
operator, and an adder.
Appendix B shows a complete list of assignments from C operations and state-
ments to CMDFG node types.
110
8.1. Building the CMDFG
c = a + b;
*p = a - b + c;
a b
c
+
(a) (b)
a b
c-
+
store
p
Figure 8.3.: (a) Statement contained in a CFG node; (b) generated CMDFG
nodes and data edges.
8.1.2. Data Edges
The C statements and expressions in the CFG nodes define producers and con-
sumers of variables. A producer of a variable is an operator which computes
a value that is assigned to that variable; a consumer is an operator reading the
variable value. For example, given the two statements c = a + b; d = c - 1, the
addition in the first statement is a producer of c, while the subtraction in the
second statement is a consumer of c. Obviously, consumers are data dependent
on producers; these dependences are modeled by data edges. As the CFG has
already been transformed to the SSA form, each variable has exactly one pro-
ducer. Thus, inserting data edges is done by iterating over the producers and
for each producer adding a data edge to each of its consumers1.
When a data edge is connected to a consumer, it is important to choose the
correct input port, because CMDFG nodes (as well as C operators) are gener-
ally not commutative. For that purpose, the operator input computed first in
the program flow is always connected to data port A of the CMDFG node2.
1Connecting a data edge to a producer or consumer means to connect it to the CMDFG node that
has been generated from the producer or consumer operator.
2For muxes, the least significant bits of the A port connect to the first input according to the
program flow, etc.
111
8. COMRADE 2.0 Compile Flow
Bitwidths are assigned to the data edges according to the C data types.
Self-assignment statements (n = n) produce cyclic data paths of length one,
where the output of a loop mux directly connects to the continue input of the
same loop mux. This is problematic for dynamic scheduling, because certain
conditions can lead to deadlocks. Such a situation is shown in Fig. 8.4(a).
The loop mux, currently having an AT at its output, must not accept another
incoming AT from a data predecessor before the output AT has been consumed
by all data successors. As the loop mux is a successor of itself, it will not accept
any further incoming AT (deadlock). To solve this problem we expand such
data path cycles through insertion of a nop node (Fig. 8.4(b)). An alternative
would be to set the QDEPTH parameter of the loop mux to at least one, allowing
it to consume another AT via the continue input before the current output has
been acknowledged.
(a) (b)
 
loop
 
i c
 
+
 
loop
nop
 
i c
 
+
Figure 8.4.: (a) Before, (b) after nop node insertion for data flow loops.
8.1.3. Control Edges
Control edges are used in the CMDFG for different purposes (cf. Sec-
tion 5.3.2). For example, they implement the predicated execution of non-
speculative operations as well as the selection from a set of alternative data
paths. Furthermore, they are used to forward tokens in control hierarchies,
and to control the data propagation from loop muxes to outer loop levels. Due
to these diverse application domains the control edge generation is subdivided
into several sub-steps as shown in Algorithm 1. In the following we detail each
of these steps.
112
8.1. Building the CMDFG
Algorithm 1 Insertion of control edges.
1: insert control edges to mux inputs
2: insert control edges to loop mux inputs
3: insert control edges to loop mux outputs
4: insert control edges to constants
5: build control hierachies
6: insert control edges to memory accesses
7: loop control nets
8: HW/SW interfaces
9: merge control edges
Mux Inputs
Alternative data paths originating from if/else or switch branches are joined at
muxes. The preliminary objective here is to insert a control edge to each data
predecessor of the mux. Only if the control condition is true, is the result of
that predecessor forwarded to the mux; otherwise that (mis-speculated) piece of
data is discarded. In this preliminary model the mux simply accepts the active
input, where not more than one input may be active at a time. (In Step 9, the
control edges to mux predecessors will be merged, which will drop the latter
requirement.)
Fig. 8.5 shows an example in which control edges are inserted for the pre-
decessors of mux m. To find the CMDFG mux control node for m (i.e., the
node computing the condition that decides which input of m is chosen), we
first determine the CFG node mcfg that contains the phi statement from which
m originates (1.). Obviously, mcfg is a join node. Due to the structuredness
there is exactly one branch node ccfg for mcfg (2.). We find ccfg by inspecting
the predecessors of mcfg. If such a predecessor is a branch node, that node is
already ccfg. Otherwise an arbitrary predecessor pcfg is chosen; the controller of
pcfg is ccfg. Such a controller ccfg exists, because pcfg is an alternative branch in
an if/else or switch body. There is only one such controller because of Proposi-
tion 2.1.1. Having ccfg, the CMDFG control node c is the node which has been
created from the condition in the last statement of ccfg (3.).
Now we insert a control edge (c, pi) for each mux predecessor pi. For this we
have to extract the control annotation (==0 or ==1 in Fig. 8.5(b)) that defines
for which output of c the value computed by pi is forwarded to m. Here, the
first step is to determine if the CFG node pi,cfg from which pi originates is
dominated by ccfg. If so, pi,cfg is obviously located in an alternative branch
113
8. COMRADE 2.0 Compile Flow
+
a b
–
e f
==
d 1
(a) (b)
==1
if (d == 1)
s1 = a + b; s2 = e - f;
s3 = phi(s1, s2);
1 0
m
c
p
1
p
2
m
cfg
c
cfg
==0
s1 s2
1.
2.
3.
p
1,cfg
p
2,cfg
Figure 8.5.: (a) CFG, pi,cfg dominated by ccfg; (b) CMDFG with control edges
inserted for mux predecessors.
between ccfg and mcfg. We then find the successor of ccfg which dominates
pi,cfg (in Fig. 8.5(a) this is pi,cfg itself); the value annotated at the CFG edge
from ccfg to that successor (0 or 1 in the example) is used for the control edge
annotation3.
+
a b
–
e f
==
d 1
(a) (b)
==1
if (d == 1)
s1 = a + b;
s2 = e - f;
s3 = phi(s1, s2);
1
0
m
c
p
1
p
2
m
cfg
c
cfg
==0
s1 s2
p
1,cfg
p
2,cfg
Figure 8.6.: (a) CFG, p1,cfg not dominated by ccfg; (b) CMDFG with control
edges inserted for mux predecessors.
The example in Fig. 8.6 shows a case in which one predecessor of mcfg, p1,cfg
is not dominated by ccfg; instead, p1,cfg is located before ccfg in the program
3An exception is the default annotation of switch statements, in which the actual annotation for
the CMDFG control edge is computed from the remaining outgoing CFG edges of the CFG
controller node.
114
8.1. Building the CMDFG
flow. According to the following Lemma 8.1.1 there can only be one such non-
dominated node pi,cfg. So we first extract the annotations for the dominated
nodes and mark the associated successors of ccfg as used. For if/else there can
be at most one unused successor, while a switch may have several. We combine
the annotations of the edges from ccfg to the unused ccfg successors and assign
the resulting control annotation to the control edge (c, pi).
Lemma 8.1.1. Given a t-structured CFG with a phi statement ps in a non-
header join node mcfg and an associated branch node ccfg. There can only be
one mcfg predecessor pi,cfg which is not dominated by ccfg.
Proof. Assume there are two such predecessors pi,cfg and p j,cfg, i , j. ps
w.l.o.g. computes var = phi(var1, . . . , vari, . . . , var j, . . . , varn), where vari is
defined in pi,cfg, and var j is defined in p j,cfg. pi,cfg and p j,cfg are located in
alternative branches, i.e., every path from one to the other contains a join node,
because otherwise vari and var j would not be inputs of the same phi statement.
As pi,cfg and p j,cfg are both not located on a path between ccfg and mcfg (because
they are not dominated by ccfg), there must be a join node in the program flow
before ccfg, which contains a phi statement merging at least vari and var j to a
new defined variable vark. But this means that vari and var j cannot be inputs
of ps, which must read vark instead of vari and var j. 
Loop Mux Inputs
A loop mux is a data path connector between an outer and an inner loop. It
has two data inputs; the init input is connected to the outer loop data path,
and the continue input receives values from the body of the inner loop. In the
example in Fig. 8.7(b) the loop mux is initialized with a constant four, while the
continue input connects to the adder which increments i once per inner loop
iteration. The control edge for the continue input is created similarly to mux
predecessors, where finding the controller node (the < node in the Figure) is
much easier here, because it originates from the same CFG node (highlighted
in Fig. 8.7(a)) as the loop mux. In the example, the control edge (<,+) is inserted
with annotation ==1.
For the init input from the outer loop, we do not insert any control edge. This
is because an outer loop data path already delivers one data word per outer
loop iteration, which is the desired behavior here. However, note the activation
of constants below (Section 8.1.3); for the example in Fig. 8.7, that step will
insert a control edge to activate the const node 4.
115
8. COMRADE 2.0 Compile Flow
10
<
(a) (b)
==0
 
loop
+4
1
i c
==1
atOnCancelnop
s_2
// outer loop header
i_1 = phi(i_0, i_2);
i < 10 ?
i_0 = 4;
i_2 = i_1 + 1;
1
 
 
1
0
0
s_2 = i_1;
Figure 8.7.: (a) CFG section; (b) CMDFG with control edge inserted for the
loop mux continue input and its output; CMDFG nodes with gray
background originate from the CFG node with gray backgound in
(a).
Loop Mux Outputs
To insert the loop mux output control edges illustrated in Section 5.3.2 (Loops)
we first determine for a given loop mux the successor nodes which originate from
a CFG node of the outer loop. For each such successor s, we insert a control
edge4 from the loop control node with two annotations: ==0 and atOnCancel (as
illustrated in Section 5.3.2). An example is the control edge (<, nop) in Fig.
8.7.
Constants
To activate const nodes, we simply iterate over the CMDFG nodes and insert a
control edge from the always node to each const node.
Control Hierachies
We iterate over the CMDFG nodes. For a node n which computes an if/else,
switch or loop condition, its CMDFG controller node (i.e., the node control-
4If s is already targeted by another control edge, a nop node is inserted between the loop mux and
s, and the control edge is attached to that new nop node.
116
8.1. Building the CMDFG
ling the execution of n) is found by again using the CFG control dependence
relation (cf. Fig. 8.8). First, we locate the originating CFG node ncfg, then de-
termine the controller ccfg in the CFG and finally use its last statement to locate
the sought-after CMDFG control node c.
>
j k
<
i n
(a) (b)
if (i < n)
if (j > k)
0 1
c
n
c
cfg
==1
1.
2.
3.
n
cfg
  
 
10
Figure 8.8.: (a) CFG section; (b) CMDFGwith a control edge inserted between
nested conditions.
An exception is the loop condition of the outermost loop, which has no con-
troller in the current HW region. Here, we use the initial node of the corre-
sponding SW/HW transition as the control edge source. This establishes that
the SW/HW transition actually occurs before a memory access inside the HW
region is started and before an IRQ is signaled by the hardware.
The loop condition reactivating any nodes (cf. Section 5.3.2, Loops) are in-
serted for each loop condition5.
Memory Accesses
This step adds control edges to the non-speculative memory access (MA)
nodes: load, store, and mf. To create the nCT and nAT annotated edges moti-
vated in Section 5.3.2 (Nested Loops), we again first determine the controller
in the CFG, i.e., the CFG controller node of the CFG node from which the
CMDFG MA node has been created. In Fig. 8.9, MA1 and MA2 have been cre-
ated from CFG node mcfg, the controller of which is ccfg. If an MA is the first
MA in its CFG node (basic block) according to the program flow (MA1 in the
Figure) and the CFG node (mcfg) has a loop header as its direct predecessor
(lcfg), the compiler inserts an nCT control edge from the CMDFG node comput-
5This will be refined for nested loops below.
117
8. COMRADE 2.0 Compile Flow
ing the loop condition to the CMDFG MA node, annotated with the loop exit
condition ==0 (edge (>,MA1) in the Figure). Furthermore, a nAT control edge
is inserted from the MA controller CMDFG node (reusing the control depen-
dence relation on the CFG) to the MA node (edge (<,MA1)). For successive
MAs, nCT/nAT splitting is not necessary, because the memory edges which will
be added later already avoid a premature MA execution. Instead a standard
control edge (i.e., without nCT or nAT) is then inserted (as edge (<,MA2)).
>
j k
<
i n
(a) (b)
if (i < n)
if (j > k)
0 1
==1
  
 
10
...MA1;
MA2;
 
MA1
MA2
==1
   nAT
==1
...
==1==0
nCT
c
cfg
l
cfg
c
l
m
cfg
Figure 8.9.: (a) CFG section; (b) CMDFG with control edge inserted for mem-
ory accesses MA1 and MA2.
Loop Control Nets
Here, we furnish the existing control hierarchies with control edge structures
as shown in Fig. 8.10 to make sure that an outer loop condition is not reacti-
vated before all inner loops have finished. To ensure this, loops are processed
sequentially from inner to outer levels. Conditions at the same level are com-
bined with an all node because all these loops have to finish before the outer
loop may execute the next iteration. In the example in Fig. 8.10, j > m and
k == p must both be false (== 0), before i < n may be reactivated. This
step may require that a preliminary any node destined for loop reactivation is
removed.
Conditions on different levels are combined with an any node as illustrated by
the example in Fig. 8.11.
118
8.1. Building the CMDFG
for (...; i < n; ...) {
  for (...; j > m; ...) {
      ...
  }
  for (...; k == p; ...) {
      ...
  }
}
>
ni
<
any
pk
==
       
mj
(a) (b)
==1
==1
all
       
       
any
==1, nCT
==1, nCT
==0
nCT
==0
nCT
Figure 8.10.: (a) C code, (b) generated CMDFG control net.
HW/SW Interfaces
For each HW/SW transition a const node is created, its value representing the
CFG node number of the HW exit node (i.e., the node which is the source of
the HW/SW transition). Fig. 8.12 shows an example comprising two HW/SW
transitions in one HW region. Furthermore, for each variable which is trans-
ferred from HW to SW, a control edge from the associated outreg to the const is
inserted; in the case of more than one outreg, an and node combines the edges
as shown in the Figure. If there is only one outreg, the and node is omitted. If
there is more than one HW/SW transition in the current HW region an irqData-
Mux combines the IRQ exit codes as shown in Fig. 8.12(b); if there is only one
the const node is connected directly to the irqreg.
Control Edge Merging
In this last step, the preliminary control edges to mux predecessors are replaced
by control edges which target the mux directly (cf. Fig. 8.13)6. This repre-
sentation is even closer to the target hardware; the control edge now directly
influences the select signal of the multiplexer.
A similar replacement is made for control edges pointing to loop mux prede-
6Using preliminary control edges allows for a more generic control edge handling while building
the CMDFG, and thus simplifies the implementation.
119
8. COMRADE 2.0 Compile Flow
for (i = 0; i < n; ++i)
  if (i == 4)
    for (j = 0; j < m; ++j)
      ...
==
ni
<
any
mj
<
       
                ==0
               nCT
==0
nCT
4i
...
(a) (b)
==1
==1
Figure 8.11.: Nested loops: (a) sample code snippet, (b) generated COCOMA
section.
cessors. Here, however, only the continue input predecessor is affected by the
edge replacement, because the init input is not controlled by the associated
loop condition node. As a result the semantics of control edges to a loop mux
are slightly different from mux control edges. While an edge pointing to the
latter affects all data predecessors, a control edge to a loop mux affects only the
continue input. That is, a loop mux is initialized without consuming an AT from
the control predecessor; similarly, a false condition in the control predecessor
will create a CT for the continue predecessor of the loop mux, but not for its init
predecessor.
8.1.4. Memory Edges
For the implementation of memory accesses COMRADE 2.0 assumes that a
HW kernel has access to a memory back-end supporting cached accesses to
the main memory, as planned in Section 5.1. The cache is accessible via a
bus which allows fast single word transfers. The HW kernel is provided a
cache port to access that bus. Note that this is the very basic memory system
supported by COMRADE; extensions are presented in Chapter 9.
COMRADE uses memory edges to enforce the program order sequence of
memory accesses (as defined by the source C code). This is important to ad-
here to write-after-read (WAR), read-after-write (RAW), and write-after-write
(WAW) dependences on the one hand, and to prevent simultaneous cache port
120
8.1. Building the CMDFG
(a) (b)
LoadUp(a);
LoadUp(b);
SignalIrq(6);
(SW)
(HW)
(HW)
6 (HW)
LoadUp(k);
LoadUp(n);
SignalIrq(10);
10 (HW)
(SW)
outreg a
const 6
irq
irqreg
and
outreg b outreg k
const 10
and
outreg n
Figure 8.12.: (a) CFG section with two HW/SW transitions, (b) generated
CMDFG subgraph for IRQ signaling.
n
bc
==2
a
n
bc
==(2,1,5)
mux
a
==5==1
(a) (b) 
  
Figure 8.13.: (a) Before, (b) after control edge merging.
121
8. COMRADE 2.0 Compile Flow
Algorithm 2 Construction of memory edges in program order
1: for all CFG nodes ncfg (except SW services) containing a memory access do
2: insert memory edges between the accesses located in ncfg in program order
3: lmscfg := last memory statement in ncfg
4: lma := CMDFG memory access node generated from lmscfg
5: if ncfg is a branch node then
6: c := CMDFG condition node computing the branch condition of ncfg
7: add memory edge from lma to c
8: else
9: {the successor succcfg of ncfg is a branch or a join node}
10: fmacfg := first memory access statement in succcfg
11: fma := CMDFG node created from fmacfg
12: add memory edge from lma to fma
13: end if
14: end for
accesses on the other hand; for the sake of the latter, even memory edges from
load to load are required.
Algorithm 2 shows how COMRADE 2.0 inserts memory edges. Step 2 inserts
memory edges for accesses located in the same CFG node. For branch nodes,
the last memory access lma in the node is connected to the branching condition
c (Steps 6-7) to guarantee the program order for dependent accesses in branch
targets, i.e., c is evaluated after lma is done, hence accesses in the branch tar-
get (controlled by c) are not started prematurely. Steps 10-12 insert memory
edges from the last memory access of a non-branch node to the first access of
the successor node, obeying inter-CFG node memory dependences. This also
closes the memory dependence chain across the loop header, so that all mem-
ory accesses of the current loop iteration are done before the first access of the
next iteration starts.
8.2. Correctness
In this work we examine basic concepts for the creation of combined HW/SW
systems from C. We do not intend to prove the correctness of all suggested
approaches, which would definitely exceed the scope of this work. However,
we do allude some correctness hints, through representative tests on the one
hand, and by analyzing basic principles of the token flow on the other hand.
We have run COMRADE 2.0 on a set of synthetic benchmarks, covering differ-
ent combinations of (nested) conditions and loops – with and without memory
accesses – and testing different parameter settings (e.g., dynamic vs. static CTs
122
8.2. Correctness
and different buffer sizes). The generated HW kernels have been simulated to
observe the correct behavior. To be able to reproduce the tests, their C source
codes are given in Appendix C, together with simluation results.
Before we analyze the token flow principles further, we will define three terms
helpful for the discussion.
Definition 8.2.1. Given a t-structured CFG with node set N, edge set E, and a
distinct loop header l.
The explicit loop body of l is a subset ELB(l) ⊆ LB(l) of the loop body of l, so
that each node in ELB(l) is control dependent on l.
(a) (b)
a2 = a1 + 10;
a1 = phi(a0, a3);
if (a1 > 1)
0 1
a4 = a3 - 5;
a3 = phi(a2, a4);
if (a3 > 2)
0
1
(1) loop header L
(3) loop header S
(2)
(4)
1
==0
 
loop
+
i c==1
atOnCancel nop
 
loop
i c
>
-
a0
a1 10
a2
>
==1 2
5
a4
a3
a3
==1
Figure 8.14.: (a) CFG section, (b) generated CMDFG subgraph.
As an example, in Fig. 8.14 nodes 1 and 3 are loop headers, {2, 3, 4} is the loop
body of loop header 1, {4} is the loop body of 3, {2, 3} is the explicit loop body
of 1, and {4} is the explicit loop body of 3. Obviously, explicit loop bodies do
not overlap, which allows the definition of loop levels based on explicit loop
bodies.
Definition 8.2.2. Given a t-structured CFG containing a HW region H with
node set N and edge set E.
The loop level function assigns each node in H a non-negative integer, so that
the level of an explicit loop body equals the level of its header increased by 1,
123
8. COMRADE 2.0 Compile Flow
i.e., LL : N → N0,
LL(n) :=
LL(h) + 1 if n has a loop header h in H,0 otherwise.
Assuming that all nodes in Fig. 8.14 are contained in a HW region H, and node
1 is the outermost loop header in H, then LL(1) = 0, LL(2) = 1, LL(3) = 1, and
LL(4) = 2.
According to their origin in the CFG, CMDFG nodes are associated to exactly
one explicit loop body and therefore have the same loop level. The association
of CMDFG nodes to CFG loop levels drives the definition of loop data paths.
Definition 8.2.3. Given a t-structured CFG containing a HW region H with
node set N and edge set E, and a CMDFG generated from H.
A loop data path (LDP) is a set of CMDFG nodes weakly connected through
data edges, so that each node in the LDP has at least one data input and one
data output and all nodes in the LDP have the same loop level. Furthermore,
an LDP is maximal, i.e., there is no LDP which is a proper superset of a given
LDP.
As an example, the shaded region on the far right in Fig. 8.14(b) surrounds an
LDP consisting of a nop, an adder, and a loop mux node, having the same loop
level 1. Although the nop node has no equivalent CFG node, it has loop level
1 because it is simply a buffer for a3, which is defined at that loop level. The
gray-colored > node computes the loop condition for the LDP (which itself
is not part of that LDP). In Fig. 8.14(a) the operations from which this LDP
originates have been marked with gray background.
For each variable which is read or written in an explicit loop body, its loop
header contains a phi statement (after SSA conversion) from which a loop mux
is created in the CMDFG. These loop muxes can be understood as connectors
between LDPs of different loop levels.
By analyzing the token flow across LDPs, especially the synchronization of
data, control and memory flow, we will now illustrate (though not prove) that
no deadlocks occur in an LDP. Fig. 8.15 shows a generalized LDP of level
k, where loop_c is the associated loop control node (a). Such a LDP can have
data inputs from const nodes (b), or from loop muxes (c) of loop level k − 1
(originating from the same CFG node as the loop condition loop_c). Possible
nodes contained in the LDP include operation nodes (d) and nops (e), loads (f),
muxes (g), and loop muxes (h) of level k. LDP data outputs can lead to operation
nodes without outgoing data edges (i.e., control nodes (i)) and store nodes (j),
124
8.2. Correctness
==0
==1
atOnCancel
op
 
loop
i c inner LDP,level k+1
nop
 
loop
i c
load
load
ci
const
always
cm
cl
LDP, loop level k,
associated with
loop control node c
store
(simplified
control
hierarchy)
level k-1level k
level klevel klevel k
level k-1
level k
level k-1
level 0
(b)
(c)
(d)
loop_c
loop_c
(a)
(e)
(f)
(f)
(g)
(h)
(i) (i) (j)
(i)
(a)
level k
 
  
 
Figure 8.15.: Generalized LDP (gray background) with incoming and outgoing
CMDFG edges; note that nodes not contained in the LDP can
nevertheless have the same loop level.
125
8. COMRADE 2.0 Compile Flow
where both have loop level k; the third kind of outputs are loop muxes at level
k − 1, originating from the loop header of the LDP, so that data flows into the
continue input of such a loop mux (c).
8.2.1. Static CTs
For a given LDP a basic token flow principle of COCOMA in the static CT
model is that per token (AT or CT) which enters the loop control node loop_c,
an AT is propagated along each incoming and each outgoing data edge of the
LDP, with the exception of data inputs from const nodes. To understand this,
we analyze the I/Os and the interior of the LDP, first for the static CT model.
We start off by reviewing the data inputs.
• A loop mux (Fig. 8.15(c)) receives an initial AT from an outer loop or
through a SW/HW transition. We associate this initial value with the
first token that enters loop_c. If loop_c is true, the value (once passed
through the LDP) re-enters the loop mux via its continue input; we then
associate that new value with the next token which enters loop_c. This
iterates until loop_c is false. If loop_c is actually false, or the first token
entering loop_c is a CT, then a CT is generated at the continue input of
the loop mux, which finally eliminates the result coming in from the LDP.
• const nodes are an exception because they are connected to an always node
and thus deliver a steady stream of ATs. This is however not harmful
because they cannot produce erroneous results. Constant inputs are syn-
chronized with other data inputs in operation nodes or synchronized with
control inputs in muxes or loop muxes.
Thus, all inputs (except consts) deliver exactly one AT per token entry of loop_c.
We now analyze the LDP interior, assuming an AT per incoming data edge.
• Load nodes (f) have an address input and a data output for the loaded
value (both data edges) and an incoming control edge. The control node
is either loop_c or a node ci which is a sub control node of loop_c (due to
an if/else or switch in the C code). When a token enters loop_c, the control
hierarchy establishes that ci will receive a token, too. If ci is active and
the control condition is true, then the memory access is executed, con-
suming the input address via the incoming data edge and outputting the
result. If ci is canceled or active with false control condition, a CT is
sent to the load. Now, although no memory access is performed, the load
126
8.2. Correctness
consumes the incoming address and provides a data output value (sim-
ply dummy data). So, in any case, the load will consume an AT from the
data predecessor and produce an AT for the data successors. The outgo-
ing memory edge of the load forwards an AT only if the memory access
has actually been performed. In that case, the AT is consumed by the
memory successor; otherwise, the memory successors do not anticipate
an incoming AT, because they are canceled, too.
• A mux (g) receives exactly one AT from each of its alternative data inputs.
Accordingly, it receives exactly one AT (and thus select value) from its
controller node cl, which is either identical to loop_c or reachable from
loop_c through the control hierarchy. The mux selects an input according
to the select value and propagates this to its data output. If none of
the inputs is selected, a dummy value is propagated (cf. Sections 5.3.2
(Nested Conditions) and 8.3), so there will be a data output in any case.
• Loop muxes (h) of level k originate from a subloop header. They are initial-
ized once by an AT from the LPD per token that enters loop_c. That AT
flows into the LDP of the subloop and re-enters the loop mux via its con-
tinue input for each subloop iteration. The controller node (not shown)
of that inner loop, being again in the control hierarchy of loop_c, is re-
activated for each iteration by the loop control net mechanism explained
above. As soon as the inner loop ends or in the case that the inner con-
dition is canceled, the nop node (e) is activated, delivering an AT for
successors in the LDP.
• All other nodes in the LDP (having only data I/O edges) perform their
operation as soon as all inputs are available and propagate an AT when
their computation is finished.
In this manner, ATs are propagated from LDP inputs to LDP outputs. We
finally review what happens at the outputs.
• Control nodes ci, cl, and cm are in the control hierarchy of loop_c and
receive exactly one token per token entry of loop_c. If such a control
node receives an AT via its incoming control edge, it consumes the data
inputs from the LDP and creates an AT for the outgoing control edges.
If a CT enters one of the control nodes, the incoming data from the LDP
is consumed and discarded, forwarding a CT along the outgoing control
edges.
• Store nodes (j) behave similarly, except that a memory access is per-
formed instead of activating outgoing control edges.
127
8. COMRADE 2.0 Compile Flow
• The output to the loop mux (c) has already been covered in the discussion
regarding the LDP data inputs.
This concludes the discussion for the static CT model, substantiating that, per
token (AT or CT) which enters a loop control node, an AT enters and finally
exits the associated LDPs. This is an important result regarding the synchro-
nization of control and data flow on the one hand, and strongly suggests the
absence of deadlocks on the other7.
In the next Subsection, we consider the dynamic CT model.
8.2.2. Dynamic CTs
The dynamic CT model is a bit more complex. Instead of ATs moving strictly
top-down through the LDP as described for Fig. 8.15, CTs are allowed to move
bottom-up, i.e., in reverse direction along data edges. The basic principle for
LDPs must therefore be adapted:
For each token (AT or CT) which enters the loop control node loop_c and for
each incoming and each outgoing data edge of the LDP (except data inputs
from const nodes), either an AT moves along the data edge or a CT moves in
the reverse direction.
We will however omit a detailed discussion here, suffice it to say that in the
dynamic CT model, the location where token extinction takes place is variable
instead of being fixed at control edge targets.
However, there is a peculiarity of the COMRADE 2.0 dynamic CT model:
Nodes which are control edge targets (muxes, loop muxes, loads) do not accept
CTs from their data successors. In these cases the CT waits in the successor
node until an AT comes in (possibly carrying dummy data). This is however
not harmful because in each iteration an AT enters the LDP via each data input,
so that an input never waits for a CT which is blocked inside the LDP.
A complete proof for the absence of deadlocks would require to undergo an
analysis of the whole COCOMA sequencer and the behavior of all COCOMA
nodes (and thus Modlib operators), which is outside the scope of this work.
Instead, the above observations show the plausibility of our approach and al-
low a more general and comprehensive view of token flow. We have tested
COCOMA with many diverse examples comprising different control flow sce-
narios to be reasonably certain that all practically relevant cases have been
covered (cf. Appendix C).
7Note the concluding statements of Section 8.2.2, which apply here, too.
128
8.3. Redundant Control Dependences
8.3. Redundant Control Dependences
s0 = 0;
if (c1) {
  if (c2) {
    s1 = ...;
  }
  // m2
  s2 = phi(s0, s1);
  s3 = s2 + 1;
  if (c3) {
    ...
  }
}
// m1
s4 = phi(s0, s3);
c2
c1
(a) (b)
s0==(1,0)
==(1,0)
s4
==1
mux
mux
10 01
10 01
m2
m1
+
s1
s2
s3
c3
==1
s3
...
Figure 8.16.: (a) C code in SSA form; (b) generated CMDFG section.
In nested loops subconditions may be canceled, so that all of the alternative
inputs of a mux are discarded. An example is shown in Fig. 8.16: If c1 is
false, c2 receives a CT via its incoming control edge. Then, s1 and s0 are
both discarded in m2. The mux outputs a pseudo AT (pAT) carrying dummy
data, because the successors of m2 expect an AT. We have implemented an
alternative approach in [GäKo08] which does not create pATs at mux outputs.
Instead, we have to make sure that the successors of such muxes do not wait for
an AT in the cancel case; in other words, they may not create excess CTs. The
creation of CTs in the cancel case can be prevented by nCT annotations8. In
Fig. 8.16(b) a nCT annotation of control edge (c1,m1) dedicated to the s3 input
would do the job. While this works for muxes, a nCT annotation is not possible
for (c1, c3), because the CT is needed by c3 for the propagation to its control
successors. Because c3 expects an incoming AT along its data inputs for each
iteration, a pAT must be inserted for each data input of c3. This can be done
in COCOMA using an alwAct annotated control edge, which delivers an AT
(ideally directly at the c3 data input) although the source of the edge is false or
canceled.
8In [GäKo08], which points control edges to mux predecessors instead of using direct control
edges to muxes, this is equivalent to the removal of control edges.
129
8. COMRADE 2.0 Compile Flow
However, that alternative approach turned out to be much more complex and
error-prone, hence we have decided to directly create pATs at mux outputs.
Another advantage of the current model is the invariance presented in Section
8.2, which is essential when discussing the correctness of our approach.
130
9. Low-Level Optimizations
COMRADE 2.0 can perform several low-level optimizations on the CMDFG
depending on the parameters of the actual target architecture (e.g., target fre-
quency or speed grade). Therefore, Section 9.1 first introduces our evaluation
platform ACE-M5. Section 9.2 then compares dynamic CTs to static CTs re-
garding runtime and resource requirements, before we present three actual low-
level optimization techniques in Sections 9.3-9.5. Using the optimizations pre-
sented so far, Section 9.6 compares performance results, analyzes the current
bottlenecks, and shows how they can be removed using memory localization.
9.1. Test Platform ACE-M5
Fig. 9.1 shows the architecture of the adaptive computer ACE-M5, which
we use for our experiments. The ACE-M5 essentially consists of the Xilinx
ML507 development board (kindly donated by Xilinx) and a support package
provided by the Embedded Systems and Applications Group (ESA), headed
by Professor Andreas Koch at the Technische Universität Darmstadt, includ-
ing IP cores and a Linux operating system (OS). The Virtex-5 FPGA on the
ML507 includes an embedded PowerPC processor which hosts the OS. A man-
agement PC acts as external console and also provides file sharing services to
the ACE-M5 using NFS. On the ACE-M5 side, the required Ethernet IP block
is connected to the Processor Local Bus (PLB). Hardware kernels generated
by COMRADE 2.0 are located in the RCU. Both the PowerPC and the RCU
share access to 256 MB of DDR2-SDRAM main memory via the DDR2 con-
troller, the PowerPC using its own embedded cache (32 KB instruction, 32 KB
data) connected to the memory controller interface (MCI), and the RCU using
the cache provided by the MARC [LaKo00] memory access system (4 KB).
Furthermore, the PowerPC has memory-mapped access to the RCU registers,
bypassing the caches. The RCU can send an IRQ to the PowerPC to signal the
end or interruption of a computation.
The ACE-M5 implements the AISLE architecture [LaKo09], so that the RCU
can use the same virtual address as the CPU for referencing data stored in
131
9. Low-Level Optimizations
Xilinx ML507
Xilinx Virtex-5
PowerPC 440
@400 MHz
Cache
DDR2-
Controller
@200 MHz
Ethernet Serial
DDR2-
SDRAM
File System Console
PLB
MCI
Management PC
MARC
Cache
IRQ RCU
@100 MHz
load / store units
I/O regs
Figure 9.1.: The ACE-M5 architecture.
the main memory. This allows for passing complex data structures between
software and hardware by just transferring a pointer.
The frequencies shown in Fig. 9.1 are derived from the original Xilinx ML507
reference design. Our concepts are not limited to a certain technology or target
frequency; however, here we will use 100 MHz here as target frequency for our
tests.
The connection between RCU and DDR2 is 128 bits wide and runs at 200
MHz1, offering a maximum throughput of 3.2 GByte/s.
The fully associative MARC cache is configured for 32 cache lines (128 bytes
each), and currently supports up to four parallel accesses (reads or writes, each
32 bits wide) to the cache. However, at most, two simultaneous accesses to the
same cache line are allowed. The latency of a read stall2 is 61 cycles (@100
MHz), while a write stall lasts 59 cycles, excluding waiting times due to DDR2
refresh cycles and accesses performed by the Linux OS.
1100 MHz, but transferring data on both the positive and the negative clock edge.
2The time needed to load a cache line from the main memory.
132
9.2. Dynamic CTs vs. Static CTs
9.2. Dynamic CTs vs. Static CTs
Figs. 9.2 and 9.3 show COMRADE synthesis and simulation results of
seven example kernels, most of which were taken from benchmarking suites
such as Honeywell Stressmark [Hone97], MiBench [GREA01], MediaBench
[LePM97] and CHStone [HTHT09]. The source codes actually used are given
in Appendix D. For synthesis, we have employed Synopsys Synplify Premier
DP (version E-2010.09) and Xilinx ISE 10.1.
09_results_hg_diss_dyn_stat_area.pdf
Target Frequency: 100 MHz Daten unten eingeben!
kernel dynamic CTs staticCTs kernel dynamic CTs
dynCT: FFs dynCT: LUTs statCT: FFs statCT: LUTs dynCT: FFs dynCT: LUTs
fcdf22 2,603 2,231 2,331 1,629 md5 18,230 21,187
gfMultiply 2,275 1,898 2,076 1,467
memcopy 2,759 2,497 2,554 1,951
quantization 7,731 6,936 7,000 5,728
sha 2,071 2,061 1,798 1,585
susan 8,115 10,054 6,785 7,012
0
2,000
4,000
6,000
8,000
10,000
12,000
fcdf22 gfMultiply memcopy quantization sha susan
#  
F F
s ,
 L
U
T s
dynCT: FFs dynCT: LUTs statCT: FFs statCT: LUTs
0
5,000
10,000
15,000
20,000
25,000
md5
#  
F F
s ,
 L
U
T s
Figure 9.2.: Area comparison (FFs and LUTs) of dynamic vs. static CTs, syn-
thesized for target frequency 100 MHz.
Regarding the FPGA area (Fig. 9.2), static CTs save about 26 % of the Virtex-5
look-up tables (LUTs) and 13 % of the flip-flops (FFs), because the sequencer
as well as the operators require less token handling logic.
Interestingly, the runtimes of static CTs match those of dynamic CTs in all
kernels examined in Fig. 9.3.
However, it is possible to find a kernel which does show differences in runtime.
An example is the imbalanced_paths kernel shown in Listing 9.1: A loop runs
through six iterations, the first five iterations executing a simple addition, while
the last iteration executes the else part, which contains a high-latency divider
(34 cycles).
Table 9.1 shows the resulting runtimes per loop iteration and in total. The start
of iteration i is defined here as the cycle in which the s value resulting from
the ith loop body data path computation is stored in the s-loop mux (shown in
133
9. Low-Level Optimizations
09_results_hg_diss_dyn_stat_cycles.pdf
Execution Time w/ Data$ miss
kernel dynCT statCT dynCT statCT
fcdf22 1629 1629 quantization 115897 115897
gfMultiply 1457 1457
md5 1974 1974
memcopy 384 384
sha 501 501
susan 919 919
0
500
1,000
1,500
2,000
2,500
fcdf22 gfMultiply md5 memcopy sha susan
#  
C
y c
l e
s
dynCT statCT
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
quantization
#  
C
y c
l e
s
Figure 9.3.: Runtime comparison (number of simulation cycles) of dynamic vs.
static CTs.
1 i n t main ( ) {
2 unsigned i n t i , n = 6 , s = 0 ;
3 unsigned i n t a , b , c , x = 0 ;
4
5 / / HW s t a r t s
6 f o r ( i = x ; i < n ; i ++) {
7 i f ( i != 5) {
8 s += i ;
9 } e l s e {
10 a = i + 10 ;
11 b = a ∗ 90 + 1 + i ;
12 c = b / a ;
13 s = s + 4 + c ;
14 }
15 }
16 / / HW ends
17 p r i n t f ( " s = %d \ n " , s ) ;
18 re turn 0 ;
19 }
Listing 9.1: imbalanced_paths example.
134
9.2. Dynamic CTs vs. Static CTs
Without Buffers With Buffer
Iteration Dynamic CTs Static CTs Static CTs
init 1 1 1
0 8 8 8
1 13 42 13
2 18 47 18
3 23 52 23
4 28 85 28
5 70 98 70
Total 74 102 74
Table 9.1.: Runtimes (in number of cycles) of the hardware generated from the
code in Listing 9.1 for different parameter settings.
Fig. 9.4). The init row in Table 9.1 gives the cycle in which the loop mux accepts
the initial s value. Columns two and three compare dynamic versus static CTs
for the default case in which the operators are not equipped with output buffers
(i.e., the node parameter QDEPTH is zero, cf. Section 7.1). Here, dynamic
CTs perform noticeably faster than static CTs. A major difference occurs for
iteration i = 1; Fig. 9.4 shows the performance bottleneck of the static CT
version: Because the divider delays the execution of add2, the initial s value
must remain in the output register of add1. Consequently, the resulting s value
of iteration i = 0 stays in the output register of the loop mux, so that the new s
value of iteration i = 1 cannot yet enter the loop mux.
If dynamic CTs are used (second column in Table 9.1), such a token jam does
not occur: CTs actively move towards the stuck s values and extinguish them.
This could be seen as the cancel tokens cleaning a congested drain, so that the
divider latency affects only the last loop iteration.
Instead of cleaning the drain, another possibility would be to elongate the drain
tube, by inserting buffers which can hold such superfluous, speculative s values.
This could be done by extending add1 in Fig. 9.4 with a transparent output data
buffer (parameter QDEPTH), large enough (48 entries are used here) to balance
the latency of the divider path. The results in column four of Table 9.1 indeed
show that the static CT model with buffer now performs as well as the dynamic
CT model without buffers (column two). Inserting buffers in the dynamic CT
model would not make sense, as the runtime would not improve further.
Table 9.2 compares the area resources (columns two and three) required by
imbalanced_paths. The fully pipelined 32 bit divider alone requires about 3,200
135
9. Low-Level Optimizations
 
loop
add2
0
4
s (i=0)
i c
+
add1(highlatency)
add3
s (init)
+
s (i=1)
+
i
(low
latency)
div
a b
no token
yet!
Figure 9.4.: Static CTs: token jam.
FFs and 4,200 LUTs; thus, columns four and five, showing the resource re-
quirements without the divider area, better highlight the impact of parameter
variations. According to this measure, the static CT version without buffers
saves about 9 % of the FFs and 23 % of the LUTs versus static CTs due to the
simplified operator and sequencer. When we add a buffer (as described above)
to the static CT version to achieve the dynamic CT performance, the area re-
quirements are still smaller than that of dynamic CTs (8 % FFs, 16 % LUTs),
cf. rows one and three in Table 9.2.
To summarize our results, static CTs achieve the same performance as dynamic
FFs LUTs FFs LUTs
total total w/o w/o
divider divider
Dynamic CTs, without buffers 4,347 5,264 1,163 1,042
Static CTs, without buffers 4,232 4,999 1,061 806
Static CTs, with buffer 4,246 5,070 1,075 877
Table 9.2.: Area requirements (FFs and LUTs) of the hardware generated from
the code in Listing 9.1, for different parameter settings; divider re-
sources omitted in columns four and five.
136
9.3. Operation Chaining
CTs in most cases (cf. Fig. 9.3), while saving area resources. If dynamic CTs
would actually perform better, buffer insertion could boost static CT perfor-
mance, completely alleviating the dynamic CT advantage, while still saving
area resources compared to dynamic CTs. In the following Sections we will
therefore concentrate on the static CT model.
9.3. Operation Chaining
By default, COMRADE 2.0 parametrizes the CMDFG nodes so that each hard-
ware operator will have a register holding the result of the operation, i.e., the
parameter DEPTH is set to one3. Depending on the operation, its parameters
(e.g., bitwidth, signedness, or CT scheduling type) and the parameters of the
target architecture, two or more successive operations might have critical paths
which are short enough to pack these operations into a single clock cycle. This
operation chaining is a very effective technique, because it decreases both the
latency and saves register resources, while the target frequency is not affected.
A practical requirement for this optimization is knowledge about the critical
path of each operator – this information is available to the COMRADE 2.0
compile flow through the MDF framework (cf. Section 7.2).
To understand the chaining algorithm we require some additional definitions.
The chaining of an operation not only depends on the critical path of the op-
eration itself, but also on combinatorial predecessors or successors that are al-
ready chained. Therefore, before chaining a node n, we consider its incoming
and outgoing chained node paths.
Definition 9.3.1. Given a CMDFG and a distinct node n.
An incoming chained node path of n is a (possibly empty) node path
icnp = (n0, . . . , nk), so that nk is a predecessor of n, and each ni is unregis-
tered. ICNP(n) is the set of all incoming chained node paths of n.
An outgoing chained node path of n is a non-empty node path (n0, . . . , nm),
so that n0 is a successor of n, each ni for i < m is unregistered, and nm is
registered. OCNP(n) is the set of all outgoing chained node paths of n.
If n would be chained, the operations in the incoming and outgoing chained
node paths would take place in the same cycle as n. Note that an incoming
chained node path does not have a register at its end, while an outgoing path
3Constant shifts and token forwarder nodes are not registered, because they collapse to pure
wiring in hardware.
137
9. Low-Level Optimizations
has. This register represents the end of the combinatorial node path – although
being registered, this end node contributes to the critical path delay and is thus
included in the outgoing chained node path definition.
For such node paths, we define the critical path delay.
Definition 9.3.2. Given a CMDFG and a node path np = (n0, . . . , nk).
The critical path delay of np, cpd(np), is the sum of the critical path delays of
the nodes contained in np.
For operation chaining, only the longest combinatorial input and the longest
combinatorial output are relevant; this is addressed by the next definition.
Definition 9.3.3. Given a CMDFG with a distinct node n.
The set of maximal incoming chained node paths of n, MICNP(n), contains
all incoming chained node paths of n which have a maximal critical path delay,
i.e.,
MICNP(n) := {m ∈ ICNP(n)|¬∃k ∈ ICNP(n) : cpd(k) > cpd(m)}.
Analogously, the set of maximal outgoing chained node paths of n,
MOCNP(n), contains all outgoing chained node paths of n which have a maxi-
mal critical path delay, i.e.,
MOCNP(n) := {m ∈ OCNP(n)|¬∃k ∈ OCNP(n) : cpd(k) > cpd(m)}.
Note that MICNP(n) is non-empty: If n does not have an unregistered prede-
cessor, MICNP(n) contains exactly the empty set. MOCNP(n) is non-empty, if
there is a node path (n, . . . ,m) along data edges, with m being registered. In
that case, n is said to be terminated.
Finally, the critical path delays of these maximal chained node paths determine
if n may be chained.
Definition 9.3.4. Given a CMDFG with a terminated node n.
The incoming critical path delay of n, ICritPD(n), is the critical path delay of
a maximal incoming chained node path of n, i.e., for any node m ∈ MICNP(n),
ICritPD(n) := cpd(m).
Analogously, the outgoing critical path delay of n, OCritPD(n), is the critical
path delay of a maximal outgoing chained node path of n, i.e., for any node
m ∈ MOCNP(n), OCritPD(n) := cpd(m).
138
9.3. Operation Chaining
ICritPD is well defined, because all node paths in MICNP have the same criti-
cal path delay. This is analogously true for OCritPD.
We now present the algorithm used to chain the operators in a loop body. COM-
RADE 2.0 first generates a working set of chaining candidate nodes. These are
registered CMDFG nodes having only incident data edges. Then, nodes are
successively picked from the working set in data dependence order. For such
a node n, we consider the incoming critical path delay and the outgoing crit-
ical path delay. The node n is set to unregistered, if the sum of ICritPD(n),
OCritPD(n), and the critical path delay of n is smaller than the target cycle
length. In the example in Fig. 9.5, ICritPD(a) = 0 and OCritPD(a) = 3.60 ns,
so the sum ICritPD(a) + OCritPD(a) + 3.58 ns = 7.18 ns is smaller than the
target cycle length of 10 ns. As a result, the register after node a is removed.
a
(a) (b)
b
c
3.58 ns
3.60 ns
4.15 ns
DEPTH=1
DEPTH=1
DEPTH=1
a
b
c
DEPTH=0
DEPTH=1
DEPTH=1
7.18 ns
Figure 9.5.: (a) Before, (b) after chaining operation a; target cycle length: 10
ns (100 MHz).
Fig. 9.6 shows runtimes using the static CT model, comparing results for op-
eration chaining switched on versus chaining switched off. Most of the exam-
ples show either no or only a marginal speed-up here, because they are dom-
inated by memory accesses (which are not chained). The md5 kernel with its
long pipelined data path achieves a speed-up of 1.34 in the chained version.
This result includes the delay caused by cache stalls (which are not accelerated
through operation chaining). Disregarding the stall delays, md5 would require
1,639 cycles (not chained) vs. 1,127 cycles (chained), making up a pure chain-
ing speed-up of 1.45.
Fig. 9.7 shows the area savings achieved with operation chaining due to omit-
ted registers. On average, 10 % of the FFs and 5 % of the LUTs are saved.
The reasons for the latter are slight simplifications in the internal token logic
of unregistered operations.
139
9. Low-Level Optimizations
09_results_hg_diss_chain_stat_cycles.pdf
staticCTs
Execution Time w/ Data$ miss Daten unten eingeben!
kernel no opt chaining no opt chaining
fcdf22 1629 1616 quantization 115897 115615
gfMultiply 1457 1457
md5 1974 1474
memcopy 384 384
sha 501 501
susan 919 912
0
500
1,000
1,500
2,000
2,500
fcdf22 gfMultiply md5 memcopy sha susan
#  
C
y c
l e
s
no opt chaining
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
quantization
#  
C
y c
l e
s
Figure 9.6.: Runtime comparison (number of simulation cycles), static CTs,
operation chaining switched on vs. off.
09_results_hg_diss_chain_stat_area.pdf
staticCTs
Target Frequency: 100 MHz Daten unten eingeben!
kernel no opt chain kernel no opt
no opt: FFs no opt: LUTs opt: FFs opt: LUTs no opt: FFs
fcdf22 2,331 1,629 2,103 1,512 md5 13,558
gfMultiply 2,076 1,467 1,994 1,426 quantization 7,000
memcopy 2,554 1,951 2,321 1,913 susan 6,785
sha 1,798 1,585 1,466 1,307
0
500
1,000
1,500
2,000
2,500
3,000
fcdf22 gfMultiply memcopy sha
#  
F F
s ,
 L
U
T s
no opt: FFs no opt: LUTs opt: FFs opt: LUTs
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
md5 quantization susan
#  
F F
s ,
 L
U
T s
Figure 9.7.: Area comparison (FFs and LUTs), static CTs, operation chaining
switched on vs. off; synthesized for target frequency 100 MHz.
140
9.4. Memory Access Reordering
9.4. Memory Access Reordering
By default, COMRADE 2.0 creates the CMDFG memory edges in program
order. This can cause delays when the input data for a memory access are
already available, but there is no token yet on the incoming memory edge. To
better match the memory access (MA) order with the order induced by data
dependences (and thus avoid such delays), COMRADE 2.0 reorders MAs by
adjusting memory edges.
Fig. 9.8 shows an example: Assuming that it takes longer to compute A than to
compute B (tA > tB), load2will have to wait until load1 has finished, although the
address for load2 is already available. In this case, it makes more sense to invert
the memory edge and execute load2 before load1, irrespective of the program
order.
// load1
a = *(q + A);
// load2
b = *(q + B);
load1
(a) (b)
load2
q
add add
A B
... ...
t
A t
B
Figure 9.8.: (a) C code, (b) generated CMDFG section.
In general, altering the CMDFG memory edges while preserving the program
semantics is only possible for independent memory accesses. There are nu-
merous techniques for determining such independence information, ranging
from simple rules (e.g., consecutive loads and pointers qualified by the restrict
keyword are always independent) to complex analyses (e.g., pointer analysis
[Stee96] [ShHo97] or loop-iteration space analysis [Pugh91] [Leng93]). How-
ever, this work concentrates on actual hardware generation principles rather
than on such complex analysis techniques, which generally take place in the
machine-independent parts of the compiler middle-end. Thus, we show here
how to use independence information instead of how to compute it in a so-
phisticated manner. Therefore, we simply utilize the fact that loads are always
independent, and we allow the user to manually indicate that stores are in-
141
9. Low-Level Optimizations
dependent. Even with these simple techniques, we can measure noticeable
performance gains (see below).
(a) a
b
c
d
end
...
(b)
a b c d
(c)
a b c
d
end
(CMDFG)
(CMDFG)
(MGraph)
(d)
(CMDFG)
loop
+
*
-
a
5
1
loop
(3)
(2)
(1)
(4)
(e)
node dist
a 4
b 2
c 1
d 2
(f)
1) ready: a, b, c; choose c
2) ready: a, b; choose b
3) ready: a, d; choose d
4) ready: a; choose a
(g) schedule: (h)
...
priority list scheduling:
data
dependence
(0)
memory
dependence
data/memory
dependence
in MGraph
longest distance
from predecessor
(loop) mux
(4)
step1: c
step2: b
step3: d
step4: a
c
b d
a
end
... (i)
c
b d
a
end
...
(CMDFG) (CMDFG)
Figure 9.9.: The memory access reordering algorithm. (e) dist values for b, c,
and d represent sample values (which are not deducible from (a)-
(d)).
(a)
a
b
succ
pred
(b)
a b
succ
pred
Figure 9.10.: (a) CMDFG section; (b) memory edge (a, b) removed, predeces-
sors and successors adjusted.
Fig. 9.9 shows the algorithm used for reordering memory accesses (MAs). The
algorithm is executed for each CFG node, thus reordering is applied on a per-
CFG-node basis. Fig. (a) shows a CMDFG section of four MA nodes (a, b,
c, d), which all refer to the same CFG node. Let these MAs be independent.
Note that there is an (indirect) data dependence from b to d. For each memory
edge connecting two independent MAs, we apply the transformation shown in
Fig. 9.10, resulting in Fig. 9.9(b). We now build an auxiliary graph structure
142
9.5. Memory Access Parallelization
MGraph (Fig. (c)), containing the MAs of the current CFG node as nodes and
the CMDFG memory edges as edges, plus edges derived from CMDFG data
dependences: Due to the indirect data dependence b → d, edge (b, d) is added
to the MGraph. (No edges are inserted from memory dependences here, because
a-d are independent and the end node is not an MA of the current CFG node.)
Our goal is to match the MA order with the order induced by data dependences;
because the input data is located in loop muxes at the beginning of a loop itera-
tion, we have to find the longest data path from a loop mux to the MA. The path
length is the sum of the operator latencies on the path, using expected latencies
for variable latency operators. In Fig. (d) each operator has a latency of one
cycle, resulting in the longest mux-to-MA path length of four for MA a. All
these path lengths are stored in a distance table (e), from which a priority list
of the MA nodes is built. Using this priority list, an MA schedule is generated
through list scheduling [GWDL92]; this is shown in Fig. (f): The ready MAs
are those MAs which are not yet scheduled, but their predecessors are already
scheduled. First, all MAs but d are ready, because b is not yet scheduled. From
the ready MAs, one with a smallest distance value is chosen and inserted into
the schedule. In this manner, all nodes are added, resulting in the schedule in
Fig. (g). According to the schedule, we insert re-serializing memory edges
into the CMDFG (Fig. (h), note that a direct memory edge (b, d) is not inserted
here due to the indirect data dependence between b and d), and finally remove
the unnecessary memory edges. An edge to the end node is only required from
the last MA (Fig. (i)).
MA reordering will of course only affect kernels for which the memory edge
insertion in program order leads to avoidable delays. The results in Fig. 9.11
show that this applies to the susan kernel. Here, reordering saves 47 % of the
runtime. For the other examples, the default program order sequence of the
MAs already represents the optimal solution.
We omit a Figure showing area results here, because the resource requirements
are not affected by MA reordering.
9.5. Memory Access Parallelization
So far, we have applied a very limited memory access model, using a single
cache port to the MARC cache (cf. Fig. 9.1). However, MARC already sup-
ports multiple cache ports: Up to four simultaneous accesses to the cache are
possible. In many cases the actual limit is two simultaneous accesses though,
because the BRAM blocks underlying each cache line have just two access
143
9. Low-Level Optimizations
09_results_hg_diss_reorder_stat_cycles.pdf
staticCTs
Execution Time w/ Data$ miss Daten unten eingeben!
kernel no opt reorder no opt reorder
fcdf22 1629 1616 quantization 115897 115897
gfMultiply 1457 1457
md5 1974 1974
memcopy 384 384
sha 501 501
susan 919 487
0
500
1,000
1,500
2,000
2,500
fcdf22 gfMultiply md5 memcopy sha susan
#  
C
y c
l e
s
no opt reorder
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
quantization
#  
C
y c
l e
s
Figure 9.11.: Runtime comparison (number of simulation cycles), static CTs,
memory access reordering switched on vs. off.
ports.
To support parallel cache accesses [GäSK08] in COMRADE 2.0, we reorga-
nize the memory edges, and we map memory access (MA) nodes to actual
MARC cache ports. For this we can re-use and slightly extend the MA re-
ordering algorithm presented in the previous Section. To support n parallel
cache ports, we schedule independent MAs into n columns in the Table in Fig.
9.9(g). MAs in column i are then assigned cache port i, i.e., MAs in the same
row may execute in parallel. New memory edges are thus inserted in a per-
column top-down manner, resulting in n parallel MA strings.
Fig. 9.12 shows runtime comparisons of one vs. two parallel cache ports;
reordering is applied in both versions. The mean speed-up achieved is 1.15,
with a minimum of 0.93 for gfmultiply, and a maximum of 1.42 for susan. The
slow-down of gfmultiply is caused by an issue in the MARC memory back-end4
and not related to our own research.
A general limitation of the overall speed-up here is related to cache stalls. The
parallel cache ports, which enable the hardware kernel to access more than one
cached data word in the same cycle, do not increase the bandwidth between
the cache and the main memory. Thus, although multiple cache accesses can
4A fix for this is under way, but could not be integrated into our results anymore.
144
9.5. Memory Access Parallelization
now be performed at the same time, the fraction of the execution time needed
to fill and flush the cache lines is not affected by the MA parallelization. Due
to Amdahl’s Law, if M% of the execution time is improved by parallelization
with N cache ports, then the maximum speed-up is 1
1−M%+ M%N
. For N = 2, the
maximum achievable speed-up thus ranges from 1.08 (gfMultiply) to 1.71 (md5)
with an average of 1.47.
To examine the speed-up disregarding cache stalls, Fig. 9.13 shows runtime
comparisons omitting the stall delays. The maximum speed-up achieved is 1.8
(1.33 on average), limited for two reasons: First, the overhead for starting a
new loop iteration is not parallelized. For example, in the memcopy kernel, the
time to execute the loop body is decreased from 16 cycles (one port) to 8 cycles
(two ports), but two additional cycles are needed in both versions to start a new
loop iteration, giving a speed-up of 1810 = 1.8 instead of 2.0. Second, not all
MAs are independent; e.g., out of five accesses per iteration in gfmultiply, only
two are independent of each other.
09_results_hg_diss_reorder_par_stat_cycles_with_stalls.pdf
staticCTs
Execution Time w/ Data$ miss Daten unten eingeben!
kernel reorder reorder + par reorder reorder + par
fcdf22 1616 1340 quantization 115897 115897
gfMultiply 1457 1571
md5 1974 1895
memcopy 384 325
sha 501 394
susan 487 344
0
500
1,000
1,500
2,000
2,500
fcdf22 gfMultiply md5 memcopy sha susan
#  
C
y c
l e
s
reorder reorder + par
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
quantization
#  
C
y c
l e
s
Figure 9.12.: Runtime comparison (number of simulation cycles), static CTs,
reordering only vs. reordering with parallelization for two cache
ports.
Fig. 9.14 compares the area requirements. Although the data path and se-
quencer require exactly the same resources for both one and two cache ports,
there is a slight increase in the number of FFs in the two-port versions. This is
due to the additional MARC cache port: The address and data buses from the
145
9. Low-Level Optimizations
kernel reorder reorder + par speedup
fcdf22 1078 826 1.31
gfMultiply 215 198 1.09 Cacheline unglücklich verdrängt
md5 1639 1623 1.01
memcopy 143 79 1.81
quantization 94664 94664 1.00
sha 322 217 1.48
susan 365 223 1.64
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
fcdf22 gfMultiply md5 memcopy sha susan
#  
C
y c
l e
s
reorder reorder + par
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
100,000
quantization
#  
C
y c
l e
s
Figure 9.13.: Runtime comparison (number of simulation cycles), static CTs,
reordering only vs. reordering with parallelization for two cache
ports; cache stalls omitted.
hardware kernel to MARC are buffered in a hardware kernel wrapper to keep
critical paths short. For two cache ports, twice as many of these buffers are
required. Thus, this slight area increase is not caused by the hardware kernel
itself.
9.6. Memory Localization
Using the optimizations presented so far (operation chaining, memory access
(MA) reordering, and MA parallelization using two cache ports), we compare
the runtimes of COMRADE-generated hardware kernels achieved on the ACE-
M5 versus execution on the embedded PowerPC alone, as well as on an Intel
Core2 CPU of a conventional PC. We derive the hardware kernel runtimes
from the number of simulation cycles, considering the target frequency of 100
MHz. These numbers do not include delays due to CPU/RCU data transfer or
RCU reconfiguration. We measure the CPU runtimes using a high performance
counter on the Intel CPU and a cycle-accurate time base on the embedded
PowerPC. For both CPU measurements the fastest available optimizations are
146
9.6. Memory Localization
09_results_hg_diss_reorder_par_stat_area.pdf
staticCTs
Target Frequency: 100 MHz Daten unten eingeben!
kernel reorder reorder + par kernel
reorder: FFs reorder: LUTs reorder + par: FFs reorder + par: LUTs
fcdf22 2,331 1,599 2,451 1,611 md5
gfMultiply 2,076 1,495 2,124 1,475 quantization
memcopy 2,554 1,928 2,674 1,964 susan
sha 1,798 1,546 1,918 1,554
0
500
1,000
1,500
2,000
2,500
3,000
fcdf22 gfMultiply memcopy sha
#  
F F
s ,
 L
U
T s
reorder: FFs reorder: LUTs
reorder + par: FFs reorder + par: LUTs
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
md5 quantization susan
#  
F F
s ,
 L
U
T s
Figure 9.14.: Area comparison (FFs and LUTs): static CTs, reordering only vs.
reordering with parallelization for two cache ports, synthesized
for target frequency 100 MHz.
enabled5.
Fig. 9.15 shows the results of these measurements. Three kernels (md5, mem-
copy, and susan) are actually executed faster on the 100 MHz RCU than on the
embedded 400 MHz PowerPC. However, the mean hardware kernel speed-up
is 0.84 versus the PowerPC, and down to 0.02 compared to the Intel CPU.
This is a negative result especially when we notice that the power consump-
tion of the embedded PowerPC (2.5 mW/MHz) [Ibmm10] lies roughly in the
same range as the power consumption of a Xilinx FPGA-based RCU [GäKo04]
[LSKH09].
An analysis of the hardware kernel bottlenecks reveals that the memory edges
leading across loop conditions (cf. Section 5.3.2, Memory Accesses in Nested
Loops) actually inhibit hardware pipelining of loop bodies, which is one of the
most important techniques for hardware acceleration. To break this memory
chain and actually pipeline loop bodies, two preconditions must be fulfilled.
First, the MAs in the loop body have to be independent across loop iterations.
Second, a considerably higher memory bandwidth is required, so that (at best)
all MAs in the loop body can be executed in parallel. For this, a single cache is
not appropriate, because it would have to arbitrate many accesses, increasing
5For the Intel CPU, we use the Microsoft Visual Studio compiler; for the PowerPC, we use a gcc
cross-compiler.
147
9. Low-Level Optimizations
09_results_hg_diss_hw_cpu_time.pdf
kernel HW Kernel (opt) @ 100 MHz ACE-M5 PPC @ 400 MHz Intel Core2 @ 2.66 GHz HW Kernel (opt) @ 100 MHz
fcdf22 13.66 7.54 0.77 quantization 1156.15
gfMultiply 15.71 6.535 0.39
md5 13.95 18.1 6.37
memcopy 3.25 4.03 0.461
sha 3.94 1.72 0.544
susan 3.37 5.35 0.78
0
2
4
6
8
10
12
14
16
18
20
fcdf22 gfMultiply md5 memcopy sha susan
R
u n
t i m
e  
( µ
s )
HW Kernel (opt) @ 100 MHz
ACE-M5 PPC @ 400 MHz
Intel Core2 @ 2.66 GHz
0
200
400
600
800
1,000
1,200
1,400
quantization
R
u n
t i m
e  
( µ
s )
Figure 9.15.: Runtime comparison; HW kernels use operation chaining, MA
reordering, and two parallel cache ports.
the latency of each single access. The cache currently integrated in MARC has
the practical limit of four simultaneous cache accesses (being organized in two
banks, each supporting two accesses at a time).
To be able to execute more MAs in parallel, we require more independent
memories, which are (ideally) accessible without additional overhead. Pos-
sible approaches are memory localization techniques, offering many separate
memories which can be located close to computational logic. Different local
memory concepts include the many-cache model (used by the CHiMPS com-
piler [PBDM08]), MARC II [LaWK11], as well as local scratch pad memo-
ries. The many-cache model maps non-overlapping main memory sections to
separate caches. If memory sections do overlap, cache coherency must be es-
tablished through software. MARC II also relies on multiple caches, but keeps
the caches coherent using hardware techniques. Furthermore, it supports spec-
ulative memory accesses. Scratch pad memories do not have a binding to the
main memory per se; they are the ideal choice for the storage of temporary,
local data which does not have to be written back to the main memory during
a hardware-to-software transition. However, scratch pad memories are very
broadly applicable and can even be used as caches (loading and flushing data
is then implemented manually).
148
9.6. Memory Localization
As MARC II has only been released very recently, it cannot be considered in
greater detail in this work. Instead, we use local scratch pad memories to pro-
vide a higher memory bandwidth. This new feature, called LMEM, is shown
in Fig. 9.166. A hardware kernel can now access a configurable number of
local memories in addition to the MARC cache ports. Each local memory (im-
plemented in BRAMs on Xilinx Virtex family FPGAs) has a read and a write
port which can be accessed independently. When the hardware kernel is not ac-
tive, the CPU can access the local memories via memory mapping directly or
(for better performance) program a Local Paging Unit (LPU), which can copy
data between the main memory and the local memories using a fast stream port
(provided by MARC). Currently, a basic streaming version with a limited peak
performance of 400 MB/s is used, while up to 3.2 GB/s are actually possible
on the ACE-M5 (such an upgrade is under way). Memory access delays for
the RCU can occur due to DDR2-SDRAM refresh cycles and OS access to the
main memory.
COMRADE 2.0 supports LMEM in a semi-automated way that still requires
some manual intervention. It is the programmer’s choice to assign an array
or pointer variable to either cache or a local memory. To indicate local mem-
ory, the suffix _local has to be appended to the pointer identifier. Each of
these pointers may be used at most twice in the C program (once as a load, and
once as a store), because each local memory has only two access ports. The
programmer further has to take care of copying data to and from the local mem-
ories (e.g., by programming the LPU). The size and number of scratch pads, as
well as possible parallel accesses, are set by compiler definitions. COMRADE
then omits memory dependences completely, because all MAs are independent.
Thus, the bottleneck-inducing memory chains have been broken.
To get an impression of the performance achievable with LMEM, we consider
a localized version of the md5 kernel. This kernel computes the MD5 checksum
(128 bits) of an input message of arbitrary length. md5 exhibits ideal properties
for being accelerated through localization: It has a long loop body (having a
latency of 192 cycles in our implementation), the loop body is pipelineable
(we pipeline over a number of different messages here, using C-slow execution
[LeRS83]), and all occuring memory accesses are independent.
However, an md5 kernel compiled in a straightforward fashion does not yet
work in a fully pipelined manner due to token jams. This is shown in Fig.
9.17(a); the numbers of the tokens denote the loop iteration the token is as-
signed to: After the left AT (1) has passed op1, it takes n cycles (in the extreme
6For future work, the combination of LMEM and MARC II appears to be quite promising.
149
9. Low-Level Optimizations
RCU
CPU
HW Kernel
Main Memory
MARC
LPU
LMEM 1re
ad
w
rit
e
LMEM nre
ad
w
rit
e
Stream Port Cache Ports
m
em
or
y-
m
ap
pe
d
Figure 9.16.: The LMEM memory system.
150
9.6. Memory Localization
case, n = 192 here) until op2 is started; in the meantime, the right AT (1)
blocks the load node. In Fig. 9.17(b), a nop node extended with an output buffer
of size n holds the tokens which accumulate before op2, so that the load node
can continue to feed input data into the long-latency data path, thus actually
enabling pipelining. This technique can be understood as pipeline balancing in
a dynamic context.
op1
(a) (b)
op2
load
...
n cycles
(expected 
latency) QDEPTH = n
op1
op2
load
...
nop
+
+
+
+
+
+
(1)
(1)
+
+
+
(1)
(2)
(3)
(2)
(3)
(2)
(3)
+ (1)
+ (2)
+ (3)
Figure 9.17.: md5 COCOMA section, (a) without output buffer (QDEPTH=0),
(b) with output buffer (QDEPTH=n); tokens numbered according
to loop iterations.
The optimal buffer size depends on the maximum number of loop iterations
(i.e., the number of hash values to be computed, as we pipeline over a set of
input messages here): For n iterations, a buffer size of n suffices. Once n has
reached 192 (the pipeline latency), it does not need to be increased any further
(even for more than 192 hash values), because after that initial latency, tokens
enter and exit a buffer at the same rate.
Because the localized md5 kernel has been studied during the final develop-
ment phase of this work, buffer insertion for pipeline balancing has not been
integrated into the general-purpose compile flow – instead, some heuristics are
used for the md5 example, so that we can at least examine the effects on runtime
and area requirements of this kernel.
Fig. 9.18 shows the measured runtimes using the static CT model (for dynamic
CTs we got virtually the same results) for the buffer sizes 0, 32, and 192. Note
the logarithmic scale: 1024 messages are processed 31x faster (due to pipelin-
ing) with a QDEPTH value of 192 compared to a design without buffers. The
151
9. Low-Level Optimizations
pure HW kernel throughput of 1.95 GByte/s (@ 100 MHz), obtained with a
buffer size of 192, can actually be achieved on the ACE-M5, which offers a
throughput of 3.2 GByte/s between the main memory and the local memories.
100
1000
10000
100000
1 10 100 1000
# Messages
#  
C
y c
l e
s
QDEPTH=0
QDEPTH=32
QDEPTH=192
Figure 9.18.: md5 kernel, static CTs: number of simulation cycles over number
of messages for different buffer sizes.
The resource requirements are shown in Fig. 9.19. For static as well as dy-
namic CTs the number of required LUTs increases with the QDEPTH buffer
size, because the buffers are mapped to distributed memory (using LUTs) in
the Virtex-5 FPGA. Static CTs here save 22 % of the FFs and 29 % of the LUTs
compared to dynamic CTs. Unfortunately, only the static CT and dynamic CT
md5 versions without buffers (QDEPTH=0) fit into the FPGA (XC5VFX70T) of
our ACE-M5 test platform – we have actually run the static CT version on the
board at 100 MHz. For the buffered versions an upgrade of the target FPGA
would be necessary. For example, the XC5VFX200T chip with nearly three
times as many FF and LUT resources would suffice for the md5 versions up
to QDEPTH=128. In addition to the total number of FFs and LUTs available,
we also need to take into account the number of LUTs that are actually us-
able as distributed memory. Although the XC5VFX200T chip offers 122,880
LUTs in total, only 36,480 LUTs are usable as distributed memory. Thus the
QDEPTH=192 versions, requiring more than 44,000 of such LUTs (Fig. 9.20),
do not fit into the XC5VFX200T. However, with newer FPGA generations,
the amount of available logic resources increases (Virtex-6: up to 470k LUTs
[Xili10]; Virtex-7: up to 1.2 million LUTs [Xils10]). FPGA area will no longer
be the main performance-limiting resource. That role is likely to be taken up
by external memory bandwidth in the future.
Assuming the availability of sufficient reconfigurable area, the HW kernel gen-
erated by COMRADE 2.0 with a buffer size of 192 would achieve a speed-up
152
9.6. Memory Localization
09_results_hg_diss_md5_stat_dyn_area.pdf
dies sind echte xc5vfx200t-Werte!
QDEPTH 0: QDEPTH-Werte "2" manuell zu 0 gesetzt
staticCTs dynamicCTs
QDEPTH statCT: FFs statCT: LUTs statCT: LUT distram dynCT: FFs dynCT: LUTs dynCT: LUT distram
0 23,384 16,072 966 28,490 28,172 966
64 29,654 48,107 14,709 38,670 67,277 15,397
128 29,973 61,353 26,581 39,487 81,960 28,629
192 30,316 81,668 44,423 39,369 102,776 48,519
max V5-fx70t 44,800 44,800 13,120 wahrscheinlich wachsen die distrams auf dem V5 in 64er-Schritten,
max V5-fx200t 122,880 122,880 36,480
Mittelwert
0
20,000
40,000
60,000
80,000
100,000
120,000
0 64 128 192
QDEPTH
#  
F F
s ,
 L
U
T s
statCT: FFs statCT: LUTs dynCT: FFs dynCT: LUTs
XC5VFX70T
XC5VFX200T
Figure 9.19.: md5 kernel: Virtex-5 area requirements for static and dynamic
CTs; dotted lines indicate resources available on target FPGA.09_results_hg_diss_md5_stat_dyn_area_lut_distram.pdf
QDEPTH statCT: LUT dist memdynCT: LUT dist mem
0 966 966
64 14,709 15,397
128 26,581 28,629
192 44,423 48,519
0
10,000
20,000
30,000
40,000
50,000
60,000
0 64 128 192
QDEPTH
#  
L U
T s
 u
s e
d  
a s
 d
i s
t  m
e m
statCT: LUT dist mem dynCT: LUT dist mem
XC5VFX70T
XC5VFX200T
Figure 9.20.: md5 kernel: Virtex-5 distributed memory LUT requirements for
static and dynamic CTs; dotted lines indicate resources available
on target FPGA.
153
9. Low-Level Optimizations
of 7x versus an Intel Core2 CPU @ 2.66 GHz, and a speed-up of 37x versus
execution on the embedded superscalar PowerPC 440@ 400MHz of the ACE-
M5 alone, as shown in Fig. 9.21. This illustrates the extreme performance po-
tential of memory localization, improving the HW kernel speed-up by an order
of magnitude.
1,218
237
33
0
200
400
600
800
1,000
1,200
1,400
ACE-M5 PPC
@ 400 MHz
Intel Core2
@ 2.66 GHz
HW Kernel
@ 100 MHz
E x
e c
u t
i o
n  
T i
m
e  
( µ
s )
Figure 9.21.: md5 kernel: runtime comparison for 1024 messages, 512 bits
each; HW kernel: QDEPTH=192.
154
10. Application-Level Study
Having analyzed isolated hardware kernels so far, we will now examine the
speed-up on the application-level, i.e., the speed-up achieved for the execution
of a complete application on an adaptive computer, where only parts of the
application are actually accelerated by the RCU. As example application we
use the wavelet image compression program (Versatility Stressmark) provided
by the Honeywell Benchmark Suite [Hone97]. It compresses a 512x512 pixel
gray scale image (8 bits per pixel) by successively applying the four steps (b)-
(e) shown in Fig. 10.1. After the wavelet transform step (b), only one fourth
of the resulting image (image areas A..G in the upper left corner) is processed
by quantization (c), run-length encoding (d), and entropy encoding (e), finally
resulting in a compressed bitstream (f).
A
C
B
D
E
H
F
G
I
J
A
C
B
D
E
F
G
(b)
Wavelet Transform
011 
10...
A
C
B
D
E
F
G
A
C
B
D
E
F
G
(c)
Quantization
(d)
Run-Length
Encoding
(e)
Entropy
Encoding
(a)
Input Image
(f)
Compressed
Image Data
Figure 10.1.: Wavelet image compression steps.
The wavelet transform step is explained in more detail in Fig. 10.2. In fact
the input image is transformed six times ((b)-(g) in the Figure), each time per-
forming a (2,2)-biorthogonal Cohen-Deaubechies-Feauveau transform imple-
mented in the lifting scheme [UyRB99]. Each transform stage produces a low-
pass and a high-pass-filtered image section1. (b) transforms each row of the
image, creating a low-pass section on the left and a high-pass section on the
right hand side. (c) applies the same procedure to each image column, storing
1While the low-pass section is a scaled version of the original image, the high-pass section con-
tains wavelet coefficients which are required to reproduce the original image from the scaled
version.
155
10. Application-Level Study
the low-pass values at the top and the high-pass coefficients at the bottom of the
image. This scheme of image row and column transformation is then reapplied
to the top left quarter of the image ((d) and (e) in Fig. 10.2) and afterwards to
the top left quarter of that quarter ((f), (g)).
(b)
rows 512
(a)
original
(c)
cols 512
(d)
rows 256
(e)
cols 256
(f)
rows 128
(g)
cols 128
512
51
2
Figure 10.2.: Wavelet transform stages.
We have compiled2 the compression algorithm and have executed it on the
PowerPC 440 (PPC440) of the ACE-M5, using the popular Lena image (con-
tained in the Honeywell Benchmark Suite [Hone97]) as input. Table 10.1
2Pure SW compilation using a GCC cross compiler, version 4.2.4; we have used the highest
optimization effort level -O3.
156
Original Optimized (sw-opt)
Wavelet Transform 204,791 µs 54.7 % 204,809 µs 94.0 %
Quantization 156,751 µs 41.8 % 7,080 µs 3.3 %
Run-Length Encoding 9,293 µs 2.5 % 2,570 µs 1.2 %
Entropy Encoding 3,821 µs 1.0 % 3,362 µs 1.5 %
Total 374,656 µs 100.0 % 217,821 µs 100.0 %
Table 10.1.: Execution time of the wavelet compression program executed on
the ACE-M5 CPU, original (columns two, three) and optimized
version sw-opt (columns four, five).
(columns two and three) shows the runtime of each compression step, omitting
file system delays (i.e., delays due to transfer between hard disk and main mem-
ory). We have discovered that this original implementation contains unneces-
sary computations. The quantization and run-length encoding steps transform
all 10 image areas A..J, even though only areas A..G are actually processed by
the final entropy encoding step and then stored in the output file. Furthermore,
the quantization step iterates over image columns instead of rows, decreasing
the cache efficiency. Just by fixing these two issues we have obtained an op-
timized version sw-opt of the program which saves more than 40 % of the
runtime (columns four and five in Table 10.1).
Our goal is to create a hardware accelerated version of the optimized wavelet
compression program. Obviously the wavelet transform step (94.0 % of the
runtime) is the ideal hardware acceleration candidate. According to the results
presented in Chapter 9, we apply memory localization to obtain a high mem-
ory throughput. An optimal solution would load the complete image into the
local memory, so that the six wavelet transform stages can directly access the
image data without interruption. However, the ACE-M5 LMEM cannot hold
the complete image. As each pixel requires four bytes3, the image requires
1,024 KB, while the Virtex-5 of the ACE-M5 offers a maximum of 666 KB of
BRAM storage. Therefore we have to process the image section by section.
For LMEM we choose 16 BRAM blocks with 4,096 bytes per block (the same
configuration used for the md5 kernel in the previous Section), making up a
total LMEM size of 64 KB, i.e., each image section is 16,384 pixels in size.
We have accordingly adjusted the C algorithm4 to operate on 16 arrays (subse-
quently referred to as local arrays), which model the BRAMs later used by the
3The C program uses the int data type to hold the gray value of a pixel, because the width of the
data (originally eight bits) expands during processing.
4The resulting C code is shown in Appendix D.8.
157
10. Application-Level Study
Adjusted for COMRADE 2.0 (hw-opt)
Wavelet Transform 332,667 µs 96.1 %
Quantization 6,907 µs 2.0 %
Run-Length Encoding 2,541 µs 0.7 %
Entropy Encoding 4,177 µs 1.2 %
Total 346,292 µs 100.0 %
Table 10.2.: Execution time of the wavelet compression program executed on
the ACE-M5 CPU alone, adjusted for compilation with COM-
RADE 2.0.
HW kernel5. These local arrays hold the image pixels initially, while holding
low-pass values and high-pass coefficients during the operation of the algo-
rithm. Before and after each actual wavelet transform, we copy the data from
the original integer array to the local arrays and from the local arrays back to
the integer array respectively. Table 10.2 shows the execution time measured
for this altered program version, which we call hw-opt. Due to the copy opera-
tions, the total runtime has now increased by 59 % versus sw-opt in Table 10.1.
The wavelet transform percentage has grown to 96.1 %.
We have compiled hw-opt (the source code is given in Appendix D.8) with
COMRADE 2.0 and using static CTs. Setting QDEPTH to 16 suffices here, be-
cause the data path lengths are much shorter than for the md5 kernel in the pre-
vious Section. Simulation results show that the compiled HW kernel requires
5,174 cycles to transform an image section of 16,384 pixels. In fact the HW
kernel transforms 16 pixels every five cycles (due to loop-carried dependences
of length five) making up a throughput of 3.2 pixels per cycle. Note that the
HW kernel throughput is the same regardless of whether the pixels have been
copied from image rows or image columns; instead, the column access delay
influences the data transfer time between main memory and LMEM.
When copying data between main memory and LMEM, note that only for the
first wavelet stage (rows 512) the LPU is able to transfer a whole image section
by a single burst transmission. For the remaining row stages (rows 256, rows
128) each row has to be transferred by a separate burst, because the memory
footprints of the rows are not directly contiguous in those cases. Thus the LPU
must be programmed multiple times. For the column-wise transforms, even the
cols 512 stage requires a separate burst per column, again because the columns
are not stored in directly adjacent memory regions. Furthermore, column pro-
5As memory localization has only been integrated in the final development phase of this work,
this transformation is currently done manually; cf. Section 9.6.
158
cessing requires non-unit strided main memory accesses, because 511 pixels
must be skipped in order to access the next pixel of a column. Unfortunately
non-unit strides are not implemented in the stable memory back-end for the
ML507 (MARC). Professor Koch’s group is currently extending MARC II to
provide this capability and also allow wider data transfers of 256 bits per cycle.
For this thesis, we were able to test a preliminary version of this fast stream-
ing implementation which already works correctly in post-synthesis simula-
tions, but not yet in real hardware (most likely due to vendor tool bugs in tim-
ing optimization). Combining this MARC II prototype with the COMRADE-
generated HW kernel and LMEM, we have successfully synthesized a design
targeting the ML507, which works correctly up to the post-synthesis simula-
tion. We call this design wave stream.
In order to show the correct behavior of the HW kernel even after place-and-
route, we have built the streaming-less design wave mmap which uses memory
mapping instead of LPU-based streaming to transfer data between main mem-
ory and LMEM. For this design, even the post-place-and-route simulation has
been successful, showing that the HW kernel works correctly at the designated
ACE-M5 frequency of 100 MHz.
In the following we consider the results obtained for wave stream. Fig. 10.3
shows the area required by the complete FPGA design. About half of the FFs,
LUTs, and BRAM blocks are used.
Table 10.3 and Fig. 10.4 show the FPGA resource distribution for the design
components. The complete hardware design consists of the HW kernel (data
path and sequencer), LMEM (LPU and local memory), buffers and muxes
(which connect HW kernel, LMEM, and MARC II), a MARC II prototype
(containing a techmod6 for the connection to the MCI bus, and a stream port to
buffer streamed data), and additional ML507 board support components (BSC,
including on-chip buses, DDR2 controller, and peripherals such as compact
flash, UART, and ethernet). The biggest fraction of the FFs and LUTs are re-
quired by the data path and the BSC; BRAM is used mainly by LMEM,MARC
II, and the BSC.
Fig. 10.5 shows the (wave stream) runtimes for each wavelet stage obtained
by post-synthesis simulation. Obviously the column stages run much longer
than the row stages. This is due to the strided main memory accesses which
decrease the bandwidth to 4 (instead of 32) bytes per cycle. Altogether we
have measured 2,787,187 simulation cycles, which translates to 27,872 µs on
the ACE-M5. This runtime includes HW kernel computation, data transfers
6A techmod is an adapter which connects a certain bus technology to MARC II.
159
10. Application-Level Study
17,895
23,038
85
26,905
21,762
63
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
FFs LUTs BRAM blocks
(36 kb each)
Unused
Used
Figure 10.3.: Wave stream: area requirements for the ACE-M5 RCU.
FFs LUTs BRAM blocks
(36 kb each)
Board support components 7,458 9,281 39
MARC II stream_port_256 481 401 4
MARC II techmod_mci 1,413 793 25
buffers + muxes 1,721 11 0
local memory 907 2,484 17
LPU 577 725 0
HW kernel: sequencer 267 874 0
HW kernel: data path 5,071 8,469 0
Table 10.3.: Wave stream: area distribution.
160
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
FFs LUTs BRAM blocks
(36 kb each)
Board support
components
MARC II:
stream_port_256
MARC II: techmod_mci
Buffers + muxes
Local memory
LPU
HW kernel: sequencer
HW kernel: data path
Figure 10.4.: Wave stream: area distribution.
161
10. Application-Level Study
between DDR2-SDRAM and LMEM (even accounting for DRAM refresh cy-
cles), as well as memory-mapped accesses7 to the HW kernel and the LPU.
During simulation, the LPU has been used 2,592 times, while the HW ker-
nel has been started 42 times. Compared to the fastest software-only wavelet
transform version run on the PPC440, the HW kernel achieves a speed-up of
7.3x.
1,887
18,879
729
4,848
248
1,280
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
20,000
rows 512 cols 512 rows 256 cols 256 rows 128 cols 128
R
u n
t i m
e  
[ µ
s ]
Figure 10.5.: Wave stream: ACE-M5 runtimes per wavelet stage, obtained by
post-synthesis simulation; total application runtime: 27,872 µs.
Taking also the quantization, run-length encoding and entropy encoding steps
into account (Tab. 10.4), we obtain a total ACE-M5 runtime of 41,497 µs.
Fig. 10.6 compares this result to software-only execution on the ACE-M5 CPU
(PPC440) and on an Intel Core2 CPU. The COMRADE 2.0-based wave stream
application achieves a considerable speed-up of 5.2x versus the fastest PPC440
version (sw-opt), even though the PPC440 runs at 4x the clock frequency of
the RCU. Compared to sw-opt run on the Core2, the ACE-M5 is only 10 times
7To transfer live variables to the HW kernel and to configure the LPU.
162
Processing Unit ACE-M5 Runtime [µs]
Wavelet transform RCU 27,872
Quantization CPU 6,907
Run-length encoding CPU 2,541
Entropy encoding CPU 4,177
Total 41,497
Table 10.4.: Wave stream: total ACE-M5 runtime.
slower, although the Core2 frequency is 26 times higher.
217,821
41,497
3,975
0
50,000
100,000
150,000
200,000
250,000
ACE-M5
CPU-only
400 MHz
sw-opt
ACE-M5
wave stream
100 MHz
Core2
CPU-only
2.66 GHz
sw-opt
R
u n
t i m
e  
[ µ
s ]
Figure 10.6.: Wavelet compression: execution time comparison, different com-
pute platforms.
The comparison between the ACE-M5 and the Core2 is quite unfair, though.
The ACE-M5 has to process the image step by step in an out-of core man-
ner due to its limited internal BRAM memory8. Thus, the image data (and
the wavelet coefficients, respectively) are copied multiple times between main
memory and LMEM. The Core2 instead possesses a 4 MB level-2 cache, so
that it can operate entirely on the cache.
This limitation will fall away with newer FPGA versions, which offer up to
4 MB (Virtex-6) and 8 MB (Virtex-7) of internal BRAM memory. Assuming
864 KB are actually used; 666 KB would be available. However, using more than 64 KB would
not significantly decrease the memory transfer load between main memory and LMEM. For
this, at least 1 MB would be required.
163
10. Application-Level Study
that the image is transferred from main memory to LMEM only once (before
the wavelet transform stages) and the resulting wavelet coefficients are trans-
ferred back only once (after the last stage), Tab. 10.5 (column two) shows an
estimation of the resulting runtime based on our wave stream measurements9.
While the ACE-M5 wavelet transform runtime now nearly draws level with
the Core2 (column four), the so far unaccelerated remaining steps quantiza-
tion, run-length encoding, and entropy encoding (QRE) become a bottleneck.
COMRADE 2.0 could be used to generate adequate HW kernels, so that these
three steps can also be accelerated by the ACE-M5 RCU. Assuming the same
speed-up for these kernels over the PPC440 software versions as estimated for
the wavelet transform step, the ACE-M5 could achieve a slightly better run-
time (column three) than the Core2. This result is very promising in terms of
power consumption. While the ACE-M5 requires ca. 10 W (including CPU
and RCU), the Core2 dissipates ca. 65 W [Inte11], so that the ACE-M5 would
save about 87 % of the Core2 power.
And there is still more yet unexploited potential left. More internal memory to-
gether with more reconfigurable area resources would allow even better speed-
ups with larger HW kernels (e.g., executing the wavelet transform for 32, 64,
or more pixels in parallel instead of currently 16).
To summarize, this application-level study gives a first glance on the perfor-
mance (in terms of speed-up, area requirements, and power) achievable on
the ACE-M5 with COMRADE 2.0. While the currently achieved speed-up
obtained over an embedded CPU is already significant, an FPGA with more
internal BRAM memory (e.g., Virtex-6, Virtex-7) is required to better exploit
the potentials offered by our LMEM architecture. For a more realistic rating of
COMRADE 2.0-generated kernels as well as the ACE-M5, a larger number of
benchmarks needs to be evaluated. Especially for examples with a higher de-
gree of irregularity (e.g., if/else structures), we expect even higher HW kernel
speed-ups due to a greater amount of pipeline stalls in the competing CPU.
9The wavelet transform step alone achieves a speed-up of 63x over the embedded PowerPC.
We estimate the QRE accelerated runtime through dividing the PowerPC runtime by 63, thus
estimating the wavelet transform speed-up for the QRE steps.
164
ACE-M5 ACE-M5 Core2
Runtime [µs] Runtime [µs] Runtime [µs]
(red. traff.) (red. traff.)
(QRE accel.)
SDRAM to LMEM 527
Wavelet: rows 512 828
Wavelet: cols 512 828
Wavelet: rows 256 207
Wavelet: cols 256 207
Wavelet: rows 128 52
Wavelet: cols 128 52
LMEM to SDRAM 532
Total Wavelet 3,232 3,232 2,911
Quantization 6,907 109 513
Run-length encoding 2,541 40 126
Entropy encoding 4,177 66 425
Total 16,857 3,447 3,975
Table 10.5.: Wave stream: ACE-M5 runtime estimated for reduced memory
traffic (column 2) and additional HW kernels for accelerating
quantization, run-length encoding, and entropy encoding (QRE,
column 3); comparison to the Core2 CPU (column 4).
165
10. Application-Level Study
166
11. Summary and Future Work
We have examined new approaches to high-level language compilation for
adaptive computers, especially in the hardware back-end. While most of our
techniques are language-independent, we have chosen to examine the compi-
lation of ANSI C, which is still dominant in the embedded systems domain.
An important premise of our approach is the broad language support includ-
ing pointers and nested loops, instead of using just a subset of C to represent
simple data flow in C syntax. To cope with the complex dependences and
variable-latency operations arising from C constructs, we have investigated the
promising dynamic scheduling approach.
As the hardware generated by our compiler is intended to be used as hardware
accelerator, we predominantly optimize for runtime. To this end, we have em-
ployed a number of techniques: pipelining of data paths, even operator-internal
pipelining, and speculative execution. We refine the latter by operation cancel-
ing, a technique which allows for canceling mis-speculated operations. To be
able to model such techniques in the context of the complex dependences aris-
ing from C, we required a newmicro-architectural model, because simple (con-
trol) data flow graphs did not suffice anymore. Therefore we have developed
COCOMA, a low-level intermediate representation that fills the gap between
the control flow graph (CFG) and the hardware description language (HDL)
representation. This token-based model is sufficiently flexible to accommodate
even dynamic systems with variable latencies.
We have explained in detail how a CFG can be mapped to a COCOMA in-
stance. For this, normalizing the CFG into a structured form with top-testing
loops has proven practical. From such a normalized CFG, a COCOMA in-
stance is created in two steps. First, nodes and dependence edges are created;
this produces a control memory data flow graph (CMDFG). Second, a dy-
namic scheduling sequencer is generated from the dependences encoded by the
CMDFG graph structure. This sequencer acts as glue logic between the opera-
tors. CMDFG and sequencer together represent the COCOMA instance. Once
this construction is complete, the compilation to the HDL is straightforward.
The sequencer mainly produces combinatorial logic, while the CMDFG results
in data paths composed from instantiated, word-wide hardware operators. For
167
11. Summary and Future Work
this it has proven helpful to resort to parametrizable operators, which we pro-
vide in the Modlib operator library developed according to the requirements
of this work. Simple extensibility and platform independence are important
features of the library, significantly contributing to its usability.
We have shown the practicality of our approaches by developing the COM-
RADE 2.0 compiler framework based on COMRADE 1.0 [Kasp05]. Of the
latter, we have re-used the front-end and adjusted it to meet the requirements
of our new back-end. For example, the new front-end removes goto statements,
so that the remaining compile passes work on a structured program represen-
tation. Our new back-end transforms the CFG via COCOMA to the HDL
Verilog. We have tested the compiler with several practical input programs
(synthetic regression tests as well as real-world examples) and obtained cor-
rect simulation results. This is a substantial progress over COMRADE 1.0,
which implemented only a rudimentary back-end with far fewer features, and
which produced correct Verilog output only for very simple C examples.
As far as we know, our work is the first study comparing simulation and syn-
thesis results of dynamic versus static cancel tokens (CTs). Static CTs have
turned out to be the better alternative, because they save area resources while
achieving the same runtime in most cases. In the cases where dynamic CTs
would actually achieve a better runtime, static CTs can be supported by adjust-
ing parameters of the surrounding logic (e.g., increasing data and token buffer
sizes) to again match the performance of dynamic CTs, but requiring less chip
area.
We have developed and examined several optimizations at the COCOMA level.
Operation chaining can save runtime, if the runtime is dominated by data path
latencies. We have measured speed-ups of up to 25 %. Furthermore, chain-
ing removes unnecessary registers and saves 5 to 10 % of the area resources.
Speed-ups have also been measured for memory access reordering (up to 42 %)
and memory access parallelization (up to 47%), without adversely affecting the
area requirements. But even combining these three techniques does (in most
cases) not suffice to outperform an embedded CPU. The main reason for this is
the limited memory bandwidth when using a central cache.
A considerable bandwidth increase is possible with memory localization.
Therefore we have developed the LMEM architecture, which provides inde-
pendent local memories, all of them accessible in parallel by the hardware ker-
nel. A local paging unit allows for fast, stream-based data transfer between the
main memory and LMEM. LMEM can be used very effectively if the memory
allocated by the target application can be separated into several independent
areas. An example hardware kernel running at 100 MHz computes MD5 hash
168
values 37x faster than an embedded CPU at 400 MHz, and still 7 times faster
than a conventional Intel CPU at 2.66 GHz. For a wavelet image compression
algorithm compiled with COMRADE 2.0 and using LMEM, we have mea-
sured an application-level speed-up of 5x over the embedded CPU, including
data transfer delays between main memory and LMEM. The latter example
runs only 10x faster on the conventional Intel CPU than on the adaptive com-
puter, although the Intel CPU has a 26x higher clock frequency.
Being an academic project with limited manpower, we were not able to fix all of
the remaining bugs in the system (primarily in the externally provided C front-
end). However, these known bugs can be bypassed through slight adaptations
of the input C code. An example is the missing support for the address operator
(&). The compiler should provide this by first eliminating unnecessary uses
and replace the remaining occurances by pointers. As a workaround, the user
currently replaces address operators manually.
To find bugs in the COCOMA implementation, the formal definition in Ap-
pendix A has proven to be extremely helpful. The COMRADE 2.0 C++ code
adheres very closely to these formal definitions, so that a bug found during
hardware kernel simulation can be immediately identified and fixed.
A very practical approach considering the operator library was to embed
operator-specific token logic inside the Modlib operators. This made the uni-
fied operator interface possible, which in turn simplified the design of the token
flow and the implementation of the sequencer generation algorithms.
To enhance the design and debugging flow during our work, we have integrated
mechanisms for graphical output of COCOMA instances as graph-like struc-
tures. In this manner, the static COCOMA structure can be visualized after
compilation, while dynamic behavior animations can automatically be gener-
ated from the simulation output.
Despite the efforts and contributions of this work, we have encountered many
venues for future work and research. Even though the original COMRADE 1.0
front-end as well as the underlaying SUIF2 framework contain several high-
level optimization passes, far more known compiler optimizations are not yet
taken advantage of. For this it seems to be reasonable to migrate to another
underlaying framework such as LLVM [Latt02] or Scale [Scal11], offering ex-
isting implementations of useful optimizations. Several pointer analysis tech-
niques available in these frameworks would be very helpful to automate mem-
ory reordering, memory parallelization, and memory localization.
Another suggestion to enhance the COMRADE 2.0 front-end is the integration
of floating point support. This would be possible with little effort, because the
169
11. Summary and Future Work
required hardware operators are already contained in Modlib, and hardly any
change to COCOMA would be necessary.
In the back-end, speculative memory accesses would be an interesting exten-
sion. Here, one has to distinguish between control speculation and data spec-
ulation. The former executes a memory access when data dependences are
fulfilled, without being known if the access actually has to be executed (i.e.,
there is no token yet at the incoming control edge). The latter executes the
memory access even without fulfilled data dependences, e.g., using a predicted
address value. Control speculation would require adaptations (such as an undo
function for speculative writes) mainly in the memory back-end, while data
speculation needs to adjust the COCOMA token flow as well.
The usability of COMRADE 2.0 would be greatly improved by better automa-
tion in the context of the very promising memory localization technique. This
includes finding independent memory sections, re-arranging them, and pro-
gramming the local paging unit accordingly.
Finally, the configuration scheduling technique [KaVK05], which is already
existing, needs to be used by the compiler, so that the compile flow is actually
completed down to the final FPGA configurations.
To conclude, this work has shown that dynamic scheduling is an effective and
appropriate method for high-level language compilation for adaptive comput-
ers. Besides exploring interesting approaches and gathering novel results, we
have discovered many further venues for future research.
170
Bibliography
[ADHL00] A, G. ; D, A. ; H, D. L. ; L, M. S. ; M, D. L. ; M,
B. R. ; S, C.: An Overview of the SUIF2 Compiler Infrastructure
/ Stanford University. 2000. – Technical Report
[BöJa66] B¨, C. ; J, G.: Flow diagrams, turing machines and languages with
only two formation rules. In: Communications of the ACM 9 (1966), no. 5,
pp. 366–371
[BrGa03] B, C. ; G, J.: Early Output Logic using Anti-Tokens. In: Proceed-
ings of the International Workshop on Logic Synthesis (IWLS), 2003
[Budi03] B, M.: Spatial Computation. Carnegie Mellon University, Pittsburgh,
USA, School of Computer Science, Diss., Dec. 2003
[BuNa08] B, B. ; N, W. A.: Compiler Generated Systolic Arrays for
Wavefront Algorithm Acceleration on FPGAs. In: Proceedings of the Inter-
national Conference on Field Programmable Logic and Applications (FPL),
2008, pp. 655–658
[CaHW00] C, T. J. ; H, J. R. ; W, J.: The Garp architecture and
C compiler. In: IEEE Computer 33 (2000), Apr., no. 4, pp. 62–69
[CaMS01] C, L. P. ; MM, K. L. ; S-V, A. L.: Theory
of latency-insensitive design. In: IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems 20 (2001), Sep., no. 9, pp. 1059
–1076
[CFHJ06] C, J. ; F, Y. ; H, G. ; J, W. ; Z, Z.: Platform-Based
Behavior-Level and System-Level Synthesis. In: Proceedings IEEE Inter-
national System-on-Chip Conference (SOC), 2006, pp. 199–202
[CFRW91] C, R. ; F, J. ; R, B. K. ; W, M. N. ; Z, F. K.:
Efficiently computing static single assignment form and the control depen-
dence graph. In: ACM Transactions on Programming Languages and Sys-
tems (TOPLAS) 13 (1991), Oct., no. 4, pp. 451–490
[Dijk72] D, E. W.: Chapter I: Notes on structured programming. London, UK
: Academic Press Ltd., 1972. – pp. 1–82.
[Drey94] D, S.: Die Umwandlung hierarchischer Netzlisten vom Logik-
Austauschformat SLIF in die Hardware-Beschreibungssprache Verilog. In-
stitut für Theoretische Informatik, Technische Universität Braunschweig,
Germany, Jun. 1994
171
Bibliography
[ErHe94] E, A. M. ; H, L. J.: Taming Control Flow: A Structured Ap-
proach to Eliminating GOTO Statements. In: Proceedings of the Interna-
tional Conference on Computer Languages (ICCL), 1994, pp. 229–240
[FeOW87] F, J. ; O, K. J. ; W, J. D.: The Program Dependence
Graph and Its Use in Optimization. In: ACM Transactions on Programming
Languages and Systems (TOPLAS) 9 (1987), no. 3, pp. 319–349
[GäKo04] G¨, H. ; K, A.: Wavelet-based Image Compression on the Recon-
figurable Computer ACE-V. In: Proceedings of the International Confer-
ence on Field Programmable Logic and Applications (FPL) vol. 3203/2004,
Springer Berlin / Heidelberg, Aug. 2004, pp. 1006–1010
[GäKo08] G¨, H. ; K, A.: Accelerating Speculative Execution in High-Level
Synthesis with Cancel Tokens. In: Proceedings of the 4th International
Workshop on Reconfigurable Computing (ARC) vol. 4943/2008. Berlin, Hei-
delberg : Springer, Aug. 2008, pp. 185–195
[GäSK08] G¨, H. ; S, F. ; K, A.: Memory Access Parallelisation in High-
Level Language Compilation for Reconfigurable Adaptive Computers. In:
Proceedings of the International Conference on Field Programmable Logic
and Applications (FPL), 2008, pp. 403–408
[GäTK10] G¨-L¨, H. ; T, B. ; K, A.: A Flexible Compute and
Memory Infrastructure for High-Level Language to Hardware Compilation.
In: Proceedings of the International Conference on Field Programmable
Logic and Applications (FPL), 2010, pp. 475–482
[GCHB05] L G, B. ; C, E. ; H, S. ; B, P. ; J, C. ; M, E.:
C-based Rapid Prototyping For Digital Signal Processing. In: Proceedings
of the 13th European Signal Processing Conference (EUSIPCO), 2005
[GDGN03] G, S. ; D, N. ; G, R. ; N, A.: SPARK: AHigh-Level Syn-
thesis Framework for Applying Parallelizing Compiler Transformations. In:
Proceedings of the 16th International Conference on VLSI Design (VLSI),
2003, pp. 461–466
[GGDN04] G, S. ; G, R. ; D, N. ; N, A.: SPARK: A Parallelizing
Approach to the High-Level Synthesis of Digital Circuits. Kluwer Academic
Publishers, 2004
[Golz78] G, U.: Parallele Programmschemata, Technische Universität Hannover,
Habilitation, 1978
[GREA01] G, M. R. ; R, J. S. ; E, D. ; A, T. M. ; M, T.
; B, R. B.: MiBench: A free, commercially representative embedded
benchmark suite. In: Proceedings of the IEEE International Workshop on
Workload Characterization, 2001, pp. 3–14
[GuHe07] G, Y. ; H, M. C.: High Performance Molecular Dynamics Sim-
ulations with FPGA Coprocessors. In: Proceedings of the Reconfigurable
Systems Summer Institute (RSSI), 2007
172
Bibliography
[GuNB08] G, Z. ; N, W. ; B, B.: Efficient Hardware Code Generation
for FPGAs. In: ACM Transactions on Architectures and Code Optimization
(TACO) 5 (2008), May, no. 1, pp. 1–26
[GWDL92] G, D. ; W, A. ; D, N. ; L, S.: High-Level Synthesis – Introduc-
tion to Chip and System Design. Kluwer Academic Publishers, 1992. – pp.
233–235.
[HaWa97] H, J. R. ; W, J.: Garp: a MIPS processor with a reconfigur-
able coprocessor. In: Proceedings of the 5th IEEE Symposium on FPGA-
Based Custom ComputingMachines (FCCM). Washington, DC, USA : IEEE
Computer Society, 1997
[HMSH10] H, J. ; M¨, P. ; S, F. ; H, D. ; K, A.: Accel-
erating High-Level Engineering Computations by Automatic Compilation of
Geometric Algebra to Hardware Accelerators. In: Proceedings of the Inter-
national Conference on Embedded Computer Systems: Architectures, Mod-
elling and Simulation, 2010
[Hoan93] H, D. T.: Searching Genetic Databases on Splash 2. In: Proceedings
of the IEEE Workshop on FPGAs for Custom Computing Machines, IEEE,
Apr. 1993, pp. 185–191
[Hone97] H T C: Versatility Stressmark, Version 0.8 CDRL
A001. Minneapolis, USA : Honeywell Technology Center, 1997. – Techni-
cal Report
[HTHT09] H, Y. ; T, H. ; H, S. ; T, H.: Proposal and Quanti-
tative Analysis of the CHStone Benchmark Program Suite for Practical C-
based High-level Synthesis. In: Journal of Information Processing vol. 17,
2009, pp. 242–254
[Kahn95] K, H. J.: EDIF Version 350/400 and information modelling, 1995, pp.
451
[Kasp05] K, N.: COMRADE - Ein Hochsprachen-Compiler für adaptive Com-
putersysteme, Abteilung Entwurf integrierter Schaltungen, Technische Uni-
versität Braunschweig, Germany, Diss., Jun. 2005
[KaVK05] K, N. ; V  V, J. ; K, A.: Configuration Merging for
Adaptive Computer Applications. In: Proceedings of the International Con-
ference on Field Programmable Logic and Applications (FPL), 2005, pp.
217–222
[Koch04] K, A.: Advances in Adaptive Computer Technology. Habilitation,
Abteilung Entwurf integrierter Schaltungen (E.I.S.), Technische Universität
Braunschweig, Germany, Dec. 2004
[KoKa05] K, A. ; K, N.: High-Level-Language Compilation for Recon-
figurable Computers. In: Proceedings of the International Conference on
Reconfigurable Communication-centric SoCs (ReCoSoC), 2005
173
Bibliography
[LaKo00] L, H. ; K, A.: Memory Access Schemes for Configurable Pro-
cessors. In: Proceedings of the 10th International Workshop on Field-
Programmable Logic and Applications (FPL). London, UK : Springer-
Verlag, Aug. 2000, pp. 615–625
[LaKo09] L, H. ; K, A.: Architectures and Execution Models for Hardware/-
Software Compilation and their System-Level Realization. In: IEEE Trans-
actions on Computers 99, PrePrints (2009)
[LaPM06] L, D. ; P, O. ; M, P.: Automated Generation of Hard-
ware Accelerators with Direct Memory Access from ANSI/ISO Standard C
Functions. In: Proceedings of the 14th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines (FCCM). Washington, DC,
USA : IEEE Computer Society, 2006, pp. 45–56
[Latt02] L, C.: LLVM: An Infrastructure for Multi-Stage Optimization, Com-
puter Science Dept., University of Illinois at Urbana-Champaign, Diplomar-
beit, Dec. 2002
[LaWK11] L, H. ; W, T. ; K, A.: MARC II: A Parametrized Specula-
tive Multi-Ported Memory Subsystem for Reconfigurable Computers. In:
Proceedings of the Conference on Design, Automation and Test in Europe
(DATE), 2011
[Leng93] L, C.: Loop Parallelization in the Polytope Model. In: Proceedings
of the 4th International Conference on Concurrency Theory (CONCUR).
London, UK : Springer-Verlag, 1993, pp. 398–416
[LePM97] L, C. ; P, M. ; M-S, W. H.: MediaBench: A tool
for Evaluating and Synthesizing Multimedia and Communications Systems.
In: Proceedings of the 30th Annual IEEE/ACM International Symposium on
Microarchitecture, 1997, pp. 330–335
[LeRK08] L, S. ; R, D. ; K, V.: LLVM-CHiMPS: compilation envi-
ronment for FPGAs using LLVM compiler infrastructure and CHiMPS com-
putational model. In: Proceedings of the Reconfigurable Systems Summer
Institute (RSSI), 2008
[LeRS83] L, C. E. ; R, F. M. ; S, J. B.: Optimizing Synchronous Cir-
cuitry by Retiming. In: Proceedings of the Caltech Conference on VLSI,
1983, pp. 87–116
[LSKH09] L, H. ; S, F. ; K, A. ; H, D.: Acceleration and En-
ergy Efficiency of a Geometric Algebra Computation Using Reconfigurable
Computers and GPUs. In: Proceedings of the IEEE Symposium on Field
Programmable Custom Computing Machines (FCCM), 2009, pp. 255–258
[MacM01] MM, D.: Nimble Compiler Environment for Agile Hardware.
Storming Media LLC (USA), 2001
174
Bibliography
[NeKo01] N, T. ; K, A.: A Generic Library for Adaptive Computing En-
vironments. In: Proceedings of the International Conference on Field Pro-
grammable Logic and Applications (FPL) vol. 2147/2001, 2001, pp. 503–
512
[Pana07] P, E. M.: The Molen Compiler for Reconfigurable Architectures,
Technical University Delft, Diss., Jun. 2007
[PBDM08] P, A. ; B, D. ; D, E. ; M, J. ; S,
P. ; E, S.: CHiMPS: A C-level compilation flow for hybrid CPU-
FPGA architectures. In: Proceedings of the International Conference on
Field Programmable Logic and Applications (FPL), 2008, pp. 173–178
[PEBD09] P, A. ; E, S. ; B, D. ; D, E. ; M, J. ; S,
H. ; S, P. ; W, R.: Performance and power of cache-based
reconfigurable computing. In: Proceedings of the ACM/SIGDA International
Symposium on Field Programmable Gate Arrays (FPGA). New York, NY,
USA : ACM, 2009, pp. 281–281
[Pugh91] P, W.: The Omega test: a fast and practical integer programming algo-
rithm for dependence analysis. In: Proceedings of the ACM/IEEE Confer-
ence on Supercomputing. New York, NY, USA : ACM, 1991, pp. 4–13
[RaSm94] R, R. ; S, M. D.: A High-Performance Microarchitecture with
Hardware-Programmable Functional Units. In: Proceedings of the 27th An-
nual International Symposium on Microarchitecture, ACM New York, NY,
USA, Dec. 1994, pp. 172–180
[Schu07] S, T. et al.: Accelerating the Cube Cut Problem with an FPGA-
augmented Compute Cluster. In: Parallel Computing: Architectures, Algo-
rithms and Applications, 2007
[ShHo97] S, M. ; H, S.: Fast and accurate flow-insensitive points-to
analysis. In: Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages (POPL). New York, NY, USA :
ACM, 1997, pp. 1–14
[ShVu93] S, M. ; V, J.: Fast implementations of RSA cryptography. In:
Proceedings of the 11th Symposium on Computer Arithmetic, 1993, pp. 252–
259
[SmHo02] S, M. D. ; H, G.: An Introduction to machine SUIF and its
Portable Libraries for Analysis and Optimization / Division of Engineering
and Applied Sciences, Harvard University. 2002. – Technical Report
[SSLM92] S, E. M. ; S, K. J. ; L, L. ; M, C. ; M, R. ;
S, A. ; S, H. ; S, P. R. ; B, R. K. ; S-
V, A. L.: SIS: A System for Sequential Circuit Synthesis / EECS
Department, University of California, Berkeley. 1992 (UCB/ERL M92/41).
– Technical Report
175
Bibliography
[Stee96] S, B.: Points-to Analysis in Almost Linear Time. In: Proceed-
ings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Pro-
gramming Languages (POPL). New York, NY, USA : ACM, 1996, pp. 32–
41
[Stef08] S, C. P.: Nallatech Software Tools on the Maxwell FPGA Parallel
Computer. In: Proceedings of the Reconfigurable Systems Summer Institute
(RSSI), 2008
[STKY99] S, R. ; T, M. ; K, A. ; Y, A. ; F, T. ; K,
T.: A scheduling method for synchronous communication in the Bach hard-
ware compiler. In: Proceedings of the Asian and South Pacific Design Au-
tomation Conference (ASP-DAC) vol. 1, 1999, pp. 193–196
[ThGä10] T, B. ; G¨-L¨, H.: Modlib: A Flexible Module Library for
High-Level Language Compilation / Abteilung Entwurf integrierter Schal-
tungen (E.I.S.), Technische Universität Braunschweig. 2010. – Technical
Report
[UyRB99] U, G. ; R, D. ; B, A.: Integer Wavelet Transforms
using the Lifting Scheme. In: Proceedings of the International Multiconfer-
ence on Circuits, Systems, Communications and Computers (CSCC) vol. 1,
1999, pp. 2651–2653
[Waka99] W, K.: C-based synthesis experiences with a behavior synthe-
sizer, “cyber”. In: Proceedings of the Conference on Design, Automation
and Test in Europe (DATE). New York, NY, USA : ACM, Mar. 1999, pp.
390–393
[Webe08] W, N.: Goto Elimination for the Adaptive Compiler COMRADE,
Embedded Systems and Applications Group (ESA), Technische Universität
Darmstadt, in cooperation with Abteilung Entwurf integrierter Schaltungen
(E.I.S.), Technische Universität Braunschweig, Germany, Diplomarbeit, Jul.
2008
176
Weblinks
[Alti08] A L.: C-to-Hardware Compiler User Manual. http:
//www.altium.com/files/altiumdesigner/s08/learningguides/
GU0122%20C-to-Hardware%20Compiler%20User%20Manual.pdf, 19th
May 2008
[Bass11] B, J. L.: FPGA C Compiler. http://sourceforge.net/projects/
fpgac, 24th Feb. 2011
[Bowe09] B, M. ; E S L.: Handel-C Language Reference
Manual. http://www.pa.msu.edu/hep/d0/l2/Handel-C/Handel%
20C.PDF, 13th Jan. 2009
[Ibmm10] IBM M D: The PowerPC 440 Core.
https://www-01.ibm.com/chips/techlib/techlib.nsf/
techdocs/852569B20050FF77852569970063431C/\protect\T1\
textdollarfile/440_wp.pdf, 17th Nov. 2010
[Impu11] I A T I.: Impulse-C Website. http://www.
impulseaccelerated.com, Feb. 2011
[Inte11] I, I.: Intel Core2 Processor Desktop Processor E6000 Data
Sheet. http://download.intel.com/design/processor/datashts/
31327807.pdf, 17th Jan. 2011
[Josh11] J, Jaivardhan: Parallel Computing and .Net. http://jai-on-asp.
blogspot.com/2010/05/parallel-computing-and-net.html, 24th
Feb. 2011
[Ment09] M G C.: Catapult-C data sheet. http://www.mentor.
com/products/esl/high_level_synthesis/catapult_synthesis/
upload/Catapult_DS_pdf.pdf, 15th Jan. 2009
[Mitr11] M: Mitrion Users’ Guide. http://forum.mitrionics.com/
uploads/Mitrion_Users_Guide.pdf, 24th Feb. 2011
[Scal11] S C G: Scale Compiler Homepage. http://www.cs.
utexas.edu/users/cart/Scale/, 24th Feb. 2011
[Xili10] X, I.: Virtex-6 Family Overview. http://www.xilinx.com/
support/documentation/data_sheets/ds150.pdf, 30 Nov. 2010
[Xils10] X, I.: Virtex-7 Series FPGAs. http://www.xilinx.com/
publications/prod_mktg/Virtex7-Product-Table.pdf, 30th Nov.
2010
177
Weblinks
178
Index
Activate Token, 53
Adaptive Compiler, 21
Adaptive Computer, 13
Adaptive Computing, 15
Annotation (Control Flow), 85
Anti-Token, 53
Back-edge, 26
Basic Block, 22
Bitwidth Configuration, 84
Branch Node, 22
Cancel Token, 53
CancelConsumed Function, 91
Central Processing Unit, 13
CFG Controller Node, 38
CMDFG Controller Node, 116
CMDFG Frame, 78
CMDFG Mux Control Node, 113
COMRADE Controller Micro-
Architecture, 92
Consumer, 111
Control Configuration, 85
Control Data Flow Graph, 30
Control Dependence, 23
Control Edge, 78
Control Flow, 85
Control Flow Frame, 21
Control Flow Graph, 22
Control Function, 85
Control Width, 85
Controller, 23
Critical Path Delay of a Node Path,
138
Cycle, 29
Data Edge, 78
Data Flow, 84
Data Flow Graph, 28
Data Input Port, 81
Data Output Port, 81
Domination, 23
Down Token, 53
Dynamic Cancel Token, 53
Dynamic Scheduling, 18
Early Evaluation, 39
ECFG, 37
Evaluation Functional (eval), 87
Explicit Loop Body, 123
Field Programmable Gate Array, 15
Fully Pipelinable, 29
Hardware Kernel, 34
Hardware Region, 27
Hardware-to-Software Transition, 27
Hierarchical Task Graph, 49
High-Level Synthesis, 16
Incoming Chained Node Path, 137
Incoming Critical Path Delay, 138
Initial Node, 75
Initiation Interval, 29
Initiation Interval Function, 29
Input Port Functional, 84
Inreg Node, 73
Intermediate Representation, 21
Irq Data Mux, 61
Irqreg Node, 74
Join Node, 23
179
Index
Late Evaluation, 39
Latency, 29
Latency Function, 29
Latency-Insensitive System, 32
LMEM, 149
Local Paging Unit, 149
Loop Body, 27
Loop Body Data Path, 69
Loop Control Node, 68
Loop Data Path, 124
Loop Header, 27
Loop Level, 123
Loop Mux, 61
Loop Mux Variable, 70
Master Mode, 57
Maximal Incoming Chained Node
Paths Set, 138
Maximal Outgoing Chained Node
Paths Set, 138
Memory Chain, 73
Memory Edge, 78
Meta Data Fetcher, 96
Modlib, 96
Multiplexer Data Types, 82
Mux, 60
Mux Control nCT Annotation, 85
Mux Predecessor Assignment Func-
tional, 85
Node Input Configuration, 86
Node Input Control Configuration, 87
Node Input Data Configuration, 87
Node Operation, 88
Node Output Configuration, 87
Node Output Control Configuration,
87
Node Output Data Configuration, 87
Node Termination, 138
Node Type, 81
Node Type Parameters, 81
Non-pipelinable, 29
Operation Chaining, 137
Outgoing Chained Node Path, 137
Outgoing Control Edges Function, 81
Outgoing Critical Path Delay, 138
Outgoing Edges Function, 81
Output Port Functional, 84
Outreg Node, 74
Parameter Assignment, 82
Parameter Setting Functional, 84
Partitioned Control Flow Graph, 27
Path, 23
Pegasus, 50
Pipelinable, 29
Post-Dominance Frontier, 23
Post-Domination, 23
Predicated Execution, 30
Producer, 111
Pseudo Activate Token, 67
ReadyConsumed Function, 91
Reconfigurable, 15
Region, 23
Resource Sharing, 31
Schedule, 31
Scheduling, 18
Scheduling Type, 91
Sequencer, 31
Sequencer Condition, 185
Sequencer Configuration, 89
Set of Relational Predecessors of An-
other Set, 78
Set of Relational Successors of An-
other Set, 78
Software Region, 27
Software Subregion, 27
Software-to-Hardware Transition, 27
Source Projection Function, 81
Speculative Execution, 30
Speculative Predicated Execution, 31
Static Cancel Token, 53
Static Scheduling, 18
Structured Program, 23
T-Structured Control Flow Frame, 24
180
Index
T-Structured Control Flow Graph, 24
T-Structured Transformations, 24
Target Projection Function, 81
Timed Data Flow Graph, 29
Top-Structured (T-Structured) Pro-
gram, 24
Type Function, 84
Up Token, 53
181
Index
182
Acronyms
ALU Arithmetic Logic Unit
AST Abstract Syntax Tree
AT Activate Token
BSC Board Support Components
CDFG Control Data Flow Graph
CF Control Flow Frame
CFG Control Flow Graph
COCOMA COMRADE Controller Micro-Architecture
CPP C Preprocessor
CPU Central Processing Unit
CT Cancel Token
DFG Data Flow Graph
ECFG Line Graph of the CFG
EDIF Electronic Data Interchange Format
EE Early Evaluation
FF Flip-Flop
FPGA Field Programmable Gate Array
GLACE Generic Library for Adaptive Computing Environments
HDL Hardware Description Language
HLS High-Level Synthesis
HTG Hierarchical Task Graph
HW Hardware
II Initiation Interval
IR Intermediate Representation
LDP Loop Data Path
LIS Latency-Insensitive System
LPU Local Paging Unit
LUT Look-Up Table
MA Memory Access
MARC Memory Architecture for Reconfigurable Computers
MDF Meta Data Fetcher
NI Node Input Configuration
183
Acronyms
NIC Node Input Control Configuration
NID Node Input Data Configuration
NO Node Output Configuration
NOC Node Output Control Configuration
NOD Node Output Data Configuration
pAT Pseudo Activate Token
PDF Post-Dominance Frontier
PPC PowerPC
RCU Reconfigurable Compute Unit
Seq Sequencer
SeqC Sequencer Configuration
SIS Sequential Interactive Synthesis
SSA Static Single Assignment
SUIF Stanford University Intermediate Format
SW Software
UART Universal Asynchronous Receiver Transmitter
ULK Ultimativer Logik-Konverter
184
A. Sequencer Conditions
This Chapter exactly defines how the COCOMA sequencer computes a node
input control configuration NICt2 and a new sequencer configuration SeqCt2
from a given node output control configuration NOCt1 and an existing se-
quencer configuration SeqCt1 . The node input ports addressed by a NIC are
listed in the left-hand column of Table A.1. For each input port the right-hand
column specifies where in this Chapter the associated sequencer condition is
located, i.e., a formula defining for each node which value is assigned to the
designated input port. Some of the sequencer conditions are dependent on the
scheduling type. In such cases, two conditions are given, one for the dynamic
CT model and one for static CTs.
To simplify the notation, we omit timing subscripts from now on; the sequencer
conditions represent combinatorial logic anyway, thus all signals are assigned
the same clock cycle.
Although NIC assigns each of the ports an integral value, in practice, most
ports are assigned either 0 or 1. Therefore, we use boolean algebra and assign
false or true; formally, however, a false assignment stands for 0, and a true
assignment for 1.
For multi-bit inputs such as the Start signal of a mux node, we break down the
assignment to single bits by addressing the associated incoming data edges.
For example, for a mux node y which has two data predecessors x1 and x2, we
assign a boolean value to Start(x1, y) and Start(x2, y), which is equivalent to
assigning a number between 0 and 3 to Start(y).
Instead of directly defining the new sequencer configuration SeqCt2 for a
given SeqCt1 , we specify formulae to determine if the ReadyConsumed or
CancelConsumed value of a given edge needs to be set (i.e., a change from
0 to 1) or reset (from 1 to 0). Table A.2 lists the used set and reset signals.
Note that the sequencer conditions use the node neighborhood notation intro-
duced in Section 6.2. Due to the condition complexity, we have moved several
subconditions to extra Tables.
185
A. Sequencer Conditions
NIC Definition
Start Default: Table A.12; for mux and loopMux inputs:
Table A.13
StartCtrl Table A.14
StartCtrlPseudo Table A.15
Sel Table A.17
StartSel Table A.16
ResultReadyAck Table A.26
Cancel Table A.31
CancelStateAck Default, dynCT: Table A.35; for mux and loopMux
inputs, dynCT: A.36; default, statCT: Table A.37;
for mux and loopMux inputs, statCT: A.38
CancelStateCtrlAck Table A.40
Table A.1.: Left: inputs ports addressed by a NIC; right: pointer to the Table
defining the sequencer condition.
SeqC Set/Reset Signal Definition
SetReadyConsumed Table A.44
ResetReadyConsumed Table A.45
SetCancelConsumed Table A.48
ResetCancelConsumed Table A.45
Table A.2.: Left: signals signifying a set or reset of a SeqC function; right:
pointer to the Table defining the sequencer condition.
186
AlwActCreateAT(y) :⇔
(1) y has an incoming alwAct-
annotated control edge, such
that
(2) an AT is created from
a violated condition, i.e.,
(a) there is a valid result
at the controller output, and
(b) the control condition is
not fulfilled, or
(3) an AT is created from
a CT, i.e.,
(a) xcon is canceled, and
(b) y has not already consumed
that CT.
∃(xcon, y) ∈ Econ : [
(ann(xcon, y) = alwAct) ∧ [
(ResultReady(xcon)∧
¬ReadyConsumed(xcon, y)∧
¬eval(xcon, y))∨
(CancelStateControl(xcon)∧
¬CancelConsumed(xcon, y))]
]
Table A.3.: Definition of condition AlwActCreateAT; used by condition Start
(Table A.12).
SchedType = dynCT: DataAT(xdat , y) :⇔
(1) xdat has a valid result,
(2) which has not yet been
consumed by y, and
(3) y is not canceled.
ResultReady(xdat)∧
¬ReadyConsumed(xdat, y)∧
(t(y) < {mux, loopMux} ⇒
¬CancelState(y))∧
(t(y)in{mux, loopMux} ⇒
¬CancelState(xdat, y))
Table A.4.: Definition of condition DataAT (dynCT); used by condition
DataDependencesFulfilled (Table A.6).
187
A. Sequencer Conditions
SchedType = statCT: DataAT(xdat , y) :⇔
(1) xdat has a valid result,
(2) which has not yet been
consumed by y.
ResultReady(xdat)∧
¬ReadyConsumed(xdat, y)
Table A.5.: Definition of condition DataAT (statCT); used by condition
DataDependencesFulfilled (Table A.6).
DataDependencesFulfilled(y) :⇔
(1) If y is a multiplexer,
at least one data predeces-
sor edge is active.
t(y) ∈ Mux ⇒
∃x ∈ Xdat : DataAT(x, y)
∧
(2) If y is not a multiplexer,
all data predecessor edges
are active.
t(y) < Mux ⇒
∀x ∈ Xdat : DataAT(x, y)
Table A.6.: Definition of condition DataDependencesFulfilled; used by condi-
tion Start (Table A.12). This condition is not defined for nodes of
type mux or loopMux.
DataDependencesFulfilled(xdat , y) :⇔
(1) The data predecessor edge
is active.
DataAT(xdat, y)
Table A.7.: Definition of condition DataDependencesFulfilled for mux; used
by condition Start (Table A.12). This condition is only defined for
nodes of type mux or loopMux.
188
ControlAT(xcon, y) :⇔
(1) always nodes steadily emit
ATs.
t(xcon) = always
∨
(2) For other nodes:
(a) xcon has a valid result,
and
(b) the control condition is
fulfilled, and
(c) the result of xcon has not yet
been consumed by y, or
(d) (xcon, y) has an atOnCancel
annotation, and
(e) xcon is canceled, and
(f) that CT has not been con-
sumed yet by y.
[ResultReady(xcon)∧
eval(xcon, y)∧
¬ReadyConsumed(xcon, y)]∨
[ann(xcon, y) = atOnCancel∧
CancelStateCtrl(xcon)∧
¬CancelConsumed(y)]
Table A.8.: Definition of condition ControlAT; used by condition ControlDe-
pendencesFulfilled (Table A.9).
ControlDependencesFulfilled(y) :⇔
(1) If y is an all node,
all non-nAT control prede-
cessor are active.
t(y) = all ⇒
∀xcon ∈ Xcon \ Xcon,nAT :
ControlAT(xcon, y)
∧
(2) Otherwise,
at least one (if existing)
non-nAT control predecessor
edge is active.
t(y) , all ⇒
[Xcon \ Xcon,nAT , ∅ ⇒
∃xcon ∈ Xcon \ Xcon,nAT :
ControlAT(xcon, y)]
Table A.9.: Definition of condition ControlDependencesFulfilled; used by con-
dition Start (Table A.12).
189
A. Sequencer Conditions
MemoryAT(xmem, y) :⇔
(1) xmem has a valid result, and
(2) the result of xmem has not
yet been consumed by y.
ResultReady(xmem)∧
¬ReadyConsumed(xmem, y)
Table A.10.: Definition of condition MemoryAT; used by condition Memory-
DependencesFulfilled (Table A.11).
MemoryDependencesFulfilled(y) :⇔
(1) If y is an all node,
all memory predecessor
edges are active.
t(y) = all ⇒
∀xmem ∈ Xmem :
MemoryAT(xmem, y)
∧
(2) Otherwise,
at least one (if existing)
memory predecessor edge is
active.
t(y) , all ⇒
[Xmem , ∅ ⇒
∃xmem ∈ Xmem :
MemoryAT(xmem, y)]
Table A.11.: Definition of condition MemoryDependencesFulfilled; used by
condition Start (Table A.12).
190
Start(y) :⇔
(1) Initial nodes are never
activated through COCOMA-
logic;
t(y) , initial
∧
(2) data dependences are ful-
filled, and
DataDependencesFulfilled(y)
∧
(3) memory dependences are
fulfilled, or (in the statCT case)
y is canceled.
MemoryDependencesFulfilled(y)∨
(SchedType = statCT∧
CancelState(y))
Table A.12.: Definition of condition Start. (1) Initial nodes are never acti-
vated by COCOMA data/control/memory predecessors, but they
are (re-)activated when the associated software-to-hardware tran-
sition takes place (this is not integrated in the sequencer formal-
ism). This condition is not defined for multiplexer nodes.
Start(xdat , y) :⇔
(1) Data dependences are ful-
filled.
DataDependencesFulfilled(xdat, y)
Table A.13.: Definition of condition Start for incoming data edges of multi-
plexer nodes.
191
A. Sequencer Conditions
StartCtrl(y) :⇔
(1) Control dependences are
fulfilled, or
(2) y is activated by an in-
coming alwAct-annotated
edge.
ControlDependencesFulfilled(y)∨
AlwActCreateAT(y)
Table A.14.: Definition of condition StartCtrl. This condition is not defined for
nodes of type mux.
StartCtrlPseudo(y) :⇔
(1) y is activated by an incom-
ing alwAct edge.
AlwActCreateAT(y)
∨
(2)
(a) y has an incoming CT-
capable control edge,
(b) whose associated controller
is canceled.
∃xcon ∈ Xcon,1 ∪ Xcon,nAT
∪Xcon,atOnCancel :
CancelStateCtrl(xcon)
Table A.15.: Definition of condition StartCtrlPseudo. This condition is not de-
fined for nodes of type mux.
StartSel(y) :⇔
There is a control predecessor
xcon of y, such that
∃xcon ∈ Xcon :
(1) xcon has a valid result, or
(2) xcon is canceled, but
(3) the token (AT or CT)
from xcon has not yet been
consumed.
(ResultReady(xcon)∨
CancelS tateCtrl(xcon))∧
(¬ReadyConsumed(xcon, y)∧
¬CancelConsumed(xcon, y))
Table A.16.: Definition of condition StartSel. This condition is only defined
for nodes of type mux.
192
Sel(xdat , y) :⇔
(1) xdat has a valid result, and ResultReady(xdat)
∧
(2) there is a control predeces-
sor xcon of y, such that
the evaluation bit associated to
xdat is true.
∃xcon ∈ Xcon :
eval(xcon, y)|i, i = muxPredy(xdat)
Table A.17.: Definition of condition Sel. This condition is only defined
for nodes of type mux. (2) Recall that eval(xcon, y) is multi-
dimensional if y is of type mux. The operator |i returns the ith
dimension.
SchedType = dynCT: DataSuccessorAccepted(y, z) :⇔
(1) The ResultReady of y has
already been consumed by z, or
(2) z has been started already,
or
(3) the edge has not yet
propagated a CT from z, and
(4) z is canceled and
(5) its cancel state is being
acknowledged.
ReadyConsumed(y, z)∨
((t(z) < {mux, loopMux} ⇒
StartAck(z))∧
(t(z) ∈ {mux, loopMux} ⇒
StartAck(y, z)))∨
(¬CancelConsumed(y, z)∧
(t(z) < {mux, loopMux} ⇒
(CancelState(z)∧
CancelStateAck(z)))∧
(t(z) ∈ {mux, loopMux} ⇒
(CancelState(y, z)∧
CancelStateAck(y, z))))
Table A.18.: Definition of conditionDataSuccessorAccepted (dynCT); used by
condition DataSuccessorsAccepted (Table A.20).
193
A. Sequencer Conditions
SchedType = statCT: DataSuccessorAccepted(y, z) :⇔
(1) The ResultReady of y has
already been consumed by z, or
(2) z has been started already.
ReadyConsumed(y, z)∨
((t(z) < {mux, loopMux} ⇒
StartAck(z))∧
(t(z) ∈ {mux, loopMux} ⇒
StartAck(y, z)))
Table A.19.: Definition of conditionDataSuccessorAccepted (statCT); used by
condition DataSuccessorsAccepted (Table A.20).
DataSuccessorsAccepted(y) :⇔
All data successors accept the
ResultReady of y.
∀zdat ∈ Zdat :
DataSuccessorAccepted(y, zdat)
Table A.20.: Definition of condition DataSuccessorsAccepted; used by condi-
tion ResultReadyAck (Table A.26).
MemorySuccessorAccepted(y, zmem) :⇔
(1) The ResultReady of y has
been consumed by zmem, or
(2) zmem is started currently.
ReadyConsumed(y, zmem)∨
StartAck(zmem)
Table A.21.: Definition of condition MemorySuccessorAccepted; used by con-
dition MemorySuccessorsAccepted (Table A.22).
MemorySuccessorsAccepted(y) :⇔
All memory successors accept
the ResultReady of y.
∀zmem ∈ Zmem :
MemorySuccessorAccepted(y, zmem)
Table A.22.: Definition of condition MemorySuccessorsAccepted; used by
condition ResultReadyAck (Table A.26).
194
SchedType = dynCT: ControlSuccessorAccepted(y, zcon) :⇔
(1) If the control edge has a 1
or atOnCancel annotation,
(a) the ResultReady of y has
already been consumed by zcon,
or
(b) zcon accepts an AT, or
(c) zcon is canceled currently
and
(d) accepts the cancel.
ann(y, zcon) ∈ {1, atOnCancel} ⇒ [
ReadyConsumed(y, zcon)∨
StartCtrlAck(zcon)∨
(Cancel(zcon)∧
CancelAck(zcon))]
∧
(2) If the control edge has an
nCT annotation,
(a) the ResultReady of y has
already been consumed by zcon,
or
(b) zcon is started now, or
(c) y has a valid result and
(d) the control condition for
(y, zcon) is violated.
ann(y, zcon) = nCT ⇒ [
ReadyConsumed(y, zcon)∨
StartControlAck(zcon)∨
(ResultReady(y)∧
¬eval(y, zcon))]
∧
(3) If the control edge has an
nAT annotation,
(a) zcon is canceled now and
(b) accepts the cancel, or
(c) y has a valid result and
(d) the control condition is
fulfilled.
ann(y, zcon) = nAT ⇒ [
(Cancel(zcon)∧
CancelAck(zcon))∨
(ResultReady(y)∧
eval(y, zcon))]
∧
(4) If the control edge has an
alwAct annotation,
(a) the ResultReady of y has
already been consumed by zcon,
or
(b) zcon is started now, or
(c) zcon is canceled, and
(d) that CT is now extin-
guished.
ann(y, zcon) = alwAct ⇒ [
ReadyConsumed(y, zcon)∨
StartControlAck(zcon)∨
(CancelState(zcon)∧
CancelStateAck(zcon))]
The Table is continued on the next page.
195
A. Sequencer Conditions
∧
(5) If the control edge has a
muxControl annotation,
(a) zcon accepts the token, or
(b) has already consumed the
token.
ann(y, zcon) = muxControl ⇒ [
StartSelAck(zcon)∨
ReadyConsumed(y, zcon)]
Table A.23.: Definition of condition ControlSuccessorAccepted (dynCT); used
by condition ControlSuccessorsAccepted (Table A.25).
SchedType = statCT: ControlSuccessorAccepted(y, zcon) :⇔
(1) If the control edge has a 1
or atOnCancel annotation,
(a) the ResultReady of y has
already been consumed by zcon,
or
(b) zcon is started now, or
(c) zcon is canceled currently
and
(d) accepts the cancel.
ann(y, zcon) ∈ {1, atOnCancel} ⇒ [
ReadyConsumed(y, zcon)∨
StartControlAck(zcon)∨
(Cancel(zcon)∧
CancelAck(zcon))]
∧
(2) If the control edge has an
nCT annotation,
(a) the ResultReady of y has
already been consumed by zcon,
or
(b) zcon is started now, or
(c) y has a valid result and
(d) the control condition for
(y, zcon) is violated.
ann(y, zcon) = nCT ⇒ [
ReadyConsumed(y, zcon)∨
StartControlAck(zcon)∨
(ResultReady(y)∧
¬eval(y, zcon))]
∧
The Table is continued on the next page.
196
(3) If the control edge has an
nAT annotation,
(a) zcon is canceled now and
(b) accepts the cancel, or
(c) y has a valid result and
(d) the control condition for
(y, zcon) is fulfilled.
ann(y, zcon) = nAT ⇒ [
(Cancel(zcon)∧
CancelAck(zcon))∨
(ResultReady(y)∧
eval(y, zcon))]
∧
(4) If the control edge has an
alwAct annotation,
(a) the ResultReady of y has
already been consumed by zcon,
or
(b) zcon is started now.
ann(y, zcon) = alwAct ⇒ [
ReadyConsumed(y, zcon)∨
StartControlAck(zcon)]
∧
(5) If the control edge has a
muxControl annotation,
(a) zcon accepts the token, or
(b) has already consumed the
token.
ann(y, zcon) = muxControl ⇒ [
StartSelAck(zcon)∨
ReadyConsumed(y, zcon)]
Table A.24.: Definition of condition ControlSuccessorAccepted (statCT); used
by condition ControlSuccessorsAccepted (Table A.25).
ControlSuccessorsAccepted(y) :⇔
All control successors ac-
knowledge the ResultReady or
Cancel.
∀zcon ∈ Zcon :
ControlSuccessorAccepted(y, zcon)
Table A.25.: Definition of condition ControlSuccessorsAccepted; used by con-
dition ResultReadyAck (Table A.26).
197
A. Sequencer Conditions
ResultReadyAck(y) :⇔
(1) If y does not have outgoing
control edges,
(a) data successors and
(b) memory successors
acknowledge the ResultReady.
Zcon = ∅ ⇒
DataSuccessorsAccepted(y)∧
MemorySuccessorsAccepted(y)
∧
(2) If y has an outgoing control
edge,
its control successors acknowl-
edge the ResultReady.
Zcon , ∅ ⇒
ControlSuccessorsAccepted(y)
Table A.26.: Definition of condition ResultReadyAck.
SchedType = dynCT: DataCT(y, zdat) :⇔
(1) y does not have a valid
result, and
(2) zdat has a CT,
(3) which has not yet been
consumed by y.
¬ResultReady(y)∧
(t(zdat) < {mux, loopMux} ⇒
CancelState(zdat))∧
(t(zdat) ∈ {mux, loopMux} ⇒
CancelState(y, zdat))∧
¬CancelConsumed(y, zdat)
Table A.27.: Definition of condition DataCT (dynCT), used by condition Can-
cel (Table A.31).
198
CancelThroughIncomingControlEdge(y) :⇔
(1) y has an incoming
atOnCancel control edge,
and
(a) xcon has a valid result,
(b) which has not yet been
consumed, and
(c) its control condition to y is
violated.
x ∈ Xcon,atOnCancel∧
ResultReady(xcon)∧
¬ReadyConsumed(xcon, y)∧
¬eval(xcon, y)
∨
(2) y has another incoming
CT-capable control edge, and
(a) xcon has a valid result,
(b) which has not yet been
consumed, and
(c) its control condition to y is
violated, or
(d) xcon is canceled, and
(e) the cancel has not yet
been consumed by y.
Xcon \ (Xcon,nCT ∪ Xcon,alwAct ∪
Xcon,atOnCancel) , ∅∧
[ResultReady(xcon)∧
¬ReadyConsumed(xcon, y)∧
¬eval(xcon, y)]∨
[CancelStateCtrl(xcon)∧
¬CancelConsumed(xcon, y)]
Table A.28.: Definition of condition CancelThroughIncomingControlEdge;
used by condition Cancel (Table A.31). This condition is not de-
fined for nodes of type mux.
199
A. Sequencer Conditions
SchedType = dynCT: CancelThroughOutgoingDataEdges(y) :⇔
(1) y has a data successor,
and
(2) y has no incoming, CT-
capable control edge, or
(3) y has exactly 1 incoming
control edge which comes from
an initial or always node, and
(4) each data successor has a
CT, and
(5) y is not a multiplexer.
Zdat , ∅∧
[Xcon,1 ∪ Xcon,nAT ∪ Xcon,atOnCancel = ∅∨
(|Xcon| = 1 ∧ xcon ∈ Xcon :
t(xcon) ∈ always, initial)]∧
∀zdat ∈ Zdat : DataCT(zdat)∧
t(y) < Mux
Table A.29.: Definition of condition CancelThroughOutgoingDataEdges
(dynCT); used by condition Cancel (Table A.31).
SchedType = statCT: CancelThroughOutgoingDataEdges(y) :⇔
This condition is always false,
because in static CT mode,
CTs are not propagated along
data edges.
false
Table A.30.: Definition of condition CancelThroughOutgoingDataEdges
(statCT); used by condition Cancel (Table A.31).
200
Cancel(y) :⇔
(1) If y is not a mux and not an
irqDataMux,
(a) y is canceled through an
incoming control edge,
(b) where all ATs of y are
consumed (except for loop-
Muxes), or
(c) through outgoing data
edges.
t(y) < {mux, irqDataMux} ⇒
[CancelThroughIncoming-
ControlEdge(y)∧
(¬ResultReady(y)∨
t(y) = loopMux)]∨
CancelThroughOutgoing-
DataEdges(y)
∧
(2) Otherwise,
y is never canceled.
t(y) ∈ {mux, irqDataMux} ⇒
false
Table A.31.: Definition of condition Cancel.
SchedType = dynCT:
DataPredCancelStateResetLocalStandard(xdat , y) :⇔
(1) xdat is canceled now and
(2) accepts the cancel, or
(3) y has already consumed
xdat’s cancel, or
(4) xdat has a valid result,
and
(5) y has a cancel, and
(6) neither CT
(7) nor AT have already been
consumed by the other party.
(Cancel(xdat)∧
CancelAck(xdat))∨
CancelConsumed(xdat, y)∨
(ResultReady(xdat)∧
(t(y) < {mux, loopMux} ⇒
CancelS tate(y))∧
(t(y) ∈ {mux, loopMux} ⇒
CancelState(xdat, y))∧
¬CancelConsumed(xdat, y)∧
¬ReadyConsumed(xdat, y))
Table A.32.: Definition of conditionDataPredCancelStateResetLocalStandard
(dynCT), used by condition CancelStateAck (Table A.35).
201
A. Sequencer Conditions
SchedType = dynCT:
DataPredCancelStateResetLocalMux(xdat , y) :⇔
(1) xdat has a valid result, and
(2) y has a cancel, and
(3) neither CT
(4) nor AT have already been
consumed by the other party.
(ResultReady(xdat)∧
(t(y) < {mux, loopMux} ⇒
CancelS tate(y))∧
(t(y) ∈ {mux, loopMux} ⇒
CancelState(xdat, y))∧
¬CancelConsumed(xdat, y)∧
¬ReadyConsumed(xdat, y))
Table A.33.: Definition of condition DataPredCancelStateResetLocalMux
(dynCT), used by condition CancelStateAck (Table A.36).
SchedType = dynCT: DataPredCancelStateResetLocal(xdat , y) :⇔
(1) If xdat is a multiplexer,
(2) DataPredCancelState-
ResetLocalMux is fulfilled.
t(xdat) ∈ Mux ⇒
DataPredCancelStateReset-
LocalMux(xdat, y)
∧
(1) Otherwise,
(2) DataPredCancelState-
ResetLocalStandard is ful-
filled.
t(xdat) < Mux ⇒
DataPredCancelStateResetLocal-
Standard(xdat, y)
Table A.34.: Definition of condition DataPredCancelStateResetLocal
(dynCT), used by condition CancelStateAck (Table A.35).
SchedType = dynCT: CancelStateAck(y) :⇔
y’s CancelState is reset, if
(1) all data predecessors accept
the cancel.
(∀xdat ∈ Xdat :
DataPredCancelStateReset-
Local(xdat, y))
Table A.35.: Definition of condition CancelStateAck (dynCT). Not defined for
nodes of type mux or loopMux.
202
SchedType = dynCT: CancelStateAck(xdat , y) :⇔
y’s CancelState is reset, if the
connected predecessor accepts
the CT.
DataPredCancelStateReset-
Local(xdat, y))
Table A.36.: Definition of condition CancelStateAck (dynCT) for nodes of type
mux or loopMux.
SchedType = statCT: CancelStateAck(y) :⇔
y’s CancelState is reset, if
(1) data inputs are consumed
now.
StartAck(y)
Table A.37.: Definition of condition CancelStateAck (statCT). Not defined for
nodes of type mux or loopMux.
SchedType = statCT: CancelStateAck(xdat , y) :⇔
y’s CancelState is never true,
thus the acknowledgment is
always false.
false
Table A.38.: Definition of condition CancelStateAck (statCT) for nodes of type
mux or loopMux.
203
A. Sequencer Conditions
ControlSuccCancelStateResetLocal(y, zcon) :⇔
(1) If (y, zcon) is annotated with
1 or nAT , and zcon is not a mux,
(a) zcon is canceled now
and
(b) accepts the cancel, or
(c) zcon has already con-
sumed y’s cancel.
ann(y, zcon) ∈ {1, nAT}∧
t(zcon) , mux ⇒
(Cancel(zcon)∧
CancelAck(zcon))∨
CancelConsumed(y, zcon)
∧
(2) If (y, zcon) is annotated with
atOnCancel, and zcon is not a
mux,
(a) zcon is canceled now
and
(b) accepts the cancel, or
(c) y is canceled, and
(d) zcon receives an AT, and
(e) accepts it, or
(f) zcon has already con-
sumed y’s cancel.
ann(y, zcon) = atOnCancel∧
t(zcon) , mux ⇒
(Cancel(zcon)∧
CancelAck(zcon))∨
(CancelStateCtrl(y)∧
StartCtrl(zcon)∧
StartCtrlAck(zcon))∨
CancelConsumed(y, zcon)
∧
(3) If (y, zcon) is annotated with
nCT ,
(a) y’s cancel just disappears.
ann(y, zcon) = nCT ⇒
true
∧
The Table is continued on the next page.
204
(4) If (y, zcon) is annotated with
alwAct,
(a) zcon receives a control
AT, or
(b) has already received it, or
(c) zcon has a CT
(d) which is erased now.
ann(y, zcon) = alwAct ⇒
[StartCtrlAck(zcon)∨
ReadyConsumed(y, zcon)]∨
[CancelState(zcon)∧
CancelStateAck(zcon)]
∧
(5) If zcon is a mux,
(a) zcon accepts the select
signal, or
(b) zcon has already con-
sumed the select.
t(zcon) = mux ⇒
StartSelAck(zcon)∨
CancelConsumed(y, zcon)
Table A.39.: Definition of condition ControlSuccCancelStateResetLocal; used
by condition CancelStateCtrlAck (Table A.40).
CancelStateCtrlAck(y) :⇔
y’s CancelStateCtrl is acknowl-
edged, if
(1) all control successors
accept the cancel.
(∀zcon ∈ Zcon :
ControlSuccCancelStateReset-
Local(y, zcon))
Table A.40.: Definition of condition CancelStateCtrlAck.
205
A. Sequencer Conditions
SetReadyConsumedStandard(y, z) :⇔
(1) If (y, z) is a data edge,
(a) y has an AT,
(b) which has not already been
acknowledged, and
(c) either z consumes the AT,
or
(d) the AT is extinguished
along the edge by a CT from z
(only in dynCT mode).
(y, z) ∈ Edat ⇒
ResultReady(y)∧
¬ResultReadyAck(y)∧
(StartAck(z)∨
(SchedType = dynCT∧
CancelState(z)∧
CancelStateAck(z)))
∧
(2) If (y, z) is a control edge,
(a) y has an AT,
(b) which has not already been
acknowledged, and
(c) either z consumes the AT
via StartCtrl, or
(d) the AT is turned into a CT,
consumed by z.
(y, z) ∈ Econ ⇒
ResultReady(y)∧
¬ResultReadyAck(y)∧
[StartCtrlAck(z)∨
(Cancel(z)∧
CancelAck(z))]
∧
(3) If (y, z) is a memory edge,
(a) y has an AT,
(b) which has not already been
acknowledged, and
(c) z consumes the AT.
(y, z) ∈ Emem ⇒
ResultReady(y)∧
¬ResultReadyAck(y)∧
StartAck(z)
Table A.41.: Definition of condition SetReadyConsumedStandard. Not defined
for edges leading to nodes of type mux or loopMux.
206
SetReadyConsumedMux(y, z) :⇔
(1) If (y, z) is a data edge,
(a) y has an AT,
(b) which has not already been
acknowledged, and
(c) either z consumes the AT,
or
(d) the AT is extinguished on
the edge by a CT from z (only
in dynCT mode).
(y, z) ∈ Edat ⇒
ResultReady(y)∧
¬ResultReadyAck(y)∧
(StartAck(y, z)∨
(SchedType = dynCT∧
CancelState(y, z)∧
CancelStateAck(y, z)))
∧
(2) If (y, z) is a control edge,
(a) y has an AT,
(b) which has not already been
acknowledged, and
(c) z consumes the select
signal.
(y, z) ∈ Econ ⇒
ResultReady(y)∧
¬ResultReadyAck(y)∧
StartSelAck(z)
Table A.42.: Definition of condition SetReadyConsumedMux. Only defined for
edges leading to nodes of type mux.
207
A. Sequencer Conditions
SetReadyConsumedLoopMux(y, z) :⇔
(1) If (y, z) is a data edge,
(a) y has an AT,
(b) which has not already been
acknowledged, and
(c) either z consumes the AT,
or
(d) the AT is extinguished on
the edge by a CT from z (only
in dynCT mode).
(y, z) ∈ Edat ⇒
ResultReady(y)∧
¬ResultReadyAck(y)∧
(StartAck(y, z)∨
(SchedType = dynCT∧
CancelState(z)∧
CancelStateAck(z)))
∧
(2) If (y, z) is a control edge,
(a) y has an AT,
(b) which has not already been
acknowledged, and
(c) either z consumes the AT
via StartCtrl, or
(d) the AT is turned into a CT,
consumed by z.
(y, z) ∈ Econ ⇒
ResultReady(y)∧
¬ResultReadyAck(y)∧
[StartCtrlAck(z)∨
(Cancel(z)∧
CancelAck(z))]
Table A.43.: Definition of condition SetReadyConsumedLoopMux. Only de-
fined for edges leading to nodes of type loopMux.
SetReadyConsumed(y, z) :⇔
(1) If z is neither a mux nor a
loopMux,
SetReadyConsumedStandard is
fulfilled.
t(z) < {mux, loopMux} ⇒
SetReadyConsumedStandard(y, z)
∧
(2) If z is a mux,
SetReadyConsumedMux is
fulfilled.
t(z) = mux ⇒
SetReadyConsumedMux(y, z)
∧
(3) If z is a loopMux,
SetReadyConsumedLoopMux
is fulfilled.
t(z) = loopMux ⇒
SetReadyConsumedLoopMux(y, z)
Table A.44.: Definition of condition SetReadyConsumed.
208
ResetReadyConsumed(y, z) :⇔
(1) y is notified a
ResultReadyAck.
ResultReadyAck(y)
Table A.45.: Definition of condition ResetReadyConsumed.
209
A. Sequencer Conditions
SetCancelConsumedStandard(y, z) :⇔
(1) If (y, z) is a data edge,
(a) z emits a CT,
(b) which has not already been
acknowledged, and
(c) either y consumes the CT
(and y is not a multiplexer), or
(d) the CT is extinguished
along the edge by an AT from
y.
(y, z) ∈ Edat ⇒
CancelState(z)∧
¬CancelStateAck(z)∧
[(Cancel(y)∧
CancelAck(y)∧
t(y) < Mux)∨
(ResultReady(y)∧
¬ResultReadyAck(y))]
∧
(2) If (y, z) is a control edge
with alwAct or atOnCancel
annotation,
(a) y has a CT,
(b) which has not yet been
acknowledged, and
(c) z consumes the AT which
has been generated at the edge
from y’s CT.
(y, z) ∈ Econ,alwAct ⇒
CancelStateCtrl(y)∧
¬CancelStateCtrlAck(y)∧
StartCtrl(z)∧
StartCtrlAck(z)
∧
(3) If (y, z) is a control edge
having neither an atOnCancel,
nor an alwAct annotation,
(a) y has a CT,
(b) which has not yet been
acknowledged, and
(c) z consumes the CT.
(y, z) ∈ Econ \ (Econ,alwAct ∪
Econ,atOnCancel) ⇒
CancelStateCtrl(y)∧
¬CancelStateCtrlAck(y)∧
Cancel(z)∧
CancelAck(z)
Table A.46.: Definition of condition SetCancelConsumedStandard. Not de-
fined for edges leading to nodes of type mux or loopMux.
210
SetCancelConsumedLoopMux(y, z) :⇔
(1) If (y, z) is a data edge,
(a) z emits a CT,
(b) which has not already been
acknowledged, and
(c) either y consumes the CT
(and y is not a multiplexer), or
(d) the CT is extinguished
at the edge by an AT from y.
(y, z) ∈ Edat ⇒
CancelState(y, z)∧
¬CancelStateAck(y, z)∧
[(Cancel(y)∧
CancelAck(y)∧
t(y) < Mux)∨
(ResultReady(y)∧
¬ResultReadyAck(y))]
∧
(2) If (y, z) is a control edge,
(a) y has a CT,
(b) which has not yet been
acknowledged, and
(c) z consumes the CT.
(y, z) ∈ Econ ⇒
CancelStateCtrl(y)∧
¬CancelStateCtrlAck(y)∧
Cancel(z)∧
CancelAck(z)
Table A.47.: Definition of condition SetCancelConsumedLoopMux. Only de-
fined for edges leading to nodes of type loopMux.
SetCancelConsumed(y, z) :⇔
(1) If z is neither a mux nor a
loopMux,
SetCancelConsumedStandard
is fulfilled.
t(z) < {mux, loopMux} ⇒
SetCancelConsumedStandard(y, z)
∧
(2) If z is a mux,
SetCancelConsumedMux is
fulfilled.
t(z) = mux ⇒
SetCancelConsumedMux(y, z)
∧
(3) If z is a loopMux,
SetCancelConsumedLoopMux
is fulfilled.
t(z) = loopMux ⇒
SetCancelConsumedLoopMux(y, z)
Table A.48.: Definition of condition SetCancelConsumed.
211
A. Sequencer Conditions
ResetCancelConsumed(y, z) :⇔
(1) If (y, z) is a data edge,
z’s CancelState is reset.
(y, z) ∈ Edat ⇒
[(t(z) < {mux, loopMux} ⇒
CancelStateAck(z))∧
(t(z) ∈ {mux, loopMux} ⇒
CancelStateAck(y, z))]
∧
(2) If (y, z) is a control edge,
y’s CancelStateCtrl is ac-
knowledged.
(y, z) ∈ Econ ⇒
CancelStateCtrlAck(y)
Table A.49.: Definition of condition ResetCancelConsumed.
212
B. CMDFG Node Types for C
Operations and Statements
Table B.1 shows a complete list of assignments from C operations and state-
ments to CMDFG node types created during the CMDFG generation pass (cf.
Chapter 8).
The node created from a phi statement is an any node if the phi is used to merge
memory flow using the virtual memory variable MEM_ACCESS. Otherwise, a
loopMux is created if the phi statement is located in a loop header CFG node. If
the originating CFG node is not a loop header, a mux is created instead.
Apart from the nodes listed in Table B.1, COMRADE creates an always node as
token source for const nodes.
If there are multiple HW/SW transitions, an irqDataMux is created, which for-
wards the IRQ code of the particular transition to the irqreg.
Further nodes (and, any, nop) are inserted during the control edge creation
phase.
213
B. CMDFG Node Types for C Operations and Statements
C statement or expression CMDFG node type
a + b add
a - b sub
a ∗ b mul
a / b div
a % b mod
a << b shiftLeft
a >> b shiftRight
c = (unsigned char)a bitSel (applies to other integer casts
accordingly)
a < b lessThan
a <= b lessThanOrEqual
a > b greaterThan
a >= b greaterThanOrEqual
a == b equal
a ! = b notEqual
a & b bitwiseAnd
a && b logicalAnd
a | b bitwiseOr
a || b logicalOr
aˆb bitwiseXor
a = !b negation
a = ∼b complement
a = *p load
*p = a store (if p , 0)
*0 = 0 mf (memory forwarder)
0 const (analogous for other con-
stants)
c = phi(a, b) mux, loopMux, or any
LoadUp inreg
PushDown outreg
SignalIrq irqreg
InitialMem initial
Table B.1.: Left: originating C operation or statement; right: node type of the
created CMDFG node.
214
C. Regression Tests
C.1. Source Codes
1 i n t main ( ) {
2 i n t i , s ;
3
4 s = 0 ;
5
6 f o r ( i = 0 ; i < 10 ; ++ i ) {
7 s += i ;
8 }
9
10 p r i n t f ( " s = %d \ n " , s ) ;
11 re turn 0 ;
12 }
Listing C.1: hg_regression_01
1 i n t main ( ) {
2 i n t i , a [ 1 0 ] ;
3
4 a [ 0 ] = 0 ; a [ 1 ] = 0 ; a [ 2 ] = 0 ; a [ 3 ] = 0 ; a [ 4 ] = 0 ;
5 a [ 5 ] = 0 ; a [ 6 ] = 0 ; a [ 7 ] = 0 ; a [ 8 ] = 0 ; a [ 9 ] = 0 ;
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 a [ i ] += 1 ;
9 }
10
11 p r i n t f ( " a [ 9 ] = %d \ n " , a [ 9 ] ) ;
12 re turn 0 ;
13 }
Listing C.2: hg_regression_02
215
C. Regression Tests
1 i n t main ( ) {
2 i n t i , s ;
3
4 s = 0 ;
5
6 f o r ( i = 0 ; i < 10 ; ++ i ) {
7 i f ( i == 5) {
8 s += i ;
9 }
10 }
11
12 p r i n t f ( " s = %d \ n " , s ) ;
13 re turn 0 ;
14 }
Listing C.3: hg_regression_03
1 i n t main ( ) {
2 i n t i , a [ 1 0 ] ;
3
4 a [ 0 ] = 0 ; a [ 1 ] = 0 ; a [ 2 ] = 0 ; a [ 3 ] = 0 ; a [ 4 ] = 0 ;
5 a [ 5 ] = 0 ; a [ 6 ] = 0 ; a [ 7 ] = 0 ; a [ 8 ] = 0 ; a [ 9 ] = 0 ;
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 a [ i ] += 1 ;
9 i f ( i == 5) {
10 i += 1 ;
11 }
12 }
13
14 p r i n t f ( " a [ 9 ] = %d \ n " , a [ 9 ] ) ;
15 re turn 0 ;
16 }
Listing C.4: hg_regression_04
1 i n t main ( ) {
2 i n t i , a [ 1 0 ] ;
3
4 a [ 0 ] = 0 ; a [ 1 ] = 0 ; a [ 2 ] = 0 ; a [ 3 ] = 0 ; a [ 4 ] = 0 ;
5 a [ 5 ] = 0 ; a [ 6 ] = 0 ; a [ 7 ] = 0 ; a [ 8 ] = 0 ; a [ 9 ] = 0 ;
216
C.1. Source Codes
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 i f ( i == 5) {
9 a [ i ] += 1 ;
10 }
11 }
12
13 p r i n t f ( " a [ 9 ] = %d \ n " , a [ 9 ] ) ;
14 re turn 0 ;
15 }
Listing C.5: hg_regression_05
1 i n t main ( ) {
2 i n t i , a [ 1 0 ] ;
3
4 a [ 0 ] = 0 ; a [ 1 ] = 0 ; a [ 2 ] = 0 ; a [ 3 ] = 0 ; a [ 4 ] = 0 ;
5 a [ 5 ] = 0 ; a [ 6 ] = 0 ; a [ 7 ] = 0 ; a [ 8 ] = 0 ; a [ 9 ] = 0 ;
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 i f ( i == 5) {
9 i += 1 ;
10 }
11 a [ i ] += 1 ;
12 }
13
14 p r i n t f ( " a [ 9 ] = %d \ n " , a [ 9 ] ) ;
15 re turn 0 ;
16 }
Listing C.6: hg_regression_06
1 i n t main ( ) {
2 i n t i , a [ 1 0 ] ;
3
4 a [ 0 ] = 0 ; a [ 1 ] = 0 ; a [ 2 ] = 0 ; a [ 3 ] = 0 ; a [ 4 ] = 0 ;
5 a [ 5 ] = 0 ; a [ 6 ] = 0 ; a [ 7 ] = 0 ; a [ 8 ] = 0 ; a [ 9 ] = 0 ;
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 a [ i ] += 1 ;
9 i f ( i == 5) {
217
C. Regression Tests
10 a [ i ] += 2 ;
11 } e l s e {
12 a [ i ] += 1 ;
13 }
14 a [ i ] += 1 ;
15 }
16
17 p r i n t f ( " a [ 9 ] = %d \ n " , a [ 9 ] ) ;
18 re turn 0 ;
19 }
Listing C.7: hg_regression_07
1 i n t main ( ) {
2 i n t i , s ;
3
4 s = 0 ;
5
6 f o r ( i = 0 ; i < 10 ; ++ i ) {
7 i f ( i > 5) {
8 i f ( i == 7) {
9 s += 2 ;
10 } e l s e {
11 s += 1 ;
12 }
13 } e l s e {
14 s += i ;
15 }
16 }
17
18 p r i n t f ( " s = %d \ n " , s ) ;
19 re turn 0 ;
20 }
Listing C.8: hg_regression_08
1 i n t main ( ) {
2 i n t i , a [ 1 1 ] ;
3
4 a [ 0 ] = 0 ; a [ 1 ] = 0 ; a [ 2 ] = 0 ; a [ 3 ] = 0 ; a [ 4 ] = 0 ;
5 a [ 5 ] = 0 ; a [ 6 ] = 0 ; a [ 7 ] = 0 ; a [ 8 ] = 0 ; a [ 9 ] = 0 ;
218
C.1. Source Codes
6 a [ 1 0 ] = 0 ;
7
8 f o r ( i = 0 ; i < 10 ; ++ i ) {
9 a [ i ] += 1 ;
10 i f ( i > 5) {
11 a [ i ] += 2 ;
12 i f ( i == 7) {
13 a [ i ] += 2 ;
14 } e l s e {
15 a [ i ] += 1 ;
16 }
17 a [ i ] += 1 ;
18 } e l s e {
19 i += 1 ;
20 }
21 a [ i ] += 1 ;
22 }
23
24 p r i n t f ( " a [ 9 ] = %d \ n " , a [ 9 ] ) ;
25 re turn 0 ;
26 }
Listing C.9: hg_regression_09
1 i n t main ( ) {
2 i n t i , j , s ;
3
4 s = 0 ;
5
6 f o r ( i = 0 ; i < 10 ; ++ i ) {
7 f o r ( j = 0 ; j < 3 ; ++ j ) {
8 s += j ;
9 }
10 }
11
12 p r i n t f ( " s = %d \ n " , s ) ;
13 re turn 0 ;
14 }
Listing C.10: hg_regression_10
1 i n t main ( ) {
219
C. Regression Tests
2 i n t i , j , a [ 1 0 ] ;
3
4 a [ 0 ] = 0 ; a [ 1 ] = 0 ; a [ 2 ] = 0 ; a [ 3 ] = 0 ; a [ 4 ] = 0 ;
5 a [ 5 ] = 0 ; a [ 6 ] = 0 ; a [ 7 ] = 0 ; a [ 8 ] = 0 ; a [ 9 ] = 0 ;
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 a [ i ] += 1 ;
9 f o r ( j = 0 ; j < 3 ; ++ j ) {
10 a [ i ] += j ;
11 }
12 a [ i ] += 1 ;
13 }
14
15 p r i n t f ( " a [ 9 ] = %d \ n " , a [ 9 ] ) ;
16 re turn 0 ;
17 }
Listing C.11: hg_regression_11
1 i n t main ( ) {
2 i n t i , j , s ;
3
4 s = 0 ;
5
6 f o r ( i = 0 ; i < 10 ; ++ i ) {
7 i f ( i == 5) {
8 f o r ( j = 0 ; j < 3 ; ++ j ) {
9 s += j ;
10 }
11 }
12 }
13
14 p r i n t f ( " s = %d \ n " , s ) ;
15 re turn 0 ;
16 }
Listing C.12: hg_regression_12
1 i n t main ( ) {
2 i n t i , j , a [ 1 0 ] ;
3
220
C.1. Source Codes
4 a [ 0 ] = 0 ; a [ 1 ] = 0 ; a [ 2 ] = 0 ; a [ 3 ] = 0 ; a [ 4 ] = 0 ;
5 a [ 5 ] = 0 ; a [ 6 ] = 0 ; a [ 7 ] = 0 ; a [ 8 ] = 0 ; a [ 9 ] = 0 ;
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 a [ i ] += 1 ;
9 i f ( i > 5) {
10 a [ i ] += 2 ;
11 f o r ( j = 0 ; j < 3 ; ++ j ) {
12 a [ i ] += j ;
13 }
14 a [ i ] += 3 ;
15 }
16 a [ i ] += 4 ;
17 }
18
19 p r i n t f ( " a [ 9 ] = %d \ n " , a [ 9 ] ) ;
20 re turn 0 ;
21 }
Listing C.13: hg_regression_13
1 i n t main ( ) {
2 i n t i , j , k , s , t ;
3
4 s = 0 ;
5 t = 0 ;
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 f o r ( j = 0 ; j < 3 ; ++ j ) {
9 s += j ;
10 }
11 f o r ( k = 0 ; k < 3 ; ++k ) {
12 t += 2 ∗ k ;
13 }
14 }
15
16 p r i n t f ( " s = %d \ n " , s ) ;
17 p r i n t f ( " t = %d \ n " , t ) ;
18 re turn 0 ;
221
C. Regression Tests
19 }
Listing C.14: hg_regression_14
1 i n t main ( ) {
2 i n t i , j , k , a [ 1 0 ] ;
3
4 a [ 0 ] = 0 ; a [ 1 ] = 0 ; a [ 2 ] = 0 ; a [ 3 ] = 0 ; a [ 4 ] = 0 ;
5 a [ 5 ] = 0 ; a [ 6 ] = 0 ; a [ 7 ] = 0 ; a [ 8 ] = 0 ; a [ 9 ] = 0 ;
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 a [ i ] += 1 ;
9 f o r ( j = 0 ; j < 3 ; ++ j ) {
10 a [ i ] += j ;
11 }
12 a [ i ] += 2 ;
13 f o r ( k = 0 ; k < 3 ; ++k ) {
14 a [ i ] += k ;
15 }
16 a [ i ] += 3 ;
17 }
18
19 p r i n t f ( " a [ 9 ] = %d \ n " , a [ 9 ] ) ;
20 re turn 0 ;
21 }
Listing C.15: hg_regression_15
1 i n t main ( ) {
2 i n t i , j , k , s , t ;
3
4 s = 0 ;
5 t = 0 ;
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 i f ( i > 5) {
9 f o r ( j = 0 ; j < 3 ; ++ j ) {
10 s += j ;
11 }
12 }
13 f o r ( k = 0 ; k < 3 ; ++k ) {
222
C.1. Source Codes
14 t += 2 ∗ k ;
15 }
16 }
17
18 p r i n t f ( " s = %d \ n " , s ) ;
19 p r i n t f ( " t = %d \ n " , t ) ;
20 re turn 0 ;
21 }
Listing C.16: hg_regression_16
1 i n t main ( ) {
2 i n t i , j , k , a [ 1 0 ] ;
3
4 a [ 0 ] = 0 ; a [ 1 ] = 0 ; a [ 2 ] = 0 ; a [ 3 ] = 0 ; a [ 4 ] = 0 ;
5 a [ 5 ] = 0 ; a [ 6 ] = 0 ; a [ 7 ] = 0 ; a [ 8 ] = 0 ; a [ 9 ] = 0 ;
6
7 f o r ( i = 0 ; i < 10 ; ++ i ) {
8 a [ i ] += 1 ;
9 i f ( i > 5) {
10 a [ i ] += 2 ;
11 f o r ( j = 0 ; j < 3 ; ++ j ) {
12 a [ i ] += j ;
13 }
14 a [ i ] += 3 ;
15 }
16 a [ i ] += 4 ;
17 f o r ( k = 0 ; k < 3 ; ++k ) {
18 a [ i ] += k ;
19 }
20 a [ i ] += 5 ;
21 }
22
23 p r i n t f ( " a [ 9 ] = %d \ n " , a [ 9 ] ) ;
24 re turn 0 ;
25 }
Listing C.17: hg_regression_17
223
C. Regression Tests
Dynamic CTs Static CTs
QDEPTH=0 QDEPTH=16 QDEPTH=0 QDEPTH=16
hg_regression_01 47 47 47 47
hg_regression_02 128 128 128 128
hg_regression_03 57 57 57 57
hg_regression_04 138 138 138 138
hg_regression_05 121 112 121 112
hg_regression_06 158 158 158 158
hg_regression_07 238 238 238 238
hg_regression_08 65 65 65 65
hg_regression_09 225 225 225 225
hg_regression_10 185 185 176 176
hg_regression_11 418 388 418 388
hg_regression_12 70 69 70 69
hg_regression_13 342 330 342 330
hg_regression_14 185 185 176 176
hg_regression_15 708 648 708 648
hg_regression_16 185 185 180 177
hg_regression_17 632 590 632 590
Table C.1.: Simulation runtimes of regression test hardware kernels in #cycles.
C.2. Simulation Results
Table C.1 shows the measured number of simulation cycles for the kernels
given in Section C.1, both for dynamic and static CTs, as well as for trans-
parent output queues (16 entries, i.e., QDEPTH=16) and without output queues
(QDEPTH=0). These measurements give the pure runtime of the hardware ker-
nel including cache stalls on the adaptive computer ACE-M5, but excluding the
time for the FPGA configuration and the transfer of variables between software
and hardware.
224
D. C Sources
D.1. Fcdf22
Fcdf22, taken from the Versatility Honeywell Stressmark [Hone97], implements
the kernel of a wavelet image transformation. We have unrolled the loop by
factor two to increase the size of the loop body. In our measurements, we
transform one pixel row of 256 pixels.
1 # de f i n e ROW 256
2
3 i n t main ( ) {
4 i n t mid = ROW / 2 − 1 ;
5 i n t i , i 1 ;
6 i n t s [ROW / 2 ] , d [ROW / 2 ] ;
7 i n t s_ i , s_ i1 , d_i , d_ i1 ;
8
9 / / i n i t i a l i z e a r r a y s s [ ] , d [ ]
10
11 d [0 ]= d [0 ]+ d [0] − s [0] − s [ 1 ] ;
12 s [0 ]= s [ 0 ]+ ( ( d [0 ]+ d [0 ] ) > >3 ) ;
13
14 / / HW s t a r t s
15 f o r ( i = 1 ; i < mid ; i += 2) {
16 i 1 = i + 1 ;
17 d_ i=d [ i ]+d [ i ]− s [ i ]− s [ i 1 ] ;
18 s _ i=s [ i ]+ ( ( d [ i −1]+ d_ i ) > >3);
19
20 d_ i1=d [ i 1 ]+d [ i 1 ]− s [ i 1 ]− s [ i +2 ] ;
21 s _ i 1=s [ i 1 ]+ ( ( d_ i+d_ i1 ) > >3);
22
23 s [ i ] = s _ i ;
24 d [ i ] = d_ i ;
25 s [ i 1 ]= s _ i 1 ;
225
D. C Sources
26 d [ i 1 ] = d_ i1 ;
27 }
28 / / HW ends
29
30 / / ou t p u t r e s u l t s s [ ] , d [ ]
31
32 re turn 0 ;
33 }
Listing D.1: fcdf22
D.2. GfMultiply
GfMultiply is a kernel of the pegwit program from the MediaBench suite
[LePM97], performing elliptic curve cryptography.
1 / / s u r r ound i ng code om i t t e d
2
3 / / HW s t a r t s
4 f o r ( j = q [ 0 ] ; j ; j −−) {
5 i f ( ( l o g _ q j = l g [ j ] ) != TOGGLE) {
6 x = l o g _ p i + l o g _ q j ;
7 i f ( x >= TOGGLE) {
8 r [ i+ j −1] ^= exp t [ x − TOGGLE] ;
9 } e l s e {
10 r [ i+ j −1] ^= exp t [ x ] ;
11 }
12 }
13 }
14 / / HW ends
Listing D.2: gfmultiply
D.3. MD5
The MD5 kernel computes the MD5 hash value (128 bits) of a given message.
The source code used has been initially implemented by Benjamin Thielmann
and was then adjusted by ourselves for pipelining multiple messages.
226
D.3. MD5
1 / / on l y HW k e r n e l f u n c t i o n p r i n t e d here
2
3 void p roces s_hw_ loop ( unsigned num_messages ) {
4 / / f u n c t i o n b l o c k v a l u e s
5 unsigned i n t a , b , c , d , temp , temp2 , f , g ;
6 / / hashes from p r e v i o u s i t e r a t i o n
7 unsigned i n t h0_h , h1_h , h2_h , h3_h ;
8 / / c oun t e r v a r i a b l e
9 unsigned i n t h ;
10
11 / / _ l o c a l p o i n t e r s : t e l l c omp i l e r t o use l o c a l
12 / / BRAM
13 unsigned i n t ∗ message_b ram0_ loca l
14 = message_bram0 ;
15 / / . . .
16 unsigned i n t ∗ message_b ram15_ loca l
17 = message_bram15 ;
18 unsigned i n t ∗ h 0_ l o c a l = h0 ;
19 unsigned i n t ∗ h 1_ l o c a l = h1 ;
20 unsigned i n t ∗ h 2_ l o c a l = h2 ;
21 unsigned i n t ∗ h 3_ l o c a l = h3 ;
22
23 / / " b u f f e r s " f o r message words
24 unsigned i n t t _ v a r _ 0 ;
25 / / . . .
26 unsigned i n t t _ v a r _15 ;
27
28 / / HW s t a r t s
29 f o r ( h=0; h<num_messages ; h++) {
30 / / i n i t hash
31 a = h 0_ l o c a l [ h ] ;
32 b = h 1_ l o c a l [ h ] ;
33 c = h 2_ l o c a l [ h ] ;
34 d = h 3_ l o c a l [ h ] ;
35
36 / / save p r e v i o u s hashes
37 h0_h = a ;
38 h1_h = b ;
39 h2_h = c ;
40 h3_h = d ;
227
D. C Sources
41
42 / / 16 message_bram_ loca l s
43 / / => t o g e t h e r 16 words o f one message
44 t _ v a r _ 0 = message_b ram0_ loca l [ h ] ;
45 / / . . .
46 t _ v a r _15 = message_b ram15_ loca l [ h ] ;
47
48 / / f u n c t i o n b l o c k 0
49 f = ( d ^ ( b & ( c ^ d ) ) ) ;
50 temp = d ;
51 d = c ;
52 c = b ;
53 temp2 = a + f + (0 xd76aa478 + t _ v a r _ 0 ) ;
54 b = b + ( ( temp2 << 7) | ( temp2 >> ( 2 5 ) ) ) ;
55 a = temp ;
56
57 / / more f u n c t i o n b l o c k s . . .
58
59 / / f u n c t i o n b l o c k 63
60 f = c ^ ( b | (0 x f f f f f f f f − d ) ) ;
61 temp = d ;
62 d = c ;
63 c = b ;
64 temp2 = a + f + (0 xeb86d391 + t _ v a r _ 9 ) ;
65 b = b + ( ( temp2 << 21) | ( temp2 >> 1 1 ) ) ;
66 a = temp ;
67
68 / / add hash and s t o r e
69 h 0_ l o c a l [ h ] = h0_h + a ;
70 h 1_ l o c a l [ h ] = h1_h + b ;
71 h 2_ l o c a l [ h ] = h2_h + c ;
72 h 3_ l o c a l [ h ] = h3_h + d ;
73 }
74 / / HW ends
75 }
Listing D.3: md5
228
D.4. Memcopy
D.4. Memcopy
Memcopy copies data from one array to another. While this is a synthetic kernel,
copying memory is a basic technique that occurs in many applications. To
enlarge the loop body, we copy eight data words per loop iteration.
1 i n t main ( ) {
2 i n t i , a0 , a1 , a2 , a3 , a4 , a5 , a6 , a7 ;
3 i n t o f f s e t 0 , o f f s e t 1 , o f f s e t 2 , o f f s e t 3 ,
4 o f f s e t 4 , o f f s e t 5 , o f f s e t 6 , o f f s e t 7 ;
5 i n t a [ 6 4 ] , b [ 6 4 ] ;
6 a [ 0 ] = 0 ; a [ 1 ] = 1 ; a [ 2 ] = 2 ;
7 / ∗ . . . ∗ / a [ 6 3 ] = 64 ;
8 b [ 0 ] = 100 ; b [ 1 ] = 101 ; b [ 2 ] = 102 ;
9 / ∗ . . . ∗ / b [ 6 3 ] = 164 ;
10 o f f s e t 0 = 0 ; o f f s e t 1 = 4 ; o f f s e t 2 = 8 ;
11 / ∗ . . . ∗ / o f f s e t 7 = 28 ;
12
13 / / HW s t a r t s
14 f o r ( i = 0 ; i < 64 ; i += 8) {
15 a0 = ∗ ( ( i n t ∗ ) ( ( ( i n t ) a ) + o f f s e t 0 ) ) ;
16 / / . . .
17 a7 = ∗ ( ( i n t ∗ ) ( ( ( i n t ) a ) + o f f s e t 7 ) ) ;
18
19 ∗ ( ( i n t ∗ ) ( ( ( i n t ) b ) + o f f s e t 0 ) ) = a0 ;
20 / / . . .
21 ∗ ( ( i n t ∗ ) ( ( ( i n t ) b ) + o f f s e t 7 ) ) = a7 ;
22
23 o f f s e t 0 += 32 ;
24 / / . . .
25 o f f s e t 7 += 32 ;
26 }
27 / / HW ends
28
29 p r i n t f ( " r e s u l t : b [ 0 ] = %d \ n " , b [ 0 ] ) ;
30 / / . . .
31 p r i n t f ( " r e s u l t : b [ 6 3 ] = %d \ n " , b [ 6 3 ] ) ;
32 re turn 0 ;
33 }
Listing D.4: memcopy
229
D. C Sources
D.5. Quantization
The quantization kernel, taken from the Versatility Honeywell Stressmark
[Hone97], quantizes the output of a wavelet image transform (32 bits per pixel)
to four bits per pixel. We have replaced the call to the C function abs (the
source code of which not being known to the compiler) by an if/else construct.
For better efficiency, we iterate over pixel rows instead of columns in the outer
loop. For our measurements, we use an input image area of 64x64 pixels.
1 / / s u r r ound i ng code om i t t e d
2
3 / / HW s t a r t s
4 f o r ( j=y ; j <y+ s i z e ; j ++) {
5 f o r ( i=x ; i <x+ s i z e ; i ++) {
6
7 v a l=my_ in t _da t a [ i +( j < <9)] ;
8 i f ( v a l >= 0) {
9 a b s v a l = v a l ;
10 } e l s e {
11 a b s v a l = 0 − v a l ;
12 }
13 i f ( a b s v a l < my_b lock th r e sh [ num ] ) {
14 r e s u l t _ v a l=ZERO_MARK;
15 } e l s e {
16 / / c l a s s i f y
17 i f ( va l >mythresh8 ) {
18 i f ( va l >mythresh12 ) {
19 i f ( va l >mythresh14 ) {
20 i f ( va l >mythresh15 ) c l a s s i f y _ v a l =15;
21 e l s e c l a s s i f y _ v a l = 14 ;
22 } e l s e {
23 i f ( va l >mythresh13 ) c l a s s i f y _ v a l =13;
24 e l s e c l a s s i f y _ v a l = 12 ;
25 }
26 } e l s e {
27 i f ( va l >mythresh10 ) {
28 i f ( va l >mythresh11 ) c l a s s i f y _ v a l =11;
29 e l s e c l a s s i f y _ v a l = 10 ;
30 } e l s e {
31 i f ( va l >mythresh9 ) c l a s s i f y _ v a l = 9 ;
230
D.6. SHA
32 e l s e c l a s s i f y _ v a l = 8 ;
33 }
34 }
35
36 } e l s e {
37 i f ( va l >mythresh4 ) {
38 i f ( va l >mythresh6 ) {
39 i f ( va l >mythresh7 ) c l a s s i f y _ v a l = 7 ;
40 e l s e c l a s s i f y _ v a l = 6 ;
41 } e l s e {
42 i f ( va l >mythresh5 ) c l a s s i f y _ v a l = 5 ;
43 e l s e c l a s s i f y _ v a l = 4 ;
44 }
45 } e l s e {
46 i f ( va l >mythresh2 ) {
47 i f ( va l >mythresh3 ) c l a s s i f y _ v a l = 3 ;
48 e l s e c l a s s i f y _ v a l = 2 ;
49 } e l s e {
50 i f ( va l >mythresh1 ) c l a s s i f y _ v a l = 1 ;
51 e l s e c l a s s i f y _ v a l = 0 ;
52 }
53 }
54 }
55 r e s u l t _ v a l= c l a s s i f y _ v a l ;
56 }
57 my_quant_buf [ num∗IMG_SIZE+q s i z e ]= r e s u l t _ v a l ;
58 ++q s i z e ;
59
60 }
61 }
62 / / HW ends
Listing D.5: quantization
D.6. SHA
SHA from CHStone [HTHT09] represents a part of the computation of the SHA
hash value of a given message. We have unrolled the loop by factor three to
enlarge the loop body.
231
D. C Sources
1 / / s u r r ound i ng code om i t t e d
2
3 / / HW s t a r t s
4 f o r ( i = 16 ; i < 79 ; i = i + 3) {
5 iM14 = W[ i − 1 4 ] ;
6 tmp1 = W[ i − 3] ^ W[ i − 8] ^ iM14 ^ W[ i − 1 6 ] ;
7 tmp2 = W[ i − 2] ^ W[ i − 7] ^
8 W[ i − 13] ^ W[ i − 1 5 ] ;
9 tmp3 = W[ i − 1] ^ W[ i − 6] ^ W[ i − 12] ^ iM14 ;
10
11 W[ i ] = tmp1 ;
12 W[ i +1] = tmp2 ;
13 W[ i +2] = tmp3 ;
14 }
15 / / HW ends
16 W[79 ] = W[79 − 3] ^ W[79 − 8] ^
17 W[79 − 14] ^ W[79 − 1 6 ] ;
Listing D.6: sha
D.7. Susan
Susan is taken from MiBench [GREA01] and performs an edge detection on a
gray scale image. For our measurements, we use an input image block of 8x8
pixels. In the inner loop, we have replaced the long addition chain – caused
by the original p++ expressions – by a more compact address computation. To
reduce the time for computing n, we have replaced the 36 serial additions by a
tree adder network (using helper variables a1, ..., a36).
1 / / s u r r ound i ng code om i t t e d
2
3 s u s a n _ p r i n c i p l e ( in , r , bp , max_no , x_ s i z e , y _ s i z e )
4 ucha r ∗ in , ∗bp ;
5 i n t ∗ r , max_no , x_ s i z e , y _ s i z e ;
6 {
7 i n t i , j , n ;
8 ucha r ∗p , ∗ cp ;
9 i n t a1 , a2 , a3 , a4 , a5 , a6 , a7 , a8 ,
10 a9 , a10 , a11 , a12 , a13 , a14 , a15 , a16 ,
232
D.7. Susan
11 a17 , a18 , a19 , a20 , a21 , a22 , a23 , a24 ,
12 a25 , a26 , a27 , a28 , a29 , a30 , a31 , a32 ,
13 a33 , a34 , a35 , a36 ;
14
15 memset ( r , 0 , x _ s i z e ∗ y_ s i z e ∗ s i z e o f ( i n t ) ) ;
16
17 / / HW s t a r t s
18 f o r ( i =3; i <y_s i z e −3; i ++) {
19 f o r ( j =3; j <x_s i z e −3; j ++) {
20 n=100;
21 p= i n + ( i −3)∗ x_ s i z e + j − 1 ;
22 cp=bp + i n [ i ∗ x_ s i z e+ j ] ;
23
24 / / r e p l a c e d "p++" f o r more compact
25 / / add r e s s c ompu t a t i o n s
26 a1=∗( cp−∗p ) ;
27 a2=∗( cp −∗( p + 1 ) ) ;
28 a3=∗( cp −∗( p + 2 ) ) ;
29 a4=∗( cp −∗( p+x_s i z e − 1 ) ) ;
30 p+=x_ s i z e ;
31
32 a5=∗( cp−∗p ) ;
33 a6=∗( cp −∗( p + 1 ) ) ;
34 a7=∗( cp −∗( p + 2 ) ) ;
35 a8=∗( cp −∗( p + 3 ) ) ;
36 p+=x_s i z e −2;
37
38 a9=∗( cp−∗p ) ;
39 a10=∗( cp −∗( p + 1 ) ) ;
40 a11=∗( cp −∗( p + 2 ) ) ;
41 a12=∗( cp −∗( p + 3 ) ) ;
42 p+=4;
43
44 a13=∗( cp−∗p ) ;
45 a14=∗( cp −∗( p + 1 ) ) ;
46 a15=∗( cp −∗( p + 2 ) ) ;
47 a16=∗( cp −∗( p+x_s i z e − 4 ) ) ;
48 p+=x_s i z e −3;
49
50 a17=∗( cp−∗p ) ;
233
D. C Sources
51 a18=∗( cp −∗( p + 1 ) ) ;
52 a19=∗( cp −∗( p + 3 ) ) ;
53 a20=∗( cp −∗( p + 4 ) ) ;
54 p+=5;
55
56 a21=∗( cp−∗p ) ;
57 a22=∗( cp −∗( p+x_s i z e − 6 ) ) ;
58 a23=∗( cp −∗( p+x_s i z e − 5 ) ) ;
59 a24=∗( cp −∗( p+x_s i z e − 4 ) ) ;
60 p+=x_s i z e −3;
61
62 a25=∗( cp−∗p ) ;
63 a26=∗( cp −∗( p + 1 ) ) ;
64 a27=∗( cp −∗( p + 2 ) ) ;
65 a28=∗( cp −∗( p + 3 ) ) ;
66 p+=x_s i z e −2;
67
68 a29=∗( cp−∗p ) ;
69 a30=∗( cp −∗( p + 1 ) ) ;
70 a31=∗( cp −∗( p + 2 ) ) ;
71 a32=∗( cp −∗( p + 3 ) ) ;
72 p+=4;
73
74 a33=∗( cp−∗p ) ;
75 a34=∗( cp −∗( p+x_s i z e − 3 ) ) ;
76 a35=∗( cp −∗( p+x_s i z e − 2 ) ) ;
77 a36=∗( cp −∗( p+x_s i z e − 1 ) ) ;
78
79 / / t r e e adder f o r n
80 a1 += a2 ; a3 += a4 ;
81 a5 += a6 ; a7 += a8 ;
82 a9 += a10 ; a11 += a12 ;
83 a13 += a14 ; a15 += a16 ;
84 a17 += a18 ; a19 += a20 ;
85 a21 += a22 ; a23 += a24 ;
86 a25 += a26 ; a27 += a28 ;
87 a29 += a30 ; a31 += a32 ;
88 a33 += a34 ; a35 += a36 ;
89
90 a1 += a3 ; a5 += a7 ;
234
D.8. Fcdf22_local
91 a9 += a11 ; a13 += a15 ;
92 a17 += a19 ; a21 += a23 ;
93 a25 += a27 ; a29 += a31 ;
94 a33 += a35 ;
95
96 a1 += a5 ; a9 += a13 ;
97 a17 += a21 ; a25 += a29 + a33 ;
98
99 a1 += a9 ;
100 a17 += a25 ;
101
102 a1 += a17 ;
103
104 n += a1 ;
105
106 i f ( n<=max_no ) {
107 r [ i ∗ x_ s i z e+ j ] = max_no − n ;
108 }
109 }
110 }
111 / / HW ends
112 }
Listing D.7: susan
D.8. Fcdf22_local
Fcdf22_local is an adjusted version of the wavelet transform function fcdf22 of-
fered by the Versatility Honeywell Stressmark [Hone97]. We use it for the
application-level study in Chapter 10.
1 void f c d f 2 2 _ l o c a l ( i n t l i n e _ s i z e , i n t num_blocks ) {
2 i n t i , h1 , h2 , h3 , h4 , h5 , h6 , f1 , f2 ;
3 i n t p i x e l ;
4
5 f1 = 0 ;
6 p i x e l = 0 ;
7
8 / / HW ke r n e l s t a r t
235
D. C Sources
9 f o r ( i = 0 ; i < num_blocks ; i ++) {
10 / / i t e r a t e over 16 p i x e l s
11
12 f2 = s _ p i x e l _ 0 _b r am0_ l o c a l [ i ] ;
13
14 / / === p i x e l s 0 , 1 ===
15 h1 = d_p i x e l _ 0_b r am1_ l o c a l [ i ] ; h2 = f2 ;
16 h3 = s _ p i x e l _ 1 _b r am2_ l o c a l [ i ] ;
17 h4 = f1 ;
18
19 h5 = ( h1 << 1) − ( h2 + h3 ) ;
20
21 i f ( p i x e l == 0) {
22 / / l i n e s t a r t
23 h6 = h2 + ( ( h5 << 1) >> 3 ) ;
24 } e l s e {
25 / / l i n e midd l e
26 h6 = h2 + ( ( h4 + h5 ) >> 3 ) ;
27 }
28
29 d_p i x e l _ 0_b r am1_ l o c a l [ i ] = h5 ;
30 s _ p i x e l _ 0 _b r am0_ l o c a l [ i ] = h6 ;
31
32 f1 = h5 ;
33 f2 = h3 ;
34
35 / / === p i x e l s 2 , 3 ===
36 h1 = d_p i x e l _ 1_b r am3_ l o c a l [ i ] ; h2 = f2 ;
37 h3 = s _ p i x e l _ 2 _b r am4_ l o c a l [ i ] ;
38 h4 = f1 ;
39
40 h5 = ( h1 << 1) − ( h2 + h3 ) ;
41 h6 = h2 + ( ( h4 + h5 ) >> 3 ) ;
42
43 d_p i x e l _ 1_b r am3_ l o c a l [ i ] = h5 ;
44 s _ p i x e l _ 1 _b r am2_ l o c a l [ i ] = h6 ;
45
46 f1 = h5 ;
47 f2 = h3 ;
48
236
D.8. Fcdf22_local
49 / / === p i x e l s 4 , 5 ===
50 h1 = d_p i x e l _ 2_b r am5_ l o c a l [ i ] ; h2 = f2 ;
51 h3 = s _ p i x e l _ 3 _b r am6_ l o c a l [ i ] ;
52 h4 = f1 ;
53
54 h5 = ( h1 << 1) − ( h2 + h3 ) ;
55 h6 = h2 + ( ( h4 + h5 ) >> 3 ) ;
56
57 d_p i x e l _ 2_b r am5_ l o c a l [ i ] = h5 ;
58 s _ p i x e l _ 2 _b r am4_ l o c a l [ i ] = h6 ;
59
60 f1 = h5 ;
61 f2 = h3 ;
62
63 / / === p i x e l s 6 , 7 ===
64 h1 = d_p i x e l _ 3_b r am7_ l o c a l [ i ] ; h2 = f2 ;
65 h3 = s _ p i x e l _ 4 _b r am8_ l o c a l [ i ] ;
66 h4 = f1 ;
67
68 h5 = ( h1 << 1) − ( h2 + h3 ) ;
69 h6 = h2 + ( ( h4 + h5 ) >> 3 ) ;
70
71 d_p i x e l _ 3_b r am7_ l o c a l [ i ] = h5 ;
72 s _ p i x e l _ 3 _b r am6_ l o c a l [ i ] = h6 ;
73
74 f1 = h5 ;
75 f2 = h3 ;
76
77 / / === p i x e l s 8 , 9 ===
78 h1 = d_p i x e l _ 4_b r am9_ l o c a l [ i ] ; h2 = f2 ;
79 h3 = s _ p i x e l _ 5_b r am10_ l o c a l [ i ] ;
80 h4 = f1 ;
81
82 h5 = ( h1 << 1) − ( h2 + h3 ) ;
83 h6 = h2 + ( ( h4 + h5 ) >> 3 ) ;
84
85 d_p i x e l _ 4_b r am9_ l o c a l [ i ] = h5 ;
86 s _ p i x e l _ 4 _b r am8_ l o c a l [ i ] = h6 ;
87
88 f1 = h5 ;
237
D. C Sources
89 f2 = h3 ;
90
91 / / === p i x e l s 10 , 11 ===
92 h1 = d_p i x e l _5_b r am11_ l o c a l [ i ] ; h2 = f2 ;
93 h3 = s _ p i x e l _ 6_b r am12_ l o c a l [ i ] ;
94 h4 = f1 ;
95
96 h5 = ( h1 << 1) − ( h2 + h3 ) ;
97 h6 = h2 + ( ( h4 + h5 ) >> 3 ) ;
98
99 d_p i x e l _5_b r am11_ l o c a l [ i ] = h5 ;
100 s _ p i x e l _ 5_b r am10_ l o c a l [ i ] = h6 ;
101
102 f1 = h5 ;
103 f2 = h3 ;
104
105 / / === p i x e l s 12 , 13 ===
106 h1 = d_p i x e l _6_b r am13_ l o c a l [ i ] ; h2 = f2 ;
107 h3 = s _ p i x e l _ 7_b r am14_ l o c a l [ i ] ;
108 h4 = f1 ;
109
110 h5 = ( h1 << 1) − ( h2 + h3 ) ;
111 h6 = h2 + ( ( h4 + h5 ) >> 3 ) ;
112
113 d_p i x e l _6_b r am13_ l o c a l [ i ] = h5 ;
114 s _ p i x e l _ 6_b r am12_ l o c a l [ i ] = h6 ;
115
116 f1 = h5 ;
117 f2 = h3 ;
118
119 / / === p i x e l s 14 , 15 ===
120 h1 = d_p i x e l _7_b r am15_ l o c a l [ i ] ; h2 = f2 ;
121 h3 = s _ p i x e l _ 8_b r am16_ l o c a l [ i ] ;
122 h4 = f1 ;
123
124 i f ( p i x e l < l i n e _ s i z e − 16) {
125 / / l i n e midd l e
126 h5 = ( h1 << 1) − ( h2 + h3 ) ;
127 } e l s e {
128 / / l i n e end
238
129 h5 = ( h1 << 1) − ( h2 << 1 ) ;
130 }
131
132 h6 = h2 + ( ( h4 + h5 ) >> 3 ) ;
133
134 d_p i x e l _7_b r am15_ l o c a l [ i ] = h5 ;
135 s _ p i x e l _ 7_b r am14_ l o c a l [ i ] = h6 ;
136 f1 = h5 ;
137
138 p i x e l += 16 ;
139 i f ( p i x e l >= l i n e _ s i z e ) {
140 p i x e l = 0 ;
141 }
142 }
143 / / HW ke r n e l ends
144 }
Listing D.8: fcdf22_local
239
240
Curriculum Vitae
Personal Data
Name: Hagen Gädke-Lütjens
Born: 15 Febuary 1979 in Berlin, Germany
Marital Status: married
Education
1991 – 1998 Gymnasium Alfeld in Alfeld (Leine)
July 1998 Abitur at Gymnasium Alfeld
1999 – 2004 Study of Computer Science at Technische Universität
Braunschweig, Germany
October 2004 Diplom in Computer Science
Compulsory Community Service
1998 – 1999 DRK Tagespflege-Einrichtung in Gronau (Leine)
Work Experience: Academic
2000 – 2001 Student developer at the Fraunhofer-Institute for Surface
Engineering and Thin Films in Braunschweig
2001 – 2003 Student developer at the Computing Center, Technische
Universität Braunschweig
2003 – 2004 Student developer at the Department for Computer Graph-
ics, Technische Universität Braunschweig
2005 – 2011 Research Assistant at the Department for VLSI Design,
Technische Universität Braunschweig
Work Experience: Commercial
2004 – 2005 Internship at Siemens Signalling Company Ltd. in Xi’an,
P. R. China
241
Lebenslauf
Persönliche Daten
Name: Hagen Gädke-Lütjens
Geboren am 15. Februar 1979 in Berlin
Familienstand: verheiratet
Ausbildung
1991 – 1998 Gymnasium Alfeld in Alfeld (Leine)
Juli 1998 Abitur am Gymnasium Alfeld
1999 – 2004 Studium der Informatik an der Technischen Universität
Braunschweig
Oktober 2004 Diplom in Informatik
Zivildienst
1998 – 1999 DRK Tagespflege-Einrichtung in Gronau (Leine)
Berufstätigkeit im akademischen Bereich
2000 – 2001 Studentische Hilfskraft am Fraunhofer-Institut für
Schicht- und Oberflächentechnik (IST) in Braunschweig
2001 – 2003 Studentische Hilfskraft am Rechenzentrum der Techni-
schen Universität Braunschweig
2003 – 2004 Studentische Hilfskraft am Institut für Computergraphik,
Technische Universität Braunschweig
2005 – 2011 Wissenschaftlicher Mitarbeiter an der Abteilung Entwurf
integrierter Schaltungen (E.I.S.), Technische Universität
Braunschweig
Berufstätigkeit im kommerziellen Bereich
2004 – 2005 Praktikum in der Siemens Signalling Company Ltd. in
Xi’an, V.R. China
242
