Code Generation for Synchronous Control Asynchronous Dataflow Architectures by Bhagyanath, Anoop
Code Generation for
Synchronous Control Asynchronous Dataflow
Architectures
Dissertation
vom Fachbereich Informatik der Technischen Universität Kaiserslautern zur Verleihung des
akademischen Grades Doktor der Ingenieurwissenschaften (Dr.-Ing.) genehmigte Dissertation
von
Anoop Bhagyanath
Datum der wissenschaftlichen Aussprache 31.01.2020
Dekan Prof. Dr. Jens Schmitt
Gutachter Prof. Dr. Klaus Schneider




Scaling up conventional processor architectures cannot translate the ever-
increasing number of transistors into comparable application performance.
Although the trend is to shift from single-core to multi-core architectures,
utilizing these multiple cores is not a trivial task for many applications due to
thread synchronization and weak memory consistency issues. This is especially
true for applications in real-time embedded systems since timing analysis be-
comes more complicated due to contention on shared resources. One inherent
reason for the limited use of instruction-level parallelism (ILP) by conventional
processors is the use of registers. Therefore, some recent processors bypass reg-
ister usage by directly communicating values from producer processing units to
consumer processing units. In widely used superscalar processors, this direct
instruction communication is organized by hardware at runtime, adversely af-
fecting its scalability. The exposed datapath architectures provide a scalable
alternative by allowing compilers to move values directly from output ports to
the input ports of processing units. Though exposed datapath architectures
have already been studied in great detail, they still use registers for executing
programs, thus limiting the amount of ILP they can exploit. This limitation
stems from a drawback in their execution paradigm, code generator, or both.
This thesis considers a novel exposed datapath architecture named Syn-
chronous Control Asynchronous Dataflow (SCAD) that follows a hybrid control-
flow dataflow execution paradigm. The SCAD architecture employs first-in-
first-out (FIFO) buffers at the output and input ports of processing units. It
is programmed by move instructions that transport values from the head of
output buffers to the tail of input buffers. Thus, direct instruction commu-
nication is facilitated by the architecture. The processing unit triggers the
execution of an operation when operand values are available at the heads of
its input buffers. We propose a code generation technique for SCAD proces-
sors inspired by classical queue machines that completely eliminates the use
of registers. On this basis, we first generate optimal code by using satisfia-
bility (SAT) solvers after establishing that optimal code generation is hard.
Heuristics based on a novel buffer interference analysis are then developed to
compile larger programs. The experimental results demonstrate the efficacy





I will always be grateful to Prof. Klaus Schneider for his exemplary guidance.
I have thoroughly enjoyed all the fruitful discussions we have had over the
course of my journey as a doctoral student. Under your guidance, I have
learned to effectively approach problems by focusing on the important aspects
and see that the details converge naturally. This also helped me recognize
and appreciate the elegance of different solutions. Moreover, I have gained
important insights while assisting you with teaching, which will be really useful
during my academic career.
I sincerely thank Prof. Andreas Koch, who agreed to take time out of his
hectic schedule to review this work. Prof. Pascal Schweitzer and Prof. Karsten
Berns were kind to ensure the smooth conduct of my PhD defense as examiner
and chair, respectively. I am also thankful for their valuable advice on pursuing
a career in academics.
I would also like to thank my colleagues for a healthy atmosphere in the
group to share and discuss each other’s work. Let me take this opportunity to
also thank students I have worked with for bringing fresh perspectives to my
research efforts.
Most importantly, I express my deepest gratitude to my family and friends







1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Execution Paradigms 7
2.1. SCAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2. Dynamically Ordered SCAD . . . . . . . . . . . . . . . . . . . . . 19
2.3. Statically Ordered SCAD . . . . . . . . . . . . . . . . . . . . . . . 20
3. Code Generation Techniques for SCAD 23
3.1. Register Oriented Code Generation . . . . . . . . . . . . . . . . . 24
3.2. Queue Oriented Code Generation . . . . . . . . . . . . . . . . . . 25
4. Optimal Code Generation 31
4.1. Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2. Mapping to SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5. Heuristics for Code Generation 59
5.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2. Buffer Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3. Balancing Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4. Move Code Generation . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5. Remarks on Buffer Size . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6. Related Work 93
7. Conclusions 105
Bibliography 111
A. Compilation Example 119




1.1. (a) An example expression tree, (b) the corresponding assembler
code and (c) the same expression tree with spill code . . . . . . 3
2.1. Execution framework of a SCAD processor that use a 2D mesh
network as data transport network . . . . . . . . . . . . . . . . . 8
2.2. A processing unit in a SCAD machine . . . . . . . . . . . . . . . 8
2.3. Execution framework of a SCAD processor that use a fat binary
tree as data transport network . . . . . . . . . . . . . . . . . . . . 10
2.4. The load-store unit in a SCAD machine . . . . . . . . . . . . . . 11
2.5. The control unit in a SCAD machine . . . . . . . . . . . . . . . . 12
2.6. The control unit in a SCAD machine with branch prediction . 13
2.7. Dataflow graphs for conditional statement . . . . . . . . . . . . . 15
2.8. Dataflow graph for while loop statement using switch node . . 16
2.9. A processing unit in a dynamically ordered SCAD machine . . 19
2.10. A processing unit in a statically ordered SCAD machine . . . . 21
3.1. An expression tree with its register program, and corresponding
SCAD program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2. Architecture of a queue machine . . . . . . . . . . . . . . . . . . . 25
3.3. An expression tree with its queue program and the content of
the queue after executing each instruction . . . . . . . . . . . . . 26
3.4. An expression DAG with its levelized version, its planarized
version, the final level-planar expression DAG, and the obtained
queue program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5. An expression tree with its queue program, and corresponding
SCAD program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6. A given expression DAG with its planarized version, the ob-
tained queue program, and a SCAD program without swap and
dup instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7. An expression DAG with a move cod program without dup and
swap instructions for a SCAD machine with one load-store unit
(lsu) and one processing unit (u) . . . . . . . . . . . . . . . . . . 30
4.1. Control flow graph of the constructed program P . . . . . . . . 33
4.2. Failed attempt to schedule basic block Bi without overhead . . 33
ix
List of Figures
4.3. Ordering of variables forming a cycle spanning both left and
right input buffers of PU k . . . . . . . . . . . . . . . . . . . . . . 39
4.4. Ordering of variables forming a cycle in the left input buffer
and a cycle in the right input buffer of PU k . . . . . . . . . . . 40
4.5. Ordering of variables forming a cycle spanning all input buffers 40
4.6. Ordering of variables in input buffers . . . . . . . . . . . . . . . . 42
4.7. Ordering of variables in input buffers 1, 2 and 3 (from left to
right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.8. Ordering of variables in input buffers 1, . . . , n (from left to right)
forming a cycle spanning buffers i, . . . , n . . . . . . . . . . . . . . 43
4.9. Minimal numbers of PUs required to execute programs of dif-
ferent sizes in SCAD and its variant architectures . . . . . . . . 52
4.10. Minimal PUs required to execute programs of different levels . 53
4.11. Average minimal time to execute programs of different sizes . . 53
4.12. Average measure of parameters in queue-based and register-
based SCAD program executions . . . . . . . . . . . . . . . . . . 56
4.13. Average data transmissions in program executions . . . . . . . . 56
4.14. Compile time for optimal SCAD code generation for programs
of different sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.15. Compile time for optimal SCAD code generation for programs
of different levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1. Possible control-flow graphs of loops in MiniC . . . . . . . . . . 62
5.2. Examples for intuitive understanding of dominance and post-
dominance frontiers . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3. An example for SSA transformation and elimination . . . . . . 65
5.4. An example for SSI transformation and elimination . . . . . . . 66
5.5. Write and read instances of variables x and y during program
execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6. Content of the output buffer (the tail is on the left and the head
is on the right hand side) at time w(yj) (from Figure 5.5) . . . 68
5.7. Live ranges and use intervals in a program . . . . . . . . . . . . 75
5.8. Balancing variables by dummy assignments and copy assignments 77
5.9. Balancing variables by SSI transformation . . . . . . . . . . . . . 77
5.10. Bounding variable use in loop . . . . . . . . . . . . . . . . . . . . 80
5.11. Balancing variables in loops by discarding copies . . . . . . . . . 83
5.12. Execution time using minimal buffer sizes . . . . . . . . . . . . . 88
5.13. Execution time using enough buffer sizes . . . . . . . . . . . . . 89
5.14. Number of PU firings . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.15. Minimal resource usage . . . . . . . . . . . . . . . . . . . . . . . . 90
5.16. Total number of data transmissions . . . . . . . . . . . . . . . . . 91
5.17. Number of data transmissions by PUs . . . . . . . . . . . . . . . 92
6.1. Execution framework of a superscalar processor . . . . . . . . . 94
6.2. Execution framework of a VLIW processor . . . . . . . . . . . . 95
6.3. Execution framework of a TTA processor . . . . . . . . . . . . . 96
6.4. Execution framework of a TRIPS processor . . . . . . . . . . . . 99
x
List of Figures
A.1. Command program . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.2. Colored register interference graph . . . . . . . . . . . . . . . . . 120
A.3. Balanced command program . . . . . . . . . . . . . . . . . . . . . 123




3.1. List of queue instructions . . . . . . . . . . . . . . . . . . . . . . . 26
3.2. Mapping queue machine instructions to move instructions of a
universal SCAD machine . . . . . . . . . . . . . . . . . . . . . . . 27
4.1. Minimal time slot assignments for Program 4.19 . . . . . . . . . 46
4.2. Minimal PU assignments for Program 4.25 . . . . . . . . . . . . 50
5.1. Statements in MiniC language . . . . . . . . . . . . . . . . . . . . 60
5.2. Instructions of the Command language . . . . . . . . . . . . . . . 61
5.3. Variable definition tuples generated by each command instruc-
tion type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4. Variable definition tuples killed by each command instruction
type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5. Variable use tuples generated by each command instruction type 72
5.6. Variable use tuples killed by each command instruction type . . 72
5.7. Command instructions and corresponding SCAD move instruc-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85




1. Reaching definitions analysis . . . . . . . . . . . . . . . . . . . . . . 71
2. Live use tuples analysis . . . . . . . . . . . . . . . . . . . . . . . . . 72
3. Constructing the buffer interference graph . . . . . . . . . . . . . 74
4. Bound variable x in loops in statement S . . . . . . . . . . . . . . 79




AST Abstract syntax tree
CDB Common data bus
CU Control unit
DAG Directed acyclic graph
DO-SCAD Dynamically ordered SCAD
DPDI Dynamic placement dynamic issue




MIB Move instruction bus
NP Non-deterministic polynomial time (complexity class)
PU Processing unit
RISC Reduced instruction set computer
SAT Satisfiability
SCAD Synchronous control asynchronous dataflow
SMT Satisfiability modulo theories
SO-SCAD Statically ordered SCAD
SPDI Static placement dynamic issue
SPSI Static placement static issue
SSA Static single assignment
SSI Static single information
TTA Transport triggered architecture






1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . 4
1.3. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1. Motivation
The miniaturization of microelectronics is resulting in an ever increasing num-
ber of transistors in processors. However, simply scaling up conventional pro-
cessor architectures will not render a proportional improvement in application
performance [AHKB00]. Due to this processor scaling wall, the past decade
has seen a shift from single-core to multi-core architectures. However, due
to the required thread synchronization [Lee06] and weak memory consistency
issues [Mosb93; AdGh96; StNu04], in many applications, it is not a trivial
task to extract enough thread-level parallelism to take advantage of multiple
cores. The effective use of multiple cores is even more difficult in embedded
systems since, for a majority of embedded system applications, one has to
guarantee strict requirements in timing in addition to the correctness of mul-
tithreaded programs. Timing analysis becomes more complicated for multi-
core architectures due to contention on shared resources by co-running threads
[FGQF12; NoPa12; RGGQ12]. An alternative for improving the application
performance is to still increase the use of instruction-level parallelism (ILP)
[JoWa89; Wall91a; RaFi93] contained in the programs.
The execution paradigms of processor architectures are traditionally viewed
from the perspective of control-flow and dataflow computing models. In the
control-flow model [VonN45], the program is a linear sequence of instructions
addressed by a program counter. Each instruction accesses its operands from
and writes its result to an updateable memory. Typically, the control-flow
model of computation expresses no parallelism. At the opposite extreme is
the dataflow model [KaMi66; Denn74; Kahn74] where programs are dataflow
1
Chapter 1: Introduction
graphs, and instruction execution is driven by the arrival of data tokens. The
intermediate results are directly communicated from producer instructions
to consumer instructions and are not stored in a shared memory. However,
though offering an elegant concurrent model of computation, dataflow pro-
cessors fell out of favour due to their inefficient implementations and their
inability to effectively manage a shared memory to support imperative pro-
gramming languages (see Chapter 6 for more details).
Hence, commercially successful processors are based on the control-flow
model. Since the model itself does not express any parallelism, the proces-
sors and compilers use various techniques to extract ILP [Toma67; FERN84;
Fish81; MLCH92; Lam88; Rau94; LaHw95]. However, these techniques face
unavoidable limits on their further scalability [AHKB00]. One inherent reason
for these limitations is the use of registers to hold intermediate values: Most
current processor architectures are so-called load-store architectures where
only load and store instructions have access to the main memory while all
other instructions use registers as operands and target addresses. This is be-
cause the execution time of memory accesses did not improve as fast as that
of other instructions (memory wall [WuMc95; McKe04]), so that the number
of memory accesses had to be limited as much as possible. A simple way to
reduce them is to load values into local memories like registers and to work
on local copies as long as possible. While the introduction of registers was a
good idea for sequential processors, it now imposes limits for the use of ILP.
For example, consider the expression tree shown in Figure 1.1a. Using 4
registers and 4 read and write ports in the register file, one can evaluate the
expression in only 3 steps by firing all instructions in each level in one step.
However, if only 2 write ports are available in the register file, at most only
2 independent instructions can be fired in one step. It then takes 4 steps to
evaluate the expression as shown in Figure 1.1b, irrespective of the number of
processing units. By the Sethi-Ullmann algorithm [SeUl70], at least 3 registers
are required to evaluate the expression tree if no load/store instructions shall
be used. If only 2 registers would be available, one has to insert spill code in
that the obtained result of x1+x2 is stored in memory and loaded in a register
after having evaluated x3 − x4 as shown in Figure 1.1c. It now takes 5 steps
to evaluate the program (since there are 5 levels in the obtained syntax tree),
irrespective of the number of processing units. More memory accesses not only
reduce ILP but also adversely affect the timing-predictability of applications
[WGRS09] which is an important metric in real-time embedded systems.
Hence, the limited number of registers and register file ports limits the use of
ILP. Increasing the number of registers is however difficult since this number
is directly encoded in the instruction sets. Changing it requires corresponding
changes in machines, compilers, and even operating systems. Also, increas-
ing the number of processing units and the number of ports in a register file
quickly leads to a bottleneck in wiring these on the chip [ZyKo98; RDKM00].
For the latter reason, clustered architectures have been introduced where pro-
cessing units can only access predefined register clusters. Moreover, additional
cycles are incurred for writing/reading a value to/from register compared to
2
1.1. Motivation















Figure 1.1.: (a) An example expression tree, (b) the corresponding assembler code
and (c) the same expression tree with spill code
directly communicating the value from a producer processing unit to consumer
processing units.
Therefore, recent processors somehow try to bypass register usage by com-
municating values directly from the producer processing units to consumer
processing units. This is often referred to as direct instruction communication
or direct data routing. Though based on the control-flow computing model,
these processors gradually adopt the dataflow computing model since values
are produced, directly communicated, consumed, and are not overwritten us-
ing a shared namespace (register). Most current processors follow a hybrid
control-flow dataflow model of computation [YAJE14]. In widely used super-
scalar processors [John91; SmSo95], instructions in reservation station are exe-
cuted out-of-order according to Tomasulo’s algorithm [Toma67]. Instructions
that occur simultaneously in reservation stations communicate their results
directly with one another, while others communicate via registers. This di-
rect instruction communication is completely controlled by the processor. The
processor tracks data dependencies of instructions to add appropriate entries
in the reservation station for the direct communication of values. Moreover,
allocating instructions to processing units or instruction placement is also de-
termined at runtime. These factors limit the scalability of these machines to
larger numbers of processing units (see Chapter 6 for more details).
The exposed datapath architectures [Corp94; Corp99; WTSS97; BKMD04;
MCCV06; HSMC11; GoHS11; VSGG10; WSCH15] propose an interesting
scalable alternative by allowing the compiler to move values directly from
producer processing units to consumer processing units. This way, direct
instruction communication is simply offered by the processor, and it is the
responsibility of the compiler to control and utilize the same. These archi-
tectures often provide a large number of processing units. They can use a
compiler determined instruction placement to mitigate communication delays,
which are becoming increasingly dominant in many-core processors [AHKB00;
HoYS98]. Though exposed datapath architectures have already been studied
in great detail, we observe that these architectures still use registers to exe-
cute programs, thus limiting the amount of ILP they can make use of. This
limitation stems from either a drawback in their execution paradigm or the
3
Chapter 1: Introduction
code generator or both (see Chapter 6 for a review of related architectures).
1.2. Contributions
We propose a novel exposed datapath architecture based on a hybrid control-
flow dataflow execution paradigm to facilitate direct instruction communica-
tion. We moreover recommend an associated code generator that completely
avoids the use of registers. The main contributions are therefore:
• SCAD: The Synchronous Control Asynchronous Dataflow architecture
is an exposed datapath hybrid control-flow dataflow architecture that
uses first-in first-out (FIFO) buffers, i.e., queues, at output and input
ports of processing units. SCAD is programmed by move instructions
that transport values from the heads of output buffers to the tails of input
buffers (exposed datapath). We also study two close variants of SCAD:
statically ordered SO-SCAD and dynamically ordered DO-SCAD , to
compare it with the classic register and dataflow architectures.
• Code Generation: It is observed that SCAD code generation inspired
from classical queue machines, in contrast to register machines, not only
exploits more ILP but also lends naturally to direct instruction commu-
nication thereby eliminating the need to use registers. However, with
limited processing units, the content of output buffers will need to be
rotated to access values which are not at their heads. This leads to
computational and transportation overheads.
• Optimal Code Generation: We prove that it is a hard problem to
compile a given program to optimal (overhead-free) move code for a
given SCAD machine. Consequently, boolean constraints are formulated
for optimal code generation so that satisfiability (SAT) solvers can be
used to generate optimal code.
• Heuristics for Code Generation: Since optimal code generation
using SAT solvers is only feasible for small programs, we developed a
heuristic for SCAD code generation. To that end, instructions are as-
signed to processing units using a novel buffer interference analysis so
that overhead-free SCAD code is generated by considering instructions
in program order.
1.3. Outline
The rest of the thesis is organized as follows: Chapters 2,3,4 and 5 discuss core
contributions of this thesis in the order listed above. Experiments in Chapter 4
reveal that SCAD finds an essential balance between hardware complexity and
compiler flexibility to implement direct instruction communication effectively.
Further experimental results in Chapters 4 and 5 show the efficacy of our code
generation technique for SCAD compared to that based on traditional code
4
1.3. Outline
generation for register architectures. Chapter 6 reviews execution paradigms of
and compilation for other architectures concerning the use of ILP. Conclusions






2.1. SCAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1. Organization . . . . . . . . . . . . . . . . . . . . . 7
2.1.2. Execution of a move program . . . . . . . . . . . . 9
2.1.3. Control flow in SCAD . . . . . . . . . . . . . . . . 11
2.1.4. Remarks on the execution paradigm . . . . . . . . . 17
2.2. Dynamically Ordered SCAD . . . . . . . . . . . . . . 19
2.3. Statically Ordered SCAD . . . . . . . . . . . . . . . . 20
In this chapter, we present a novel processor architecture: the Synchronous
Control Asynchronous Dataflow (SCAD) architecture [BhJS16]. SCAD is an
exposed datapath architecture that uses a blend of control-flow and dataflow
computational models for executing programs. We also discuss two variants
of SCAD [BhSc17a] that differ in hardware and compiler complexity: First,
statically ordered SCAD (SO-SCAD) that tends more towards the control-
flow model in its execution style compared to SCAD. Second, the dynamically




The organization of processing units in a SCAD architecture is shown in Figure
2.1. Each processing unit (PU) has queues or first-in-first-out (FIFO) buffers
at its input and output ports. Input and output buffers are connected to two
interconnection networks: There is the move-instruction bus (MIB) (given in
red color) which is used to synchronously send values from the control unit to
the PUs, and the data transport network (DTN) (given in green color) which is
7
Chapter 2: Execution Paradigms
Figure 2.1.: Execution framework of a SCAD processor that use a 2D mesh network







Figure 2.2.: A processing unit in a SCAD machine
used by the PUs to asynchronously send values to each other whenever these
are available.
A processing unit in a SCAD machine is shown in Figure 2.2. Note that besides
input buffers l and r to store left and right operands of a binary operation,
there is an opcode input buffer op and a copies input buffer cp used to hold the
number copies of the result to be produced in the output buffer o. The opcode
and copies buffers of PUs are not shown in Figure 2.1 to avoid cluttering. The
operand input buffers and output buffers hold pairs (adr, val) of entries. For
an input buffer, adr is the address of the output buffer of the PU that produced
or that will produce the value val. An entry (adr,) with the special value 
is used to indicate that the required value is not yet available and will later
8
2.1. SCAD
be sent from the output buffer adr. Similarly, for an output buffer, adr is the
address of the input buffer of the PU that will consume the value val. An
entry (adr,) with the special value  is used to indicate that the required
value is not yet available and will later be produced by the PU and can then
be sent to the input buffer adr.
SCAD is programmed by a sequence of move instructions src → tgt whose
semantics is to move a value from the head of output buffer src to the tail of
input buffer tgt. Although a 2D mesh network is used as DTN in Figure 2.1,
any interconnection network ranging from a simple set of buses and sockets to
more complex parallel networks such as Omega, Banyan, Beneš networks, etc.
can be used as DTN. For instance, Figure 2.3 shows PUs interconnected using
a fat binary tree. Similarly, a PU in a SCAD architecture may implement
any function with an arbitrary number of inputs and outputs. These prop-
erties recommend SCAD as an interesting candidate for application-specific
processors. However, we do not explore co-design possibilities with SCAD in
this thesis, and instead, we simply assume PUs capable of executing standard
binary operations.
2.1.2. Execution of a move program
The execution of a move program works as follows: Using the program counter,
the control unit (CU) will fetch the next move instruction src→ tgt from the
instruction memory and will broadcast it via the MIB to all PUs. The input
buffer with address tgt will add the entry (src,) to its tail, and the output
buffer with address src will add the entry (tgt,) to its tail. If one of the two
buffers should be full, it will signal this via a feedback signal fullBuffer to
the control unit. Then the other buffer will also not store the entry, and the
control unit will resend the move instruction src→ tgt in the next cycle (it is
stalled at this point of time). The data transport related to a move instruction
src → tgt is deferred to a later time when the data is available. Therefore,
the addresses in move instructions are registered synchronously in program
order while the flow of data proceeds asynchronously: synchronous control
asynchronous dataflow. Also, note that all move instructions are stored in
buffers in the order in which they were issued by the control unit, i.e., in
program (control-flow) order. To see in more detail how a move program is
executed, let us consider the behaviors of the PUs and their input and output
buffers.
If a processing unit will find entries (adr1, x1),. . . ,(adrm, xm) with xi ≠ 
at the heads of its m input buffers and there is free space in its n output
buffers, it can react and will consume entries (adr1, x1), . . . , (adrm, xm) to
produce new result values y1 ∶= f1(x1, . . . , xm), . . . , yn ∶= fn(x1, . . . , xm) where
f1, . . . , fn are the functions associated with that PU (assuming that number
of copies of each result to produce is 1). Each output value yi is then stored in
that entry (tgt,) of output buffer number i that is closest to the head of the
output buffer, i.e., that entry is replaced with (tgt, yi). If there should be no
such entry, then a new entry (, yi) is placed at the tail of the output buffer i,
and the next target address for this output buffer will be stored in this entry.
9
Chapter 2: Execution Paradigms
Note that it is possible that the result value has been computed before a move
instruction has been issued by the control unit to move it to another place.
The output buffers are responsible for the final transport of data by sending
messages between PUs over the DTN. Such a message (src, tgt, val) consists
of the address of the sending output buffer src, the address of the input target
buffer tgt, and the value val that is transported by the message. A message
(src, tgt, val) is created when the output buffer with address src has a com-
pleted entry (tgt, val) as its head. This message is then sent to the input
buffer tgt via the DTN. When it finally reaches the input buffer tgt, the input
buffer will replace the entry (src,) closest to its head with (src, val), which
may trigger a new firing of its PU. Additionally, the output buffers snoop the
MIB for receiving new target addresses for their values. If output buffer src
will see the move instruction src → tgt on the MIB, it will check whether it
contains an entry (, y). If so, it will replace the one closest to its head with
(tgt, y). Otherwise, it will create a new tail (tgt,), if there is still space avail-
able. Otherwise, it will signal fullBuffer to the control unit, which then has
to stall and resend the move instruction later. The input buffers also always
snoop the two interconnection networks, i.e., the MIB and the DTN. As ex-
plained above, address entries (src,) are put in order in the input buffer tgt
whenever a move instruction src → tgt is seen on the MIB, and an available
entry (src,) is completed with the value val when a message (src, tgt, val)
arrives. Note that values from different output buffers reorder appropriately
in an input buffer based on the order of output buffer addresses registered in
the input buffer. However, the DTN must maintain the ordering of values sent







Split 2x2 Split 2x2
Figure 2.3.: Execution framework of a SCAD processor that use a fat binary tree
as data transport network
Clearly, there is at least one store unit (SU) with two input buffers, one for
the memory addresses and another one for the values to be stored at the
corresponding addresses. There is no output buffer. Instead, the SU stores
the values in the order specified by the input buffers (in the program order)
to the main memory. Clearly, there is also at least one load unit (LU) that
has just one input buffer for the addresses and an output buffer for the values
10
2.1. SCAD
loaded from memory. They will be sent through the DTN similar to output
values of other PUs, and whether the SU and the LU have to be synchronized
depends on a chosen weak memory model [Mosb93; AdGh96; StNu04]. In
the rest of this thesis, we assume a combined load-store unit (LSU) as shown
in Figure 2.4. The left operand buffer l holds memory addresses, the right
operand buffer r holds values to be stored in case of store operations, the
opcode buffer op holds load or store opcode, and the copies buffer cp holds
the number of copies of loaded values to be produced in output buffer o in








Figure 2.4.: The load-store unit in a SCAD machine
Move instructions of the form imm → tgt are also necessary to move any
immediate value imm to any input buffer tgt. That is, an entry (⊺, imm) is
enqueued to the tail of input buffer tgt by MIB, where ⊺ is simply a placeholder
in the address lane, which indicates that the particular slot in the address lane
is occupied. Any further address entries will have to be enqueued behind this
slot. Of course, multiple independent move instructions can be broadcast
simultaneously by using multiple lanes in MIB. Every sub-sequence of move
instructions where no buffer address occurs more than once is a bundle that
could be broadcast at once without affecting the correctness of the move code.
Note that we have not included register files or any local storage other than
buffers in the SCAD architecture, although it is possible to use them just
like any other PU. In Chapter 3, we motivate a code generation technique
for SCAD that does not require any local memory other than the buffers. In
other words, we utilize the direct instruction communication facilitated by the
SCAD architecture to the fullest.
2.1.3. Control flow in SCAD
The control flow is implemented in SCAD using move instructions whose tar-
get is the control unit. The control unit (CU) maintains the program counter
pc that is used to address the instruction memory. As long as a valid address
is available in pc, the corresponding move instruction is fetched and issued.
11
Chapter 2: Execution Paradigms
The CU is responsible for both conditional and unconditional branches. Un-
conditional branches are simply encoded as move instructions to the special
address pc. That is, the move instruction adr → pc sets the program counter
in CU to adr, and the CU fetches the instructions starting from address adr
in subsequent cycles.
Conditional branches are handled by input buffers of the CU. To this end,
the CU is equipped with three input buffers as shown in Figure 2.5: cu@c
is used to hold the branch condition, cu@then holds the ‘then’ address (or
branch target address), and cu@else is used to hold the ‘else’ address. Con-
ditional branching is implemented as follows: First, the branch target address
is moved to cu@then. If the instruction adr → cu@then is fetched by CU, it
immediately moves the branch target address adr to cu@then. Next, issue
move instructions to compute the branch condition. Finally, issue a move
instruction, from the output buffer where the branch condition will be pro-
duced, to cu@c. If this instruction is fetched by the CU, it destroys its local pc
and automatically moves pc + 1 to cu@else. Since a valid address is now not
available in pc, the CU stalls or stops fetching further instructions. Once the
branch condition is computed and transported to cu@c, the heads of all input
buffers of the CU will have valid entries. This triggers the CU, and a new local
pc is defined appropriately. Note that both buffers cu@then and cu@else are
populated by the CU itself (see Figure 2.5). Furthermore, since the CU stalls
on each conditional branch until the branch outcome is known, a single entry
is sufficient in all its input buffers. Consequently, the buffer cu@c does not






Figure 2.5.: The control unit in a SCAD machine
For the experimental results in this thesis, we use a SCAD simulator that stalls
on encountering branches. As already mentioned, this thesis’s main focus is
to study and evaluate improvements in performance by executing programs
in SCAD by direct instruction communication (avoiding the use of registers).
Nevertheless, to exploit instruction-level parallelism (ILP) across control flow
boundaries, SCAD may either utilize branch prediction for speculative exe-
cution similar to dynamically scheduled superscalar machines or predicated
execution similar to statically scheduled VLIW processors. In the following





In superscalar processors [John91; SmSo95], instructions are ordered in pro-
gram order in the reorder (FIFO) buffer. This allows the processor to flush all
entries in different data structures (reservation station, forward reference table,
and reorder buffer) when the branch instruction is at the head of the reorder
buffer, and the predicted branch outcome was incorrect. Instead of a sin-
gle reorder buffer, we have multiple input and output FIFO buffers in SCAD
processors that are populated in program order. When a branch condition
outcome is predicted to speculatively execute move instructions along the pre-
dicted control flow path, the CU in SCAD will signal speculativeExecution
to all PUs and LSU so that they can ‘mark’ the current tail of their input
and output buffers. This indicates for each buffer that all entries that follow
the marked entry are under speculation. Similarly, when the actual branch
condition outcome is computed by a PU and transported to the CU, the CU
will signal predictionStatus to all PUs and LSU, so that they can either
‘unmark’ the last marked entry in their buffers in case the prediction was cor-
rect or flush all entries that follow the last marked entry in their buffers in case
of a wrong prediction. To that end, the CU in SCAD has an additional input
buffer cu@p to hold predicted branch outcomes as shown in Figure 2.6, and
the input buffers have multiple entries to support a speculation depth greater
than one. The maximal speculation depth is then determined by the size of
input buffers of the CU. Also, note that the input buffer cu@c that holds ac-
tual branch outcomes must now have an address lane since the actual branch
condition outcomes for branches currently under speculation are computed in
dataflow order (and therefore not necessarily in program order) before they
are transported to the CU. Either static branch prediction or dynamic branch
prediction may be used (in this case, we assume in Figure 2.6 that the branch







Figure 2.6.: The control unit in a SCAD machine with branch prediction
For the correctness of the branch prediction scheme, it is important that broad-
casting of the predictionStatus signals by the CU must be in the same order
as the broadcasting of the corresponding speculativeExecution signals. This
is naturally so because the CU’s input buffers behave similarly to input buffers
of any PU: the CU will also always consume values from the head of its input
13
Chapter 2: Execution Paradigms
buffers, and these buffers are filled only from the tail. For conditional branch-
ing, the CU will first enqueue corresponding entries to the tail of its input
buffers, then set the program counter appropriately based on predicted branch
outcome, and finally broadcast the speculativeExecution signal to all PUs
and the LSU. If values are available at the heads of all its input buffers, the
CU will consume these values, compare actual and predicted branch outcomes,
and then broadcast the appropriate predictionStatus signal to all PUs and
the LSU. Of course, if the branch prediction was incorrect, all entries in the
CU’s input buffers are flushed, and the program counter is reset appropriately
based on the actual branch outcome. Therefore, the comparison of actual and
predicted branch outcomes and the subsequent predictionStatus signaling is
always performed in the order of prediction of corresponding branch outcomes
and subsequent speculativeExecution signaling.
Note that not only the buffers of PUs and the LSU but also the routers in
the DTN must ‘mark’ speculative execution so that all appropriate values in
the SCAD machine are flushed or discarded in the event of a wrong branch
prediction. However, there is an important difference between the behavior of
PU buffers and LSU buffers in SCAD in the context of speculative execution.
When an entry under speculation reaches the head of input buffers of the LSU,
the LSU has to stall if the corresponding operation is a memory write operation
so that no memory updates are carried out until the corresponding branch is
evaluated. On the other hand, PUs and the DTN can continue operating nor-
mally until a wrong branch prediction is signaled (via the predictionStatus
signal). That is, PUs can consume available values from input buffer heads
(even if these values are under speculation), compute and enqueue the result
to the tail of its output buffer. Similarly, the DTN can transport available
result values (even if these values are under speculation) from the heads of
output buffers to appropriate input buffers. This is possible since the val-
ues are not stored in a shared namespace (register or memory) that cannot
be rolled back in case of wrong predictions. Instead, they are produced and
directly communicated for consumption by PUs. This, in fact, enables spec-
ulative execution in SCAD processors. Clearly, PUs, DTN routers, and the
LSU must maintain a count of the number of consumed speculation marks, so
that wrong entries are not ‘unmarked’ on receiving next predictionStatus
signals informing correct predictions.
Predicated execution
For predicated execution, the control flow is translated to dataflow, and the
resulting dataflow graph is executed. Nodes of a dataflow graph are associated
with operations that are applied to operands sent to these nodes along the
edges between the nodes. The control flow of programs is displayed in the
dataflow graphs using special nodes such as switch and select nodes, and it
is not a trivial task to implement these nodes as processing units in SCAD.
Consider the conditional statement:
if ϕ(z) then {x = f(x, y);} else {x = g(x);}
14
2.1. SCAD
The assignment x = f(x, y) is executed if the boolean expression ϕ(z) evaluates
to true. Otherwise, the assignment x = g(x) is executed. Depending on when
the boolean condition is checked, there are two ways to translate the above
conditional statement. Dataflow graphs shown in Figures 2.7a and 2.7b both


















Figure 2.7.: Dataflow graphs for conditional statement
In the dataflow graph shown in Figure 2.7a, special nodes switch and merge
are used to implement the conditional branching. The switch node has two
inputs: a boolean input c (shown by a dashed edge) and in. It has two outputs
: out0 and out1. If the boolean input c is true (respectively false), the value
in the other input in is forwarded to out1 (respectively out0). To execute the
conditional statement, the switch node directs the input value xinp to one of
the two paths depending on the outcome of the boolean expression ϕ(z). This
triggers the firing of nodes in the selected path while nodes in the other path
are not executed. Finally, the merge node fires when the value is available
in any one of its two inputs and forwards this value to the output. In the
dataflow graph shown in Figure 2.7b, a special node select is used. The select
node has one output and three inputs: a boolean input c (shown by a dashed
edge), in0 and in1. If the boolean input c is true (respectively false), the
value in in1 (respectively in0) is forwarded to the output. To execute the
conditional statement, the select node selects the value computed by one of
the two paths depending on the outcome of boolean expression ϕ(z). Unlike
the switch based dataflow graph which first evaluates the branch condition, the
select based dataflow graph first executes both paths irrespective of the branch
condition outcome and then selects the computed result from the appropriate
path depending on the branch condition outcome.
Consider the dataflow graph in Figure 2.7a. To implement the same in
15







Figure 2.8.: Dataflow graph for while loop statement using switch node
SCAD, the SCAD compiler must generate move code for computations in
both ‘if’ and ‘else’ paths. However, at runtime, only move instructions regis-
tered for computation of one path will find values to transport. To ensure the
correct execution, it is necessary that data transports registered for the com-
putation of the other path are flushed or discarded. In other words, the switch
node can be implemented as a PU in the SCAD processor only with additional
provisions. We do not explore this possibility in this thesis. Notice that the im-
plementation of the conditional statement by the switch based dataflow graph
is similar to conditional branching by stalling the CU, where the CU stalls on
encountering a branch move instruction until the branch condition outcome
is computed. This is a sequential computation of the conditional statement.
On the other hand, the select based dataflow graph performs a parallel com-
putation of the conditional statement but requires more hardware resources.
We observe that it is possible to implement the select node as a PU in the
SCAD processor. To this end, the PU will have three inputs, c, in0 and in1,
corresponding to the inputs of the select node and a fourth input to hold the
number of copies to produce. If the boolean input is available, the select node
in the dataflow graph will forward the appropriate input value, if available, to
its output irrespective of the availability of other input value. However, the
select PU in a SCAD processor must wait for values to be available at all its
input buffer heads for consumption, then produce copies of the appropriate
input value in the output buffer and discard the other input value. It is im-
portant that the PUs in a SCAD processor consume all input values registered
to execute an operation to avoid any stray values (values not accounted for).
The dataflow graph for loop statements must be constructed using switch
nodes to ensure the termination of the loop. For example, consider the while
statement:
while ϕ(x) { x = f(x); }
It may be expressed as
if ϕ(x) then {do {x = f(x);} while ϕ(x);} else {}
16
2.1. SCAD
If the above conditional statement is translated to a dataflow graph using only
select node, it will wait forever for the computation in the ‘then’ branch to
terminate. Therefore, the switch node must be used as shown in Figure 2.8 so
that the loop body is executed only after the condition check. Since the switch
node cannot be implemented as a PU in SCAD without additional support,
the SCAD processors must rely on branch prediction to exploit ILP across
loop boundaries. However, various techniques such as loop unrolling and hy-
perblock formation [MLCH92] to cover conditional statements are applicable
in SCAD.
2.1.4. Remarks on the execution paradigm
The program counter of the CU steers the control flow of the program execu-
tion in SCAD. In this respect, SCAD adheres to the control-flow computing
model. However, intermediate results are produced and communicated di-
rectly for consumption instead of using a shared namespace for storing these
values. In this respect, SCAD adheres to the dataflow computing model. This
way, SCAD employs a hybrid control-flow dataflow model of computation. A
code generator for SCAD must compile a source code program to a sequence
of move instructions that are registered in the execution framework of SCAD
by the control unit. Of course, to decide about the communication of values
between PUs, the SCAD compiler must know which values are produced by
the PUs. To this end, it must allocate instructions to PUs (instruction place-
ment). The actual firing of PUs (instruction issue) is determined at runtime
when appropriate input (operand) values are available. From this point of
view, SCAD is categorized as an exposed datapath architecture that presents
a static placement dynamic issue (SPDI) scheduling model. See Chapter 6 for
a review of other architectures based on the above characterizations.
Hardware complexity
The intermediate results from executions of PUs in SCAD are stored in their
output buffers. Both address and value lanes in each output buffer are ‘pure’
FIFO buffers. The MIB and PU always add address and value entries, respec-
tively, to the tail of respective lanes in any output buffer. The DTN always
removes these entries from the head of respective lanes. Since the number of
entries in output FIFO buffers scale better compared to registers in a register
file, the output buffers in SCAD are capable of holding more intermediate re-
sults. Moreover, with a central register file, multiple instructions will have to
use a limited number of ports to write their execution results. When increasing
the number of ports, the area and access times of registers increase at a rate
of n3 and n3/2 [RDKM00], while power dissipation increases super-linearly at
the rate of n2 to n3 [ZyKo98]. In contrast, every PU in SCAD has its own
output buffer.
Notice that input buffers in SCAD are more difficult buffers. The address
lane in an input buffer is still a FIFO buffer since address entries are always
populated from the tail by the MIB and are removed from the head by the
17
Chapter 2: Execution Paradigms
PU. On the other hand, though entries in the value lane are removed from the
head by PU, the values delivered via the DTN are not simply added to the tail.
Instead, the value val in a DTN message (src, tgt, val) must find its place in the
value lane adjacent to the address entry src closest to the head of the address
lane. This is necessary so that the correct operand values occur at the input
buffer heads of a PU for executing the next operation. In register machines
that execute typical reduced instruction set computer (RISC) instructions,
operand values are encoded by register names. Again, with a central register
file, multiple instructions will have to use a limited number of ports to read
their operand values, while every PU in SCAD has its own input buffers. In
dataflow machines, a token-matching hardware [GuKW85] is required to find
the operand values that refer to the same operation. To match n left operand
values with n corresponding right operand values, a total of n! comparisons
are needed in the worst case. This is even assuming that all corresponding
operand values are already available for comparison. In reality, some of these
operand values may not yet be computed. In SCAD input buffers, only n
comparisons are needed in the worst case to determine the appropriate slot
for an incoming operand value. Furthermore, the synchronous registration of
move instructions guarantees that any incoming value will have a slot reserved
in the respective input buffer.
Compiler flexibility
The values can be transported only from head of output buffers to the tail of
input buffers. Therefore, to move values from output buffers to input buffers,
the SCAD compiler must know the order of values that occur in these buffers.
To this end, the compiler must determine an order of instructions executed
by each PU, which in turn determines the order of result values in the output
buffer and the operand values in the input buffers of each PU. This means
that depending on the ordering of values, some value to transport to an input
buffer might not be currently found at head of the output buffer of the PU
that produces this value. In this case, the compiler will have to rotate the
current values at the head of the output buffer (via some input buffers) to ac-
cess the relevant value for transportation. The rotation of values is necessary
to avoid storing these values temporarily in a shared namespace (register file),
i.e., equivalently to execute the whole program by direct instruction commu-
nication avoiding the use of registers.
The rotation of values incurs not only additional moves to input buffers, but
also additional computational overhead since they need to be then copied by
PUs to output buffers for later transportation (more details are given in Chap-
ter 3). Clearly, this transportation and computational overhead are not desir-
able due to its adverse impact on performance. Therefore, the compiler must
try to minimize the overhead by appropriately ordering instructions on PUs.
With more PUs and thus more buffers, restrictions to access values lessen,
consequently requiring lesser overhead. Nevertheless, the ordering constraint
restricts the compiler in freely allocating instructions to PUs to maximize the
use of ILP. However, we will see in experimental results (in Section 4.3) that it
18
2.2. Dynamically Ordered SCAD
is not difficult to avoid the overhead in SCAD. In other words, utilizing direct
instruction communication comes at a reasonable cost in the SCAD execution
paradigm. Furthermore, the compromise in the use of ILP due to the order-
ing constraint is negligible. By an experimental comparison of SCAD with
its variant architectures (discussed next), we show that the aforementioned
qualities can be attributed to the right blend of control-flow and dataflow
computing characteristics in SCAD. In the following sections, we introduce
subtle variants of SCAD: the dynamically ordered SCAD (DO-SCAD) with
more dataflow computing characteristics, where the compiler is completely
free to allocate instructions to maximize the use of ILP, but at the cost of
considerably more complex hardware compared to SCAD; and the statically
ordered SCAD (SO-SCAD) with more control-flow computing characteristics,
where hardware complexity is lesser, but the compiler is even more restricted
in effectively exploiting ILP compared to SCAD.







Figure 2.9.: A processing unit in a dynamically ordered SCAD machine
In SCAD, output buffer addresses registered in program order in the address
lane in input buffers are used at runtime to reorder values that arrive at PU
inputs via the DTN. This way, it is ensured that the correct operands for
executing the next operation occur at the heads of the input buffers of the
PU. Figure 2.9 shows a PU in a dynamically ordered SCAD (DO-SCAD)
machine. Instead of FIFO buffers, there is a pool of entries at PU inputs
and outputs. Unlike the ordering used in SCAD, tag matching is used by
PUs in DO-SCAD architectures to find the correct operands to execute an
operation. For simplicity, we assume that all tag values are determined at
compile time. A PU’s input pool holds pairs (val, t) of entries, where val is
the operand value and t is the tag associated with that operand. In the PU’s
outputs, there are two kinds of tuples: (1) (val, t) where val is the result of
an execution of an operation, and t is the tag associated with its operands,
19
Chapter 2: Execution Paradigms
and (2) (adr, tr, to) where adr is the address of an input of the destination PU
to which the value val associated with operand tag to must be transported
and tr is the tag associated with and transported along with val for operand
matching in the destination PU. Clearly, the move instructions must now be
augmented with operand and result tags as follows: src{to} → tgt{tr}. For
example, consider two instructions i and j whose operands are associated with
tags ti and tj , respectively.
xtgt(i) = xsrcL(i) ⊙i xsrcR(i)
xtgt(j) = xtgt(i) ⊙j xsrcR(j)
Note that the left operand of instruction j is the target of instruction i (xtgt(i)).
Assume that instructions i and j are assigned to PUs m and n, respectively.
Then, the move instruction um@o{ti} → un@l{tj} transports the target of
instruction i from the output of PU m to the left input of PU n to execute
instruction j. The execution of a move program now proceeds as follows: the
move instruction src{to} → tgt{tr} is broadcast via the MIB to all PUs. The
PU output with address src will add the entry (tgt, tr, to) to its pool of entries.
A PU can fire if it finds operand values with matching tags for execution. The
result of an execution along with an operand tag is added to the output pool
of tuples (val, to). At the PU’s outputs, the operand tag in the pool of tuples
(adr, tr, to) is matched with that in the pool of tuples (val, to). If a match
is found, the value val and result tag tr are made available to the DTN for
transporting to the PU input with address adr.
To avoid overhead (additional rotation of values), the SCAD compiler has
to carefully allocate and decide about the order of instructions on the PUs, so
that the order of values dequeued from the output buffers and enqueued to the
input buffers concur. This reduces the SCAD compiler’s flexibility to allocate
instructions to PUs to utilize all ILP contained in a program. In DO-SCAD,
matching tags at the PU output ensures that correct values are sent from
the output pool to input pools, and matching tags at inputs ensures that the
correct operand values are consumed by the PU for execution. So there is no
notion of any compiler determined ordering of values in DO-SCAD. Therefore,
the DO-SCAD compiler has the complete freedom to allocate instructions to
PUs to utilize maximal ILP. However, this comes at the cost of considerably
more complex hardware that is needed to accommodate and match tags. More-
over, DO-SCAD architectures face the same memory ordering problem that
dataflow computers suffer from.
2.3. Statically Ordered SCAD
In a statically ordered SCAD (SO-SCAD) architecture, the PUs only rely
on the order of arrival of values at the PU inputs to identify the operands
for executing the next operation. Figure 2.10 shows a PU in a SO-SCAD
machine. Note that there are no address lanes in the input buffers. The input
and output values of PUs now reside in ‘pure’ FIFO buffers unlike the more
20
2.3. Statically Ordered SCAD
difficult input buffers of the SCAD architecture, thus reducing the hardware
complexity and simplifying the execution of a move program. When a move
instruction src → tgt is broadcast via the MIB to all PUs, only the output
buffer with address src will add the entry (tgt,) to its tail. Like SCAD,
the result of a PU’s execution is added to the value slot in the entry (adr,)
closest to the head of its output buffer. The DTN snoops the head of output
buffers for transporting values to the addressed input buffers. However, the








Figure 2.10.: A processing unit in a statically ordered SCAD machine
In SCAD, a compiler determined order of operand values (implied by order
of instructions on PUs) is enforced locally at each input buffer in each PU
at runtime. This ensures that irrespective of when these values are produced
and transported, correct operand values always occur at input buffer heads for
executing the next operation. In other words, the runtime reordering provision
in input buffers allows the execution in SCAD to tolerate a variable PU and
DTN latency. The ordering of instructions is determined at compile-time and
enforced at runtime in SCAD, making it a hybrid ordered architecture. In DO-
SCAD, ordering is both determined and enforced at runtime, thus dynamically
ordered. In SO-SCAD PUs, the next operation to execute is decided simply by
order of arrival of values at PU inputs, which in turn depend on when these
values are produced and transported from some output buffers. To ensure
the correct order of arrival of values at PU inputs, it is necessary that the
SO-SCAD compiler not only determines the order of instructions on PUs but
also enforces this order in some way (discussed later). Therefore, ordering is
both determined and enforced at compile time in SO-SCAD, thus statically
ordered. Notice that the SO-SCAD execution paradigm resembles the classic
control-flow (or register) architectures where the compiler encodes operands
and results of an instruction by register addresses. Similarly, the DO-SCAD
execution paradigm resembles classic dataflow architectures.
21
Chapter 2: Execution Paradigms
There are two apparent ways in which the SO-SCAD compiler can enforce
the ordering: (1) The instruction issue (firing of PUs) times can be statically
determined, and the control unit can trigger firings of PUs at these predeter-
mined times to ensure that the produced values are transported and arrive
in the expected order at each input buffer. To that end, the latency of PUs
and DTN must be exposed so that the compiler can derive a correct static
schedule similar to VLIW compilers [FERN84; Gros00]. The main drawback
is that a statically determined instruction issue inhibits execution in dataflow
order (thus restricting the use of ILP) and cannot adapt to a variable latency
of PUs and DTNs. Furthermore, it is considerably more difficult to avoid over-
head (due to the need to rotate values) in a SO-SCAD machine with static
instruction issue (see experimental results in Section 4.3) compared to a SCAD
machine. (2) Alternatively, the compiler may determine that all values arriv-
ing at the same input buffer are produced in the expected order in the same
output buffer (i.e., by the same PU). Since the DTN always transports values
from heads of output buffers, it is guaranteed that the operand values destined
to the same input buffer will arrive in the expected order. Although this way
of enforcing an order seems like a good approach at first, the compiler is so
restricted in allocating and ordering instructions on PUs that it often requires
as many PUs as the number of instructions to avoid an additional rotation of
values or overhead (see experimental results in Section 4.3).
22
Chapter 3
Code Generation Techniques for
SCAD
Contents
3.1. Register Oriented Code Generation . . . . . . . . . 24
3.2. Queue Oriented Code Generation . . . . . . . . . . . 25
3.2.1. Code generation for queue machines . . . . . . . . . 25
3.2.2. SCAD code from queue code . . . . . . . . . . . . . 27
3.2.3. Overhead . . . . . . . . . . . . . . . . . . . . . . . 28
In this chapter, we explain the underlying principles for code generation for
SCAD architectures. We discuss two contrasting ways of generating move
code for SCAD machines: register oriented and queue oriented code gener-
ation, explaining why queue oriented code generation is more adequate for
SCAD architectures [BhJS16]. To simplify explanations, we consider move
code generation for a universal SCAD machine, which is defined as follows:
Definition 3.1 〈 Universal SCAD Machine 〉
A universal SCAD machine is a SCAD machine with a single universal
processing unit capable of executing memory accesses, standard binary and
unary operations. It has one output buffer ‘ o’ to store the result of each
operation and four input buffers: ‘ l’ to store the first (left) operand, ‘ r’ to
store the second (right) operand, ‘ op’ to store the operation to be executed,
and ‘ cp’ to store the number of copies of the result to be added to the output
buffer.
23
Chapter 3: Code Generation Techniques for SCAD
3.1. Register Oriented Code Generation
Recall that it is possible to include register files with any arbitrary number of
registers, input ports (to read and/or write registers), and output ports (to
access read registers) in the SCAD architecture. A register file’s input ports
will receive immediate data (addresses of registers to be read or written) from
the control unit via the MIB. The output ports send data (register content) to
other processing units (PUs) via the DTN. Code generation for register ma-
chines is an extensively researched field, so that already established efficient
methods may be used to compile programs to a sequence of typical reduced
instruction set computer (RISC) instructions where operands and results of
an instruction are register addresses. A register oriented code generator for
SCAD may first translate these RISC instructions to corresponding move in-
structions for a SCAD machine and then analyze the resulting sequence of
move instructions for bypassing as many reads and writes of registers as pos-
sible.
expression tree register program SCAD program
x1 x2






r1 ← r1 + r2




r0 ← r0 − r1
adr(x3)→ l; load→ op; 1→ cp;
o→ r0;
adr(x1)→ l; load→ op; 1→ cp;
adr(x2)→ l; load→ op; 1→ cp;
o→ l; o→ r; add→ op; 1→ cp;
r0 → l; o→ r; mul→ op; 1→ cp;
o→ r0;
adr(x4)→ l; load→ op; 1→ cp;
adr(x5)→ l; load→ op; 1→ cp;
o→ l; o→ r; div→ op; 1→ cp;
r0 → l; o→ r; sub→ op; 1→ cp;
Figure 3.1.: An expression tree with its register program, and corresponding SCAD
program
For example, consider the expression tree shown in Figure 3.1. According
to the Sethi-Ullmann algorithm [SeUl70], at least 3 registers are required to
evaluate the expression tree without storing intermediate results to the main
memory. This is achieved by ordering the nodes by a depth-first traversal
(post-order traversal) of the tree and then assigning registers to nodes. The
resulting RISC program is shown in Figure 3.1. It is straightforward to trans-
late each RISC instruction to a set of move instructions for the execution on
a universal SCAD machine so that after the execution of, the RISC instruc-
tion in a traditional register machine, and the set of move instructions in the
universal SCAD machine, the content of registers are the same in both ma-
chines. Figure 3.1 shows a SCAD program for a universal SCAD machine that
bypasses all accesses to registers r1 and r2. It is understood as follows: After
loading x3 to register r0, x1 + x2 may be evaluated without using registers r1
or r2. This is because after loading x1 and x2 from the main memory, these
24
3.2. Queue Oriented Code Generation
values are available in that order in the output buffer o of the universal SCAD
machine for transportation to input buffers l and r, respectively, for comput-
ing x1 + x2. The ∗ operation will now find its left operand x3 in register r0
and its right operand x1 +x2 at the head of output buffer o. After storing the
multiplication result x3 ∗ (x1 + x2) in register r0, again similarly the use of r1
and r2 may be avoided in evaluating x4/x5. Finally, the subtraction operation
will find its left operand x3 ∗ (x1 + x2) in register r0 and its right operand at
the head of output buffer o.
Note that it is not possible to bypass all register accesses this way in general.
In the above example, x3 and the result of ∗ must be stored in register r0 (or
rotated incurring overhead), so that + and / can respectively be performed by
direct communication of corresponding operand values from the output buffer
to the input buffers. Clearly, the reason for this restriction is the ordering of
the instructions by the depth-first traversal that was motivated by the reuse
of registers in the first place. Furthermore, the depth-first ordering of instruc-
tions limits the use of ILP offered by programs. All instructions in each level
of an expression tree are independent, and the maximal ILP is used when
instructions are executed level-wise when computing expression trees.
3.2. Queue Oriented Code Generation
A queue machine [Voll70; FeEr81] reads operands for executing an operation
from the head of a queue and adds the results to the tail of that queue. The






Figure 3.2.: Architecture of a queue machine
3.2.1. Code generation for queue machines
A queue program to evaluate an expression tree is generated by a breadth-
first traversal of the tree [FeEr81]. A consistent left to right or right to left
traversal ensures that the operands required to execute operations at one level
are available in the queue in the correct order. The queue program for the
expression tree and the contents of the queue after executing each instruction
of that program is shown in Figure 3.3. A list of queue instructions is listed
in Table 3.1.
25
Chapter 3: Code Generation Techniques for SCAD
queue instruction description
load adr,n Load data from memory address adr and add n
copies of the loaded value to the tail of the queue.
store adr Store the value from the head of the queue to the
memory address adr.
opcode n Dequeue necessary operands from the head of the
queue to execute the operation opcode and add n
copies of the result to the tail of the queue.
swap Dequeue two operands from the head of the queue,
swap them, and add them to the tail of the queue.
dup n Dequeue one operand from the head of the queue,
and add n copies of it to the tail of the queue.
goto PC, L Unconditional Branch: Transfer the control from
PC to PC+L
ifGoto PC, L Conditional Branch: Transfer the control from PC
to PC+L if the head of the queue holds else PC+1
Table 3.1.: List of queue instructions
expression tree queue program queue content
x1 x2















[x3, x1 + x2]
[x3, x1 + x2, x4]
[x3, x1 + x2, x4, x5]
[x4, x5, x3 ∗ (x1 + x2)]
[x3 ∗ (x1 + x2), x4/x5]
[x3 ∗ (x1 + x2) − (x4/x5)]
Figure 3.3.: An expression tree with its queue program and the content of the
queue after executing each instruction
Basic blocks of programs are often represented by directed acyclic graphs
(DAGs). Generating a queue program for an expression tree is easy since an
expression tree is by definition a level-planar graph [ScLY02]. However, gen-
erating queue programs for general expression DAGs involves first converting
the DAG into a level-planar graph and then performing a breadth-first traver-
sal of the graph [ScLY02] as shown in Figure 3.4: The given expression DAG is
first levelized, which means that operations must only refer to operands at the
same level. This can be easily achieved by introducing dup operations, which
take a value from the head of the queue and add some copies of it to the tail
of the queue. Then, the graph is planarized. This means that crossing edges
are removed by inserting swap operations that take two values from the head
of the queue and add them in exchanged order to the tail of the queue. One
26
3.2. Queue Oriented Code Generation
can sometimes avoid the introduction of swap operations by suitable ordering
of input or output nodes, but not in general. Finally, another levelization is
usually required since the swap operations may be placed at new levels.






























Figure 3.4.: An expression DAG with its levelized version, its planarized version,
the final level-planar expression DAG, and the obtained queue pro-
gram
3.2.2. SCAD code from queue code
queue instruction corresponding SCAD move instructions
load adr, n adr → l; load → op; n → cp;
store adr adr → l; o→ r; store → op;
opcode n o→ l; o→ r; opcode → op; n → cp;
swap o→ l; o→ r; swap → op;
dup n o→ l; dup → op; n → cp;
goto PC, L PC → l; goto → op; L → cp;
ifGoto PC, L PC → l; o→ r; ifGoto → op; L → cp;
Table 3.2.: Mapping queue machine instructions to move instructions of a universal
SCAD machine
To generate SCAD code from a given queue code, we map each queue instruc-
tion to a sequence of move instructions for the universal SCAD machine as
listed in Table 3.2 ([BhJS16]). Thus, the following theorem is stated.
Theorem 3.1 (Queue Simulation) A queue machine can be simulated
by a universal SCAD machine (consequently by any SCAD machine with
multiple processing units).
27
Chapter 3: Code Generation Techniques for SCAD
Proof See Table 3.2. Note that the contents of the queue in the queue ma-
chine and the output buffer in the universal SCAD machine will be identical
after each execution of each queue instruction on the queue machine and the
corresponding move instructions on the universal SCAD machine, respectively.
It is not difficult to adapt the mapping for a SCAD machine with multiple pro-
cessing units, given an assignment of each queue instruction to that processing
unit in the SCAD machine which executes the queue instruction. ∎
Figure 3.5 shows the move code program for the universal SCAD machine for
the expression tree, obtained by translation from the queue program according
to the mapping in Table 3.2. Clearly, SCAD programs obtained by translation
of queue programs do not need to use registers. In other words, all intermediate
results are communicated or transported directly from the output buffers to
the input buffers of processing units. This is enabled by the ordering of nodes
in the expression tree by a breadth-first traversal in contrast to the depth-first
ordering used in register oriented code generation. Importantly, the breadth-
first ordering allows full use of ILP since independent instructions are ordered
consecutively. However, as we have seen, overhead in terms of dup and swap
operations might be required to compile basic blocks or expression DAGs. The
use of dup and swap overhead not only incur additional computation but also
lead to additional move instructions (transportation overhead) in SCAD.
expression tree queue program SCAD program
x1 x2












adr(x1)→ l; load→ op; 1→ cp;
adr(x2)→ l; load→ op; 1→ cp;
adr(x3)→ l; load→ op; 1→ cp;
o→ l; o→ r; add→ op; 1→ cp;
adr(x4)→ l; load→ op; 1→ cp;
adr(x5)→ l; load→ op; 1→ cp;
o→ l; o→ r; mul→ op; 1→ cp;
o→ l; o→ r; div→ op; 1→ cp;
o→ l; o→ r; sub→ op; 1→ cp;
Figure 3.5.: An expression tree with its queue program, and corresponding SCAD
program
3.2.3. Overhead
Clearly, dup and swap overheads adversely affect the performance. However,
since there are multiple buffers in a SCAD machine, it often requires lesser
overhead compared to a queue machine that only has one central queue.
Theorem 3.2 (Overhead) For any queue program, there is a corre-
sponding SCAD program with the same number of dup and swap in-
28
3.2. Queue Oriented Code Generation
structions. However, there are SCAD programs without dup and swap
instructions where the queue machine requires dup and swap instructions.
Proof Given any queue program, the corresponding SCAD program gener-
ated by the mapping given in Table 3.2 will have the same number of dup and
swap instructions. The converse is however not true. For the simple basic
block y1 = x1 −x2; y2 = x1/x2, the expression DAG and its level-planar version
are shown in Figure 3.6. As can be seen, the expression DAG is not planar
and therefore, we have to introduce additional dup and swap instructions for
the queue machine. This is required since all operands have to be brought
into a total order in the queue machine since it has only one queue. After
loading, duplicating, and swapping, the content of the queue is [x1, x2, x1, x2]
so that the two binary operations can read their operands in the right order
to compute x1 − x2 and x1/x2. For the universal SCAD machine, we can do
without dup and swap instructions: To this end, we first load two copies of
x1 and then two copies of x2, so that the content of the output buffer o is
[x1, x1, x2, x2] after the first two lines. We then move the two copies of x1 at
head of output buffer o to input buffer l so that l holds values [x1, x1]. The
remaining two copies of x2, which are now at the head of output buffer o, are
moved to the input buffer r so that r holds values [x2, x2]. Finally, opcodes
sub and div are moved to the op buffer so that the corresponding results are
produced in the output buffer for storing.

















adr(x1)→ l; load→ op; 2→ cp;
adr(x2)→ l; load→ op; 2→ cp;
o→ l; o→ l;
o→ r; o→ r;
sub → op; 1→ cp;
div → op; 1→ cp;
adr(y1)→ l; o→ r; store → op;
adr(y2)→ l; o→ r; store → op;
Figure 3.6.: A given expression DAG with its planarized version, the obtained
queue program, and a SCAD program without swap and dup instruc-
tions
∎
Since there are two input buffers and one output buffer in the universal SCAD
machine, there are fewer restrictions in accessing values compared to the queue
machine with only one central queue. If we add more processing units to a
SCAD processor, the more number of input and output buffers further reduces
29
Chapter 3: Code Generation Techniques for SCAD
the restrictions for accessing values. Consequently, there is a lesser need for
dup and swap instructions. Consider the expression DAG given in Figure 3.7.
For the same expression DAG, both the queue machine and the universal
SCAD machine require dup and/or swap instructions as shown in Figure 3.4.
This is because both x2 and the result of +1 occur in the output buffer of the
single universal processing unit in the universal SCAD machine, necessitating
dup and swap overhead to execute × and +2. On a SCAD machine with one
processing unit (u) and one load-store unit (lsu), x2 will be loaded to the
output buffer of lsu, and the result of +1 will be produced in the output buffer
of u. This avoids the need for any overhead to compute the expression DAG.
The resulting move program is shown in Figure 3.7. In the move program in
Figure 3.7, we use the following notations to denote different buffer addresses:
The addresses of left and right operand buffers of any processing unit u are
denoted by u@l and u@r, respectively. The output buffer address is denoted
by u@o. The addresses of the opcode and the copies buffers are denoted by
u@op and u@cp, respectively. The corresponding buffer addresses of LSU are
denoted by lsu@l, lsu@r, lsu@o, lsu@op and lsu@cp, respectively. We use
the above notation in the rest of this thesis.





adr(x1)→ lsu@l; load→ lsu@op; 1→ lsu@cp;
adr(x2)→ lsu@l; load→ lsu@op; 3→ lsu@cp;
lsu@o→ u@l; lsu@o→ u@r; add → u@op; 2→ u@cp;
u@o→ u@l; lsu@o→ u@r; mul → u@op; 1→ u@cp;
u@o→ u@l; lsu@o→ u@r; add → u@op; 1→ u@cp;
adr(y1)→ lsu@l; u@o→ lsu@r; store → lsu@op;
adr(y2)→ lsu@l; u@o→ lsu@r; store → lsu@op;
Figure 3.7.: An expression DAG with a move cod program without dup and swap
instructions for a SCAD machine with one load-store unit (lsu) and





4.1. Complexity Analysis . . . . . . . . . . . . . . . . . . . 32
4.2. Mapping to SAT . . . . . . . . . . . . . . . . . . . . . 35
4.2.1. Buffer constraints . . . . . . . . . . . . . . . . . . . 36
4.2.2. Move code generation . . . . . . . . . . . . . . . . . 43
4.2.3. Optimizing execution time . . . . . . . . . . . . . . 44
4.2.4. Adapting constraints for DO-SCAD . . . . . . . . . 46
4.2.5. Adapting constraints for SO-SCAD . . . . . . . . . 47
4.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1. Execution paradigms . . . . . . . . . . . . . . . . . 51
4.3.2. Queue and register based SCAD code . . . . . . . . 54
4.3.3. Feasibility . . . . . . . . . . . . . . . . . . . . . . . 57
Increasing numbers of dup and swap operations not only degrade the per-
formance but also increase the code size and power consumption of SCAD
machines. Therefore, it is desirable to obtain optimal code for a given pro-
gram and a given SCAD machine, where optimal refers to a minimal number
of overhead operations. We have seen that with more processing units, lesser
overhead is necessary. In Section 4.1, we prove that it is an NP-hard problem
to determine for a given program, the minimal number of processing units
required in a SCAD machine to execute the program without any overhead
[Ande17]. We then map the decision version of the problem to an equiva-
lent satisfiability (SAT) problem in Section 4.2 [BhSc16; BhSc17; BhSc17a].
Besides constraints for the resource-constrained (minimize the number of pro-
cessing units) optimal code generation, we also formulate optional boolean
constraints in Section 4.2.3 to optimize the execution time for time-constrained
optimal code generation.
31
Chapter 4: Optimal Code Generation
4.1. Complexity Analysis
Recall that dup and swap operations are used to rotate values at output buffer
heads to access other relevant values from these output buffers. The alternative
is to store values at the output buffer heads to memory and load back these
values after accessing other relevant values. However, memory accesses must
also be avoided since they are expensive. Therefore, for complexity analysis,
we limit memory accesses to just one memory read to load the initial value of
variables in the program.
Theorem 4.1 (NP-hardness) Given a program P and a SCAD ma-
chine S with one load-store unit and p processing units, it is an NP-hard
problem to determine if it is possible to schedule the program P on the
SCAD machine S so that:
• memory accesses are minimized (to one read per load variable)
• overhead (dup or swap) operations are not used.
Proof We prove the NP-hardness by a reduction from the graph coloring
problem that is known to be NP-complete [Karp72]. The decision version of
the graph coloring problem is stated as follows: Given p colors {c1, . . . , cp} and
an undirected graph G = (V,E) with n vertices V ∶= {v1, . . . , vn} and m edges
E ∶= {e1, . . . , em}, determine if there exist an assignment of colors to vertices,
Col ∶ V Ð→ {c1, . . . , cp}, so that no two adjacent vertices share the same color,
i.e., ∀ei∶=(vj ,vk)Col(vj) ≠ Col(vk).
Reduction: For the graph G, we construct a program P whose control flow
is shown in Figure 4.1. P contains basic blocks B ∶= {B1, . . . ,Bm}, BL and
BR. Node A is only used to generate control flows to the set of basic blocks B,
each of which branches to BL or BR. Basic blocks B write to vertex variables
v1, . . . , vn, while BL and BR read vertex variables. The basic idea of the
reduction is that the processing unit (PU) q that produces a vertex variable
vi in the schedule of program P on the SCAD machine S corresponds to an
assignment of color cq to vertex vi in the graph G.
Modeling edges: The m edges of the graph G are modeled by m basic
blocks B ∶= {B1, . . . ,Bm}. Each basic block Bi ∈ B corresponds to an edge
ei ∶= (vj , vk). Since vj and vk must be assigned different colors, we must enforce
that the corresponding vertex variables are produced by different PUs. This
is achieved by constructing each basic block Bi as follows:
Bi ∶






where ⊙ denotes some binary operation, and li and l
′
i are some load variables




B1 B2 . . . . . . . . . . . . Bm
BL BR
Figure 4.1.: Control flow graph of the constructed program P
memory. Since only one memory read is allowed per load variable, li and l
′
i
must only be read once by the load-store unit (LSU) into its output buffer,
either in order li ≺ l
′
i or in order l
′
i ≺ li. Assume the order li ≺ l
′
i so that the




i] (head of the buffer at left and
tail at right). See Figure 4.2. Now assume that vertex variables vj and vk are
produced by the same PU in that order, i.e., vj ≺ vk. Then, li (respectively,
l
′
i) must be moved to the left input buffer (respectively, the right input buffer)
of the PU before moving l
′
i (respectively, li) to the same input buffer. The
copy of li at the head of the LSU output buffer can be moved to the left input





i], with again a copy of li at its head. However, we now need to access
l
′
i to move it to the right input buffer of the PU before moving the copy of li
to the same buffer. Therefore, we have to move the copy of li from the head
of the LSU output buffer to some other buffer to access l
′
i. This will incur
an overhead operation. The same is the case for variable orderings l
′
i ≺ li and
vk ≺ vj . Since it is required that the schedule of the program P on the SCAD
machine S does not use any overhead operation, the vertex variables vj and



















Figure 4.2.: Failed attempt to schedule basic block Bi without overhead
Modeling unique color assignment: Note that if multiple edges are incident
on the same vertex vi in the graph G, multiple basic blocks in the set of basic
33
Chapter 4: Optimal Code Generation
blocks B in the program P will write to the corresponding vertex variable vi.
Since any vertex vi must be assigned one and only one color, we must enforce
that any vertex variable vi is produced by one and only one PU in all basic
blocks that write to vi. The unique assignment of color to vertices in the graph
G is modeled by the basic blocks BL and BR in the program P. BL reads all










where denotes some immediate value when appearing as operand of an in-
struction and some variable (different from vertex variables) when appearing
as the target of an instruction.
Multiple basic blocks in B may write to any vertex variable vi that is then
read in BL and BR. Since basic blocks B branch to BL (or BR), any vi must
reside in a statically determined buffer at the exit of basic blocks B so that
it can be accessed in BL (or BR) from this buffer. Note that vi must be
moved to the left input buffer of some PU in the basic block BL and to the
right input buffer of some PU in BR. This enforces that the variable vi must
reside in a statically determined output buffer (say out) at the exit of basic
blocks B. If vi resides in a left (respectively right) input buffer, then it must
be rotated to the right (respectively left) input buffer of the PU executing
instruction = ⊙vi (respectively = vi⊙ ) in basic block BR (respectively
BL). This will incur an overhead operation contradicting our requirement of
an overhead-free schedule of program P on SCAD machine S. Finally, in any
basic block in B, vi must be produced by the same PU whose output buffer
is out. If vi is produced by a different PU in any basic block Bj ∈ B, it must
be moved to the output buffer out incurring an overhead operation. Putting
together the above arguments, the construction of basic blocks BL and BR
enforces that in the overhead-free schedule of program P on SCAD machine
S, any vertex variable vi is produced by one and only one PU, thus modeling
a unique assignment of colors to vertices in the graph G.
Solution: Clearly, every optimal (overhead-free) schedule of program P on
the SCAD machine S corresponds to a valid coloring of the graph G in that
Col(vi) = cq for any vertex vi where PU q is the unique PU that produces
the vertex variable vi in the optimal schedule. Similarly, any valid coloring of
the graph G corresponds to an optimal schedule of program P on the SCAD
machine S in that any vertex variable vi is produced by PU q where Col(vi) =
cq. The resulting schedule will be optimal: A valid coloring will assign different
colors to all adjacent vertices in the graph G. This means that in the resulting
schedule of P on S, in any basic block Bi ∈ B corresponding to edge ei ∶=
(vj , vk), the vertex variables vj and vk will be produced by different PUs. This
allows all basic blocks B in program P to be compiled without any overhead.
Furthermore, since any vertex is assigned a unique color in the graph G, any
vertex variable is produced by a unique PU in program P. This means that
34
4.2. Mapping to SAT
the basic blocks BL and BR can also be compiled without any overhead since
vertex variables can be moved directly from the output buffers of the respective
assigned PUs to the appropriate input buffers in BL and BR. ∎
4.2. Mapping to SAT
To formulate the problem to generate code without overhead in propositional
logic, we assume that the following is given:
• a basic block (DAG) in the form of three-address code in static single
assignment (SSA) form [AlWZ88; CFRW91; RoWZ88], i.e.,
xtgt(0) = xsrcL(0) ⊙0 xsrcR(0)
⋮
xtgt(`−1) = xsrcL(`−1) ⊙`−1 xsrcR(`−1)
for some variables V ∶= {x0, . . . , xn−1}, where ⊙i denotes some binary
operation.
• a SCAD machine with one load-store unit and p processing units that
may execute any binary operation.
The problem is to determine if the basic block can be executed on the SCAD
machine without any dup and/or swap overhead. If so, determine the schedule
of the basic block on the PUs of the SCAD machine.
In SSA form, every variable xi occurs at most once as the left-hand side in
the three-address code, but it may occur several times on the right-hand side.
This defines three different kinds of variables:
• target variables Vtgt are those that occur on the left-hand sides.
• source variables Vsrc are those that occur on the right-hand sides.
• load variables Vld are those that only occur on the right-hand sides
Vld ∶= Vsrc ∖ Vtgt
If a variable is in Vsrc ∩Vtgt, then we assume that all its read operations occur
after its unique write operation (no shadowing of variables).
Furthermore, the basic block can also be partitioned into levels.
• level 0 are all instructions that only read variables Vld
↝ their target variables V0def are written by this level
• level j + 1 are all instructions that only read variables Vld ∪⋃ji=0 V
j
def and
where at least one variable of Vjdef is read
35
Chapter 4: Optimal Code Generation
If the DAG is an expression tree, level 0 are the leaf nodes, level 1 are the
nodes that are only connected to leaves, and so on. All instructions in one
level are independent and can be executed in parallel. The levels are also
defined by ASAP (as soon as possible) scheduling of the basic block.
A schedule of the basic block on the SCAD machine comprises an assignment
of variables to units (PU/LSU) of the SCAD machine and an ordering of
the variables on these units. The assignment determines which instructions
of the basic black are executed by each unit, and the ordering determines
the order in which each unit executes these instructions. The schedule is an
optimal schedule if move code can be generated with the given assignment and
ordering, to execute the basic block on the SCAD machine.
In the following, we set up boolean constraints whose conjunction provides
a constraint formula for the given basic block. Every satisfying assignment of
that formula is a valid schedule for the given SCAD machine and vice versa.
Assuming that PU 0 is the load-store unit in the given SCAD machine with
p processing units {1, . . . , p}, we define the following relation to determine the
assignment of variables to PUs in the SCAD machine:
• PU assignment relation αi,j for xi ∈ V and j ∈ {0, . . . , p}
– αi,j means that xi ∈ V is produced by PU j
– this determines the instructions of the basic block executed by PU
j
Every variable has to be produced by one and only one PU, as expressed in
the unique PU assignment constraint C1 in 4.1. The first part of the
constraint asserts that any variable xi is assigned to at least one PU and the
second part asserts that if xi is assigned to any PU k, then it is not assigned
























Furthermore, as we determine that all xi ∈ Vld are assigned to PU 0,
• we replace for all xi ∈ Vld all αi,0 with true
• we replace for all xi /∈ Vld all αi,0 with false
• we replace for all xi ∈ Vld and j > 0 all αi,j with false
↝ only αi,j remain where xi /∈ Vld and j > 0
4.2.1. Buffer constraints
To determine an ordering of assigned variables on respective PUs (i.e., an
order of instructions executed by each PU), we introduce a strict partial order
relation ≺ called the variable order such that:
• variable order relation xi ≺ xj for xi, xj ∈ V
36
4.2. Mapping to SAT
– restrictions of ≺ to an output buffer establishes a total order among
variables produced in that output buffer.
The constraints of the ordering of variables in output and input buffers are
then formulated using the variable order relation. To better comprehend the
formulation of buffer constraints, we first describe a preliminary approach
referred to as the production order approach before describing the final buffer
constraints in the consumption order approach. The exact meaning of the
variable order ≺ differs subtly in both approaches. Irrespective of that, the
following constraints are imposed on the variable order relation.
First, we demand that the variable order ≺ is a strict partial order. There-
fore, ≺ is both transitive (constraint C2 in 4.2) and irreflexive (constraint
C3 in 4.3). Both together imply that ≺ is acyclic as expressed in 4.4.
C2 ∶⇔ ⋀
xi,xj ,xk∈V
xi ≺ xj ∧ xj ≺ xk ⇒ xi ≺ xk (4.2)
C3 ∶⇔ ⋀
xi∈V
¬xi ≺ xi (4.3)
C2 ∧ C3 ⇒ ⋀
xi,xj∈V
xi ≺ xj ⇒ ¬xj ≺ xi (4.4)
Second, we demand that the total order among variables produced in the same
output buffer respect data dependencies in the basic block. To express that
two variables xi and xj stem from the same output buffer (i.e., are produced





αi,k ∧ αj,k (4.5)
The data dependency constraint C4 is now formulated as shown in 4.6,
where xi ≺d xj means that xj is data dependent (read-after-write dependent)
on xi. If xtgt(i) and xtgt(j) are produced by the same PU in that order (i.e.,
xtgt(i) ≺ xtgt(j), then xtgt(i) must not be data dependent on xtgt(j). Similarly,









xtgt(i) ≺ xtgt(j) ⇒ ¬xtgt(j) ≺d xtgt(i)
∧





The data dependency relation ≺d may be either computed offline, or com-
puted by the SAT solver by formulating the relation given by 4.7 as a strict





xtgt(i) ≺d xsrcL(i) ∧ xtgt(i) ≺d xsrcR(i) (4.7)
⋀
xi,xj ,xk∈V
xi ≺d xj ∧ xj ≺d xk ⇒ xi ≺d xk (4.8)
37
Chapter 4: Optimal Code Generation
⋀
xi∈V
¬xi ≺d xi (4.9)
Production order approach
In this preliminary approach, the variable order ≺ is the production order.
That is, xi ≺ xj means that the same PU produces xi and xj in that order.
Alternatively, ≺ is the consumption order enforced by the output buffers. In
other words, all copies of xi must be consumed before consuming any copy of
xj . Clearly, a total variable ordering must exist for those variables that are at
some time in the same output buffer.
Consider now any pair of instructions xtgt(i) = xsrcL(i) ⊙i xsrcR(i) and xtgt(j) =
xsrcL(j)⊙jxsrcR(j) of the basic block. If the instructions are executed on different
PUs, it is possible to move their operands to the corresponding input buffers
irrespective of ordering of operands in some output buffers. Assume that
the instructions are executed on the same PU k. Then either xtgt(i) must
precede xtgt(j) in the production order (i.e., xtgt(i) ≺ xtgt(j)) or vice versa.
If xtgt(i) (respectively xtgt(j)) is produced before xtgt(j) (respectively xtgt(i)),
then we should be able to move the operand xsrcL(i) (respectively xsrcL(j))
before the operand xsrcL(j) (respectively xsrcL(i)), to the left input buffer of
PU k. Therefore, if both left operands xsrcL(i) and xsrcL(j) are produced by
the same PU, we must enforce the ordering xsrcL(i) ⪯ xsrcL(j) (respectively
xsrcL(j) ⪯ xsrcL(i)). Otherwise, one will need to use an overhead instruction to
access the appropriate operand value. Similar arguments apply for the right
operands. This is expressed in constraint 4.10. Clearly, if xtgt(i) precedes
xtgt(j) in the production order, then xtgt(j) must not precede xtgt(i) in the
production order. Recall that cycles in the production order ≺ are avoided by













βsrcL(i),srcL(j) ⇒ xsrcL(i) ⪯ xsrcL(j)
∧










βsrcL(i),srcL(j) ⇒ xsrcL(j) ⪯ xsrcL(i)
∧








Clearly, the buffer constraint 4.10 is necessary. That is, all optimal schedules
must satisfy this constraint. To visualize variable orderings easily, we use
drawings of input buffers displaying relative positions of variables that pass
38
4.2. Mapping to SAT
through each input buffer at some time. See for example Figures 4.3, 4.4 and
4.5. Any directed line from xi in input buffer k to xj in input buffer l indicates
that a copy of xi must be moved to input buffer k before moving a copy of
xj to the input buffer l. We now distinguish two types of directed lines. The
directed solid (black) line from xi to xj means that variables xi and xj are
produced in that order at some time in some output buffer. So, all copies of
xi must be moved before moving any copy of xj . In further explanations, we
call these output edges. The meaning of output edges is already encapsulated
by the variable ordering relation ≺, so that we may denote an output edge by
xi ≺ xj . Note that there exists an output edge from each copy of xi to each
copy of xj in the input buffers. The directed dashed (blue) line from xi to xj
means that a copy of variables xi and xj appears in that order at some time
in that particular input buffer. So, some copy of xi must be moved to that
input buffer before moving some copy of xj to the same input buffer. Note
that it is always a unidirectional (vertical) line since it is specific to variable
copies in each input buffer. We call these input edges denoted by xi ↝ xj .
Now consider the input buffers of PU k shown in Figure 4.3. The variable
ordering satisfies constraint 4.10 since the left operands are produced in dif-
ferent output buffers and the right operands are also produced in different
output buffers. However, still optimal move code cannot be generated. Before
moving xsrcL(i) to the left input buffer of PU k, xsrcR(j) must be moved. Before
moving xsrcR(j) to the right input buffer of PU k, xsrcR(i) must be moved to the
same input buffer. However, xsrcR(i) can be moved only after moving xsrcL(j).
Finally, xsrcL(j) can be moved to the left input buffer of PU k only after mov-
ing xsrcL(i) to the same input buffer. Thus, optimal move code generation is
blocked by the cycle (xsrcL(i) ↝ xsrcL(j) ≺ xsrcR(i) ↝ xsrcR(j) ≺ xsrcL(i)). This
proves that the buffer constraint 4.10 is not sufficient. That is, there exists
non-optimal schedules that satisfy the constraint. It is not difficult to see that
the buffer constraint 4.10 only filters out those non-optimal schedules that






Figure 4.3.: Ordering of variables forming a cycle spanning both left and right
input buffers of PU k
We observe that cycles in Figures 4.4 and 4.3 are special cases of a general
cycle spanning all input buffers shown in Figure 4.5. Clearly, input edges ↝
alone cannot form a blocking cycle. Also output edges ≺ alone cannot form
a cycle since we already establish that ≺ is acyclic by imposing transitivity
(constraint 4.2) and irreflexivity (constraint 4.3). It is not difficult to further
see that any cycle may be decomposed into a corresponding cycle comprising
of alternating input and output edges.
39






Figure 4.4.: Ordering of variables forming a cycle in the left input buffer and a
cycle in the right input buffer of PU k
PUs 1 . . . k








Figure 4.5.: Ordering of variables forming a cycle spanning all input buffers
An input edge xi ↝ xj means that some copy of xi must be moved to some
input buffer before moving some copy of xj to the same input buffer. An
output edge xj ≺ xk means that all copies of xj must be moved before moving
any copy of xk. Concatenating both, xi ↝ xj ≺ xk means that some copy of xi
must be moved before moving any copy of xk even if xi and xk are produced
in different output buffers. This establishes an order of consumption enforced
by the input buffers besides the consumption order enforced by the output
buffers (i.e., the production order). Notice that similar to the consumption
order enforced by the output buffers, the consumption order enforced by the
input buffers must also be transitive. That is, if some copy of xi must be
consumed before consuming any copy of xj and if some copy of xj must be
consumed before consuming any copy of xk, then some copy of xi must be
consumed before consuming any copy of xk. Also, the consumption order
enforced by the input buffers must be acyclic in an optimal schedule (i.e.,
there must not exist any blocking cycles). Otherwise, if it is the case that
some copy of xi must be consumed before consuming any copy of xj and some
copy of xj must be consumed before consuming any copy of xi, the optimal
move code generation will fail.
Consumption order approach
In this approach, the variable order relation ≺ is augmented to represent both
the consumption order enforced by the output buffers (or the production order)
and the consumption order enforced by the input buffers. In other words, if
40
4.2. Mapping to SAT
variables xi and xj are produced by the same PU, xi ≺ xj means that all copies
of xi must be consumed before consuming any copy of xj . If they are produced
by different PUs, xi ≺ xj means that some copy of xi must be consumed before
consuming any copy of xj .
Assuming that the variables xi and xj appear in any input buffer in that
order (i.e., there exists an input edge xi ↝ xj), we introduce the following
notation to express that variable xi precedes variable xk in consumption order
enforced by the input buffer, when xj and xk are produced by the same PU
in that order (i.e., there exists an output edge xj ≺ xk).
γi,j,k ∶⇔ βj,k ∧ xj ≺ xk ⇒ xi ≺ xk (4.11)
Now, the final buffer constraint C5 that encapsulates both consumption


































Finally, a conjunction of the relevant boolean constraints gives the final con-
straint formula Cs for optimal scheduling of the given basic block on the








Since the unique PU assignment and the data dependency constraints are
straightforward, we focus on the buffer (ordering) constraint. We have seen
that if any schedule of a basic block on a SCAD machine results in cycles
in the consumption order (i.e., do not satisfy the buffer constraint), optimal
move code generation will fail. So the following lemma is stated:
Lemma 4.1 (Necessary) All optimal schedules of a basic block on a
SCAD machine must satisfy constraint Cs. In other words, Cs is a neces-
sary constraint for optimal basic block scheduling on SCAD.
In the following, we prove the sufficiency of constraint Cs.
41
Chapter 4: Optimal Code Generation
Lemma 4.2 (Sufficient) Any non-optimal schedule of a basic block on
a SCAD machine must not satisfy constraint Cs. In other words, Cs is a
sufficient constraint for optimal basic block scheduling on SCAD.
Proof Assume that ≺ is the variable ordering in a non-optimal schedule that
satisfies the constraint Cs. Since the schedule is non-optimal, it is not possible
to generate move code with the ordering ≺ without any overhead. Therefore,
all possible ways of moving values from output buffers to input buffers must
reach a deadlock in that no more values can be moved from the head of any
output buffer to the tail of any input buffer, thus necessitating the use of
overhead instructions. We now prove that if such a deadlock is reached, the
variable order relation ≺ must contain at least one cycle, thus not satisfying
the constraint Cs contradicting our initial assumption. At deadlock, assume
that {x1, . . . , xm} are the next operands expected by input buffers 1 . . .m,
respectively, where ∀ixi ∈ V. Since no more moves are possible, none of ∀ixi is
at the head of any output buffer. Then, there exists {y1, . . . , yn} where yi ≠ xi
is an operand for input buffer i (whose next operand is xi) and ∀iyi is at the
head of some output buffer. Furthermore, n < m holds since more than one
next operands in {x1, . . . , xm} may be behind any yi in some output buffer.
This results in following the input edges: {x1 ↝ y1, . . . , xn ↝ yn}.
Since ∀ixi is not at the head of any output buffer, each xi must succeed one
of {y1, . . . , yn} in some output buffer. Consider any xi. If xi succeeds yi, then
yi ≺ xi holds yielding an output edge as shown in Figure 4.6a. The input and
output edges now form a cycle in input buffer i, thus not satisfying constraint
Cs contradicting our assumption that the ordering ≺ satisfies the constraint
Cs. Therefore, any xi must succeed yj ∣ j ≠ i.
xi
yi





(b) Cycle spanning buffers 1 and 2
Figure 4.6.: Ordering of variables in input buffers
Consider x1. Without loss of generality, assume that x1 succeeds y2 in some
output buffer. Now consider x2. If x2 succeeds y1, then y1 ≺ x2 holds yielding
a cycle in input buffers 1 and 2 as shown in Figure 4.6b. Thus, it does not
satisfy the constraint Cs, which contradicts our assumption. Therefore, x2
must succeed yj ∣ j > 2. Again, without loss of generality, assume that x2
succeeds y3 in some output buffer. Now, consider x3. If x3 succeeds y2 or
y1, this results in cycles shown in Figures 4.7a and 4.7b, respectively. Again,
42
4.2. Mapping to SAT
not satisfying constraint Cs thus contradicting our assumption. Therefore, x3
must succeed yj ∣ j > 3. Continuing the argument, consider xn. Since xn is not
at the head of any output buffer, it must succeed one of {y1, . . . , yn} in some
output buffer. However, if xn succeeds any yj ∣ j ≤ n, this results in a cycle
spanning input buffers j, . . . , n as shown in Figure 4.8. Thus, inevitably the
buffer constraint Cs is not satisfied, contradicting our initial assumption that
the variable ordering ≺ satisfies constraint Cs. It is therefore the case that any














(b) Cycle spanning buffers 1, 2 and 3
Figure 4.7.: Ordering of variables in input buffers 1, 2 and 3 (from left to right)











Figure 4.8.: Ordering of variables in input buffers 1, . . . , n (from left to right) form-
ing a cycle spanning buffers i, . . . , n
∎
Theorem 4.2 (Necessary and Sufficient) Constraint Cs is both nec-
essary and sufficient for optimal basic block scheduling on SCAD.
Proof This follows directly from Lemmas 4.1 and 4.2. ∎
4.2.2. Move code generation
After proving the sufficiency of constraint Cs, it is easy to generate a move
program for the given basic block for execution on the given SCAD machine
with a variable ordering ≺ and a PU assignment α that satisfies the constraint
Cs. First load variables (or leaf nodes of the DAG representing the given basic
block) are produced in the output buffer of PU 0, i.e., the LSU. For this, simply
extract the variables assigned to LSU, sort them according to ≺ and generate
43
Chapter 4: Optimal Code Generation
moves to load these variables in the sorted order to the output buffer of the
LSU. Next, extract variables assigned to PUs {1, . . . , p}. This determines the
entire sequence of operand values that must pass through each input buffer i
and the entire sequence of result values that must pass through each output
buffer j. Let tail[i] denote the next operand value to move to input buffer i
and head[j] denote the current result value at the head of the output buffer j.
Now, the move code for the computation of the basic block can be generated
as follows:
• determine input and output buffer pairs (i, j) such that head[j] and
tail[i] are the same. The Lemma 4.2 guarantees that such a pair of input
and output buffers will exist as long as there are still values to move
from output to input buffers.
• generate moves to transport the value from the output buffer j to the
input buffer i for each determined pair (i, j) and update the respective
tail[i] and head[j] values.
• generate moves to store variables that appear as root nodes of the DAG,
if found at any head[j].
• repeat the above steps until all result values are either moved to respec-
tive input buffers or stored in the main memory.
4.2.3. Optimizing execution time
So far, we have considered resource-constrained optimal scheduling since our
effort was to derive an optimal schedule of programs on SCAD machines with
a given number of PUs so that any reduction in performance due to overhead
operations is avoided. Consider the simple program x2 = x0 + x1;x3 = x0 ∗ x1.
Clearly, for any order of loading of x0 and x1 by the LSU, these operand values
can be moved so that x2 and x3 can be produced by the same PU. Even when
a SCAD machine with 2 PUs is provided, a satisfying assignment of the SAT
constraints developed so far may only use 1 PU to execute the above program.
However, an optimal schedule that minimizes the execution time will assign
x2 and x3 to different PUs for concurrent execution. Therefore, the next step
is to optimize the execution time for the maximal use of ILP contained in
programs or equivalently to minimize the execution time of programs. For
this purpose, we introduce the desired execution time t as an additional input
parameter that modifies our problem statement as follows: The problem is to
determine if the basic block can be executed on the SCAD machine in time t
without any dup and/or swap overhead. If so, determine the schedule of the
basic block for the SCAD machine.
To include further constraints regarding the execution time, we consider the
assignment of variables to time slots, which is defined as follows:
• time slot assignment relation θi,j for xi ∈ V and j ∈ {0, . . . , t − 1}
– θi,j means that xi ∈ V is produced in time slot j
44
4.2. Mapping to SAT
Every variable has to be produced in one and only one time slot, as expressed
in the unique time slot assignment constraint C6 in 4.14. Similar to
the unique PU assignment constraint C1, the first part of C6 asserts that any
variable xi is assigned to at least one time slot, and the second part enforces
that if xi is assigned to any time slot k, then it is not assigned to any other
























The assignment of variables to time slots must respect both the data depen-
dency of the variables and the production order of the variables on the same
PU. To formulate these, the following notation is used to express that a vari-
able xi is assigned to an earlier time slot than variable xj (i.e., xi is produced
before xj):









τi < τj holds if xi is assigned to any time slot k and xj is assigned to a time
slot m > k.
The ‘time slot respect data dependency’ constraint C7 in 4.16 restricts
time slot assignments in that the operand variables must be assigned to earlier





τsrcL(i) < τtgt(i) ∧ τsrcR(i) < τtgt(i) (4.16)
The ‘time slot respect ordering’ constraint C8 4.17 restricts time slot
assignments in that the time slots assigned to variables produced by the same












xtgt(i) ≺ xtgt(j) ⇒ τtgt(i) ≺ τtgt(j)
∧





The above constraints are added to Cs (in 4.13) to obtain a new constraint
formula for optimal scheduling of the given basic block on the given SCAD
machine where optimal refers to both, absence of any overhead and guarantee










Finally, it is worth mentioning that instead of introducing a boolean time slot
assignment variable θi,j for any xi and then deriving τi to build constraints,
it is as well possible to encode τi directly as an integer variable. However,
the resulting boolean and linear integer constraints will necessitate the use of
satisfiability modulo theories (SMT) solvers [BhSc17].
45
Chapter 4: Optimal Code Generation
4.2.4. Adapting constraints for DO-SCAD
As we have seen, the assignment of instructions to PUs is restricted by the
variable ordering. On the one hand, the variable ordering constraints avoid
the use of overhead, which will otherwise adversely affect the performance. On
the other hand, these constraints reduce the flexibility of the SCAD compiler
to perform the instruction assignment for a maximal use of ILP. For example,
consider the following program:
x3 = x0 ⊙0 x1
x4 = x2 ⊙1 x3
x5 = x1 ⊙2 x4
(4.19)
The program can be scheduled without overhead on a SCAD machine with
only one PU and one LSU. Clearly, variables {x0, x1, x2} are load variables
assigned to LSU, while variables {x3, x4, x5} are assigned to the PU. The
resulting optimal assignment of variables to time slots (to minimize execution
time) in given in Table 4.1b. Note that the SCAD machine takes 6 cycles to
execute the program. However, an ideal schedule that maximizes the use of















(b) by SCAD scheduler
Table 4.1.: Minimal time slot assignments for Program 4.19
The variables x2 and x3 are assigned to the same time slot 3 in the ideal
schedule. Note that x1 is the right operand of target variable x3 (x3 = x0⊙0x1).
Therefore, x1 should be produced before x3 and thus also before variable x2
in the ideal schedule where both x2 and x3 are assigned to the same time slot
(see Table 4.1a). However, this is not possible due to the following reason: x4
and x5 are mapped to the single available PU. Since x5 = x1 ⊙2 x4 (x4 is the
right operand of the target variable x5), variable x4 must be produced before
producing variable x5, i.e., x4 ≺ x5 (see Table 4.1b). The left operands of x4
and x5 are load variables x2 and x1, respectively, which are produced in the
output buffer of the LSU. Since x4 ≺ x5 holds and both are mapped to the single
PU, x2 must be moved to the left input buffer of the PU before moving x1 to
the same buffer. This requires that x2 is loaded from the main memory by the
LSU before loading x1, i.e., x2 should be produced in the output buffer of the
LSU before producing x1. Otherwise, overhead operations will be required to
rotate copies of x1 to access the copy of x2. Therefore, the ordering constraint
46
4.2. Mapping to SAT
x2 ≺ x1 in the LSU output buffer forbids the assignment of x2 and x3 in the
same time slot in the SCAD schedule.
Unlike buffers in SCAD, there is a pool of entries at inputs and outputs
of PUs in the dynamically ordered SCAD (DO-SCAD) equipped with tag-
matching hardware. This avoids any need for a compiler determined ordering
in DO-SCAD. Thus providing full flexibility to the DO-SCAD compiler to
assign instructions to PUs to utilize maximal ILP. Due to the absence of
any ordering, no overhead operations are required. Therefore, the optimal
schedule in DO-SCAD simply refers to an assignment of instructions to PUs
that maximizes the use of ILP or minimizes the execution time. Consequently,
the conjunction of the unique PU assignment constraint C1, the unique time
slot assignment constraint C6 and the ‘time slot respect data dependency’
constraint C7 is enough to derive the final constraint formula Cdo for optimal
scheduling for DO-SCAD:
Cdo ∶⇔ C1 ∧ C6 ∧ C7 (4.20)
The ideal schedule in Table 4.1a for program 4.19 is therefore possible in DO-
SCAD since there are no ordering constraints. In the experimental results
provided in Section 4.3, we show that with only a few PUs, a SCAD compiler
can determine an appropriate variable ordering that avoids the use of overhead.
Moreover, we also show that compromises in the exploited ILP due to the
buffer constraint (or the ordering constraint) in SCAD are an exception and
not the expectation.
4.2.5. Adapting constraints for SO-SCAD
There is provision in input buffers in SCAD to enforce a statically determined
order of operand values at runtime locally in each PU, irrespective of the order
of arrival of these values at respective input buffers. However, in a statically or-
dered SCAD (SO-SCAD), the PUs rely only on the order of arrival of operand
values to determine the next operation to be executed. The order of arrival
of values in turn depends on when these values are produced in some output
buffer(s). Recall that two ways of enforcing a compiler determined order of
instructions in SO-SCAD PUs was discussed in Section 2.3: (1) The control
unit must administer the firing of PUs at statically determined instruction
issue times to ensure that values are produced and transported in the same
order that they must be consumed. (2) The instructions that produce operand
values targeted to the same input buffer are assigned by the compiler to the
same PU so that these values originate from and are transported from the same
output buffer in the expected order. We refer to the former as SO-SCADI and
the latter as SO-SCADP machines where subscripts ‘I’ and ‘P’ indicates that
the ordering is enforced by static instruction issue and static instruction place-
ment, respectively. Consequently, an optimal schedule in SO-SCADI refers to
an assignment and ordering of instructions on PUs and statically determined
instruction issue (firing of PUs) times, such that the use of ILP is maximized.
An optimal schedule in SO-SCADP refers to an assignment and ordering of
47
Chapter 4: Optimal Code Generation
instructions on PUs such that the use of ILP is maximized.
To establish how instruction issue times and instruction placement are de-
termined for these machines, let us consider two instructions xtgt(i) = xsrcL(i)⊙i
xsrcR(i) and xtgt(j) = xsrcL(j)⊙j xsrcR(j) of a basic block. In both SCAD and SO-
SCAD, if the instructions are executed on different PUs, it is possible to move
their operands to the corresponding input buffers irrespective of the ordering
of operands in some output buffers. Assume that the instructions are executed
on the same PU k. If xtgt(i) (respectively xtgt(j)) is produced before xtgt(j)
(respectively xtgt(i)), then we should be able to move the operand xsrcL(i) (re-
spectively xsrcL(j)) before the operand xsrcL(j) (respectively xsrcL(i)), to the left
input buffer of PU k. Therefore, if both left operands xsrcL(i) and xsrcL(j) are
produced by the same PU, we must enforce the ordering xsrcL(i) ⪯ xsrcL(j) (re-
spectively xsrcL(j) ⪯ xsrcL(i)), in both SCAD and SO-SCAD. Assume now that
both left operands are produced by different PUs. As we have seen, the buffer
constraint C5 (4.12) ensures in SCAD that the appropriate transportation of
operands xsrcL(i) and xsrcL(j) is not cyclically blocked by any other variables in
the basic block. However, the buffer constraint works under the premise that
input buffers in PUs in SCAD tolerates any random arrival order of operand
values by reordering operand values at runtime if required, and this way re-
spects the compiler determined order of instructions on each PU. On the other
hand, input buffers in SO-SCAD are ‘pure’ FIFO buffers. In SO-SCADI, the
firing times of PUs are statically determined and dynamically enforced by the
control unit to ensure that the compiler determined order of instructions is
respected. Therefore, in SO-SCADI, if both left operands xsrcL(i) and xsrcL(j)
are produced by different PUs, the control unit must trigger the firing of the
PU that will produce xsrcL(i) (respectively xsrcL(j)) before triggering the firing
of the PU that will produce xsrcL(j) (respectively xsrcL(i)). In other words, the
production times of variables or variable schedule time slots τ are statically
determined such that τsrcL(i) ≺ τsrcL(j) (respectively τsrcL(j) ≺ τsrcL(i)). A similar
argument applies for the right operands. Accordingly the buffer constraint for
































In SO-SCADP, the compiler simply allocates to the same PU those instructions
that produce operand values targeted to the same input buffer. Therefore,
if target variables xtgt(i) and xtgt(j) are produced by the same PU k, both
left operands xsrcL(i) and xsrcL(j) are assigned to the same PU enforcing the
ordering xsrcL(i) ⪯ xsrcL(j) if xtgt(i) ≺ xtgt(j) or the ordering xsrcL(j) ⪯ xsrcL(i) if
48
4.2. Mapping to SAT














βsrcL(i),srcL(j) ∧ xsrcL(i) ≺ xsrcL(j)
∧










βsrcL(i),srcL(j) ∧ xsrcL(j) ≺ xsrcL(i)
∧








To obtain the final constraint formula Csoi (respectively Csop) for SO-SCADI
(respectively SO-SCADP), simply replace buffer constraint C5 (constructed




















In SCAD, the values consumed by the same input buffer may be produced by
different PUs without coordinating their production times. In SO-SCADI, the
compiler must ensure a total order of production times of values consumed by
each input buffer, but they may still be produced by different PUs since stati-
cally determined instruction issue times are enforced at runtime by controlling
the firing times of PUs. In SO-SCADP, these values must be produced by the
same PU in the order in which they must be consumed by each input buffer.
Therefore, clearly the buffer constraint for SO-SCADP is stricter compared to
SO-SCADI, which again is stricter compared to SCAD. This means that as we
consider SCAD < SO-SCADI < SO-SCADP machines, it becomes increasingly
difficult to avoid overhead operations. For example, consider the following
program:
x2 = x1 ⊙0 x0
x3 = x2 ⊙1 x2
x4 = x0 ⊙2 x3
x5 = x3 ⊙3 x1
(4.25)
Load variables x0 and x1 are assigned to the LSU, while variables {x2, x3, x4, x5}
can be assigned to any available PUs. Assignments of variables (equivalently
instructions) to a minimal number of PUs in SCAD, SO-SCADI and SO-
SCADP are shown in Table 4.2. Note that 2 PUs (excluding LSU or PU 0)
are required in SCAD to execute the program without any overhead. This
is because of the following reason: Computing x4 is data dependent on x2.
Therefore, if x2 and x4 are assigned to the same PU 1, x2 ≺ x4 must hold.
49
Chapter 4: Optimal Code Generation
Consequently, x1 ≺ x0 must hold since x1 and x0 are left operands of x2 and
x4, respectively. Similarly, computing x5 is data dependent on x2. If x2 and
x5 are assigned to the same PU 1, x2 ≺ x5 and consequently x0 ≺ x1 must
hold since x0 and x1 are right operands of x2 and x5, respectively. Due to the
contradicting constraints x1 ≺ x0 and x0 ≺ x1, an extra PU 2 is used in SCAD
to compute x4.
PU instr order
0 x0 ≺ x1




0 x0 ≺ x1











Table 4.2.: Minimal PU assignments for Program 4.25
At least 3 PUs (excluding LSU or PU 0) are needed in a SO-SCADI archi-
tecture to execute the same program without any overhead. This is because
in a SO-SCADI machine, we can no longer schedule x3 and x5 in the same
PU 1. Since computing x5 is data-dependent on x3, x3 ≺ x5 must hold if they
are computed by the same PU 1. Note that x2 and x1 are right operands
of x3 and x5, respectively. Therefore, x2 must be moved to the right input
buffer of PU 1 before moving x1 to the same buffer. This is however not pos-
sible in SO-SCADI since x2 can be produced only after producing x1 (x2 is
data-dependent on x1). This was not a concern in SCAD since x2 and x1 are
produced by different PUs, and thus have no need to be ordered.
Finally, 4 PUs (excluding LSU or PU 0) are needed in a SO-SCADP archi-
tecture to execute the same program without any overhead. This is because, in
a SO-SCADP machine, x2 and x3 can no longer be scheduled in the same PU
1. Note that x1 and x2, which are left operands of x2 and x3, respectively, will
have to be assigned to the same PU in SO-SCADP if x2 and x3 are produced
by the same PU. This is not possible since x1 is a load variable assigned to
LSU (or PU 0), and x2 must be assigned to another PU. A similar argument
applies for right operands x0 and x2. Experimental results in the following
section clearly show that while the use of overhead is easily avoided in SCAD,
it is considerably more difficult to avoid the use of overhead in SO-SCADI and
SO-SCADP.
4.3. Experiments
To generate input programs for experiments, a random basic block generator
was implemented that accepts number of nodes n (program size) and number
of levels l (program level) as inputs. The basic block is generated by ran-
domly choosing two predecessors of every node, ensuring that the DAG has l
levels. Clearly, for a n node basic block, l ∶= {2, . . . , n − 1} levels are possible.
50
4.3. Experiments
A hundred basic blocks were generated for every pair (n, l) in each described
experiment so that as many different patterns of DAGs are covered. We use
Microsoft’s Z3 solver [MoBj08a] to find satisfying assignments for boolean
constraints for optimal basic block scheduling in SCAD and its variant ar-
chitectures. First, we compare in Section 4.3.1 different SCAD variants in
terms of difficulty in avoiding the use of overhead instructions and subsequent
use of ILP. Section 4.3.2 compares the execution of queue oriented move code
with register oriented move code by a cycle-accurate SCAD simulator. The
feasibility of optimal SCAD code generation is discussed in Section 4.3.3.
4.3.1. Execution paradigms
For each randomly generated program, we determine the minimal number of
PUs required in SO-SCADP, SO-SCADI and SCAD architectures to execute
the program optimally without any overhead. Recall that the decision version
of the optimal code generation was mapped to SAT where a given program
was, if possible, compiled to optimal move code for execution in a SCAD (or its
variant) machine with a given number of PUs. To derive the minimal number
of PUs, we perform a binary search between the theoretical minimal number
(set as 1) and the maximal number (set to program size). The average-case
and worst-case minimum number of PUs required in SCAD and its variant
architectures to optimally execute programs of different sizes are shown in
Figures 4.9a and 4.9b, respectively. With a timeout of 60 seconds, programs
of sizes (nodes in DAG) 26, 17 and 12 were successfully processed for SCAD,
SO-SCADI and SO-SCADP architectures, respectively. The reason for the
lesser feasibility of optimal compilation as we consider SCAD ≺ SO-SCADI ≺
SO-SCADP is that the minimal number of PUs required in these machines for
overhead-free execution of programs increases considerably in that order. The
SAT solver has to handle more boolean constraints with more PUs.
The DO-SCAD architectures are not bound by any ordering constraint (and
thus do not require any overhead). As expected, for other architectures, more
PUs are required to avoid overhead with the increase in program size, as
seen in Figure 4.9. However, the rate of increase in numbers of PUs varies
considerably among these architectures. It is most difficult to avoid overhead
in SO-SCADP, often requiring a number of PUs proportional to the program
size. Note that size 12 programs require on an average 8 PUs (and 10 PUs
in the worst case). This means that for a SO-SCADP machine with a given
number of PUs, the compiler will often have to introduce a lot of overhead
to execute real programs degrading the overall performance. Therefore, while
SO-SCADP offers simple hardware, the compiler will often find it prohibitively
difficult (requiring a lot of PUs) or even impossible (with a limited number of
PUs) to utilize ILP contained in programs. The situation is more optimistic
for SO-SCADI architectures with a lower rate of increase in the number of
PUs. The size 17 programs require on an average 5 PUs (but 10 PUs in the
worst case) for overhead-free (optimal) execution. Although SO-SCADI offers
simple hardware (same as SO-SCADP), the control unit in SO-SCADI triggers
the firing of PUs at statically determined instruction issue times, in contrast to
51
Chapter 4: Optimal Code Generation
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●




















● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ●



















Figure 4.9.: Minimal numbers of PUs required to execute programs of different
sizes in SCAD and its variant architectures
SO-SCADP where the arrival of input values drives the firing of PUs. It is this
flexibility to determine the instruction issue times that allows the SO-SCADI
compiler to more effectively avoid the overhead compared to the SO-SCADP
compiler. However, as already mentioned, the static instruction issue cannot
adapt to a variable latency of PUs, memory accesses, and DTN, which is often
restricting the use of ILP.
Compared to SO-SCAD architectures, only a few PUs are required in a
SCAD architecture for the compiler to avoid the use of any overhead. Even for
size 26 programs, less than 3 PUs on average (only 4 PUs in worst case) are
enough (see Figure 4.9). Therefore, the organization of input buffers in SCAD
(more complex compared to pure FIFO buffers in SO-SCAD and simpler com-
pared to tag matching hardware in DO-SCAD) imparts more flexibility to the
compiler in determining an instruction schedule so that the use of overhead
is comfortably avoided compared to compilers for SO-SCAD variants. Real
programs are easily executed without any overhead in SCAD machines with
a limited number of PUs as shown by heuristic experimental results in Sec-
tion 5.6. Moreover, the instruction issue in the dataflow order tolerates a
variable component latency and further ensures the effective use of ILP in
programs.
The minimal numbers of PUs required in different architectures with vary-
ing levels of DAGs (representing programs or basic blocks) are shown in Fig-
ure 4.10. The average number of PUs (in all architectures) increases with the
number of levels. With more levels, there are more data dependencies that
constrain the SCAD compiler in ordering dependent instructions in the same
PU. In SO-SCADI, in addition to the ordering of dependent instructions in
the same PU, data dependencies also constrain the compiler in determining
52
4.3. Experiments



















(a) size 12 programs on SO-SCAD
● ● ● ● ● ●
● ●
●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●



















(b) size 26 programs on SCAD
Figure 4.10.: Minimal PUs required to execute programs of different levels
the issue times of all dependent instructions. In SO-SCADP, data dependen-
cies additionally restrict the compiler in allocating dependent instructions to
PUs. Therefore, the impact of increase in program levels is more apparent in
SO-SCADP compared to SO-SCADI (see Figure 4.10a) and least apparent in





































(a) on SCAD variants with minimal




































(b) on SCAD and DO-SCAD with min-
imal PUs required in SCAD
Figure 4.11.: Average minimal time to execute programs of different sizes
Finally, we want to determine for each architecture the degree of exploited
ILP when enough PUs are available for an overhead-free execution. To derive
53
Chapter 4: Optimal Code Generation
the minimal execution time (for maximizing the use of ILP), we again perform
a binary search between a theoretical minimum (set to 1) and the maximum
(set to program size). Note that the derived minimal execution time is ab-
stract (in number of steps) since only a unit instruction latency is taken into
account by the SAT encoding and other architectural timings such as fetch
time, decode (registration of move instruction) time, time taken for values to
ripple through buffers, etc. are abstracted out. The abstract steps should
serve well to measure the relative use of ILP by different SCAD variants. In a
first experiment, we determine the minimal number of PUs required in the SO-
SCADP architecture for an overhead-free execution. With this number of PUs,
we then observed that the minimal execution time derived for the SO-SCADP,
SO-SCADI, SCAD, and DO-SCAD machines are equal. This is expected since
the minimal number of PUs required in SO-SCADP for an optimal execution
is nearly the same as the input program size. In the next experiment, the min-
imal number of PUs required in the SO-SCADI architecture was determined,
and the minimum execution times with this many PUs in SO-SCADI, SCAD,
and DO-SCAD machines are derived (see Figure 4.11a). Similarly, minimum
execution time in SCAD and DO-SCAD with the minimal number of PUs
required for SCAD is shown in Figure 4.11b. It is clear from both graphs that
once enough PUs are available in the SO-SCAD or the SCAD architecture to
avoid the use of any overhead, they can exploit nearly maximal ILP (exploited
by DO-SCAD) contained in programs. This means that the overhead is the
main limiting factor in exploiting ILP.
4.3.2. Queue and register based SCAD code
In this section, we generate optimal queue-based and optimal register-based
move code, and compare the execution of both using a cycle-accurate SCAD
simulator1. The optimal queue-based move code is generated as follows: De-
rive the minimal number of PUs required for an overhead-free compilation of
the input program. With this number of PUs, derive a schedule (PU assign-
ment and variable ordering) for a minimum execution time (maximum use of
ILP) of the input program. Generate move code from this PU assignment
and variable ordering. The optimal register-based move code is generated as
follows: Allocate variables in the program to a minimal number of registers
using the well known Chaitin-Briggs heuristic [Chai04] (that yields nearly op-
timal results). The resulting dataflow graph is used to build relevant SAT
constraints (similar to constraints for DO-SCAD, i.e., without buffer or order-
ing constraints) so that instructions are assigned to PUs to maximize the use
of ILP, taking care of any false dependencies. The same number of PUs are
used in both queue-based and register-based compilation. With the PU assign-
ment and allocation of variables to registers, it is straightforward to generate
a move program: For each instruction, move operand values from registers to
the corresponding input buffers and move result values from output buffers




register-based move programs use storage (buffer and registers) to accommo-
date all intermediate program values. To also study the effect of the number
of register file ports, we execute register-based move code on, (1) a SCAD
machine with a single-ported register file referred to as REG-MIN and (2) a
SCAD machine with a multi-ported register file with as many ports as regis-
ters, referred to as REG-MAX. Recall that the organization of components in
a SCAD architecture allows the use of register files just like any PU.
The buffer size can be critical for executing a program on a SCAD machine.
With larger buffers, there will be fewer control unit stalls yielding better per-
formance. The program will deadlock if a minimal buffer size (specific to a
program) is not guaranteed (the control unit stalls forever due to lack of space
in a buffer). However, since randomly generated basic blocks are small pro-
grams, we observe that the impact of buffer size can be neglected. Clearly,
one entry in each input and output buffer is sufficient to successfully execute
register-based move programs since all intermediate values are stored in regis-
ters. We measure the minimal number of entries in input and output buffers
required for executing queue-based move programs. Clearly, free space in in-
put buffers allows values to be transported from output buffers, thus freeing
up space in the output buffers. Also, free space in output buffers allows PUs to
consume values from input buffers, thus freeing up space in the input buffers.
Therefore, the minimal input buffer size and the output buffer size required to
execute a move program often depend on each other. To derive the minimal
buffer sizes, we first fix the output buffer size to a large value (equal to the
program size) and then determine the minimal input buffer size required for
the successful execution of the move program. In a second step, with the in-
put buffer size set to this derived minimum value, we determine the minimal
output buffer size required for the successful execution of the move program.
Input and output buffers of the same sizes are assumed in all PUs. Finally,
we configure a unit latency for all components (PU, LSU, and DTN) in the
SCAD machine.
For randomly generated input programs of different sizes, the average exe-
cution time (in cycles) of the corresponding queue compiled move programs
in SCAD, register compiled move programs in REG-MIN and REG-MAX, are
shown in Figure 4.12a. As expected, the register-based move programs take
more cycles to execute compared to the queue-based move programs due to
additional register write moves to transport the computed result from the out-
put of PUs to the registers. In queue-based move programs, values are directly
communicated from output buffers of producer PUs to input buffers of con-
sumer PUs. Besides additional register writes, the contention of simultaneous
register accesses on the single read/write port leads to further execution cycles
in REG-MIN. The gap in execution time between REG-MIN, REG-MAX, and
SCAD understandably widens with the increase in program size due to more
intermediate values in larger programs. The number of resources (PUs, regis-
ters, input and output buffer sizes) used is shown in Figure 4.12b. Together,
the size of input and output buffers per PU is more than the overall number
of registers. However, note that only up to 2 PUs are used on an average.
55











































● ● ● ●
● ●
● ● ●
● ● ● ●
● ●























Figure 4.12.: Average measure of parameters in queue-based and register-based
SCAD program executions
With more PUs, more buffers will be available for storing intermediate values,
thus requiring less size per buffer. Moreover, buffers scale better compared to
a register file. For an exact comparison, we will need to measure hardware re-
source usage (in terms of area, power, and access times), which is not covered
in this thesis.



































(a) Queue-based SCAD program






























(b) Register-based SCAD program
Figure 4.13.: Average data transmissions in program executions
The next experiment compares data transmission pattern in the execution
of queue-based and register-based move programs. Results are shown in Fig-
56
4.3. Experiments
ures 4.13a and 4.13b, respectively. Note that both graphs use the same scale
for the x and y axes. The overall data transmission (or transportation of val-
ues) for programs of the same size is higher in execution using registers due
to the additional register writes, compared to the execution by direct com-
munication of intermediate results. Importantly, note that the bulk of this
data traffic is concentrated on a limited number of ports in the register file.
This clearly reveals the register file as the bottleneck in utilizing ILP and the
hotspot in power consumption. Meanwhile, the data traffic in the execution
of queue-based move program is distributed among processing units where
each processing unit has dedicated input and output ports. Thus, we have
a distributed communication pattern, which is important for the continued
scaling of performance with the increase in ILP contained in programs. Note
that the number of data transmissions by the LSU is more in the execution
of queue-based move programs. This is because we directly instantiate the
relevant number of copies of a loaded variable in the output buffer of the LSU.
With more load variables, we may instead choose to load only one copy, move
this copy to the input buffer of a PU and instantiate the relevant number of
copies in the output buffer of that PU.
4.3.3. Feasibility


























































































Figure 4.14.: Compile time for optimal SCAD code generation for programs of
different sizes
We also study the feasibility of our optimal code generation by SAT solver.
The average and maximal times taken by the SAT solver to derive a queue-
based schedule (both resource-constrained and time-constrained) for programs
of different sizes are shown in Figure 4.14. As expected, the time taken by SAT
solver to derive a feasible schedule increases with the program size. A larger
57
Chapter 4: Optimal Code Generation
number of program variables means a larger number of PU assignment rela-
tions (αi,j) to consider in the resource-constrained SAT problem. In addition,
the time-constrained SAT problem must consider a larger number of time slot
assignment relations (θi,j) with increasing program size. With a timeout of 60
seconds, programs containing up to 26 variables were successfully processed
for resource-constrained SCAD code generation (see Figure 4.14a). But for
the time-constrained move code generation, programs containing only up to

























































(a) Resource constrained code genera-








































● ● ● ●
avg
max
(b) Time constrained code generation
for programs of size 17
Figure 4.15.: Compile time for optimal SCAD code generation for programs of
different levels
Figure 4.15 shows the SAT solver’s runtime to derive a queue-based schedule
for programs of different levels. The time taken increases with the number of
levels since the SAT solver has to handle more data dependency constraints
with increasing program levels. On the other hand, more data dependencies
restrict the number of choices for variable orderings leaving the SAT solver
with fewer variable orders to check to derive a feasible schedule. Therefore, in
general, for two-level or three-level DAGs, the SAT solver can find a feasible
schedule in less time due to many choices of variable orderings. However, for
certain exceptions (difficult DAG structures), the SAT solver has to explore all
possible variable orderings to determine a feasible one. This is why we often
see spikes in the maximal compile time in Figures 4.15a and 4.15b.
58
Chapter 5
Heuristics for Code Generation
Contents
5.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.1. The MiniC Language and Compiler . . . . . . . . . 60
5.1.2. Basic Definitions and Notations . . . . . . . . . . . 61
5.1.3. Static Single Information (SSI) Representation . . . 64
5.2. Buffer Assignment . . . . . . . . . . . . . . . . . . . . 67
5.2.1. Variable Interference . . . . . . . . . . . . . . . . . 67
5.2.2. Computing the Interference Graph . . . . . . . . . 69
5.2.3. Remarks . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3. Balancing Variables . . . . . . . . . . . . . . . . . . . 76
5.3.1. Balancing by Discarding Copies . . . . . . . . . . . 78
5.3.2. Balancing by SSI Transformation . . . . . . . . . . 83
5.4. Move Code Generation . . . . . . . . . . . . . . . . . 84
5.5. Remarks on Buffer Size . . . . . . . . . . . . . . . . . 86
5.6. Experiments . . . . . . . . . . . . . . . . . . . . . . . . 86
Since optimal SCAD code generation by SAT solvers is only feasible for small
programs, heuristics are required to compile larger programs to move code
for the execution on SCAD machines. In Section 5.2, we present a heuristic
for assigning variables in a program to a minimal number of output buffers
(or PUs) in a SCAD machine so that resource-constrained move code can be
generated in program order without any dup or swap overhead. Recall that
the number of copies of values to produce must also be determined at compile
time. Depending on the control flow path, the number of uses of a value or
the number of copies of a value to produce, may differ. Therefore, we present
in Section 5.3, program transformations to equalize the number of uses of a
value along each control flow path so that the number of copies to produce is
uniquely determined. Finally, Section 5.4 describes move code generation.
59
Chapter 5: Heuristics for Code Generation
5.1. Preliminaries
This section introduces some preliminary information necessary to understand
the rest of this chapter. The MiniC1 language and its compiler intermedi-
ate representation, used to implement the heuristic, are introduced in Sec-
tion 5.1.1. Some basic definitions from compiler literature and notations used
in the rest of this chapter are given in Section 5.1.2. Static single information
(SSI) form [Anan99; Sing06], a compiler intermediate representation, is briefly
described and its properties are outlined in Section 5.1.3.
5.1.1. The MiniC Language and Compiler
MiniC is a simple imperative programming language developed by the Em-
bedded Systems Group2 at the University of Kaiserslautern. It has a minimal
set of data types, namely boolean (bool), unsigned/signed integers of machine
width (nat and int, respectively) and arrays. For these data types, there are
the usual boolean and arithmetic operations needed to implement reasonable
benchmarks. The language is strictly typed and allows programmers to con-
vert types using cast expressions like (nat) τ to enforce the type checker to give
the expression τ the type nat. Functions are not allowed to be recursive and
are inlined by the MiniC compiler. MiniC provides the standard statements
listed in Table 5.1, where S, Si are statements, ϕ is a boolean expression, λ is
a left-hand side expression and τ , τi are right-hand side expressions. A simple
Statement type Syntax
assignment λ = τ ;
sequence S1;S2
conditional if(ϕ) S1 [else S2]
while loop while(ϕ) Sl
for loop for (i = τl . . . τu) Sl
do-while loop do Sl while(ϕ)
function call f(τ1, . . . , τn);
return return τ
Table 5.1.: Statements in MiniC language
compiler framework is available for MiniC that offers syntax analysis, type-
checking, and a translation to an intermediate format referred to as command
language (similar to three-address code). A command program of length `
is a sequence of command instructions (I0, I1, . . . , I`−1). Available instruction
types are listed in Table 5.2, where x, xi and y are variables, ⊙ is a primi-
tive binary arithmetic or logic operator and i is an immediate value used to






copy variable y = x
assignment y = x1 ⊙ x2
array access y = x1[x2]
array assignment y[x1] = x2
unconditional branch goto i
conditional branch if x goto i
Table 5.2.: Instructions of the Command language
5.1.2. Basic Definitions and Notations
The MiniC compiler also offers further functions that split the command pro-
grams into control-flow graphs which is the basis for many code optimization
techniques.
Definition 5.1 〈 Control-Flow Graph 〉
A control-flow graph (CFG) is a program representation as a directed graph
GC = (VC ,EC , Is, Ie) with command instructions as set of nodes VC , set of
edges EC and two special nodes Is and Ie where Is is the start node with
no incoming edge and Ie is the end node with no outgoing edge.
In CFG representation of a command program (I0, I1, . . . , I`−1), (Ii−1, Ii) ∈
EC for all i ∈ {1 . . . ` − 1} and (Ii, Ij) ∈ EC if Ii is a branch instruction whose
target is Ij . Is denotes start node I0 and Ie denotes the end node I`−1. A
CFG is also often defined at the granularity of basic blocks, which is a piece of
straight line code, i.e., there are no jumps into or out of the middle of a block.
Definition 5.2 〈 Path 〉
A path of length ` ≥ 0 from a node Iu to a node Iv is a sequence of nodes
(Ik0 , Ik1 , . . . , Ik`) such that Ik0 = Iu, Ik` = Iv and (Iki−1 , Iki) ∈ EC for
i ∈ {1 . . . `}.
A node Iv is reachable from Iu if and only if there is a path from Iu to Iv
in the CFG. Every node is reachable from the start node Is. The end node Ie
is reachable from every other node.
In a CFG representation of a command program (I0, I1, . . . , I`−1), an edge
(Ii, Ij) is called a forward-edge if i < j and a back-edge otherwise. Notice that
MiniC is a structured programmed language in that only structured control
flow constructs [BoJa66] are available for constructing MiniC programs. This
61
Chapter 5: Heuristics for Code Generation
structured control flow is further retained by the MiniC compiler in that the
CFG of command programs are reducible flow graphs where possible loop
structures are shown in Figure 5.1 and are defined as follows:
Definition 5.3 〈 Loop 〉
A sequence of command instructions (Ii, . . . , Ii+m) where m ≥ 0, in a com-
mand program (I0, I1, . . . , I`−1), is a loop if Ii is the single entry point, Ii










(b) Entry at Ii and exit at Ii+m
Figure 5.1.: Possible control-flow graphs of loops in MiniC
Finally, we define the concepts of dominance and post-dominance that are
necessary to understand the construction and properties of SSI programs ex-
plained in the next section.
Definition 5.4 〈 Dominance 〉
A node Iu dominates a node Iv, denoted by Iu dom Iv, if every path from
the start node Is to Iv contains Iu. Every node dominates itself. If Iu
dom Iv and Iu ≠ Iv, then Iu strictly dominates Iv denoted by Iu sdom Iv.
The node Iu is the unique immediate dominator of node Iv, denoted by Iu




Definition 5.5 〈 Dominance Frontier 〉
A node Iv is in the dominance frontier set of node Iu, denoted by DF(Iu),
if Iu dominates an immediate predecessor of Iv, but does not strictly dom-
inate Iv, i.e., DF(Iu) = {Iv ∣ (Iw ↝ Iv) ∧ (Iu dom Iw) ∧ ¬(Iu sdom Iv)}
Intuitively, the dominance of Iu reaches until, but not including Iv so that Iv
is a control flow join point as shown in Figure 5.2a. Post-dominance and the










(b) Iu ∈ PDF(Iv)
Figure 5.2.: Examples for intuitive understanding of dominance and post-
dominance frontiers
Definition 5.6 〈 Post-Dominance 〉
A node Iv post-dominates a node Iu, denoted by Iv pdom Iu, if every path
from Iu to the end node Ie contains Iv. Every node post-dominates itself.
If Iv pdom Iu and Iv ≠ Iu, then Iv strictly post-dominates Iu denoted by Iv
spdom Iu. The node Iv is the unique immediate post-dominator of node
Iu, denoted by Iv ipdom Iu, if Iv spdom Iu, but Iv does not strictly post-
dominate any other strict post-dominators of Iu.
Definition 5.7 〈 Post-Dominance Frontier 〉
A node Iu is in the post-dominance frontier set of node Iv, denoted
by PDF(Iv), if Iv post-dominates an immediate successor of Iu, but
does not strictly post-dominate Iu, i.e., PDF(Iv) = {Iu ∣ (Iu ↝ Iw) ∧
(Iv pdom Iw) ∧ ¬(Iv spdom Iu)}
Intuitively, the post-dominance of Iv reaches until, but not including Iu so
that Iu is a control flow split point as shown in Figure 5.2b.
63
Chapter 5: Heuristics for Code Generation
5.1.3. Static Single Information (SSI) Representation
Static single information (SSI) [Anan99; Sing06] is an extension of the well es-
tablished static single assignment (SSA) [AlWZ88; CFRW91; RoWZ88] com-
piler intermediate representation. In the following, the SSA representation is
first discussed, followed by its extension to SSI. A program is in SSA form if
each variable is defined (statically or textually) only once. A non-SSA pro-
gram is transformed into an SSA program by replacing each definition of a
variable with a unique definition and renaming each use of the original variable
using the new variable whose unique definition reaches it. Clearly, multiple
definitions may merge from different paths to a control flow join and reach a
given use. In this case, a φ-function is placed at the join point that selects the
appropriate variable depending on which path was executed. Consequently,
the following properties hold for SSA (and SSI) programs.
Property 5.1 〈 Unique Definition 〉
Each variable in a program in SSA (and SSI) form has a unique definition.
Property 5.2 〈 Dominance 〉
The unique definition of any variable x in a program in SSA (and SSI)
form, dominates all uses of x.
For example, consider the CFGs shown in Figure 5.3a. The variable definitions
are made unique, and variable uses are appropriately renamed to construct the
SSA program in Figure 5.3b. Note that a φ-function is used to assign x4, with
x2 if the left control flow path was taken or with x3 if the right control flow
path was taken. This is a pseudo assignment in that the execution of a φ-
function requires knowledge of the past control flow. Therefore, to execute
SSA programs, φ-function assignments will have to be replaced by actual as-
signments that respect the semantics of the φ-function. A naive elimination
of φ-functions places copy assignments at the end of each predecessor basic
block. For example, x4 ←Ð x2 and x4 ←Ð x3 are placed at the end of the basic
blocks in the left path and the right path, respectively (see Figure 5.3c).
The SSA form is constructed in two phases: First, φ-functions are inserted
for each variable so that only a single definition of the variable reaches its
uses. Second, variables are uniquely named when defined, and their uses
are appropriately renamed. If a variable is defined in instruction I, then φ-
functions are inserted for that variable at dominance frontiers (more precisely
at iterated dominance frontiers) of I (DF(I)) to construct a minimal SSA
64
5.1. Preliminaries
x←Ð . . .
. . .←Ð x
x←Ð . . .
. . .←Ð x
. . .←Ð x
x←Ð . . .
. . .←Ð x
(a) Non-SSA program
x1 ←Ð . . .
. . .←Ð x1
x2 ←Ð . . .
. . .←Ð x1
. . .←Ð x1
x3 ←Ð . . .
x4 ←Ð φ(x2, x3)
. . .←Ð x4
(b) SSA program
x1 ←Ð . . .
. . .←Ð x1
x2 ←Ð . . .
x4 ←Ð x2
. . .←Ð x1
. . .←Ð x1
x3 ←Ð . . .
x4 ←Ð x3
. . .←Ð x4
(c) After elimination
Figure 5.3.: An example for SSA transformation and elimination
form [CFRW91; ApPa02]. The pruned SSA form reduces the number of φ-
functions by not inserting the φ-function for a variable at program points
where the variable is not live anymore.
SSI form, introduced in [Anan99] and more concisely revisited in [Sing06],
extends the SSA form by additionally treating uses of a variable similar to
how SSA form treats variable definitions. More precisely, while the SSA form
establishes that each use of a variable is dominated by its unique definition
(Property 5.2), the SSI form further establishes that the unique definition of
a variable is post-dominated by each of its uses.
Property 5.3 〈 Post-Dominance 〉
Each use of any variable x in a program in SSI form post-dominates the
unique definition of x.
Clearly, the definition of a variable may separate to different paths at a control
flow split and reach different uses of the variable along these different paths.
Therefore, in the same way that φ-functions are inserted at join points to
construct SSA programs, σ-functions are further inserted at split points to
construct SSI programs. The σ-function assigns a variable whose definition
reaches a split point to an appropriate variable depending on which path will
be executed in the future. For example, consider CFG in Figure 5.4a (same
as Figure 5.3a). Note that in addition to the φ-function at the join point, a σ-
function is inserted at the split point to derive the SSI program in Figure 5.4b.
The σ-function will assign x1, to variable x5 if the left control flow path will
be taken in future or to variable x6 if the right control flow path will be
taken. The assignments to σ-functions are also pseudo assignments since the
65
Chapter 5: Heuristics for Code Generation
execution of a σ-function requires knowledge of the future control flow. Both
φ and σ-function assignments will have to be replaced by appropriate actual
assignments to execute SSI programs. Analogous to eliminating φ-functions,
a naive elimination of σ-functions places copy assignments at the beginning of
each successor basic block. For example, x5 ←Ð x1 and x6 ←Ð x1 are placed at
the beginning of basic blocks in the left path and the right path, respectively
(see Figure 5.4c).
x←Ð . . .
. . .←Ð x
x←Ð . . .
. . .←Ð x
. . .←Ð x
x←Ð . . .
. . .←Ð x
(a) Non-SSI program
x1 ←Ð . . .
σ(x5, x6)←Ð x1
. . .←Ð x5
x2 ←Ð . . .
. . .←Ð x6
. . .←Ð x6
x3 ←Ð . . .
x4 ←Ð φ(x2, x3)
. . .←Ð x4
(b) SSI program
x1 ←Ð . . .
x5 ←Ð x1
. . .←Ð x5
x2 ←Ð . . .
x4 ←Ð x2
x6 ←Ð x1
. . .←Ð x6
. . .←Ð x6
x3 ←Ð . . .
x4 ←Ð x3
. . .←Ð x4
(c) After elimination
Figure 5.4.: An example for SSI transformation and elimination
While SSI is introduced in [Anan99], an alternative definition and algorithm
are provided in [Sing06]. A more recent work [BBDR12] discovered that both
definitions are not equivalent. To distinguish between these definitions, the au-
thors introduce the notion of upward-exposed use, where the use of a variable
x is upward-exposed at a program point p if there is a path from p to the use
that does not go through any other use of x. In the SSI definition in [Anan99],
at most one use of each variable is upward-exposed at each program point p.
A pseudo-use of each variable is assumed at the end node Ie of the CFG so
that this property holds even at program points where the variable is not live.
Clearly, these pseudo-uses are only considered for the SSI construction and
not for liveness analysis. However, in the SSI definition in [Sing06], at most
one use of each variable is upward-exposed only at the definition points and
not at all program points. Hence, the definition of SSI in [Anan99] is more
constrained compared to that in [Sing06]. For more details, see [BBDR12]. In-
terestingly, the construction algorithm in [Sing06] produces an SSI form that
corresponds to the definition in [Anan99]. We use this algorithm for the SSI
construction so that the following property can be stated:
66
5.2. Buffer Assignment
Property 5.4 〈 Unique Upwards-Exposed Use 〉
For each point in a program in SSI form, there is a unique upward-exposed
use of each variable.
To construct an SSI form, φ and σ-functions are inserted in the first phase.
The φ-functions are inserted for each variable so that the dominance prop-
erty 5.2 is satisfied. The σ-functions are inserted for each variable so that
both the post-dominance property 5.3 and the unique upwards-exposed use
property 5.4 are satisfied. Notice that the insertion of φ-functions introduces
new definitions that may trigger the insertion of corresponding σ-functions and
vice versa. Therefore, it is necessary to perform a fixpoint iteration that alter-
nately introduces φ and σ-functions. In the second phase, variables are given
unique names when defined, and their uses are appropriately renamed. Anal-
ogous to the insertion of φ-functions in a minimal SSA form, a minimal SSI
form is constructed by inserting σ-functions for a variable at post-dominance
(more precisely at iterated post-dominance) frontiers of I (PDF(I)) where the
variable is used in instruction I. Clearly, the elimination of φ and σ-functions
results in extra overhead in the form of copy assignments. To minimize these,
we use the pruned SSI form where the number of φ and σ-functions are reduced
by not inserting these for a variable at program points where the variable is
not live.
5.2. Buffer Assignment
We first explore in Section 5.2.1 execution scenarios for any pair of variables,
such that if assigned to the same output buffer (or same PU), dup and/or
swap overhead operations will be required to read (or dequeue) their values.
Restrictions are then introduced for the move code generation so that the
dataflow analysis framework [Kild73] can be used to compute whether it is
safe to map a given pair of variables to the same output buffer as explained
in Section 5.2.2. Eventually, a buffer interference graph (similar to the tradi-
tional register interference graph) is computed whose coloring yields a valid
assignment of variables in the program to output buffers in the SCAD machine
so that overhead-free move code can be generated.
5.2.1. Variable Interference
Let xi denote the ith value (i.e., the value from the ith write) of variable
x. Clearly, xi has a single write instance and may have any number of read
instances. Figure 5.5 shows the write instance and read instances of xi, where
w(xi) denotes the time at which xi is produced, rm(xi) denotes the time
at which the mth read of xi occur and c is the total number of reads of xi
before w(xi+1)) if any. Assume that the variables x and y are assigned to the
same output buffer (i.e., scheduled on the same PU) in a SCAD machine. If
67







Figure 5.5.: Write and read instances of variables x and y during program execu-
tion
there exists an instance j of variable y such that w(yj) > w(xi) as shown in
Figure 5.5, then any read of yj can only occur after all remaining reads of xi.
At time w(yj), all copies of yj are ‘blocked’ by the remaining unread copies of
xi in the output buffer as shown in Figure 5.6. The dashed line (in red color)
shows the timeline during which it is not possible to read any copy of yj unless
all unread copies of xi are rotated (to the same or a different output buffer) to
bring yj to the head of the output buffer. Therefore, if it is required to read
(dequeue) yj during this timeline, the variable y is said to be blocked by the
variable x, denoted by y ⊏ x. If either y is blocked by x or vice versa, then
x and y are said to ‘interfere’ with each other denoted by x ◻ y. In this case,
x and y must be assigned to different output buffers in the SCAD machine
to avoid overhead (rotation of values). Thus, the following definitions are in
order:
→ yj . . . yj xi . . . xi →
Figure 5.6.: Content of the output buffer (the tail is on the left and the head is on
the right hand side) at time w(yj) (from Figure 5.5)
Definition 5.8 〈 Blocking: y ⊏ x 〉
A variable y is ‘blocked’ by another variable x, denoted by y ⊏ x, if and
only if the following holds:











Definition 5.9 〈 Interfering: x ◻ y 〉
A pair of variables, x and y ‘interfere’ with each other, denoted by x ◻ y,
if and only if the following holds:
(y ⊏ x) ∨ (x ⊏ y) (5.2)
68
5.2. Buffer Assignment
Clearly, the ‘interfere’ operator ◻ is commutative while the ‘blocked’ operator
⊏ is not commutative.
5.2.2. Computing the Interference Graph
The order in which variables are written and read determines if they interfere.
Therefore, to compute the interference of variables, we enforce that variables
are written and read in the order that they are defined and used, respectively,
in the command program. This order is enforced by restricting the move code
generation in that command instructions are considered in program order for
move code generation. That is, operand moves of any command instruction Ii
are scheduled before scheduling any operand moves of command instruction
Ii+1. Thus, variable writes and reads occur in program order at runtime,
facilitating the use of the well-known dataflow analysis framework [Kild73] to
compute whether y ⊏ x holds for any pair of variables x and y.
First, we define the domain of values on which the dataflow analysis frame-
work is employed.
Definition 5.10 〈 var-def (v, I) tuple 〉
A variable definition tuple (v, I) contains a variable v and an instruction
I that define (write) variable v.
Definition 5.11 〈 var-use (v, I) tuple 〉
A variable use tuple (v, I) contains a variable v and an instruction I that
use (read) variable v.
Let W and R denote the set of all possible var-def and var-use tuples respec-
tively in a given command program. Furthermore, let W(x) and R(x) denote
the subsets of W and R, respectively, consisting of tuples that correspond to
variable x, i.e.,
W(x) = {(x, I) ∣ (x, I) ∈W}
R(x) = {(x, I) ∣ (x, I) ∈ R} (5.3)
Next ’may‘ approximations of reaching variable definition tuples and live vari-
able use tuples at each program point are computed and finally results from
these analyses are used to compute the interference of variables on buffers in
a SCAD machine. The analyses are described followed by a computation of
the buffer interference graph.
69
Chapter 5: Heuristics for Code Generation
Reaching Definition Tuples
A var-def tuple (v, I) reaches a program point p if there exists at least one
path from the point immediately following instruction I to p such that the
variable v is not redefined (overwritten) along that path. The goal of reaching
definitions analysis is to compute for every program point the set of var-def
tuples that reach this program point. Clearly, reaching definitions analysis is a
forward analysis performed on the set of var-def tuples. The following transfer




out(I) = fs(in(I)) = (in(I) ∖ kill(I)) ∪ gen(I)
(5.4)
where in(I) and out(I) are the sets of var-def tuples that reach the entry and
the exit of instruction I, respectively. Instruction I ′ in I ′ ↝ I is a predeces-
sor of I. The set of var-def tuples generated by instruction I is denoted by
gen(I) and kill(I) denotes the set of var-def tuples killed by instruction I. See
Table 5.3 for gen(I) and Table 5.4 for kill(I) for each command instruction
type. Note that the union operator ∪ is used as the meet or join operator to
compute the ‘may’ approximation.
Command Instr I gen(I) : var-def tuples
y = x {(y, I)}
y = x1 ⊙ x2 {(y, I)}
y = x1[x2] {(y, I)}
y[x1] = x2 {}
goto i {}
if x goto i {}
Table 5.3.: Variable definition tuples generated by each command instruction type
Command Instr I kill(I) : var-def tuples
y = x W(y)
y = x1 ⊙ x2 W(y)
y = x1[x2] W(y)
y[x1] = x2 {}
goto i {}
if x goto i {}
Table 5.4.: Variable definition tuples killed by each command instruction type
For any given program, starting with empty initial values, the analysis eval-
uates the equation system 5.4 until a fixpoint is reached or as long as there
are changes (see Algorithm 1). This derives for every program point p, a set
of var-def tuples that reaches p denoted by Dp.
70
5.2. Buffer Assignment
Algorithm 1: Reaching definitions analysis
1 Function ReachingDefs(CFG):
// initialize
2 forall node I in CFG do
3 out[I] ←Ð ∅




7 foreach node I in CFG do
8 in[I] ←Ð ⋃
I′↝I
out[I ′]
9 out[I] ←Ð (in[I] ∖ kill(I)) ∪ gen(I)
10 end
11 until out[I] unchanged for all nodes I
// return
12 return in[I] for each node I
Live Use Tuples
A var-use tuple (v, I) may be reached from a program point p if there exists
at least one path from p to the instruction I such that the variable v is not
defined (written) along that path. The goal of live uses analysis is to compute
for every program point, the set of var-use tuples that may be reached from this
program point. Unlike reaching definitions analysis, live uses is a backward
analysis performed on the set of var-use tuples. The following transfer function




in(I) = fI(out(I)) = (out(I) ∖ kill(I)) ∪ gen(I)
(5.5)
where in(I) and out(I) are sets of var-use tuples that may be reached from
the entry and the exit of instruction I, respectively. Instruction I ′ in I ↝ I ′ is
a successor of I. Furthermore, gen(I) is the set of var-use tuples generated by
instruction I and kill(I) denotes the set of var-use tuples killed by instruction I.
See Table 5.5 for gen(I) and Table 5.6 for kill(I) for each command instruction
type. Again, note that the union operator ∪ is used as the meet or join
operator to compute the ‘may’ approximation.
For any given program, starting with empty initial values, the analysis eval-
uates the equation system 5.5 until a fixpoint is reached or as long as there
are changes (see Algorithm 2). This derives for every program point p, a set
of var-use tuples Up that may be reached from p.
71
Chapter 5: Heuristics for Code Generation
Command Instr I gen(I) : var-use tuples
y = x {(x, I)}
y = x1 ⊙ x2 {(x1, I), (x2, I)}
y = x1[x2] {(x2, I)}
y[x1] = x2 {(x1, I), (x2, I)}
goto i {}
if x goto i {(x, I)}
Table 5.5.: Variable use tuples generated by each command instruction type
Command Instr I kill(I) : var-use tuples
y = x R(y)
y = x1 ⊙ x2 R(y)
y = x1[x2] R(y)
y[x1] = x2 {}
goto i {}
if x goto i {}
Table 5.6.: Variable use tuples killed by each command instruction type
Algorithm 2: Live use tuples analysis
1 Function LiveUses(CFG):
// initialize
2 forall node I in CFG do
3 in[I] ←Ð ∅




7 foreach node I in CFG do
8 out[I] ←Ð ⋃
I↝I′
in[I ′]
9 in[I] ←Ð (out[I] ∖ kill(I)) ∪ gen(I)
10 end
11 until in[I] unchanged for all nodes I
// return
12 return in[I] for each node I
Computation
Using results from the aforementioned analyses, we now determine for any
given pair of variables x and y, if y is blocked by x (i.e., y ⊏ x). In other
words, we determine if there exists any program point p whose execution lies
72
5.2. Buffer Assignment
on the dashed line (in red color) in Figure 5.5. In the following, we use the
notations:
Dp(x) = {I ∣ (x, I) ∈ Dp}
Up(x) = {I ∣ (x, I) ∈ Up}
First, we identify any program point p so that there exists a path from a
definition of x (say instruction Ixd ) to p via a definition of y (say instruction
Iyd ) such that x is not overwritten along that path and y is not overwritten
along the path from Iyd to p. Such a path exists only if the following holds:
∃Iyd ∈ Dp(y) ∣ (∃I
x
d ∈ DIyd (x) ∣ I
x
d ∈ Dp(x)) (5.6)
Note that this is equivalent to identifying
∃i, j ∣ t(p) > w(yj) > w(xi)
where t(p) is the time at which the program point p is executed.
Next we identify for the same program point p, if there exists a path from p
to a use of x (say instruction Ixu) via a use of y (say instruction I
y
u) such that
x is not defined (written) along that path and y is not defined (written) along
the path from p to Iyu . Such a path exists only if the following holds:
∃Iyu ∈ Up(y) ∣ (∃Ixu ∈ UIyu(x) ∣ I
x
u ∈ Up(x)) (5.7)
Note that this is equivalent to identifying
∃m,n ∈ N ∣ t(p) < rn(yj) < rm(xi)
for the same i, j and t(p).
If both such paths exist, we confirm from Definition 5.8 that variable y is
blocked by x or y ⊏ x holds. Then, clearly the execution time of program point
p, t(p) lies on the dashed line (in red color) in Figure 5.5. Therefore, for any
given pair of variables x and y, if there exists a point p in the program such
that the following holds:
(∃Iyd ∈ Dp(y) ∣ (∃I
x








then y is blocked by x or y ⊏ x holds. Clearly, the above condition for identi-
fying relevant program points is necessary but not sufficient due to probable
infeasible paths. That is, if y is blocked by x at runtime then the above con-
dition must be satisfied. However, if the above condition is satisfied, it may
also be the case that y is not blocked by x.
From Definition 5.9, variables x and y interfere (i.e., x ◻ y holds) if either
y ⊏ x or x ⊏ y is true. By computing the same for all pairs of variables in the
command program, a buffer interference graph G = (V,E) is constructed with
variables as vertices V , and there exists an edge between any two vertices if
73
Chapter 5: Heuristics for Code Generation
Algorithm 3: Constructing the buffer interference graph
1 Function ConstructBufferInterferenceGraph(CFG):
// initialize
2 V ←Ð set of all nodes in CFG except start and end nodes
3 E ←Ð ∅ // empty set of edges
4 P ←Ð set of all variable pairs
// dataflow analyses
5 D ←Ð ReachingDefs(CFG)
6 U ←Ð LiveUses(CFG)
// return true if x is blocking y, otherwise false
7 Function Blocking(x, y):
/* return true if any definition of x reaches any definition of
y that in turn reaches node I */
8 Function DefOrder(I):
9 yDefs ←Ð {Iy ∣ (y, Iy) ∈ D[I]}
10 foreach node Iy in yDefs do
11 xDefs ←Ð {Ix ∣ (x, Ix) ∈ D[Iy]}
12 if xDefs ≠ ∅ then return true
13 end
14 return false
/* return true if any use of x is live at any use of y that in
turn is live at node I */
15 Function UseOrder(I):
16 yUses ←Ð {Iy ∣ (y, Iy) ∈ U[I]}
17 foreach node Iy in yUses do
18 xUses ←Ð {Ix ∣ (x, Ix) ∈ U[Iy]}
19 if xUses ≠ ∅ then return true
20 end
21 return false
22 foreach node I in CFG do
23 if DefOrder(I) ∧ UseOrder(I) then return true
24 end
25 return false
// construct buffer interference graph
26 foreach variable pair (x, y) in P do
// add edge (x, y) to E if x and y interfere
27 if Blocking(x, y) ∨ Blocking(y, x) then
28 E ←Ð E ∪ {(x, y)}
29 end
30 end




the corresponding variables interfere (see Algorithm 3). Any coloring heuristic
may now be used to assign colors (output buffers or PUs) to vertices (vari-
ables) such that interfering variables are always assigned to different output
buffers or PUs. This guarantees that when command instructions are consid-
ered in program order for move code generation, relevant operand values will
always be found at the head of respective output buffers for transportation to
destination input buffers. Furthermore, a coloring of the buffer interference
graph will guarantee that multiple operands of the same command instruc-
tion are mapped to different output buffers. Note that live use tuples analysis
(Algorithm 2) returns a set of live use tuples at the entry of each command
instruction I (in[I]). Now consider for example, the assignment instruction I
:= y = x1 ⊙ x2. Clearly, both I ∈ UI(x1) and I ∈ UI(x2). Therefore, from 5.7,
the entry of instruction I will serve as a program point that reaches a use of
x1 (I) via a use of x2 (again I) and vice versa. Thus, it will be determined
that x1 and x2 interfere irrespective of the order of the definitions of x1 and
x2.
. . .←Ð y
. . .←Ð z
⋮
. . .←Ð y
. . .←Ð x
. . .←Ð x
⋮
z ←Ð . . .
y ←Ð . . .









Figure 5.7.: Live ranges and use intervals in a program
5.2.3. Remarks
The buffer assignment is based on more liberalized rules compared to the tra-
ditional assignment of variables to registers: The variables live at a program
point are assigned to different registers. However, we can map two live vari-
ables to the same buffer. In addition to liveness, if the use intervals of variables
also overlap, then they must be assigned to different buffers to avoid overhead
in the generated move code (see Figure 5.7 for example). For the given pro-
gram, since live ranges of variables x, y and z overlap, they must be assigned
to different registers to execute classic register architectures. However, the
use intervals of only y and z overlap so that only y and z are needed to be
assigned to different buffers for an overhead-free execution in SCAD archi-
75
Chapter 5: Heuristics for Code Generation
tectures. This is not surprising since buffers can accommodate different live
values, albeit restricting access to only the one at the head of the buffer.
5.3. Balancing Variables
Any PU in a SCAD machine must produce as many copies of the computed
result in its output buffer as there are uses of that value in the program.
However, the number of uses of a value may vary depending on which control
flow path is taken. Since the number of copies of values to produce must
be determined at compile-time and the program control flow is only known at
runtime, the SCAD compiler must transform the program in some way so that
the number of uses of a value along each control flow path is equalized. In
other words, all variables assigned to buffers must be balanced in the program
where balancing is defined as follows:
Definition 5.12 〈 Balanced Variable 〉
A variable x is balanced in a program if for every program point p, the
number of uses of x in each path from p to the end node Ie are equal.
Equivalently, a variable is said to be ‘balanced’ in a program if the number of
uses of that variable is uniquely determined at all program points. For SCAD
compilation, it is necessary to balance variables so that the number of copies
to produce is uniquely determined for each variable at compile time. Consider
for example, the conditional execution shown in the control-flow graph in
Figure 5.8a. If the left control flow path is taken, the value of variable x from
the definition above the control flow split point is not used since x is redefined
in the left path. If the right control flow path is taken instead, the value of x
is used 3 times (2 uses in the right path and 1 use after the control flow join
point). However, a predetermined number of copies must be produced in the
output buffer of a PU when x is defined above the split point, and all copies
must be consumed irrespective of the future control flow so that there are no
stray values (values not accounted for).
We propose two approaches to balance any variable x in a program: (1)
Dummy uses of variable x in the form of assignments dx ←Ð x to dummy
variable dx, are introduced as shown in Figure 5.8b. Note that x is now used
3 times along both left and right paths so that the first definition of x can
now safely produce 3 copies. The dummy assignments are translated to move
instructions to special address null so that these values are simply discarded by
the SCAD machine at runtime (given in Section 5.4). The algorithm to balance
variables in programs by introducing dummy assignments is presented and
discussed in Section 5.3.1. (2) Alternatively, we introduce copy assignments
of variable x to itself (x ←Ð x) in the left and the right control flow paths
as shown in Figure 5.8c. Note that x is now used 1 time along each path so
76
5.3. Balancing Variables
x←Ð . . .
x←Ð . . . . . .←Ð x
. . .←Ð x
. . .←Ð x
(a) not balanced




x←Ð . . . . . .←Ð x
. . .←Ð x
. . .←Ð x
(b) dummy assignments
x←Ð . . .
x←Ð x
x←Ð . . .
x←Ð x
. . .←Ð x
. . .←Ð x
. . .←Ð x
(c) copy assignments
Figure 5.8.: Balancing variables by dummy assignments and copy assignments
that the first definition of x needs to produce only 1 copy. It is easy to see
that the elimination of σ-functions in SSI programs will provide exactly this
transformation. The program in Figure 5.8a after SSI transformation is shown
in Figure 5.9a. A naive elimination of σ(x2, x3)←Ð x1 places copy assignments
x2 ←Ð x1 and x3 ←Ð x1 in the left and the right paths, respectively, as
shown in Figure 5.9b, so that x1 is used once along each path. We prove in
Section 5.3.2 that all variables are inherently balanced in SSI programs and
therefore, directly use the SSI transformation for balancing programs [Schn18].
x1 ←Ð . . .
σ(x2, x3)←Ð x1
x4 ←Ð . . . . . .←Ð x3
. . .←Ð x3
x5 ←Ð φ(x4, x3)
. . .←Ð x5
(a) SSI program
x1 ←Ð . . .
x2 ←Ð x1
x4 ←Ð . . .
x5 ←Ð x4
x3 ←Ð x1
. . .←Ð x3
. . .←Ð x3
x5 ←Ð x3
. . .←Ð x5
(b) After elimination
Figure 5.9.: Balancing variables by SSI transformation
The dependence of the number of variable uses on the number of loop it-
erations poses an additional challenge in balancing variables. If there exists
77
Chapter 5: Heuristics for Code Generation
a path in a loop body where a variable is only used, the number of copies of
this variable to produce will depend on the number of loop iterations. This is
not desirable since the number of loop iterations is often unknown at compile
time. Moreover, even if statically known, these are often large numbers, so
that instantiating these many copies of a variable will require huge buffer sizes
in SCAD. Due to this reason, all variable uses in loops must be bounded where
bounding is defined as follows:
Definition 5.13 〈 Bounded Variable 〉
A variable x is bounded in a loop body (Ils, . . . , Ile) if there does not exist
any path from loop start Ils to loop end Ile where x is only used.
We prove in Section 5.3.2 that the SSI transformation bounds all unbounded
variables in a given program so that there is no need to explicitly bound vari-
ables when balancing a program by SSI transformation. However, when a
program is balanced by discarding copies (i.e., by introducing dummy assign-
ments), all unbounded uses of any variable x must be explicitly bounded. To
this end, we introduce an additional variable x′ to bound the use of x in a
loop. Consider for example, the loop in Figure 5.10a with an unbounded use
of x. The new variable x′ is introduced as shown in Figure 5.10b. Note that
both x and x′ are now bounded since they are defined in the loop body, and
thus the number of copies to produce for each definition of x and x′ is inde-
pendent of the number of loop iterations. Each definition in this example may
produce one copy. The algorithm to bound variables in programs is given in
Section 5.3.1.
5.3.1. Balancing by Discarding Copies
Recall that MiniC is a structured programming language that allows program-
mers to only use structured control flow constructs. Therefore, modifications
of the abstract syntax tree (AST) representation of a program are guaranteed
to retain the block structure of the program, while modifications of the simpler
control-flow graph (CFG) representation (i.e., at the intermediate command
language level) may lose the block structure of the program if not carefully
done. Due to this reason, we decide to implement bounding and balancing of
variables at the MiniC language level and not at the intermediate command
language level.
Statements in MiniC are defined recursively (see Table 5.1). Algorithm 4
implements the bounding of any variable x in any MiniC statement S recur-
sively from innermost to outermost loops. Note that for each statement type
S, the use of x is first bounded in the contained statements if any, followed by
the bounding of x in statement S if S is a loop statement. For example, for
the while statement S := while(ϕ,Sl), the use of x is first bounded in the loop
body statement Sl to obtain S
b
l . The variable x is then bounded in this while
78
5.3. Balancing Variables
Algorithm 4: Bound variable x in loops in statement S
1 Function BoundLoops(S, x):
2 switch S do // statement type
3 case seq(S1,S2) do // sequence: S1;S2
4 Sb1 ←Ð BoundLoops(S1, x)





8 case cond(ϕ,S1,S2) do // conditional: if(ϕ) S1 [else S2]
9 Sb1 ←Ð BoundLoops(S1, x)





13 case while(ϕ,Sl) do // while loop: while(ϕ) Sl
14 Sbl ←Ð BoundLoops(Sl, x) // bound inner loops
15 if x is unbounded in Sbl then
16 Sbl ←Ð seq(asg(x,x
′), seq(Sbl , asg(x
′, x)))
17 return seq(asg(x′, x), while(ϕ,Sbl ))
18 end
19 return while(ϕ,Sbl )
20 end





Chapter 5: Heuristics for Code Generation
x←Ð . . .
⋮
. . .←Ð x
⋮
(a) Unbounded variable use










Figure 5.10.: Bounding variable use in loop
loop by introducing new assignments asg(x,x′) and asg(x′, x) using variable
x′ as explained in Figure 5.10. The do-while and for-loop statements are han-
dled similarly and are not shown in Algorithm 4. Moreover, the for loops and
do-while loops are easily expressed in terms of while loop as follows:
Loop Corresponding while loop
for (i = τl . . . τu) Sl i = τl; while(i < τu) {Sl; i = i + 1}
do Sl while(ϕ) Sl; while(ϕ) Sl
To ensure that all variables are bounded in all loops in the program, variable
uses are bounded in function statements in the MiniC program followed by
the main program statement.
After bounding variable uses in loops, balancing of variables is performed
by a bottom-up traversal of the AST representation of the program. The
bottom-up traversal is necessary since balancing any variable x in any state-
ment S requires knowledge of the number of uses of x at the exit of S. This
is clear from the example in Figure 5.8b, where three dummy assignments are
placed in the left path since x is used twice in the right path and once after
the exit of the conditional statement (i.e., after the control flow join point).
The recursive function that implements the bottom-up traversal of AST to
balance variable x in statement S given the number of uses of x at the exit
of S (uexit) is given in Algorithm 5. Note that for each statement type S,
x is first balanced in the contained statements if any, before balancing x in
this statement S. For a sequence statement seq(S1, S2), statement S2 is first
balanced followed by S1. This implements a bottom-up traversal. Clearly,
dummy assignments or discarding of copies are needed only for those state-
ments that contain control flow split points. Finally, the balanced statement
and the use count of x at the entry of the balanced statement (uentry) are re-
80
5.3. Balancing Variables
turned, so that uentry serves as the number of uses of x at the exit of the next
statement considered in the bottom-up traversal. Note that uentry must be
evaluated even for atomic statements that do not require balancing. For the
assignment statement asg(λ, τ), uentry is either set to 0 if x is defined by the
assignment or evaluated by summing the use counts of x in the left-hand side
expression λ and the right-hand side expression τ . This is done similarly for
the function call statement fun(τ1, . . . , τn) and the return statement return(τ).
For the sequence statement S := seq(S1, S2), the number of uses evaluated at
the entry of S2 is provided as the use count at the exit of S1 when balancing
the statement S1. The number of uses of x subsequently evaluated at the
entry of S1 is returned.
Consider the balancing of variable x in a conditional statement S := cond(ϕ,S1,S2)
in Algorithm 5 in the context of the example in Figure 5.8b. The number of
uses at the exit of S is uexit = 1. Assume that the left path in Figure 5.8b
corresponds to the if statement S1 and the right path corresponds to the else
statement S2. The statements S1 and S2 are balanced and respective entry
use counts uS1 = 0 and uS2 = 3 are evaluated. Next, the use count at the
control flow split point is fixed as the maximum of the ‘if’ path and the ‘else’
path, and the appropriate number of copies are discarded in the other path.
In this case, uS2 − uS1 = 3 copies of the variable x are discarded in the left
(if) path. Finally, the number of uses of x in the condition expression ϕ and
the use count of x at the split point are added to determine the use count at
the entry of the conditional statement S. The main idea behind balancing a
variable use this way is to have as many copies of the variable as its maximal
use count at the control flow split point and then discard unnecessary copies
depending on the path taken.
Observe that the use count of a variable at a control flow split point in a
conditional statement S := cond(ϕ,S1,S2) does not affect the balancing of that
variable in its contained statements S1 and S2. However, due to back edges
in loops, the use count of a variable at the split point will affect the balancing
of that variable in its contained loop body statement. Due to this reason,
for the loop statements, we first determine the use count at the split point
and then balance variables in the loop body statement in a second phase. In
more detail, consider the balancing of variable x in the while statement S :=
while(ϕ,Sl) in the context of example in Figure 5.11a. The variable x is not
used after the while loop, thus uexit = 0. To balance x in the contained loop
body statement Sl, we have to derive the number of uses of x at the exit of
Sl, which is dependent on the use count at the split point. At the control
flow split, the program control flows either to the exit of the loop statement
S or to the entry to the loop body statement Sl. Therefore, to determine the
maximal use count of x at the split point, besides uexit, we have to evaluate
the use count at the entry to the loop body statement denoted by uSl in
Algorithm 5. Note that the number of uses of x at the exit of Sl, will not
impact the use count at the entry of Sl because variable uses in loops are
already bounded, which guarantees that x is defined in all paths in Sl where
it is used. Now invoking the balance statement function with Sl and x as
81
Chapter 5: Heuristics for Code Generation
Algorithm 5: Balance variable x in loop bounded statement S
1 Function BalanceStmt(S, x, uexit):
2 Function Uses(e): // count number uses of x in expression e
3 return number uses of x in e
4 Function Discard(n): // assignment statements to discard x
5 return sequence of n discard assignments asg( dx, x)
// balance statement S
6 switch S do // statement type
7 case asg(λ,τ) do // assignment: λ = τ ;
8 if x is defined by S then return (S, 0)
9 else return (S, Uses (λ) + Uses (τ))
10 end
11 case seq(S1,S2) do // sequence: S1;S2
12 Sb2, uS2 ←Ð BalanceStmt(S2,x,uexit)





16 case cond(ϕ,S1,S2) do // conditional: if(ϕ) S1 [else S2]
17 Sb1, uS1 ←Ð BalanceStmt(S1,x,uexit)
18 Sb2, uS2 ←Ð BalanceStmt(S2,x,uexit)
19 if uS1 > uS2 then S
b
2 ←Ð seq(Discard(uS1 − uS2), S
b
2)
20 else Sb1 ←Ð seq(Discard(uS2 − uS1), S
b
1)





24 case while(ϕ,Sl) do // while loop: while(ϕ) Sl
25 ,uSl ←Ð BalanceStmt(Sl,x,0)
26 u ←Ð ((uexit > uSl)?uexit ∶ uSl) // control flow split
27 Sbl , ←Ð BalanceStmt(Sl,x, u + Uses(ϕ))
28 uentry ←Ð u + Uses(ϕ)
29 if uexit > uSl then




32 return (seq(while(ϕ,Sbl ), Discard(uSl − uexit)), uentry)
33 end
34 end
35 case fun(τ1, . . . , τn) do // function call: f(τ1, . . . , τn);
36 return (fun(τ1, . . . , τn), uexit+ Uses(τ1) + . . .+ Uses(τn))
37 end
38 case return(τ) do // return : return τ
39 return (return(τ), uexit + Uses(τ))
40 end




x←Ð . . .
. . .←Ð x
x←Ð . . .
x←Ð . . . . . .←Ð x
. . .←Ð x
(a) Not balanced
x←Ð . . .
. . .←Ð x




x←Ð . . .
. . .←Ð x
. . .←Ð x
dx←Ð x
(b) Balanced
Figure 5.11.: Balancing variables in loops by discarding copies
arguments will return the use count of x at the entry of Sl, and a balanced
Sl which is ignored. In this example, uSl = 1 since x is used once in the loop
body before it is defined. The use count u at the control flow split point is
finally fixed to the maximum of uSl and uexit. The number of uses of x at the
exit of Sl is now easily obtained by adding the use count at the control flow
split and the use count of x in the condition expression ϕ. This value can now
be used to balance the loop body statement Sl. In this example, balancing x
in the loop body will yield three dummy assignments in the left path of the
conditional statement inside the loop body. Finally, variable x is balanced in
the loop statement S := while(ϕ,Sl) by discarding the appropriate number of
copies after the loop exit or at the entry to the loop body. One copy of x is
discarded after the loop exit in the given example.
5.3.2. Balancing by SSI Transformation
In this section, we present the SSI transformation as an alternative means to
balance variables in programs by proving that all variables in an SSI program
are balanced by construction. To this end, we first prove that variables are
bounded in loops in an SSI program and then prove that they are balanced.
83
Chapter 5: Heuristics for Code Generation
Lemma 5.1 (Bounded SSI) All variables in a SSI program are bounded
in all loops in the program.
Proof Assume on the contrary that a variable x is not bounded in a loop
(Ils, . . . , Ile). Therefore, from Definition 5.13, there must exist a path P
where x is not defined and a sequence of n ≥ 1 uses of variable x, denoted
by (Iu1 , . . . , Iun), appear. Let p denote a program point at the exit of the last
use Iun . At program point p, the use Iu1 is upward-exposed via the back edge
(Ile, Ils) of the loop. Also, any use of x after the loop (pseudo-use at program
end node Ie in case x is not used after the loop) is also upward-exposed at the
program point p. This violates the unique upward-exposed use property 5.4 of
programs in SSI form. Thus, contradicting our assumption that the program
is in SSI form. ∎
Theorem 5.1 (Balanced SSI) All variables in a SSI program are bal-
anced.
Proof Lemma 5.1 states that all variables in a program in SSI form are
bounded. Therefore, it is enough to consider a single iteration to count the
number of uses of variables in any loop. In other words, back edges can be
ignored. Assume now that a variable x is not balanced at any point p in the
given program. Let Id be the unique definition of x (see the unique definition
property 5.1 of SSI programs). Clearly, program point p must appear after Id
in the command program since the use count of x at all program points before
Id is zero. Since x is not balanced at p, there exists at least two paths Pi and
Pj from the program point p to the end node Ie with a different number of
uses of x. Without loss of generality, assume that any use Iu is a part of path
Pi, but not a part of path Pj . Since there exists a path from p to Iu and Id
dom Iu (see the dominance property 5.2 of SSI programs), there must exist a
path from Id to p. Therefore, there exists a path from the unique definition Id
to p to the end node Ie (along the path Pj), where the use Iu does not appear.
This violates the post-dominance property 5.3 of SSI programs, which states
that each use of x post-dominates its unique definition, thus contradicting our
assumption that the program is in SSI form. ∎
5.4. Move Code Generation
Once the buffer assignment (assignment of instructions to PUs) and the num-
ber of copies to produce for each variable definition are determined, it is easy
to generate the final move code by considering the command instructions in
84
5.4. Move Code Generation
program order. Table 5.7 lists command instructions and the corresponding
move instructions, where u(x) denotes the PU that produce x, n denotes the
number of copies to produce and the memory address of any array variable
x is denoted by adr(x). The operand moves corresponding to a command
instruction may be ordered randomly since the buffer assignment guarantees
that these operands are assigned to different output buffers or PUs. For ex-
ample, x1 and x2 in the assignment command instruction y = x1 ⊙ x2 may be
moved to left and right input buffers, respectively, of u(y) in any order. Since
our SCAD simulator currently only supports primitive binary arithmetic and
logic operations supported by the command language (denoted by ⊙), dup and
swap opcodes are not yet supported. Therefore, a copy command y = x is cur-
rently implemented by adding x to the immediate value 0 in u(y), incurring
an additional transportation overhead for the moving immediate value 0 to
the right input buffer of u(y).
Command Instr Corresponding SCAD Move Instructions
y = x u(x)@o→ u(y)@l; 0→ u(y)@r; +→ u(y)@op; n→ u(y)@cp;
y = x1 ⊙ x2 u(x1)@o→ u(y)@l; u(x2)@o→ u(y)@r; op→ u(y)@op; n→ u(y)@cp;
y = x1[x2] adr(x1)→ u0@l; u(x2)@o→ u0@r; +→ u0@op; 1→ u0@cp;
u0@o→ lsu@l; ld→ lsu@op; 1→ lsu@cp;
lsu@o→ u(y)@l; 0→ u(y)@r; +→ u(y)@op; n→ u(y)@cp;
y[x1] = x2 adr(y)→ u0@l; u(x1)@o→ u0@r; +→ u0@op; 1→ u0@cp;
u0@o→ lsu@l; u(x2)@o→ lsu@r; st→ lsu@op;
goto i i→ pc;
if x goto i u(x)@o→ cu@c; i→ cu@then; pc→ cu@else;
Table 5.7.: Command instructions and corresponding SCAD move instructions
Currently, the SCAD compiler does not perform any array index analysis
so that all array accesses and assignments are directed to the main mem-
ory. Furthermore, we reserve PU 0 (u(0)) for evaluating the memory address
of array elements, which is then transported to the load-store unit (lsu) for
loading or storing array elements. See the array access (y = x1[x2]) and the
array assign (y[x1] = x2) command instructions in Table 5.7. Unconditional
branching (goto i) is implemented by moving the branch target i to the spe-
cial address pc that denotes the program counter. For conditional branching
(if x goto i), the branch target i is moved to ‘then’ input lane of the con-
trol unit followed by moving the branch condition produced in u(x)@o to the
condition input lane of the control unit. Though we list the move instruc-
tion pc→ cu@else for the conditional branching, this is automatically done by
the control unit. For more details on how the control flow is implemented in
SCAD, see Section 2.1.3. Finally, dummy assignments of the form dx = x are
translated to move instruction u(x)@o→ null that simply discard a copy of x
from the head of the output buffer u(x)@o. Appendix A lists all intermediate
results from the compilation (input MiniC program, program after bounding
variables in loops, program after balancing variable uses, command program,
buffer interference graph and final move program) of a simple benchmark.
85
Chapter 5: Heuristics for Code Generation
5.5. Remarks on Buffer Size
Given enough space in the output buffers, a single slot is sufficient in input
buffers in a SCAD processor to guarantee that the move program generated
by the heuristic discussed in the previous sections will successfully execute
without deadlocking (i.e., without control unit stalling forever due to full
buffers). This is attributed to the following characteristics of the heuristic:
(1) The operand moves of each command instruction are scheduled consecu-
tively. This means, to execute an operation on a PU, the corresponding data
transports to different input buffers of the PU are registered consecutively
before registering the data transports for the next operation. (2) Command
instructions are considered in program order for move code generation. This
means, when data transports are registered to execute an operation on a PU,
all data transports necessary to produce the relevant operand values for this
operation would have already been registered. Therefore, even if the control
unit should stall at this point of time, it is guaranteed that the operand values
will be eventually transported to the input buffers and are consumed by the
PU, freeing up space in its input buffers.
5.6. Experiments
We experimentally compare the execution of benchmarks by direct instruc-
tion communication with the execution by using registers to hold intermedi-
ate results. For the former, queue-based move programs are generated using
the heuristic discussed in the previous sections. The Chaitin-Briggs heuris-
tic [Chai04] is used to color the buffer interference graph so that variables
are assigned to a minimal number of PUs. Recall that the queue-based code
generation heuristic produces resource constrained move programs that use
a minimal number of PUs to execute benchmarks. The register-based move
program is generated as follows: Allocate variables to a minimal number of
registers by coloring the register interference graph, again using the Chaitin-
Briggs heuristic [Chai04]. Assign instructions to PUs by list-based scheduling
that attempts to minimize the execution time using the given resources. The
same number of PUs are then used in both queue-based and register-based
move code generation. With the PU and register assignments, the move pro-
gram is easily obtained by generating the following moves for each command
instruction: move the operands from the registers to the corresponding input
buffers of the PU that executes this instruction, and move the result from
the output buffer to the respective mapped register. In both register-based
and queue-based code generation, the array variables are mapped to the main
memory and PU 0 is reserved for computing array element addresses.
We use a cycle-accurate SCAD simulator3 to execute both queue-based and
register-based move programs. To study the impact of the number of ports
in the register file, register-based move programs are executed on both (1) a




(2) a SCAD machine with q multi-ported register file with as many ports as
registers, denoted by REG-MAX. Once all intermediate results are accommo-
dated in the respective local storage, the sizes of input and output buffers have
a similar impact on the execution time of both register-based and queue-based
move programs, in that the control unit will stall if not enough space is avail-
able in buffers to register the move instruction. Therefore, we use the same
buffer sizes to execute both move programs. To this end, execution times are
measured for two configurations of buffer sizes: (1) First, we use a minimal
number of entries in the input and output buffers required for the queue-based
move program (i.e., the move program generated by the heuristic where vari-
ables are balanced by discarding copies). Currently, the queue-based code
generation heuristic does not consider buffer sizes. Therefore, we check if a
given buffer size is sufficient or not for the successful execution of the pro-
gram in the SCAD simulator. It is already known from the heuristic that
the required minimal input buffer size is 1, given large enough output buffers.
The sizes of input buffers are fixed to 1 to determine the required minimal
output buffer size by simulation. (2) Next, we provide enough space (size 20)
in input and output buffers so that the execution times of both queue-based
and register-based move programs are compared, avoiding a differing impact
of minimal buffer sizes, if any.
Program Label Input
factorial fact {1, . . . ,12}
fibonacci fib {1, . . . ,20}
sumup sumup add numbers {1, . . . ,500} in a loop
euclid euclid compute gcd of pairs {101,1001}, . . . ,{150,1050}
heron heron compute square root of first 20 perfect squares
daxpy daxpy vector length 100
eratosthenes sieve sieve determine prime numbers in {1, . . . ,100}
insertion sort insort reverse sorted array [15, . . . ,1]
bubble sort bbsort reverse sorted array [15, . . . ,1]
matrix multiply matmul 4 × 6 and 6 × 8 matrices
image convolution imgconv 6 × 6 image and 3 × 3 kernel
Table 5.8.: List of benchmarks
Table 5.8 lists benchmarks (with corresponding inputs used for experi-
ments), that comprises of both regular parallelism (daxpy, matrix multiply,
image convolution) and irregular (other benchmarks) parallelism. The simu-
lated SCAD machine has one load-store unit (LSU) for memory accesses, and
the other PUs are capable of executing any standard binary operation. Arith-
metic, comparison, and bitwise operations are configured to have a unit delay,
except for multiplication and division operations that take 8 and 20 cycles,
respectively, to execute. Memory accesses (read and write) take 50 cycles.
The DTN is configured to have a latency of log2 p where p is the number of
processing units. This is usually the number of stages required in a multi-stage
interconnection network to communicate values from p senders to p receivers.
87
Chapter 5: Heuristics for Code Generation
Furthermore, all components (PUs, LSU, and DTN) are pipelined.
Recall that two program transformations were discussed to balance vari-
ables in a program for move code generation: balancing by discarding copies
and by SSI transformation. In the rest of this section, we refer to the former
as NORMAL balanced move program and the latter as SSI balanced move
program. We also abbreviate the execution of the NORMAL balanced move
program on the SCAD machine by NORMAL execution in the following expla-
nations. Similarly, SSI, REG-MIN, and REG-MAX executions. The number
of cycles taken for different execution types using minimal and enough buffer
sizes are shown in Figures 5.12 and 5.13, respectively. Note that the cycles for
different execution types relative to one another appear similar in both figures,
with only a few minor differences discussed later. This affirms our postulate
that sizes of input and output buffers have a similar impact on the execution
of both queue-based and register-based move programs. The execution of a
NORMAL balanced move program on SCAD is the most efficient in terms of
the number of cycles since all register accesses are bypassed in comparison to
the execution of the corresponding register-based move programs. The REG-
MIN execution is considerably slower due to the contention of simultaneous
register accesses on the single port of the register file. Clearly, register accesses
and the number of register file ports have a huge impact on the execution time,
particularly for programs exhibiting regular parallelism (daxpy, matrix mul-
tiply, image convolution) since there are more parallel accesses (writes and
reads) of intermediate values.
Figure 5.12.: Execution time using minimal buffer sizes
Notice that unlike the optimal code generation experiments, the REG-MAX
(that uses an ideal register file) execution cycles of many benchmarks are com-
parable to that of the corresponding NORMAL balanced move programs in
SCAD. This is because the experiments using optimal code only considered ba-
sic blocks as programs. The control flow in the above benchmarks introduces
additional overhead in the execution by direct instruction communication.
88
5.6. Experiments
Figure 5.13.: Execution time using enough buffer sizes
First, in the form of new copy assignments for bounding variables. Second,
as additional cycles to discard unnecessary copies introduced by the balancing
transformation. Furthermore, the copy assignments of the form y ← x incur
extra overhead for the direct instruction communication since x need to be
rotated via the input buffer to the output buffer of the PU to which y is as-
signed. When using registers to hold values, this simply means copying one
register’s content (or an immediate value) to another register. The impact of
these overheads is apparent from Figure 5.14 that shows the total number of
firings of PUs. The numbers of PU firings for most benchmarks are consider-
ably more in the execution of NORMAL balanced move programs compared
to the corresponding register-based move programs. It is not difficult to see
that with programs offering more ILP (possibly by various techniques such as
loop unrolling, superblock and hyperblock formations) and with larger num-
bers of PUs in SCAD, the negative effect of the overhead will quickly amortize
since the overall overhead will then be distributed between more PUs. This is
seen to a slight extent for the matrix multiplication and the image convolution
benchmarks in Figures 5.12 and 5.13. Unfortunately, we do not yet have a
heuristic that maximizes the use of ILP in SCAD with any given number of
PUs. This will be addressed in future work.
As expected, SSI balanced move programs of all benchmarks take longer to
execute compared to NORMAL balanced move programs (Figures 5.12 and
5.13). Balancing by the SSI transformation introduces extra overhead in the
form of copy assignments used at the control flow split points. Again, this is
clearly seen in Figure 5.14 in that PUs in the SCAD machine fire more often
when executing SSI balanced move programs. However, NORMAL balancing
will require larger buffers to successfully execute generated move programs
since maximal copies are produced when values are computed. This space-
time trade-off in computation is reflected in Figure 5.15 that shows the min-
imal number of resources required for different execution types. Clearly, SSI
89
Chapter 5: Heuristics for Code Generation
Figure 5.14.: Number of PU firings
balanced move programs require less buffer size for all benchmarks. Minimal
output buffer sizes are shown with input buffer sizes fixed to 1. Although we
configured the same buffer sizes for all PUs in SCAD, each PU will not use all
slots in its buffers. In fact, only a few (one or two) PUs will utilize all buffer
slots because our heuristic tries to assign as many instructions as possible to
the same buffer to use a minimal number of PUs (i.e., resource-constrained
code generation). In addition to minimal PUs (required by the NORMAL bal-
anced benchmark programs), minimal registers required by the register-based
compilation are also given in Figure 5.15. However, we will need to measure
the hardware resource usage for an exact comparison of the register file size
and the buffer size, which is not covered in this thesis.
Figure 5.15.: Minimal resource usage
90
5.6. Experiments
In Figure 5.12, the execution of the SSI balanced programs in SCAD has a
lower runtime compared to REG-MIN executions for most benchmarks except
fibonacci and euclid. This is again more apparent for parallel benchmarks
like daxpy, matrix multiply, and image convolution. For these benchmarks,
it can be observed that the SSI execution even outperforms the execution in
REG-MAX (using ideal register files). Now compare the SSI execution with
the execution in REG-MIN and REG-MAX in Figure 5.13. It is interesting
to see that provided enough buffer size, the execution by direct instruction
communication using the SSI transformation is faster than the execution by
using registers for all benchmarks (except for the fibonacci benchmark where
the SSI execution lags behind REG-MAX execution). This is because SSI bal-
anced move programs have more move instructions due to the additional copy
assignments introduced by the SSI elimination. With more move instructions
and less buffer size, the control unit in SCAD would frequently stall. Some of
these control unit stalls are avoided in experiments with enough buffer size.
Figure 5.16.: Total number of data transmissions
The total number of data transmissions by all hardware units for each ex-
ecution type is shown in Figure 5.16. The NORMAL execution has the least
number of data transmissions. Execution in REG (REG-MIN or REG-MAX)
contain comparatively more communication of values due to additional writes
to registers. Even more values are communicated by the interconnect in the
SSI execution due to additional copying (or duplicating) of values by the SSI
elimination. However, for more parallel benchmarks (daxpy, matrix multiply,
and image convolution), the overall data communication in the SSI execution
is less than that in the REG execution. This is reflected as a better perfor-
mance of these benchmarks by the SSI execution compared to the REG-MIN
and the REG-MAX executions (see again Figures 5.12 and 5.13). It is again
interesting to observe that if programs offer more ILP, the negative impact
of register accesses on the performance aggravates while that of the overhead
amortizes quickly. We expect to confirm this observation in the future by
91
Chapter 5: Heuristics for Code Generation
Figure 5.17.: Number of data transmissions by PUs
developing heuristics to generate SCAD code that will maximize the use of
ILP. The total number of data transmissions by PUs for each execution type
is shown in Figure 5.17. As expected, the overall data communication in the
execution by direct instruction communication is distributed among PUs. In
contrast, all data traffic is routed through a limited number of register file
ports in the execution where registers are used to hold intermediate values.
Notice that the execution of register-based move code resembles the execu-
tion in VLIW processors, where each instruction reads operand values from
registers and writes its result to a register. Similar to the bundling of instruc-
tions in VLIW, independent move instructions can be bundled in SCAD. Also,
the execution in SCAD can benefit from various VLIW compiler techniques to
find enough independent instructions for concurrent execution. Therefore, it is
not unfair to consider the above experiments as a comparison of the execution
by direct instruction communication in SCAD and the execution in a VLIW




In this chapter, we review processor architectures concerning the use of ILP
with an emphasis on their capability to bypass register usage or implement
direct instruction communication. We also mention aspects of SCAD that
distinguish it from the reviewed architectures.
Superscalar processors [John91; SmSo95] execute typical reduced instruc-
tion set computer (RISC) instructions where operands and results of instruc-
tions are encoded by register addresses. Dynamically scheduled superscalar
processors perform out-of-order execution of RISC programs by Tomasulo’s
algorithm [Toma67]. They fetch and decode multiple instructions at once fill-
ing a reservation station and a reorder buffer. Instructions in the reservation
station whose operand values are available are dispatched to processing units
(PUs) for execution. So, allocation of instructions to PUs (instruction place-
ment) and firing times of PUs (instruction issue) are determined at runtime
(see Figure 6.1). During the execution of an instruction, other instructions
that consume the result of this instruction might still be decoded and put
into the reservation station. Since any entry in the reservation station is a
probable consumer instruction, each PU must broadcast its result to all reser-
vation station entries. To that end, n PUs arbitrate on a single common data
bus (CDB) that is snooped by all m reservation station entries. If any con-
sumer instruction is not yet decoded when the result is broadcast, the result
is communicated to this consumer instruction via the register. Therefore, the
instruction window for direct instruction communication is determined by the
size of the reservation station. However, scaling the reservation station size
and the number of PUs in a superscalar processor is limited by the n ∶ 1 ar-
bitration of PUs on the CDB and the 1 ∶ m broadcast from the CDB to the
reservation station entries. In contrast, the input and output buffers of SCAD
are scalable, and the data transport network (DTN) in SCAD establishes a
unicast n ∶ n network where n is the number of PUs.
In superscalar processors, instructions still have to read their operands from
registers in the program order in the decode stage. Each instruction has to
write its result to a register in the write-back stage again in the program or-
der (from the reorder buffer). The number of registers that can be read or
93





























Figure 6.1.: Execution framework of a superscalar processor
written at any point of time is limited by the number of ports of the register
file. Furthermore, compilers have to generate spill code to temporarily store
intermediate results in memory if the number of programmer-accessible regis-
ters is not sufficient, thus reducing the use of ILP (see Figure 1.1). Increasing
the number of registers is difficult since this number is directly encoded in the
RISC instruction format. Other major drawbacks are the enormous power
consumption and the limited scalability: In addition to executing programs,
the processor also takes care of instruction scheduling in that the data de-
pendencies of instructions are tracked, and instructions are allocated to PUs
(instruction placement) at runtime. Superscalar processors rely on branch
prediction to keep the PUs busy for control-flow intensive programs.
The very long instruction word (VLIW) processors [FERN84], contrary to
superscalar processors, are quite simple machines. It is the responsibility
of the VLIW compiler to track data dependencies of instructions, allocate
instructions to PUs, and also determine PU firing times. So, both instruction
placement and instruction issue are determined statically by the compiler.
The number and latency of PUs are exposed to the compiler so that it can
bundle independent instructions in a very long instruction word. The hardware
simply executes each instruction in a bundle simultaneously. The general
structure of VLIW architectures is shown in Figure 6.2. Many sophisticated
techniques like trace scheduling [Fish81] and hyperblock scheduling [MLCH92]
for acyclic regions, modulo scheduling (or software pipelining) [Lam88; Rau94]
and loop unrolling [LaHw95] for cyclic regions (or loops), etc. are used by the
VLIW compilers to find sufficient independent instructions from across basic
blocks to utilize the available PUs. Statically scheduled VLIW processors have
significantly less power consumption than dynamically scheduled superscalar
processors. However, while the same binary programs can run on very different
superscalar processors that share the same instruction set, the programs have
to be compiled for each particular VLIW processor, which restricted these
processor architectures to the field of embedded computing [FiFY05].
Since most VLIW architectures use typical RISC instructions, each instruc-
94
Chapter 6: Related Work
Register File
Instruction Register
Figure 6.2.: Execution framework of a VLIW processor
tion in a very long instruction must read its operands from registers and write
its result to a register. If the number of registers (again, this number is en-
coded in the RISC instruction format) is not sufficient, the VLIW compilers
have to generate spill code to temporarily store intermediate results in the
main memory, which reduces the use of ILP (see again Figure 1.1). Further-
more, a processor with n issue slots usually requires a register file with 2 ∗ n
read ports and n write ports, so that each of the n instructions bundled to-
gether can simultaneously read two operands and write one result. Clearly,
with the increase of number of PUs, the number of ports of the register file
must also be increased to improve the use of ILP. However, the required wiring
to allow all PUs to access all read and write ports in the register file in a cycle
quickly reaches limits in its scalability. It is observed in [ZyKo98] that with
the increase in the number of ports, the power dissipation in a register file
increases at the rate of n2 to n3 while the area and the access times have
been shown to increase at the rate of n3 and n3/2 in [RDKM00]. This lead to
the development of clustered VLIW architectures where PUs can only access
predefined subsets of registers, which is a further hurdle for the compilers to
generate suitable code.
The effect of exposing datapaths of the processor to the compiler to possibly
reduce the register file pressure in VLIW processors was studied in [HoCo94]
and yielded quite impressive results. It was found that it is possible to bun-
dle and execute 2 instructions per cycle with only a two-ported register file,
which requires four read ports and two write ports in the original VLIW. Also,
3.6 instructions per cycle is supported with a six-ported register file, which
requires eight read ports and four write ports in the original VLIW archi-
tecture. Transport-triggered architectures (TTA) [Corp94; HoCo94a; Corp99],
whose execution framework is shown in Figure 6.3, was used to conduct the
aforementioned study. The PUs and the register file are connected via a move
instruction bus in TTA so that the PUs can communicate values with each
other without having to write and read a central register file. The PUs have
uniquely addressed single registers at their input and output ports. One of the
input ports of each PU is designated as the ‘trigger’ port (shown in dark green
color in Figure 6.3) so that the data transport to this port will trigger the
execution of the respective PU, thus the name ‘transport-triggered’. Similar
to SCAD, TTAs are also programmed by move instructions. When the control
unit issues a move instruction via the move instruction bus, the transport of
95
Chapter 6: Related Work
a value from register src to register tgt is performed immediately again via
the move instruction bus. Clearly, several move instructions may be grouped
in a bundle and issued via a bus with many lanes. Again like in SCAD, the
PUs in TTA may implement any arbitrary application-specific function with-
out affecting the instruction set. Extensive research is done on compilers and
design space exploration for TTAs 1.
Move Instruction Bus
Register File
Figure 6.3.: Execution framework of a TTA processor
TTAs are, similar to VLIW, statically scheduled architectures where the
compiler determines instruction placement and instruction issue. In fact, they
are an extreme case of static scheduling since not just the firing times of opera-
tions in PUs but also the times of data transports corresponding to operations
are statically determined. In SCAD, data transports are only registered on
the issue of the respective move instructions, while the actual transport of the
data is only carried out when the respective values are available. Therefore,
SCAD processors adapt to arbitrary latencies of PUs and memory accesses
similar to superscalar processors. In contrast, compilers for TTA and VLIW
processors must often consider a worst-case latency of hardware units to de-
rive a correct static schedule. Static instruction issue inhibits the execution
in dataflow order since instructions with uncertain latency such as memory
loads (in the context of cache hit/miss) cannot be optimally accommodated
in the static schedule and will stall the entire execution engine. Thus, static
instruction issue is a limiting factor for the use of ILP. Moreover, TTAs will
still need to use a central register file since, unlike the FIFO buffers in SCAD,
they use single registers at PU inputs and outputs. Accordingly, the compil-
ers will have to assign some intermediate results (that are not bypassed) to
general purpose registers. Although register allocation is less critical than in
VLIW, the use of general purpose registers still limits the amount of ILP the
architecture can make use of.
Execution in some architectures bears a strong resemblance to the execution
paradigm of TTA in the sense that they are statically scheduled architectures
with exposed datapaths. The MOVE-Pro architecture [HSMC11] has a buffer
with multiple entries at each PU output, instead of a single register in TTA.
This increases the chances of bypassing more register file accesses. Since the
1http://openasip.org/
96
Chapter 6: Related Work
data transport to the trigger input port of a PU in TTA triggers the firing of
that PU, this data transport must always be scheduled after scheduling the
move instructions to the other input ports. The PUs in MOVE-PRO does not
have a designated trigger port. Instead, the data transport to any input port
can trigger the PU’s firing. To this end, the triggering information is encoded
by setting or clearing an opcode field in any move instruction. This increases
the flexibility to schedule move instructions in MOVE-Pro compared to TTA.
Furthermore, MOVE-Pro offers improved code density by proposing new in-
struction formats. The synchronous transfer architecture (STA) [CRSM04]
is more similar to TTA. However, instead of a limited bus, it uses a mul-
tiplexer array to allow any PU to receive operands for execution from other
PUs directly. Both the operation to be executed on a PU and the interconnect
configuration for its operands are encoded in an instruction, several of which
may be bundled together for concurrent execution. There are other similar ar-
chitectures that not only allow the compiler to move values between PUs, but
also expose more finer microarchitectural details. In FlexCore [TSBS08], PUs
are connected by a full crossbar. Statically scheduled [ScSL09] instructions of
a native ISA encode both the configuration of the crossbar network and the
operations to be executed on individual PUs. Also, the use of an instruction
decompression unit that unfolds instructions in an application-specific ISA to
the more detailed native ISA, is proposed to improve the code density. No-
tice that the execution in the statically ordered variant of SCAD where the
compiler determines the instruction issue resembles the execution paradigm of
TTA-like architectures.
Numerous other architectures expose communication details in hardware to
the compiler. To mention a few: The RAW/Tilera architecture [WTSS97;
LBFS98] consists of simple pipelined RISC processor cores arranged in a grid
and connected by mesh networks. Each core is equipped with a local instruc-
tion and data memory. A few designated registers of each core are connected
to the network. A software-controlled network switch in each core is config-
ured by the compiler to read from registers of another core connected to the
network, thus supporting direct communication from a producer instruction in
one core to the consumer instruction in another core. In the explicit datapath
wide single instruction multiple data (SIMD) architecture [WSCH15], a set of
PUs are arranged in a circular layout where each unit is connected to its left
and right neighbors. There is a control unit that can talk to all the PUs. It is
programmed using very long instructions that encode for each PU, the source
and destination of its operands, and the role of the control unit. The statically
scheduled Mill2 architecture uses a fixed-length FIFO buffer (called Belt) to
store operands and results of the execution, and this way eliminates the general
purpose register file. Unlike buffers in SCAD, the PUs in the Mill processor
may directly read operands from any location in the Belt. AMIDAR (Adap-
tive Microinstruction Driven Architecture) [GaHo05a] is a general model for
building processors that features a set of functional units (meaning any hard-
ware component) connected to each other via a communication structure. A
2https://millcomputing.com/technology/docs/belt/
97
Chapter 6: Related Work
token generator distributes the so-called tokens to these functional units over
a dedicated token generator network. Each token encodes, in addition to an
operation code, the destination functional unit to where the operation’s result
is to be sent. The token also contains a tag value that is either attached after
incrementing or attached as is to the sent result, depending on yet another in-
formation encoded by the token. Programs consist of instructions where each
instruction may contain an arbitrary number of tokens, allowing any function
to be computed by direct communication of values between the functional
units. The AMIDAR model has been often used to execute Java bytecode.
For a recent review of exposed datapath architectures, see [JKVT15].
With the growing number of PUs in processors, the execution of programs is
becoming increasingly communication dominated due to longer wires and the
subsequent decrease in wire transmission speeds [AHKB00; HoYS98]. There-
fore, the instruction schedulers must consider on-chip wire delays when allocat-
ing PUs to producing and consuming instructions. Clearly, dynamic instruc-
tion placement (as in superscalar processors) is limited in handling growing
wire delays, whereas compiler-driven static instruction placement can better
optimize communication delays. On the other hand, the dynamic instruction
issue is preferred over the static instruction issue to achieve a higher use of
ILP. This evoked interest in static placement dynamic issue (SPDI) [NKBM04]
architectures, which combined a compiler-determined allocation of PUs to in-
structions and a hardware-determined issue of instructions. According to this
categorization, superscalar processors are classified as DPDI (dynamic place-
ment dynamic issue) architectures, whereas VLIW, TTA, and other statically
scheduled exposed datapath processors are classified as SPSI (static placement
static issue) architectures. The SCAD architecture follows the SPDI execution
model, thus retaining the benefits of static instruction placement and dynamic
instruction issue.
The TRIPS [BKMD04] (Tera-op, Reliable, Intelligently adaptive Process-
ing System) architecture, an Explicit Dataflow Graph Execution (EDGE) ar-
chitecture, presents the SPDI scheduling model. Recall that in superscalar
processors, all PUs have to arbitrate on a single common data bus, which is
snooped by each reservation station entry to possibly fetch its operands. This
restricted direct instruction communication is necessary since when the hard-
ware allocates an instruction to PU, all consumer instructions might not yet be
fetched and decoded. This limits the scaling of PUs and the reservation station
in dynamically scheduled superscalar processors. TRIPS avoids this bottle-
neck by employing an SPDI block-atomic execution mode: Figure 6.4 shows
the execution framework of TRIPS where each PU has its own set of reserva-
tion station entries (instruction buffers), each of which is connected to output
ports of PUs via a 2D mesh network. The TRIPS compiler [SBGM06] trans-
forms segments of the program’s control-flow graph to large basic blocks called
hyperblocks by various techniques such as loop unrolling, if-conversion, pred-
ication, etc., and map these hyperblocks to the computational engine. Each
instruction in a hyperblock is accommodated in a statically determined reser-
vation station slot (static placement) and encodes physical locations (again
98
Chapter 6: Related Work
reservation station slots) of its consumer instructions. The availability of
operand values in any reservation station slot triggers the firing of the cor-
responding PU executing the respective instruction. The computed result
is then directly communicated to its consumer instructions via the network.
This way, all instructions within a hyperblock communicate directly with one
another. Registers and memory are used for the communication between hy-
perblocks. In fact, a hyperblock execution is considered complete when it
performs a specific number of register writes and memory stores. Clearly,
there is a program counter in TRIPS that steers the control flow in programs
at the level of hyperblocks. Notice that the hyperblock execution in TRIPS
is analogous to the RISC instruction execution in superscalar processors so
that various techniques to improve the use of ILP like superscalarity (multi-
ple hyperblocks in flight simultaneously), out-of-order execution, and branch


































RF RF RF RF
Figure 6.4.: Execution framework of a TRIPS processor
Although the use of hyperblocks considerably reduces register file accesses in
TRIPS, the use of registers for communicating between hyperblocks will still
adversely affect the potential use of ILP by the architecture. It is not easy to
scale the instruction buffers (reservation station) local to each PU in TRIPS,
compared to FIFO buffers in SCAD that scale better so that more values can
be accommodated for direct communication at any point of time. Moreover,
since each slot in a reservation station is addressed by the interconnection
network, the interconnection network in TRIPS must scale proportional to
99
Chapter 6: Related Work
the scaling of reservation stations and PUs. Meanwhile, in SCAD, the in-
terconnection network needs to scale only with the number of PUs because
only the head of output buffers and the tail of input buffers are connected
to the DTN in SCAD. On the downside, the operand values in SCAD must
ripple their way to the right slot in the input buffers. Each PU in TRIPS
can execute instructions in its reservation station and transport the results in
the order that respective operands become available. Out-of-order execution
of instructions and transport of results is also possible in SCAD PUs with
an appropriate hardware support [JaSW17], with the constraint that multiple
values send from an output buffer to the same input buffer are transported in
order. However, when allocating instructions to PUs, dependent instructions
are usually allocated to the same PU and independent instructions to different
PUs to effectively use the ILP in programs. This would speak logically against
spending additional resources for out-of-order execution and transport locally
within each PU.
There exist other block-oriented architectures whose instructions consist of
large blocks of operations where operations in a block communicate their re-
sults with one another directly without using a separate local storage (i.e.,
operations within a block are executed in dataflow order). DySER (Dynami-
cally Specializing Execution Resources) [GoHS11] is similar to TRIPS in that
the compiler determined dataflow graphs are atomically mapped and executed.
To this end, an array of PUs is used in the execution stage of the usual pro-
cessor pipeline, which can communicate intermediate results to each other via
a 2D mesh network. The PU array is configured at runtime to implement
specialized functions. Once configured, the specialized function may be used
many times to accelerate appropriate program segments. The compiler uses
profiling to determine parts of the code to accelerate this way. By allowing run-
time reconfiguration, multiple different program segments may be accelerated
in multiple phases of the execution. Note that unlike TRIPS, where the entire
program is transformed to hyperblocks, blocks are generated only for parts of
the program in DySER. Tartan [MCCV06] and Conservation cores [VSGG10]
also uses dataflow execution to accelerate parts of programs, but they do not
support hardware reconfiguration at runtime. Intra-block communication in
these block-oriented architectures avoids the use of registers, while registers or
memory are still required for inter-block communication. In SCAD, the use
of registers is completely avoided using code generation inspired from classical
queue machines.
So far, we have considered processors whose execution paradigm is funda-
mentally based on the control-flow computing model. The model does not
express any parallelism due to its following defining characteristics: First, a
program counter is used to steer the control flow of programs. Second, the
intermediate results are stored in an updateable memory (or shared names-
pace). The adverse impact of the former is alleviated in the above processors
by techniques such as branch prediction, speculative execution, the formation
of superblocks and hyperblocks. The SCAD processor can also benefit from
these techniques, but we do not experimentally study the impact of these in
100
Chapter 6: Related Work
SCAD in this thesis. The adverse impact of the latter is mitigated by di-
rect communication of values between instructions. As has been seen, the
architectures discussed above vary in the degree to which direct instruction
communication is supported or equivalently in the extent to which dataflow
execution is used to improve the exploited ILP. In SCAD, all intermediate
results are communicated directly between the processing units, though some-
times values have to be rotated (overhead) before transporting them to the
consumer processing units. With more processing units, there is less over-
head. Therefore, the execution paradigm in SCAD finds an equilibrium be-
tween control-flow and dataflow execution styles. See [YAJE14] for a recent
survey and classification of architectures based on the blend of control-flow
and dataflow execution styles.
Dataflow processors [DeMi75; Davi78; VeFi78; KiYK83; GuKW85; ArCu84;
PaCu90a; SSMP07] directly instantiate the dataflow computing model [KaMi66;
Denn74; Kahn74], that has the following contrasting characteristics compared
to the control-flow model of computation: The next instruction to fetch and
execute is determined by the availability of operands and not addressed by a
program counter. The intermediate results of computations are directly com-
municated from producer instructions to consumer instructions and are not
overwritten using a shared namespace. For the execution in dataflow proces-
sors, the entire program is represented as a directed graph called a dataflow
graph that is directly executed in hardware. The nodes and edges in a dataflow
graph represent the instructions and data dependencies between instructions,
respectively. That is, the operands for an instruction arrive via incoming
edges of the corresponding node. The node can fire as soon as operands ar-
rive, and the computed result is forwarded to the consumer instructions via
outgoing edges. Therefore, dataflow programs directly expose all available
ILP that can be used by dataflow processors. Though special languages were
developed for programming dataflow computers, most dataflow based systems
convert usual imperative programs to dataflow graphs. It is straightforward
to translate basic blocks by the conversion to static single assignment (SSA)
[AlWZ88; CFRW91; RoWZ88] form where each intermediate result is given a
unique name (no shared names). The control flow in usual programs is often
translated by introducing additional data-steering instructions like switch and
select nodes (see Section 2.1.3).
Clearly, direct instruction communication is inherent in dataflow programs
where each instruction encodes the addresses of the consumer instructions that
wait for its computed result. The first dataflow processors were based on so-
called static dataflow architectures [DeMi75; Davi78; VeFi78] where at most
only one value (often referred to as token) is allowed to propagate along each
edge in the dataflow graph. This way, race conditions by different dynamic
token instances (from different loop iterations) at an edge are avoided that
would otherwise create an ambiguity in identifying matching tokens for exe-
cuting instructions. However, the main drawback of static dataflow processors
is that the restriction of only one token per edge prevents multiple iterations
of a loop from executing simultaneously. The dynamic dataflow architectures
101
Chapter 6: Related Work
[KiYK83; GuKW85; ArCu84; PaCu90a; SSMP07] overcome this limitation
by assigning tags to data tokens and allowing multiple tokens at edges, sim-
ilar to the dynamically ordered variant of SCAD. The tags associated with
data tokens are used to disambiguate dynamic instances of respective target
instructions. To this end, the dynamic dataflow processors must employ tag
management instructions to manage and assign tags to data tokens.
The main disadvantage of dynamic dataflow processors is the additional
overhead of matching tags to identify the correct operands for executing in-
structions, often requiring expensive associative memory implementations [GuKW85].
Even though notable attempts were made to solve the token matching problem
(see Explicit Token Store (ETS) architecture [PaCu90a]), it remained as a dif-
ficulty in realizing an efficient implementation of dataflow processors. Another
significant problem of dataflow processors is the difficulty in using conventional
memory. Since some tokens will inevitably have to share the same memory
slot due to the limited size of the data memory, the use of conventional mem-
ory hardware demands an ordering of memory write and read operations. It is
impossible to ensure an ordering of load and store instructions in the dataflow
execution model since the instructions fire when operands are available (and
not in any predefined order). This necessitates additional support in dataflow
processors for ensuring the desired ordering of load and store operations of
a centralized data memory. This is also the reason why dataflow systems
could not efficiently enforce sequential memory semantics that the imperative
languages require.
WaveScalar [SSMP07] is a recent dataflow architecture that executes pro-
grams in waves, which are dataflow graphs corresponding to the maximal
acyclic code regions of the program’s control-flow graph. They are constructed
similar to hyperblocks by loop unrolling and if-conversions. Waves cannot
contain loops. To allow the simultaneous execution of different instances of
a loop body, the data tokens travel with a wave-number that serves as tag
to identify correct operands to execute instructions at runtime. To this end,
a wave-advance instruction is used for tag management in that it increments
the wave-number of data tokens as they travel from one wave to another one.
Operand matching in SCAD does not require complicated token matching
hardware. Clearly, the instructions are issued (PUs are fired) dynamically in
WaveScalar, like in other dataflow processors. The instruction placement has
both a static and dynamic component [MSPP06a]. The waves are clustered
into smaller segments by the compiler, and when any instruction in a seg-
ment needs to execute, that segment is mapped at runtime into a single PU.
Therefore, instructions are fetched and replaced at runtime in the granularity
of segments. The compiler groups instructions into segments by performing
a depth-first traversal of the dataflow graphs, thus possibly grouping depen-
dent instructions in a segment. The latency for communicating operands is
optimized by loading instruction segments hierarchically to the tiled computa-
tional engine that employs a hierarchical on-chip interconnect. The compilers
for SPDI architectures (including SCAD) can better optimize communication
delays by statically assigning instructions to the processing units. It is also
102
Chapter 6: Related Work
worth mentioning that the synchronous registration of move instructions in
SCAD will ensure that when a token arrives at a PU, there will be space re-
served for that token in the input buffer of the PU. This obviates any back
pressure in the data transport network. The PUs in WaveScalar may reject
tokens if these cannot be accommodated due to lack of space, so that the
senders must then retry later. Conventional programming languages are sup-
ported in WaveScalar by a wave-ordered memory mode. To this end, the
compiler annotates load and store instructions in each wave that encodes an
ordering constraint (using sequence and ripple numbers) of these memory ac-
cesses, which is then respected by the memory system. The memory accesses






The use of instruction-level parallelism (ILP) by commercial processor archi-
tectures face unavoidable limits on its further scalability. These architectures
are fundamentally based on a sequential control-flow computing model. The
more concurrent dataflow computing model has not yet been fully exploited
for commercial systems due to their inefficient implementations with conven-
tional hardware. Since the control-flow model itself does not express any
parallelism, these architectures employ various techniques to effectively use
ILP contained in programs and this way adopt a hybrid control-flow dataflow
model of computation. The use of registers is an innate reason for the lim-
ited use of ILP in these architectures. To this end, more recent architectures
expose their datapath to the compiler, so that use of registers is bypassed by
directly communicating intermediate results from producer processing units
to consumer processing units. It was observed that although these architec-
tures are studied in great detail, they still use registers to execute programs.
We proposed a novel exposed datapath architecture named Synchronous Con-
trol Asynchronous Dataflow (SCAD) that uses a hybrid control-flow dataflow
model of computation, where the use of registers is completely avoided by
the code generation for SCAD based on classical queue machines. SCAD
uses FIFO buffers at output and input ports of processing units, and output
buffers are connected to the input buffers using any general interconnection
network. The datapaths between output and input buffers are exposed to
the compiler. SCAD is programmed by a sequence of move instructions of
the form src → tgt that instructs a data transport from the output buffer
src to the input buffer tgt. Execution in SCAD adheres to the control-flow
computing model in that it uses a program counter to steer the control-flow
of programs. Similar to other control-flow processors, well-known compiler
techniques (generation of superblocks and hyperblocks) and hardware tech-
niques (branch prediction and speculative execution) can be used to exploit
ILP across the control flow boundaries.
This thesis focused on the dataflow aspect of execution in SCAD, where in-
termediate results are directly communicated from producer processing units
to consumer processing units and are never overwritten using a shared names-
105
Chapter 7: Conclusions
pace (registers). In other words, the use of registers, which is an inherent
bottleneck in utilizing ILP contained in programs, is completely avoided in
SCAD. This is achieved by a synergy of the execution paradigm and the code
generator. Move code generation for SCAD inspired from classical queue ma-
chines utilizes the exposed datapaths in SCAD to the fullest, eliminating any
need for other local storage (registers) and also tends to maximize the degree of
exploited ILP. Queue oriented code generation suggests a breadth-first traver-
sal over expression trees in contrast to the traditional depth-first traversal in
classic register-based architectures. The depth-first traversal was motivated
by the reuse of registers, while the breadth-first version is motivated to exploit
maximal ILP and to eliminate the use of registers. On the one hand, queue
oriented SCAD code can execute programs by direct instruction communica-
tion (without registers). However, on the other hand, SCAD code obtained
by direct translation from queue code contains overhead in terms of rotation
steps to access relevant values, not at the head of output FIFO buffers.
We observed that it is not a good idea to directly translate queue code to
SCAD code since SCAD machines often require less overhead than the queue
machine. With more processing units (thus more input and output buffers)
in SCAD, there are fewer restrictions in accessing values. It is important
to avoid the overhead since these additional operations degrade the use of
ILP, enabled in the first place by direct instruction communication in SCAD.
We established that it is, in general, an NP-hard problem to compile a pro-
gram for overhead-free execution on a given SCAD machine. Consequently,
boolean constraints are formulated for optimal (overhead-free) compilation so
that SAT solvers may be used to determine a minimal number of processing
units required in SCAD to execute a given program without the use of any
overhead operations. The minimal number of PUs is an indirect measure of
the SCAD compiler’s effectiveness in avoiding overhead or equivalently the
flexibility of the SCAD compiler in maximizing the use of ILP by direct in-
struction communication. To this end, we also studied variants of the SCAD
machine, namely the statically ordered SCAD (SO-SCAD) and the dynami-
cally ordered SCAD (DO-SCAD), to compare it with classic control-flow and
dataflow architectures, respectively.
In processing units in SCAD, the operands for executing the next operation
(or ordering of operations in processing units) are determined by the compiler
and enforced by the hardware. In simpler processing units in SO-SCAD, the
operands for executing the next operation are both determined and enforced
by the compiler. This resembles the classic control-flow (or register) architec-
tures where the compiler encodes operands and the result of an instruction
by register addresses. In processing units in DO-SCAD, the operands for the
next operation are both determined and enforced at runtime by complex to-
ken matching hardware resembling classic dataflow architectures. We found
by experiments that the SO-SCAD architecture often requires many process-
ing units to avoid the overhead when executing programs by direct instruction
communication. In other words, the naive hardware support for direct in-
struction communication in SO-SCAD takes away a lot of the flexibility from
106
Chapter 7: Conclusions
the compiler in utilizing ILP contained in the programs. At the same time,
programs can be compiled with only a few processing units for overhead-free
execution in SCAD. Given these minimal numbers of processing units, SCAD
can even exploit nearly the maximal ILP similar to DO-SCAD but without the
expensive token matching hardware. To summarize, the experimental study
revealed that the execution paradigm in SCAD strikes an essential balance
between hardware complexity and compiler flexibility.
Since optimal compilation for SCAD by SAT solvers could only process
small programs, we developed heuristics to compile real benchmarks. A novel
buffer interference analysis assigned instructions to the processing units so
that overhead-free move code is generated by considering instructions in pro-
gram order. We executed programs both by direct instruction communication
and by using registers using a cycle-accurate SCAD simulator. To this end,
a representative set of benchmarks were used that featured both regular and
irregular parallelism. It was observed that the execution by direct instruction
communication was faster even when we used ideal multi-ported register files
(with as many ports as registers) to execute register-based programs. More
importantly, the data transmission pattern showed that the communication
(transmission and reception) of values was distributed among processing units
in execution by direct instruction communication. On the contrary, as ex-
pected, the bulk of data traffic in the execution of register-based programs
was routed through a limited number of ports of the register file, revealing




We already suggested that SCAD architectures can benefit from well estab-
lished hardware techniques (branch prediction) and compiler techniques (su-
perblock and hyperblock formations) to exploit ILP across control flow bound-
aries. However, no experiments are carried out in this regard since we focused
mainly on direct instruction communication in SCAD in this thesis. It is im-
portant to experimentally compare the impact of these techniques in SCAD
with that in conventional superscalar and VLIW processors. For this pur-
pose, it is necessary to develop a heuristic for SCAD code generation that will
maximize the use of ILP rather than the resource-constrained code generation
heuristic discussed in this thesis. On the hardware front, the immediate next
step is to prototype a SCAD processor in hardware and to compare it with tra-
ditional superscalar and VLIW implementations in terms of area, cycle time,
and power consumption.
Execution paradigm
Asynchronous control unit: In SCAD, the source and target addresses of
a move instruction are registered synchronously in the corresponding input
107
Chapter 7: Conclusions
and output buffer, respectively. If one of these buffers is full, the control unit
has to stall. This necessitates using a bus (MIB) that must be snooped by
all input and output buffers. Though multiple lanes may be used to register
independent moves simultaneously, the performance of applications with a
huge amount of ILP is expected to saturate due to the use of the MIB. The
MIB is also a hurdle in scaling SCAD to larger numbers of processing units.
Therefore, it is desirable to allow an asynchronous registration of addresses in
buffers. To this end, the SCAD program will consist of a sequence of addresses
per buffer. These independent sequences may be registered on the respective
buffers using any parallel interconnection network (address transport network)
that scales better than buses. We envision an architecture where both program
(sequences of addresses) registration and data flow proceed asynchronously.
However, with an asynchronous control unit, different aspects of the execution
paradigm and code generation must be looked at carefully for correctness.
Furthermore, since a program counter currently steers control flow in SCAD,
this asynchronous operation of the control unit (or the parallel registration
of addresses) must inevitably synchronize at control flow boundaries in the
program.
Control flow: The control flow is currently implemented in SCAD by branch
move instructions whose target is the control unit. Predication may be used
with loop unrolling to construct hyperblocks that cover conditional statements.
However, branch prediction is still required to execute multiple loop iterations
simultaneously. Branch prediction and ensuing speculative execution require
a lot of broadcast (one to many) communication from the control unit to all
processing units, which will eventually limit the scalability of SCAD to larger
numbers of processing units. Moreover, while branch prediction is comple-
mented by a synchronous control unit, it does not augur well with the nature
of an asynchronous control unit discussed above. It is therefore important to
consider other approaches in handling loops. For example, we may explore
possibilities of implementing the switch node in SCAD (see Section 2.1.3) to
execute loops similar to dataflow processors.
Timing predictability: Though not explored in this thesis, predictable ex-
ecution time was one of the factors considered when the execution paradigm
of SCAD was designed [BhJS15]. Memory accesses (especially caches) and
out-of-order execution in processing units adversely affect timing predictabil-
ity of the underlying hardware [HLTW03]. In SCAD, the number of memory
accesses is reduced by the use of FIFO buffers coupled with direct instruction
communication. Moreover, execution local to each processing unit proceeds in
program order. These factors speak in favour of SCAD as a time-predictable
processor architecture. We will investigate in greater detail the timing pre-
dictability of SCAD architectures and worst case execution time analysis of
move programs.
Model of computation: Move programs are inherently difficult to read and
understand. Consequently, directly verifying the properties of a move program
is a daunting task. Therefore, it is desirable to derive a denotational semantics
of the SCAD machine by formally studying the SCAD computational model,
108
Chapter 7: Conclusions
which shall make it easier to reason about move programs and subsequently
aid in verifying them.
Code generation
Heuristics: The optimal code generation for SCAD by SAT solvers can gener-
ate move code given both resource and time constraints. However, only small
programs can be processed by the SAT solver. The heuristics discussed for
SCAD code generation can only generate overhead-free move code that uses
a minimal number of processing units (resource-constrained) for execution. It
is necessary to determine heuristics with constraints on the execution time to
maximize the use of ILP. It is further important to consider input and output
buffer sizes in SCAD processing units for generating move code that will work
with given buffer sizes.
Compilation phases: In classic compilers, register assignment and instruc-
tion scheduling are conflicting phases in that an early register assignment may
constrain instruction scheduling and vice versa. Analogously, the conflict-
ing compilation phases in SCAD are overhead optimization and instruction
scheduling. In the discussed heuristic, we obtain a constrained instruction
schedule to generate overhead-free SCAD code. It is also necessary to consider
other phase orderings, i.e., to optimize overhead for an instruction schedule
that maximizes the use of ILP. Finally, combined approaches should also be
explored.
Dataflow and functional programs: In imperative programs, the same
variable is often used to hold different values during different stages of the
program, and sequential execution is required to guarantee program correct-
ness. This is in harmony with the execution paradigm followed by register
machines where registers are used to hold different values during different
stages of the program execution. However, in the SCAD execution paradigm,
intermediate program values are preferably communicated directly from the
producer PUs to the consumer PUs instead of storing these in some central
register file. In other words, SCAD deals with values rather than variables.
This execution paradigm is closer to dataflow and functional programs, where
variables obey the single-assignment rule: each variable is assigned at most
once, and it is never over-written. Therefore, it is worthwhile to study the
characteristics of dataflow and functional programs to possibly learn further




[ArCu84] Arvind and D.E. Culler. The Tagged Token Dataflow Architec-
ture. Technical Report FLA Memo 229. Cambridge, Massachusetts,
USA: MIT Lab for Computer Science, 1984.
[AdGh96] S.V. Adve and K. Gharachorloo. “Shared Memory Consistency
Models: A Tutorial”. In: IEEE Computer 29.12 (Dec. 1996),
pp. 66–76.
[AHKB00] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger. “Clock
rate versus IPC: the end of the road for conventional microarchi-
tectures”. In: International Symposium on Computer Architec-
ture. Vancouver, British Columbia, Canada: ACM, 2000, pp. 248–
259.
[Anan99] C.S. Ananian. The Static Single Information Form. Technical
Report MIT-LCS-TR-801. Cambridge, Massachusetts, USA: Mas-
sachusetts Institute of Technology, 1999.
[Ande17] M. Anders. “Complexity Analysis of Code Generation for the
SCAD Machine”. Bachelor. MA thesis. Department of Computer
Science, University of Kaiserslautern, Germany, Oct. 2017.
[ApPa02] A.W. Appel and J. Palsberg. Modern Compiler Implementation
in Java. 2nd ed. Cambridge University Press, 2002.
[AlWZ88] B. Alpern, M.N. Wegman, and F.K. Zadeck. “Detecting Equality
of Variables in Programs”. In: Principles of Programming Lan-
guages (POPL). San Diego, California, USA: ACM, 1988, pp. 1–
11.
[BoJa66] Corrado Böhm and Giuseppe Jacopini. “Flow Diagrams, Turing
Machines and Languages with Only Two Formation Rules”. In:
Communications of the ACM 9.5 (May 1966), pp. 366–371.
[BhJS15] A. Bhagyanath, T. Jain, and K. Schneider. “Poster Abstract: A
Time-Predictable Model of Computation”. In: Real-Time Sys-
tems Symposium. San Antonio, Texas, USA: IEEE Computer
Society, 2015, p. 376.
111
Bibliography
[BhJS16] A. Bhagyanath, T. Jain, and K. Schneider. “Towards Code Gen-
eration for the Synchronous Control Asynchronous Dataflow (SCAD)
Architectures”. In: Methoden und Beschreibungssprachen zur Mod-
ellierung und Verifikation von Schaltungen und Systemen. Freiburg,
Germany: University of Freiburg, 2016, pp. 77–88.
[BBDR12] B. Boissinot, P. Brisk, A. Darte, and F. Rastello. “SSI Proper-
ties Revisited”. In: ACM Transactions on Embedded Computing
Systems (TECS) 11S.1 (June 2012), 21:1–21:23.
[BhSc16] A. Bhagyanath and K. Schneider. “Optimal Compilation for Ex-
posed Datapath Architectures with Buffered Processing Units
by SAT Solvers”. In: Formal Methods and Models for Codesign.
Kanpur, India: IEEE Computer Society, 2016.
[BhSc17] A. Bhagyanath and K. Schneider. “Exploring the Instruction-
Level Parallelism Potential of Exposed Datapath Architectures
with Buffered Processing Units”. In: Application of Concurrency
to System Design. Ed. by A. Legay and K. Schneider. Zaragoza,
Spain: IEEE Computer Society, 2017.
[BhSc17a] A. Bhagyanath and K. Schneider. “Exploring Different Execu-
tion Paradigms in Exposed Datapath Architectures with Buffered
Processing Units”. In: International Conference on Embedded
Computer Systems: Architectures, Modeling, and Simulation (SAMOS).
Ed. by Y. Patt and S.K. Nandy. Samos, Greece: IEEE Computer
Society, 2017, pp. 1–10.
[BKMD04] D. Burger, S.W. Keckler, K.S. McKinley, M. Dahlin, L.K. John,
C. Lin, C.R. Moore, J. Burrill, R.G. McDonald, and W. Yoder.
“Scaling to the End of Silicon with EDGE Architectures”. In:
IEEE Computer 37.7 (July 2004), pp. 44–55.
[Chai04] G. Chaitin. “Register allocation and spilling via graph coloring”.
In: ACM SIGPLAN Notices 39.4 (Apr. 2004), pp. 66–74.
[CRSM04] G. Cichon, P. Robelly, H. Seidel, E. Matúš, M. Bronzel, and G.
Fettweis. “Synchronous Transfer Architecture (STA)”. In: Inter-
national Workshop on Embedded Computer Systems: Architec-
tures, Modeling, and Simulation. Samos, Greece: Springer Berlin
Heidelberg, 2004, pp. 343–352.
[Corp94] H. Corporaal. “Design of Transport Triggered Architectures”. In:
Great Lakes Symposium on VLSI (GLSVLSI). Notre Dame, IN,
USA: IEEE Computer Society, 1994, pp. 130–135.
[Corp99] H. Corporaal. “TTAs: Missing the ILP complexity wall”. In:




[CFRW91] R. Cytron, J. Ferrante, B.K. Rosen, M.N. Wegman, and F.K.
Zadeck. “Efficiently computing static single assignment form and
the control dependence graph”. In: ACM Transactions on Pro-
gramming Languages and Systems (TOPLAS) 13.4 (Oct. 1991),
pp. 451–490.
[Davi78] A.L. Davis. “The architecture and system method of DDM1: A
recursively structured Data Driven Machine”. In: International
Symposium on Computer Architecture (ISCA). Palo Alto, CA,
USA: ACM, 1978, pp. 210–215.
[Denn74] J.B. Dennis. “First Version of a Data-Flow Procedure Language”.
In: Programming Symposium. Ed. by B. Robinet. Vol. 19. LNCS.
Paris, France: Springer, 1974, pp. 362–376.
[DeMi75] J.B. Dennis and D.P. Misunas. “A Preliminary Architecture for
a Basic Dataflow Processor”. In: International Symposium on
Computer Architecture (ISCA). Ed. by W.K. King and O. Gar-
cia. Houston, TX, USA: ACM, 1975, pp. 126–132.
[FeEr81] M. Feller and M.D. Ercegovac. “Queue machines: An organiza-
tion for parallel computation”. In: Conpar 81. Vol. 111. LNCS.
Nürnberg, Germany: Springer, 1981, pp. 37–47.
[FGQF12] M. Fernández, R. Gioiosa, E. Quiñones, L. Fossati, M. Zulianello,
and F. J. Cazorla. “Assessing the suitability of the NGMP multi-
core processor in the space domain”. In: International Confer-
ence on Embedded Software. New York, NY, USA: ACM, 2012,
pp. 175–184.
[FiFY05] J.A. Fisher, P. Faraboschi, and C. Young. Embedded Comput-
ing: A VLIW Approach to Architecture, Compilers and Tools.
Morgan Kaufmann, 2005.
[FERN84] J.A. Fisher, J.R. Ellis, J.C. Ruttenberg, and A. Nicolau. “Parallel
processing: a smart compiler and a dumb machine”. In: ACM
SIGPLAN Notices 19.6 (1984), pp. 37–47.
[Fish81] J.A. Fisher. “Trace Scheduling: A Technique for Global Microcode
Compaction”. In: IEEE Transactions on Computers C-30.7 (July
1981), pp. 478–490.
[GaHo05a] S. Gatzka and C. Hochberger. “The AMIDAR Class of Recon-
figurable Processors”. In: The Journal of Supercomputing 32.2
(2005), pp. 163–181.
[GoHS11] V. Govindaraju, C.-H. Ho, and K. Sankaralingam. “Dynamically
Specialized Datapaths for Energy Efficient Computing”. In: In-
ternational Symposium on High Performance Computer Archi-




[GuKW85] J.R. Gurd, C.C. Kirkham, and I. Watson. “The Manchester pro-
totype dataflow computer”. In: Communications of the ACM
28.1 (Jan. 1985), pp. 34–52.
[Gros00] J.P. Grossman. Compiler and Architectural Techniques for Im-
proving the Effectiveness of VLIW Compilation. unpublished manuscript.
2000.
[HoCo94] J. Hoogerbrugge and H. Corporaal. “Register file port require-
ments of transport triggered architectures”. In: Microarchitecture
(MICRO). San Jose, California, USA: IEEE Computer Society,
1994, pp. 191–195.
[HoCo94a] J. Hoogerbrugge and H. Corporaal. “Transport-Triggering vs.
Operation-Triggering”. In: Compiler Construction (CC). Ed. by
P. Fritzson. Vol. 786. LNCS. Edinburgh, UK: Springer, 1994,
pp. 435–449.
[HSMC11] Y. He, D. She, B. Mesman, and H. Corporaal. “MOVE-Pro: A
low power and high code density TTA architecture”. In: Interna-
tional Workshop on Embedded Computer Systems: Architectures,
Modeling, and Simulation. Samos, Greece: IEEE, 2011, pp. 294–
301.
[HLTW03] R. Heckmann, M. Langenbach, S. Thesing, and R. Wilhelm.
“The influence of processor architecture on the design and the
results of WCET tools”. In: Proceedings of the IEEE 91.7 (July
2003), pp. 1038–1054.
[HoYS98] M. Horowitz, Chih-Kong Ken Yang, and S. Sidiropoulos. “High-
speed electrical signaling: overview and limitations”. In: IEEE
Micro 18.1 (Jan. 1998), pp. 12–24.
[JKVT15] P. Jääskeläinen, H. Kultala, T. Viitanen, and J. Takala. “Code
Density and Energy Efficiency of Exposed Datapath Architec-
tures”. In: Journal of Signal Processing Systems 80.1 (July 2015),
pp. 49–64.
[John91] W. Johnson. Superscalar Microprocessor Design. Englewood Cliffs,
New Jersey, USA: Prentice Hall, 1991.
[JaSW17] T. Jain, K. Schneider, and F. Walk. “Out-of-Order Execution of
Buffered Function Units in Exposed Data Path Architectures”.
In: Reconfigurable Architectures Workshop (RAW). Ed. by D.
Göhringer and D. Sciuto. Orlando, Florida, USA: IEEE Com-
puter Society, 2017, pp. 229–234.
[JoWa89] N.P. Jouppi and D.W. Wall. “Available instruction-level paral-
lelism for superscalar and superpipelined machines”. In: Archi-
tectural Support for Programming Languages and Operating Sys-
tems. Boston, Massachusetts, USA: ACM, 1989, pp. 272–282.
114
Bibliography
[Kahn74] G. Kahn. “The Semantics of a Simple Language for Parallel Pro-
gramming”. In: Information Processing. Ed. by J.L. Rosenfeld.
Stockholm, Sweden: North-Holland, 1974, pp. 471–475.
[Karp72] R.M. Karp. “Reducibility among Combinatorial Problems”. In:
Complexity of Computer Computations. Ed. by R.E. Miller and
J.W. Thatcher. Yorktown Heights, New York, USA: Plenum Press,
New York, 1972, pp. 85–103.
[Kild73] G.A. Kildall. “A unified approach to global program optimiza-
tion”. In: Principles of Programming Languages (POPL). Boston,
Massachusetts, USA: ACM, 1973, pp. 194–206.
[KaMi66] R.M. Karp and R.E. Miller. “Properties of a Model for Par-
allel Computations: Determinacy, Termination, Queueing”. In:
SIAM Journal on Applied Mathematics (SIAP) 14.6 (Nov. 1966),
pp. 1390–1411.
[KiYK83] M. Kishi, H. Yasuhara, and Y. Kawamura. “DDDP-a Distributed
Data Driven Processor”. In: International Symposium on Com-
puter Architecture (ISCA). New York, NY, USA: ACM, 1983,
pp. 236–242.
[Lam88] M. Lam. “Software pipelining: an effective scheduling technique
for VLIW machines”. In: Programming Language Design and Im-
plementation. Atlanta, Georgia, USA: ACM, 1988, pp. 318–328.
[LBFS98] W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar,
and S. Amarasinghe. “Space-time scheduling of instruction-level
parallelism on a RAW machine”. In: Architectural Support for
Programming Languages and Operating Systems. San Jose, Cal-
ifornia, USA: ACM, 1998, pp. 46–57.
[Lee06] E.A. Lee. “The problem with threads”. In: IEEE Computer 39.5
(2006), pp. 33–42.
[LaHw95] D.M. Lavery and W.W. Hwu. “Unrolling-based optimizations
for modulo scheduling”. In: Microarchitecture (MICRO). Ann
Arbor, Michigan, USA: IEEE Computer Society, 1995, pp. 327–
337.
[MLCH92] S.A. Mahlke, D.C. Lin, W.Y. Chen, R.E. Hank, and R.A. Bring-
mann. “Effective compiler support for predicated execution us-
ing the hyperblock”. In: Microarchitecture (MICRO). Portland,
Oregon, USA: IEEE Computer Society, 1992, pp. 45–54.
[MoBj08a] L. Mendonça de Moura and N. Bjørner. “Z3: An Efficient SMT
Solver”. In: Tools and Algorithms for the Construction and Anal-
ysis of Systems. Vol. 4963. LNCS. Budapest, Hungary: Springer,
2008, pp. 337–340.
[McKe04] S.A. McKee. “Reflections on the memory wall”. In: Computing
Frontiers. Ischia, Italy: ACM, 2004, pp. 162–167.
115
Bibliography
[MSPP06a] M. Mercaldi, S. Swanson, A. Petersen, A. Putnam, A. Schw-
erin, M. Oskin, and S.J. Eggers. “Instruction scheduling for a
tiled dataflow architecture”. In: Architectural Support for Pro-
gramming Languages and Operating Systems (ASPLOS). Ed. by
J.P. Shen and M.R. Martonosi. San Jose, California, USA: ACM,
2006, pp. 141–150.
[MCCV06] M. Mishra, T.J. Callahan, T. Chelcea, G. Venkataramani, M.
Budiu, and S.C. Goldstein. “Tartan: evaluating spatial compu-
tation for whole program execution”. In: Architectural Support
for Programming Languages and Operating Systems (ASPLOS).
Ed. by J.P. Shen and M.R. Martonosi. San Jose, California, USA:
ACM, 2006, pp. 163–174.
[Mosb93] D. Mosberger. “Memory consistency models”. In: ACM SIGOPS:
Operating Systems Review 27.1 (Jan. 1993), pp. 18–26.
[NKBM04] R. Nagarajan, S.K. Kushwaha, D. Burger, K.S. McKinley, C.
Lin, and S.W. Keckler. “Static Placement, Dynamic Issue (SPDI)
Scheduling for EDGE Architectures”. In: Parallel Architectures
and Compilation Techniques. Antibes Juan-les-Pins, France: IEEE
Computer Society, 2004, pp. 74–84.
[VonN45] John von Neumann. First draft of a report on the EDVAC. Tech.
rep. Moore School of Electrical Engineering, University of Penn-
sylvania, June 1945.
[NoPa12] J. Nowotsch and M. Paulitsch. “Leveraging multi-core comput-
ing architectures in Avionics”. In: European Dependable Comput-
ing Conference. Washington, DC, USA: IEEE Computer Society,
2012, pp. 132–143.
[PaCu90a] G. Papadopoulos and D. Culler. “Monsoon: An Explicit Token-
Store Architecture”. In: International Symposium on Computer
Architecture. Ed. by J.-L. Baer and L. Snyder. Seattle, Washing-
ton, USA: IEEE Computer Society, 1990, pp. 82–91.
[RGGQ12] P. Radojković, S. Girbal, A. Grasset, E. Quiñones, S. Yehia, and
F. J. Cazorla. “On the evaluation of the impact of shared re-
sources in multithreaded COTS processors in time-critical en-
vironments”. In: ACM Transactions on Architecture and Code
Optimization 8.4 (Jan. 2012), 34:1–34:25.
[Rau94] B. Ramakrishna Rau. “Iterative modulo scheduling: an algo-
rithm for software pipelining loops”. In: Microarchitecture. San
Jose, California, USA: IEEE Computer Society, 1994, pp. 63–74.
[RaFi93] B.R. Rau and J.A. Fisher. “Instruction-level Parallel Processing:
History, Overview, and Perspective”. In: Journal of Supercom-
puting 7.1-2 (1993), pp. 9–50.
116
Bibliography
[RDKM00] S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Ka-
pasi, and J. D. Owens. “Register organization for media pro-
cessing”. In: High-Performance Computer Architecture (HPCA).
2000, pp. 375–386.
[RoWZ88] B.K. Rosen, M.N. Wegman, and F.K. Zadeck. “Global Value
Numbers and Redundant Computations”. In: Principles of Pro-
gramming Languages (POPL). San Diego, California, USA: ACM,
1988, pp. 12–27.
[Schn18] A. Schneiders. “Using Static-Single-Information-Form for SCAD
Code Generation”. Bachelor. MA thesis. Department of Com-
puter Science, University of Kaiserslautern, Germany, June 2018.
[Sing06] J. Singer. Static program analysis based on virtual register re-
naming. Technical Report UCAM-CL-TR-660. Computer Labo-
ratory, University of Cambridge, Feb. 2006.
[ScLY02] H. Schmit, B. Levine, and B. Ylvisaker. “Queue machines: hard-
ware compilation in hardware”. In: Field-Programmable Custom
Computing Machines. Napa, California, USA: IEEE Computer
Society, 2002, pp. 152–160.
[SBGM06] A. Smith, J. Burrill, J. Gibson, B. Maher, N. Nethercote, B.
Yoder, D. Burger, and K. S. McKinley. “Compiling for EDGE
architectures”. In: International Symposium on Code Generation
and Optimization. New York, NY, USA: IEEE, 2006.
[StNu04] R.C. Steinke and G.J. Nutt. “A unified theory of shared memory
consistency”. In: Journal of the ACM 51.5 (Sept. 2004), pp. 800–
849.
[SmSo95] J. Smith and G. Sohi. “The Microarchitecture of Superscalar
Processors”. In: Proceedings of the IEEE 83.12 (1995), pp. 1609–
1624.
[ScSL09] T. Schilling, M. Själander, and P. Larsson-Edefors. “Schedul-
ing for an Embedded Architecture with a Flexible Datapath”.
In: Annual Symposium on VLSI. Tampa, Florida, USA: IEEE
Computer Society, 2009, pp. 151–156.
[SeUl70] R. Sethi and J.D. Ullman. “The Generation of Optimal Code
for Arithmetic Expressions”. In: Journal of the ACM 17.4 (Oct.
1970), pp. 715–728.
[SSMP07] S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam,
K. Michelson, M. Oskin, and S.J. Eggers. “The WaveScalar Ar-
chitecture”. In: ACM Transactions on Computer Systems 25.2
(May 2007), pp. 1–54.
[TSBS08] M. Thuresson, M. Själander, M. Björk, L. Svensson, P. Larsson-
Edefors, and P. Stenstrom. “FlexCore: Utilizing Exposed Dat-
apath Control for Efficient Computing”. In: Journal of Signal
Processing Systems 57.1 (Apr. 2008), pp. 5–19.
117
Bibliography
[Toma67] R.M. Tomasulo. “An Efficient Algorithm for Exploiting Multiple
Arithmetic Units”. In: IBM Journal of Research and Develop-
ment 11.1 (1967), pp. 25–33.
[VSGG10] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin,
J. Lugo-Martinez, S. Swanson, and M.B. Taylor. “Conservation
cores: reducing the energy of mature computations”. In: Archi-
tectural Support for Programming Languages and Operating Sys-
tems (ASPLOS). Ed. by J.C. Hoe and V.S. Adve. Pittsburgh,
Pennsylvania, USA: ACM, 2010, pp. 205–218.
[VeFi78] R. Vedder and D. Finn. “The Hughes Data Flow Multiprocessor:
Architecture for Efficient Signal and Data Processing”. In: In-
ternational Symposium on Computer Architecture (ISCA). Los
Alamitos, CA, USA: IEEE Computer Society, 1985, pp. 324–332.
[Voll70] R. Vollmar. “Über einen Automaten mit Pufferspeicherung”. In:
Computing 5.1 (1970), pp. 57–70.
[WSCH15] L. Waeijen, D. She, H. Corporaal, and Y. He. “A Low-Energy
Wide SIMD Architecture with Explicit Datapath”. In: Journal
of Signal Processing Systems 80.1 (2015), pp. 65–86.
[WTSS97] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, et al. “Baring
it all to Software: RAW Machines”. In: IEEE Computer 30.9
(Sept. 1997), pp. 86–93.
[Wall91a] D.W. Wall. “Limits of instruction-level parallelism”. In: Archi-
tectural Support for Programming Languages and Operating Sys-
tems. Santa Clara, California, USA: ACM, 1991, pp. 176–188.
[WGRS09] R. Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and
C. Ferdinand. “Memory hierarchies, pipelines, and buses for fu-
ture architectures in time-critical embedded systems”. In: IEEE
Transactions on Computer-Aided Design of Integrated Circuits
and Systems 28.7 (Jan. 2009), pp. 966–978.
[WuMc95] W. A. Wulf and S. A. McKee. “Hitting the Memory Wall: Im-
plications of the Obvious”. In: ACM SIGARCH Computer Ar-
chitecture News 23.1 (Mar. 1995), pp. 20–24.
[YAJE14] F. Yazdanpanah, C. Alvarez-Martinez, D. Jimenez-Gonzalez, and
Y. Etsion. “Hybrid Dataflow/von-Neumann Architectures”. In:
IEEE Transactions on Parallel and Distributed Systems 25.6
(June 2014), pp. 1489–1509.
[ZyKo98] V. Zyuban and P. Kogge. “The energy complexity of register
files”. In: International Symposium on Low Power Electronics






2 // The func t i on below computes i n t e g e r approximation to the
3 // square roo t o f a g iven na tura l number a . I t i s known as
4 // Heron ’ s a lgor i thm , but i s a l s o de r i v ed by Newton−Raphson
5 // i t e r a t i o n o f f ( x ) = xˆ2−a to compute roo t s .
6 // −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
7
8 nat r e s ;
9
10 thread heron {
11 // v a r i a b l e s
12 nat a , xold , xnew ;
13
14 // g iven na tura l number
15 a = 100 ;
16
17 // compute square roo t
18 xnew = a ;
19 xold = a ;
20 do {
21 xold = xnew ;
22 xnew = ( xold + a/ xold ) /2 ;
23 } while (xnew < xold ) ;
24
25 // s t o r e r e s u l t








0000 : a := 100
0001 : xnew := a
0002 : xold := a
read{a,xnew}
write{_t0,_t1,_t2,xnew,xold}
0003 : xold := xnew
0004 : _t0 := a / xold
0005 : _t1 := xold + _t0
0006 : xnew := _t1 / 2
0007 : _t2 := xnew < xold
0008 : if _t2 goto 3
read{xold}
write{res}
0009 : res := xold










Figure A.2.: Colored register interference graph
120
Appendix A: Compilation Example
1 $100 −> reg3 //a := 100
2 reg3 −> reg2 //xnew := a
3 reg3 −> reg1 // xo l d := a
4 reg2 −> reg1 // xo l d := xnew
5 reg3 −> pu0@l // t 0 := a / xo ld
6 reg1 −> pu0@r
7 ( / , 1 ) −> pu0@opcp
8 pu0@o −> reg2
9 reg1 −> pu0@l // t 1 := xo ld + t 0
10 reg2 −> pu0@r
11 (+ ,1) −> pu0@opcp
12 pu0@o −> reg2
13 reg2 −> pu0@l //xnew := t 1 / 2
14 $2 −> pu0@r
15 ( / , 1 ) −> pu0@opcp
16 pu0@o −> reg2
17 reg2 −> pu0@l // t 2 := xnew < xo l d
18 reg1 −> pu0@r
19 (< ,1) −> pu0@opcp
20 pu0@o −> reg4
21 $3 −> cu@r // i f t 2 goto 3
22 reg4 −> cu@l
23 $0 −> l su@l // res := xo ld
24 reg1 −> lsu@r
25 s t −> lsu@opcp
Listing A.2: Register based move program
121
Appendix A: Compilation Example
Queue oriented
1 nat r e s ;
2
3 thread heron {
4 nat a1 ;
5 nat a , xold , xnew ;
6
7 a = 100 ;
8
9 xnew = a ;
10 xold = a ;
11 a1 = a ;
12 do {
13 a = a1 ; // bound use o f a
14 xold = xnew ;
15 xnew = ( xold + a/ xold ) /2 ;
16 a1 = a ; // bound use o f a1
17 } while (xnew < xold ) ;
18
19 r e s = xold ;
20 }
Listing A.3: Bounded program




5 nat dN ;
6 nat a1 ;
7 nat a , xold , xnew ;
8
9 a = 100 ;
10
11 xnew = a ;
12 xold = a ;
13 a1 = a ;
14 do {
15 dN = xold ; // d i s card xo ld copy
16 a = a1 ;
17 xold = xnew ;
18 xnew = ( ( xold + ( a / xold ) ) / 2) ;
19 a1 = a ;
20 } while ( ( xnew < xold ) )
21 dN = xnew ; // d i s card xnew copy
22 dN = a1 ; // d i s card a1 copy
23
24 r e s = xold ;
25 }
Listing A.4: Balanced program
122
Appendix A: Compilation Example
read{}
write{a,a1,xnew,xold}
0000 : a := 100
0001 : xnew := a
0002 : xold := a
0003 : a1 := a
read{a1,xnew,xold}
write{_dN,_t0,_t1,_t2,a,a1,xnew,xold}
0004 : _dN := xold
0005 : a := a1
0006 : xold := xnew
0007 : _t0 := a / xold
0008 : _t1 := xold + _t0
0009 : xnew := _t1 / 2
0010 : a1 := a
0011 : _t2 := xnew < xold
0012 : if _t2 goto 4
read{a1,xnew,xold}
write{_dN,res}
0013 : _dN := xnew
0014 : _dN := a1
0015 : res := xold











Figure A.4.: Colored buffer interference graph
123
Appendix A: Compilation Example
1 $100 −> pu3@l //a := 100
2 $0 −> pu3@r
3 (+ ,3) −> pu3@opcp
4 pu3@o −> pu2@l //xnew := a
5 $0 −> pu2@r
6 (+ ,1) −> pu2@opcp
7 pu3@o −> pu1@l // xo l d := a
8 $0 −> pu1@r
9 (+ ,1) −> pu1@opcp
10 pu3@o −> pu3@l //a1 := a
11 $0 −> pu3@r
12 (+ ,1) −> pu3@opcp
13 pu1@o −> n u l l // dN := xo ld
14 pu3@o −> pu3@l //a := a1
15 $0 −> pu3@r
16 (+ ,2) −> pu3@opcp
17 pu2@o −> pu1@l // xo l d := xnew
18 $0 −> pu1@r
19 (+ ,4) −> pu1@opcp
20 pu3@o −> pu2@l // t 0 := a / xo ld
21 pu1@o −> pu2@r
22 ( / , 1 ) −> pu2@opcp
23 pu1@o −> pu2@l // t 1 := xo ld + t 0
24 pu2@o −> pu2@r
25 (+ ,1) −> pu2@opcp
26 pu2@o −> pu2@l //xnew := t 1 / 2
27 $2 −> pu2@r
28 ( / , 2 ) −> pu2@opcp
29 pu3@o −> pu3@l //a1 := a
30 $0 −> pu3@r
31 (+ ,1) −> pu3@opcp
32 pu2@o −> pu4@l // t 2 := xnew < xo l d
33 pu1@o −> pu4@r
34 (< ,1) −> pu4@opcp
35 $12 −> cu@r // i f t 2 goto 4
36 pu4@o −> cu@l
37 pu2@o −> n u l l // dN := xnew
38 pu3@o −> n u l l // dN := a1
39 $0 −> l su@l // res := xo ld
40 pu1@o −> lsu@r
41 s t −> lsu@opcp





2008–2009 Senior Software Engineer
Infineon Technologies
XC2000 Microcontroller Family Device Driver team
2007–2008 Software Engineer
Cypress Semiconductors
West Bride Peripheral Controller Device Driver team
2005–2007 Software Engineer
Delphi Automotive Systems
Engine Management Systems team
Akademische Ausbildung
2010–2012 MSc in Electrical and Computer Engineering Note 1,3
Specialized in Embedded Systems
Technische Universität Kaiserslautern (TUK)
Masterarbeit: Real-time scheduling of variable duration task graphs on multiple resources
2001–2005 BTech in Electrical and Electronics Engineering 83%
National Institute of Technology Calicut (NITC), India
DSP based active noise control
Schulausbildung
2001 Vordiplom 92%
Kendriya Vidyalaya Kottayam, India
1999 Abitur 91%
Kendriya Vidyalaya Kottayam, India
125
Appendix B: Curriculum Vitae
126
