Integer linear programming vs. graph-based methods in code generation by Kästner, Daniel & Langenbach, Marc
Integer Linear Programming vs GraphBased Methods in
Code Generation
Daniel Kastner
kaestnercsunisbde
Marc Langenbach
mlangencsunisbde
February  	
Abstract
A common characteristic of many embedded applications is that they are aimed at the
highvolume consumer market which is extremely costsensitive However many of them
impose stringent performance demands on the underlying system Therefore the code gen
eration must take into account the restrictions and features given by the target architecture
while satisfying these performance demands Highlevel language compilers often are unable
to generate code meeting these requirements One reason is the phase coupling problem be
tween instruction scheduling and register allocation Many compilers perform these tasks
separately with each phase ignorant of the requirements of the other Commonly each task
is accomplished by using heuristic methods As the goals of the two phases often conict
whichever phase is performed rst imposes constraints on the other sometimes producing
inecient code Integer linear programming ILP provides an integrated approach to the
combined instruction scheduling and register allocation problem This way optimal solutions
can be found	albeit at the cost of high compilation times In our experiments we con
sidered as target processor the 
bit DSP ADSPx We have examined two dierent
ILP formulations and compared them with conventional approaches including list scheduling
and the critical path method Moreover we have investigated approximations based on the
ILP formulations this way compilation time can be reduced considerably while still produc
ing nearoptimal results From the results of our implementation we have concluded that
integrating ILP formulations in conventional global algorithms is a promising method for
generating highquality code
 Introduction
In the last decade digital signal processors DSPs have emerged as the processors of choice
for implementing embedded systems for the highvolume consumer market The placement on
the highvolume market leads to a constraint of low prices on the other hand many embedded
applications impose stringent performance demands on the underlying system Highlevel language
compilers often are unable to generate code meeting these requirements SCL	
 Much of the
research for optimizing compilers has concentrated on general purpose processors or machine
independent optimizations so that special hardware features of typical DSPs as eg irregular
register sets and dual banked memory are not eciently used Another reason is the complexity
of code generation itself here the phase coupling problem between instruction scheduling and
register allocation plays an important role
To provide an insight into the underlying problem the basics of code generation are sketched in
this section following the notions in WM	 and Bas	 A compiler takes an inputprogram and

performing syntactic and semantical analysis transforms it into an intermediate representation
Subsequently code generation is performed producing a semantically equivalent program in the
target language the target machines instruction set The task of code generation is composed
of three subtasks code selection instruction scheduling and register allocation Since all these
subtasks represent complex problemsin fact instruction scheduling and register allocation are
NPhardmany compilers perform code generation in three largely independent phases with each
subproblem solved by using heuristic methods
 The task of code selection is to generate a semantically equivalent target machine program
for an intermediatelanguage program In the worst case code selection can be NPhard
too but at least for RISCprocessors the problem is easier because of the simpler instruction
set architecture
 The goal of register allocation is to map values of variables and registers of the intermediate
representation to physical registers in order to minimize the number of memory references
during program execution Register allocation itself consists of two subtasks
 In general the number of simultaneously live variables exceeds the number of physical
registers In order to minimize data transfer to and from memory the allocator has to
decide which values are to be held in registers In the context of distributed register
sets it is a task of increasing importance to map values to certain register sets
 After the allocation the physical registers have to be determined in which the values
are to reside This subtask is called register assignment
 Instruction scheduling the instruction sequence selected by the code selector is to be re
ordered in order to eciently exploit the parallelprocessing capabilities and the instruction
pipeline of the target machine
Since code selection is easier to perform for RISC processors than for CISCs the importance
of interaction between register allocation and code selection decreases Instead the interaction
between register allocation and instruction scheduling becomes increasingly important In the
context of instruction scheduling registers are used to exploit instruction level parallelism when
register allocation is performed they are required to reduce memory accesses As the goals of
these two phases often conict whichever phase is executed rst imposes constraints on the other
sometimes resulting in inecient code This is called the phase ordering problem When reg
ister allocation is performed before instruction scheduling it can limit the reordering capabilities
of the scheduler by assigning the same physical register to independent intermediate values This
prevents the corresponding operations from being overlapped by the scheduler When instruction
scheduling precedes register allocation the number of simultaneously alive values can be increased
so much that many of these values have to be spilled to main memory
Over the years several heuristic methods have been developed for the problem of instruction
scheduling eg list scheduling LDSM percolation scheduling Nic or region scheduling
GS	 which use a graphbased representation of the program While being very fast these clas
sical heuristic methods have the disadvantage of only being able to nd approximative suboptimal
solutions without any information about the quality of the solution
Formulations based on integer linear programming ILP oer the possibility of integrating in
struction scheduling and aspects of register allocation in a homogeneous problem description and
of solving them together Moreover it is possible to get an optimal solution of the problem of in
struction scheduling and register allocationalbeit at the cost of high calculation times Another
feature is the ability to provide lower bounds on the optimal schedule length Since graphbased
heuristic approaches always calculate an upper bound the quality of an approximative solution

can be estimated even if no exact solution could be obtained due to the complexity of the prob
lem We have investigated two such ILPformulations OASIC GE	 GE	 and SILP Zha	

developed in the area of architectural synthesis
The paper is organized as follows In the second section we describe our target architecture the
digital signal processor ADSP
x After a short presentation of the most important intermedi
ate representations in Section  an overview of conventional graphbased algorithms for instruction
scheduling and register allocation is given in Sections  and  The basics of integer linear pro
gramming are sketched in Section 
 followed by an overview of the investigated ILPformulations
SILP and OASIC Then extensions to these models required to cope with global analyses and with
the architectural features of the ADSP
x are outlined The rest of Section 
 deals with some
approximation algorithms based on ILPformulations These allow for nearlyoptimal solutions
to be obtained in considerably shorter computation times Our implementation is described and
experimental results are given in Section  Section  concludes and provides an outlook
 Architecture
In the scope of our paper we are considering as targetarchitecture a  bit digital signal proces
sor with loadstore architecture the ADSP
x super harvard architecture computer Ana	

Ana	c Ana	 Ana	a Ana	b Its core processor consists of the register le three functional
units a control unit two address generators DAG and DAG a timer and the instruction cache
see gure  Data can be transported via three buses PM DM and IObus which provide
connection to the program memory

data memory and the IOprocessor
The register le consists of two sets of sixteen bit registers which are used to store both xed
and oating point data Furthermore each set is divided into four groups of four consecutive
registers
The three functional unitsan arithmeticlogical unit ALU a shifter and a multipliercan
operate in parallel with some restrictions listed below if they use certain register groups as source
operands see gure  Multifunctional instructions use the multiplier and the ALU concurrently
or perform two simple instructions in the ALU eg a combined additionsubtraction
Providing two address generators the ADSP
x is capable of fetching a bit data from PM
and a bit data from DM simultanously This can be done in parallel with the arithmetic
operations
Instructions are executed in three clock cycles
 In the fetchcycle the ADSP
x reads the instruction from either the onchip instruction
cache or from program memory
 In the decodecycle the instruction is decoded
 In the executecycle the instruction is executed ie the operations to be executed are
completed
These cycles are pipelined In sequential program ow when one instruction is being fetched the
instruction fetched in the previous cycle is being decoded and the instruction fetched two cycles
before is being executed Thus the throughput is one instruction per cycle

The program memory is used to store both data and instructions

Figure  Overview of the architecture of the ADSP
x
R0 - F0
R1 - F1
R2 - F2
R2 - F2
R4 - F4
R5 - F5
R6 - F6
R7 - F7
R8 - F8
R9 - F9
R10 - F10
R11 - F11
R12 - F12
R13 - F13
R14 - F14
R15 - F15
Multiplier
ALU
Register File
Any Register
Any Register
Figure  The use of the register le groups for a combined multiplyaddinstruction

 Intermediate Representation
A compiler internally represents the source program in several ways This section briey describes
the intermediate representations needed by the dierent schedule algorithms
Denition  A basic block in a given control ow graph is a path of maximal length such that
at most the rst node of this path has more than one incoming edge and at most the last node has
more than one outgoing edge
For every basic block holds if the rst instruction is executed then all remaining instructions will
be executed too assuming no runtime errors exception etc
Denition  The basic block graph G
B
of a given control ow graph G
cf
is derived from G
cf
by replacing basic blocks by nodes Edges in G
cf
that lead to the rst node of a basic block are
connected to the node representing the basic block Edges that leave the last node of a basic block
in G
cf
 become outgoing edges of the basic block node
The sequence of instructions within a basic block may be rearranged with respect to certain
restrictions which are determined by the data dependencies between instructions These depen
dencies are pairs of reade or write accesses to the same register or other components that inuence
the overall state of the machine Write accesses are calles denitions read accesses uses Data
dependencies can be categorized as
 true dependencies defuse
 output dependencies defdef
 anti dependencies usedef
In neither case the position of the dening and the using instructions may be interchanged The
data dependencies for a given basic block are given by its data dependence graph
Denition  Given a basic block B Its data dependence graph is a labelled acyclic directed
graph G
D
 V
D
 E
D
 whose nodes are labelled with the instructions of B An edge exists between
nodes x and y x y  V
D
 if there is a sequence of instructions x  n

 n

     n
k
 y such that
 x is a denition y is a use of the same resource and there is no other use on a path from x
to y true dependence or
 x uses a resource which will be written by y and there is no other write access to that resource
on a path from x to y anti dependence or
 x and y are denitions of the same resource and there is neiter another denition nor a use
on any path from x to y output dependence
If E
true
D
denotes the true dependencies E
anti
D
the anti dependencies and E
output
D
the output depen
dencies then the set of edges can be rewritten as
E
D
 E
true
D
E
anti
D
 E
output
D

 Conventional Approaches
Current microprocessors usually provide several hardware resources that can be used in parallel
These so called horizontal processors force the compiler to reorder the created microcode in order
to improve its eciency This task is known as compaction for two or more instructions that are
to be executed in parallel are packed into the same instruction word
Compaction methods can be classied as local or global Local techniques only consider basic blocks
as their compaction scope whereas global techniques try to resolve the basic block boundaries by
dierent means A wellknown example for global techniques is trace scheduling which combines
basic blocks that are often executed successively to form a single trace Then these traces are
scheduled
 Critical Path Method
Ramamoorthy and Tsuchiya LDSM introduced critical path CP algorithms for microcode
compaction in 	 Their work has some similarities to the critical path approach to processor
scheduling
The minimum amount of cycles needed to schedule a basic block corresponds to the maximum
depth of its data dependence graph This longest path ist called critical path The ordering of
instructions in the critical path is implied by their data dependencies The remaining instructions
are then placed into the critical path what may lead to additional cycles
The calculation of the critical path takes three steps First an early and a late partition are created
In the early partition an instruction is scheduled into the rst cycle if no data dependences are
violated Hereby resource conicts are not taken into account The late partition is computed
analogously with the direction of the edges in the data dependence graph being reversed Doing
so instructions are scheduled into the last possible cycle The amount of cycles needed to schedule
the instructions of the basic block in the early partition is equal to the one needed for the late
partition
The critical path consists of those instructions that are mapped to the same cycle in the early as
well as in the late partition These instructions form the critical partition and are a frame for
adding the remaining operations In a following step resource conicts in the critical partition
are resolved this may lengthen the schedule
In the last step the remaining instructions are inserted into the revised critical partition This is
done for each instruction by testing the cycles between the early and late partition with respect
to data dependencies and resource conicts This may also lengthen the schedule A schema of
the implementation is given in algorithm 
The critical path method is not always able to create an optimal schedule This is due to the fact
that subsequent subframes

these are newly created cycles cannot be merged together although
this would be permitted by the data dependencies and the use of resources

For each cycle in the revised critical partition a frame is created Inserting new cycles due to resource conicts
would demand a recalculation of the early and the late partition To avoid this subframes are inserted instead of
new cycles to avoid this computation


criticalpathBASICBLOCK bb

LIST framesddgdepth
LIST noncritical
MICROOP m
int c
createpartitionsbb
forallbb m
if m	
ep  m	
lp
insertmopframesm	
ep m
else
appendnoncritical m
forallnoncritical m
for cm	
ep cm	
lp c
if insertmopframesc m
break

Algorithm  Scheme of the implementation of critical path method
 List Scheduling
List scheduling is a local scheduling method It evolved out of a heuristic branchandbound
algorithm LDSM and is also used by global methods such as trace scheduling Fis
In addition to a list of instructions in the basic block the data dependence graph is needed
While executing list scheduling maintains a set called data ready in which all instructions reside
that have no predecessor in the data dependence graph or whose predecessors have already been
scheduled
The algorithm can now easily be explained starting with the rst cycle of the basic block an
instruction is taken out of the data ready set and put into the current cycle as long as there are
instructions available and no resource conict is encountered Then the cycle is incremented and
the data ready set is updated This is done until all instructions are scheduled A pseudo code of
list scheduling is presented in algorithm 
In order to select an instruction from the data ready set a heuristic is used that assigns priority
values to the instructions Instructions with higher priority are preferred Possible heuristics are
rst t The rst instruction found in the data ready set is taken
longest remain After every update a counter for each microinstruction is incremented The
instruction with the greatest counter value is selected
max depend The priority of an instruction is determined by the number of successors in the
data dependence graph
highest level The priority equals the length of the longest path in the data dependence graph
starting from that instruction
As an extension to list scheduling we have integrated register allocation into the task of scheduling

listschedulingBASICBLOCK bb

LIST mops
int cycle
MICROOP m
updatedataready
while entriesdataready 
  
cycle
while mchoosemicroop  NULL
if areparallelm mops
appendmops m
forallmops m
m	
cyclecycle
removefromdatareadymops
updatedataready
emptymops


Algorithm  Scheme of the implementation of list scheduling
First a prerun scheduler is invoked to gather information that can be used by the register allocator
which assigns machine registers to symbolic registers Then the scheduler performs the nal
reordering of the instructions To be more precise the scheduler identies operations that cannot
be compacted due to restrictions on register usage

For dierent instruction dierent register
groups are required therefore colliding instructions can be marked with the groups they need
to belong to in order to resolve the conict The register allocator tries to satisfy these group
assignments as long as they do not inhibit the graph coloring
 Global Methods
In our implementation we didnt consider any global graphbased algorithm since these are suited
for large input programs Our test programs only contain a few basic blocks and so the global
methods would have found little more parallelism if any at all Since the aim of our work is a
comparison between conventional and ILPbased methods we had to choose programs suitable
for both approaches Larger programs would have prevented the ILP scheduler from computing
complete solutions within a bearable amount of time For the complete conv program eg the
optimal solution using ILPmethods could not be calculated whithin twentyfour hours so we
stopped the calculation process The program conv examined in this paper is only the rst basic
block of the original convolution lter to make a useful comparison possible

Recall that for multifunctional instructions certain register groups are required see gure 

 Register Allocation
The input for the register allocator is an intermediate representation of the source program Herein
a symbolic register is assigned to every operation and every modied variable This unbounded
number of symbolic registers is to be mapped onto the limited number of machine registers A
register cannot be assigned to two dierent symbolic registers if their life ranges overlap
Denition  A symbolic register r is live at a program point p if r is dened on a program
path from the entry node of the procedure to p and there exists a path from p to a use of r on which
r is not dened The live range of a symbolic register r is the set of program points at which r is
live
Denition  Two live ranges of symbolic registers interfere with one another if one of them is
dened during the life range of the other The register interference graph is an undirected graph
Its nodes are life ranges of symbolic registers and there is an edge between every two interfering
life ranges
 Graph Coloring
Graph coloring is a common method to solve the problem of register assignment WM	 The
register interference graph is the graph to be colored and the number of actual machine registers
is the number of colors to be used The graph coloring problem is to assign one of the k possible
colors to every node such that every two nodes that share an edge have dierent colors For k  
the problem is NP complete but in the context of register allocation there exist a number of
heuristic methods which have been well tried and tested in practice
One heuristic method is the following if the graph contains a node n of degree less than k then n
can be assigned a color that diers from the color of its neighbours n is removed from the graph
which results in a new graph with one node and several edges fewer The problem was reduced
recursivly to a smaller one
If a kcoloring cannot be obtained by this methodwhich does not mean that there doesnt exist
such a coloringsome life ranges have to be spilled ie stored to memory
 ILPbased Methods
 General Introduction
The application area of integer linear programming covers a rich variety of problems In integer
programming problems an objective function is maximized or minimized subject to inequality and
equality constraints and integrality restrictions on some or all of the variables ILP methods are
used to solve scientic or economic problems NW NKT	 Typical applications are concerned
with the management and ecient use of scarce resources to increase productivity These applica
tions include production scheduling machine sequencing Bru	 VLSIdesign portfolio analysis
DK	
 as well as problems in molecular biology high energy physics and xray crystallography
The calculation of an optimal solution of an integer linear program is NPhard yet many large
instances of such problems can be solved This however requires the selection of a structured
formulation and no adhoc approach CWM	
	
x1
x2
PI
PF
objective function
integer points
Figure  feasible areas
In this paper we will just sketch the basics of integer linear programming which are essential
for the understanding of the presented ILPapproaches For further information see eg NW
NKT	 PSa or CWM	
Integer linear programming 	ILP
 is the following optimization problem
min z
IP
 c
T
x 
x  P
F
 ZZ
n
where
P
F
 fx j Ax  b x  IR
n

g c  IR
n
 b  IR
m
 A  IR
mn
The set P
F
is called feasible region If some of the variables have to be integral while the others
also can take real values the problem is called mixed integer linear problem 	MILP
 We
will assume that A  ZZ
mn
and b  ZZ
m
holds Then the optimal solution of an integer linear
program can be calculated by solving the following problem NW
min z
IP
 c
T
x 
x  P
I
where
P
I
 convfx j x  P
F
 ZZ
n
g
Here conv denotes the convex hull In the following we need another denition
Denition  	Relaxation
 Let Q be an optimization problem with a feasible region XQ
An optimization problem Q
R
is called a relaxation of Q if for the feasible region XQ
R
 the
following holds
XQ  XQ
R

For the twodimensional case a representation of P
F
und P
I
equations  and  is given in gure 
The integral points within P
F
denote the feasible solutions to the integer linear problem depending
on the objective function at least one of them represents an optimal solution The feasible region
of  consists only of the integer points whereas the feasible region of  P
I
 consists of the
convex hull of these points

Since P
F
is described only by equality and inequality constraints no integrality constraints are
required any linear objective function over P
F
can be optimized in polynomial time using linear
programming algorithms Unfortunately in most cases no representation of P
I
as a system of
linear equations is known furthermore the number of inequality constraints required to describe
the convex hull is usually extremely large NKT	 Therefore one can try to solve a related
problem called the LPrelaxation of the integer linear problem which reads as follows
min z
R
 c
T
x 
x  P
F
with
P
F
 fx j Ax  b x  IR
n

g c  IR
n
 b  ZZ
m
 A  ZZ
mn
Since P
I
 P
F
 one can conclude from  and  that z
R
 z
IP
 If P
F
 P
I
 the polyhedron P
F
is called integral and in this case the equation z
R
 z
IP
holds Thus the optimal solution can
be calculated in polynomial time by solving its LPrelaxation Therefore while formulating an
integer linear program one should attempt to nd equality and inequality constraints such that
P
F
will be integral It has been shown that for every bounded system of rational inequalities
there is an integer polyhedron GE	 PSb Unfortunately for most problems it is not known
how to formulate these additional inequalitiesand there could be an exponential number of them
NKT	
In general P
I
 P
F
 and the LPrelaxation provides a lower bound on the objective function The
eciency of many integer programming algorithms depends on the tightness of this bound The
better P
F
approximates the feasible region P
I
 the sharper is the bound so that for an ecient
solution of an ILPformulation it is extremely important that P
F
is close to P
I
 This can be
achieved by developing tight descriptions of P
F
that closely approximate P
I
 Moreover formal
analysis can be used to determine new valid inequalities the inequalities that arise due to the
integrality of the variables so that the formulations can be further tightened
By using such techniques the ILPformulations examined in this paper try to get to an ecient
solution of the integrated instruction scheduling and register allocation problem They are based on
formulations which were developed in the area of architectural synthesis The goal of architectural
synthesis consists of nding either the fastest architecture for a given code sequence or the cheapest
architecture for a given performance requirement GE	 This problem formulation is closely
related to the integrated instruction scheduling and register allocation problem For a correct
synthesis the problem of instruction scheduling has to be taken into account as part of a more
general resource allocation the register allocation is also part of the problem The dierence is
mainly that in order to solve the problem of instruction scheduling in a compiler the hardware is
xed Thus the input code has to be transformed to allow an ecient use of the given hardware
resources maintaining the programs semantics Resource allocation and minimization of hardware
costs are less important in this scope
When no resource constraints have to be considered local instruction scheduling can be performed
by applying the critical path method CPM to the data dependence graph see section 
This algorithm calculates for each node of the dependence graph ie for each instruction of the
input program the earliest possible asapcontrol step as soon as possible and the latest possible
execution time alapcontrol stepas late as possible Instructions scheduled to the same control
step are executed in the same clock cycle of the target machine The asap and alapvalues are
important for both ILPformulations considered in this paper The reason is that they allow for
each instruction to be assigned to an interval of valid control steps in which the execution may
take place Thus the size of the feasible region of the created ILP is reduced

Each operation of the input program can be executed by a certain resource type In order to
describe the mapping of instructions to hardware resources a resource graph is used which is
dened following Zha	

Denition  	resource graph
 The resource graph G
R
 V
R
 E
R
 is a bipartite directed
graph The set of nodes V
R
 V
D
 V
K
is composed of the nodes of the data dependence graph V
D
and the available resource types represented by V
k
 Its edge set E
R
 V
D
	V
K
describes a possible
assignment where j k  E
R
means that instruction j  V
D
can be executed by the resources of
type k
 SILP
The ILPformulation described in this section was presented in Zha	
 under the name SILP
Scheduling and Allocation with Integer Linear Programming First we will give an overview of
the terminology used
 The variable t
i
indicates the relative position of a microoperation within the instructions of
the optimized code sequence the t
i
values have to be integral For linear program ow the
mapping of a microoperation to the cth instruction of the machine program is equivalent to
assigning the starting time for the execution of this microoperation to the cth clock cycle
control step For nonlinear program ow this correspondance need not be correct
 w
j
describes the execution time of instruction j  V
D

 The busy time of the hardware component executing operation j is denoted by z
j
ie the
minimal time interval between succesive data inputs to this functional unit
 The number of available resources of type k  V
K
is R
k

 
j
describes the life range of a variable created by operation j
The ILP is generated from a resource ow graph G
F
 This graph describes the execution of a
program as a ow of the available hardware resources through the programs instructions for each
resource type this leads to a separated ow network Each resource type k  V
K
is represented by
two nodes k
Q
 k
S
 V
F
 the nodes k
Q
are the sources the nodes k
S
are the sinks in the ow network
to be dened The rst instruction to be executed on resource type k gets an instance k
r
of this
type from the source node k
Q
 after completed execution it passes k
r
to the next instruction using
the same resource type The last instruction using a certain instance of a resource type returns it
to k
S
 The number of simultaneously used instances of a certain resource type must never exceed
the number of available instances of this type An example resource ow graph for two dierent
resource types and two instructions is given in gure 
Denition  	resource ow graph
 The resource ow graph G
F
is a directed graph G
F

V
F
 E
F
 with
V
F


kV
K
V
k
F
und E
F


kV
K
E
k
F
where
V
k
F
 V
k
D
 fk
Q
 k
S
g  fu  V
D
ju k  E
R
g  fk
Q
 k
S
g
and
E
k
F
 fi j ji j  V
k
D

 j not dependent on i 
 i  jg
 fk
Q
 j jk j  E
R
g
 fj k
S
 jk j  E
R
g

r1 = dm(i0, m0) r4 = dm(i1, m1)
DMQ
DMS
r6 = r4 + r5 r7 = min(r4, r5)
SS
SQ
Figure  Resource ow graph for two instructions executed on an ALU and the data memory
resp
Each edge i j  E
k
F
is mapped to a ow variable x
k
ij
 f g  A hardware resource of type k is
moved through the edge i j from node i to node j if and only if x
k
ij
 
V
k
D
is the set of all nodes of the data dependence graph belonging to instructions which can be
executed by resource type k Each edge    E
k
F
describes a possible ow of resources of type
k  V
K
from  to  The ow entering a node j  V
D
is represented by the variable 
k
j
and the
ow leaving node j is denoted by 
k
j
 The exact denitions are given below

k
j

X
ijE
k
F
x
k
ij
 
k
j

X
jiE
k
F
x
k
ji

The goal of this ILPformulation is to transform a given set of machine instructions in order to
minimize the number of clock cycles required for execution The basic ILPformulation for the
problem of instruction scheduling with respect to resource constraints can then be given as follows
 objective function
min M
steps

 constraints
 time constraints
For no instruction the start time may exceed the maximal number of control steps
M
steps
which is to be calculated
t
j
M
steps
 j  V
D


 precedence constraints
When instruction j depends on instruction i then j may be executed only after the
execution of i is nished
t
j
 t
i
 w
i
 i j  E
output
D
 E
true
D
t
j
 t
i
   i j  E
anti
D

 ow conservation
The value of the ow entering a node must equal the ow leaving that node

k
j

k
j
   j  V
D
  k  V
k
 j k  E
R


 assignment constraints
Each operation must be executed exactly once by one hardware component
X
kV
K

jkE
R

k
j
   j  V
D
	
 resource constraints
The number of available resources of all resource types must not be exceeded
X
kjE
k
F
x
k
kj
 R
k
k  V
K


 serial constraints
When operations i and j are both assigned to the same resource type k then j must
await the execution of i when a component of resource type k is actually moved along
the edge i j  E
k
F
 ie if x
k
ij
 
t
j
 t
i
 z
i
 
X
kV
K

ijE
k
F
x
k
ij
   
ij
i j  E
k
F

The better the feasible region of the relaxation P
F
approximates the feasible region of
the integral problem P
I
 the more eciently can the integer linear program be solved
In Zha	
 it is shown that the tightest polyhedron is described by using the value

ij
 z
i
 asapj  alapi
 Integration of Register Allocation
Up to now the presented ILPformulation covers only the problem of instruction scheduling To
take into account the problem of register assignment this formulation has to be modied Again
following the concept of ow graphs the register assignment problem is formulated as register
distribution problem
Denition  	register ow graph
 The register ow graph G
g
F
 V
g
F
 E
g
F
 is a directed
graph with a set of nodes V
g
F
 V
g
 G and a set of directed arcs E
g
F
 The set G contains a
resource node g representing the available register set G  fgg A node j  V
g
represents an op
eration performing a write access to a register this way creating a variable with lifetime 
j
 Each
arc i j  E
g
F
provides a possible ow of a register from i to j and is assigned a ow variable
x
g
ij
 f g Then the same register is used to save the variables created by nodes i and j if
x
g
ij
 
Lifetimes of variables are reected by true dependences When an instruction i writes to a register
then the life span of the value created by i has to reach all uses of that value To model this
variables b
ij
  are introduced measuring the distance between a dening instruction i and
a corresponding use j The formulation of the precedence relation is replaced by the following
equation
t
j
 t
i
 b
ij
 w
i

Then for the lifetime of the register dened by instruction i must hold

i
 b
ij
 w
i
 i j  E
true
D


An instruction j may only write to the same register as a preceding instruction i if j is executed
at a time when the life span of i 
i
is already nished In other words If the variable produced by
instruction i has lifetime 
i
and the output of instruction j is to be written into the same register
ie when x
g
ij
  holds then t
j
 t
i
 
i
must hold This fact is caught by the following register
serial constraint 
t
j
 t
i
 w
i
 w
j
 
i
 x
g
ij
   T 
Here T represents the number of machine operations of the input program which surely provides
an upper bound for the maximal possible lifetime
In order to correctly model the register ow graph ow conservation constraints as well as resource
constraints and assignment constraints have to be added to the integer linear program This leads
to the following equalities and inequalities

g
g
 R
g


g
j
   j  V
g



g
j

g
j
   j  V
g
F

t
j
 t
i
 w
i
 w
j
 
i
 x
g
ij
   T i j  E
g
F

Moreover the ILPformuation can be tightened by an identication of redundant serial constraints
and the insertion of valid inequalities for further information see Zha	
 Kas	
Following CWM	 we will measure the complexity of an ILPformulations in terms of the number
of constraints and binary variables The number of constraints is On

 where n is the number
of operations in the input program The number of binary variables can be bound by On


however its only the ow variables used in the serial constraints that have to be specied as
integers Zha	
 Kas	
 OASIC
In this section a dierent modelling approach called OASIC Optimal Architectural Synthesis
with Interface Constraints GE	 GE	 is presented Again results of polyhedral theory are
used to formulate constraints which reduce the size of the feasible region thus increasing the
solution eciency
First we will give an overview of the used terminology which diers in some points from the
SILPterminology
 The main decision variables are called x
k
jn
 where x
k
jn
  means that microoperation j is
scheduled in instruction n   and is executed by an instance of resource type k Again in
the case of sequential control ow the relative position of an instruction can be considered
as synonymous to the clock cycle of the execution of this instruction see section 

 t
j
describes the relative position of a microoperation j within the instructions of the opti
mized code sequence This variable is introduced just for the sake of clarity and is not used
in the linear programs instead the following equation is used
t
j

X
kjkE
R
X
nNj
n  x
k
jn
 Nj  fasapj asapj       alapjg is the set of possible control steps in which an
execution of j can take place asapj describes the earliest possible execution time alapj
the latest possible one

As already mentioned in Section 
 for every bounded system of linear inequalities there exists
an integral polytope that contains the same integer points When this integral polytope is used as
feasible region of an optimization problem an integer optimal solution can be calculated by using
linear programming algorithms NW PSa The constraints necessary to dene the integer
vertices of the polytope are called integral facets
The goal of the OASICapproach is to formulate the integer linear program in a way that permits
its transformation to a node packing graph which has been partially characterized by its facets
The resulting polytope is in general not identical with the integral polytope but by taking into
account the additional facets a better approximation to the integral polytope is gained This is
covered in detail in GE	 GE	 an overview is given in Kas	
In the following the ILPformulation given in GE	 GE	 is presented in a slightly modied
form
 objective function
min M
steps
	
 constraints
 time constraints
No instruction may exceed the maximum number of control steps M
steps
 which is to
be calculated
t
j
M
steps
 j  V
D

 precedence constraints
When instruction j depends on instruction i then j has to be executed after completion
of i
X
kV
K

jkE
R
X
n
j
n
n
j
Nj
x
k
jn
j

X
kV
K

jkE
R
X
n
i
nQ
k
i

n
i
Ni
x
k
in
i
   i j  E
true
D
 E
output
D

n  fnQ
k
i
  j n  Nig Nj
X
kV
K

jkE
R
X
n
j
n
n
j
Nj
x
k
jn
j

X
kV
K

jkE
R
X
n
i
nQ
k
i

n
i
Ni
x
k
in
i
   i j  E
anti
D

n  fnQ
k
i
  j n  Nig Nj 
 assignment constraints
The execution of an operation must start in exactly one control step and is performed
by exactly one resource type
X
kV
K

jkE
R
X
nNj
x
k
jn
   j  V
D

 resource constraints
The number of available instances of a resource type must not be exceeded so that in
no control step more than R
k
operations may be executed by resource type k


XjV
D
jkE
R
X
nN

jkn


x
k
jn
 R
k
k  V
K

   n

M
steps
with N

j k n

  fn  Nj  n  n

 p   p  Q
k
j
 g 
In the ideal case the resulting polytope is integral then the relaxation provides an optimal integer
solution However in most cases some variables will have nonintegral values Then these variables
are specied as binary and the resulting MILP is solved This is repeated until all variables of the
solution have integral values
 Integration of Register Allocation
In order to take into account the problem of register assignment with respect to a homogeneous
register set the above presented formulation just has to be extended by some additional con
straints It must be assured that in no control step more than R registers are used so that there
are at most R overlapping lifetimes
A variable i is dened at a program point when it is assigned a value a variable is used when
it is referenced at a program point For a given instruction sequence the lifetime of a variable
can be represented by a lifetimedening edge i  j between the operation i that produced the
variable and the operation j that last used the variable However each variable can be used
within more than one operation In consequence lifetimedening edges are possibly not unique
since when simultaneous instruction scheduling and register allocation is performed the order of
the operations is not xed Thus in a naive approach a lifetimedening edge will be inserted
between a denition and each use By means of transitivity analysis and of asap	alapanalysis
the number of edges can be reduced For more informations see GE	 GE	 Kas	
In the constraints generated to take into account the problem of register allocation the following
terminology is used An edge i  j crosses control step n if and only if Nif      n w
i

g   and Nj  fn n     Tg   The value e
n
i indicates the number of edges with
head i crossing control step n the set Mn represents the set of all maximal sets of edges M

n
which cross control step n and have unique heads
X
j
a
j
b
M

n

X
kV
K
X
n

n
n

Nj
a

x
k
j
a
n


X
kV
K
X
n

n
n

Nj
b

x
k
j
b
n


X
kV
K
X
n

n
n

Nj
b

x
k
j
b
n


X
kV
K
X
n

n
n

Nj
a

x
k
j
a
n

   R 
 n 
  M

n Mn
When e edges are crossing control step n and among these e
n
i have head i while the rest of the
edges has unique heads inequality  has to be generated exactly e
n
i times for control step
n In the general case the number of constraints to be generated for control step n is given by
Q
i
e
n
i The register allocation constraint calculates two times the number of crossing edges for
each control step The relevant variables are partitioned into four groups and dependent of their
group they are used in the inequality either with positive or with negative sign
As for the complexity of this ILPformulation the number of constraints is bound by On


when no register allocation is considered The number of variables is On

 Considering the

problem of integrated instruction scheduling and register allocation the worst case number of
binary variables doesnt change however the number of constraints can grow exponentially due
to the register crossing constraints see GE	 GE	 Kas	
 Global Analysis
In the context of global analysis the consideration of data dependences is not sucient in order
to maintain the programs semantics It might be possible for some instructions to be moved from
one basic block to another which is executed under dierent control conditions without violating
any data dependences A common approach to this problem is the prevention of code motion
across basic block boundaries eg by inserting dummy nodes at the begin and end of each basic
block Ell
 GE	 In this section an entirely dierent approach is presented Let a sequence
of machine operations be given which is to be optimized For each instruction the semantics of
the input program dene the valid control conditions provided there are no data dependences
it can be executed in every basic block that is subject to exactly the same control conditions
Thus for each instruction the set of basic blocks is determined in which it can be inserted and
the resulting disjunction is integrated into the integer linear program
Let a basic block B
k
be given This basic block is assigned a starttime t
A
k
and an endtime t
E
k

An instruction is contained in basic block B
k
if and only if t
A
k
 t
j
 t
E
k
 Thus the set X
k
of all
instructions that can belong to basic block B
k
can be dened as follows
X
k
 fj  V
D
j t
A
k
 t
j
  
 t
j
 t
E
k
 g
Now we want to represent the fact that an instruction j may be scheduled in exactly one of l
possible basic blocks B

     B
l
by a system of inequalies including binary variables y

     y
l

First an upper bound T is needed so that t
j
 T  j  V
D
 When a constraint t
j
 t

  is
replaced by t
j
 t

 T  it becomes redundant and doesnt restrict the feasible region any more
So we dene y
j
  if and only if j  B
k
 and y
j
  if and only if j  B
k
 The value of the upper
bound is chosen to be T  I  where I  jV
D
j indicates the number of operations of the input
program
The following constraints guarantee that an instruction j is contained in exactly one of the sets
X

     X
l

t
A
k
 t
j
 Ty
k
j
   k       l 
t
j
 t
E
k
 Ty
k
j
   k       l
l
X
k
y
k
j
 l  
y
k
j
 f g  k       l
Example  When an instruction j can be assigned to three basic blocks B

 B

und B

 the
following constraints are generated

tA

 t
j
 Ty

j
 
t
j
 t
E

 Ty

j
 
t
A

 t
j
 Ty

j
 
t
j
 t
E

 Ty

j
 
t
A

 t
j
 Ty

j
 
t
j
 t
E

 Ty

j
 
y

j
 y

j
 y

j
 
When a feasible solution provides y

j
  then instruction j must be scheduled in basic block B

as can easily be veried

The set of basic blocks which an instruction can be assigned to can be calculated using the control
dependence graph Bas	 When instructions of dierent basic blocks have the same predecessor
in the control dependence graph and the types of these edges to these predecessors are identical
all these instructions can be assigned to each of these basic blocksprovided this is not prevented
by data dependences
In order for the assignment of instructions to basic blocks to be well dened neither branches nor
loopinstructions may be removed from their actual basic blocks since these instructions dene
the basic block structure When a basic block B
k
is introduced by a loopinstruction i

and is
nished by a branch i

 the assignment of an operation j to B
k
means that j is scheduled between
i

and i

 Any reordering of basic blocks is excluded ie the order of the basicblocks in the
output program must be the same as in the input program This way just the ordering of control
instructions is xed any other instructions can be moved between the basic blocks which underly
the same control conditions The order of the basic blocks is xed in the ILP by creating the
following inequalities for each pair of subsequent basic blocks
t
A
k
 t
E
k
  

t
A
k
 t
E
k
 
t
E
k
 t
A
k
Thus not all reorderings of instructions are allowed However the quality of the generated code
depends of the degree of parallelism this is not severely aected by the order of basic blocks So
in most cases this restriction will be of no importance
These constraints are sucient when only the problem of instruction scheduling is considered
When integrated instruction scheduling and register allocation is performed additional constraints
are necessary for a correct representation of lifetimes in the presence of branches and loop
instructions Kas	 Unfortunately these constraints require additional binary variables so that
the complexity of the generated ILPs is increased Therefore in our implemenation only the op
timal register set assignment is calculated After solving the ILPs a conventional register allocator
has to be invoked This allocator uses the previously calculated register set assignment to get a
feasible register assignment In the scope of this paper we cannot give more details however
they can be found in Kas	
	
 Extensions Required by the Target Architecture
 Extensions to SILP
As can be seen in the previous sections the assignment of instructions to hardware resources
plays a signicant role for the ILPmodeling In order to correctly model the processing of the
ADSP
x we need the following resource nodes
 DM models accesses to data memory
 PM models accesses to program memory
 S standard models ALU and Shifteroperations as well as several miscellaneous instruc
tions These instructions can never be executed in parallel Since the motivation to introduce
resource nodes is the capability of parallel execution the assignment of these instructions to
dierent nodes would not be useful
 MU models multiplierinstructions
 C models control ow instructions
Prevention of Incorrect Parallelism
Parallel execution of instructions assigned to the same resource type is excluded by the serial
constraints Instructions assigned to dierent resource nodes can always be executed in parallel
However the considered architecture just implements limited parallelism only certain ALU and
multiplieraccesses can be executed in parallel and parallel accesses to memory are not possible
for all loadstoreoperations Moreover the degree of potential parallelism also depends of the
register assignment see Section  gure  Therefore additional constraints are required which
explicitly prohibit the parallel execution of a certain pair of operations
For two operations i and j which must not be executed in parallel ie for which t
i
 t
j
must
hold constraints are formulated which represent the disjunction t
i
 t
j
t
i
 t
j
 The following
inequalities are required
t
i
 t
j
 v
ij
T 
t
i
 t
j
  v
ij
T 
v
ij
 f g 	
T  I   
A correctness proof is provided in Kas	
Irregular Register Sets
The operands of multifunctioninstructions using ALU and multiplier are restricted to a set of four
registers whithin the register le see Section  Thus there are four dierent register groups
to be considered and no homogeneous register set For each such group an own register node is
inserted into the register ow graph so that the set G of available register sets has to be extended
to G  fg

 g

 g

 g
	
g The denition of the register ow graph has to be modied and the
constraints  
 and  are be replaced by the following constraints

g
g
 R
g
 g  G 
X
gG

g
j
   j  V
g


g
j

g
j
   j  V
g
F
 g  G 
t
j
 t
i
 w
i
 w
j
 
i
 
X
gG
x
g
ij
   
ij
 i j  E
g
F

Inequality  assures that at most four instances of each of the four register sets are used
the constraint  guarantees that each variable is assigned to exactly one register The ow
conservation is maintained by  and the new formulation of the serial constraints is given in

When instructions i and j are combined to form a multifunctioninstruction so that for the reach
ing denition m the target register set is restricted to exactly one g  G it must be guaranteed
that m in fact uses a register of register set g Then a constraint of the form 
g
m
  must
hold Since
P
g

g
j
  this automatically excludes the use of other register sets The formulation
presented below uses two binary variables p
ij
and q
ij
which are dened by following constraints
t
i
 t
j
 p
ij
T 
t
i
 t
j
 q
ij
T 

p
ij
 q
ij
  
where T  I  
Using these values the register constraints can be formulated as follows

g
m
  t
i
 t
j
 p
ij
T 

g
m
   t
i
 t
j
 q
ij
T 	
The correctness proofs are omitted in this paper they are explicitly given in Kas	
 Extensions to OASIC
Prevention of Incorrect Parallelism
Let two instructions i and j be given which cannot be executed in parallel so that t
i
 t
j
must
hold When i and j can never be assigned to the same basic block or when parallel execution is
prevented by data dependences no additional constraints are required Otherwise the following
constraints have to be added to the formulation
X
kV
K
ikE
R
x
k
in

X
kV
K
ikE
R
x
k
jn
   n  Ni Nj 
Irregular Register Sets
The constraints given in Section 
 have to be modied since the register le is divided into
subsets to be considered separately So new variables have to be introduced When instruction i

performs a write access to a register the binary variable x
k
in
has to be replaced conformingly to
the following equation
x
k
in

	
X
m
x
km
in
When instruction i is executed in control step n by resource type k and a write access is performed
to a register of register set m then x
km
in
  must hold The following additional constraints are
required
 In any register only one variable must be saved at one time thus in each of the four register
groups at most R
m
 
X
kjE
R
X
jnNj
x
km
jn
 R
m
 m  n 
 For each register group the number of crossing edges of each control step must not exceeed
the number of available registers
X
j
a
j
b

X
kV
K
X
n

n
n

Nj
a

x
km
j
a
n


X
kV
K
X
n

n
n

Nj
b

x
km
j
b
n


X
kV
K
X
n

n
n

Nj
b

x
km
j
b
n


X
kV
K
X
n

n
n

Nj
a

x
km
j
a
n

   R
m
n m 
 When operation l is restricted to register set m
l
because of the combination of operations i
and j to a multifunctioninstruction the following constraints have to be added
t
i
 t
j
 p
ij
T 
t
i
 t
j
 q
ij
T 
p
ij
 q
ij
  
X
nNl
x
km
l
ln
  t
i
 t
j
 p
ij
T 

X
nNl
x
km
l
ln
   t
i
 t
j
 q
ij
T 
where T  jV
D
j and p
ij
 q
ij
 f g This formulation is analogous to that one presented
in Section 

Using this formulation the optimal assignment of register sets can be calculated Moreover the
structure of these constraints ascertains that enough registers are available to execute the resulting
code without inserting spillcode However no concrete register assignment is performed by the
ILP formulation The concrete assignment has to be left to a conventional graphbased register
assigner that takes into account the previously calculated register set assignment
The reason for not integrating register assignment completely in the ILPformulation is the com
plexity of the ILP to be generated All decision variables must have integral values Using a
branchandbound algorithm to solve the ILP 
n
nodes have to be generated in the worst case
where n is the number of variables specied as binary While for a homogeneous register set
Nj binary variables are introduced per instruction this number increases to Nj when the
four dierent register sets are taken into account If an own binary variable was dened for each
register 
Nj binary variables would be required Since O
Nj
  O

Nj
  O
Nj




the increase in complexity would be intolerably high

 Approximations
The computation time required to solve the generated ILPs is high Therefore it is an interesting
question to know whether heuristics can be applied which cannot guarantee an optimal solution
but can also deal with larger input programs In this paper we give an overview of the investigated
approximation algorithms they are treated in detail in Kas	
 Approximation by Rounding
The basic idea of this approach is to solve only partially relaxed problems Relaxed binary
variables are xed one by one to that value  f g which they would take presumably in an
optimal solution In the basic formulation the SILPapproach requires only the ow variables
appearing in the serial constraints to be specied as binary these are forming the set M
S
 Since
these variables are multiplied by a large constant one can assume that a relaxed value close
to   indicates that the optimal value of that variable is also   see Zha	
 As for the
other binary variables introduced to handle global analyses interdictions of parallelism register
set assignment etc a relaxation of the integrality constraint would aect the structure of the ILP
too much see also PSa Thus the presented rounding approach is applied exclusively to the
variables x M
S

First the approximation algorithm replaces the integrality constraint x  f g for all x  M
S
by the inequality   x   and solves the resulting mixed integer linear program After that
a nonintegral variable x  M
S
which smallest distance to an integer value is rounded to that
value by adding an appropriate equation to the ILPformulation Then the mixed integer linear
program is solved again and the rounding step is repeated It is possible that the rounding leads
to an infeasible ILPthen the latest xed variable is xed to its complement When the MILP
is still unsolvable an earlier decision was wrong Then in order to prevent the exponential cost
of complete backtracking integrality constraints are reintroduced This is done by grouping the
xed binary variables by the distance they had to the next integral value before rounding and
redeclaring them as binary beginning by those with the largest distance It is clear that in the
worst case the original problem has to be solved again
Since only the variables x  M
S
are relaxed the calculation of the relaxations can take a long
time moreover due to backtracking and false rounding decisions the computation time can be
higher than with the original problem The quality of the solution is worse than for the other
approximations so this approach cannot be considered promising
 Stepwise Approximation
Again the variables x  M
S
are relaxed and the resulting MILP is solved Then the following
approach is repeated for all control steps beginning with the rst one The algorithm checks
whether any operations were scheduled to the actual control step in spite of a serial constraint
formulated between them Let M
c
S
be the set of all variables corresponding to ow edges between
such colliding operations with respect to the actual control step c All x  M
c
S
are declared
binary and the resulting MILP is solved This enforces a sequentialisation of all microoperations
which cannot be simultaneously executed in control step c and so cannot be combined to a valid
instruction Then for all x  M
c
S
which have solution value x   this equation is added to the
constraints of the MILP so that these values are xed The integrality constraints for the x M
c
S
with value x   are not needed any more and are removed Then the algorithm considers the
next control step

After considering each control step it is still possible for some variables x  M
S
to have non
integral values Then the set of all x M
S
with nonintegral value is determined iteratively these
variables are redeclared binary and the MILP is solved again This is repeated until all variables
have integral values
This way a feasible solution can always be obtained Since for each control step optimal solutions
with respect to arosen collisions is calculated it can be expected that the resulting xations also
lead to a good global solution This is conrmed by the test results
 Isolated Flow Analysis
In this approach only the ow variables x  M
S
corresponding to a certain resource type r  R
are declared as binary The ow variables related to other resources are relaxed ie
  x    x M
S
mit resx  r
x  f g  x M
S
mit resx  r
Then an optimal solution of this MILP is calculated and the x  M
S
executed by r are xed to
their actual solution value by additional equality constraints This approach is repeated for all
resource types so a feasible solution is obtained in the end
This way in each step an optimal solution with respect to each individual resource ow is calcu
lated Since the overall solution consists of individually optimal solutions of the dierent resource
types in most cases it will be equal to an optimal solution of the entire problem This optimality
however cannot be guaranteed as when analysing an individual resource ow the others are only
considered in their relaxed form However the computation time is reduced since only the binary
variables associated to one resource type are considered at a time
 Stepwise Approximation of Isolated Flow Analysis
The last approximation developed for the SILPFormulation is a mixture of the two previously pre
sented approaches At each step the ow variables of all resources except the actually considered
resource type r are relaxed for the variables x M
S
with resx  r the stepwise approximation
is performed until all these variables are xed to an integral value Then the next resource type
is considered Clearly this approximation is the fastest one and in our experimental results the
solutions provided by this approximation are as good as the results of the two previously presented
approximations In the following we denote this approximation by SF 
 Rounding Approximation for OASIC
For the OASICformulation only one approximation has been foundthe rounding approach
Similarly as described above for SILP a nonintegral variable with smallest distance to an integer
is rounded to that value by adding an appropriate equation to the constraints This is repeated
until all variables have integral values Again false xations are changed by a backtracking
approach respectively when this is not sucient to obtain a feasible solution the variables have
to be redeclared as integral Unfortunately this approximation suers from high computation
times and doesnt produce satisfying results

Intermediate RepresentationsInput Program
Graph-based Algorithms
MILP-Generation
Output Program
MILP-Solver
CPLEX
Figure  Overview of the Implementation
 Implementation and Experiments
The source language is assembler based on the ADSP
xs instruction set A source program
can be generated by the gccbased compiler gk shipped with the ADSP
x or can be written
by hand The implementation comprises several phases
 The source program is analysed syntactically and sematically with flex and bison These
use a contextfree grammar written for the ADSP
xs instruction set
 The required intermediate representations as eg control ow graph data dependence graph
control dependence graph the resource ow graph when using SILP etc are generated
 Integrated instruction scheduling and register allocation is performed by two alternative
approaches
 graphbased approaches Each basic block is scheduled using the specied algorithm
and the compacted instruction sequence is written to the result le
 ILPbased approaches First the ILP of the formulation type specied by the user
SILP or OASIC is generated This ILP is solved by using CPLEX a callable library
to optimize linear and integer linear problems CPL	 Finally the optimization result
is interpreted and the optimized instruction sequence is written to the result le
This process can be visualized as depicted in gure  The input programs are typical applica
tions of digital signal processing eg a fourier transformation a digital lter as well as imaging
algorithms All these programs are strictly sequential ie they consist of a sequence of machine
instructions each containing only one microoperation If only instruction scheduling is performed
an optimal register assignment is provided in order to allow a high amount of achievable paral
lelism
Apart from the optimal respective approximative schedules lower bounds on the optimal value of
the objective function are calculated These bounds can be obtained simply by relaxing the
integrality constraint of the variables x  M
S
in the SILPformulation As for OASIC the
calculation is terminated after the rst iteration ie all decision variables x
k
in
are relaxed and
the actual objective value is returned The calculation time can be further reduced when also the
v and p	qvariables are relaxed however this aects the quality of the bounds
In the following some of our experimental results are described Table  shows the most important
characteristics of our example programs the number of instructions basic blocks loops and data

name description instructions basis blocks loops dependences
r nite impulse response lter    	
cascade innite impulse response lter    

dft discrete Fouriertransformation 
   

whetp function p from Whetstone
benchmark

   
histo histogramm    

conv convolution lter 	   	
Table  Characteristics of the source programs
name mode constr bin expl size KB
r isra  
 
 	
cascade is 
   
cascade isra 

 	  	
dft is    
dft isra   	
 		
whetp is   	 

histo is 
  
 	


conv is 	
  
 	
Table  Characteristics of the ILPs generated by the SILPbased formulation
name mode constr bin size KB
r isra  		 
	
cascade is 	
  
cascade isra 	  
	
dft is 	 
 

dft isra  	
 
whetp is 
 
 
histo is 	 	 
conv is 
  

Table  Characteristics of the ILPs generated by the OASICbased formulation


list scheduling critical
Programm 
a
lr
b
md
c
hl
d
path optimal
r      
cascade 	 	  	  
dft 
 
 
 
 
 
whetp      
histo      
conv      
a
rst t
b
longest remain
c
max depend
d
highest level rst
Table  Number of instructions in the result of the dierent scheduling algorithms
dependences The properties of the ILPs of the SILP resp OASICformulations are shown in
tables  and 
Column  mode! indicates if instruction scheduling was considered in an isolated way or integrated
with register allocation The number of generated constraints is given in column  constr! the
number of binary variables in column  bin! In the SILP approach some ow variables need not
be specied as binary since they always take integral values due to the structure of the ILP So
the numbers in brackets show how many variables are explicitly specied as binary As for OASIC
the ILPs are solved iteratively by specifying the nonintegral variables as binary and resolving
the resulting MILP Therefore only the overall number of variables is given in the table In the
last column the sizes of the generated ILPs are given in KBytes It is obvious that in most cases
the size of the OASICgenerated ILPs is higher than for the SILPformulation This is especially
true when considering integrated register allocation and instruction scheduling
The calculation time of the graphbased algorithms is signicantly lower than the time needed
to solve the ILPs all graphbased algorithms take less than one second to execute The exact
calculation times can be found in Lan	 in this paper they are omitted Results of list scheduling
with integrated register allocation are not presented here either We want to oppose the best
possible results of the conventional algorithms to the ILPapproaches So we assume that in the
input programs an optimal register assignment is given
As we can see in table  using the rst t and longest remain heuristics we get the same results
because of two reasons On the one hand the data ready set is implemented as a list instructions
that reside in data ready for the longest time are at the beginning of the list The list is checked for
a suitable instruction from the beginning On the other hand the programs examined areexcept
for whetphandwritten Therefore the rsttting operation equals the one with the longest
remainvalue This also explains why the rst t and longest remain heuristics yield the same or a
better result than max depend or highest level rst only considering the handwritten programs
A better result for whetp which was compiled via the gk compiler is gained from the max
depend heuristic Preferring such instructions that lie on longer paths in the data dependence
graph enables parallel execution which cannot be exploited by the other methods cascade and
whetp demonstrate the heuristic nature of max depend The deviation from the optimal result is
" resp " The programs contain a large amount of data dependences between consecutive
instructions This imposes great restrictions on compaction which results in a worse schedule An
optimal schedule was found in only 
 of  cases Table  shows the deviations from the optimal
result The highest level rst heuristic which is slightly more costly yields better schedules than
the others heuristics First t and longest remain still produce better results than the critical path
method

list scheduling critical
Programm 
a
lr
b
md
c
hl
d
path
r " " " " " "
cascade 
" 
" " 
" " 
"
dft " " " " " "
whetp " " " " " "
histo " " " " " "
conv " " 	" 	" 	" "
" " " " 	
"
a
rst t
b
longest remain
c
max depend
d
highest level rst
Table  Deviations of the results obtained from list scheduling with several heuristics from the
optimal solution written in percentage of the optimum
name mode method instr CPUtime
r isra def  
	 sec
r isra app  	 sec
cascade is def  	 sec
cascade is app   sec
cascade isra def    h
cascade isra app  
 sec
dft is def   sec
dft is app   sec
dft isra def    h
dft isra app  	 min  sec
whetp is def  h  min
whetp is app  	
 sec
histo is def    h
histo is app   h  min
conv is def   h  min
conv is app  

 sec
Table 
 Runtime characteristics of the SILPbased ILPs

name mode method instr CPUtime
r isra def   min  sec
r isra app   sec
cascade is def   sec
cascade is app   sec
cascade isra def   min  sec
cascade isra app  
	 sec
dft is def   h  min
dft is app   h 	 min
dft isra def    h
dft isra app 
  h  min
whetp is def   min  sec
whetp is app   sec
histo is def    h
histo is app    h  min
conv is def   h  min
conv is app   min 
 sec
Table  Runtime characteristics of the OASICbased ILPs
The runtime characteristics for the solution process of the ILPs are described in tables 
 and 
Again the column  mode! indicates if register allocation is considered together with instruction
scheduling The column  method! in table 
 shows if an exact solution is computed default
def or an approximation is used app We describe only the approximation SF which takes the
least computation time The quality of the optimized code equals that of the code produced by
SF  only for whetp more instructions were needed For details see again Kas	 The columns
 instr! in tables 
 and  give the number of instructions of the optimized program and the CPU
times needed to compute these results are shown in the last column For the OASICapproach
only the rounding approach was viable in table  it is denoted by app
As we can see in table 
 by the use of approximations the solution time of the SILPbased
approach can be signicantly reduced While the calculation time of an optimal solution for the
program cascade with integrated instruction scheduling and register allocation had to be broken
o after more than twentyfour hours the stepwise approximation of the isolated ow analysis
SF could nd a solution in 
 sec Moreover this solution was even optimal In fact only for
the program whetp the approximation gave a suboptimal result all other input programs were
solved optimally At rst sight it seems surprising that for some programs the approximation take
more time than the exact solution However the approximations require several mixed integer
linear programs to be generated and solved When the original problem is small the creation
and solution of the approximate MILPs consumes more time than is saved during the solution
processes The solutions found by the rounding approach for OASIC see table  are less good
than those found with the SILPapproximations Moreover the calculation time can still grow
very high so this approach is less satisfactory
The results of the calculation of lower bounds are given in tables  and 	 We only present the
results for the problem of integrated instruction scheduling and register allocation Method B
calculates the lower bounds by relaxing only the ow variables x M
S
of the SILPbased approach
and for the OASICbased approach the variables x
ij
k In method B the v and p	qvariables
are relaxed too Using the OASICbased approach no lower bounds could be calculated for
the three largest programs since the size of the ILPs grew too much We can see that for the
SILPapproach the deviation from the optimal solution is 	" for method B and 	"
for method B However with B a lower bound could be found within seconds for all input
	
program method lower bound CPUtime deviation "
r B   sec 
r B   sec 
cascade B 
 

 sec 
cascade B  	 sec 
dft B   sec 
dft B  	 sec 
whetp B 
 	 sec 
whetp B  	 sec 
histo B  

 sec 	
histo B  
	 sec 

conv B   min  sec 	
conv B   sec 	
Table  Calculation of lower bounds in the SILPbased formulation
program method lower bound CPUtime
r B  
 sec
r B  	 sec
cascade B 
  sec
cascade B 
 	 sec
dft B   min  sec
dft B   min  sec
Table 	 Calculation of lower bounds in the OASICbased formulation
programs For the OASICapproach the quality of methods B and B didnt dier the deviation
is "however the calculation times were higher than using the SILPbased approach
An overview of the solution quality of the dierent methods is given in gure 
 The results of the
SILP approximation were in fact optimal for the programs shown The output programs of the
graphbased algorithms contain more instructions ie they are not optimally compacted This is
although an optimal register assignment as already given in the input programs whereas in the
ILPformulation the register assignment is treated together with instruction scheduling
Comparing the results obtained with the SILP and the OASICformulation it becomes obvious
that the SILP approach is better suited for the problem of integrated instruction scheduling and
register allocation whithin a compiler OASIC is more ecient if an optimal solution of instruction
scheduling for larger input programs is calculated When register allocation is taken into account
the disk space devoured by the ILPs can explodethis is especially true for our target architecture
because of the irregular register set For the program dft the size of the OASICbased ILP with
integrated instruction scheduling and register allocation was more than  MB However mainly
it is the calculation of approximations that favours the SILPformulation The OASICapproach
is less suited for approximative calculations while based on the SILPformulation very good
approximative solutions in fact optimal solutions with one exception can be obtained This is
also true for problems which dont allow an exact solution due to their size With the calculation of
lower bounds the SILPmodelling outperforms OASIC too see table 	 For all tested programs
a good lower bound could be calculated whithin seconds With the OASICbased formulation
in three cases lower bounds could just be calculated for the pure instruction scheduling problem
Considering the register allocation too the calculation was broken o since the size of the ILPs
grew to much

histo conv fir
cascade dft
low er bounds (SILP)
optimal
list scheduling/hlf
input
0
5
10
15
20
25
30
35
40
45
50
in
s
tr
u
c
ti
o
n
s
program name
method
Comparison of Solution Quality for Different Methods
Figure 
 Comparison of Solution Quality for Dierent Methods
For the examined graphbased algorithms the average distance from the optimal solution in our
testsuite is " Thus the conventional algorithms exceed the optimal instruction number by
at least " on average On calculating lower bounds in the ILPapproach an objective value is
obtained which is on average " below the optimal instruction number So the quality of the
lower bounds when using an ILPapproach is comparable to the quality of the solutions oered by
convential graphbased algorithms
	 Conclusions
We have shown that the the problem of instruction scheduling for the underlying irregular target
architecture can be modeled completely and correctly as an integer linear program This result
was obtained by extending two structured ILPformulations SILP and OASIC Since the ILPs
are created for a xed set of microoperations it is not possible for the considered approaches to
take into account the insertion or removal of instructions within a unique ILPformulation Live
range splitting and insertion of spill code cannot be considered so that a complete integration of
instruction scheduling and register allocation is not possible Two subtasks of register allocation
register set assignment and concrete register assigment however can be integrated For reasons of
complexity it is advisable to renounce of a complete solution to the register assignment problem
whenever using the OASICapproach and using the SILPformulation when considering input
programs with nonlinear program ow Then only an optimal register set assignment is calculated
The analysis of our experimental results has shown that for use in a compiler the SILPmodelling is
superior to the OASICapproach Based on the SILPformulation several approximations can be

calculated leading to good results in relatively low calculation times The optimality of the result
is not guaranteed by such heuristics yet better results can be obtained than with the conventional
graphbased algorithms examined in Lan	
Another important application of the SILPapproach consists in calculating lower bounds on the
optimal solution In conventional graphbased algorithms it is not possible to estimate the quality
of a solution By solving partial relaxations of the ILP lower bounds to the optimal solution can
be calculated For the tested programs the quality of these lower bounds corresponds to the
quality of solutions which are calculated by conventional graphbased algorithms Thus it is
possible to give an interval which safely contains the optimal solution and to obtain an estimate
for the quality of an approximate solution This holds even when the optimal solution cannot be
calculated for reasons of complexity
The optimal schedule computed by the ILP methods is gained at a high price The space and
time complexity explodes with increasing program sizes and inhibits therefore the scheduling of
complete applications with the ILP approach It seems more promising to compact suitable code
sequences eg innermost loops Graphbased methods are a real alternative to the ILP scheduling
techniques At the cost of losing optimal results a improved schedule can be found within a short
time This makes the heuristics attractive to be used within optimizing compilers
At the moment graphbased methods are the only way to schedule large programs within a
bareable space of time The quality of the schedule could be improved by integrating ILP methods
into heuristics that could identify certain code fragments and schedule them optimally using ILP
References
Ana	 Analog Devices ADSP

 Users Manual 		
Ana	a Analog Devices ADSP
 Family Assembler Tools and Simulator Manual 		
Ana	b Analog Devices ADSP
 Family C Tools Manual 		
Ana	c Analog Devices ADSP
x SHARC Users Manual 		
Ana	
 Analog Devices ADSP
x SHARC DSP Microcomputer Family 		

Bas	 S Bashford Code Generation Techniques for Irregular Architectures Technical Report
	
 Universitat Dortmund November 		
Bru	 W Bruggemann Ausgewahlte Probleme der Produktionsplanung Physica Verlag Hei
delberg 		
CPL	 CPLEX Optimization Using the CPLEX Callable Library 		
CWM	 S Chaudhuri RA Walker and JE Mitchell Analyzing and Exploiting the Structure
of the Constraints in the ILPApproach to the Scheduling Problem IEEE Transactions
on Very Large Scale Integration VLSI System 
 #  December 		
DK	
 W Dinkelbach and A Kleine Elemente einer betriebswirtschaftlichen Entschei
dungslehre Springer 		

Ell
 JR Ellis Bulldog A Compiler for VLIW Architectures MIT Press 	

Fis JA Fisher Trace Scheduling A Technique for Global Microcode Compaction IEEE
Transactions on Computers C # 	 July 	

GE	 C H Gebotys and MI Elmasry Optimal VLSI Architectural Synthesis Kluwer Aca
demic 		
GE	 C H Gebotys and MI Elmasry Global Optimization Approach for Architectural
Synthesis IEEE Transactions on ComputerAided Design of Integrated Circuits and
Systems CAD	

 #  September 		
GS	 Rajiv Gupta and Mary Lou Soa Region scheduling An approach for detecting and
redistributing parallelism IEEE Transactions on Software Engineering 
#
		
Kas	 Daniel Kastner Instruktionsanordnung und Registerallokation auf der Basis ganz
zahliger linearer Programmierung fur den digitalen Signalprozessor ADSP
x Mas
ters thesis Universitat des Saarlandes 		
Lan	 Marc Langenbach Instruktionsanordnung unter Verwendung graphbasierter Algorith
men fur den digitalen Signalprozessor ADSP
x Masters thesis Universitat des
Saarlandes 		
LDSM David Landskov Scott Davidson Bruce Shriver and Patrick W Mallet Local Mi
crocode Compaction Techniques ACM Computing Surveys 
#	 	
Nic Alexandru Nicolau Uniform parallelism exploitation in ordinary programs In Interna
tional Conference on Parallel Processing pages 
#
 IEEE Computer Society Press
August 	
NKT	 GL Nemhauser AHG Rinnooy Kan and MJ Todd editors Handbooks in Opera
tions Research and Management Science volume  of Handbooks in Operations Research
and Management Science NorthHolland Amsterdam New York Oxford 		
NW GL Nemhauser and LA Wolsey Integer and Combinatorial Optimization John Wiley
and Sons New York 	
PSa CH Papadimitriou and K Steiglitz Combinatorial Optimization Algorithms and Com
plexity PrenticeHall Englewood Clis 	
PSb CH Papadimitriou and K Steiglitz Combinatorial Optimization Algorithms and Com
plexity chapter  pages  #  PrenticeHall Englewood Clis 	
SCL	
 MAR Saghir P Chow and CG Lee Exploiting Dual DataMemory Banks in Digital
Signal Processors http		wwweecgtorontoedu	saghir	papers	asplosps 		

WM	 R Wilhelm and D Maurer

Ubersetzerbau Theorie Konstruktion Generierung zweite
uberarbeitete und erweiterte Auage Springer Berlin Heidelberg New York 		
Zha	
 L Zhang SILP Scheduling and Allocating with Integer Linear Programming PhD
thesis Technische Fakultat der Universitat des Saarlandes 		


