A Compiler Algorithm for Managing Asynchronous Memory Read Completion by Maskit, Daniel
A Compiler Algorithm for Managing Asynchronous
Memory Read Completion
 
Daniel Maskit
Scalable Concurrent Programming Laboratory
California Institute of Technology
January    
Abstract
Computers with conventional memory systems have a predictable latency be
tween initiation and completion of a memory read On such machines it is relatively
easy for either the compiler or the processor to guarantee that a load has completed
before further references to the loaded register are made In a machine with a log
ically shared but physically distributed memory these latencies are not statically
predictable Some existing systems such as the Cray TD deal with this problem
by using a hardware mechanism to enforce synchronization on a register which is
the target of a remote memory access The MMachine currently being designed
by the Concurrent VLSI Architecture Group at MIT performs remote memory ac
cesses asynchronously and allows program execution to continue while the access is
outstanding but does not enforce synchronization in hardware This architectural
simplication and resulting relaxation of memory completion semantics poses a
challenge to the compiler how can this simpler memory system be eciently sup
ported while maintaining program correctness In particular what is required to
guarantee that there are no con	icts between completion of a memory operation by
placing a value into a register and other uses of the register being written This pa
per describes a general solution to this problem develops an algorithm to implement
it and shows that the algorithm is correct
 
The research described in this report is sponsored primarily by the Advanced Research Projects
Agency under contract number DABTC	 The information contained herein does not neces
sarily re
ect the position or policy of the government of the United States and no ocial endorsement
should be inferred	
 Overview
Computers with conventional memory systems have a predictable latency between initi
ation and completion of a memory read On such machines it is relatively easy for either
the compiler or the processor to guarantee that a load has completed before further refer
ences to the loaded register are made In a machine with a logically shared but physically
distributed memory these latencies are not statically predictable The standard method
for dealing with this issue in hardware is to have operations check the synchronization
state of their destination register and block until the register is in a stable state This
technique is used on machines such as the Cray TD
The MMachine currently being designed by the Concurrent VLSI Architecture Group
at MIT will perform remote memory accesses asynchronously but does not check the
synchronization state of registers prior to overwriting them This allows processing to
continue while awaiting completion of a remote memory access but does not require the
hardware complexity to allow reading the synchronization state prior to writing over the
register This architectural decision is necessitated by the use of multiple looselycoupled
processor clusters on a single chip The necessary hardware interlocks to manage mem
ory synchronization could not be implemented without signi	cant complications This
architectural simpli	cation and resultant relaxation of memory completion semantics
poses a challenge to the compiler
 how can this memory system be eciently supported
while maintaining program correctness In particular what is required to guarantee that
there are no conicts between completion of a memory operation and other uses of the
destination for that operation
This paper describes a general solution to this problem develops an algorithm to
implement it and shows that the algorithm is correct
 Problem description
This section describes the problem of writeafterwrite hazards As part of this description
some terminology is introduced This terminology is used for formulating the solution to
the problem
The following de	nitions will facilitate discussion of this issue The set of possible
machine operations can be divided into three categories that dier in their handling of
synchronization state of registers Within this paper READ operations are de	ned to be
the only operations that perform synchronization
  LOAD places a value into a destination register The completion of a LOAD can occur
signi	cantly after its initiation When a LOAD is initiated it sets a synchronization
ag associated with the destination register This ag is cleared when the LOAD
completes
 
  READ takes a value from a source register and uses it to perform some operation such
as add subtract compare etc READS will not begin until the synchronization ag
on the source register is cleared
  WRITE places a value into a destination register WRITES will complete even if the
synchronization ag on the destination register was set when they started After a
WRITE has completed the synchronization ag on the destination register is always
cleared
  Problem  Delayed Loads
Figure   shows a code fragment which illustrates a problem with this system
 if a WRITE
has as its destination a register which has its synchronization ag set this register could
end up in an undesired state The value p is LOADED into r If q   the program will
stall until the LOAD completes and then READ r and add it to r If q   r could be
WRITTEN before the LOAD completes Figure  shows a worstcase timing sequence In
this case in between the literal  being WRITTEN to r and r being returned the LOAD
completes overwriting the value  in r The result is that even though q   the
return value is p rather than the expected  
a  p LOAD p r  LOAD from location p into r
ifq   EQUAL q  cc  Check q  
BF cc L  If false branch to L
a  a  b ADD r r r Add r to r
JMP L
else L
a   MOVE  r  WRITE r
L
return a RETURN r
Figure  
 Delayed Load
   Problem   Multiple Loads
Figure  shows a second problem
 If a LOAD to a register is followed by a second LOAD
to the same register the contents of the register will be unknown Here the value p
is LOADED into r If q   the value b could be LOADED into r before the load of p
completes In this case in between b being LOADED into r and r being READ the LOAD
of p completes overwriting the value in r The result is that even though q   the
return value is r  p rather than the expected r  b

RETURN r1
WRITE #1, r1
if (q == 0)
LOAD p, r1 LOAD initiated
r1 = UNKNOWN
SYNC(r1) = TRUE
r1 = #1
SYNC(r1) = FALSE
RETURN r1
LOAD completes
r1 = *p
SYNC(r1) = FALSE
Take FALSE Branch
LOAD p, r1
if (q == 0)
WRITE #1, r1
Figure 
 Worstcase Timing for Delayed Load
a  p LOAD p r  LOAD from location p into r
ifq   EQUAL q  cc  Check q  
BF cc L  If false branch to L
a  b LOAD br
L
a  a  c ADD rrr  READ r r WRITE r
return a RETURN r
Figure 
 Multiple Loads

The condition that needs to be met can be stated as

Guarantee that no WRITE or LOAD has as its destination a register whose
synchronization ag is set
This can be achieved within a basic block using a single forward pass over the block
and inserting correction code when a problematic instruction is encountered To achieve
correctness READS can be inserted to force synchronization However complications arise
when dealing with transitions between basic blocks as this requires transmitting state
information across block boundaries
  
LOAD R1
LOAD R4  
  
  
  WRITE R4READ R1READ R3
= Scheduled
= Unscheduled
= Selected
Figure 
 Adjacent Block Scheduling
In addition transitions between blocks must also be managed To provide for a more
general treatment the scheduling of basic blocks is assumed to have no 	xed ordering
Therefore any or all of these other basic blocks might already be scheduled This situation
is illustrated in Figure  This 	gure shows four basic blocks Two of these basic blocks
have already been scheduled two of the blocks are as yet unscheduled One of the
unscheduled blocks has been selected as the next block to be scheduled As can be seen
there are operations in both the preceding and succeeding scheduled block which might
require action within the current block For example if the 	rst instruction in the current
block is a WRITE R this is one of the situations which requires an inserted READ Similarly
if the last instruction is a LOAD R the state of the predecessor guarantees that this is a
safe operation to perform
The problem thus becomes


Ensure that register states from scheduled predecessors are respected Gen
erate a schedule for the current block inserting synchronizing READS where re
quired Examine scheduled successors insert any required READS to guarantee
proper transition from current block into scheduled blocks Ensure that the
proper information is made available to all predecessors and successors that
have not yet been scheduled
 Overview of solution
The solution to this problem can be broken into three tasks of de	nition
 de	ne how to
schedule the code for a basic block de	ne what information needs to be transmitted from
one block to another and de	ne how to handle information from a neighboring block
which is already scheduled The key to resolving the problematic state transitions is to
insert code to perform a READ so as to guarantee synchronization
There are three possible states for a given register Each of these states can be
determined by being inherited from a predecessor established in the current block or
upwardlyexposed from a successor
  HOT The synchronization bit for the register is set or the upwardlyexposed refer
ence to this register is a LOAD
  COLDThe synchronization bit for the register is cleared There either is no upwardly
exposed reference or the exposed reference is a WRITE
  GROUNDED The exposed reference to this register is a READ
In general the 	nal state of registers for a basic block is determined by the initial
state of the registers inherited from any predecessors that have already been scheduled
and the set of operations that is performed on each register within the block Once this
state has been determined there is one possible complication that can arise
 if one or
more successors have been scheduled a check must be made to ensure that there are no
conicts between the 	nal state for the current block and the initial state of the scheduled
successors This is done by examining the upwardlyexposed state of each register in the
successors This state is determined by the 	rst use of each register within the successor
While within a block it is necessary to have available information about the blocks
successors and predecessors It is possible to structure a solution so that only immediately
adjoining blocks which have been scheduled are of interest If an adjoining block has been
scheduled the exposed state within the scheduled block of all registers is relevant
There are two types of information that need to be communicated between adjoining
blocks The 	rst type of information is a list of registers which are HOT on exit from the
block This information is called HOT
b
 and can ow from a predecessor into the current
block or from the current block into a successor The second type of information is what

type of operation is performed as the 	rst access to each register within a given block
This information is called STATE
b
 and can ow from the current block to a predecessor
or from a successor to the current block
If a predecessor has not been scheduled there is no initial ow of information between
the predecessor and the current block When scheduling is completed STATE
b
is made
available to the predecessor
If a successor has not been scheduled there is no initial ow of information between
the successor and the current block When scheduling is completed HOT
b
is made available
to the successor
If a predecessor has been scheduled HOT
p
ows from the predecessor to the current
block The initial value for HOT
b
for the current block is the union of HOT
p
from all
scheduled predecessors
If a successor has been scheduled STATE
s
ows from the successor to the current block
COLD
HOT GROUNDED
READ
WRITE
LOAD
INSERT
READ
INSERT
READ
Figure 
 State Transition Diagram
If a successor has already been scheduled a determination needs to be made as to
whether the ending state from the current block is consistent with the 	rst usage in the
next block Leaving the current block each register is identi	ed as either HOT or COLD
If a register is COLD on exit from the current block than any initial operation using that
register in the next block is correct If a register is HOT however the only initial operation
that is legal is a READ If the upwardsexposed reference to a register is a READ the register
state is GROUNDED If the STATE for a register in all scheduled successors is GROUNDED than
it is acceptable for the register to be HOT on exit from the current block Otherwise it is
necessary to GROUND the register by performing a READ on the register prior to exiting the
current block The full set of state transitions is shown in Figure 

 Algorithm
Recall that the problem being solved is

Ensure that register states from scheduled predecessors are respected Gen
erate a schedule for the current block inserting synchronizing READS where re
quired Examine scheduled successors insert any required READS to guarantee
proper transition from current block into scheduled blocks Ensure that the
proper information is made available to all predecessors and successors that
have not yet been scheduled
To facilitate formulation of an algorithm to solve this problem registers are de	ned
as being in one of three states
 HOT registers have their synchronization bit set GROUNDED
registers had their synchronization bit cleared by a READ COLD registers had their syn
chronization bit cleared by a WRITE The state GROUNDED is only relevant for managing
transitions from a block to a scheduled successor and only needs to appear in the STATE
passed from a scheduled successor to the current block Within a block it is sucient to
know if a register is HOT or COLD
HOTHOT COLD
READ
WRITE
LOAD
INSERT
READ
INSERT
READ
Figure 
 State Transitions for SingleInstruction Scheduling
At any point in the scheduling of a block there will be some set of instructions which
have already been scheduled a set HOT current
b
of registers with their synchronization bit
set and a next instruction to be scheduled The possible state transitions for each register
referenced in the next instruction are shown in Figure  For any given register if the
register is not in HOT current
b
initially any operation can be performed without special
handling If the performed operation is a LOAD the register is placed in HOT current
b

If the register is initially in HOT current
b
 the only operation which can be performed

without special handling is a synchronizing READ Any other operation will destroy the
synchronization state of the register To compensate for this a READ must be inserted
prior to issuing the instruction so that the register synchronization state is cleared before
the register is overwritten If either a READ or a WRITE is performed the register is removed
from the set HOT current
b

The general outline of the algorithm is shown in Figure  Iterating until all basic
blocks have been scheduled select a basic block The routine SelectBlock chooses the
next block to be scheduled The criteria used to make this selection need not be speci	ed
as they do not aect the workings of the algorithm It is guaranteed that the return value
from SelectBlock is a block which has not yet been scheduled Calculate HOT initial
b
for the selected block b Schedule the instructions within this basic block Calculate
STATE incoming
b
for this block For any register which was not referenced in this block
set STATE
b
r for that register to STATE incoming
b
r Generate any compensation code
needed to reconcile HOT current
b
with STATE incoming
b

while not all blocks scheduled
b  SelectBlock
HOT initial
b
  
for all predecessors p of b
if predecessor p scheduled
HOT initial
b
 HOT initial
b
 HOT final
p

HOT current
b
 HOT initial
b
Scheduleb
STATE incoming
b
 JoinSuccessorStatesb
for all registers r
if r  HOT current
b
AND STATE incoming
b
r  GROUNDED
issue read of r
remove r from HOT current
b
if STATE
b
r  NULL
STATE
b
r  GROUNDED
if STATE
b
r  NULL
STATE
b
r  STATE incoming
b
r
HOT final
b
 HOT
Mark b as scheduled
Figure 
 General Outline of Algorithm
The algorithm for scheduling the instructions within a basic block is shown in Figure 
This algorithm iterates over all of the instructions within a basic block ensuring that all
operations are scheduled and all necessary READS are inserted In addition this algorithm
determines the value of STATE
b
r for all registers referenced in the block and maintains
the membership information for the set HOT current
b


for each operation O f
if O is a READ f
if STATE
b
sourceO  NULL
STATE
b
sourceO  GROUNDED
if STATE
b
targetO  NULL
STATE
b
targetO  COLD
if sourceO  HOT current
b
remove sourceO from HOT current
b
g
if O is a WRITE f
if targetO  HOT current
b
f
issue READ of targetO
remove targetO from HOT current
b
g
if STATE
b
targetO  NULL
STATE
b
targetO  COLD
g
if O is a LOAD f
if targetO  HOT current
b
f
issue READ of targetO
if STATE
b
targetO  NULL
STATE
b
targetO  GROUNDED
g
else f
add targetO to HOT current
b
if STATE
b
targetO  NULL
STATE
b
targetO  HOT
g
g
issue O
g
Figure 
 Scheduling Algorithm for a Basic Block

The routine JoinSuccessorStates is shown in Figure  This routine examines
STATE for all scheduled successors and merges them into one STATE
for all registers r
STATE
b
r  GROUNDED
for all scheduled successors s of b
for all registers r
if STATE
b
r  HOT  STATE
s
r  HOT
STATE
b
r  HOT
else if STATE
b
r  COLD  STATE
s
r  COLD
STATE
b
r  COLD
return STATE
b

Figure 
 Algorithm for Merging STATE from Scheduled Successors
 Correctness of the Algorithm
This section demonstrates that the algorithm presented in the previous section will guar
antee that the synchronization state of a register is always cleared before an operation
using that register as its destination is issued This is shown by providing more formal
de	nitions for the concepts discussed in previous sections describing invariants for the
algorithm and then providing a demonstration of correctness
 Basic Denitions
This following provides more formal de	nitions for many of the concepts introduced in
the previous several sections
B is the complete set of basic blocks in a program
R is the complete set of available machine registers
SCHEDULEDb  TRUE if code has been generated for block b
FIRSTb r b  B r  R is the 	rst operation in a block b that references r as source
andor destination FIRSTb r is one of READ WRITE LOAD
LASTb r b  B r  R is the last operation in a block b that references r as source
andor destination LASTb r is one of READ WRITE LOAD
SOURCEOPb r  TRUE if r is a source for OPr in b
OP  fFIRSTLASTg r  R b  B
 
DESTINATIONOPb r  TRUE if r is a destination for OPr in b
OP  fFIRSTLASTg r  R b  B
SUCCESSORb s  TRUE if b s  B  s is a successor of b
PREDECESSORb p  TRUE if b p  B  p is a predecessor of b
If an instruction has a register as both source and destination the state of the register
after the instruction has been performed is determined as if the register were only the
destination The use of the register as source ensures that correction code will not be
necessary prior to executing the instruction
Initially STATE
b
 
  Denitions for Transitional Information
Given these de	nitions it is possible to de	ne rules for determining the state visible from
outside a scheduled block First the de	nition of STATE
b
 which is the upwardly exposed
register information for the block b If the 	rst reference to a register performs a READ on
the register then STATE
b
r for that register is GROUNDED

SOURCEFIRSTb r  TRUE  STATE
b
r  GROUNDED
If the 	rst instruction that accesses a register uses that register as both source and desti
nation STATE
b
r is GROUNDED as the subsequent use of the register as destination is not
the 	rst reference to the register If the 	rst reference to a register is as destination the
operation could be either a WRITE or a LOAD If it is a WRITE then STATE
b
r is COLD if it
is a LOAD then STATE
b
r is HOT

SOURCEFIRSTb r  FALSE  FIRSTb r  WRITE STATE
b
r  COLD
SOURCEFIRSTb r  FALSE  FIRSTb r  LOAD  STATE
b
r  HOT
Once it is possible to derive the de	nition of STATE
b
for one block it is possible to
combine the STATE
s
for all scheduled successors of that block once some rules of prece
dence are de	ned If STATE
p
r  HOT for any predecessor STATE incoming
b
r  HOT
If STATE
p
r  HOT for all predecessors but STATE
p
r  COLD for any predecessor
STATE incoming
b
r  COLD Otherwise STATE
p
r  GROUNDED for all predecessors
and STATE incoming
b
r  GROUNDED Given these rules it is possible to de	ne

STATE incoming
b
 f
i
STATE
b
i
j b
i
 B  SCHEDULEDb
i
  TRUE
 SUCCESSORb b
i
  TRUE
There are two rules needed for determining HOT final
b
 If the last reference to a
register uses the register as destination and the operation is a LOAD then HOT final
b
r
for that register is TRUE In all other cases HOT final
b
r for that register is FALSE

  
DESTINATIONLASTb r  TRUE  LASTb r  LOAD
 HOT final
b
r  TRUE
DESTINATIONLASTb r  FALSE 	 LASTb r  LOAD
 HOT final
b
r  FALSE
For a given block HOT initial
b
is the union of the sets HOT final
p
from all of the
previously scheduled successors For this union operation a register is considered to be
in HOT final
p
 iff HOT final
p
r  TRUE This leads to the de	nition

HOT initial
b
 f

i
HOT final
b
i
j b
i
 B  SCHEDULEDb
i
  TRUE
 PREDECESSORb b
i
  TRUEg
 Invariants Preconditions and Postcondition
The preconditions for the algorithm are

  B contains all of the basic blocks for a procedure
  The graph containing B is properly organized to represent the relationships between
all blocks and their predecessorssuccessors
  Each b  B contains a set of operations fOjO  fREAD WRITELOADgg
  All operations are legal to issue
  b  B  SCHEDULEDb  FALSEg
The invariants for this algorithm are

  b  B SCHEDULEDb  TRUE  All original operations in b have been issued
  b  B SCHEDULEDb  TRUE  All required READS have been inserted in b
  b  B SCHEDULEDb  TRUE  HOT final
b
is correct
  b  B SCHEDULEDb  TRUE  STATE
b
is correct
As long as these invariants are preserved there is only one necessary postcondition
for the algorithm

  b  B  SCHEDULEDb  TRUE
 
	 Demonstration of Correctness
Initially no blocks have been scheduled and all invariants are trivially preserved If B 
 then there are no basic blocks within the current procedure and the algorithm trivially
terminates Otherwise there is at least one block which needs to be scheduled and the
outer loop denoted by

while not all blocks scheduled
will be entered This loop condition can be rewritten as

while fbjb  B SCHEDULEDb  FALSE g
Since SelectBlock is guaranteed to return a value fbjb  B SCHEDULEDB 
FALSEg  and such a b exists after execution of this line it can be asserted that
 b  B 
SCHEDULEDb  FALSE
The next step in the algorithm is to ensure that the set HOT initial
b
contains only
registers that are in HOT final
p
of scheduled predecessors p and contains all such regis
ters The initialization

HOT initial
b
 
allows us to assert
 HOT initial
b
contains no registers not present in HOT final
p
of a
scheduled predecessor of b If there are no scheduled predecessors of b then this initial
value is correct If there are scheduled predecessors the following code will ensure that
HOT initial
b
contains the proper set of registers

for all predecessors p of b
if predecessor scheduled
HOT initial
b
 HOT initial
b

 HOT final
p

This can be expressed as

f p j p  B PREDECESSORb p  TRUE g
if SCHEDULEDp  TRUE
HOT initial
b
 HOT initial
b

 HOT final
p

This loop will terminate as it iterates only once for each predecessor of b and the set of
predecessors is 	nite After execution of this loop the following can be asserted

r  R  r  HOT initial
b
 r  HOT final
p
 SCHEDULEDp  TRUE
 PREDECESSORb p  TRUE
 
At this point the computation of HOT initial
b
is complete After the assignment

HOT current
b
 HOT initial
b
everything that has been asserted about HOT initial
b
will hold for HOT current
b
until
the latter is modi	ed
It will be demonstrated in Section   that after the call

Scheduleb
the following can be asserted
 all original operations in b have been issued all required
READS have been inserted in b HOT current
b
is correct up to this point and STATE
b
is
correct up to this point
It will be shown in Section  that following the call

STATE incoming
b
 JoinSuccessorStatesb
the following can be asserted
 STATE incoming
b
is correct
Next an iteration over all registers is performed to identify any registers which are in
HOT current
b
 but are not grounded in STATE incoming
b
 In addition any registers for
which STATE
b
r  NULL are set to STATE incoming
b
r

for all registers r
if r  HOT current
b
 STATE incoming
b
r  HOT 	 STATE incoming
b
r  COLD
issue READ of r
remove r from HOT current
b
if STATE
b
r  NULL
STATE
b
r  GROUNDED
if STATE
b
r  NULL
STATE
b
r  STATE incoming
b
r
This loop will terminate as each iteration examines one register and the number of registers
is 	nite After this code has executed it can be asserted that
 STATE
b
is correct all
necessary READs to compensate for transitions between this block and scheduled successors
have been issued HOT current
b
is correct As there are no more instructions that can
alter HOT current
b
 the following assignment is made

HOT final
b
 HOT current
b
Once this assignment is complete it can be asserted that HOT final
b
is correct Finally
SCHEDULEDb is updated using

SCHEDULEDb  TRUE
And all of the invariants hold for the current block
As this loop will be executed exactly once for each block b  B and there are a
	nite number of blocks in B the loop will terminate After each iteration of the loop
 
SCHEDULEDb holds for one more block than it had on the previous iteration There
fore when the loop terminates the postcondition

b  B  SCHEDULEDb  TRUE
has been satis	ed
 Correctness of the routine Schedule
Recall that the required postconditions of this routine are

  All original operations in B have been issued
  All required READS have been inserted in B
  HOT current
b
is correct up to this point
  STATE
b
is correct up to this point
Recall also that the following preconditions have been shown to hold on entry to this
routine

  b  B SCHEDULEDb  FALSE
  HOT current
b
contains only registers that are in HOT final
p
of scheduled predeces
sors of b and contains all such registers
This entire routine is in the form of a single loop denoted by

for each operation O
As a block must have a 	nite number of operations and each operation is dealt with
in a single loop iteration this loop will terminate As the bottom of this loop is the
instruction

issue O
it can be trivially asserted that all original operations in b are issued
Depending on the type of operation currently being handled one of three clauses is
executed By showing that each clause properly handles one type of operation it will be
demonstrated that all operations will be properly handled
The only item of concern for processing a READ is to ensure that following the READ
its source register is not in HOT current
b
 In terms of STATE
b
 if this is the 	rst reference
to the source register its state is set to GROUNDED and if this is the 	rst reference to the
destination register its state is set to COLD This is all accomplished with the clause

 
if STATE
b
sourceO  NULL
STATE
b
sourceO  GROUNDED
if STATE
b
targetO  NULL
STATE
b
targetO  COLD
if sourceO  HOT current
b
remove sourceO from HOT current
b
Following this clause and the issuing of the instruction at the bottom of the loop it can
be asserted that if the 	rst operation is a READ all postconditions hold at the bottom of
the loop after the 	rst iteration
When dealing with a WRITE if the target register is in HOT current
b
 it is necessary
to insert a READ of this register Following this insertion the register is removed from
HOT current
b
 In either case if this is the 	rst reference to the target variable its state
is set to COLD These manipulations are performed by

if targetO  HOT current
b
f
issue READ of targetO
remove targetO from HOT current
b
g
if STATE
b
targetO  NULL
STATE
b
targetO  COLD
Following this clause and the issuing of the instruction at the bottom of the loop it can
be asserted that if the 	rst operation is a WRITE all postconditions hold at the bottom of
the loop after the 	rst iteration
When dealing with a LOAD if the target of the LOAD is in HOT current
b
 it is necessary
to insert a READ of this register The register is not removed from HOT current
b
however
as the target of the LOAD currently being processed must be in HOT
b
once the current
instruction has been issued If we are to issue this read of the target register and this
is the 	rst reference to this register its state is set to GROUNDED If the target is not in
HOT current
b
 it is added to HOT current
b
 If this is the 	rst reference to this register
its state is set to HOT This logic is encoded as

if targetO  HOT current
b
f
issue READ of targetO
if STATE
b
targetO  NULL
STATE
b
targetO  GROUNDED
g
else f
add targetO to HOT current
b
 
if STATE
b
targetO  NULL
STATE
b
targetO  HOT
g
Following this clause and the issuing of the instruction at the bottom of the loop it can
be asserted that if the 	rst operation is a LOAD all postconditions hold at the bottom of
the loop after the 	rst iteration
It can now be asserted that if the instruction is any of READ WRITE LOAD following
the issuing of the instruction at the bottom of the loop all postconditions hold after the
	rst iteration This can be generalized to the statement that all postconditions will hold
at the bottom of any loop iteration in particular the 	nal loop iteration Therefore all
postconditions hold at the end of this routine
 Correctness of the routine JoinSuccessorStates
Recall that the requirement of this routine is that it has as its postcondition

STATE incoming
b
is correct
To be able to discuss this it is necessary that some description be provided as to what
it means for this condition to hold If STATE
s
r for a register is HOT for any scheduled
successor s then it is irrelevant what STATE
s
r for that register is for any other scheduled
successor s The register must be treated as if it is HOT for all successors If STATE
s
r 
HOT for all scheduled successors but it is COLD for one or more successors then it is treated
as being COLD for all successors If neither of these conditions are met then the register
is treated as being GROUNDED for all successors
The preconditions for this routine are

  b  B
  The information about successors to b is valid
The above description implies that the default setting for a register unless it is over
ridden by a value from a scheduled successor is GROUNDED Therefore the 	rst operation
in the routine is

for all registers r
STATE
b
r  GROUNDED
This loop will terminate as it iterates once per register and there are a 	nite number of
registers If there are no scheduled successors of b then the routine terminates here and
the returned STATE
b
is GROUNDED for all registers This is the correct return value for this
circumstance If there are scheduled successors the loop denoted by

 
for all scheduled successors of b
will be entered This can be rewritten as

 s j s  B  SCHEDULEDs  TRUE  SUCCESSORb s  TRUE
This loop will terminate as there must be a 	nite number of successors to a block Within
this loop there is another loop which is denoted by

for all registers r
This loop will also terminate as the number of registers is 	nite The body of this loop is

if STATE
b
r  HOT 	 STATE
s
r  HOT
STATE
b
r  HOT
else if STATE
b
r  COLD	 STATE
s
r  COLD
STATE
b
r  COLD
The 	rst check shown here ensures that if any scheduled successor has STATE
s
r  HOT
then the set returned from this routine will have STATE
b
r  HOT Similarly if none of
the scheduled successors has STATE
s
r  HOT but one or more of them have STATE
s
r
 COLD then the set returned from this routine will have STATE
b
r  COLD If none of
the scheduled successors have either STATE
s
r  HOT or STATE
s
r  COLD then the set
returned from this routine will have the default value of STATE
b
r  GROUNDED Upon
assignment of the return value from this routine to STATE incoming
b
 it can be asserted
that STATE incoming
b
is correct
 Managing Transfer of Control
The preceding algorithm and proof do not address the issue of managing register state
when a transfer of control such as a function call or interrupt occurs It is assumed
that interrupts are not an issue as they will either only use special hardware registers set
aside for their use or will 	rst read any registers that they are going to write to preserve
their initial values The interrupt handler must take action to ensure that no registers
which were not HOT on entry to the handler are HOT when control is returned to the user
program
The algorithm relies on all registers being in a known state upon entry to a function
In order to support separate compilation of source 	les and the use of libraries it is
necessary to ensure that no registers are HOT prior to issuing a function call In a system
where all source was required to be in a single input 	le it would be possible to extend
the interblock exchange of information to also handle interprocedural exchange It is also
possible to design a system that uses this interprocedural exchange of information by
importing information from previously compiled images but that is outside the scope of
this paper
 
 Trace Scheduling
The previous sections have all dealt with a traditional compiler which works with basic
block scheduling The Multiow compiler being used for the MMachine work uses a trace
scheduling algorithm  In terms of the algorithm presented in this paper the major
dierence between basic blocks and traces is that unlike a basic block a trace can have
multiple entry and exit points This section explains the changes that are necessary to
the algorithm presented in Section  to accommodate trace scheduling and discusses the
impact of these changes on the demonstration of correctness

 The Trace Scheduling Algorithm
The motivation for trace scheduling is found in the problem of compiling for instruction
level parallelism ILP In general it is dicult for a compiler to locate enough parallelism
within a basic block to sustain utilization of multiple functional units A compiler using
the tracescheduling algorithm seeks to overcome this obstacle by performing scheduling
on a larger unit than a basic block This larger unit is called a trace A trace is allowed
to span multiple basic blocks and may contain conditional branches within it Traces are
not allowed to contain loop back edges One important dierence between a basic block
and a trace is that a trace is allowed to have multiple exit and entry points
An indepth description of trace scheduling can be found in  

  Algorithm Modications for Trace Scheduling
The 	rst change that needs to be made to the algorithm is the structure with which
state information is associated For the basic block algorithm it makes sense to associate
this information with each block When dealing with tracescheduling a more natural
association is to be found with the edges in the control ow graph for the program This
provides a natural way of handling the multiple entry and exit points which can exist
within a trace One of the nice properties of these edges is that an edge will have only
one entry point and one exit point
At the start of processing for a trace the algorithm gathers information from all edges
which enter the trace at the top After each instruction it now becomes necessary to
determine if there are any edges that either enter or leave the trace between the instruction
that was just issued and the next instruction to be issued For all such edges there is
dierent handling depending on whether or not the edge has been scheduled If there are
scheduled edges joining the trace then the HOT information from those traces is added
into HOT
current
 If the edge has not been scheduled STATE is attached to the edge to
be used when it is scheduled
If there are unscheduled edges leaving the trace HOT
current
is attached to the edge
to be used when it is scheduled If there are scheduled edges leaving the trace a check
 
is made to determine if any synchronizing operations need to be inserted before the next
instruction is emitted
Based on these changes the necessary changes to the algorithm can be formulated
As can be seen in Figure   the general outline of the algorithm is essentially the same
JoinSuccessorStates remains unchanged but the algorithm for scheduling a single trace
is dierent from that for a basic block
while not all traces scheduled
t  SelectTrace
HOT initial
t
  
for all predecessors p of t
if predecessor p scheduled
HOT initial
t
 HOT initial
t
 HOT final
p

HOT current
t
 HOT initial
t
Schedulet
STATE incoming
t
 JoinSuccessorStatest
for all registers r
if r  HOT current
t
AND STATE incoming
t
r  GROUNDED
issue read of r
remove r from HOT current
t
if STATE
t
r  NULL
STATE
t
r  GROUNDED
if STATE
t
r  NULL
STATE
t
r  STATE incoming
t
r
HOT final
t
 HOT
Mark t as scheduled
Figure  
 General Outline of Trace Algorithm
The algorithm for scheduling the instructions within a trace is shown in Figure   
This algorithm iterates over all of the instructions within a trace ensuring that all op
erations are scheduled and all necessary READS are inserted In addition this algorithm
determines the value of STATE
t
r for all registers referenced in the block and maintains
the membership information for the set HOT current
t
 In addition prior to issuing each
instruction this algorithm determines if there are any splits from or joins to this trace en
tering at the current cycle If there are such edges the necessary manipulation of STATE
t
and HOT
t
are performed and any READS necessitated by these edges are inserted

for each cycle C
for each edge e joining t at C
if edge e scheduled f
HOT current
t
 HOT current
t
 HOT final
e
g
else f
Attach STATE
t
to e
g
for each operation O f
	
	
	
issue O
g
for each edge e splitting from t at C
if edge e scheduled f
for all registers r
if r  HOT current
t
AND STATE incoming
e
r  GROUNDED
issue read of r
g
else f
Attach HOT current
t
to e
g
Figure   
 Scheduling Algorithm for a Trace
 

 Correctness Modications for Trace Scheduling
Most of the changes to accommodate tracescheduling clearly have little impact on the
previously demonstrated correctness of the algorithm The one area which bears some
examination is the code that occurs at the beginning and end of each cycle of scheduling
These pieces of code mirror in many respects the code at the beginning and end of each
basic blocktrace
At the beginning of each cycle a check is made to determine if any edges join the cur
rent trace at this cycle If edges do join at this cycle it is necessary to either incorporate
HOT information from the edge if it has been scheduled or to attach STATE information to
the edge if it hasnt been scheduled In the 	rst case this is directly analogous to incor
porating HOT
final
from all predecessors to a given trace into HOT
initial
prior to beginning
scheduling of the trace For any given instruction it is necessary to have incorporated
information about all preceding instructions which have already been scheduled For the
latter case some clari	cation is needed When STATE is attached to an edge this does not
mean that the current values of STATE that have already been computed for this trace are
copied into the edge It means that a new STATE
e
is initialized and all further computa
tion of STATE that occurs in the process of scheduling the current trace must update not
only STATE
t
for the trace but all STATE
e
for unscheduled edges that have already joined
the trace This process will ensure that when the edge is scheduled it is presented with a
clear picture of what will happen to registers following the transition from the edge into
the adjoining trace
At the end of each cycle management of edges that split from the current trace are
handled If these edges have already been scheduled a check is made to determine if any
registers need to be GROUNDED prior to transitioning to the edge Ideally any necessary
compensation code can be conditionalized so that it is only executed if the branch to the
edge is to be taken and hidden in a branch shadow If the adjoining edge has not been
scheduled then a copy of HOT
current
is placed on the edge Relative to this edge this is
equivalent to placing HOT
final
into an adjoining trace If there are multiple edges and it
is not possible to conditionalize any inserted GROUNDing operations all of the scheduled
edges should be handled 	rst to minimize the number of registers in the set HOT copied
into the unscheduled edges
 Summary
This paper describes a compiler algorithm for preventing writeafterwrite hazards in the
absence of hardware interlocks While the algorithm was motivated by the design of the
MIT MMachine it is applicable to any architecture that does not prevent this hazard
The algorithm has been shown to be correct Implementation of the algorithm within
the Multiow Compiler targeted for the MMachine is currently under way Future work
will detail the implementation details of the algorithm and evaluate the cost of managing
this issue in software rather than hardware

References
  Dally W J et al MMachine Architecture v  Massachusetts Institute of Tech
nology Arti	cial Intelligence Laboratory Concurrent VLSI Architecture Memo 
February  
 Ellis John R Bulldog
 A Compiler for VLIW Architectures The MIT Press
Cambridge MA  
  Lowney PG Freudenberger Stefan M et al The Multiow Trace Scheduling
Compiler J of Supercomputing v   

