Data dependme, Instruction scheduling, Shared memory multiprocessor, Superscalar, Synchronisation, Synchronisation marker
Introduction
A parallel compiler exploits loop level parallelism and instruction scheduling exploits instruction level parallelism. In the past, people studied either on loop level or on instruction level independently but some new problems need to be resolved where we consider loop level and instruction level simultaneously. For example, consider a shared memory multiprocessor in which each processing element is a superscalar processor: we need to exploit loop level and instruction level simultaneously. In this paper, we discuss a synchronisation problem in which synchronisation operation is inserted at loop level but the order dependence may be broken during instruction scheduling. Parallel loops in a program, whose iterations can be executed concurrently on different processors, provide the greatest potential of parallelism to be exploited by multiprocessor systems [ If there are data dependences across iterations of a DO-loop (loop-carried dependence), its iterations can still be executed concurrently on different processors, provided that the data dependences are enforced by synchronisation across the processors during the execution. This kind of parallel loop is called a DOACROSS loop [2, 141. There have been some compiler techniques on multiprocessor data synchronisatioafor DO ACROSS loops. Midkiff [lo] and Wolfe [16] suggested inserting statement level synchronisation instructions such as set/ wait and send/wait in the loop body to enforce data dependences. These schemes can only handle constant distance data dependences and single-nested DOACROSS loops. Su and Yew proposed a process oriented data synchronisation scheme for constant distance data dependences [12] . Li provides two operations, POST and WAIT, on a logical event variable to execute a DOACROSS loop in parallel [8] . The operation POST(u) sets the event variable U to TRUE; The operation WAIT(u) busy-waits until U becomes TRUE. The initial value of an event variable is FALSE. The scheme of event variable synchronisation is suitable for some variable distances and nested loops. It can also handle single loop synchronisation. Tang, Yew and Zhu present an algorithm based on special counters [lS] . They use two data oriented synch read and synch write to replace the regular read and write in the original sequential program, respectively. The ordering number for each data access is decided at compilation time. This scheme considers a subset of loop bounds and subscript in the event variable synchronisation method.
From the discussion above, we know that Li's approach is a most powerful one. Fig. 1 is an example of event variable synchronisation [8] . A nested loop is shown in Fig. la , and its synchronisation operation insertion is shown in Fig. lb. In Fig. lb , EV is a bit array to record whether its corresponding iteration has finished or not. Because the dependence sink should not wait in all loop iterations, a condition must be checked in each iteration to see if a WAIT must really be executed. This condition is called mask predicate, which is tested by the I F statement in Fig. lb. On the other hand, because the distance of dependence relation is variable, the distance between dependence source and sink must be determined.
This work is supported in part by the National Science Council under grant NSC 82-0408-E-002-093.
The correspondence between two dependent references is Now, synchronisation conditions are described as called contact, which should be maintained by indexing the event array correctly in POST and WAIT as in Fig.  lb processor has not written data into array A yet, but the Waf. The dependence relation still holds because it is a corresponding bit array has been set. In the meantime, if loop carried dependence. In such case, the LBD is cona processor is waiting for the bit set, error is incurred. So verted into the LFD. we will get stale data. In this paper, we propose a technique to resolve this problem.
In this Section, we propose an approach to resolving the In this Section, we discuss the synchronisation conditions problem. The simplest sohtion for this Problem is to which prevent the instruction scheduling from error.
constrain instruction moving across any synchronisation operation. However, this will seriously affect the function First, some notations are defined.
of instruction scheduling. According to Reference 5, the Src = dependence source instruction parallelism in a basic block is scarce. If this Snk = dependence sink problem is resolved in this way, the performance of the Sig = a synchronisation instruction 'POST' superscalar processor will degrade seriously. This Waf = a synchronisation instruction 'WAIT' problem has two features: (i) the dependence infomation P,, = the processor which excecutes the iteration of is constructed during synchronisation operation insertion which is done at statement level, (ii) the order dependence Psnk = the processor which executes the iteration of maintained by synchronisation operation may be broken during instruction scheduling. clear that the solution is very troublesome if we try to resolve this problem at instruction level without any information provided by the statement level. Therefore, we convert the dependence information, which is constructed at statement level, into synchronisation markers which always attach to the dependence event, and then the error prevention algorithms are developed to guide the instruction scheduler for correct scheduling. Basically, if we can satisfy the synchronisation conditions above, the problem is resolved. A marker is a pseudoinstruction, the format of which is either AI or V,. AI, the upmarker, represents the situation that the synchronisation condition may be violated when the dependence event Src, Snk, Sig, or Wat immediately following AI is moved up. Conversely, VI, the downmarker, means that the synchronisation condition may be violated when the dependence event immediately beyond VI is moved down. Variable t is to identlfy whether the dependence events belong to the same dependence relation. The problem is overcome if we deposit different synchronisation marker pairs between Src and its corresponding Sig, and Snk and its corresponding Wat. When the dependence event is moved up (down), the marker near to the event must be moved up (down) together with it as one unit. After the instruction scheduling is finished, all markers inserted are deleted. Therefore, this will not increase the original program size. Now, we consider how to insert the synchronisation markers into the dependence relation. First, we illustrate the example shown in With a dependence event, we can classify the event as one of the five types of dependence relation. The five types are (i) single dependence source, (ii) single dependence sink, ( i ) multiple dependence source, (iv) multiple dependence sink and (v) one or more dependence sources and sinks. Let A and B be Src, Snk, Sig, Waf, or Wat, (Waf, represents n Wats). An A-B marker is a synchronisation marker pair into which the downmarker V, and the upmarker Ai are inserted immediately following and preceding A and B, respectively. If A is Wat,, n identical downmarkers are inserted immediately following n Wafs. The synchronisation marker insertion of the five types is shown in Fig. 3 . With the single dependence source Src in Fig. 3a , we need only insert a Src-Sig marker to prevent 400 violating synchronisation conditions. Similarly, with the single dependence sink Snk in Fig. 3b , the multiple dependence source Src, in which there are n correspoqding dependence sinks in Fig. 3c , and the multiple dependence sink Snk, in which there are n corresponding dependence sources in Fig. 3d , we need only insert Waf-Snk marker, Src-Sig marker, and Wat,-Snk marker, respectively. With a data dependence event Src JSnk, in which there are rn corresponding dependence sinks and n corresponding dependence sources in Fig. 3e, From the discussion above, we insert, at most, two different marker pairs to resolve all kinds of dependence cases. Now, we show how to insert synchronisation markers into the program. This is done in two steps:
(i) append synchronisation markers into the dependence event of the source program;
(ii) generate the synchronisation markers during the intermediate code generation.
Step 1 is done by a parallel compiler during the insertion of synchronisation operations [8] . A parallel compiler does dependence analysis to decide whether the dependence relation exists. If any dependence relation is found, the synchronisation operations POST and WAIT are inserted into the dependence relation [9] . At the same time, we append an adequate marker pair between Src(Snk) and its corresponding Sig(Wat). The synchronisation marker insertion is shown in Algorithm 1. For an array element A , which is a dependence source, A , is replaced with string A , & i, the corresponding synchronisation marker of which is V i , and string i&POST(EV[il, i , , . . . , in]) is inserted after the statement that issues A,.
Similarly, the array element A , , which is a dependence sink, is replaced with string j&A, and statement IF p WAIT(E)&j is inserted before the statement which issues the sink reference A , (p is mask predicate and E is contact). The terminal symbol & is a special symbol which is to combine synchronisation marker and array element as one unit. The synchronisation markers are generated during intermediate code generation. For example, the synchronisation operation insertion shown in Fig. lb can have some markers appended as shown in (EV[I,, I,]) . Similarly, string i& is inserted before the array element ALII, I, -21, which is a dependence sink, and string &i is appended after the corresponding WAIT (EV[I, -I, + 2, I , -21). represented by id is given by attribute id.,,,e. The Elist.,,, records the number of dimensions in the Elist.
Stmt+SIPOIWT 2. S + (FI LI F&LI L&FI F,&L&F2) I= E
The function limit(array, j) returns nj, the number of elements along the jth dimension of the array whose symbol The grammar in Fig. 5 generates the three-address code of array element and inserts synchronisation markers which are immediately after or before the dependence events of three-address code. The semantic action of each rule is listed at right-hand side of the rule. Production rule 1 shows that a statement can be an assignment, PO, or WT statement. Rule 2 describes an assignment statement with simple identifier (F := E), array element ( L := E), or dependence event (F&L := E, L&F := E, or F,&L&F, := E) on the left-hand side of ':=', respectively. Rules 3, 4, 5, and 6 show the arithmetic operation in which the nonterminal symbol P is a simple identifier (P + F), array element (P + L) or dependence event (P + F&L, P + L&F, or P + F,&L&F,). Rules 7, 8, and 9 calculate the address of array element. Rules 10 and 11 generate the synchronisation instruction and its corresponding marker. HWAIT and HPOST are two instructions which are supported by hardware. For the rule P + F&L, F.place records the variable of the synchronisation marker, and F, a synchronisation marker identifier, is placed before L, which is an array element. Therefore, the upmarker AF.,,,,, is inserted before L. The corresponding semantic action generates AF.,,,, and The grammar shown in Fig. 5 is LALR (1) [l] , so we know that this grammar is realisable. It is left recursive; therefore, bottom-up parsing is employed and it is implemented with shift reduce parser. A handle of a rightsentential form y is a production A -+ b and a position of y where the string j may be found and replaced by A to produce the previous right-sentential form in a rightmost derivation of y. That is, if S =. uAw =. upw, the A + p in the position following U is a handle of apu. Basically, shift-reduced parsing is used to find a handle. The action 'shift' is applied if a handle can not be found; or the action 'reduce' is applied, and it will reduce the production in which a handle is found. The three-address code of statements S , and PO in Fig. 4 is shown in Fig.  6 . In this Figure, we do not allow synchronisation pair Aj and V j to cross each other. In the next Section, we propose three efficient error prevention algorithms to prevent the instruction scheduling from error.
4
Error prevention algorithm
Simple and efficient prevention algorithm
In this Section, we propose algorithms to ensure correct instruction scheduling. The scheduling fails if marker pair
Ai and V i meet together during the scheduling. To identify dependent events, the synchronisation markers are always attached to the dependent events. There is a directed edge from block B , to block 8, if B, can immediately follow B , in some execution sequence. We say that B , is a predecessor of block B, , and B, is a successor of B,, and B, precedes B, if there are directed edges from B, to B,. Algorithm 2 describes how to maintain correct scheduling with out of order execution. In this algorithm, assume that the instruction I, in BB(A), basic block A with offset x, is moved to the position y of BB(B) and called I , thereafter. This is shown in Fig. 7 . Fig. 6 is to be moved across instruction m, this is an allowable movement; however, if it is to be moved across instruction n, this movement fails because the two markers, Aj and Vj, meet together.
Two modified prevention algorithms
In this Section, we slightly modify Algorithm 2 by separating the dependence event from the basic block to improve efficiency. Therefore, the low bound is b +(m -n) + 2n = b + m + n. On the other hand, if a synchronisation block is moved to the middle of the original basic block, each block will be divided into two subblocks. Therefore, the upper bound of total number of blocks will be b + (m -n) + (m -n) + 2n + 2n = b + 2(m + n). From the discussion above, the range of total number of blocks including HPOST and HWAIT synchronisation blocks is between b + m + n and b + 2(m + n). Therefore, the total number of blocks in Algorithm 3 is greater than the original blocks and dependent on the number of dependence events and the position of synchronisation blocks. For example, the flow graph of Fig. 6 is shown in Fig. 8a . In this Figure, the HPOST synchronisation block is isolated from the original block. The instruction in position I can be moved to position m because blocks A and B are in the same block. In another case, assuming the instruction in position I is to be moved to position n, we inspect HPOST synchronisation block and find that a synchronisation marker pair meet together. Therefore, the scheduling fails. We only inspect one synchronisation block, and this is more efficient than Algorithm 2. However, if instruction HPOST(t,,) is to be moved to position m, the scheduling succeeds but block splitting and merging are executed. Now the inspection should be done statement by statement instead of block by block. After block splitting and merging, the changed flow graph is shown in Fig. 8b .
-I type 1 mid I - The error checking is done in a synchronisation block because the dependence events had been isolated in a synchronisation block. Therefore, we inspect synchronisation conditions only when 1, is in a synchronisation block. The time complexity for 1, being a dependence event is O(c), where c is the number of blocks between blocks A and B. If every dependence event is formed as a synchronisation block, what is the range of total number of blocks including synchronisation blocks? For (m-n) dependence events, there are at least (m-n) synchronisation blocks and (m -n) blocks which consist of dependence events and its marker. Similarly, for n dependence events which are multiple dependence sourcelsink, we need at least 2n synchronisation blocks and n blocks. Therefore, the low bound of total number of blocks is b + (m -n) + (m -n) + 2n + n = b + 2m + n. Similarly, if a synchronisation block is moved into the middle of the original basic block, each block will be separated into two subblocks. The upper bound of total number of blocks is b + 2 * [(m -n) + (m -n)] + 2 (n + n + n) = b + 4m + 2n. Therefore, if every dependence event is formed as a synchronisation block, the range of total number of blocks including synchronisation blocks is between b + 2m + n and b + 4m + 2n. From the discussion above, the number of blocks in Algorithm 4 is more than in Algorithm 3. This will increase the operation of block splitting and merging. However, the time complexity is O(c) if I , is a dependence event, where c is the number of blocks. For example, the flow graph of the program segment in Fig. 6 is shown in Fig. 8c . In this Figure, the action of movement from position 1 to position m or n is similar to Algorithm 3 and block splitting and merging would be executed. The flow graph after moving the instruction in position I to position m is shown in Fig. 8d . To move instruction HPOST (tlo) to position I', we need only check two blocks to find out that the scheduling is not allowable. For Algorithms 2 and 3, this must be inspected statement by statement. In general, Algorithm 2 is a simple but efficient error prevention algorithm. However, if I, is a dependence event and the number of instructions between 1, and I, is large, the performance of Algorithms 3 and 4 is better than that of Algorithm 2.
Conclusion
The major goal of the superscalar based multiprocessor is to exploit loop and instruction parallelism. This requires reconsideration of several interesting problems such as error prevention, discussed in this paper, scheduling approaches, and other related compiler techniques. We have proposed an approach to resolving the problem of instruction scheduling with out of order execution. The most important contribution of this paper is that it shows how to prevent the instruction scheduling from error by providing synchronisation markers to guide the instruction scheduler for correct scheduling. In most cases, all algorithms proposed are very efficient (O(1) ).
The synchronisation marker method is suitable for post scheduling and prescheduling.
scheduling decision from the attributes of each block.
pp. 491-j42-
