This paper examines the effectiveness of decoupling as an optimization technique for high-performance computer architectures. Decoupled access execute architectures are described, and the concept of cotir-ol decoupling is introduced and justified. A description of a highly-decoupled architecture is given, and a metric for the effectiveness of decoupling on particular programs, the Loss of Decoupling frequency is introduced. Finally, a number of rest k,enchmark programs are examined and the applicability of decoupling them is anatyzed.
1 Introduction ji number of papers have discussed the architectural optimization decoupling over the last decade (see [2] , [6] , [7] and [8] ). This paper introduces control decoupling, a further technique for increasing performance, and attempts to identify the class of programs over which decoupling is an effective technique to achieve high performance.
The instruction set of modern computers (see, for example, reference [5] ) is partitioned into three classes of instruction: control, memory accessing and data operations. The decoupled architectures that have been described to date, for example ZS-1[2], PIPE [6] and WM [8] , decouple between the latter two classes, using "Decoupled Access/Execute," in which the addresses for memory references are generated in advance of the execution of clata-related instructions. This means that memory read operations can be initiated many cycles before the read data is required for execution, and the latency of main memory read operations (or cache operation) can be hidden. Decoupled Access/Execute is described in detail below, in section 2.
Less conventional is the use of control decoupling. This architecture feature is introduced in order to maximize the use of main memory bandwidth in an implementation, by exploring the control-flow graph of a progam ahead of the time at which computation is required: this enables requests for packets of computation to I>ermission to copy without fee all or part of this material is {yanted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notica is given that copying is by permission of the Association for Computing Machinery.
To copy otherwisa, or to republish, requires a fee Nigel P. Topham, University of Edinburgh United Kingdom be queued ahead of the time at which they are required, so that when one computation has finished, another is ready to take its place on the relevant unit. Control decoupling is described in detait below, in section 3.
In order to determine how valuable these optimization are, in section 5 we introduce a conceptual framework for program events that cause decoupling to break down: these Loss of Decoupling events are of central importance to understanding the performance of decoupled architectures. In section 6 we discuss briefly the performance impact of Loss of Decoupling events. In section 7, we conduct an examination of a range of popular benchmark programs in order to understand the prevalence of these events.
2
Access Decoupling
In the architecture described here, Access Decoupling is implemented by partitioning user instructions into two classes, memory accessing and user withmetic.
The memory accessing instructions are run on a speciat-purpose processor, the Address Processor (M), whose function is optimized for the purposes of producing regulru patterns of addresses, such as found in numeric programs. The most widely used instruction in the AP is an addition operation, which adds register plus register or register plus immediate, writing the result to another register and initiating either a memory read or a memory write operation. Apart from simple additions, the M has no other datahandling capability:
it has no general multiplication, division or logical operations.
The user arithmetic instructions are rdso run on a special-purpose processor, the Data Processor (DP). This has a frdt set of integer and floating-point arithmetic and logical operations, but no memory addressing instructions at all.
The AP and DP are connected by two types of queues: the Load Data Queue (LDQ) and the Store Address Queue (SAQ). Entries in these queues are made when the AP generates memory addresses, but the two types of queues are used substantiality differently. In the architecture described in this paper there are two independent LDQs connected to the two memory read ports, and one SAQ to end/or specific permission. ICS-7193 Tokyc, Japan @ 1993 ACM 0-8979 j-600-X/93 /0007 /0047 . ..$j .50 drive the single memory write port. The relationships of these processors and their queues can be seen in Figure 1 below. When the AP generates a memory request for a Load operation, the next free element in the relevant Load Data Queue is allocated and marked "pending".
A request for a memory read at that address is sent to the memory interface. When the read data is returned from memory, it is placed into the queue element reserved at the time of the requesk and the element is marked as "valid".
The Load Data Queues are available as source operands to instructions running in the DP, and an instruction that reads a queue suspends until the top entry in the queue is marked "valid", and then pops this element. (In fact the queues are mapped to two generalpurpose registers.) In this manner, every read memory access is carried out by two instructions, one on the address processor to initiate the request and one on the data processor to access the data.
Store operations are handled differently.
When an address for a store operation is generated by the R, the address itself is written to the next free element of the Store Address Queue, and marked "valid". Store operations are initiated in the DP, where every arithmetic instruction contains a "store" bit. When this bit is set, the result of the instruction is sent as data to main memory, in addition to being written to a general-purpose register. When an instruction with its "store" bit set completes, the oldest entry in the Store Address queue is popped, and used as the address for a memory write operation with the data generated by the instruction.
In this way, the AP may proceed through a program, keeping ahead of the place where the DP is executing. This has the beneficial effect of initiating memory read operations early, reducing the impact of memory latency on execution time. In a fully decoupled program, once the decoupling between processors has been established, the execution time is insensitive to latency, provided that the main memory offers sufficient bandwidth to support the request rate generated by the AP.
The purpose of generating store addresses earlier than performance argu'me;ts woti-d requir; is to ensure correct func;onality the instruction-set specification of the architecture defies that memory operations have their semantics defined by the order in which read and write operations are initiated by the AP. If the address of a write operation, for example, is used as the address of a subsequent read operation before the data for the write has been generated in the DP, a comparator detects that the read cannot proceed, and the AP is stalled until the condition can be resolved.
Control Decoupling
Control Decoupling is a further optimization that permits a Control Processor (CP) to execute yet further in advance of the AP. The fist step is to give the Address and Data processors the capability of running program inner loops, that is of running a body of instructions a number of times, determined either by a loop-counõ r terminated in a data-dependent manner. Apart from simple loops, the Address and Data Processors have no other control capability they have no conditional jumps or subroutine call instructions.
All major control functionality, that is non-inner loop control, subroutine call and return, and dispatching inner loops to the Address and Data Processors, is concentrated in the CP. Since user computation is carried out in the DP, the CP does not need floating-point capabilities, but it does provide a full set of logical operations, integer addition and subtraction, integer multiplication (for array index calculations inner loops have their index arithmetic strengthreduced), integer division (for loop normalization), and a full set of comparison and conditional and unconditional branch operations that would be familiar to the programmer of any conventional RISC instruction set. Memory accessing (both load and store operations) are provided conventionally in the CP instruction set.
In order to reduce the interaction between Control and Data Processors, the DP supports an elaborate conditional-execution scheme, with a full set of comparison operations on integers and floating point operands, conditional execution of any of its instruction seg and a comprehensive set of condition combination instructions, which permit the compilation of nested i f . . then. . el se statements into guarded execution.
The CP invokes operations of the Address and DP using a special instruction, which dispatches a unit of work called an Instruction Fetch Block (IFB) to one of them. An IFB contains a pointer to the tirst instruction, a length field, specifying the number of instructions to issue, and a loop coun~identifying how many times to issue these instructions. Instruction Fetch Blocks are enqueued by the CP for both the Address and Data Processors: when a processor has finished issuing instructions from one IFB, it can proceed to issuing instructions from the next block, if there is one in the queue, without delay. A set of parameter queues is provided to allow the CP to pass data items to the Address and Data Processors.
In this manner, the normal operation of the system is that the CP is executing instructions from the later part of a program. At the same time, the AP is executing instructions from an earlier part of the program, and the DP is executing from a still earlier part. In this way, all three processors are fully decoupled.
The benefit of Control Decoupling is that while one inner loop is running on the AP, preparatory work for subsequent loops, such as lcop count calculation and array subscript arithmetic, maybe carried out in parrdlel, ensuring that no time is lost in the AP between inner loop bodies.
A. description of events during program execution that cause this decoupling to break down is found below, in section 5.
CP Memory Ordering
All addresses produced by the CP are compared against all pending DP write operations, whose addresses are held in the Store Address Queue, and the CP is stalled if any conflict arises. Nevertheless, a logical inconsistency may arise if the CP tries to read data that will be written by the DP by an earlier part of the program, if the address has not yet been generated because the AP has not yet reached this part of the program.
In this system, it is the function of the compiler to eliminate such inconsistencies by preventing the CP from decoupling from the AP when an access of this type is possible.
Decoupled Execution: an Example
The diagram in Figure 1 shows an overall picture of the controla,nd address-decoupled machine. The way in which the two forms c~fdecoupling occur can be explained by reference to an example. Consider the code fragment below :
The inner loop, in J, is executed autonomously in the Address and Data Processors: in each iteration, the AP initiates a memory read for B (J) and C (1, J ) and a queues the write address for A ( 1, J ) , and the DP multiplies one Load Data Queue element by the register holding S, adds another Load Data Queue element and stores the result. The AP relies on the availability of stride values for the A (1, J ) and C ( 1, J ) accesses to avoid the need to do full multiplications within the loop.
The CP implements the loop in I, and prepares the Instruction .Fetch Blocks and stride values for the AP and the DP. These are :passed to these
Processors via the queues, enabling the CP to iterate around the loop in I without re-synchronizing with the Address or DP on each iteration.
5
When does Decoupling Break Down? Figure 2 gives a framework for discussing the influence of program events on both cormol and access decoupling. The broad raTows show how the decoupling optimization is successful when the CP is transfeming information to the AP, and when the AP is transferring information to the DP. The numbered arrows, in the reverse direction, represent inter-processor dependencies that can interfere with decoupling by requiring that the CP or AP wait for a later unit 3 1 Figure 2 . Events that May Destroy Decoupling to "catch up". We call each of these events a "Loss of Decoupling"; each numbered arc in Figure 2 corresponds to exactly one type of LOD event.
Computed Index Operations
Arc 1 in Figure 2 represents the case of the AP needing to wait for the DP before initiating further operand fetches: this case arises when a value computed in the DP must be conveyed back to the AP to take part in address formation. A program fragment causing such an event is illustrated below:
This case is rare across the full range of numeric codes: it arises in "Particle-In-Cell" codes, where a real particle position is calculated and then quantized into a polygonal grid, and in Monte Carlo codes, in which a discrete item is selected using a real-valued random number generator.
Conditional Control Flow Operations
Arc 2 in Figure 2 shows the CP needing to wait for a condition to be evaluated in the DP before it can issue further instructions: at first glance, this event might seem to occur every time a conditional statement is executed. However, since the DP can implement conditionals internally, through the medium of conditional execution, this event only occurs when a condition causes a major change of sequence in the Control Flow behaviour of the program, for example, when a program condkionally calls a subroutine, or conditionally executes an entire loop body. A progmm fragment illustrating this type of event is shown below:
Other types of program fragments which cause this type of LOD event are loops containing conditional exits, and conditional subroutine returns. Again, this type of event is rare in most numeric codes, and most occurrences prove to be highly predictable using a branch prediction scheme, giving the compilation system a good method for minimizing the performance impact of these events.
Control/Data Aliases
Arc 3 in Figure 2 shows the CP needing to wait for the AP to "catch up". This event needs to be generated when the compiler detects a~ssible read-after-write hazard between a CP read and a write by either the AP or DP. A fragment of code illustrating this event is shown below. The loop bound for the second loop is required before it can be dispatched: however, since the bound may be wmputed by the previous loop, the CP must be prevented from reading IV ( 100 ) before its new value has been computed. In the ACRI system, it is safe for the CP read to proceed as soon as it is known that all write addresses for the fist loop have been generated, since a run-time "alias with outstanding write" check is carried out on all entries in the Store Address Queue takes place, and a read that does conflict is stalled untii the data is generated by the DP.
Sparse and Pointer Operations
Arc 4 in Figure 2 shows an AP-AP dependency, which is significant when the AP needs to wait for an AP memory read to finish before it may initiate a further memory access. A program fragment illustrating this event is shown below:
In this fragment, the AP needs to fetch the value of the IX element before it can fetch (or generate a write address for) the A array element. This is typical code for sparsevector or sparsematrix operation, and while it is clearly necessary to optimize this for these applications, the occurrence of this events in non-sparse numeric codes is again rare.
Other types of inter-unit dependencies are much less significant than the four identified above a DP-DP arc, for example, would indicate that a subsequent operation depended on a previous operation, both in the DP: this is a case that occurs in all pipelined machines, and is resolved by a combination of pipeline forwarding, code generation and stalling. A CP-CP arc represents a controlcontrol dependency, of the type that occurs in conventional RISC microprocessors.
Again, performance is maximized in these circumstances using conventional architectural techniques: caching minimizes the impact of memory latencies and compiler code scheduling maximizes the overlap of operations.
6
The Cost of an LOD When the AP is ahead of the DP by an amount of time which is greater than the memory latency, each LDQ pop performed by the DP adds nothing to the program execution time. In effect the latency associated with reading that piece of data is zero. It is therefore useful to talk about the perceived latency, as the mean number of cycles the DP must stall each time it tries to pop an item off an LDQ. In the absence of LODS, the perceived latency (after an initial start-up period) is always zero. This is, of course, an ideal situation and in practice the dependencies outlined in section 5 lead to the occasional re-coupling of the AP and DP (as well the CP and the AP and/or DP). To assess the impact of LOD events on the performance of the system we can use a simplistic LOD-penalty model.
Let tminbethe idealized execution time of a program when memory latency is zero, let Plod be the mean penalty inc~~in the DP whenever an LOD occurs, and let Nlod be the number of LODS that occur in the program. Naturally, the total execution time -
. We can think of the mean LOD penalty as somewhat equivalent to the start-up time of a vector operation, although there are good reasons to believe that LODS are strictly less frequent than vector start-ups. Following on from our definition of t we can say that an efficient decoupled system will have a low LOD penalty and requires a compiler which optimizes for minimal LOD frequency.
7
The Frequency of LODS
To examine the frequencies of the various types of Loss of Decoupling events, we turned to the Perfect Club benchmark suite which contains 12 programs chosen from a range of different supercomputer applications areas, running on problem sets which are small enough to enable investigation on workstation-sized computing environments.
The methodology we adopted was to profile these benchmarks using conventional Unix tools (pro f, Sun's t COV, and Mips's pixie programs). In common with many other applications, they show significant instruction locality, in that a small number of routines in each program contributes a large fraction of the execution time. We identified the areas of the programs that dominate the computation, and examined those routines for the syntactic causes of Loss of Decoupling events. Where these events were identified, we describe the impact of the Loss of Decoupling, and suggest ways that the compiler or applications programmer might reduce this impact.
The analyses for six benchmarks are presented here: they are SPICE, a circuit simulation package, OCEAN, an oceanographic modell.ing program, BDNA, a molecular dynamics program which computes interactions between DNA molecules and an ionic solution, DYFESM, a finite-element package, MDG, a program which simulates the dynamics of water molecules, and QCD, which performs Monte Carlo simulations of quantum chromodynamics. We do not attempt a systematic study of Loss of Decoupling frequency in this papec this is currently on-going work and will be reported in a subsequent publication.
Analysis of SPICE
The statements most frequently executed in the SPICE benchmark occur within subroutine DCDCMP. This routine, whose purpose is to perform an LU factorization of the matrix giving the coefficients of the circui~is called by the circuit solver. The functioning of the routine, part of which is illustrated in Figure 3 , is significantly obscured by the data representation used and by the fact that SPICE uses an internal memory management package. The code fragment shown in the figure searches for an element located at (i, j ) in the coefficient matrix, and adjusts its value when it is found. It is clear that a large number of addressing computations are necessary to support each data operation these are inevitable with the data representation chosen, which allocates elements of the matrix in a vector with no direct mapping of the matrix row and column number to the position in the vector. This sparse allocation is, in turn, inevitable given the constraints of the problem: the matrix represents information between different "nodes" in the circuit, and is necessarily largely full of zeros since most nodes are not connected to most other nodes. In this fragment however, nearly 3.3 million index array references take place in order to perform 1.4 million data references.
Two groups of frequent statements are concerned with copying array elements (in COPY 4) and zeroing data (in ZER08). These are trivially decouplable. The OCEAN benchmark is interesting in that the two assignments that are most frequently executed (166 million times) are straightforward array copying operations. It is possible that this arises because the benchmark has been 'scaled down' to a reasonable size: the full size production code may have a different ratio of computation to copying. T!kis routine is essentially non-decouplable in any reasonable wmputer structure, since every index array reference causes a control tmnsfer before the routine commits to making a data reference. It is very hard to see any compiler-implemented transformation that would improve this, and the only alternative for users wishing to get significant speedups would be to re-code the algorithm using a more sympathetic data structure, perhaps taking advantage of the much larger amounts of physical memory that are available on machines more recent than when SPICE was originally written.
The most intensive computation occurs inside a complex FFT routine, a fragment of which is shown in Figure 5 . Each inner-most loop has a high loop count, and no address recurrences to prevent full exploitation of decoupling. The computation of the array index JS can be strength-reduced to a single addition within the loop body, and even the major control transfers (the two arithmetic IF statements on JL) may be evaluated while previous loops are continuing to execute, achieving fdl control decoupling in addition to the access/execute decoupling.
The situation with the next most frequent group of statements, is much more healthy. There are no address recurrences, and no control recurrences in this routine: the control flow is independent of all addressing and all data arithmetic. This routine therefore decoup[es fully, in spite of the fact that the loop counts are small. This routine would not benefit greatly from vectorization on a different architecture, but can exploit decoupling. It is responsible for some 2.4 million floating point operations in this benchmark. The third group of most frequently executed instructions is within function MEMPTR, whose purpose is to validate a 'pointer' (actually an array subscript) within the SPICE internal memory management package. The core of this routine is shown in Figure 4 . forming straightforward computation on array elements indexed by simple strided subscripts, with no recurrences, ensuring that no loss of decoupling occur.
Analysis of BDNA
BDNA calculates dynamic interactions between organic and nonorganic molecules in a complex polarized environment.
The vast majority of computation time is spent in subroutine ACTFOR which crdculates the interaction between each possible pair of atoms in the environment. The most frequent statements are shown in Figure 6 . These calculate the distances between every pair: the array IND is set up to point to every atom that is within 8 Angstroms of the atom I, and a huge body of code (332 lines containing 265 addition and subtractions, 137 multiplications, 23 divisions, 14 square roots and 13 exponential) is run over that set of atoms. Although this second loop executes with a mean loop count of less than 27, the fact that the loop body is so large means that accesses to IND can be successfully pre-queued by the CP, and hence a potential loss of decoupling point is avoided. 
Analysis of DYFESM
The DYFESM program performs two-dimensional finite element structural analysis using the Explicit Leap Frog method. A large proportion of the execution time is spent in a small number of subroutines. When profiled on a SUN Spare system, using prof, the time spent in the top four routines accounts for over 85% of the execution time, and on an Alliant FX/80 these same routines account for over 93% of the execution time [3].
Subroutine MATMUL
The matmul subroutine, shown in Figure 7 , accounts for around 60% of the execution time of DYFESM when executed on a scalar processor such as that found in a SPARCStation. This is an inherently vectorizable routine, and for example accounts for less than 37% of the execution time on an AUiant FX/80.
The routine contains a triple-nested set of DO loops, which perform a matrix multiplication as a linear combination of columns. The only statement which could possibly interfere with the decoupling of the inner loop is the statement:
The intent of this statement is to prevent unnecessary computations from taking place when the multiplier (TEMP) is zero. In fact TEMP is rarely zero. However, even with this statement in, no loss of decoupling need occur. If all of the non-leaf loops, and all scalar statements outside of non-leaf loops, are executed on the CG then we can be sure of avoiding any dependency that might cause a loss of decoupling. Here we are assuming that the compiler can detect that there is no overlap between the B and C arrays. This is one example of the case where, in a multiply-nested loop structure, there is no loss of decoupling on a branch provided that there is no loop-carried dependence from a leaf-loop computation to an outer (non-leaf) scalar computation. The computation of B ( I ) then 3< Implement an address cache the AP so that the becomes average latency for accessing subscripted indices
is reduczd to a tolerable level.
The entire forward solve phase decouples perfectly. However, the value of B ( I ) defined in iteration I of the outer loop is then used in iterations I +1 through N. The normal process of pre-loading values for A ( , ) and B ( ) lead to Read-After-Write hazards in the Load Address and Store Address queues of the AP -particularly during the early iterations when I is small compared with the decoupling distance, and the dynamic flow distance short. In the architecture model assumed in this paper, such memory-RAW hazarcls are detected by the associative match circuit~in the SAQ and tagged. When the corresponding store data is produced it is automatically forwarded to the appropriate LDQ at the correct position in ithe queue. This bypass mechanism prevents the compiler having to insert an algorithmic LOD after the completion of each inner loctp, which is what would effectively happen in a vector machine. which is symmetric upon interchange of I and J, to be stored in a compressed form.
7.5.1 Subroutine
INTERF
The effect this has on decoupling depends on how the compiler decides to treat the references to Ml (K) and M(I,M1(K)+J,N). If the AP reads Ml(K), waits until the value arrives from memory, arid then computes the address for M(I,M1(K)+J,N) before reading the correct location, then decoupling will be lost. However, there are three ways around this problem:
1. Let the CP prefetch the values of
Let the AP issue non-blocking loads to the Ml vector within the AP's inner loop.
The INTERF subroutine calculates inter-molecular interaction forces in three dimensions. For the most part it decouples very well, but there are two places where loss of decoupling appears to be unavoidable.
In the calculation of inter-molecular forces. a testis made to find out if the distance over which an integration occurs is greater than some threshold. If the test is true for all possible interactions on a molecule, then the code which computes forces is skipped. We can see this occurring in the statement:
This is executed 5,923,953 times. The value of KC is computed in the Execution Unit within the immediately preceding loop. There is no loss of decoupling within the loop which computes KC, since we can use "M-conversion" to turn the statemenfi IF (RS(K) .GT. CUT2) KC=KC+l into a guard computation followed by a guarded increment. However, converting the conditional jump to label 1100 into guarded execution would be difficult and possibly counter-productive since the guarded region is large, and not executed in approximately 37% of the cases. This is a situation where run-time information can be extremely useful to a compiler --the decision atmut whether to do if-conversion is a pragmatic one, and depmds on dynamic program behaviour. A similar structure occurs later on in the program, and a further 586,530 loss of decoupling events accrue. No branch instructions need to be executed within this subroutine.
.XMA-XB(l) XL (3) =XM?-XB(3) XL (4) .KA(l)-XMB XL (5) =XA(3)-XMB XL (6) =XA(1)-XB(l) XL (7) =XA(1)-XB(3) xL(8)=XA(3)-XB(l) XL (9) =XA ( This program is perhaps unusual for a scientific application, in that the most frequently executed subroutine contains a loss of decoupling. However, even when that happens, the relative frequency of loss of decoupling is still low. According to the definition of MFLOPS for this program, there are over 3.4 billion floating point operations~one. AnY promssor capable of issuing one floating point add and one floating point multiply per cycle will therefore have an execution time greater than 1.7 billion cycles, and in practice a number of effects will conspire to extend the minimum execution time somewhat beyond that. We can immediately state that the smallest average interval between loss of decoupling events in this program can not be less than 1.7E9/6.5E6 = 262 cycles.
When LODS are close together in time, the associated penalty is likely to be close to the mean memory access time (plus epsilon), but if LODS are widely spaced out in time, then the associated penalty will be closer to the maximum memory access time. Thus, a program with clustered LODS will fare better than a program with evenly-spaced LODS. In MDG the LODS are well spaced out, and will probably experience a comparatively high penalty.
Analysis of QCD
The QCD program performs a Monte Carlo simulation for quantum chromodynamics using the Pseudo Heat-bath algorithm. On a CRAY Y-MP this program has been measured at a little over 4% vectorizable [1] , but a hand-tuned Y-MP/832 version has been benchmarked at 270.9 MFLOPS compared with the baseline compiler version (same machine) which runs at just 13.0 MFLOPS [4].
There are nominally 2.59 billion floating point operations in the benchmarked run for this progm-n.
7.6.1
Subroutine MULT
The MULT subroutine contains 18 complex scalar expressions, and this is one of the main reasons that this program vectorizes poorly. However, there are no algorithm structures which could lead to loss of decoupling events, and so we must conclude that this routine will decouple completely. Any problems with LODS during execution of this routine must occur in the calling context just prior to the cafl to MULT.
The DAG for this subroutine contains no common sub-expressions, but many multiple uses of input values. and udag ought to be executed on the CP -:mcluding the assignments to S ETFLG at the leaf-level within the I F tree. Dependence analysis indicates that there are no dependencies from the calls of mu 1 t, cpymat and udag, to any of the subsequent CP computations. If code is partitioned in this way, then all potential loss of decoupling points are removed. This does, however, place a significant load on the CP, which then requires a floating-point arithmetic capability. If the CP computations within the WHILE loop take longer to execute than the calls to mul t, cpymat and udag, then the CP will be the bottleneck. Otherwise the computation will proceed at the rate determined by mu 1 t, cpyma t and uda g, and we have seen that in the case of mul t the rate is close to peak. Here is a situation in which control decoupling provides a very significant advantage Figure 12 . Extract from the SYSLOP routine (QCD) 7.6.3 Subroutine PRNSE2
One subroutine which appears to cause problems for a decoupled architecture is PRNS E2. This contains a very deeply nested loop structure (6 loops deep), with an IF statement at the inner-most level. This can spell trouble for a decoupled machine, but in this case the body of the THEN part is substantial enough so that the loop trip time for the CU computation for the inner-loop ought to FA"C%l"(l, 3"* I+P+1) *U2 (1,3' J+Q+l) *U3 (1,3* K+ R+1) = TOT(l)-FAC*U1(2,3*I+P+1) *U2 (2,'3; J+ Q+1)*U3" (1,3* K+R+1) = TOT(l)-FAC*U1(1,3*I+P+1) *u2 (2,3 *J+ Q+1)*u3 (2,3 *K+R+1) = TOT(l)-FAC*U1(2,3'I+P+l) *u2 (1,3* J+O+1) *U3 (2,3 *K+R+1) TOT(2) = TOT"(2)+~AC'*Ul (i,3*I+P+l)' *U2 (1,3 *J+Q+1) *U3 (2,3 *K+R+1) " TOT(2) = TOT(2)+ FAC*U1(1,3'I+P+l) *u2 (2,'3i J+Q+l) *U3'(i,3'K+R+i) 7KIT(2) = TOT(2)+ FAC*Ul(2,3*I+p+l) *U2 (1,3 *J+Q+1) *U3 (1 J3'K+R+l) " KIT(2) = TOT(2)-FAC*U1[2,3*I+P+1) *U2 (2, "3i J+Q+l)
'U3"(2,3*K+R+1) " ENDIF 3981312 -> 3 CONTINUE 1327104 -> ENDI F 5971968 -> 2 CONTINUE Figure 13 . Extract from PRNSE2 (QCD) be shorter than the AP and DP parts. This means that the CP uses its control decoupling at the inner-most loop level to pre-compute the IF conditions and dispatch the inner-most blocks. Observing the execution profile information, we see that the IF evaluates TRUE in only 88 out of 398 cases (approximately 22% of the time). So, on average, the CP must go around the imer loop 4.5 times for each dispatch of the inner loop to the AP and DP. It will help greatly if the EPS ILO array can be cached "close" to the AP and DP, and accessed also by the CP.
An rdternative way to remove LODS is to re-structure the loop (a typicrd hand optimization).
This could be done by splitting the loop structures into two: the tirst would compute a vector of lmolean conditions, and the second would read those conditions and decide whether to compute the inner-loop body. Note, that guarded execution does not help in this case, since the code body is large and rarely executed, but branch prediction coupled with speculative dispatch operations is potentially useful optimization.
7.6.4
Loss of decoupling in QCD Under the assumptions that the CP has floating point capability and that the potential LODS in PRNSE2 are overcome, there will be very few LODS in QCD. It is worth noting that even the optimized (single processor) version of QCD only attains a performance of 44 MFLOPS on the CRAY Y-MP. This is mostly due to scalar register pressure, and the consequent register spill operations (accounting for approximately 27% of rdl operations).
Conclusions
We have presented control decoupling, a technique for extending the benefits of decoupling to a higher level of abstraction than in previously described decoupled architectures. The principal attnction of control decoupling is that the control flow graph of a program an be searched by the CP in advance of the AP and DP so that events which would otherwise cause an LOD in a purely Access/ Execute decoupled architecture do not necessarily disrupt the flow through the AP-memory-DP pipeline. In many cases speculative traversal of the control flow graph of a program by the CP will further improve performance many control decisions are highly predictable, and so the speculative dispatch of work to the AP and DP is likely to be rewarded.
We describe how particular features of source programs cause loss of decoupling in a three-way decoupled system, and how they negatively impact processor performance, and we examine a range of benchmark programs for the dynamic incidence of these events.
We conclude from this evidence that decoupling is a very powerful technique for minimizing the impact of memory latency, and that it is applicable to a wider range of programs than other architectural optimization.
In particular, we have shown that syntactic LOD events do not always occur at points in a program where one expects to find a vector start-up penalty in a vector machine. As a loss of decoupling event has a penalty somewhat similar in magnitude to a vector start-up, we suggest that control-decoupled architectures offer potentially much higher efficiencies than existing vector machines.
9
