This paper discusses software pipelining for a new class of architectures that we call transport-triggered.
They also exhibit code scheduling possibilities which are not avail- In lran.rporr-zriggered architectures instructions specify the transport of data only and, therefore, require the transport to trigger the operations.
AII recent RISC. VLIW. and strperpipehned architectures can be classified as operation-triggered. The class of MOVE architectures presented in [5] and certain micro architectures (e.g., [6] ) can be classified as transport-triggered.
Permissionto copy without fee all or part of tbii material is granted provided that the copies are no! made or distributed for direct commercial advantage, tbe ACM copyright notice and the t itle of the pubka tion and its date appear, and notice is given that copying is by permission of the In the next subsections we tirst go into some more detaik of the MOVE concep~explaining how to program this cIass of architec-tures, and indicate some of its advantages (for a complete qualitative analysis see [5] Figure 2 The MOVE code for an addition followed by a store.
In order to make the MOVE concept more clear, an example is given in figure 2 . The example shows the atiltion of GPRs r a and ?b and a store of the result at memory address r.. The code starts with moving r= and ?b to the operand and trigger registers of the integer unit (registers io and add). The trigger register is accessed at the location that causes an addition on the integer unit. The operand move(s) must always be done before or in the same cycle as the trigger move. When the result of the addition leaves the pipeline it is available in the result register of the integer rmi~in our example the latency of the integer unit is two, so the result move crm take place two cycles after the trigger move. The result move passes the result directfy to the trigger register of the memory unit. At the same time the address of the store is placed in the operand register of the memoxy unit. 't7hus a store is tiiggered with the data to be stored, and the address of the store is supplied via the operand register. Since a store has no result no result move is needed On average between 1.5 and 2 moves are necessary per operation. 'Even before the resutt has been calGutak& MOVE architectures implement a locking mechankm which avoids the insefiion of no-op moves.
The algorithm for finding a schedule for a given initiation These provisions are descriked next.
Problem Formulation and Basic Algorithm
The instructions from the loop that needs to be software pipelined are represented as nodes in a directed graph G = (V, E). The nodes represent instructions, or in oar case moves. It is rdso possible that a node represents a collection of scheduled nodes.
In that case we speak of hierarchical reduction (see [2] ). Every node is supplied with a description of its resource needs. Between the nodes are arcs that describe precedence relations. Each arc (u, rr) has two labels d(a, v) and P(U, u). The meaning of these two labels is that v needs to be scheduled at least d(u, v) cycles after u of the P(U, v )th preceding iteration, that ix
where u(u) is the cycle in which v is triggered, and s is the iteration initiation interval, this is the time between the initiation of two successive iterations.
The problem is finding a schedule u with the lowest possible s that satisfies the following two constmints:
1. Precedence constraints:
Resottrw constraints:
At every moment no more resources are used than there are available.
These two constraints give two lower bnunds on the initiation interval. The basic algorithm for finding a schedule.
The differences with list scheduling are:
If a resource is used at cycle t, it is also used at t + ks, where k is an integer number.
In list scheduling a node may have a lower bound on a (or an upper bound in case of bottom-up scheduling). However in software pipefining a node may rdso have an upper bound since G is cycfic. Each node has to be scheduled between its lower and upper bound.
It is not always possible to schedule each node between its bounds due to-re;ource constraints.
The algorithm starts with scheduling a node at cycle zao. Next a second node is taken and it is scheduled at a place between its lower and upper bounds where enough resources are available.
The remaining nodes rue scheduled in a similar fashion.
The two bounds of a node v are determined by the following two
1 is a path from u e S tn v}
I 1 is a path tlom v to u E S},
where S is the set of scheduled nodes.
In contrast to Lam we backtrack (to a certain level) if a node could not be placed, hying one of the earlier placed nodes at a later cycle,
We found out that without backtracking schedtrles for tmmsport.
triggered architectures are too often non-optimal. Current research aims at improving the heuristics in order to avoid backtracking. 
where r(u) is an indication of the scarceness of the resources that v uses, u(v) and l(u) are the bounds of v, and a and/3 are weight factors that should be determined with some experimentation.
With this heuristic we also consider resource constraints.
Since as s becomes larger the change for a resource conflict be- io is a move horn GPR b to the integer operand register io.
Besides using one move slot in the instruction stream, and using a lransport bus during one cycle, this move also claims the io register for one or more cycles. The io register is released one cycle after the move c~ado! to the trigger register. 'A branch delay is one cycle longer thau a jump delay. 
VLIW
Every instruction is projected on a fixed move pattern.
In this pattern, operand and trigger moves are in the same cycle, and a result move has to take place exactly a tixed number of cycles later (depending on the latency of the PIJ). 
VLIW-bp

Benchmarks
We use six loops for studying the effect of dtiferent scheduling disciplines. Four of them are kernels from the Livermore loops2; the tiffh loop is the SAXPY loop; the sixth loop is a loop that increments each element of a floating point vector by one (see figure 6 ). Because of its size the Iatter loop is included for analysis purposes presented in section 5. We shalf calf this loop 'vectorincrement'. As shown its figure 7, depending on the transport capacity, the MFLOP rates may increase more than 50% for a sample of livermore benchmarks. There are several independent factors contributing to this increase, In order to anrdyze these factors we look in detail at the vector-increment loop as shown in figure 6 . In going from the VLIW discipline to the FREE discipline the interval a for 2 move busses reduces tlom 12 to 5 cycles. The corresponding code schedules for VLIW and FREE are displayed in figure 9 . We distinguish four important factors contributing to this drastic cycle reduction:
(1) software bypassing, (2) doad code removal, (3) common subexpression elimination (CSE), and (4) code motion. They are discussed next.
-.
Software Bypassing
The difference between VLIW and VLIWbp is the result of software bypassing.
If there exists a RaW- The effect of software bypassing is a reduction of the 'length' (~di/~pi) of the cycles in the graph.
This means a lower HDwever this introduces extra register and transport requirements (extra moves). This may increases~;m due to the resource constraints.
As wilf be clear from looking at the absolute MFLOP values in figure 7 , the function units are underutilized. This means that we have to apply some combination of loop umolling and register renaming in order to further reduce precedence constraints and to keep the FLJs more busy. This is a topic of further research.
Anyhow, the amount of required renaming is lower for trrmsporthiggered architectures. 
