This paper focuses on the interaction between software prefetching (both binding and nonbinding prefetch) 
Introduction
Software pipelining is a well-known loop scheduling technique that tries to exploit instruction level parallelism by overlapping several consecutive iterations of the loop and executing them in parallel ( [19] [15] ).
Different algorithms can be found in the literature for generating software pipelined schedules, but the most popular scheme is called modulo scheduling. The main idea of this scheme is to find a fixed pattern of operations (called kernel or steady state) that consists of operations from distinct iterations. Finding the optimal scheduling for a resource constrained scenario is a NP-complete problem, so practical proposals are based on different heuristic strategies. The key goal of these schemes has been to achieve a high throughput (e.g., [15] [12] [23] [20] ), to minimize register pressure (e.g., [10] [7] ) or both (e.g., [11] [16] [8] [17] ), but none of them has evaluated the effect of memory. These schemes assume a fixed latency for all memory operations, which usually corresponds to the cache-hit latency.
Lockup-free caches allow the processor not to stall on a cache miss. However, in a staticallyscheduled architecture the processor often stalls afterwards due to true dependences with previous memory operations. The alternative of scheduling all loads using the cache-miss latency requires considerable instruction level parallelism and increases register pressure ( [1] ).
Software prefetching is an effective technique to tolerate memory latency ( [4] ). Software prefetching can be performed through two alternative schemes: binding and nonbinding prefetching. The first alternative, also known as early scheduling of memory operations, moves memory instructions away from those instructions that depend on them. The second alternative introduces in the code special instructions, which are called prefetch instructions. These are nonfaulting instructions that perform a cache lookup but do not modify any register. Another important difference between the two schemes is that nonbinding prefetch moves the prefetched data into cache, whereas the binding scheme puts the data into both the cache and an architected register. These alternative prefetching schemes have different drawbacks:
• The binding scheme increases the register pressure because the lifetime of the value produced by the memory operation is stretched. It may also increase the initiation interval due to memory operations that belong to recurrences.
• The nonbinding scheme increases the memory pressure since it increases the number of memory requests, which may produce an increase in the initiation interval. Besides it may produce an increase in the register pressure since the lifetime of the value used to compute the effective address is stretched. A higher register pressure may require additional spill code, which results in additional memory pressure.
In this paper we investigate the interaction between software prefetching and software pipelining in a VLIW machine. First we show that previous schemes that do not consider the effect of memory penalties produce schedules that are far from the optimal when they are evaluated taking into account a realistic cache memory. We evaluate several heuristics to schedule memory operations and to insert prefetch instructions in a software pipelined schedule. The contributions of stalls and spill code are quantified for each case, showing that stall penalties have a much higher impact on performance than spill code. We then propose an heuristic that tries to trade off both initiation interval and stall time in order to minimize the execution time of a software pipelined loop. Finally, we show that schemes based on binding prefetch are more effective than those based on nonbinding prefetch for software pipelined schedules.
The use of binding and nonbinding prefetching has been previously studied in [13] [1] and [4] [9] [14] [18] [3] respectively among others. However, there are very few works analyzing the interactions of these prefetching schemes with software pipelining techniques. The selective scheduling ( [1] ) schedules some operations with cache-hit latency and others with cache-miss latency, like the scheme proposed in this paper. However the selective scheduling is based on profiling information whereas our method is based on a static analysis performed at compile-time. In addition, the selective scheduling does not consider the interactions with software pipelining. In [5] the authors analyze the effect of scheduling memory operations with either cache-hit latency when they exhibit some type of reuse of cache-miss latency otherwise. Their results show that in average, this scheme is better than the scheme that always uses cache-hit latency but worse than the scheme that always uses cache-miss latency. Our results corroborate this fact. However, the scheme proposed in this paper outperforms both cache-hit and cache-miss based approaches.
The rest of this paper is organized as follows. Section 2 motivates the impact that memory latency may have in a software pipelined loop. Section 3 explains the experimental methodology.
Section 4 evaluates the performance of some simple schemes for scheduling load and stores instructions. Section 5 describes the new algorithm proposed in this paper and presents some performance results. Finally, the main conclusions are summarized in section 6.
of 27

Motivation
Modulo scheduling is the most popular technique to perform software pipelining. The basic idea behind this technique is to find a fixed scheduling for all the iterations of the loop such that no structural hazards occurs and all dependences are enforced when iterations are overlapped following a fixed initiation interval. The transformed code that achieves these objectives is another loop whose body consists of operations from distinct iterations of the initial loop. A software pipelined loop via modulo scheduling is characterized basically by two terms: the initiation interval (II) and the stage counter (SC). The former indicates the number of cycles between the initiation of successive iterations. The latter shows how many iterations are overlapped.
Most heuristics for modulo scheduling have tried to minimize the initiation interval although this is not the only term that determines the execution time, as shown below. The minimum initiation interval is bounded by resources and recurrences:
The II res is the lower bound due to resource constraints of the architecture. Assuming that all functional units are pipelined, it is calculated as 1 :
where NOPS(x) indicates the number of operations of type x in the loop body, and NFUS(y) indicates the number of functional units of type y in the architecture.
The II rec is the lower bound due to recurrences in the graph and it is computed as:
where LAT(x) represents the sum of all node latencies in the recurrence x, and DIST(y) represents the sum of all edge distances in the recurrence y.
1. For the sake of simplicity, in this discussion we assume fully-pipelined units with a repeat rate of one operation per cycle.
op ARCH NOPS op
For a particular data dependence graph and a given architecture, the resulting II is dependent on the latency that the scheduler assumes for each operation. The latency of operations is usually known by the compiler except for memory operations, which have a variable latency. The II also depends on the NOPS, which is affected by the spill code introduced by the scheduler. The other parameters, NFUS and DIST, are fixed.
In Figure 1 it is depicted a simple example of how modulo scheduling works. The first step is to build the data dependence graph of the loop. In this graph, nodes represent simple operations (such us loads, adds, etc.), whereas edges represent data dependences between operations. These dependences can be between operations in the same iteration (simple dependences), or between operations from distinct iterations (loop-carried dependences). This last type of edges are labeled with a number that denotes the distance, in number of iterations, of the dependence. Any circuit in the dependence graph is known as a recurrence. Note that any recurrence must contain at least one loop-carried dependence. Starting from this graph, and knowing the latency of each operation and 
Once the minimum initiation interval has been computed, the next step is to schedule the operations. Following some particular order, each operation is scheduled with three main constraints:
(i) it cannot be scheduled before any of its producers (predecessors int he dependence graph) plus their respective latency; (ii) it cannot be scheduled before any of its consumers (successors in the dependence graph) minus its latency; and (iii) two operations cannot be scheduled in a given functional unit in time slots that differ in a multiple of II.
In these constraints cannot be met for a given operations there are different alternatives such as re-scheduling some already scheduled operations (backtracking) or increasing the II and re-scheduling all operations.
The fixed pattern that represents the transformed code is derived by overlapping the schedule of consecutive iterations with an offset of II cycles. The maximum number of overlapped iterations is referred to as the stage counter (SC).
The execution of a modulo scheduled loop can be divided into three stages: prolog, kernel and epilog (see Figure 2) . The prolog is a ramp up phase of SC-1 iterations that fills the software pipe- Figure 2 . Execution of a modulo scheduled loop line. During the kernel, NITER-SC+1 iterations are executed and the software pipeline achieves maximum overlap. In this phase the same pattern of operations is executed in each iteration. Lastly, in the epilog phase the software pipeline is drained. As the prolog, this phase has SC-1 iterations.
In this way, the execution time of the loop can be calculated as:
For a given architecture and a given scheduler, the first term of the sum (called compute time or optimistic execution time in the rest of the paper) is fixed and it is determined at compile time.
The stall time is mainly due to dependences with previous memory instructions and it depends on the run-time behavior of the program (e.g., miss ratio, outstanding misses, etc.).
Conventional modulo scheduling proposals use a fixed latency (usually the cache-hit time) to schedule memory instructions. Scheduling instructions with its minimum latency minimize the register pressure, and thus, reduces the spill code. On the other hand, this minimum latency scheduling can increase the stall time because of data dependences. In particular, if an operation needs a data that has been loaded by a previous instruction but the memory access has not finished yet, the processor stalls until the data is available. Figure 1 shows a sample scheduling for a data dependence graph and a given architecture. In this case, memory instructions are scheduled with cache-hit latency. If the stall time is ignored, as it is usual in studies dealing with software pipeline techniques, the expected optimistic execution time will be (suppose NITER is huge):
Obviously this is an optimistic estimation of the actual execution time, which can be rather inaccurate. For instance, suppose that the miss ratio of the N1 load operation is 0.25 (e.g., it has stride 1 and there are 4 elements per cache line). For every cache miss, the processor stalls some cycles (called penalty). The penalty for a particular memory instruction depends on the hit latency, the miss latency and the distance in the scheduling between the memory operation and the first instruction that uses the data produced by the memory instruction. For the dependence between N1 and N2 t exec opt the penalty is 9 cycles, so the stall time assuming that the remaining dependences do not produce any penalty is: If all memory references were considered, the effect of the stall time could be greater, and the discrepancy between the optimistic estimation usually utilized to evaluate the performance of software pipelined schedulers and the actual performance could be much higher. We can also conclude that scheduling schemes that try to minimize the stall time may provide a significant advantage.
In this paper, the proposed scheduler is evaluated and compared with others using the t exec metric. This requires to consider the run-time behavior of individual memory references, which requires the simulation of the memory system. Table 1 .
Experimental framework
Tools and benchmarks
The locality analysis and scheduling task have been performed using the ICTINEO toolset [2] .
ICTINEO is a source to source translator that produces a code in which each sentence has semantics similar to that of current machine instructions. After translating the code to such low-level representation and applying classical optimizations, the dependence graph of each innermost loop is constructed according the particular prefetching approach. Then, instructions are scheduled using any software pipelining algorithm. The particular software pipelining algorithm used in the exper-
Other instructions Latency
Machine [16] , which has been shown to be very effective to minimize both the II and the register pressure.
The resulting code is instrumented to generate a trace that feeds a simulator of the architecture.
The compute time (optimistic execution time) can be derived from the static schedule. Therefore, the function of the simulator is to add the number of cycles that the processor is stalled due to the cache. In the modeled architectures there are two reasons for the processor to stall: (a) when an instruction requires an operand that is not available yet (e.g., it is being read from the second level 
Schemes to schedule memory operations
In this section we evaluate the performance of basic schemes to schedule memory operations and point out the drawbacks of them, which motivates the new approach proposed in the next section.
The evaluated schemes are based on early scheduling or inserting prefetch instructions (see Figure   3 ). Note that in both cases each prefetch implies a modification of the data dependence graph, as described below. Then, any software pipelining algorithm can be used to generate the schedule from the modified dependence graph.
In Figure 3 each memory operation (or node) is tagged with the latency used to schedule it. The early scheduling technique (Figure 3a ) is based on using the cache-miss latency (in the figure is 10 cycles) to schedule memory instructions. In this case, all operations that have a dependence with de LOAD will be scheduled, at least, 10 cycles later and, thus, the lifetime of the destination register used by the LOAD instruction is increased (i.e., output dependence edges are stretched). On the other hand, the inserting prefetch technique (Figure 3b ) adds nonbinding memory instructions (prefetch instruction) that are scheduled using the cache-miss latency. The prefetch instruction has not a destination register, but the lifetime of registers used to compute the effective address is increased (i.e., input dependence edges are stretched). In Figure 3c and Figure 3d , it is shown how the data dependence graph of Figure 1 is transformed in order to perform both early scheduling and inserting a prefetch instruction for the N1 load operation.
Early scheduling
In this section we evaluate some schemes based on early scheduling. These schemes prefetch data without requiring additional instructions but they may result in an increase in the II when memory instructions are in recurrences. Besides, they may also require additional spill code. We have evaluated two different schemes: (i) schedule all memory operations using the cache-miss latency, or, in other words, early scheduling always (ESA), and (ii) schedule instructions that have some type of locality using the cache-hit latency and schedule the remaining ones using the cache-miss latency. This later scheme will be called early scheduling according to locality (ESL), and makes use of a static locality analysis in addition to other issues in order to determine the latency to be considered when scheduling each individual instruction.
The locality analysis is based on the analysis presented in [21] [22] . It is divided into three steps:
• Reuse analysis: computes the intrinsic reuse property of each memory instruction as proposed in [24] . The goal is to determine the kind of reuse that is exploited by a reference in each loop. Five types of reuse can be determined: none, self-temporal, self-spatial, grouptemporal and group-spatial. References without any reuse are those references that cause always compulsory misses.
• Interference analysis: using the initial address of each reference and the previous reuse analysis, it determines whether two static instructions always conflict in the cache. Besides, self-interferences are also taken into account by considering the stride exhibited by each static instruction. References that interfere with themselves or with other references are considered not to have any type of locality even if they exhibit some type of reuse.
• Volume analysis: determines which references cannot exploit its reuse because they have been displaced from cache due to the lack of enough storage. That is, this step identifies capacity misses. It is based on computing the amount of data that is used by each reference in each loop. For those loops whose number of iterations are unknown, we use the same estimation as used by Bernstein et al. in [3] .
We refer the reader to the previous mentioned papers ( [21] [22]) for more details about the locality analysis.
The analysis concludes that a reference is expected to exhibit locality if it has reuse, it does not interfere with any other (including itself) and the volume of data between to consecutive reuses is lower than the cache size (see details in the above mentioned papers).
The scheme that schedules all memory operations using the cache-hit latency (called CHL) will be used as a baseline for comparisons. As previously mentioned, modulo scheduling schemes usually schedule memory operations using the CHL approach. This scheme is expected to produce a significant amount of processor stalls as suggested in section 2.
The different algorithms have been evaluated in terms of execution time. Figure 4 compares The main conclusion that can be drawn from Figure 4 is that the performance of both the default and early scheduling schemes is far away from the lower bound in general. The CHL scheme results in a significant percentage of stall time (for the aggressive architecture the stall time represents more than 50% of the execution time for most programs). The ESA scheme practically eliminates all the stall time. The remaining stall time is basically due to the lack of entries in the outstanding miss table that is used to implement a lockup-free cache. However, this scheme increases significantly the compute time for some programs like the turb3d (by a factor of 3 in the aggressive architecture), mgrid and hydro2d. This is due to the memory references in recurrences that limit the II. The performance of the ESL scheme is in general worse than the ESA results except for the tomcatv benchmark in the simple architecture and the turb3d for both architectures.
Notice that for some programs such as su2cor, hydro2d and mgrid the stall time of the ESL is hardly reduced compared with the CHL scheme.
Inserting prefetch instructions
In order to reduce the penalties caused by memory operations, an alternative to early scheduling of memory instructions is inserting prefetch instructions, which are provided by many current instruction set architectures (e.g., the touch instruction of the PowerPC [6] ). Such instructions are scheduled at a distance of the cache-miss latency cycles from the actual memory references. This new scheme can introduce additional spill code since it increases the register pressure. In particular, the lifetimes of values that are used to compute the effective address are increased since they are used by both the prefetch and ordinary memory instructions. It can also increase the initiation interval due to additional memory instructions.
We have evaluated three alternative schemes to introduce prefetch instructions: (i) insert prefetch always (IPA), (ii) insert prefetch for those references without temporal locality even if they exhibit spatial locality, according to the static locality analysis (IPT), and (iii) insert prefetch for those instructions without any type of locality (IPL) 1 . Note that this last scheme excludes prefetching for references with spatial locality. The first scheme is expected to result in very few stalls but it requires many additional instructions, which may increase the II. The IPT scheme is more selective when adding prefetch instructions. However, it adds unnecessary prefetch instructions for some references with just spatial locality. Instructions with only spatial locality will cause a cache miss only when a new cache line is accessed. The IPL scheme is the most conservative in the sense that it adds the lowest number of prefetch instructions.
In Figure 5 it is compared the total execution time of the above mentioned prefetching schemes.
All figures are normalized to the CHL scheduling execution time. Among the schemes that insert prefetch instructions, none of them outperforms the others in general. Depending on the particular program and architecture, the best one among them is a different one. The prefetch schemes outperform the CHL scheme in general (i.e., the performance statistics in Figure 5 are in general lower than 1) but in some cases they may be even worse than the CHL.
Binding versus nonbinding schemes
Comparing binding (Figure 4 ) with nonbinding ( Figure 5 ) schemes, it can be observed that binding prefetch is always better for the first three benchmarks. Both schemes have similar performance for the next two benchmarks and only for the last one, nonbinding prefetch outperforms the binding schemes. Table 2 compares the different schemes using the CHL algorithm as a baseline. For each scheme, it shows the increase in compute time and the decrease in stall time. As we have seen before, scheduling memory operations using the cache-miss latency can affect the initiation interval The stall time due to dependences can be eliminated by scheduling memory instructions using the cache-miss latency. By default, spill code is scheduled using the cache-hit latency and therefore it may cause some stalls, although it is unlikely because the spill code usually is a store followed by a load to the same address. Since they are not usually close (otherwise the spill code hardly reduces the register pressure), the load will cause a stall only if it interferes with a memory reference in between the store and itself. The column denoted as ∇Stall represents the percentage of the stall time caused by the CHL algorithm that is avoided. For any scheme s, it is calculated as:
The ESA scheme is the best one to reduce the stall time but at the expense of a large increase in compute time, mainly when the architecture becomes more aggressive. The IPL scheme causes the lowest increase in compute time, but it is also the worst approach for decreasing the stalls due 
-----------------------------------------------------------------------------------------100 × t stall CHL t stall s -t stall CHL -----------------------------------100
× to memory. From this table we can see that none of the schemes achieves a good trade-off between compute time and stall time. None of them can significantly reduce the latter without an important increment in the former. Table 3 shows the miss ratio of the different prefetching schemes compared with the miss ratio of a nonprefetching scheme (CHL).We can see that in general the more prefetch instructions are inserted, the higher the reduction is miss ratio is. However, inserting prefetch instructions do not remove all cache misses, even for the scheme that inserts a prefetch for every memory instruction (IPA). This is due to cache interferences among prefetch instructions before the prefetched data are used. This is quite common in the programs tomcatv and swim. For instance, if two memory references that interfere in the cache are very close in the code, it is likely that the two prefetches corresponding to them are scheduled before both memory references. In this case, at least one of the two memory references will miss in spite of the prefetch. Moreover, if the prefetches and memory instructions are scheduled in reverse order (i.e., instruction A is scheduled before B but the prefetch of B is scheduled before the prefetch of A), both memory instructions will miss. Table 2 . Increment of compute time and decrement of stall time in relation to the CHL
A cache sensitive algorithm
In this section we propose a new algorithm, which is called cache sensitive modulo scheduling (CSMS), that tries to minimize both the compute time and the stall time. These terms are not independent and reducing one of them may result in an increase in the other, as we have just shown in the previous section. The proposed algorithm tries to find the best trade-off between the two terms.
The CSMS algorithm
The CSMS algorithm is based on early scheduling of some selectively chosen memory operations.
Scheduling a memory operation using the cache-miss latency can hide almost all memory latency as we have shown in the previous section without increasing much the number of instructions (as opposed to the use of prefetch instructions). However, it can increase the execution time in three ways:
• It may increase the register pressure since the lifetime of some values is increased.
Therefore, it may increase the II due to spill code if the performance of the loop is bounded by memory operations.
• It may increase II rec because the latency of memory operations is augmented. In other words, if memory instructions inside recurrences are scheduled using the cache-miss latency, the total latency of the whole cycle is increased, which increases the II rec (see definition in Table 3 . Miss ratio for the CHL and the different prefetching schemes Section 2). Therefore, it may increase the II if the performance of the loop is bounded by recurrences.
• It may increase the SC because the length of individual loop iterations may be increased.
This augments the cost of the prolog and the epilog.
Two of the main issues of the CSMS algorithm is the reduction of the impact of recurrences on the II and the minimization of the stall time. The problem of the cost of the prolog and epilog is handled by computing two alternative schedules. Both focus on minimizing the stall time and the II. However, one of them reduces the impact of the prolog and the epilog at the expense of an increase in the stall time whereas the other does not care about the prolog and epilog cost. Then, depending on the number of iterations of the loop, the most effective one is chosen.
The core of the CSMS algorithm is shown in Figure 6 . The algorithm makes use of the static locality analysis previously mentioned. Initially, two data dependence graphs with the same nodes and edges are generated. The difference is just the latency assigned to each node. In grph1, each memory node is tagged according to the locality analysis: it is tagged with the cache-hit latency if Figure 6 . CSMS algorithm Figure 7 . Scheduling a loop with recurrences it exhibits any type of locality or with the cache-miss latency otherwise. In grph2, all memory nodes are tagged with the cache-miss latency.
Then, a schedule that minimizes the impact of recurrences on the II is computed for each graph using the function ComputeSchedMinRecEffect that is shown in Figure 7 . The first step of this function is to order the recurrences by restriction order, that is, according to the II rec in decreasing order. After that, the latency of those memory operations inside recurrences that limit the II is changed from cache-miss to cache-hit until the II is limited by resources or by a more constraining recurrence (function MinimizeRecurrence Effect). Nodes to be modified are chosen according to a locality order, starting from the ones that exhibit most locality (the priority order is the next one:
self-temporal-spatial, self-temporal, group-trailing, self-spatial, unknown and without locality).
Then, the second step is to compute the actual scheduling using the modified graph. This step can be performed through any of the software pipelined schedulers proposed in the literature.
Finally, the minimum number of iterations (UpperBound) that ensures that sch2 is better than sch1 is computed. A main difference between these two schedules is the cost of the prolog and epilog parts, which is lower for the sch1. This bound depends on the computed schedules and the results of the locality analysis and it is calculated through an estimation of the execution time of each schedule. The sch1 is chosen if:
The execution time of a given schedule is estimated as:
The stall time is estimated as:
where penalty is calculated as explained in section 2: We use a scheduling according to the locality instead of the CHL (which achieves the minimum SC) in order to take into account the possible poor locality of some loops.
In Figure 8 we can see a simple example that explains how CSMS works. Following the algorithm presented in Figure 6 , the first step is to build two data dependences graphs: a) using the latency according to the locality analysis (in this case, we assume that N2 load exhibits locality, whereas N5 load does not); and b) using the cache-miss latency for all memory operations. The next step is to reduce the effect that memory instructions inside recurrences have in the initiation interval. In this case, only N5 load can affect the initiation interval due to recurrences. We can see that using cache-miss latency to schedule this operation, the II rec is increased from 3 to 7. Thus, following a locality order, we have to choose memory instructions inside the recurrence and change its latency into cache-hit latency until the effect of this recurrence is minimized (i.e., either the II is not limited just by this recurrence or all the loads have been tagged with cache-hit latency). The result of this step is a data dependence graph where the initiation interval is bounded by either resources of the II rec after minimizing its effect (see steps c) and d)). At this point there are two graphs that can lead to two different schedules of the loop. The goal is to choose the most convenient one depending on the number of iterations of our loop in order to take into account both the prolog and the epilog. Following the formula previously presented to compute the UpperBound on the number of iterations, and supposing that the miss ratio for N2 and N5 instructions is 0.25 and 1.0 respectively, the UpperBound is equal to 4. In the case that the two graphs are identical, or the number of iterations of the loop is known statically, the choice of the schedule can be done at compile time. However, in the case that the number of iterations of the loop is unknown at compile time, both schedules are inserted in the code and the decission is taken at execution time.
Performance evaluation
In this section we compare the CSMS algorithm with the best scheme for each program based on both early scheduling of memory operations or inserting prefetch instructions. These schemes are called BES (the best early scheduling scheme) and BIP (the best inserting prefetch scheme). Notice that the BES and BIP may refer to a different scheme for different programs.
The different algorithms have been evaluated in terms of execution time, which is split into compute and stall time. The stall time is due to dependences and to the lack of entries in the outstanding miss table. In Figure 9 we can see the results for both the simple and the aggressive architectures. For each benchmark all columns are normalized to the CHL execution time. It can be seen that the CSMS algorithm achieves a compute time very close to the CHL scheme whereas it has a stall time very close to the BES scheme. That is, it results in the best trade-off between compute and stall time. In programs where recurrences limit the initiation interval, and therefore the ESA scheme increases the compute time (for instance in hydro2d and turb3d benchmarks), the CSMS method minimizes this effect at the expense of a slight increase in the stall time. The CSMS scheme always performs better than the schemes based on inserting prefetch instructions except for the mgrid benchmark in the aggressive architecture. In this latter case, the BIP scheme is the best one but the performance of the CSMS is very close to it.
The CSMS scheme increases the register pressure when compared with the CHL method. This results in an increase by 0.1% and 20% in the spill code for the simple and aggressive architectures respectively. However, the penalty of this additional spill code in much lower than the reduction in stall time. Table 4 shows the relative speed-up of the different schedulers with respect the CHL scheme.
On average all alternative schedulers outperform the CHL scheme (which is usually the one used by software pipelining schedulers). However, for some programs (mainly for turb3d) the ESA and ESL schedulers perform worse than the CHL due to the increase in the II caused by recurrences. The CSMS algorithm achieves the best performance for all benchmarks. For the simple architecture the average speed-up is 1.61, and for the aggressive architecture it is 2.47. Table 5 compares the CSMS execution time with the optimistic execution time (LBND) as defined in section 2 that is used as a lower bound of the execution time. It also shows the percentage of the execution time that the processor is stalled. It can be seen that for the simple architecture the CSMS algorithm is close to the lower bound and it does not cause almost any stall. For the aggressive architecture, the performance of the CSMS is about 25% worse than that of LBND and the stall time represents about 10% of the total execution time. Notice however, that the lower bound could be quite below the actual minimum execution time.
In Table 6 it is showed the increase in compute time and the decrease in stall time of the proposed CSMS scheme. Comparing this table with Table 4 . Relative speed-up achieves the best trade-off between compute time and stall time, which is the reason for outperforming the others.
Conclusions
The interaction between software prefetching and software pipelining techniques for staticallyscheduled architectures has been studied. We have shown that modulo scheduling schemes using cache-hit latency produce many stalls due to dependences with memory instructions. For a simple architecture the stall time represents about 32% of the execution time and 63% for an aggressive Table 6 . Increment/decrement of compute/stall time of the CSMS in relation to the CHL architecture. Thus, ignoring memory effects when evaluating a software pipelined scheduler may be rather inaccurate.
We have compared the performance of different prefetching approaches based on either early scheduling of memory instructions (binding prefetch) or inserting prefetch instructions (nonbinding prefetch). We have seen that both provide a significant improvement in average, but they may cause significant penalties for particular programs. In general, methods based on early scheduling outperform those based on inserting prefetches. The main reasons for the worse performance of the latter methods are the increase in memory pressure due to prefetch instructions and additional spill code, and their limitation to remove short-distance conflict misses.
We have proposed an heuristic scheduling algorithm (CSMS), which is based on early scheduling, that tries to minimize both the compute and the stall time. The algorithm makes use of a static locality analysis and information about the dependence graph. We have shown that it outperforms the rest of strategies. For instance, when compared with the approach based on scheduling memory instructions using the cache-hit latency, the produced code is 1.6 times faster for a simple architecture, and 2.5 times faster for an aggressive architecture. In the former case, we have also shown that the execution time is very close to a lower bound.
