This paper examines two
This paper examines two alternative approaches to supporting code scheduling for multiple-instruction-issue processors. One is to provide a set of non-trapping instructions so that the compiler can perform aggressive static code scheduling.
The application of this approach to existing commercial architectures typically requires extending the instruction set. The other approach is to support out-of-order execution in the microarchitecture so that the hardware can perform aggressive dynamic code scheduling. This approach usually does not require modifying the instruction set but requires complex hardware support. In this paper, we analyze the performance of the two alternative approaches using a set of important nonnumerical C benchmark programs.
A distinguishing feature of the experiment is that the code for the dynamic approach has been optimized and scheduled as much as allowed by the architecture.
The hardware is only responsible for the additional reordering that cannot be performed by the compiler. The overall result is that the clynamic and static approaches are comparable in performance.
When applied to a four-instruction-issue processor, both methods achieve more than two times speedup over a high performance single-instruction-issue processor.
However, the performance of each scheme varies among the benchmark programs.
To explain this variation, we have identified the conditions in these programs that make one approach perform better than the other. Permission to copy without fee all or part of this material k granted provided that the copies ars not made or distributed for direct commercial advantage, the ACM r.opyrigfrt notice and the title of the publication and its date appear, and notice ia given that copying ia by permiasiou of the Association for Computing Machineg. To copy othcnvise, or to repubhshjrequirea a fee arrd/or specific permission. Restricted out-of-order execution tolerate the cache misses better than the in-order execution models. The effect is most visible for compress (compare Tables 4 and 5).
From more detailed measurements, we found that compress haa a large number of cache misses whose delay can be hidden by the dynamic code scheduler. Figure 3 and Table 6 present speedup results for a 16KB data cache.
The performance of the in-order execution models in Figure  3 is slightly better than in Figure  2 . On the other hand, the performance of restricted outof-order execution were virtually identical in both cases. This shows that the performance of restricted out-of-order execution is less sensitive to cache size than in-order execution models. programs, four-instruction-issue processors supporting both methods have achieved more than two times speedup over a high-performance single-instruction-issue processor. Both of them perform substantially better than restricted in-order execution. 
General in-order

