Many contemporary multiple issue processors employ out-of-order scheduling hardware in the processor pipeline. Such s c heduling hardware can yield good performance without relying on compile-time scheduling. The hardware can also schedule around unexpected run-time occurrences such a s c a c he misses. As issue widths increase, however, the complexity o f s u c h s c heduling hardware increases considerably and can have an impact on the cycle time of the processor.
I. Introduction
Current multiple issue processors such as the Hewlett Packard PA-8000 and the Intel Pentium Pro employ out-of-order instruction issue 1 , 2 . One advantage of this approach is that compiler scheduling is not required to attain high levels of instruction level parallelism ILP. Additionally, dynamic scheduling hardware can deal e ectively with unanticipated run-time events e.g., cache misses. There are also disadvantages. Scheduling is performed on a subset of instructions, the size of which is limited by the size of a central hardware window; the hardware does not have global knowledge of the instruction stream. The complexity of the hardware required for out-of-order issue can lead to an increase in the processor cycle time as issue widths increase. What is desirable is the adaptability of an out-of-order issue design and the aggressive cycle time of a less complex design without the compatibility limitations of a strong dependence on compile-time instruction scheduling.
An alternative to dynamic scheduling within the processor core is miss path scheduling The traditional superscalar has an instruction cache which is accessed on every cycle for instructions. These instructions are then examined by scheduling hardware which determines which instructions are safe to issue in the current cycle and performs dynamic scheduling. In the MPS design, instructions are fetched from a cache that holds wide instruction words. The wide words are composed of instructions that are guaranteed to be free of any dependencies or restrictions that can prevent parallel issue. Dependency checking or dynamic scheduling hardware is not required in the path between the cache and the execution pipeline although simpler in-order dependency enforcement might be useful for cache miss handling.
MPS the basic idea is depicted in Figure 1 . Scheduling hardware is inserted on the path between the instruction cache and the next level of memory. Instructions are scheduled for execution at instruction cache i-cache miss time. A schedule is composed of a set of wide instructions words, as in a VLIW machine 3 . As a schedule of instructions is formed, it is placed into a specially designed instruction cache. After a schedule is formed, its constituent instructions are issued to a simple but aggressively-clocked execution core. Earlier schemes proposed by Melvin, Shebanow and Patt 4 and extended by Franklin and Smotherman 5 were among the rst to use the paradigm of forming a schedule of instructions outside of the processor core. Their approaches scheduled between branch instructions only, limiting the amount of ILP the designs could extract. This paper describes a miss path scheduling design that speculates across multiple branches. An algorithm for MPS is outlined, the details of the hardware implementation are discussed, and performance results are presented. This paper is organized as follows. Section II discusses previous work related to miss path scheduling. Section III introduces MPS through an example and discusses how speculation is performed. Section IV describes the implementation of an MPS scheduler and presents detailed experimental results on the performance of MPS-scheduled code. Section VI introduces a special instruction cache called a schedule cache SC that holds the schedules constructed by MPS. Results for a proposed SC design are presented. Section VII concludes the paper.
II. Background work
The foundation of the concept of miss path scheduling in a microprocessor is the ll unit 4 . The original ll unit proposal was for use in a dynamically scheduled multiple issue processor. The goal was to form large groups of instructions called multinode words from which the scheduling hardware could more easily extract parallelism. A ll unit operates as follows. As instructions are received from memory, they are merged into multinode words and placed into a bu er. Filling completes when either the unit is full a hardware limit or a branch instruction is encountered. When lling stops, the contents of the ll unit are copied into a decoded instruction cache. Instruction fetch is performed by accessing lines in the decoded cache. Dynamic scheduling hardware is used between the decoded cache and the execution pipeline. A more recent study examined the use of a ll unit for constructing VLIW-like instructions to be placed into a shadow cache 5 . The idea was to build wide instruction words that contained no inter-instruction constraints on multiple issue, similar to a VLIW. Speculation across branches was not performed, but the design was able to fetch both paths of a branch, similar to a tree instruction 6 . The design assumed that the ll unit executed in parallel with the execution pipeline the original ll unit proposal does not explicitly comment on this aspect. The ll unit accepts instructions from a backup instruction cache. A followup to the shadow cache idea was a proposal to use the ll unit for use in decoding a CISC instruction set 7 .
Another miss path scheduling design was the expanded parallel instruction cache EPIC 8 . EPIC performs limited dynamic scheduling at cache miss to ease decode requirements at run-time and form VLIW-like instructions from a RISC instruction stream. Limited speculation is performed: instructions that follow a branch are allowed to issue in the same cycle as the branch but not any earlier. A n o v el feature of an EPIC is that a cache line can hold multiple wide-words that are to beissued in di erent cycles. The trace cache takes a di erent approach to multiple-instruction issue 9 . The design performs alignment and merging of instruction runs across multiple branches and places the resulting instruction sequence the trace as a line into a trace cache. Traces are formed based on the current fetch address and branch predictions returned from a multiple branch predictor. A trace consists of a sequence of basic blocks. The trace cache is accessed on each clock cycle using the current fetch address and the multiple predictions given by a hardware branch predictor. A back-up instruction cache is accessed in parallel with the trace cache for the situation in which the trace cache misses. Dynamic scheduling hardware is used between the caches and the execution pipeline. Nair and Hopkins recently described a technique called dynamic instruction formatting DIF 10 . A DIF-based machine consists of two execution engines, a primary sequential engine and a parallel engine. Instructions are initially executed on the primary engine. As an instruction sequence is executed, the dependencies between the instructions are detected and used to construct a schedule for the instruction sequence that can subsequently execute on the parallel engine. The schedule is placed into a DIF cache that feeds the parallel engine. A DIF machine has a learning mode and a parallel execution mode and two execution engines logically distinct execution pipelines, one for sequential execution and the other for parallel execution. The rst time an instruction sequence is fetched, it is scheduled and cached in the DIF cache for execution on the parallel engine as it executes on the sequential engine. Subsequent accesses to the instruction sequence access the schedule from the DIF cache. DIF and MPS have m uch in common but there are also signi cant di erences. MPS uses one i-cache, a variant on a conventional design; DIF uses a conventional i-cache and a separate DIF cache to hold scheduled instructions. MPS uses one execution pipeline to execute all instructions; DIF uses two pipelines, the sequential and parallel engines. Because of these two reasons, a DIF design could require more die space than an MPS design. DIF does have the advantage of masking schedule formation time by scheduling while executing sequential code on the sequential engine. Table I summarizes the previous work related to MPS and characterizes the previous approaches based on several factors. The table provides a simple comparison between the various approaches and also lists the characteristics of MPS.
III. A simple scheduling algorithm
A v ariation of the operation scheduling algorithm as presented by Ellis 11 is well-suited for MPS. The algorithm is practical for hardware implementation because scheduling is performed by processing an instruction stream sequentially and does not require prior construction of a dependency graph. The algorithm is referred to as the MPS algorithm for the remainder of this paper. The following section introduces the technique through an example and comments on the hardware structures required an informal outline is presented in the appendix.
A. An example of miss path scheduling A portion of assembly code and a machine for which it is to be scheduled are shown in Figure 2 . The machine is a three issue processor that has two fully pipelined integer ALUs and a fully pipelined load store unit. The ALUs have a latency of three cycles for multiply and divide operations and one cycle for all other operations. The load store unit has a two cycle latency for loads and a one cycle latency for stores. A reservation Figure 3 . The state after scheduling all of the instructions is shown in Figure 4 . The preceding example illustrates that the steps involved in miss path scheduling an instruction are simply: 1 checking the availability of source operands in the def-use The size of the entries depends on the maximum length allowed for a schedule, but in general this will be much smaller than an integer or oating point v alue. The reservation table is a two-dimensional bit matrix that is indexed with the value of the rst cycle in which an instruction can be scheduled. A priority encoder can be used for a quick search of the reservation table, returning the rst available issue cycle when given an initial potential issue cycle. Section IV discusses the additional requirements for miss path scheduling.
The example in Figures 3-4 considers only ow dependencies. Anti-and output dependencies must also be honored. To honor an anti-dependency while scheduling instruction X, the def-use table is checked to get the last use time of X's destination operands call this LastDstUse. The scheduling time for X is set so that X does not retire its results before LastDstUse. To honor an output dependency, the def-use table is checked for the def times of X's destinations operands call this LastDstDef. X's scheduling time is set so that it does not retire its results before LastDstDef.
B. Speculative scheduling at miss time
The previous example illustrated how miss path scheduling works on code without branches. Scheduling in the presence of branches is as follows: as a sequence of nonbranch instructions is scheduled, a LatestScheduleTime is maintained to track the latest cycle in which an instruction is scheduled. When a branch is encountered, its scheduling time is constrained to be no earlier than LatestScheduleTime. This prevents the branch instruction from executing prior to instructions that precede it.
Miss path scheduling can speculate an instruction across multiple branches. When a conditional branch is encountered, the scheduler predicts the direction of the branch and begins scheduling instructions from the predicted path. Instructions from the predicted path may bescheduled higher earlier than the preceding branch. They may beplaced ahead of several preceding branches. When speculation is performed, speculated instructions must be prevented from retiring their results when their control-dependent branches are mispredicted. Hardware support identical to that used by speculative out-of-order issue designs can beused to accomplish this 13 , 14 , 15 .
IV. Details of a miss path scheduler
The previous section introduced some of the hardware structures required for an MPS implementation. The data stored in the def-use table and the reservation table are used to make scheduling decisions. This section details the requirements on this scheduling logic.
A. Scheduling logic
Section III-A informally outlined that an MPS design can schedule an instruction I in three steps. To review brie y and add some details, these steps are:
1. Read I's source and destination operands' def-use times from the def-use use table and the reservation  table. In parallel, place I into an appropriate position in the instruction cache. Icache issues are left vague purposefully at this point. They will be discussed in detail in Section VI.
These three steps can be performed in as few as four 4 clock cycles: two cycles for Step 1 one cycle to read values, an additional cycle to compute ScheduleTime, one cycle for
Step 2, and one cycle for Step 3. The lookup of the reservation table could require more than 1 cycle if the numberof cycles that are searched is large. However, we will show in Section V that the numberof cycles to search is usually moderate, and so this table can in fact besmall enough to search in 1 cycle. Therefore, a miss path scheduler requires a minimum of four cycles to schedule an instruction.
The number of ports required on the def-use table is moderate. Accessing an entry in the def-use table accesses all of the information regarding the corresponding register: last-use time and def-time.
Step 1 perform reads and Step 3 performs writes for all of an instruction's register operands. For most current ISAs, this is a maximum of three operands and hence requires three read ports. Therefore, Steps 1 and 3 require three read ports and three write ports, respectively, to update the def-use table. So the maximal overall port requirements on the table are three write ports and three read ports.
Scheduling of individual instructions has been outlined but overall program execution has not. The following outline enumerates the steps involved in program execution on a miss path design. Figure 5 shows a high-level hardware organization of a miss-path design. Program Execution Flow:
1. Use the PC to access the instruction cache, which holds scheduled blocks of instructions. This is identical to instruction fetch in a conventional processor. 4. After schedule formation cache miss processing completes, use the PC to re-access the cache for instruction fetch. Wide-words from the newly-formed schedule are sent to the processor core for execution.
An arbitrarily long instruction stream can bescheduled but it is more practical to use a halt condition to terminate miss processing miss path scheduling. For example, scheduling could terminate when a conditional branch is encountered this implements basic-blockonly scheduling. Because scheduling is not performed across the entire program by one invocation of the miss path scheduler one cache miss, dependencies between schedules are unknown to the scheduler. Such dependencies can be handled within the execution core by using a scoreboard 16 , 17 .
There is a sequential nature to the three steps in miss path scheduling, which allows a natural mapping of the process to a pipelined implementation. Pipelining can reduce the average scheduling time per instruction and with little hardware overhead. The scheduling pipeline mimics an execution pipeline in its ability to interlock and stall by using busy bits as in a conventional scoreboard. Figure 6 depicts an example of this. Instructions A and B from the sample code in Figure 2 are shown in stages 1 and 2, respectively. A R A W hazard on r2 exists between the two instructions. Because A has yet to clear the busy bit for r2 when B is in stage 1, B cannot obtain an initial value for ScheduleTime value. B stalls until A clears r2's busy bit in stage 4. After A exits stage 4, B can proceed to stage 2. The appendix contains a detailed diagram of a pipeline for miss path scheduling along with a detailed explanation. 
B. Speculative miss path scheduling
Extracting larger amounts of ILP from most general purpose programs requires scheduling beyond basic blocks. Miss path scheduling can perform speculative scheduling, as explained in Section III-B. If speculation is used, a mechanism is required to prevent incorrectly speculated instructions from retiring their results. A method that is well-suited for a speculative miss-path scheduler is a reorder bu er with a future le to supplement the architectural register le 13 . Slots are allocated in the reorder bu er in original program order this is preserved by the scheduler and stored with the individual instructions in the cache.
The central issue in speculating instructions is choosing which instructions to speculate, a decision that relies on predicting which path a branch will take. For architectures that support a prediction bit in the instruction encoding 18 , 19 , the scheduler can use this bit to make a prediction. The bit can be set based on pro le information termed pro led prediction or a simple heuristic such as backward taken, forward not taken BTFNT. In the absence of an ISA-level prediction bit, a simple heuristic such as forward-taken FT can beimplemented in hardware. Dynamic prediction can also be used.
When a branch is predicted taken by the MPS scheduler, the scheduler issues memory requests using the branch target address to change the path along which instructions are fetched. The cost of the memory latency in clock cycles is re-incurred. To change the fetch path dynamically, the scheduler must rst determine the branch target address. The target of a PC-relative branch is determined in stage 1 of the scheduler pipeline, after the immediate eld has been decoded. Indirect branches use a register source operand to determine the target address and are more di cult to process. The branch target address is not known, so a direction-based prediction cannot be made. If a predictor that can predict indirect branches is used, the branch target address is unknown, so for a predicted-taken branch, the scheduler is unable to fetch and schedule instructions from the predicted target. For this reason, the current design of MPS stops scheduling when it encounters an indirect branch this is a halt condition. MPS also stops scheduling when it encounters a backward branch to prevent loop unrolling, as we have yet to develop robust heuristics for schedule-time estimation loop trip counts.
When a schedule S is executed the entire way through no internal branches are mispredicted, the machine needs to know the address of the next schedule to execute. If S does not contain any branch instructions, the address calculation is simply next address = starting address of S + of instructions in S. If S contains branch instructions, the calculation is not so straightforward and in fact cannot performed when the execution of S completes. The MPS scheduler tracks the path along which S is formed and computes a next address eld after the scheduling of S completes. When S is executed, its associated next address is passed to the machine and is used to locate the next schedule.
V. Experimental evaluation of a miss path scheduler
The performance of an MPS scheduler was evaluated using trace-drive simulations. All eight of the integer programs from the SPEC95 suite were used as the benchmark set reference inputs were used. The programs were compiled with the IMPACT compiler from the Univeristy of Illinois 20 using a target machine model of the Hewlett Packard PA-7100 processor. The compiled code was passed through a cycle-by-cycle simulator that modeled the behavior of the MPS scheduler and the execution pipeline, including the e ects of mispredicted branches and mis-speculation. The execution pipeline has four stages fetch, decode, execute, and writeback. The processor core has eight fully pipelined functional units the con guration and latencies are listed in Table II . The machine supports execution of multiple branches in parallel on any of the four I-ALUs.
The branch misprediction penalty is two cycles. A 32 bits cycle bandwidth with a ve cycle latency is assumed between the scheduler and next level of memory. Each benchmark was run for 20,500,000 instructions to bypass initialization code. Statistics were gathered by running the program for up to an additional 200,000,000 instructions or until program completion. An in nite instruction cache and a perfect data cache were assumed. There are two aspects of an MPS scheduler that are interesting: the performance of the code produced by the scheduler and the performance of the scheduler itself. Since the goal is to produce parallel code for multiple-issue processors, speculative s c heduling is the focus for the remainder of the paper. Scheduler performance is brie y examined rst. The time required to construct a schedule is time that the processor core is not executing instructions and therefore is an overhead of the MPS paradigm 1 . The performance of pipelined and unpipelined miss path scheduling was measured for speculative scheduling across an unlimited numberof branches. Pro led branch prediction was used. Training inputs were used for the pro ling runs. The results are presented in Table III . The pipelined implementation required 12 fewer clock cycles to form a schedule. A pipelined design was assumed for the remainder of the results in this paper due to its superior performance with little hardware overhead.
Table III also presents the number of stall cycles when scheduling and instruction counts and schedule lengths for the schedules. The time to build a schedule is on average 137 greater than the time to fetch the instructions from the L2 cache memory. The additional time for scheduling is caused by the time to drain the scheduling pipeline a xed quantity dependent on the pipeline depth and by stalls in the scheduling pipeline. Recall that scheduling stalls are caused by two things: a data hazard as shown in Figure 6 or changing the memory fetch path to fetch instructions from a non-sequential branch target address. Since the IMPACT compiler performs code positioning, i.e., the likely target of a conditional branch is positioned as the fall-through, the scheduler rarely had to change the fetch path. As Table III shows, the time to build a schedule roughly ts the equation
Time to schedule = memory latency + instructions + scheduling stall cycles + depth of scheduling pipeline. Although constructing an individual schedule requires considerably more time than simple cache miss repair, the extra cost is amortized over time if the schedule peforms well and is executed frequently.
The instruction counts for schedules varied widely across programs, ranging from six to eighteen. The schedule lengths were invariably shorter than the instruction counts, indicating that every program contains an intrinsic amount of ILP. Additionally, the schedule lengths were moderate, ranging from 4.69 to 10.22. Recall from Section IV-A that the MPS scheduler has to search the reservation table for an open execution slot using a priority encoder. If the range of cycles to search is excessive, the search could take multiple clock cycles or impact the cycle time of the entire machine. The moderate schedule lengths imply that the priority encoder usually has to search a v ery small range and therefore will 1 Note that scheduling time is independent of the instruction and data caches. An instruction cache miss triggers schedule formation but does not a ect scheduling time. The data cache is not used during scheduling. not take multiple cycles or stretch the cycle time. We n o w focus on the characteristics of the code produced by the scheduler. The quality of an MPS-produced code schedule is heavily in uenced by two factors: the instruction count and the accuracy of the branch predictor. Table IV presents statistics on the behavior of conditional branches in the benchmark programs. Two branch prediction strategies were simulated: a static forward-taken FT heuristic and pro led prediction. As expected, pro led prediction yielded a much lower misprediction rate than the simple FT heuristic. In fact, pro led prediction provided an accuracy rate of just under 90, a gure competitive with simple hardware schemes 21 . The number of branches per schedule and the point where control-ow exited a schedule are also presented. The former is a static schedule-time measure of the numberof basic blocks in a schedule, and the latter is a dynamic processor core execution-time measure of how many basic blocks in a schedule are executed and retired. For example, a typical schedule in perl contains 4-5 basic blocks and executes 3-4 of these blocks. The average number of basic blocks per schedule ranges from 1.58 to 4.51 and is signi cantly greater than the code available from basic-block scheduling. The dynamic behavior of exits from the schedules shows that 1.45-3.35 basic blocks are executed before a branch is mispredicted and control exits the schedule, o of the predicted path o -path. The bene t of scheduling across multiple basic blocks is measured by the number of speculated instructions that retire their results. Table V presents data on code speculation performed by the MPS scheduler. The number of speculatd instructions per schedule ranges from under two to over eleven. The percentage of speculated instructions that retired their results which we term useful yield ranges from 60 to over 90. For all benchmarks, the useful yield is over 60. This indicates that the MPS scheduler does an e ective job of speculation. The data in Tables III-V can help intuite which programs will perform well using MPS but the nal measure is the execution times of the programs. A metric that captures overall performance is Useful Operations Completed per Cycle OPC. The MPS simulator keeps precise counts of the number of instructions issued and the number that retire their results. Since scheduling does not introduce any additional instructions, i.e., no code expansion occurs, OPC can be used to compare between di erent con gurations of MPS schedulers as well as with other machine models, such as a superscalar processor. The performance of a speculative MPS scheduler was measured. The scheduler speculated across an arbitrary numberof branches, with halt conditions of a backward branch or a branch through a register. Due to our experimental framework, the scheduler was not able to schedule across library calls, which an actual hardware implementation would be able to do. Therefore, procedure calls were also used as a halt condition. The branch prediction strategies used were an FT heuristic and pro led prediction. The results are presented in Figure 7 . Basic blocks results are included to measure if speculation proved bene cial to overall performance. In general, speculation yielded better performance than basic block scheduling across all programs. The exception is on ijpeg, where the FT predictor produced a lower OPC than basic block s c heduling due to its high misprediction rate 53. Pro led prediction consistently outperformed both basic block scheduling and speculation with the FT predictor. For m88ksim and vortex, the di erences between pro led prediction and FT are small. The results reinforce the intuition that with good branch prediction, speculation can improve performance over basic block scheduling
The results also demonstrate that for some programs, speculation is bene cial even with a poor predictor. Although go, compress, and li su er from branch misprediction rates in excess of 20 when using an FT predictor, their OPCs are still higher than for basic-block scheduling. The data in Table VI illustrates why. The useful yield for all three programs decreases only marginally even though the misprediction rate increases by 75-156. Contrast this with ijpeg, for which a 189 increase in the misprediction rate reduces the useful yield by 19.
VI. Instruction cache support: the Schedule Cache
An i-cache for an MPS-based machine must hold schedules of instructions and therefore is called a schedule cache SC. The requirements for an SC di er from those for a traditional i-cache and in some respects exceed them. For the remainder of this paper, a traditional i-cache is referred to as an i-cache and a schedule cache as an SC. This section describes these requirements and suggests two potential organizations for an SC.
A. Requirements for a Schedule Cache
One of the di erences between an i-cache and an SC is the unit of placement and replacement. In an i-cache, the unit is a cache line or if sub-blocking is used, a sub-block. In an SC, the unit is a schedule. MPS reorders instructions so that the instructions in a schedule are not necessarily in program order. If any part of a schedule is displaced, the entire schedule is invalid. Therefore, one of the requirements of an SC is invalidation of a complete schedule S when any portion of S is displaced.
The location of a schedule in an SC the location of the rst MultiOp in the schedule is determined using bit selection on the address of the rst instruction in the schedule this is identical to normal cache addressing. However, the high-order bits of the address are not su cient for tagging a schedule. If multiple schedules are formed starting from addresses that have identical high-order bits and map to the same SC location, the tag match cannot determine which of the schedules is cache-resident. However, the o set bits of the address di erentiates between instructions. Therefore a schedule is tagged with a concatenation of the high-order bits and the o set bits of the rst instruction address 2 . This tagging scheme di ers only slightly from that in a traditional i-cache and permits direct-mapped or set-associative organizations.
Each branch instruction in a schedule has associated with it the prediction made for it when it was scheduled. The prediction has to be supplied to the processor core so that it can be checked against the execution-time outcome of the branch, to see if the execution remains on-path within the current schedule or will go o -path. An SC must store each branch's prediction only 1 bit per branch is required.
A cache must also store the next address value associated with each schedule.
B. The uncompressed schedule cache design
One potential organization for an SC is a traditional i-cache with additional hardware. The design is similar to that for the uncompressed cache described in Conte, et al., 22 and is called an uncompressed SC. The design is shown in Figure 8 . Each cache line holds one MultiOp. The cache index of the rst MultiOp in a schedule is determined using bit selection, as described above. The remaining MultiOps for the schedule are located in consecutive cache lines. Single-word sized sub-blocks are used to allow the placement of instructions into arbitrary positions within a cache line. NOPs are used as placeholders for issue slots that do not issue an instruction during a particular cycle. In this respect, the uncompressed SC is a mirror of the reservation table. NOPs used as placeholders within a MultiOp are termed horizontal NOPs. In Figure 8 , MultiOp B consists of three 2 This assumes that multiple schedules starting from the same instruction cannot concurrently reside in the SC. Such path associativity 9 can be implemented within an SC with minor additional hardware but is not explored here to simplify the discussion.
instructions. The remaining ve issue slots are not scheduled to issue an instruction and are horizontal NOPs. An empty cycle in a schedule in which no instructions are issued, i.e., the MultiOp consists entirely of NOPs, is termed a vertical NOP and is represented as a cache line consisting entirely of NOPs. In Figure 8 , the cache line encircled by the dotted line is a vertical NOP. Each cache line has several status elds associated with it that aid in cache access: Line valid bit: When set, this bit indicates that the cache line is valid. Op valid bits: One bit for each issue slot in a cache line. When set, the corresponding Op is valid and is passed on to the functional unit. When the bit is clear, the functional unit is issued a NOP. Distance eld: This eld is set to the number of cache lines between the current cache line and the cache line that holds the rst MultiOp in the schedule. Theoretically, a schedule can span the entire cache, so for a cache with n lines, log 2 n bits are required for the eld. In an implementation, however, there will be arestriction on the maximum length of the schedule to m cycles m n, for which the eld width is log 2 m bits. Last line bit: When set, the cache line is the last line in a schedule. A vertical NOP is identi ed by a cleared line valid bit and a cleared last line bit this indicates that the line was not written to during schedule formation but is not the last line in the schedule. Next address eld: This is the address to use for the next cache access if the schedule executes the entire way through stays on-path. The eld is set for only the last cache line in a schedule.
MultiOp
Cache access in an uncompressed SC with n lines proceeds as follows assume a directmapped organization. Bit selection on the PC is used to generate an index value. The tag at index is matched with the PC to check if the requested schedule is cache-resident. If a tag match occurs and the valid bit is set, this is an SC hit, and the MultiOp is fetched and passed to the execution core. Since subsequent MultiOps are located in consecutive cache lines, the index for the next cache access next index is generated using an increment of the form next index = index + 1 m o d n . A PC value from the processor core is not required for cache access as long as execution remains on-path.
When a cache miss occurs, the scheduler is invoked with the PC and begins constructing a new schedule S starting at cache line index. When an instruction for S is placed in a cache line for the rst time, the line's distance eld is set the distance value is available from the scheduler's reservation table and its valid bit is set when an instruction is placed into an issue slot, the issue slot's Op valid bit is set also. If another schedule S old is resident at index, at least part of it will bereplaced. Any portion of S old that is not replaced is marked invalid invalidation is discussed shortly. The new schedule S is constructed and placed into the cache. When scheduling halts, the last line in S has its last line bit and next address elds set, and execution resumes as the MultiOps that comprise S are sent to the processor core.
Schedule invalidation is achieved using the distance eld. When a MultiOp M for schedule S invalid is displaced from the cache, the distance eld for the cache line indicates the cache location of the rst MultiOp, M f i r s t ,for S invalid . The line valid bit for M f i r s t 's cache line is cleared. The next cache access for S invalid will result in a cache miss.
C. The compressed schedule cache design
Another organization for an SC is a design similar to the compressed cache described in Conte et al. 22 and is called a compressed SC. The design is shown in Figure 9 . Unlike the uncompressed SC, the compressed SC does not hold horizontal or vertical NOPs, and multiple MultiOps can occupy a cache line. The compressed SC is organized as two separate banks. When a MultiOp spans two cache lines, banking allows access to all of the instructions within the MultiOp in a single cache access 23 . The cache index of the rst MultiOp in a schedule is determined using bit selection. The rst MultiOp in a schedule always begins at o set 0 within its cache line. Cache access for successive MultiOps is performed by using the length of the current MultiOp to determine the bank and the o set of the next MultiOp. The generation of next index is similar to a scheme detailed in 24 , with minor di erences. Since a MultiOp can start anywhere in a cache line, a length eld is maintained for every o set within a cache line. The length of each MultiOp is computed during schedule formation and is stored in the length eld. The length eld is added to index to generate next index and the o set for cache access. A bank bit is associated with every o set in a cache line to direct cache access to the proper bank. As with the uncompressed SC, a PC value from the processor core is needed only when a branch misprediction occurs. Each cache line in the compressed SC also has the same elds as a line in the uncompressed cache except for the line valid and Op valid bits, which are no longer needed, and they are used in the same manner.
Schedule formation with the compressed SC di ers from that with the uncompressed SC. The compressed SC does not mirror the reservation table as the cache does not hold NOPs except possibly in the last cache line in a schedule, and these are just padding NOPs. This implies that instructions cannot be placed into the cache during scheduling. Rather, instructions are written into a schedule bu er. The schedule bu er has the same dimensions as the reservation table and is organized as a small uncompressed SC. When a halt condition is encountered, MultiOps are fetched directly from the schedule bu er and passed to the processor core. As MultiOps are sent to the core, they are placed into the compressed SC. As a MultiOp is placed into the cache, it is processed by compression logic that removes horizontal NOPs. Vertical NOPs are represented by a hardware pause eld associated with each MultiOp in hardware, a pause eld is implemented for every instruction in a cache line. When an empty MultiOp is encountered during a copy from the schedule bu er, the pause eld for the previous MultiOp M previous is set to one. For every consecutive empty MultiOp encountered, M previous 's pause eld is incremented. A pause eld of value m instructs the issue logic to halt MultiOp issue for m cycles after issuing the current MultiOp. If a schedule S goes o -path while it is being fetched from the schedule bu er, S is marked invalid in the cache.
Since a MultiOp is stored in a compressed fashion in the cache, it must be uncompressed so that each individual instruction is aligned with the correct functional unit before the MultiOp is passed to the execution pipeline. The logic to accomplish this is called an expander and resides between the cache and the execution pipeline 22 . The functionality of the expander is directly opposite to that of the compression logic used when placing a MultiOp into the SC. The expander adds an extra stage to the pipeline and increases the branch misprediction penalty by one cycle.
D. Experimental evaluation
Simulations were conducted to gauge the performance of the compressed SC design with the MPS scheduler. The uncompressed SC was not evaluated because a prior study 22 demonstrated that it does not achieve high performance on a multiple-issue machine similar to the one used for this study. As with the experiments in Section V, each benchmark was run for 20,500,000 instructions to bypass initialization code. The last 500,000 of these instructions were used for d-cache and SC warmup. Measurements were taken on next 200,000,000 instructions or until the program completed. A 5 cycle latency with a 32 bits cycle bandwidth to a perfect L2 cache was assumed.
Since the MPS design uses a statically-scheduled processor core, a potential source of performance degradation is data cache d-cache misses, because execution of all instructions in a MultiOp stalls on a d-cache miss. The primary concern is d-cache read accesses that can cause a dependent instruction or a chain of instructions to stall. To include the e ects of read misses, a d-cache was also modelled. The d-cache was 32KB in size with 32-byte lines. LRU replacement was used with 4-way set associativity. Results are presented in Table VII . For all of the benchmarks except compress, a 32KB direct-mapped SC yielded performance several factors lower than the parallelism the MPS scheduler is able to extract. Two exceptions are compress are compress and ijpeg. compress achieved performance equivalent to using an ininite SC, since it has a very small footprint of schedules which a 32KB cache is able to capture. ijpeg has a relatively small footprint also and has the most intrinsic parallelism of all of the programs, as evidenced by its OPC of 2.45 with an in nite SC.
The performance of all programs improved as the associativity of the 32KB SC was increased from 1 to 4 but still all programs except for compress and ijpeg attained OPCs of less than 1. A 64KB, 4-way set-associative S C w as simulated in an attempt to determine if capacity misses were the major source of the poor behavior. Performance for all programs increased dramatically with the 64KB SC, indicating that indeed capacity is a major issue in schedule cache design. perl and li remained in 1 OPC range but showed signi cant improvement o v er the 32KB design.
Since an SC holds schedules that occupy m ultiple cache lines, both con ict and capacity misses occur more frequently than in a traditional i-cache. A s c hedule can reside at indices i through i+n and thus will have a con ict with any other schedule that maps to any cache line in that range. This trait can bemitigated with associativity. Schedule size can also exacerbate capacity issues. Multiple schedules can contain the same basic blocks they will not begin with the same basic block. The code expansion causes capacity contention in the cache and requires a large schedule cache for acceptable performance.
E. A comparison with a superscalar design
As stated in the introduction, MPS is a possible alternative design paradigm to outof-order superscalar processors. The fundamental di erence is that the MPS scheduling hardware is not used in the processor core and does not have to resolve dependencies between a large number of instructions within a central hardware window. MPS does not use register renaming so it does not require potentially complex hardware to allocate and free renaming registers.
We performed a set of experiments to compare the performance of an MPS machine with that of a superscalar machine. The characteristics of the superscalar machine are listed in Table VIII . The superscalar simulator is not within the exact same framework as the MPS simulator, so we were not able to model identical execution cores. For example, the superscalar machine can execute only 1 branch percycle, whereas the MPS machine can execute up to 4 branches in parallel. However, the superscalar core has a richer set of functional units, a large central window, and the ability to speculate instructions past up to sixteen unresolved branches. The results are presented in Table IX . The programs go and vortex are not included as we were unable to simulate them in the superscalar framework. Data is presented for simulations with in nite caches and with a 64KB i-cache SC a data cache was simulated in all of the experiments. The results are mixed. When using an in nite cache, MPS yields higher OPCs for 3 of the 6 programs. When a 64KB cache is modelled, MPS outperforms the superscalar design on the same 3 programs. Those 3 programs m88ksim, compress, and ijpeg have relatively small SC footprints, as con rmed by the data in Table VII . On the other 3 programs, the superscalar outperforms MPS using an in nite cache or a 64KB i-cache SC. Those 3 programs gcc, li, and perl have large SC footprints, and the performance gap between MPS and superscalar widens signi cantly when a 64KB cache is simulated. This reinforces the earlier observation that the large unit of re placement i n an SC a s c hedule can cause MPS to be SC-size sensitive.
However, these cycle-count-based number do not re ect the clock speed advantage that an MPS design could have. Taking that into account only increases the performance of an MPS machine, so it is possible that in the nal analysis an MPS machine could outperform a superscalar even if producing lower OPCs.
VII. Conclusion
This paper has presented and developed an alternative to out-of-order issue that relegates the complexity of speculative s c heduling to the instruction cache miss path. The technique is called miss-path scheduling and has the advantage of a potentially aggressive cycle time for a simpli ed processor core. The central idea is to schedule sequences of instruction for execution at cache miss time rather than use dynamic scheduling hardware in the processor core. The details of a hardware design for a miss-path scheduler called MPS were presented. The hardware was shown to be practical for implementation, with no more complexity than well-known blocks of logic such as a register le or a priority encoder.
An implementation of an MPS design was outlined, including a method for pipelining the design. The design was extended to schedule instructions speculatively across basicblock boundaries. The performance of MPS was evaluated for basic-block and speculative scheduling using simulations with a realistic machine model. Two branch predictors were modeled for the speculative MPS designs, a forward-taken heuristic and a pro le-bit. The requirements of instruction cache design for an MPS-based machine were discussed. An instruction cache for such a machine is labeled a schedule cache SC, since entire schedules of instructions are stored in the cache. Two designs for schedule caches were described, the uncompressed SC and the compressed SC. The performance of a compressed SC was measured. It was found that performance varies greatly between 32KB and 64KB SCs, indicating that MPS places additional burdens not present in traditional i-caches. Comparisons between an MPS design and a superscalar design were also made. The performance of neither design proved to beclearly superior. However, the results to date suggest that an MPS-based design provides a viable alternative to traditional out-of-order superscalars, especially when the likely reduction in cycle time is taken into account.
while ( (Op = GetNextSequentialOp()) != NULL ) { 1) Check Op's source operands to determine when all of them will be ready, EarliestStartTime=MaxReadyTime(Op's Src Operands).
2)
Determine ResourceSet, which is the set of resources that Op needs to execute: functional units, register ports, result buses, etc.
3) Search the reservation Step 3a
Step 3b
Step 5
Steps 6-9 ! " Fig. 10 . Pipeline stages for miss path scheduling.
Miss Path Scheduling:
1. Check the availability of the instruction's source operands by reading their def-times and busy bits from the def-use table. This is used to compute an initial value for the earliest cycle in which the instruction can execute. Call this value ScheduleTime. The register ids required for this and all following steps can be determined using logic identical to that used for instruction decode.
