The object-code compatibility problem in VLIW architectures stems from their statically scheduled nature. Dynamic rescheduling DR 1 is a technique to solve the compatibility problem in VLIWs. DR reschedules program code pages at rst-time page faults i.e., when the code pages are accessed for the rst time during execution. Treating a page of code as the unit of rescheduling makes it susceptible to hazard of changes in the page-size during the process of rescheduling. This paper proves that the changes in the page-size are only due to insertion and or deletion of NOPs in the code. Further, it presents an ISA encoding called list encoding, which does not require explicit encoding of the NOPs in the code. A property o f t h e encoding called rescheduling-size invariance RSI is presented and it is proved that the list encoding satis es this property.
Introduction
The object-code compatibility problem in VLIW architectures stems from their statically scheduled nature. The compiler for a VLIW machine schedules code for a speci c machine model or a machine generation, for precise, cycle-by-cycle execution at run-time. The machine model assumptions for a given code schedule are unique, and so are its semantics. Thus, code scheduled for one VLIW is not guaranteed to execute correctly on a di erent VLIW model. This is a characteristic of VLIWs often cited as an impediment to VLIWs becoming a general-purpose computing paradigm 2 . An example to illustrate this is shown in Figures 1, 2 , and 3. Figure 1 shows an example VLIW schedule for a machine model which has two IALUs, one Load unit, one Multiply unit, and one Store unit. Execution latencies latencies have changed to 4 and 3 cycles respectively. The generation X schedule will not execute correctly on this machine due to the ow dependence between operations B and C, between D and H, and between E and F. Figure 3 shows the schedule for a generation X + n machine which includes an additional multiplier. The latencies of all FUs remain as shown in Figure 1 . Code scheduled for this new machine will not execute correctly on the older machines because the code has been moved in order to take advantage of the additional multiplier. In particular, E and F have been moved. There is no trivial way to adapt this schedule to the older machines. This is the case of downward incompatibility b e t w een generations. In this situation, if di erent generations of machines share binaries e.g., via a le server, compatibility requires either a mechanism to adjust the schedule or a di erent set of binaries for each generation. One way to avoid the compatibility problem would be to maintain binary executables customized to run on each new VLIW generation. But this would not only violate the copy-protection rules, but would also increase the disk-space usage. Alternatively, program executables may b e translated or rescheduled for the target machine model to achieve compatibility. This can be done in hardware or in software. The hardware approach adds superscalar-style run-time scheduling hardware to a VLIW 3 , 4 , 5 , 6 , 7 . The principle disadvantage of this approach is that it adds to the complexity of the hardware and may potentially stretch cycle time of the machine if the rescheduling hardware falls in the critical path. The software approach is to perform o -line compilation and scheduling of the program from the source code or from decorated object modules .o les. Code rescheduled in this manner yields better relative speedups, but the technique is cumbersome to use due to its o -line nature. It could also imply violation of copy protection. Dynamic Rescheduling DR 1 , is a third alternative to solve the compatibility problem. Under dynamic rescheduling, a program binary compiled for a given VLIW generation machine model is allowed to run on a di erent VLIW generation. At each rst-time page fault a page-fault that occurs when a code page is accessed for the rst time during program s execution, the page fault handler invokes a module called the rescheduler, to reschedule the page for that host. Rescheduled code pages are cached in a special area of the le system for future use to avoid repeated translations.
Since the dynamic rescheduling technique translates the code on a per-page basis, it is susceptible to the hazard of changes in the page-size due to the process of rescheduling. If the changes in the machine model across the generation warrant addition and or deletion of NOPs to from the page, it would lead to page over ow or an under ow. This paper discusses a technique called list encoding for the ISA and proves the property of rescheduling-size invariance RSI, which guarantees that there is no code-size change due to dynamic rescheduling. The organization of this paper is as follows. Section 2 presents the terminology used in this paper. Section 3 brie y describes dynamic rescheduling and demonstrates the problem of code-size change with an example. Section 4 introduces the concept of rescheduling-size invariance RSI, presents the list encoding, and then proves the RSI properties of list encoding. Section 5 presents concluding remarks and directions for future research.
Terminology
The terminology used in this paper is originally from Rau 3 8 , and is introduced here for the discussion that follows. Each wide instruction-word, or MultiOp, in a VLIW schedule, consists of several operations, or Ops. All Ops in a MultiOp are issued in the same cycle. VLIW programs are latency-cognizant, meaning that they are scheduled with the knowledge of functional unit latencies. An architecture which runs latency-cognizant programs is termed a Non-Unit Assumed L atency NUAL architecture. A Unit Assumed L atency UAL architecture assumes unit latencies for all functional units. Most superscalar architectures are UAL, whereas most VLIWs are NUAL. The machine models discussed in this paper are NUAL.
There are two scheduling models for latency-cognizant programs: the Equals model and the Less-Than-or-Equals LTE model 9 . Under the equals model, schedules are built such that each operation takes exactly as much as its speci ed execution latency. In contrast, under the LTE model an operation may take less than or equal to its speci ed latency. In general, the equals model produces slightly shorter schedules than the LTE model; this is mainly due to register re-use possible in the equals model. However, the LTE model simpli es the implementation of precise interrupts and provides binary compatibility when latencies are reduced. Both the scheduler in the back-end of the compiler and the dynamic rescheduler in the page-fault handler presented in this paper follow the LTE scheduling model.
For the purposes of this paper, it is assumed that all program codes can be classi ed into two broad categories: acyclic code and cyclic code. Cyclic code consists of short inner loops in the program which t ypically are amenable to software pipelining 10 . On the other hand, acyclic code contains a relatively large numberof conditional branches, and typically has large loop bodies. This makes the acyclic code un-amenable to software pipelining. Instead, the bodyof the loop is treated as a piece of acyclic code, surrounded by the loop control Ops. Examples of cyclic code are the inner loops like counted DO-loops found in scienti c code. Examples of acyclic code are non-numeric programs, and interactive programs. This distinction between the types of code is made because the scheduling and rescheduling algorithms for cyclic and acyclic code di er considerably, because of which the dynamic rescheduling technique treats each separately.
It is also assumed that the program code is structured in the form of superblocks 11 or the hyperblocks 12 . Hyperblocks are constructed by if-conversion of code using predication 13 , 14 . Support for predicated execution of Ops is also assumed. Both superblocks and hyperblocks have a single entry point i n to the block at the beginning of the block and may have multiple side-exits. This property is useful in bypassing the problems introduced by speculative code motion in DR, discussion of which can be found elsewhere see 15 . 3 Overview The technique of dynamic rescheduling performs translation of code pages at rst-time page faults and stores the translated pages for subsequent use. Figure 4 shows the sequence of events that take place in Dynamic Rescheduling. Event 1 indicates a rst-time page-fault. On a page-fault, the OS switches context and fetches the requested page from the next level of the memory hierarchy; this is shown shown as Events 2 & 3 respectively. Events 1, 2, 3 are standard in the case of every page fault encountered by the OS. What is di erent i n t h e case of DR is the invocation of a module called the rescheduler at each rst-time page fault. The rescheduler operates on the newly fetched page to reschedule it to execute correctly on the host machine. This is shown as event 4. Event 5 shows that the rescheduled page is written to an area of the le system for future use, and in event 6, the execution resumes.
To facilitate the detection of a VLIW generation mismatch at a rst-time page fault, each program binary holds a generation-id in its header. The machine model for which the binary was originally scheduled and the boundaries to identify the pieces of cyclic code in the program are also stored in the program binary. This information is made available to the rescheduler it while performs rescheduling. A page of the rescheduled code remains in the main memory until it is displaced as any other page in the memory, at which time it is written to a special area of the le system called text swap. All subsequent accesses to the page during the lifetime of the program are ful lled from the text swap. Text swap may beallocated on a per-executable basis at compile time, or beallocated by the OS as a system-wide global area shared by all the active processes. The overhead of rescheduling can bequantitatively expressed in terms of the following factors: 1 the time spent at the rst-time page-faults to reschedule the page, 2 the time spent in writing the rescheduled pages to the text swap area, and, 3 the amount of disk space used to store the translated pages. Further discussion of the overhead introduced by DR and an investigation of tradeo s involved in the design of the text-swap used to reduce the overhead are beyond the scope of this paper see 16 for more details.
Insertion and deletions of NOPs
When the compiler schedules code for a VLIW, independent Ops which can start execution in the same machine cycle are grouped together to form a single MultiOp; each Op in a MultiOp is bound to execute on a speci c functional unit. Often, however, the compiler cannot nd enough Ops to keep all the FUs busy in a given cycle. These empty slots in a MultiOp are lled with NOPs. In some machine cycles the compiler cannot schedule even a single Op for execution; NOPs are scheduled for all FUs in such a cycle and the instruction is called as an empty MultiOp.
Logically, the rescheduler in DR can be thought of as performing the following steps to generate the new code 1 , no matter what the type of code. First, it breaks down each MultiOp into individual Ops, to create an ordered set of Ops. Second, it discards the NOPs from this set. The ordered set of Ops thus obtained is a UAL schedule. In the third step, depending upon the resource constraints and the data dependence constraints, it re-arranges the Ops in the UAL schedule to create the new, NUAL schedule. In the fourth and last step, new NOPs and empty MultiOps are inserted as required to preserve the semantics of the computation. Note that the numberof NOPs and the empty MultiOps that are newly inserted may not bethe same as that in the old code, which may lead to the problem that the size of the code may c hange due to rescheduling. It is important to note at this time that the change in the code-size, if any, is only due to the NOPs and the empty MultiOps. An example of changes in code-size is illustrated in Figure 5 . In the left portion of the Figure, the old code is shown. Assume that the execution latency of Ops A; D; E; F; G; H is 1-cycle each; that of Op B is 3-cycles and of the LOAD Op C is 2-cycles. Further, Ops E ;Fare dependent on the result of Op C, hence should not begin execution before Op C nishes execution. In a newer generation of the architecture shown on right, one IALU is removed from the machine, while increasing the latency of the LOADs by 1 to 3-cycles. When the old code is executed on the newer generation, DR invokes the rescheduler, which generates the new code as shown. To account for the new, longer latency of the LOAD unit, it inserts an empty MultiOp in the third cycle. Also, the old MultiOp consisting of operations E and F is broken into two consecutive MultiOps due to the reduction in the numberof IALUs. Observe that the new code is bigger than the old code. Assuming all the Ops are 64-bits each, the net increase in the size of the code is 80 bytes, corresponding to the 10-extra NOPs inserted during rescheduling.
The page size with which a computer system operates is usually dictated by the hardware or the OS or both. It is non-trivial for the OS to handle any c hanges in the page sizes at runtime. Previous work done in this area by T alluri and Hill attempts to support multiple pagesizes, where each page-size is an integral multiple of a base page-size 17 18 19 . Enhanced VM hardware the Translation-Lookaside Bu er TLB, and an enhanced VM management policy must beavailable in the to support the proposed technique. It is possible that with the help of this extra hardware, multiple code page sizes can be used to handle variations in page-size due to DR, but this would lead to a multitude of problems. The rst problem is that of ine cient memory usage: if a new page is created to accommodate the spill-over" generated by the rescheduler, the remainder of the new page remains unused. On the other hand, if the code in a page shrinks due to DR, that leads to a hole in the memory. The second problem arises due to control restructuring: when a new page is inserted, it must be placed at the end of the code address space. The last MultiOp in the original page must then bemodi ed to jump to the new page, and the last MultiOp on the new page must be modi ed to jump to the page which lies after the original page. Now, if a code positioning optimization was performed on the old code in order to optimize for I-Cache accesses, this process could violate the ordering, potentially leading to performance degradation. Perhaps the most serious problem is that the code movement within the old page or into the new page could alter branch target addresses merge points in the old code, leading to incorrect code. It may not even bepossible to repair this code, because the code which jumps to the altered branch targets may not be visible to the rescheduler at rescheduling time.
One solution to avoid the problem of code-size change is to use a specialized ISA encoding which hides" the NOPs and the empty MultiOps in the code. Since all code-size changes in DR are due to the addition deletion of NOPs, such an encoding circumvents the problem. An encoding called the list encoding which has this ability is discussed in detail in Section 4, along with the rescheduling algorithms for cyclic and acyclic code.
Rescheduling Size-Invariance
List Encoding is an ISA encoding which does not require explicit representation of NOPs and empty MultiOps in the object code, and hence it is a zero-NOP encoding. This property of List encoding is used to support DR. This section presents a formal de nition of list encoding, followed by a n i n troduction to the concept of Rescheduling Size-Invariance RSI. It will also be shown that any list-encoded schedule of code is rescheduling size-invariant. De nition 5 VLIW Generation A VLIW generation G is de ned by the set fR;Lg, where R is a set of hardware resources in G, and L 2 f1; 2; : : : g is the set of execution latencies of all the Ops in the operation set of G. R itself is a set consisting of pairs fr; n r g , where r is a resource type and n r is the number of instances of r.
List encoding and RSI
This de nition of a VLIW generation does not model complex resource usage patterns for each Op, as used in 20 , 21 and 22 . Instead, each member of the set of machine resources R, presents a higher-level abstraction of the functional units" found in modern processors. Under this abstraction, the low-level machine resources such as the register-le ports and operand result busses required for the execution of an Op on each functional unit are bundled with the resource itself. All the resources indicated in this manner are assumed to bebusy through the period of time equal to the latency of the executing Op, indicated by the appropriate memb e r o f s e t L .
De nition 6 Rescheduling Size-Invariance RSI A VLIW schedule S is said to satisfy the RSI property i sizeofS Gn = sizeofS Gm , where S Gn ; S G m are the versions of the original schedule S prepared for execution on arbitrary machine generations G n and G m respectively. Further, schedule S is said to be rescheduling size-invariant i it satis es the RSI property.
2
The proof that list encoding is RSI will be presented in two parts. First, it will be shown that acyclic code in the program is RSI when list encoded, followed by the proof that the cyclic code is RSI when list encoded. Since all code is assumed to beeither acyclic or cyclic, the result that list encoding makes it RSI will follow. In the remainder of this section, algorithms to reschedule each of these types of codes are presented, followed by the proofs themselves.
Rescheduling Size-Invariant Acyclic Code
The algorithm to reschedule acyclic code from VLIW generation G old to generation G new is shown in Algorithm Reschedule Acyclic Code. It is assumed that both the old and new schedules are LTE schedules see Section 2, and that both have the same register le architecture and compiler register usage convention. The RSI property for list encoded acyclic code schedule will now beproved.
Theorem 1 An arbitrary list encoded schedule of acyclic code is RSI.
Proof: The proof will be presented using induction over the number of Ops in an arbitrary list encoded schedule. Let L i bean arbitrary, ordered sequence of i Ops i 1 that occur in a piece of acyclic code. Let F i denote a directed dependence graph for the Ops in L i , i.e. each O p i n L i i s a n o d e i n F i , and the data-and control-dependences between the Ops are indicated by directed arcs in F i . Let S Gn be the list encoded schedule for L i generated using the dependence graph F and designed to execute on a certain VLIW generation G n . Also, let G m denote another VLIW generation which is the target of rescheduling under DR.
Induction Basis. L 1 is an Op sequence of length 1. In this case, sizeof S Gn = 1, and the dependence graph has a single node. It is trivial in this case that S Gn is RSI, because after rescheduling to generation G m , the numberof Ops in the schedule will remain 1, or, sizeof S Gn = sizeof S Gm = 1 1
Induction
Step. L p is an Op sequence of length p, where p 1. Assume that S Gn is RSI. In other words, sizeof S Gn = sizeof S Gm = p 2 Now consider the Op sequence L p+1 , which is of length p+1, such that to L p+1 was obtained from L p by adding one Op from the original program fragment. Let this additional Op be denoted by z. Op z can be thought of as borrowed from the original program, such that the correctness of the computation is not compromised. L p is an ordered sequence of Ops, and Op z must then beeither a pre x of L p , or a su x to it. Also, let T Gn denote the list encoded schedule for sequence L p+1 , which means sizeof T Gn = p + 1 . In order to prove the current theorem, it must now be proved that T Gn is RSI if S Gn is RSI.
The addition of Op z to L p may c hange the structure of the dependence graph F p in two ways: 1 if the Op z adds one or more data dependence arcs to F p , or 2 the Op z does not add any data dependence arcs to F p .
Op z adds dependences:
This case corresponds to the fact that Op z is control-and or data-dependent on one or more of the Ops in L p , or vice versa. Following are the two sub-cases in which a schedule will be constructed which includes the Op z: 1 construction of T Gn using the dependence graph F, and, 2 rescheduling of T Gn to T Gm . In both these cases, all the dependences introduced by Op z must behonored. Further, any resource constraints must besatis ed as well. This is done using the well-known list scheduling algorithm in the rst sub-case, and the Reschedule Acyclic Code algorithm in the second subcase. Appropriate NOPs and empty MultiOps will be inserted in the schedule by both these algorithms. However, when the schedules T Gn and T Gm are list encoded, the empty MultiOps will be made implicit using the pause eld in the Header Op of the previous MultiOp, and the NOPs in a MultiOp will be made implicit via the FUtype eld in the Ops. Thus, the only source of size increase in schedules T Gn and T Gm is due to the newly added Op z.
Op z does not add any dependences:
In this case, only the resource constraints, if any, w ould warrant the insertion of empty MultiOps. By an argument similar to that in the previous case, it is trivial to see that the only source of size increase in schedules T Gn and T Gm is the newly added Op z. Thus, in both the cases, sizeof T Gn = sizeof S Gn + 1, from which and from Equation 2, it follows that: sizeof T Gn = p + 1 3 Similarly, for both the cases, sizeof T Gm = sizeof S Gm + 1 , which leads to:
From Equations 3 and 4, and by induction, it is proved that an arbitrary list encoded schedule of acyclic code is RSI. An example of the transition of the code previously shown in Figure 5 , by application of algorithm Reschedule Acyclic Code is shown in Figure 6 assuming that the original schedule belonged to the acyclic category. It can be observed that the size of the original code on the left is the same as that of the rescheduled code on the right. The NOPs and the empty MultiOps have been eliminated in the list encoded schedules; the rescheduling algorithm merely re-arranged the Ops, and adjusted the values of the H and the p n pause elds within the Ops to ensure the correctness of execution on G new .
Rescheduling Size-Invariant Cyclic Code
Most programs spend a great deal of time executing the inner loops, and hence the study of scheduling strategies for inner loops has attracted great attention in literature 23 , 24 , 25 , 8 , 26 , 27 , 28 , 29 . Inner loops typically have small bodies relatively fewer Ops which makes it hard to nd ILP within these loop-bodies. Software pipelining is a well-understood scheduling strategy used to expose the ILP across multiple iterations of the loop 30 , 25 . There are two ways to perform software pipelining. The rst one uses loop unrolling, in which the loop body is unrolled a xed number of times before scheduling. Loop bodies scheduled via unrolling can be subjected to rescheduling via the Reschedule Acyclic Code algorithm described in Section 4.2. The code expansion introduced due to unrolling, however, is often unacceptable, and hence the second technique, Modulo Scheduling 30 , is employed.
Modulo-scheduled loops have very little code expansion as the prologue and epilogue of the loop which makes it very attractive. In this paper, only modulo-scheduled loops are examined for the RSI property; unrolled-and-scheduled loops are covered by the acyclic RSI techniques presented previously. First, some discussion of the structure of modulo-scheduled loops in presented, followed by an algorithm to reschedule modulo scheduled code. The section ends with a formal treatment to show the list-encoded modulo-scheduled cyclic code is RSI. Concepts from Rau 29 are used as a v ehicle for the discussion in this section.
Assumptions about the hardware support for execution of modulo scheduled loops are as follows. In some loops, a datum generated in one iteration of the loop is consumed in one of the successive iterations an inter-iteration data dependence. Also, if there is any conditional code in the loop body, multiple, data-dependent paths of execution exist. Modulo-scheduling such loops is non-trivial 2 . This paper assumes three forms of hardware support to circumvent these problems. First, register renaming via rotating registers 29 in order to handle the inter-iteration data dependencies in loops is assumed. Second, to convert the control dependencies within a loop body to data dependencies, support for predicated execution 14 is assumed. Third, support for sentinel scheduling 33 to ensure correct handling of exceptions in speculative execution is assumed. Also, the pre-conditioning 29 of counted-DO loops is presumed to have been performed by the modulo scheduler when necessary.
A modulo scheduled loop, Gn , consists of three parts: a prologue Gn , a kernel Gn , and an epilogue " Gn , where G n is the machine generation for which the loop was scheduled. The prologue initiates a new iteration every II cycles, where II is known as the initiation interval. Each slice of II cycles during the execution of the loop is called a stage. In the last stage of the rst iteration, execution of the kernel begins. More iterations are in various stages of their execution at this point in time. Once inside the kernel, the loop executes in a steady state so called because the kernel code branches back to itself. In the kernel, multiple iterations are simultaneously in progress, each in a di erent stage of execution. A single iteration completes at the end of each stage. The branch Ops used to support the modulo scheduling of loops have special semantics, by which the branch updates the loop counts and enables disables the execution of further iterations. When the loop condition becomes false, the kernel falls through to the epilogue, which allows for the completion of the stages of the un nished iterations. Figure 7 shows an example modulo schedule for a loop and identi es the prologue, kernel, and the epilogue. Each row in the schedule describes a cycle of execution. Each boxrepresents a set of Ops that execute in a same resource e.g. functional unit in one stage. The height of the box is the II of the loop. All stages belonging to a given iteration are marked with a unique alphabet 2 f A; B; C; D; E; Fg. In a kernel-only loop, the prologue and the epilogue of the loop collapse" into the kernel, without changing the semantics of execution of the loop. This is achieved by predicating the execution of each distinct stage in a modulo scheduled loop on a distinct predicate called a stage predicate. A new stage predicate is asserted by the loop-back branch. Execution of the stage predicated on the newly asserted predicate is enabled in the future executions of the kernel. When the loop execution begins, stages are incrementally enabled, accounting for the loop prologue. When all the stages are enabled, the loop kernel is in execution and the loop is in the steady state. When the loop condition becomes false, the predicates for the stages are reset, thus disabling the stages one by one. This accounts for the iteration of the epilogue of the loop. A modulo scheduled loop can berepresented in the KO form, if adequate hardware predicated execution and software a modulo scheduler to predicate the stages of the loop support is assumed. Further discussion of KO loop schedules can be found in 29 . All modulo-scheduled loops can be represented in the KO form. The KO form thus has the potential to encode modulo schedules for all classes of loops, a property which is useful in the study of dynamic rescheduling of loops, as will be shown shortly.
The size of a modulo scheduled loop is larger than the original size of the loop, if the modulo schedule has an explicit prologue, a kernel, and an epilogue. In contrast, a KO l o o p schedule has exactly one copy of each stage in the original loop body, and hence has the same size as the original loop body, provided the original loop was completely if-converted 3 . This property of the KO loops is useful in performing dynamic rescheduling of modulo scheduled It is possible that due to Op z in L p+1 , the nature of the graph F p could be di erent from that of the graph F p+1 in two w a ys: 1 Op z is data dependent on one or more Ops in L p+1 or vice versa, or 2 the Op z is independent of all the Ops in L p+1 . In both of these cases, the data dependences and the resource constraints are honored by the modulo scheduling algorithm via appropriate use of NOPs and or empty MultiOps within the schedule. When this schedule is list encoded, the NOPs and the empty MultiOps are made implicit via the use of pause and the FUtype elds within the Ops. Hence, sizeof Gn , sizeof Gn = sizeof z = 1
7
In other words, sizeof Gn = sizeof Gn + 1 8 From this result and from Equation 6 , it follows that: sizeof Gn = p + 1 9
Similarly, for both the cases, sizeof Gm = sizeof Gm + 1 , which leads to:
sizeof Gm = p + 1 10
From Equations 9 and 10, and by induction, it is proved that an arbitrary list encoded KO modulo schedule is RSI.
Corollary 1 A List encoded schedule is RSI.
Proof: All program codes can be divided into the two categories: the acyclic code and cyclic code as de ned in Section 2. Hence, It follows from Theorem 1 and Theorem 2 that a list encoded schedule is RSI.
Conclusions
This paper has presented the highlights of a solution for the cross-generation compatibility problem in VLIW architectures. The solution, called Dynamic Rescheduling, performs rescheduling of program code pages at rst-time page faults. Assistance from the compiler, the ISA, and the OS is required for dynamic rescheduling. During the process of rescheduling, NOPs must be added to deleted from the page to ensure the correctness of the schedule. Such additions deletions could lead to changes in the page size. The code-size changes are hard to handle at run-time, would and require extra support in hardware TLB extensions and software VM management extension.
An ISA encoding called List Encoding, which encodes the NOPs in the program implicitly, was presented. The list encoded ISA has xed-width Ops. The Header Op rst Op in a MultiOp indicates the number of empty MultiOps if any following it. This information eliminates the need to explicitly encode the empty MultiOps in the schedule. The OpType eld encoded in each Op eliminates the need of explicitly encoding the NOPs within the MultiOp, because the decode hardware can use this information to expand and route the Op to appropriate execution resource. A property of the list encoding called Rescheduling-Size Invariance RSI was proved for the acyclic and cyclic for kernel-only modulo-scheduled codes. A schedule of code is RSI i the code size remains constant across the dynamic rescheduling transformation.
A study of the instruction fetch hardware and I-Cache organizations required to support the list encoding has previously been studied 34 . The work presented in this paper can beextended with a study of other encoding techniques which may not berescheduling-size invariant non-RSI encodings. Also, a study of rescheduling algorithms which operate on non-RSI encodings can beconducted. These topics are currently being investigated by the authors.
