Combined instruction scheduling and register allocation by KHAING KHAING KYI WIN
  
COMBINED INSTRUCTION SCHEDULING AND 
REGISTER ALLOCATION  
 
 





A THESIS SUBMITTED  
FOR THE DEGREE OF MASTER OF SCIENCE 
DEPARTMENT OF COMPUTER SCIENCE 
NATIONAL UNIVERSITY OF SINGAPORE 
 
APRIL 2004 
     ii 
Acknowledgement 
 A special thanks and sincere gratitude go to the National University of Singapore 
for providing me the research scholarship and all the faculty members in the School of 
Computing. 
 I would like to express my heartfelt gratitude and appreciation to Dr. Wong Weng 
Fai, Associate Professor, Computer Science Department, and School of Computing 
(SOC), National University of Singapore (NUS) for valuable discussions, theoretical 
input and motivating support throughout the process of this study. 
 I am greatly indebted to my parents and my sister for their loving understanding 
and encouragement throughout my study at NUS. 
 Last but not the least, sincere thanks go to the friends from NUS, who have made 










     iii 







List of  Figures……………..………………………………………………………viii 
 
List of Tables…………… …………………………………………………….…...ix 
                
Chapter1 ............................................................................................................................... 1 
Introduction ......................................................................................................................... 1 
 
1.1. What is Instruction Scheduling?.................................................................................................. 2 
1.2. What is Register Allocation? ........................................................................................................ 2 
1.3. What is phase ordering problem?................................................................................................ 3 
1.4. Integer Programming .................................................................................................................... 4 
1.5. Objective of the Study .................................................................................................................. 5 
1.6. Methodology employed in this study.......................................................................................... 5 
1.6.1. Literature review............................................................................................................... 5 
1.6.2. Experiment ....................................................................................................................... 6 
1.7. Outline of this thesis ..................................................................................................................... 8 
 
Chapter 2 .............................................................................................................................11 
Related work in combined Instruction Scheduling and Register Allocation.....................11 
 
2.1. Introduction .................................................................................................................................11 
2.2. Previous Integrated Techniques for Register Allocation and Instruction Scheduling ......12 
2.3. Other optimization techniques using integer linear programming ......................................19 
2.4. Summary .......................................................................................................................................21 
 
Chapter 3 ............................................................................................................................ 22 
Theoretical Background to this study ............................................................................... 22 
 
3.1. Position of Register Allocation and Instruction Scheduling in compiler back end ...........22 
3.2. Instruction Scheduling ................................................................................................................23 
     iv 
3.2.1. Different Types of Instruction Schedulers.................................................................25 
3.2.2. Instruction Scheduling Problems ................................................................................26 
3.2.3. Local Instruction Scheduling........................................................................................28 
3.2.4. Global Instruction Scheduling .....................................................................................29 
3.3. Register Allocation ......................................................................................................................32 
3.3.1. Local Register Allocators ..............................................................................................33 
3.3.2. Global Register Allocators............................................................................................33 
 
Chapter 4 ............................................................................................................................ 39 
Preliminary Results of combined instruction scheduling and register allocation using 
integer programming ......................................................................................................... 39 
 
4.1. Optimal Instruction Scheduling ................................................................................................39 
4.2. Optimal Register Allocation.......................................................................................................42 
4.2.1. Overview of ORA..........................................................................................................42 
4.2.2. The implementation of ORA.......................................................................................44 
 
Chapter 5 ............................................................................................................................ 50 
Cooperative local instruction scheduling with impact or region-based register 
allocator.............................................................................................................................. 50 
 
5.1. Introduction .................................................................................................................................50 
5.2. Different types of Elcor schedulers ..........................................................................................54 
5.2.1. List scheduler..................................................................................................................54 
5.2.2. List scheduling with backtracking scheduling............................................................56 
5.3. Our proposed scheduler .............................................................................................................58 
5.3.1. Common Data Structure...............................................................................................58 
5.3.2. Heuristic ..........................................................................................................................61 
5.4. Experimental evaluation .............................................................................................................62 
5.4.1. Methodology...................................................................................................................62 
5.4.2. Results and discussion...................................................................................................62 
5.4. Summary .......................................................................................................................................65 
 
Chapter 6 ............................................................................................................................ 66 
Cooperative instruction scheduling with linear scan register allocation .......................... 66 
 
6.1. Global register allocators in Trimaran......................................................................................67 
6.1.1. Impact register allocator................................................................................................68 
     v 
6.1.2. Region based register allocator ....................................................................................68 
6.1.3. Linear scan register allocator ........................................................................................69 
6.2. Experimental evaluation .............................................................................................................75 
6.2.1. Result and discussion.....................................................................................................76 
6.3. Summary .......................................................................................................................................82 
 
Chapter 7 ............................................................................................................................ 83 
Conclusions and recommendation for further work.......................................................... 83 
 
7.1. Conclusion....................................................................................................................................83 
7.2. Contributions ...............................................................................................................................84 




Appendix A......................................................................................................................... 98 
 




Appendix D .......................................................................................................................105 
 















     vi 
Summary 
In compilers for machines with instruction-level parallelism, the phases of 
instruction scheduling and register allocation can be antagonistic. Negative effects on 
performance can be detected whichever phase is executed first. In order to take the best 
advantage of the Instruction Level Parallelism, compilers need to minimize both delays 
due to memory latency and register usage when instruction scheduling and register 
allocation is performed. Unfortunately, instruction scheduling and register allocation are 
disaffected processes. When register allocation is done before instruction scheduling, 
unnecessary dependences are added. Although spill code is minimized, the execution 
time of the program may increase as it take more cycles from instruction scheduling 
phase. When instruction scheduling is executed first, an efficient schedule is generated. 
However, the code motion that occurs after instruction scheduling generally increases 
spill code so that additional memory delays may occur. In order to solve this phase 
ordering problem, attempt has been made to use several approaches in this study.  
First of all, an experimental study of combined instruction scheduling and register 
allocation is carried out using integer linear programming approach. The basic 
formulation of integer linear programming is built inside the ILOG OPL studio version 
3.5 and Trimaran software. The preliminary results show that even for a small code 
segment, the variables and expressions to formulate phase ordering problem are very 
large. Hence, it takes too much time to formulate instruction scheduling and register 
allocation problem. These results suggest that the approach has very limited practicality.  
Then, due to excessive usage of variables and expressions in the formulations 
     vii 
while implementing combined instruction scheduling and register allocation using integer 
linear programming, a much more promising approach, a pre-pass local instruction 
scheduler adapted from convergent scheduler, is proposed and implemented in Trimaran 
so as to solve phase ordering problem. The proposed scheduler is inserted in the pre-pass 
scheduler of the Trimaran and impact or region-based register allocator is used to 
perform cooperatively with the proposed scheduler. Convergent scheduler operates on 
different phases. Each phase implements a heuristic that addresses a particular problem 
such as ILP or register pressure. Compared with convergent scheduler, our proposed 
scheduler can handle both ILP and register pressure problems at the same time. This is 
more efficient because it does not need different phases. Once we have scheduled for 
ILP, our proposed scheduler can automatically reduce register pressure by saving 
simultaneously live ranges. The main advantages of this approach are the ability to 
reduce total dynamic cycles and spill code insertion.  
Finally, linear scan register allocator, proposed by Massimiliano Poletto and 
Vivek Sarkar, is implemented in Trimaran to combine the proposed pre-pass local 
instruction scheduling and linear scan register allocation. The experimental results show 
that combing the proposed pre-pass local instruction scheduler with linear scan register 
allocator reduces maximum active live interval, total dynamic cycles and dynamic 
register allocation overhead compared to combining Trimaran’s list scheduler with 




     viii 
List of Figures 
Page No 
Figure-1.1:  Example phase ordering problem. ...............................................................................3 
Figure-3.1:  Position of Register Allocation and Instruction Scheduling in compiler back end
......................................................................................................................................................23 
Figure-4.7: Instruction graph for optimal register allocation of the sample program..............46 
Figure-4.8: Memory Graph for the Sample Program of Optimal Spill Code Placement using 
Two Real Registers....................................................................................................................48 
Figure-5.1: Example data dependent graph(DDG) of basic block(BB) 38 of rawcaudio 
benchmark from Trimaran. .....................................................................................................52 
Figure-5.2: Result of Pre-pass Scheduling for BB 38 of rawcaudio benchmark from 
Trimaran. ....................................................................................................................................52 
Figure-5.3: Result of Post-pass Scheduling for BB 38 of rawcaudio benchmark from 
Trimaran. The dotted boxes represent the spill codes inserted from impact register 
allocator. .....................................................................................................................................53 
Figure-5.4: List scheduling algorithm..............................................................................................56 
Figure-5.5: ListBT scheduling algorithm ........................................................................................57 
Figure-5.6: Example weight matrix calculation..............................................................................60 
Figure-5.7: Our proposed scheduling algorithm............................................................................61 
Figure-6.1: Control Flow Graph (CFG) with long instructions within each basic block from 
Trimaran .....................................................................................................................................71 
Figure-6.2: Control Flow Graph (CFG) with long instructions after instructions reordering 
within each basic block.............................................................................................................72 
Figure-6.3: Earliest completion time (etime) and Latest completion time (ltime) for each 
basic block in Figure-6.1 ..........................................................................................................72 
Figure-6.4 : A number of live intervals for data dependent graph in Figure-6.1 and Figure 6.2
......................................................................................................................................................73 
Figure-6.5: position of the proposed pre-pass scheduler and linear scan register allocator in 
Trimaran infrastructure ............................................................................................................76 
 
     ix 
List of Tables 
Page No 
Table-4.1: The result from ILOG OPL studio for the sample program of spill code 
placement ...................................................................................................................................47 
Table-4.2: The result from ILOG OPL studio for the sample program of optimal spill code 
placement ...................................................................................................................................49 
Table-5.1: Execution time comparison of our proposed scheduler with List scheduler and 
ListBT scheduler. ......................................................................................................................64 
Table-5.2: Total dynamic cycles and register allocation overhead comparison on different 
schedulers using region based register allocator in Trimaran. ............................................64 
Table-5.3: Total dynamic cycles and register allocation overhead comparison on different 
schedulers using impact register allocator in Trimaran. ......................................................65 
Table-6.1: Total dynamic cycle and total register allocation overhead comparison between 
linear scan register allocator and region-based register allocator for 16 registers............77 
Table-6.2: Total dynamic cycle and total register allocation overhead comparison between 
linear scan register allocator and region-based register allocator for 32 registers............78 
Table-6.3: Total dynamic cycle and total register allocation overhead comparison between 
linear scan register allocator and impact register allocator for 16 registers ......................78 
Table-6.4: Total dynamic cycle and total register allocation overhead comparison between 
linear scan register allocator and impact register allocator for 32 registers ......................79 
Table-6.6: The maximum active live intervals of each procedure, which have long and 
narrow data dependent graph, of several benchmarks in Trimaran. .................................80 
Table-6.6: Average speedups of combining the proposed pre-pass scheduler with linear scan 
register allocator over combining Trimaran’s default scheduler with impact or region-
based register allocator. ............................................................................................................81 
     1 
Chapter1 
Introduction 
 Register allocation and instruction scheduling have received widespread attention 
in the past academic and industrial research and have been considered most important 
phases in modern optimizing compilers so as to increase performance of these compilers. 
The goal of an optimization compiler is to efficiently use all of the resources of the target 
computer. Instruction scheduling and register allocation are the most important phases in 
compiler optimization. In compilers for machines with instruction-level parallelism, the 
phases of instruction scheduling and register allocation can be antagonistic. There can be 
negative effects on one’s performance whichever phase is executed first. In order to take 
the best advantage of the Instruction Level Parallelism, compilers need to minimize both 
delays due to memory latency and register usage when instruction scheduling and register 
allocation is performed. Unfortunately, instruction scheduling and register allocation are 
disaffected processes. When register allocation is done before instruction scheduling, 
unnecessary dependences are added. Although spill code is minimized, the execution 
time of the program may increase as it take more cycles from instruction scheduling 
phase. When instruction scheduling is executed first, an efficient schedule is generated. 
However, the code motion that occurs after instruction scheduling generally increases 
spill code so that additional memory delays may occur. In order to solve this phase 
ordering problem, attempt has been made to use several approaches. First, this research 
studies optimal and near optimal instruction scheduling and register allocation separately. 
     2 
Then, using several approaches, instruction scheduling and register allocation is 
combined to obtain both lower spill code placement and optimal instruction scheduling. 
1.1. What is Instruction Scheduling? 
 Instruction scheduling is the process by which a compiler reorders the instructions 
of a program in an attempt to decrease its running time, to reduce its code size, to 
improve other aspects of the program or to hide latencies present in modern day 
microprocessors such that a more time-efficient schedule is produced. Scheduling is often 
critical in achieving peak performance from these processors. 
1.2. What is Register Allocation? 
 Register allocation determines which of the values – variables, temporaries, and 
large constants – that might profitably be in a machine’s register at each point in the 
execution of a program. The job of the register allocator is to assign those values to a 
limited number of machine registers. Register allocation is important because registers 
are almost always a scarce resource. However, sometimes, there are not enough registers 
to be allocated. In such case, value (i.e. variable) is selected to be spilled into memory 
instead of being assigned to a register and to be reloaded to and from memory. This is 
called register spilling. The goal of register allocation is to keep frequently used values in 
registers. Optimal register allocation is considered here to minimize spilling as much as 
possible. 
     3 
1.3. What is phase ordering problem? 
               The instruction scheduling which applied to a program intermediate language 
before register allocation is called pre-pass scheduling, and after register allocation is 
called post-pass scheduling. In pre-pass scheduling, the full parallelism of the program is 
exploited so as to generate an efficient schedule. However, it can cause the possibility of 
excessive register spilling due to overuse of registers. In post-pass scheduling, spill code 
is decreased but unnecessary dependencies can be added to cause many stalls. There is no 
natural order for performing instruction scheduling and register allocation. The ordering 
problem between instruction scheduling and register allocation is called phase ordering 
problem which is a well-known problem for modern day compiler researchers. An 
example of phase ordering problem is given in Figure-1.1. 
 
 
1 Load R1 10
2 Load R2 20
3 NOP
4 Add R2 R1 R2
5 Load R1 30
6 Load R2 40
7 NOP
8 Add R2 R1 R2
1 Load R1 10
2 Load R2 20
3 Load R3 30
4 Load R4 40
5 Add R2 R1 R2
6 Add R4 R3 R4
1 Load V0 10
2 Load V1 20
3 NOP
4 Add V2 V0 V1
5 Load V3 30
6 Load V4 40
7 NOP
8 Add V5 V3 V4
1 Load V0 10
2 Load V1 20
3 Load V3 30
4 Load V4 40
5 Add V2 V0 V1









( a ) ( b ) ( c ) ( d )  
Figure-1.1:  Example phase ordering problem. (a) Example intermediate language 
with live ranges. (b) After instruction scheduling with live ranges. (c)Register 
allocation first. (d) Instruction scheduling first. 
 
Assume that the memory access operations take two cycles and other operations 
take one cycle. Figure-1.1 (a) and Figure-1.1 (b) show that the number of overlapping 
live intervals is increased after instruction scheduling. If the register allocation is 
executed first, it would require 8 cycles although only 2 registers is enough for register 
     4 
allocation. However, if instruction scheduling is done first, then although it would require 
only 4 cycles, 4 registers would be needed to avoid spilling. Which of these two orders is 
better depends upon the number of available registers and functional units.   
1.4. Integer Programming 
 An integer programming problem (IP) [Win93] is a Linear Programming (LP) in 
which some or all variables are required to be nonnegative. LP is a tool for solving 
optimization problems. George Dantzig ( 1947 ) developed an efficient method, the 
simplex algorithm, for solving LP problems. Since the development of the simplex 
algorithm, LP has been used to solve optimization problems in industries as diverse as in 
banking, education, forestry, petroleum, and trucking. In response to a survey of fortune 
of 500 firms, 85% of the respondents are said to have used LP. 
 There are three types of IP problem. They are 
( 1 ) Pure IP Problem : An IP in which all variables are required to be integers is called 
a pure integer programming problem. For example : 
   Max z = 3x1 + 3x2 
   s.t. x1 +x2 <= 6 
   x1, x2 >= 0, x1, x2 integer 
 
is a pure integer programming problem. 
( 2 ) Mixed IP Problem : An IP in which only some of the variables are required to be 
integers is called a mixed integer programming problem. For example : 
   Max z = 3x1 + 3x2 
   s.t. x1 +x2 <= 6 
   x1, x2 >= 0, x1 integer 
 
is a mixed integer programming problem (x2 is not required to be an integer). 
     5 
( 3 ) 0-1 IP : An IP problem in which all the variables must equal 0 or 1 is called a 0-1 
integer programming problem. For example : 
   Max z = x1 - x2 
   s.t. x1 + 2x2 <= 2 
   2x1 – x2 <= 1   
   x1, x2 = 0 or 1 
 
1.5. Objective of the Study 
This research attempts  
( 1 ) to study an approach of instruction scheduling and register allocation separately  
( 2 ) to combine instruction scheduling and register allocation 
 These objectives are set up in order to obtain optimal and near optimal solution 
for Instruction Level Parallelism (ILP) and the smallest number of spill code insertion in 
modern optimizing compilers. 
1.6. Methodology employed in this study 
Research Methodology comprises two mechanisms : Literature Review and 
Experiment. 
1.6.1. Literature review 
The Literature review was conducted on theoretical references and techniques 
from previous work on register allocation and instruction scheduling both in combination 
and in isolation of these two phases in order to gain in-depth theoretical background to 
combine instruction scheduling and register allocation using different approaches. The 
literature review is discussed in chapter 2 and chapter 3. 
     6 
1.6.2. Experiment 
Several approaches to combine instruction scheduling and register allocation were 
investigated and implemented in this study. First of all, an experimental study of 
combined instruction scheduling and register allocation is carried out using integer linear 
programming approach. The basic formulation of integer linear programming is built 
inside the ILOG OPL studio version 3.5 and Trimaran software. The preliminary results 
show that even for a small code segment, the variables and expressions to formulate 
phase ordering problem are very large. Hence, it takes too much time to formulate 
instruction scheduling and register allocation problem. These results suggest that the 
approach has very limited practicality. The discussion of combined instruction scheduling 
and register allocation using integer programming is mentioned in chapter 4. 
Then, due to excessive usage of variables and expressions in the formulations 
while implementing combined instruction scheduling and register allocation using integer 
linear programming, a much more promising approach, a pre-pass local instruction 
scheduler, is proposed and implemented in Trimaran so as to solve phase ordering 
problem. A pre-pass local instruction scheduler which is a modification of part of 
convergent scheduling with impact and region-based register allocation is given in 
chapter 5.  The main advantages of this approach are the ability to reduce total dynamic 
cycles and spill code insertion.  
Finally, linear scan register allocator, proposed by Massimiliano Poletto and 
Vivek Sarkar, is implemented in Trimaran to evaluate the performance of combining the 
proposed pre-pass local instruction scheduling and linear scan register allocation. 
Combining the proposed pre-pass local instruction scheduling with the linear scan 
     7 
register allocator reduces maximum active live intervals, total dynamic cycles, dynamic 
register allocation overhead compared with combining list scheduler with region-based 
register allocator and impact register allocator. The discussion of combined the proposed 
pre-pass scheduling with linear scan register allocation is given in chapter 6.     
1.6.2.1 Why is OPL studio used in this study? 
ILOG OPL Studio [Ilog04] is an integrated development environment for 
mathematical programming and combinatorial optimization applications. In [Hen99], the 
author mentions that linear programming, integer programming, and combinatorial 
optimization problems are ubiquitous in many application areas such as planning, 
scheduling, sequencing, resource allocation, design, and configuration. Robust solvers 
that solve large-scale linear programs and various classes of integer programs are now 
available. However, many integer programming and combinatorial optimization problems 
are challenging from a computational standpoint: they are NP-complete or worse, and it 
is widely believed that no general and efficient algorithm ever exists for solving them. As 
a consequence, solution to these problems requires considerable time and expertise both 
in the application domain (modeling) and in algorithm design (solving). In addition, the 
resulting algorithmic solutions often involve substantial development effort, since the 
distance between the problem model and the computer algorithm may be large. OPL 
(Optimization Programming Language) is a modeling language for combinatorial 
optimization that may simplify these optimization problems substantially. OPL was 
motivated by modeling languages such as AMPL and GAMS that provide computer 
equivalents to traditional algebraic notation. It provides similar support for modeling 
linear and integer programs and access to state-of-the-art linear programming algorithms. 
     8 
But OPL adds several new dimensions to modeling languages beyond the traditional 
support for linear and integer programming. One of the significant dimensions added by 
OPL is its high-level support for scheduling and resource allocation applications, which 
are ubiquitous in industry. OPL provides novel modeling concepts such as activities and 
resources, and provides access to special-purpose algorithms such as the edge-finder 
procedure and many other new modeling tools.  
1.6.2.2 Why is Trimaran used in this study? 
Trimaran [tri03] is an integrated compilation and performance monitoring 
infrastructure. The architecture space that Trimaran covers is characterized by HPL-PD, a 
parameterized processor architecture supporting novel features such as predication, 
control and data speculation and compiler controlled management of the memory 
hierarchy. Trimaran also consists of a full suite of analysis and optimization modules, as 
well as a graph-based intermediate language. Optimizations and analysis modules can 
easily be added, deleted or bypassed, thus facilitating compiler optimization research. 
Similarly, computer architecture research can be conducted by varying the HPL-PD 
machine via the machine description language HMDES. Trimaran also provides a 
detailed simulation environment and a flexible performance monitoring environment that 
automatically track the machine as it is varied. 
1.7. Outline of this thesis 
The paper is organized in seven chapters. They are : 
1. Introduction 
2. Related work in combined instruction scheduling and register allocation 
     9 
3. Theoretical background to this thesis 
4. Preliminary results of combined instruction scheduling and register allocator 
using integer programming 
5. Cooperative local instruction scheduling with impact or region-based register 
allocator  
6. Cooperative local instruction scheduling with linear scan register allocator 
7. Conclusion and recommendations for further work 
 Chapter one is concerned with the introduction of “Combined instruction 
scheduling and register allocation”. 
 Chapter two presents related work in combined instruction scheduling and register 
allocation and detail discussion of it. A discussion of other optimization techniques using 
integer programming is also provided. 
 Chapter three describes background material. In addition, the chapter contains a 
discussion of instruction scheduling and register allocation techniques. 
 Chapter four presents integer linear programming methodologies employed in 
combining instruction scheduling and register allocation. An overview of Optimal 
Register Allocation (ORA) is provided, and the experimental results of ILOG OPL studio 
version 3.5 scheduling model and ORA developed by David et. al [GW95] is presented. 
 Chapter five describes a new integrated local instruction scheduler which can 
reduce both total dynamic cycle and register allocation overhead of long and narrow data 
dependent graphs. A discussion of the detailed implementation and experimental 
evaluation is also given.   
 Chapter six describes and evaluates combination of the proposed scheduler with 
     10 
linear scan register allocator, proposed by Massimiliano Poletto and Vivek Sarkar. The 
investigation of combining the proposed scheduler and linear scan register allocator is 
presented compared with combining list scheduler with impact or region based register 
allocator.  












     11 
Chapter 2 
Related work in combined Instruction Scheduling and 
Register Allocation 
2.1. Introduction 
 Register allocation and instruction scheduling are two important phases in an 
optimizing compiler for exploiting greater instruction-level parallelism (ILP). To produce 
good code for modern machines such as VLIW and superscalar machines, the compiler 
must expose enough instruction level parallelism (ILP) to let the scheduler to keep the 
various functional units busy. The scheduler must order the operations in such a way that 
lets them execute in parallel. Moreover, the compiler must keep as many values in 
registers as possible, since the memory interface is rarely wide enough or versatile 
enough to meet the need of operands. The goal of instruction scheduling is to exploit 
available instruction level parallelism. The goal of register allocation is to minimize the 
number of memory accesses. Unfortunately, instruction scheduling and register allocation 
are interdependent so that the objectives of instruction scheduling and register allocation 
cause phase ordering problem which is a well-known problem for modern day compiler 
researchers. If the compiler reorders the instructions to decrease execution time, it will 
increase the number of physical registers needed to hold values. On the other hand, if one 
allocates the temporaries to physical register before instruction scheduling, then it will 
lead to the amount of instruction reordering limited. In current optimizing compilers, the 
     12 
two modules are usually processed separately, either code scheduling after register 
allocation (post-pass scheduling), or code scheduling before register allocation (pre-pass 
scheduling). Recently, more and more studies are focused on the integration of code 
scheduling and register allocation for generating more efficient object codes. Literature 
shows that a number of integrated and cooperative techniques have been proposed in an 
attempt to introduce some communication between the two phases. 
2.2. Previous Integrated Techniques for Register Allocation 
and Instruction Scheduling 
Goodman and Hsu [GH88] have developed two separate integrated strategies. In 
their post-pass scheduling approach (DAG-driven register allocation), the local register 
allocator runs first and uses information typically used by the scheduler, namely the data 
dependence graph (DAG), to guide the register allocator in making its allocation, while 
the scheduler is constrained by dependences added by the register allocator. The test 
results show that DAG-driven local register allocation significantly improves the 
performance of post-pass code scheduling. 
Alternatively, in their second approach, called integrated pre-pass scheduling, the 
instruction scheduler is executed first, but is constrained by a limit on the number of 
registers that are available and thus oscillates in its heuristic for scheduling based on 
whether the current number of live variables has reached the register limit. If there are 
insufficient available registers, then the scheduler tries to schedule an instruction which 
will decrease the number of live values. Otherwise the scheduler schedules to increase 
fine-grain parallelism. Goodman and Hsu [GH88] find out that on highly pipelined 
     13 
machine models, the integrated pre-pass scheduling approach slightly outperforms the 
DAG-driven register allocation approach. Both the integrated pre-pass scheduling and the 
DAG-driven register allocation approaches have been shown to be effective in solving 
the problem of the interdependency between code scheduling and register allocation. 
Bradlee, Eggers, and Hendry [BEH91] have examined three strategies namely post-
pass, IPS, with their own integrated technique called RASE. In their post-pass 
scheduling, register allocation is performed first and two phases are completely 
separated. IPS also developed by Bradlee et al. is a slightly modified version of integrated 
pre-pass scheduling in which they calculated the register limit in a different way. In this 
approach, the local allocator is replaced with a global allocator, and the scheduler is 
invoked again after allocation in order to produce better schedule spill code. Compared 
with those found by Goodman and Hsu [GH88], their version of integrated pre-pass 
scheduling is proved to show greater speedups. 
Bradlee, Eggers and Henry developed a more integrated scheduling approach called 
RASE (Register Allocation with scheduling Estimating). Firstly, a pre-scheduler is 
invoked before the global register allocation (GRA) to gather schedule cost information 
that is then used by GRA. Secondly, GRA allocates global registers according to 
schedule cost estimates and spill cost estimates. Then, GRA determines the limit of local 
registers for each basic block. Lastly, the final scheduler performs local register 
allocation while scheduling and strictly adheres to the local register limit and spill if 
necessary. However, the performance of RASE is slightly better than that of IPS only if 
the basic block is large. Due to their limitation in the lack of performance of global, 
parallelism-enhancing optimization, their experiments are little validated with reference 
     14 
to deeply pipelined machine targeted at floating point computing, machine such as trace. 
However, both RASE and IPS are considerably better than post-pass, supporting the 
hypothesis that integrating the two phases is necessary. 
Moon and Ebcioglu [ME92] presented a new resource-constrained code global 
scheduling technique for VLIW and superscalar machines. The scheduling algorithm pre-
computes the set of available operations that are schedulable and can reach the root 
VLIW instruction. Based on global view of the program, selective scheduling is used to 
separate heuristics and transformation and to apply heuristics. A search is made to find a 
suitable destination register for an operation that has been identified as movable; 
however, if a suitable register cannot be found among the available registers, the next 
best movable operation is examined. Their technique obtains a significant speedup on 
AIX utilities and SPEC integer benchmarks without resorting to branch probability, 
which is usually an unreliable source of information in exploiting irregular parallelism. 
Pinter [Pin93] developed a new framework for integrated approach based on 
building a parallel interference graph which represents both register allocation conflicts 
and scheduling constraints. If the parallel interference graph is colorable, it provides a 
register allocation which does not generate false dependences. Heuristics are provided for 
trading off between scheduling and register spilling. It is time consuming when a 
procedure is called due to a small number of register to be used at any point during the 
execution of the program. So, plans have been made to investigate further heuristic for 
removing edges and selecting value to spill. 
 Cindy Norris and Lori. L. Pollock [NP93] developed a cooperative global register 
allocation and instruction scheduling environment so as to minimize the amount of 
     15 
modification to each phase while providing speedups over uncooperative approaches 
which are comparable to or better than previous cooperative, more complex approaches. 
They experimentally identified which of 18 different methods of scheduler-sensitive 
register allocation gives the best overall speedups over uncooperative post-pass 
scheduling. Compared with previous cooperative approaches, their scheduler-sensitive 
register allocation strategy yielded comparable speedups and better speedups with larger 
number of registers. 
Berson, Gupta and Soffa [BGS93] developed a technique called URSA that unifies 
functional units and resource allocator for registers in VLIW machines. Operating on a 
dependence DAG representation of the program, their technique employs 3 phases : the 
measurement of resource requirement to identify regions with excess requirement; 
application of transformations that reduce the requirements to levels supported by the 
target machine and the assignment of resources. URSA relies on a program trace to build 
the DAG, thus making it mainly applicable to scientific applications. They are 
investigating methods to enable URSA to handle superscalar architectures. 
This framework is later applied to the problem of integrating register allocation 
with local instruction scheduling, and then with global instruction in the study by Berson, 
Gupta and Soffa [BGS94] again. Global instruction scheduling is however found to be 
much more complicated than local instruction scheduling under this framework, because 
benefits of moving instructions must be determined, while there is no choice but to 
reduce excess resource demands in the local scheduling. It is also mentioned that the code 
motion algorithms must consider the effects of code duplication on the critical paths, and 
must determine whether the critical path length of the source block is decreased when 
     16 
loads of moved values are inserted into it. However, no detail has been given in that 
paper on how these problems are handled on resource spackling with global instruction 
scheduling. As a result, the effectiveness of the URSA approach has yet been judged by 
any experimental results so far. 
Recognizing the integration of the optimization phase for an improvement in the 
quality of code generated, yet the partially achieved integration, due to the use of several 
different representations of a program in the various phases, Berson, Gupta and Soffa 
[BGS95] proposed a program representation method called a Global Unified Resource 
Requirements Representation (GURRR). In their approach, resource requirements and 
availability information are combined with control and data dependence information. 
This integration is based on the simultaneous allocation of different types of resources. It 
extends the instruction level PDG by the addition of resource hole nodes and reuse edges, 
which connect nodes that can temporally share an instance of a resource. This method has 
been developed to explicitly incorporate maximum resource requirements, excessive sets 
and resource holes. To achieve better integration, several optimization phases are 
formulated to use the representation. 
 In view of the interdependences between instruction scheduling and register 
allocation, yet a lack of co-operation between the scheduler and register allocator as 
generating code that contains excess register spills or a lower degree of parallelism than 
actually achievable, Cindy Norris and Lori. L. Pollock [NP95a] proposed a strategy for 
providing cooperation between register allocation and both global and local instruction 
scheduling. By comparing experimentally their strategy with other co-operative and 
uncooperative methods, their study shows that the greatest speedups are achieved by 
     17 
performing either cooperative or uncooperative global instruction scheduling with 
cooperative register allocation and local instruction scheduling. 
 Chang, Chen and King [CCK97] proposed an integer programming formulations 
that combine local instruction scheduling with local register allocation. Their 
formulations are based on several assumptions. Two optimization problems are 
considered in their approach : NRS problem and RS problem. The former is focused on 
combining instruction scheduling and register allocation in a basic block with register 
spilling not allowed. The later is emphasized on combing instruction scheduling and 
register allocation in a basic block with register spilling allowed. ILP is used to solve 
these two problems. One major problem with ILP is that the number of variables and 
expressions in the formulations could be very large for only a small code segment. This 
results in a very long solution time. Experimental results are given for one simple 10-
instruction example which takes 20 minutes to solve optimally. These results suggest that 
the approach has very limited practicality. Therefore, their formulations need to be 
further refined to minimize redundant variables and/or inequalities. Through these 
refinements, a more general technique to combine instruction scheduling and register 
allocation in multi-issue processors should be able to be developed. 
 In [BGS98], Berson, Gupta and Sofa present various versions of algorithms that 
implement the integration of register allocation and instruction scheduling. They 
proposed a newly developed integrated algorithms as well as existing algorithms to 
obtain better result.  
In their first approach, Goodman and Hsu’s IPS technique that uses live range 
spilling is extended to incorporate live range splitting. By comparing their extended 
     18 
version of the algorithm with the original (IPS) approach, their study shows that live 
range splitting is far superior to live range spilling when developing an integrated 
resource allocator.  
Their second approach is the modified version of Cindy Norris and Pollock 
approach to incorporate the use of the register reuse DAG in place of the interference 
graphs for detecting excessive register demands (RRD). They find out that register reuse 
DAGs are superior to interference graphs.  
Their third approach called Unified Resource Allocator (URSA) is based upon the 
measure_and_reduce paradigm for both register and functional units [BGS93]. Using the 
reuse DAGs, this approach identifies excessive sets that represent groups of instructions 
whose parallel scheduling requires more resources than are available. The excessive 
demands for resources are reduced by driving from the excessive sets. To reduce register 
demands, live range splitting is also used. They find out that their algorithm performs 
better than the algorithm based upon Goodman et al.[GH88] and Norris et al.[NP95a] and 
also has the lowest compilation times. 
Gang Chen and Michael D. Smith [CS99] developed a new approach to solving 
the phase ordering problem associated with instruction scheduling and register allocation. 
They proposed a global code reorganization phase that runs after the greedy pre-pass 
scheduler and before the register, instead of trying to perform instruction scheduling and 
register allocation together or trying to backtrack during scheduling. Their reorganizer 
controls register pressure while maintaining the effectiveness of the pre-pass scheduler. 
This two-phase approach is implemented as a Top-down (TD) traversal followed by 
bottom-up (BU) traversal of the scheduling region. The key aspect of their approach is a 
     19 
separation of the determination of scheduling length from the minimization of register 
pressure. The result shows that their two-phase approach to pre-pass scheduling performs 
significantly better than the common approach of using a single register-pressure-
sensitive scheduler. They also mentioned that their approach is a general approach that 
can also be used in a global, GAD-based scheduler in addition to superblock scheduler. 
 Vivek Sarkar, Mauricio J. Serrano and Barbara B. Simons [SSS01] developed a 
new framework for selecting, duplicating and sequencing instructions – register-sensitive 
instruction selection, register-sensitive instruction duplication and register-sensitive 
instruction sequencing - in order to decrease register pressure. In this approach, 
instruction selection and duplication of transformations can be performed on 
intermediate-language instructions in a general dependence graph that contains both true 
and non-true dependences. Their approach is slightly different from those approaches that 
restricted attention to a single expression tree or a single expression DAG. They also 
proposed a new algorithm for instruction scheduling reducing register pressure that is 
based on backwards scheduling. They find out that register-sensitive instruction 
duplication can deliver significant speedups (up to 1.22x) for the SPECint95 benchmarks 
on an IA-32 processor. It is also mentioned that register-sensitive sequencing delivers 
smaller speedups (up to 1.12x) for the SPECjvm and Java Grande benchmarks on a 
PowerPC processor.  
2.3. Other optimization techniques using integer linear 
programming 
 Although prior works using integer programming has produced limited success 
     20 
for integration of instruction scheduling and register allocation, integer programming has 
been used successfully to optimally solve various other compiler optimization problems. 
In [Pug91], William Pugh developed the Omega test which is an integer 
programming algorithm that can determine whether dependence exists between two array 
references, and if so, under what conditions. The Omega test is based on an extension of 
Fourier-Motzkin variable elimination (a linear programming method) to integer 
programming, and has worst-case exponential time complexity.   Experiments suggest 
that, for almost all programs, the average time required by the Omega test to determine 
the direction vectors for an array pair is less than 500 µsecs on a 12 MIPS workstation. 
Their studies have shown that the Omega test is a fast and practical method for 
performing data dependence analysis that is not only adequate for problems encountered 
in vectorizing FORTRAN code, but also for the demands of more sophisticated program 
transformation tools. 
Robert Bixby, Ken Kennedy and Ulrich Kremer [BKK94] proposed an approach 
to automatic data layout in the context of a programming tool that produces High 
Performance Fortran or similar language as output. Their approach has allowed exploring 
exact solutions to the problem of automatic data layout, even though their formulation of 
the problem is NP-complete. Their experiments show that even though they use a general 
purpose integer programming tool, there exists a formulation that can be solved very 
efficiently. Compared to the similar 0-1 problems and their special purpose solvers 
indicate that their results can be improved significantly as well if a special purpose solver 
is used. 
     21 
2.4. Summary 
This chapter surveys and examines various techniques that have been developed 
for integration and cooperation between instruction scheduling and register allocation so 
as to solve phase ordering problem. The study finds out that all previous techniques and 
methods of solving phase ordering problem are focused on combining the instruction 
scheduling with graph coloring register allocation schemes. Linear scan register 
allocation [PS99], proposed by Massimiliano Poletto and Vivek Sarkar, is very simple 
and faster than algorithms based on graph coloring approach. Unfortunately, although 
linear scan register allocation is an attractive register allocation algorithm, nobody has yet 
attempted for combining instruction scheduling and linear scan register allocation. Thus, 
the performance evaluation of combining our proposed pre-pass local instruction 
scheduler, which is discussed in chapter five, and linear scan register allocator is 







     22 
Chapter 3 
Theoretical Background to this study 
3.1. Position of Register Allocation and Instruction Scheduling 
in compiler back end 
The main components of a compiler are compiler front end and compiler back 
end. Compiler front end consists of lexical analysis, parsing, and semantic analysis. It 
builds an abstract syntax tree and a symbol table. Research on compiler front end has 
achieved a mature stage so it is left for the researchers to focus on compiler back end 
especially on register allocation and instruction scheduling in order to produce more 
efficient object module for instruction level parallelism in modern optimizing compiler.  
The compiler back end is divided into individual components called phases as 
shown in Figure-3.1. After the front end has built the abstract syntax tree, the initial 
optimization phase builds the flow graph, or intermediate representation. While building 
the flow graph, some initial optimizations can be performed on instruction within each 
block. As illustrated in Figure-3.1, register allocation is generally preceded by flow graph 
building, dominator optimization, interprocedural optimization, dependence optimization, 
global optimization, limiting resources, and instruction scheduling; followed by a second 
instruction scheduling phase that schedules spill code [Mor98]. Although instruction 
scheduling and register allocation can be performed simultaneously for expressions, for 
the more general case of local or global instruction scheduling along with global register 
allocation, instruction scheduling and register allocation are generally performed as 
     23 
separate optimization [GW95], as shown in Figure-3.1. 
 
Dominator O ptimization 
Interprocedural O ptimization 
Dependence O ptimization 





Form O bject Module 
Front End 
Flow Graph Building 
 
Figure-3.1:  Position of Register Allocation and Instruction Scheduling in compiler 
back end 
 
3.2. Instruction Scheduling 
Instruction scheduling is one of the most important compiler optimizations due to 
its role in increasing pipeline utilization. The local instruction scheduling is the process to 
     24 
find a minimum length instruction schedule for a basic block subject to procedure, 
latency, and resource constraints. A basic block is a straight-line sequence of code with a 
single entry point and a single exit point. Consider code fragment in Figure-3.2. Assume 
that the processor has only one functional unit; memory access operations take two 
cycles; and all other operations take one cycle. 
 
Before Scheduling After Scheduling 
1. LO AD R1 , a 1. LO AD R1 , a 
2. N O P 2. LO AD R2 , b 
3. ADD R3 , R1 , R1 3. ADD R3 , R1 , R1 
4. LO AD R2 , b 4. ADD R4 ,R2 , R3 
5. N O P   
6. ADD R4, R2, R3   
  
Figure-3.2: Instruction scheduling example 
 
 
The original code on the left takes 6 cycles while the optimal schedule code on 
the right only takes 4 cycles. The NOP operations denote the cycles in which the machine 
has to wait for results of previous operations. The scheduled code effectively hides the 
latency of memory access operations but it must use more registers than the original one. 
VLIW and superscalar machines use multiple functional units to increase their 
peak performance. To keep these functional units busy, the compiler must expose enough 
Instruction Level Parallelism ( ILP ) in order to produce good code for these machines. 
Consider the code fragment in Figure-3.3(a). Assume that the machine has two identical 
functional units; multiply takes two cycles; addition and subtraction take one cycle.   
                        
     25 
ADD R1, R1, R2 
MUL R3 , R2 , R3 
SUB  R1 , R1 , R5 
SUB  R3 , R3 , R5 
ADD R4 , R1 , R3 
  
(a) Before instruction scheduling   
      
ADD R1, R1, R2 MUL R3, R2, R3 
SUB R1 , R1 , R5 N O P 
N O P SUB R3 , R3 , R5 
N O P ADD R4 , R1 , R3 
  
(b) After instruction scheduling 
 
Figure-3.3: Instruction Level Parallelism 
 
Instruction level parallelism is exploited by the scheduled code in Figure-3.3(b). 
When one functional unit is busy with a multiply, an addition and a subtraction are issued 
to the other functional unit. The amount of ILP available is subject to data dependence. 
For example, in Figure-3.3(b) we cannot move the last add into an earlier cycle because it 
has to wait for the results of previous operations.  
3.2.1. Different Types of Instruction Schedulers 
 There are different types of schedulers, based on the size of the pieces of the 
procedure that they attempt to reorder. They are basic block scheduler, branch scheduler, 
cross-block scheduler, pipeliner, trace scheduler or percolation scheduler [Muc97]. 
 Basic block schedulers reorder the instructions within individual blocks. The form 
     26 
of the program flow graph is not changed. The reordering of each block is independent of 
the reordering of other blocks, with the possible exception of some knowledge about 
values computed at the end of a block (or used at the beginning of a block) [Muc97, 
Mor98]. 
 Cross-block scheduling improves basic block scheduling by considering a tree of 
blocks at once and may move instructions from one block to another [Muc97].  
 Software pipeliners reorder and replicate instructions in loops to eliminate stalls. 
The result of software pipelining is a new loop in which values are being simultaneously 
computed for multiple iterations of the original loop [Muc97, Jai91, Lam88]. 
 Trace scheduling is an instruction scheduling method developed by Fisher 
[Muc97, Fis81, Ell86]. A trace is a sequence of instructions, including branches without 
including loops, that is executed for some input data. Trace scheduling uses a basic-block 
scheduling method to schedule the instructions in each entire trace, beginning with the 
trace with the highest execution frequency. Trace schedulers reorder the instructions in a 
simple path of blocks. The paths that are reordered are chosen to be the most frequently 
executed paths in the program. Instructions may be moved to places where the value 
computed is not guaranteed to be used (speculative execution). By reordering these larger 
sequences of instructions, more opportunities can be found for eliminating stalls. 
 Percolation scheduling is another aggressive cross-block scheduling method that 
was developed by Nicolau [Muc97, AN88]. 
3.2.2. Instruction Scheduling Problems 
The local instruction scheduling problem is to find a minimum length instruction 
     27 
schedule for a basic block subject to precedence, latency, and resource constraints. This 
instruction scheduling problem becomes complicated (interesting) for pipelined 
processors because of data hazards and structural hazards [WH00]. A data hazard occurs 
when an instruction i produces a result that is used by a following instruction j, and it is 
necessary to delay j's execution until i's result is available depending on data 
dependences. There are four cases of data dependences: 
True dependence: If an instruction modifies some resource that is later used by a 
following instruction, then there is a true dependence. In the following example, I2 is true 
dependent on I1 because I1 defines variable x which is then used by I2. 
  I1 :  x: = y + z 
  I2 :  w : = x + z 
 
Anti-dependence: If an instruction uses a resource that is later modified by a 
following instruction, there is anti-dependence. In the example that follows, I2 is anti-
dependent on I1 because I1 uses the value of y before I2 redefines y. 
  I1 :  x : = y + z 
  I2 :  y : = w + z 
 
Output dependence: If both instructions modify the same resource, then the 
initial order must be preserved so that later instruction will get the value of the resource 
modified by a preceding instruction. For example, I1 is output dependent on I2 and I2 is 
output dependence on I1 respectively because both I1 and I2 modify the same data item 
x. 
  I1 :  x : = y + z 
  I2 :  x : = w + z 
 
Input dependence: If both instructions use the same resource without modifying 
it, then there is no restriction on order. For example, I1 is input dependent on I2 and I2 is 
     28 
input dependence on I1 respectively because both I1 and I2 use the same data item y 
without redefining it. 
  I1 :  x : = y + z 
  I2 :  x : = y + z 
 
Among four kinds of data dependences mentioned above, true dependence, anti-
dependence and output dependence can cause data hazard. 
A structural hazard occurs when a resource limitation causes an instruction's 
execution to be delayed. Since general instruction scheduling problem is NP-complete, a 
number of heuristic methods that give approximate solutions have been developed. 
Among them, list scheduling is the dominant method. More advanced techniques, such as 
trace scheduling and software pipelining, typically use list scheduling to perform the 
actual assignment of operations into specific cycle. 
3.2.3. Local Instruction Scheduling  
Instructions are reordered within a basic block, a straight line sequence of code 
with a single entry point and a single exit point. This is called local instruction scheduling 
methods [HG83]. The data dependencies in a basic block can be described by a directed 
acyclic graph ( DAG ). The leaves of the DAG are the variables occurring as operands in 
the basic block; the inner nodes represent intermediate results. Basic blocks are typically 
rather small, with up to 20 instructions. Nevertheless, scientific programs often contain 
larger basic blocks, due to e.g complex arithmetic expressions and array indexing. Larger 
basic blocks can also be produced by compiler techniques such as loop unrolling and 
trace scheduling.  In particular, a minimum execution time schedule contains the smallest 
possible number of no-ops or idle cycles, thereby utilizing all of the processor cycles 
     29 
effectively. Instruction scheduling for a single_issue and multi_issue processor is 
NP_complete if there is no fixed bound on the maximum latency. Such negative results 
have led to the belief that in production compilers, one must take a heuristic or 
approximation algorithm approach; rather than an exact approach to basic block 
scheduling [Mou97]. In [BW01], Peter van Beek and Kent Wilken presented a relatively 
simple constraint programming approach to instruction scheduling which is fast and 
optimal. However, recently, Wilken et al. [WH00] again has shown that through various 
modeling and algorithmic techniques; integer linear programming could be used to 
produce optimal instruction schedules for large basic blocks in a reasonable amount of 
time. 
3.2.4. Global Instruction Scheduling  
Recent studies make an effort to focus on scheduling across basic block 
boundaries [GS90a, Lam88, Ell86, AN88, SB92]. For example, global instruction 
scheduling increases the number of instructions available for parallelism by considering 
instructions beyond individual basic blocks. Literature shows that various techniques 
have been used to perform global instruction scheduling. 
Trace scheduling [Ell86] and percolation scheduling [AN88] employ the program 
Control Flow Graph (CFG) to globally rearrange code in order to increase instruction 
level parallelism. The CFG consists of a set of nodes representing basic blocks within the 
procedure and edges that represent the transfer of control between basic blocks. The 
flowgraph contains a start node which is the entrance node to the graph and contains a 
path to every other node in the flowgraph. 
     30 
Trace scheduling [Fis81, Ell86] attempts to increase the number of available 
statements from which code is scheduled by using the CFG to trace out paths of potential 
execution in the program. The statements in a single trace are reordered for parallel 
execution. The performance of trace scheduling relies on the selection of the correct 
trace. Thus, the compiler can predict which paths are most likely to be executed. As 
program profiling is used for path prediction, it is time consuming to make trace schedule 
which is mainly useful for scientific applications. 
Percolation scheduling [AN88], on the other hand, performs transformations to 
convert the CFG into a CFG with more parallelism. Core transformations are low-level 
transformations that are applied to adjacent nodes in the CFG to move statements from 
one node to the other. Data dependence information is required to make sure that a 
transformation will not change the semantics of the program. The core transformations 
are applied by a higher level set of transformations called scheduling transformations. 
These transformations move operations further than to an adjacent block. Another level 
of transformations, enabling transformations, relies on global information to rearrange the 
program graph to expose parallelism. The current system that incorporates percolation 
scheduling relies on interaction with the user to guide the scheduling process.  
Later, CFG is replaced by program dependence graph (PDG) [FOW87] to 
perform the global scheduling. The PDG reflects both control and data dependence 
information within a single graph. Representing both types of relationships in a single 
graph allows control and data dependences to be treated uniformly. 
A more aggressive technique, called region scheduling, designed by Gupta and 
Doffa [GS90a] also uses the PDG to do global instruction scheduling. Region scheduling 
     31 
approach attempts to balance the parallelism in each region of the code so that all regions 
contain sufficient parallelism to utilize the resources of the architecture. Excess fine-grain 
parallelism in one region can be transferred into another region with insufficient 
parallelism, and coarse-grain parallelism can be converted to fine-grain parallelism.  
A scheduling technique known as dominator-path scheduling (DPS) is again 
introduced by Sweany and Beaty [SB92]. A group of basic blocks on a path in the 
dominator tree is scheduled as a single basic block by combining the data dependence 
DAGs of the blocks into a single DAG for the entire path. When the basic blocks are 
combined, data dependence edges are inserted to prevent illegal code motion from one 
block to another block. Dominator analysis is used to compute the definition and use sets 
necessary for global scheduling. 
Software pipelining is focused on increasing instruction level parallelism within a 
loop [Lam88, Jai91]. The iterations of a software pipelined loop are initiated at constant 
intervals before preceding iterations complete. Thus, multiple iterations of the loop in 
different stages of their computation are in progress simultaneously. 
A technique, reverse if-conversion [WMHR93], eliminates the need to do explicit 
global instruction scheduling. [WMHR93] discuss a set of transformations which 
transform a CFG into a predicate intermediate representation for which the compiler can 
generate a globally-scheduled CFG by applying local scheduling techniques. Reverse if-
conversion is used to convert the scheduled predicated operations into a scheduled 
control flow graph. 
Jack Liu and Fred Chow [LC02] propose the implementation of an instruction 
scheduler that produces near-optimal results by efficiently enumerating all possible 
     32 
schedules. They show that an enumeration-based approach to instruction scheduling is 
highly effective in producing efficient code for a processor with tight and irregular 
constraints. 
 Recently, in [DP02], Walter Lee, Diego Puppin, Shane Swanson and Saman 
Amarasinghe propose a general instruction scheduling framework, convergent 
scheduling, which implements and simplifies a set of different passes that addresses 
scheduling constraints such as partitioning, load balancing, communication bandwidth, 
and register pressure for modern complex processors by offering a set of innovative 
features. By applying these heuristics, their scheduler is proved to obtain a significant 
speedup on a 4-cluster clustered VLIW architecture, Desoli's PCC algorithm, UAS, and 
the existing space-time scheduler of the RAW processor. 
3.3. Register Allocation 
The register allocation problem is to assign variables to a limited number of 
hardware registers during program execution. Local register allocation assigns registers 
to variables in basic block, which are maximum branch-free sequence of instructions. 
Global register allocation assigns registers to variables throughout the program. Variables 
in registers can be accessed much quicker than those that are not in registers. Typically, 
however, there are far more variables than registers so it is necessary to assign multiple 
variables to registers. Variables conflict with each other if one is used both before and 
after the other within a short period of time (for instance, within a subroutine). The goal 
here is to assign variables that do not conflict so as to minimize the use of non-register 
memory. An optimal allocation can be discovered by solving an integer programming 
     33 
program; however, this technique is too expensive for a production compiler [Mor98]. 
3.3.1. Local Register Allocators  
Local register allocation (or block-level register allocation) is the task of 
assigning data items to registers over an entire block of straight-line code so that the 
traffic between registers and memory is minimized. Local register allocation was first 
considered formally in 1966 by Horwitz, Karp, Miller, and Winograd [HKMW66]. In 
that paper, an algorithm was presented to produce an optimal allocation through dynamic 
programming. The algorithm accurately captures the index register architecture used at 
the time. Unfortunately, after more than thirty years, the index register model does not 
reflect the costs of a modern architecture with general purpose registers. After that 
various techniques which perform register allocation include [HFG89, Fre74, FL98]. 
3.3.2. Global Register Allocators  
The commonly used approach of global register allocation using graph coloring 
was originally developed by Chaitin et al. [Cha82]. They try to color a graph that 
abstracts the interference among live ranges using K colors, where K is the number of 
physical registers. Their method mainly consists of two tasks, simplification of the graph 
and assignment of colors. Simplification gives an order of coloring that guarantees K-
colorability. Assignment is then made by giving colors to live ranges one by one in that 
order. This approach successfully packs live ranges to registers to minimize the spill cost. 
Chow and Hennessy [CH84] designed an alternative form of global register 
allocation via graph coloring. In their attempt to reduce the amount of spill code inserted, 
     34 
live range splitting is done instead of spilling the entire live range as in Chaitin’s method. 
In addition, the algorithm assumes that all references are initially from memory and 
attempts to map memory references to physical registers as opposed to mapping virtual 
registers to physical registers. These approaches may be more amenable to memory-
memory and memory-register architectures where one or more operands of an instruction 
can be referenced from memory. Their algorithm is also flexible enough to be adapted to 
different architectures, language environments, and program characteristics. 
The method used by Gupta, Soffa, and Steele [GSS89] partitions a program into 
code segments via clique separators and performs register allocation on these segments 
independently. A clique separator is a completely connected subgraph whose removal 
disconnects the graph into at least two subgraphs. By combining the allocations of the 
subgraphs, the allocation for the entire program is obtained. The technique is space 
efficient because the interference graph for only one code segment needs to be 
constructed at any point in the allocation. Partitioning also increases the likelihood of 
obtaining a coloring without spilling. 
Callahan and Koblenz [CK91b] designed a hierarchical approach to register 
allocation. Based on the premise that the register interference graph suffers from the loss 
of program flow structure, the approach tiles the program control flow graph with a tree 
of tiles reflecting the program’s hierarchical control structure, and then to run a two phase 
algorithm to perform standard graph coloring on each tile to allocate registers sensitive to 
local usage patterns while retaining a global perspective, and to insert spill code into less 
frequently executed parts of the program. Their approach minimizes the number of 
dynamic memory references than Chow and Hennessy approach.   
     35 
A technique by Proebsting and Fischer [PF91] obtained a global register 
allocation by following the local allocation by a phrase to eliminate loads and stores at 
basic block boundaries. The technique is a probabilistic approach in which local 
allocation is followed by probabilistic global allocation performed iteratively from inner 
to outer loops. Thus, the local allocation is combined to obtain a global allocation. When 
only a global register allocation phase is executed, local and global values compete 
equally for registers. The pitfall of this approach is a slow compilation speed. 
Hendren et al.[HGAM92] used a hierarchical cyclic interval graph to perform 
register allocation instead of using an interference graph. In that study, a live range is 
represented by a cyclic interval. The allocation separates the spilling phase from the 
allocation phase by first transforming the hierarchical interval graph into an equivalent 
graph which is guaranteed to be colorable. The allocation is performed next by a newly 
introduced algorithm called the fat cover algorithm. 
Kolte and Harrold [KH93] employed a load/store range analysis technique for live 
range splitting that is based on reaching definition and live variable analysis. They 
suggest building an interference graph in which the nodes represent load/store ranges 
instead of the traditional live ranges. Their experiments indicate that a graph coloring 
register allocator operating on the load/store ranges often decides a better allocation than 
the same allocator operating on live ranges. 
David W. Goodwin and Kent D. Wilken [GW96] at the University of California 
at Davis developed a fundamentally new register allocation approach based on integer 
programming. The Optimal Register Allocation (ORA) approach optimally allocates 
registers and optimally places spill code, significantly decreasing spill code overhead 
     36 
compared to the traditional graph-coloring approach. Their approach builds a 0-1 integer 
program (IP) representation of the allocation problem using integer variables to represent 
possible register allocation actions at each point in the computer program. Each action 
has an associated cost, and constraints limit the solver to choose only actions that lead to 
a valid allocation. CPLEX is used to solve the IP, empirically producing optimal 
allocations in O (n3) time.  
Koseki, H. Komastu and Y. Fukazawa [KFK97] proposed a register existence 
graph that can express the interference among symbolic registers and the parallelism 
among instructions in a program. They have also shown that leveling a register existence 
graph realizes the generation of anti-dependence and spill code taking account of the 
parallelism in a program, which existing methods rarely do. They are now considering 
ways of improving the leveling algorithm and allowing cooperation between a spill code 
generator and a code scheduler. 
In [KW98], Timothy Kong and Kent D. Wilken proposed a precise approach to 
register allocation for irregular-register architectures which is based on 0-1 integer 
programming (IP). ORA has been extended to irregular register architectures - 
architectures that place restrictions on register usage based on instruction type. Many 
real-world processors - such as the Intel x86 family - have irregular registers. While 
efficient register allocation for irregular architectures is difficult, better register allocation 
for the x86 architecture has the potential to benefit a vast number of users. An IP register 
allocator is built for the x86 architecture within the Gnu C Compiler (GCC), and is 
compared experimentally with GCC’s graph-coloring register allocator. Experimental 
results show that the IP allocator reduces register allocation overhead by 61% compared 
     37 
with the graph coloring allocator. The results also show that the x86 IP allocator is 
dramatically faster than the prior RISC IP allocator, because of the smaller number of 
registers in the x86 architecture and because of the register irregularities. The paper 
discussed several irregular architecture features, but there are others that remain to be 
modeled. For example, instruction selection can be integrated into register allocation to 
further reduce spill code. Plans are made to report this features in a future paper.  
Poletto M. and Sarkar V. [PS99] proposed a more well-behaved global register 
allocator, linear scan register allocator which is considerably faster than algorithms based 
on graph coloring, but allocates registers to variables in a single linear-time scan of the 
variables' live ranges. It is quite simple and results in code that is almost as efficient as 
that obtained using more complex and time-consuming register allocator based on graph 
coloring. A discussion of experimental evaluation and improvement to linear scan 
register can be found in [OGM98, OT98, EK02, SS03]. 
Andrew W. Appel and Lal George [AL01] have formulated the register allocation 
problem for CISC architectures with few registers into one involving optimal placement 
of spill code, followed by optimal register coalescing. They have given some empirical 
evidence that dividing the problem into these two phases does not significantly worsen 
the overall quality of the solution, but a full demonstration of this fact would require 
optimal solutions to the overall problem that no one has been able to calculate. They have 
demonstrated an efficient algorithm using integer linear programming for optimal spill-
code placement. Programs compiled with optimal spilling followed by optimistic 
coalescing run about 9.7% faster than those compiled with SSA based splitting followed 
by iterated register coalescing. 
     38 
Akira Koseki, Hideaki Komatsu and Toshio Nakatani [KKN02] described a new 
framework of register allocation, Coloring Precedence Graph (CPG) and a Register 
Preference Graph (RPG), based on Chaitin-style coloring. Their focus is on maximizing 
the chances for live ranges to be allocated to the most preferred registers while not 
destroying the colorability obtained by graph simplification. Experimental results show 
that their coloring algorithm is powerful to simultaneously handle spill decisions, register 












     39 
Chapter 4 
Preliminary Results of combined instruction scheduling 
and register allocation using integer programming   
This chapter reports the experimental results of instruction scheduling and register 
allocation using integer programming. With the goal to combine instruction scheduling 
and register allocation using integer programming, the implementation of the ILOG OPL 
Studio scheduling problem for optimal instruction scheduling and method of David W. 
Goodwin and Kent D. Wilken [GW95] for optimal register allocation have been carried 
out.  
4.1. Optimal Instruction Scheduling 
 Once instruction scheduling problem has been formulated into ILOG OPL Studio 
scheduling problem (scheduling model), solution approaches to obtaining optimal (or at 
least near optimal) solution can be found. For example, the following source code, 
Figure-4.1, is the output of the Trimaran software and portion of wc.O_el file which 
describes Region Based Elcor Language ( REBEL ) textual listing. The Trimaran back-
end (Elcor) uses the Elcor Intermediate Representation ( The Elcor IR ) to represent a 
program unit. The Elcor IR has a textual representation, known as Rebel. 
     40 
     
      hb 35 (
      weight(105196)
      Entry_ops(344) exit_ops(142 143 144)
      Entry_edges(ctrl ^87 ctrl ^88) exit_edges(ctrl ^236 ctrl ^233 ctrl  ^237 ctrl ^101)
      Flags(sched)
      attr(lc ^553)
      Subregions(
      op 344 (C_MERGE [] [] s_time(0) s_opcode(C_MERGE.0) attr(bb_id(55)) in_edges(op-155(105194) op-343(2)) flags(real_merge sched))
      op 141 (L_B_C1_C1 [br<50:I gpr 2>] [br<1:i gpr 4>] p<t> s_time(0) S_opcode(L_B_C1_C1.0) attr(lc ^555) flags(sched))
      op 242 (PBRR [br<82:b btr 2>] [b<37> i<0>] p<t> s_time(0) s_opcode(PBRR.0)  Attr(lc ^556) flags(sched))
      op 244 (PBRR [br<84:b btr 3>] [b<36> i<0>] p<t> s_time(0) s_opcode(PBRR.1)  Attr(lc ^557) flags(spec sched))
      op 246 (PBRR [br<86:b btr 4>] [b<37> i<1>] p<t> s_time(0) s_opcode(PBRR.2)  Attr(lc ^558) flags(spec sched))
      op 243 (CMPP_W_EQ_UN_UN [br<83:p pr 2> u<>] [br<50:i gpr 2> i<32>] p<t> S_time(2) s_opcode(CMPP_W_EQ_UN_UN.0) flags(sched))
      op 245 (CMPP_W_EQ_UN_UN [br<85:p pr 3> u<>] [br<50:i gpr 2> i<10>] p<t> S_time(2) s_opcode(CMPP_W_EQ_UN_UN.1) flags(spec sched))
      op 247 (CMPP_W_EQ_UN_UN [br<87:p pr 4> u<>] [br<50:i gpr 2> i<9>] p<t>   S_time(2) s_opcode(CMPP_W_EQ_UN_UN.2) flags(spec sched))
      op 142 (BRCT [] [br<82:b btr 2> br<83:p pr 2>] p<t> s_time(3) S_opcode(BRCT.0) attr(lc ^562) out(op-345(82954) op-364(22242))   Flags(sched))
      op 345 (C_MERGE [] [] s_time(3) s_opcode(C_MERGE.0) attr(bb_id(56))   in_edges(op-142(82954)) flags(sched))
      op 143 (BRCT [] [br<84:b btr 3> br<85:p pr 3>] p<t> s_time(4) S_opcode(BRCT.0) attr(lc ^564) out(op-346(78918) op-362(4036))    Flags(sched))
      op 346 (C_MERGE [] [] s_time(4) s_opcode(C_MERGE.0) attr(bb_id(57)) in_edges(op-143(78918)) flags(sched))
      op 144 (BRCT [] [br<86:b btr 4> br<87:p pr 4>] p<t> s_time(5)  S_opcode(BRCT.0) attr(lc ^566) out(op-347(77615) op-364(1303))   Flags(sched))
)))
Figure-4.1: Portion of wc.O_el file  
 
In Figure-4.1, the intermediate representation of a program is in Rebel format. It 
resembles the assembly language of a processor in its form except for certain additional 
fields. For example,  
 
op 246 (PBRR [br<86:b btr 4>] [b<37> i<1>] p<t> s_time(0) s_opcode(PBRR.2) attr(lc ^558) flags(spec sched)) 
 
where op refers to a region type, 246 refers to a region number, PBRR refers to an 
operation name, [br<86:b btr 4>] are operation destinations, [b<37> i<1>] are operation 
sources, p<t> refers to an operation predicate, s_time(0) refers to an operation scheduling 
time, s_opcode(PBRR.2) refers to an operation opcode (link to HMDES), attr(lc ^558) 
refers to operation attributes and flags(spec sched)) refers to operation flags. For operand 
representation in Rebel such as [br<86:b btr 4>], br is register status, 86 is virtual register 
index, b is data type, btr is register file and 4 is physical register. Each operation is 
formulated into OPL scheduling language which can be seen in Appendix A. The result 
from the ILOG OPL studio is depicted in Figure-4.2. 
     41 
Optimal Solution with Objective Value: 4 
task[op141] = [0 -- 2 --> 2] 
task[op242] = [0 -- 1 --> 1] 
task[op244] = [0 -- 1 --> 1] 
task[op246] = [0 -- 1 --> 1] 
task[op243] = [2 -- 1 --> 3] 
task[op245] = [2 -- 1 --> 3] 
task[op247] = [2 -- 1 --> 3] 
task[op142] = [4 -- 1 --> 5] 
task[op143] = [5 -- 1 --> 6] 
task[op144] = [3 -- 1 --> 4] 
inte = Discrete Resource 
  required by task[op247] over [2,3]  in capacity 1 
  required by task[op245] over [2,3]  in capacity 1 
  required by task[op243] over [2,3]  in capacity 1 
  required by task[op246] over [0,1]  in capacity 1 
  required by task[op244] over [0,1]  in capacity 1 
  required by task[op242] over [0,1]  in capacity 1 
flo = Discrete Resource 
mem = Discrete Resource 
  required by task[op141] over [0,2]  in capacity 1 
br = Discrete Resource 
  required by task[op144] over [3,4]  in capacity 1 
  required by task[op143] over [5,6]  in capacity 1 
  required by task[op142] over [4,5]  in capacity 1 
lomem = Discrete Resource 
 
Figure-4.2: The result from ILOG OPL studio  
 
 The result can be checked from the Rebel output. For example, task[op141] =     
[0 -- 2 --> 2] means that op141 is scheduled in schedule time 0, takes two cycles to 
schedule and will finish schedule time 2. In Rebel output, schedule time for op 141 is also 
0 that is s_time(0). Thus, the schedule time is correct. Accordingly, scheduling time for 
op 242, op 244, op 246, op 243, op 245, op 247 is also correct. But for op 142, op 143 
and op 144, the scheduling time from Rebel output is 3, 4 and 5 respectively. ILOG OPL 
studio gives 4, 5, 3 respectively. These operations perform on branch unit ( br ) and it has 
only one resource. Thus these operations must be scheduled serially. All of those 
operations can start after scheduling time 2. So scheduling time for op 142, op 143 and op 
     42 
144 can be in any order, that is, 3, 4, 5 or 4, 3, 5 or 4, 5, 3 respectively. So scheduling 
time for op142, op 143 and op 144 is also correct. Then, in order to generate ILOG OPL 
source code automatically from Trimaran software, plan has been made to update 
function.cpp in Trimaran software to include that feature.  
4.2. Optimal Register Allocation 
The method of David W. Goodwin and Kent D. Wilken [GW95] for optimal 
register allocation is implemented on ILOG OPL studio. In this paper, a new approach to 
global register allocation named Optimal Register Allocation (ORA) is introduced with 
the goal of optimally allocating registers and optimally placing spill code, significantly 
decreasing spill code overhead compared to the traditional graph-coloring approach.  
4.2.1. Overview of ORA 
This section gives a brief overview of the Optimal Register Allocation approach 
to global register allocation introduced by David W. Goodwin and Kent D. Wilken 
[GW95]. The ORA register allocator consists of three top level modules: the analysis, 
solver, and rewrite modules, as illustrated in Figure-4.1. The ORA analysis module 
analyzes a function to determine the points in the function where decisions must be made 
about various register allocation actions. Each register-allocation decision is a binary 
decision: at a specific point in the function a certain register allocation action is either 
performed (1) or is not performed (0). Register-allocation decisions include whether a 
symbolic register should be defined into a specific real register, whether the assignment 
of a symbolic register to a specific real register should continue, whether a symbolic 
     43 
register should be stored to or loaded from memory, etc. The ORA analysis module 
produces a binary decision variable for each register-allocation decision that must be 
made, and records the decision variable and the corresponding register allocation action 
in the decision-variable table, as illustrated in Figure-4.3. 

















             
 
Figure-4.3: Overview of ORA 
 
The ORA solver module uses the information about decision variables, register 
allocation overheads, and conditions to construct a 0-1 integer program, a linear program 
with the added requirement that each variable must be assigned an integer solution value 
that is either 0 or 1. After constructing the 0-1 integer program, an optimal solution is 
found using a commercial integer program solver such as CPLEX or the IBM 
Optimization Subroutine Library. The solver determines a value of either 0 or 1 for each 
decision variable so the conditions are satisfied and the total register allocation overhead 
is optimally reduced. The ORA solver module then records in the decision-variable table, 
the solution value for each decision variable.  
     44 
Finally, the ORA rewrite module examines the decision variable table to 
determine each decision variable that was set to 1 by the solver, and to determine the 
corresponding register allocation action. The intermediate instructions are then rewritten 
based on the register allocation actions determined by the solver, with each symbolic 
register replaced by the assigned real register, and with spill code inserted at the 
prescribed locations. 
4.2.2. The implementation of ORA 
 This section describes the outcomes of the implementation of David W. Goodwin 
and Kent D. Wilken [GW95]’s method. The control flow graph of sample program 
mentioned in their paper can be seen in Figure-4.4 and instruction graph of symbolic 
register A and B is given in Figure-4.5. 
 
A = _
 _ = A + 1
B = A + 1
_ = B + 2
_ = B / 2
 _ = A + 2
1 :




Figure-4.4: Control flow graph for the sample program 
     45 
 
 
Figure-4.5: Instruction graph for symbolic register A and B 
 
David et al. paper introduces an optimal solution to the 0-1 integer programming 
problem maps directly to an optimal register allocation and optimal spill code placement. 
Firstly, the optimal register allocation exemplified in their paper is implemented by using 
ILOG OPL studio version 3.5. For source code, please see Appendix B. The result is 
shown in Figure-4.6. 
  
Optimal Solution with Objective Value: 5.0000 
x1_defA = 1.0000 
x2_defA = 0.0000 
x1_use_end1A = 0.0000 
x1_use_cont1A = 1.0000 
x1_use_end2A = 0.0000 
x1_use_cont2A = 1.0000 
x2_use_end1A = 0.0000 
x2_use_cont1A = 0.0000 
x2_use_end2A = 0.0000 
x2_use_cont2A = 0.0000 
x1_defB = 0.0000 
x2_defB = 1.0000 
     46 
x1_use_endB = 0.0000 
x1_use_contB = 0.0000 
x2_use_endB = 0.0000 
x2_use_contB = 1.0000 
 
Figure-4.6: The result from ILOG OPL studio for optimal register allocation example 
program 
 
 There are two variables x1 and x2 followed by another variable such as _defA, 
_use_end1A etc. so that there can be two real registers to perform register allocation. 
Variable x1_defA equal 1 means that symbolic register A is allocated in real register 1. In 
the same way, x2_defB equal 1 means that symbolic register B is allocated in real register 
2. The sample program of the instruction graph [GW95] is depicted in Figure-4.7.  
 
 
Figure-4.7: Instruction graph for optimal register allocation of the sample program 
 
 Now that there are two symbolic registers, A and B, and also two real registers, 
optimal register allocation can be performed. David et. al also mentions that if there are 
too few registers or too many registers, optimal register allocation cannot be performed. 
     47 
Otherwise, it is necessary to use spill code placement method. The sample program of 
spill code placement is also implemented in ILOG OPL studio and results are shown in 
Table-4.1. The source code is presented in Appendix C. 
 
Optimal Solution with Objective Value: 9.0000  
x1_defA = 1.0000 X1_cont3A = 1.0000 x2_load2A = 0.0000 
x2_defA = 0.0000 X1_contB = 0.0000 x2_load3A = 0.0000 
x1_use_end1A = 0.0000 X2_cont1A = 0.0000 x2_load4A = 0.0000 
x1_use_cont1A = 1.0000 X2_cont2A = 0.0000 x2_load5A = 0.0000 
x1_use_end2A = 0.0000 X2_cont3A = 0.0000 x_memory_cont1A = 0.0000 
x1_use_cont2A = 1.0000 x2_contB = 1.0000 x_memory_cont2A = 0.0000 
x2_use_end1A = 0.0000 x1_store1A = 0.0000 x_memory_cont3A = 0.0000 
x2_use_cont1A = 0.0000 x2_store1A = 0.0000 x_memory_cont4A = 0.0000 
x2_use_end2A = 0.0000 x1_store2A = 0.0000 x_memory_cont5A = 0.0000 
x2_use_cont2A = 0.0000 x2_store2A = 0.0000 x1_storeB = 0.0000 
x1_defB = 0.0000 x1_store3A = 0.0000 x2_storeB = 0.0000 
x2_defB = 1.0000 x2_store3A = 0.0000 x1_load1B = 0.0000 
x1_use_endB = 0.0000 x1_load1A = 0.0000 x1_load2B = 0.0000 
x1_use_contB = 0.0000 x1_load2A = 0.0000 x2_load1B = 0.0000 
x2_use_endB = 0.0000 x1_load3A = 0.0000 x2_load2B = 0.0000 
x2_use_contB = 1.0000 x1_load4A = 0.0000 x_memory_cont1B = 0.0000 
x1_cont1A = 1.0000 x1_load5A = 0.0000 x_memory_cont2B = 0.0000 
x1_cont2A = 1.0000 x2_load1A = 0.0000   
 
Table-4.1: The result from ILOG OPL studio for the sample program of spill code 
placement  
 
 There are also two symbolic registers, A and B, and two real registers so that no 
symbolic register needs to be stored into memory. Memory graph for spill code 
placement modeled by David et. al is shown in Figure-4.8. 
 
     48 
 
 
Figure-4.8: Memory Graph for the Sample Program of Optimal Spill Code Placement 
using Two Real Registers 
 
 However, the spill code placement method developed by David et. al. failed to 
show effectiveness as they made use of two real registers both for the sample program of 
optimal register allocation and optimal spill code placement. In my view, in order to 
show the actual performance of their method, the sample program of spill code placement 
should use one register only. My study attempts to meet this requirement and the result 





     49 
Optimal Solution with Objective Value: 12.0000  
x1_defA = 1.0000 x1_cont3A = 1.0000 x_memory_cont1A = 0.0000 
x1_use_end1A = 0.0000 x1_contB = 1.0000 x_memory_cont2A = 1.0000 
x1_use_cont1A= 1.0000 x1_store1A = 1.0000 x_memory_cont3A = 0.0000 
x1_use_end2A = 1.0000 x1_store2A = 0.0000 x_memory_cont4A = 0.0000 
x1_use_cont2A= 0.0000 x1_store3A = 0.0000 x_memory_cont5A = 0.0000 
x1_defB = 1.0000 x1_load1A = 0.0000 x1_storeB = 0.0000 
x1_use_endB = 0.0000 x1_load2A = 0.0000 x1_load1B = 0.0000 
x1_use_contB = 1.0000 x1_load3A = 0.0000 x1_load2B = 0.0000 
x1_cont1A = 1.0000 x1_load4A = 1.0000 x_memory_cont1B = 0.0000 
x1_cont2A = 1.0000 x1_load5A = 0.0000 x_memory_cont2B = 0.0000 
 
Table-4.2: The result from ILOG OPL studio for the sample program of optimal spill 
code placement  
 
 As mentioned in section 4.1 and section 4.2, experiments on optimal instruction 
scheduling and register allocation are performed separately. Plan has been made to 
combine these two phases in order to increase Instruction Level Parallelism (ILP) and to 
reduce the number of memory accesses. However, during the process of implementing in 
isolation of optimal instruction scheduling and register allocation, it is obvious that the 
number of variables and expressions in the formulations are very large for only a small 
code segment. It can lead to limited practicality so as to solve phase ordering problem. 
Therefore, during the later phases of an experiment, a much more promising approach, 
which is modification of portion of convergent scheduling [DP02, WDSS02], is used 




     50 
Chapter 5 
Cooperative local instruction scheduling with impact or 
region-based register allocator 
This chapter reports the experimental evaluation of an efficient local instruction 
scheduler compared with list scheduler and list backtracking scheduler from Trimaran. 
Although list backtracking scheduler can generate efficient schedule, more compile time 
is required than conventional list scheduler. In contrast, list scheduler is very efficient in 
compile time because it never revisits or undoes a scheduling decision but the schedule 
code is not as efficient as list backtracking scheduler [AMB00]. We propose a single-pass 
local instruction scheduler [KW04b] which takes into consideration efficient instruction 
scheduling while maintaining optimal instruction scheduling length. The proposed 
scheduler performs the optimal scheduling within optimal instruction schedule length in a 
single pass. Our experimental results show that our proposed scheduler can generate 
more efficient code than list backtracking scheduler and significantly reduce compilation 
times than list scheduler.  
5.1. Introduction 
 The demand for faster and more powerful computer systems is ever increasing. 
One of the ways to enhance computer performance is to increase the number of 
processors in a system, or equivalently, to take advantage of as much instruction level 
parallelism (ILP) as possible. Instruction scheduling is one of the most important phases 
     51 
in compiler optimization. The goal of instruction scheduling is to exploit available 
instruction level parallelism efficiently and effectively. In order to meet this goal, a 
single-pass local instruction scheduler [KW04] is proposed. The proposed approach is 
based on part of convergent scheduling which is a general instruction scheduling 
framework that simplifies and facilitates the application of a multitude of arbitrary 
constraints and scheduling heuristics required to schedule instructions for modern 
complex processors. The proposed scheduler is implemented within pre-pass scheduler of 
Trimaran to compare with Elcor pre-pass scheduler of Trimaran. In Trimaran, the 
instruction scheduling which is performed before and after register allocation - impact or 
region-based register allocator - is called pre-pass scheduling and post-pass scheduling. 
Four instruction schedulers - list scheduler, cycle scheduler, list scheduling with 
backtracking scheduler and operation scheduler - have already been implemented in 
Trimaran. Among them, list scheduler is very efficient in compilation time, while list 
scheduling with backtracking scheduler [AMB00] can generate efficient schedule code. 
However these two schedulers try to get optimal instruction scheduling by scheduling the 
instruction at earliest cycle. Our proposed scheduler tries to schedule the instructions in 
the latest cycle within each basic block while maintaining the optimal instructing 
schedule length in order to reduce a number of simultaneously live variables and 
unnecessary dependencies. This can increase the probability that the register allocation 
can reduce a number of spill codes which can also shorten the scheduling length after 
register allocation. A comparison between most previous schedule (e.g. Elcor schedule) 
and our proposed schedule has been given in Figure-5.1, Figure-5.2 and Figure-5.3.































( a ) ( b ) ( c )  
Figure-5.1: Example data dependent graph(DDG) of basic block(BB) 38 of rawcaudio 
benchmark from Trimaran. (a) the original data dependent graph. Latency 1 on (op 
140, op 239) means that operation 239 cannot start from one cycle after operation 
140 completes. (b) the Elcor schedule which schedules the instructions at earliest 
cycle so that live range of operation 137 can be lengthened, which can also increase 
simultaneously live ranges. As a result, the lifetime of register may be lengthened, 
and the amount of spill code and unnecessary dependencies may be increased. (c) 
the proposed schedule which schedules the instructions in the latest cycle while 
maintaining optimal instruction schedule length. 
 
 
BB < 38 > BB < 38 >





op 140< 0 >
< 1 >
< 2 > < 2 >
< 1 >
< 0 >
( b )( a )  
Figure-5.2: Result of Pre-pass Scheduling for BB 38 of rawcaudio benchmark from 
Trimaran (a) the schedule of Elcor scheduler. (b) the schedule of proposed 
scheduler. We set up a machine with 4 integer units, 2 float units, 2 memory units 
and 1 branch unit. We also set up a machine with 16 general purpose registers. 
Operation 140, 239, and 137 use integer units and operation 141 uses branch unit 
and operation 278 uses resource null unit. Our proposed scheduler schedules 
operation 137 at cycle 2 as it is not on the critical path while the Elcor scheduler 
schedules it at cycle 0. 
     53 
 



















































( a ) ( b )
 
 
Figure-5.3: Result of Post-pass Scheduling for BB 38 of rawcaudio benchmark from 
Trimaran. The dotted boxes represent the spill codes inserted from impact register 
allocator (a) the schedule of Elcor scheduler. (b) the schedule of proposed scheduler. 
Our proposed scheduler significantly reduces total dynamic cycles and spill code 
insertion because it can reduce simultaneously live ranges in order to remove 
unnecessary data dependencies and spill code insertion after pre-pass scheduling. 
 
The rest of this chapter is organized as follows. In the next section, various 
instruction scheduling methods is discussed. The overview of the proposed scheduler is 
given in Section 3. Performance evaluation of the proposed scheduler is presented in 
Section 4. The summary is given in Section 5. 
     54 
5.2. Different types of Elcor schedulers  
Trimaran [tri03] is a compiler infrastructure for supporting the state of the art 
research in compiling for Instruction Level Parallel (ILP) architectures. Two main 
components in Trimaran are IMPACT ( a compiler frontend ) and Elcor ( a compiler 
backend ). Instruction scheduling is done in the Elcor compiler backend. There are four 
schedulers in Elcor backend : list scheduler, cycle scheduler, list scheduling with 
backtracking scheduler and operation scheduler. List scheduler schedules each operation 
by considering ready when all predecessors are scheduled, while cycle scheduler 
schedules each operation cycle-by-cycle where all operations for cycle i are scheduled 
prior to scheduling operations for cycle i+1. These two schedulers never unschedule 
scheduled operations. In addition, all operations can schedule after scheduling all its 
predecessors. In contrast, list scheduling with backtracking scheduler can schedule each 
operation by making forcibly scheduled even if all predecessors are not scheduled. This 
can force scheduled operations to be unscheduled, and then rescheduled later (called 
backtracking). Operation scheduler places each operation between earliest completion 
time (etime) and latest completion time (ltime) of the operation. If all slots are full then 
some operations are unscheduled to open slots. Among four schedulers, in order to 
evaluate the proposed scheduler, we will focus on list scheduler and list scheduling with 
backtracking scheduler which are already implemented in Trimaran. 
5.2.1. List scheduler  
List scheduler [KPD98, P98] is a traditional scheduler to schedule a basic block, a 
straight line sequence of code with a single entry point and a single exit point. A data 
     55 
precedence graph (DPG) is built to describe the data dependencies in a basic block. The 
nodes of the graph are operations in the basic block, whereas the edges represent 
dependences between operations. List scheduler is limited by data dependences and 
resource constraint. Data dependencies are formed by sharing data and memory locations 
between instructions. There are three cases of data dependences: true dependence, anti-
dependence, output dependence. Resource constraint contains a finite set of functional 
units. The first step in list scheduling is to compute earliest completion time (the longest 
path from root node to current node) and latest completion time (critical path length - the 
longest path from current node to leaf node) of the data dependence graph and priorities 
for each operation. All the predecessors of operation must be scheduled when an 
operation is considered to schedule by maintaining ready-list, a list of operations that are 
ready to execute. Which of the ready operations to schedule is depends upon the highest 
static priority operation from the ready-list. Ready-list is updated for next cycle when the 
operation is scheduled in the earliest cycle as possible. Although list scheduler has 
difficulty finding the optimal schedule, it is a faster scheduler. List scheduling algorithm 
is given in Figure-5.4. 
 
1.cycle = 0  
2.ready-list = root nodes in Data Precedence Graph ( DPG )  
3.inflight-list = empty list  
4.while (ready-list or inflight-list not empty and an issue slot is available) 
5.      for op = (all nodes in ready-list in descending priority order) 
6.           if (a Functional Unit exists for op to start at cycle)  
7.                  remove op from ready-list and add to inflight-list  
8.                  add op to schedule at time cycle 
9.           endif  
10.    endfor 
11.    cycle = cycle + 1  
     56 
12.    for op = (all nodes in inflight-list)  
13.          if (op finishes at time cycle)  
14.              remove op from inflight-list 
15.              check nodes waiting for op in DPG and add to ready-list if all operands     
                   available 
16.          endif 
17.    endfor 
18.endwhile 
 
Figure-5.4: List scheduling algorithm 
 
5.2.2. List scheduling with backtracking scheduling 
There are two backtracking schedulers [AMB00] : operBT scheduler and listBT 
scheduler. ListBT scheduler can generate more efficient code than operBT scheduler 
because it can either do limited backtracking to only support scheduling of branch 
operations with branch delay slots, or do unlimited backtracking, while operBT scheduler 
sometimes unschedules the operation unnecessarily. Like the list scheduler, scheduling is 
done by computing earliest completion time, latest completion time and priorities in 
listBT scheduler. ListBT scheduler generally selects operations to be scheduled from the 
ready operations based on priority, those operations whose predecessors are scheduled. 
However, listBT scheduler does not always schedule the operation in dependence order 
so that the predecessors and successors of the operation may already be scheduled. The 
highest priority operation from the ready list is removed and put to the current operation, 
which can schedule if resources are available between earliest completion time and latest 
completion time. Otherwise, the listBT scheduler must unschedule all the conflicting 
operations which have the lower priority and then forcibly schedule the operation for the 
     57 
current cycle. Then, the ready list is updated for next cycle. If the operation can not be 
scheduled in the range earliest completion time and latest completion time, the listBT 
scheduler forcibly schedule the operation at force cycle and its attempted cycle is set to 
force cycle. The algorithm of listBT scheduler is given in Figure-5.5. 
 
1.Initialize EarlyCycle, LateCycle and compute priorities of operation 
2.ReadyList = Start operation  
3.while (CurrentOperation = ReadyList.pop())  
4.     Compute EarlyCycle and LateCycle for CurrentOperation  
5.     ForceCycle = max(AttemptedCycle+1, EarlyCycle)  
6.     success = FALSE  
7.     for (CurrentCycle ranging from EarlyCycle through LateCycle)  
8.         if (resources required by CurrentOperation available)  
9.                 Schedule CurrentOperation in CurrentCycle  
10.               Update ReadyList with ready successors of CurrentOperation  
11.               success = TRUE 
12.               break  
13.       elseif ((unscheduling enabled for CurrentOperation) AND 
                        (CurrentCycle >= ForceCycle) AND  
                        (HasHigherPriority (CurrentOperation, CurrentCycle))) 
14.               Unschedule conflicting operations and update ReadyList  
15.               Enable unscheduling for conflicting operations  
16.               Schedule CurrentOperation in CurrentCycle and update ReadyList  
17.               success = TRUE 
18.               break  
19.       endif  
20.    endfor 
21.    if (success = FALSE)  
22.            Unschedule conflicting operations at ForceCycle and update ReadyList  
23.            Enable unscheduling for conflicting operations  
24.            Schedule CurrentOperation in ForceCycle and update ReadyList  
25.            Set AttemptedCycle to ForceCycle for CurrentOperation  
26.    endif 
27.endwhile 
 
Figure-5.5: ListBT scheduling algorithm 
     58 
5.3. Our proposed scheduler 
 In this section, the proposed scheduler of the local instruction scheduling problem 
based on part of convergent scheduling [DP02, WDSS02] is presented. As in the 
convergent scheduler, the proposed scheduler use weight matrix to compute schedule 
time. However, the convergent scheduler schedules the instructions at earliest cycle, 
whereas our proposed scheduler schedules the instructions as in the latest cycle as 
possible while maintaining optimal instruction scheduling length, which we assume to be 
the critical path of a basic block. This can be more efficient for long and narrow graphs 
which have a few critical paths so that every instruction can be scheduled within critical 
path length. If the instruction can not be scheduled within critical path length because of 
insufficient functional units, we increase critical path length by one repeatedly until every 
instruction can be scheduled. At the end of our heuristic, every instruction will be 
scheduled in the schedule slot with the highest weight. 
 A convergent scheduler [DP02] is composed of independent phases. Each phase 
implements a heuristic that addresses a particular problem such as ILP or register 
pressure. Compared with convergent scheduler, the proposed scheduler can handle both 
ILP and register pressure problems at the same time. This is more efficient because it 
does not need different phases. Once the schedule has been chosen for ILP, the proposed 
scheduler can automatically reduce register pressure by minimizing simultaneously live 
ranges. 
5.3.1. Common Data Structure 
 In order to implement our proposed scheduler, we use some common data 
     59 
structure based on the data structure of the convergent scheduler. In order to get the 
proposed schedule, weight matrix has been used to calculate optimal schedule length. If i 
is instruction and t is cycle time, all the weights are distributed evenly between 0 and 1. 
 ∀i,t : 0 ≤ Wi,t ≤ 1        ( 1 ) 
The heuristic is based on earliest completion time (the longest path from root 
node to current node) and latest completion time (critical path length - the longest path 
from current node to leaf node) of the dependence graph as depicted in Figure-5.4. 
Instruction in the middle of the dependence graph can be scheduled after their 
predecessors, or before their successors. If le is the earliest completion time and ll is the 
latest completion time, the instruction can be scheduled only in the time slots between le 
and ll. If the instruction cannot schedule between le and ll due to insufficient functional 
units, ll is increased by one and the process of scheduling is restarted again. Weight 
matrix is first initialized by assigning value zero for weights of each operation which are 
outside of le and ll, and average weight which is one divided by total number of 
operations inside le and ll. 
 for each i, (t ≤ le ∪ t ≥ ll), Wi,t ←  0     ( 2 ) 
 for each i, (t ≥ le ∪ t ≤ ll), Wi,t ←  1 / ∑ i      ( 3 ) 
We give more weight to a specific instruction to be scheduled in a given time cycle by 
multiplying the weight with a constant value. Then, we normalize our invariants. 
     60 
 for each i, t, Wi,t ←  Wi,t / ∑ t Wi,t      ( 4 ) 
After normalization, sum of weights Wi,t of all cycle times t, which are between le and ll 
for every instruction i is one. 
 ∀i : ∑t Wi,t = 1        ( 5 ) 
Then, the schedule time for each operation is selected by choosing the cycle time with 
maximum weight. 






cycle 0 cycle 1 cycle 2
op 140 0.25 0 0
op 239 0 0.25 0
op 141 0 0 0.25
op 137 0.25 0.25 0.25
cycle 0 cycle 1 cycle 2
op 140 0.3 0 0
op 239 0 0.3 0
op 141 0 0 0.3
op 137 0.25 0.25 0.3
cycle 0 cycle 1 cycle 2 Total
op 140 0.3 0 0 0.3
op 239 0 0.3 0 0.3
op 141 0 0 0.3 0.3
op 137 0.25 0.25 0.3 0.8
cycle 0 cycle 1 cycle 2 Total Schedule
op 140 1 0 0 1 cycle 0
op 239 0 1 0 1 cycle 1
op 141 0 0 1 1 cycle 2
















Figure-5.6: Example weight matrix calculation. (a) Data Dependent Graph with 
earliest completion time and latest completion time, (le,ll), of BB 38 of rawcaudio 
benchmark from Trimaran (b) initialization of weight matrix by dividing one by total 
number of operations where op 278 uses resource null so that there has been only 
needed to consider 4 operations – op 140, op 239, op 141, op 137. (c) give more 
weight to specific cycle for each functional unit starting from latest completion time 
     61 
that is 0.25 multiplied by 1.2. (d) calculate total weights of all cycles for each 
operation. (e) normalization of weight matrix which is current weight divided by total 
weights. After normalization, total weight for all cycles for each operation is 1. The 
cycle time with maximum weight for each operation is selected as schedule time. 
 
5.3.2. Heuristic 
 The pseudo code of the proposed scheduler's heuristic is roughly described in 
Figure-5.7. 
 
1.Compute earliest completion time and latest completion time 
2.Initialize the weight matrix 
3.while scheduling is not finished  
4.        while the latest completion time is greater than zero 
5.                     for (each operation within a basic block) 
6.                             if (resource is available and weight matrix is not zero)  
7.                                     Multiply weight matrix by 1.2 and put back into weight matrix 
8.                                     Mark the operation with schedule 
9.                                     Increase the number of current resources by one  
10.                                   if the current resources reach the maximum resource limit 
11.                                           Decrease latest completion time by one 
12.                                           Initialize the number of current resources by zero 
13.                                  endif 
14.                           endif  
15.                   endfor 
16.      endwhile 
17.      for (each operation within a basic block) 
18.                if (one of the operations within a basic block is not scheduled) 
19.                  Mark the scheduling is not finished 
20.               endif 
21. endfor 
22.endwhile 
23.Normalize the weight matrix by dividing each weight with the total weights for each 
     operation 
24.Choose the cycle time which has the maximum weight for each operation as schedule 
      time 
Figure-5.7: Our proposed scheduling algorithm 
     62 
5.4. Experimental evaluation 
 This section presents an empirical comparison of the proposed scheduler and 
Trimaran scheduler - list scheduler and list BT scheduler. 
5.4.1. Methodology  
 Trimaran's simulator is used to measure the performance of the proposed 
scheduler with both list scheduler and list BT scheduler. The framework used for 
machine-dependent ILP optimizations in Trimaran provides advanced capabilities and 
support for experimenting with innovative, forward-looking ILP architectures and the 
compiler modules needed to generate high-performance code for these architectures. 
Trimaran has a cycle level simulator which is used to generate various statistics such as 
compute cycles, total number of operations, register allocation overhead, etc. The 
simulator converts the Rebel into executable code and emulates the execution on a virtual 
HPL-PD processor. The proposed scheduler is implemented and inserted within Trimaran 
infrastructure of pre-pass scheduling. A machine is set up with 4 integer units, 2 float 
units, 2 memory units and 1 branch unit. A machine is also set up with 16 general 
purpose registers. To perform simulation, we first compile a version of our scheduler, and 
then follow by either region based register allocator or impact register allocator, and 
Elcor scheduler. A number of integer benchmarks is used in the experiments. For all the 
experiments, only one pass is used to get optimal instruction scheduling. 
5.4.2. Results and discussion 
 Table-5.2 and Table-5.3 present comparison between the proposed scheduler and 
     63 
both list scheduler and list BT scheduler. The result suggests that the proposed scheduler 
can significantly reduce total dynamic cycles and dynamic register allocation overhead of 
benchmarks which have long and narrow data dependent graphs. The reason is that list 
scheduler and list BT scheduler schedule the instruction at the earliest cycle, whereas the 
proposed scheduler schedules the instruction as in the latest cycle as possible but at the 
same time maintaining the optimal instruction schedule's length. This can decrease 
simultaneously live ranges so that spill code insertion and unnecessary dependencies can 
be saved. In addition, the schedule length after register allocation can be shortened by 
saving spill code insertion. As a result, compared with list scheduler and list BT 
scheduler, the proposed scheduler can significantly reduce total dynamic cycle and 
dynamic register allocation overhead of the benchmarks which have basic blocks with 
long and narrow data dependent graphs.  
Table-5.1 shows the execution time comparison of the proposed scheduler with 
both list scheduler and list BT scheduler. The result suggests that the proposed scheduler 
can reduce compile time execution. List scheduler is very efficient in compile time 
because it never unschedules already scheduled operations. In contrast, List BT scheduler 
may unschedule sometimes certain types of operations if it is necessary. Thus, list BT 
scheduler can generate better schedule code, although more compile time is needed. 
However, the results suggest that our proposed scheduler is efficient in both schedule 
code and compile time. Our proposed scheduler may also unschedule sometimes when 
there is not enough functional units so that all the operations cannot schedule within 
critical path length. However, this is a very rare case because modern optimization 
compilers might have plenty of functional units. 










Benchmarks Clock Ticks ( in millions ) list scheduler 
bmm 1896.58 3874.21 262.32 86%
dag 143.61 243.04 29.43 80%
eight 148.37 150.63 17.95 88%
example_bench 44.67 122.47 21.68 51%
fact2 41.89 149.22 12.22 71%
fib 62.87 191.86 23.26 63%
fib_mem 105.15 113.35 16.33 84%
fir 672.99 2802.66 311.69 54%
hyper 23.06 46.11 11.29 51%
ifthen 74.31 187.83 41.2 45%
mismatch_test 30.6 72.77 16.38 46%
mm_double 264.56 1149.48 77.49 71%
mm_dyn 1743.88 1990.13 186.05 89%
mm_int 353.95 1153.28 97.3 73%
mm 210.03 1193.52 142.13 32%
nested 78.72 236.26 30.53 61%
wc 611.78 2104.35 513.83 16%
Table-5.1: Execution time comparison of our proposed scheduler with List scheduler 
and ListBT scheduler. 
 
Benchmarks
(procedure) Dyn Cyc Reg alloc Dyn Cyc Reg alloc Dyn Cyc Reg alloc
wc(cnt) 1149169 964321 1145112 791827 1145109 422171
fact2 56 18 51 18 48 14
eight 4223 16 4015 16 4013 16
ifthen 4088 834 3925 834 3924 834
hyper 3294 8 3131 8 3130 8
fib 54 16 50 16 49 16
dag 3628 20 3626 20 3568 20
mm 60555 4830 59681 4830 58879 3228
nested 1353 428 1389 428 459 16
example_bench 8499479 7753241 7599414 5954269 7599344 5103324
fir(main) 102837 56178 96429 49821 96429 49814
mm_double(main) 4343 4129 4295 3769 4294 3754
List scheduler List BT scheduler Our scheduler
 
Table-5.2: Total dynamic cycles and register allocation overhead comparison on 




     65 














fact2 56 18 51 18 48 14 
Eight 4223 16 4015 16 4013 16 
fib_mem 4088 834 3925 834 3924 834 
Mm 3294 8 3131 8 3130 8 
Sqrt 54 16 50 16 49 16 
test_install 3628 20 3626 20 3568 20 
mm_double 60555 4830 59681 4830 58879 3228 
fir(main) 102837 56178 96429 49821 96429 49814 
Wave 4343 4129 4295 3769 4294 3754 
Table-5.3: Total dynamic cycles and register allocation overhead comparison on 
different schedulers using impact register allocator in Trimaran. 
 
5.4. Summary 
This chapter have been reported the experimental evaluation of our proposed 
scheduler with both list scheduler and list BT scheduler from Trimaran. The results show 
that the proposed scheduler is very efficient and reasonably effective for a machine which 
has enough functional units. Compared with list scheduler and list BT scheduler, the 
proposed scheduler can significantly reduce total dynamic cycles and total register 
allocation overhead of the benchmarks which have basic blocks with long and narrow 
data dependent graphs. Moreover, the proposed scheduler is also efficient in compile time 
because it is faster than list scheduler which is supposed to be efficient in compile time. 
As a future work, improving the effectiveness of our scheduler, we can try to find out 
efficient heuristic for fat and parallel data dependent graphs. Furthermore, we can also try 
to develop efficient scheduling for global instruction scheduling. A combination of 
instruction scheduling and register allocation can also be included. 
  
     66 
Chapter 6 
Cooperative instruction scheduling with linear scan 
register allocation 
This chapter describes an experimental evaluation of cooperative instruction 
scheduling with linear scan register allocation. Linear scan register allocator, proposed by 
Massimiliano Poletto and Vivek Sarkar [PS99], is very simple and faster than algorithms 
based on graph coloring approaches. Some previous experimental evaluation and 
improvements to linear scan register allocation can be found in [EK02, SS03]. The 
allocation of linear scan register allocator is affected by the maximum number of active 
live intervals. If the maximum number of active live intervals can be reduced, the linear 
scan register allocator will generate more efficient code because spill code will be 
reduced. In chapter 5, we have proposed a pre-pass local instruction scheduler which can 
reduce simultaneously live ranges thereby decreasing the maximum number of active live 
intervals. This chapter presents the performance gain from combining our proposed 
scheduler with a linear scan register allocator. Previous studies [PS99, EK02, SS03] 
investigate linear scan register allocator itself rather and none considered the impact of 
combining it with a cooperative pre-pass instruction scheduler. On the other hand, 
previous works on attempting to solve the phase ordering problem [GH88, BEH91, 
Pin93, NP93, NP95a, CCK97, BGS94, BGS98, SC99] had focused on combining the 
instruction scheduling phase with graph coloring register allocators. Therefore, in this 
thesis, we focus on the cooperative approach that solves the phase ordering problem 
     67 
between instruction scheduling and linear scan register allocation. The linear scan register 
allocator has been developed in Trimaran in order to evaluate the performance of two 
global register allocation schemes : impact register allocator and region based register 
allocator which have been already implemented in Trimaran. 
The result shows that using the proposed pre-pass scheduler can reduce the 
maximum active live intervals. This can decrease the amount of spill code insertion done 
by the linear scan register allocator. Moreover, it also shows that combining the proposed 
pre-pass scheduler with a linear scan register allocator is significantly faster than 
combining Trimaran's default scheduler with impact or region-based register allocator, 
which supports the hypothesis that combining the proposed pre-pass scheduler with linear 
scan register allocator can result in better performance. 
6.1. Global register allocators in Trimaran 
The first global register allocator using the graph coloring approach is built by 
Chaitin et al.[Cha81] at IBM in Yorktown Heights. It is based on graph coloring 
approach, which has been used as the basic of register allocation approach in modern 
compilers. A graph-coloring register allocator iteratively builds a register interference 
graph, which is undirected graph that summarizes live analysis relevant to the register 
allocation problem. A node in an interference graph is a variable or temporary that is a 
candidate for register allocation and an edge connects two nodes whose corresponding 
variables are said to interfere. The standard graph coloring method heuristically attempts 
to find a k-coloring for the interference graph, where a graph is k-colorable if each node 
can be assigned to one of k-colors and no two adjacent nodes have the same color as well. 
     68 
If the heuristic can find a k-coloring, then a register assignment is completed. Otherwise, 
some register candidates are chosen to spill, and the interference graph must be rebuilt 
after spill code is inserted, and then reattempt to obtain a k-coloring. This whole process 
should repeat until a k-coloring is finally obtained. In practice, the cost of graph-coloring 
approach can be expensive by repeatedly constructing a register interference graph until 
the heuristic succeeds. However, the graph coloring based register allocators have been 
used in many commercial compilers to obtain significant improvements over simple 
register allocation heuristic. In Trimaran, there have been two global register allocators : 
impact register allocator and region-based register allocator, adapted from graph-coloring 
framework. 
6.1.1. Impact register allocator 
Impact register allocator[REH90] is implemented by Richard Eugene Hank. Live 
ranges are constructed in impact register allocator by performing live variable analysis 
and reaching definition analysis such as Chow’s algorithm, but the live range is 
represented by a set of instructions as in Chaitin’s allocator. In order to reduce the 
amount of spill code, impact register allocator has been used loop level live range 
splitting, which is very similar to the method proposed by P. Briggs’ thesis [Bri92]. The 
graph coloring based impact register allocator strives to minimize the number of spilled 
nodes and the register allocation overhead.  
6.1.2. Region based register allocator 
Region based register allocator [HK01], implemented by Hansoo Kim, is based 
     69 
on the notion of the priority based global register allocator originally introduced by Chow 
and Hennessy. Basic blocks are used as the unit of coloring in region based register 
allocator but the construction of interference graph is based on chaitin definition of 
liveness, where two live ranges interfere if one of them is live at the definition point of 
the other. Region based register allocator includes new techniques in the various phases 
of region-based register allocation which improved compilation time while having 
comparable execution time, compared to other global register allocation. The problem of 
the compilation time and execution performance trade-off in region based compilation 
has been addressed within the context of the key optimization of register allocation.  
6.1.3. Linear scan register allocator 
Register allocation is an important part of a compiler and it can also affect the 
performance of modern optimizing compilers. Unfortunately, current optimizing 
compilers are computationally expensive as graph coloring framework is used for register 
allocation in these compilers. However, Massimiliano Poletto and Vivek Sarkar [PS99] 
proposed a new algorithm for fast register allocator, linear scan register allocator which is 
very useful when compile time and run time performance of generated code is important. 
Linear scan register allocator is based on the calculation of live range by arranging 
topological order of the instructions rather than using expensive graph coloring 
framework. Live intervals of each temporary variable are calculated for assigning to 
different registers. A temporary variable with overlapping intervals can be assigned to 
different registers and non-overlapping intervals can be assigned to same registers. Erik 
Johansson and Konstantinos Sagonas [EK02] suggest that the linear scan register 
     70 
allocator can be broken down into the following four steps : ( 1 ) sort all the instructions 
in topological order; ( 2 ) calculate the set of live intervals; ( 3 ) assign each temporary 
variable to physical register for each interval ( or spill into the memory ) and finally ( 4 ) 
rewrite the code with the obtained allocation. 
(1) Sort all the instructions in topological order  
There can be different order of instructions based on ordering method of 
topological order which include (1) depth-first ordering (2) preorder (3) postorder (4) 
Breadth first ordering (5) prediction and (6) random. An experimental results and a good 
discussion of those orderings can be found in [EK02]. Among different orderings, depth-
first ordering is the best ordering to reduce false interference [PS99, EK02] within live 
intervals. However, there has been no discussion of reordering of instructions within a 
basic block. We note that the reordering of instructions within a basic block [KW04b] 
might impact the allocation and the number of spill code insertions. Chapter 5 discusses 
the proposed heuristic for instructions reordering within a basic block. Figure-6.1 and 
Figure-6.2 show the original ordering and the ordering of instruction after applying the 
proposed heuristic respectively and corresponding earliest completion time (etime) and 







     71 
 
       ID = 2
       weight(1)
       attr(lc ^36)
10.  op 43 (C_MERGE [ ] [ ] s_time (0))
11.  op 36 (ADD_W [r < 2 : 6 0 > ] [m<9>  6<4> ] p<t> s_time (0) )
12.  op 37 (PBRR [r<3 : 17 3> ] [1:21<_$fn_atoi> 6<1>] p<t> s_time (0) )
13.  op 38 (PBRR [r<4 : 17 3>] [1:21 <_$fn_printf> 6<1>] p<t> s_time (0))
14.  op 39 (PBRR [r <5: 17 3>] [1:21<_$fn_exit> 6<1>] p<t> s_time (0) )
15.  op 12 (L_W_C1_C1 [m <8>] [r<2:6 0> ] p<t> s_time (1) )
16.  op 14 (BRL [m <35>] [r<3:17 3>] p<t> s_time (3) )
17.  op 15 (MOVE [m <9>] [m<7> ] p<t> s_time (4) )
18.  op 25 (MOVE [m <8>] [fib %d = %d ] p<t> s_time (4) )
19.  op 27 (MOVE [m<10>]  [6<0> ] p<t> s_time (4) )
20.  op 28 (BRL [m<35> ] [r<4:17 3>] p<t> s_time (5) )
21.  op 30 (MOVE [m<8>] [6<0> ] p<t> s_time (6))
22.  op 31 (BRL [m <35>] [r<5:17 3> ] p<t> s_time (7) )
23.  op 44 (DUMMY_BR [ ] [ ] s_time (8) )
      ID = 1
      weight(1)
      attr(lc ^26)
1.   op 41 (C_MERGE [ ] [ ] s_time (0))
2.   op 1 (DEFINE [m <8>] [u<> u<> ] s_time (0) )
3.   op 2 (DEFINE [m <9>] [u<> u<> ] s_time (0) )
4.   op 3 (DEFINE [m <4>] [u<> u<> ] s_time (0) )
5.   op 4 (DEFINE [m <1>] [6<0> u<> ] s_time (0) )
6.   op 5 (DEFINE [m <2>] [6<16> u<> ] s_time (0) )
7.   op 34 (DEFINE [m <35>] [u<> u<> ] s_time (0) )
8.   op 35 (MOVE [r <1:17 3>] [m<35>] p<t> s_time (0) )
9.   op 42 (DUMMY_BR [ ] [ ] s_time (1) )
       ID = 3
       weight(1)
       attr(lc ^51)
24.  op 45 (C_MERGE [ ] [ ] s_time (0))
25.  op 40 (PBRA [m <35>] [r<1:17 3>  6<1> ] p<t> s_time (0) )















     72 
       ID = 2
       weight(1)
       attr(lc ^36)
10.  op 43 (C_MERGE [ ] [ ] s_time (0))
11.  op 36 (ADD_W [r < 2 : 6 0 > ] [m<9>  6<4> ] p<t> s_time (0) )
12.  op 12 (L_W_C1_C1 [m <8>] [r<2:6 0> ] p<t> s_time (1) )
13.  op 37 (PBRR [r<3 : 17 3> ] [1:21<_$fn_atoi> 6<1>] p<t> s_time (2) )
14.  op 14 (BRL [m <35>] [r<3:17 3>] p<t> s_time (3) )
15.  op 38 (PBRR [r<4:17 3>] [1:21 <_$fn_printf> 6<1> ] p<t> s_time (4))
16.  op 15 (MOVE [m <9>] [m<7> ] p<t> s_time (4) )
17.  op 25 (MOVE [m <8>] [fib %d = %d ] p<t> s_time (4) )
18.  op 27 (MOVE [m<10>]  [6<0> ] p<t> s_time (4) )
19.  op 28 (BRL [m<35> ] [r<4:17 3>] p<t> s_time (5) )
20.  op 39 (PBRR [r <5:17 3>] [1:21<_$fn_exit> 6<1>] p<t> s_time (6) )
21.  op 30 (MOVE [m<8>] [6<0> ] p<t> s_time (6))
22.  op 31 (BRL [m <35>] [r<5:17 3> ] p<t> s_time (7) )
23.  op 44 (DUMMY_BR [ ] [ ] s_time (8) )
      ID = 1
      weight(1)
      attr(lc ^26)
1.   op 41 (C_MERGE [ ] [ ] s_time (0))
2.   op 1 (DEFINE [m <8>] [u<> u<> ] s_time (0) )
3.   op 2 (DEFINE [m <9>] [u<> u<> ] s_time (0) )
4.   op 3 (DEFINE [m <4>] [u<> u<> ] s_time (0) )
5.   op 4 (DEFINE [m <1>] [6<0> u<> ] s_time (0) )
6.   op 5 (DEFINE [m <2>] [6<16> u<> ] s_time (0) )
7.   op 34 (DEFINE [m <35>] [u<> u<> ] s_time (0) )
8.   op 35 (MOVE [r <1:17 3>] [m<35>] p<t> s_time (0) )
9.   op 42 (DUMMY_BR [ ] [ ] s_time (2) )
       ID = 3
       weight(1)
       attr(lc ^51)
24.  op 45 (C_MERGE [ ] [ ] s_time (0))
25.  op 40 (PBRA [m <35>] [r<1:17 3>  6<1> ] p<t> s_time (0) )







Figure-6.2: Control Flow Graph (CFG) with long instructions after instructions 
reordering within each basic block 
 
etime ltime
op 41 0 0
op 1 0 0
op 2 0 0
op 3 0 0
op 41 0 0
op 5 0 0
op 34 0 0
op 35 0 0
op 42 0 0
Basic Block 1
etime ltime
op 43 0 0
op 36 0 0
op 12 1 1
op 37 0 2
op 14 3 3
op 36 0 4
op 15 4 4
op 25 4 4
op 27 4 4
op 28 5 5
op 39 0 6
op 30 6 6
op 31 7 7
op 44 8 8
Basic Block 2
etime ltime
op 45 0 0
op 40 0 0
op 33 1 1
Basic Block 3
 
Figure-6.3: Earliest completion time (etime) and Latest completion time (ltime) for 
each basic block in Figure-6.1 
     73 
(2) Calculation of live interval 
Live ranges are determined by a set of instructions within each basic block. Each 
live range has a start position with the first definition of the temporary and an end 
position with the last use of the temporary.  Then, all live intervals are sorted in the order 
of increasing start-points so as to make the allocation more efficient. The number of live 
intervals with start position and end position in Figure-6.1 and Figure-6.2 are : 
 












( a ) without instruction reordering      ( b ) with instruction reordering 
 
Figure-6.4 : A number of live intervals for data dependent graph in Figure-6.1 and 
Figure 6.2 
 
As depicted in Figure-6.4, the live intervals without instruction reordering within 
each basic block, BTR1, BTR3, BTR4 and BTR5 are live at the same time. However, the 
live intervals with instruction reordering, BTR1 is only live at the same time with BTR3, 
BTR4 or BTR5.  
(3) Assigning temporary variables to registers 
After processing all intervals with ordering of increasing start points, the 
allocation of register to intervals can be done. In Trimaran, there are four register types : 
general purpose registers (GPRs), floating point registers (FPRs), branch target registers 
(BTRs) and predicate registers (PRs). The physical register information file has the 
following data structure. 
     74 
Struct Phy_Reg_Info 
{ 
 Int file_type; 
 Int vr_no; 
 Int status; // 0 = free, 1 = not free 
} 
 













Int status; // 0 = not active, 1 = active 
} 
 
Allocation is done by assigning a set of temporaries to a set of free registers for each 
corresponding register type and activating status of active_interval and Phy_Reg_Info 
with one (i.e set currently active interval and active physical register). For each interval 
(s_pos, e_pos) do : 
• Set zero to status of Phy_Reg_Info and Active_interval for temporaries whose interval 
ends before the current interval (i.e s_pos of current interval temporaries > e_pos of 
active temporaries) 
• If there is a free register for each register type, set one to status of both Phy_Reg_Info 
and Active_Interval , i.e virtual register of current live interval is allocated to  physical 
register. Otherwise, spill the live interval which ends furthest away from the current 
point. 
     75 
(4) Rewrite the code with the obtained allocation 
After assigning temporary variables to registers, the code is rewritten to bind the 
temporary variables with the physical registers.   
6.2. Experimental evaluation 
This section describes the experimental evaluation of linear scan register allocator 
in terms of speed, maximum active live intervals, total dynamic cycles and register 
allocation overhead combining with the proposed scheduler. Trimaran infrastructure is 
used to compare the performance of linear scan register allocator with impact register 
allocator and region based register allocator. Linear scan register allocator is 
implemented in Trimaran and inserted in place of register allocation phase. The proposed 
pre-pass scheduler is also implemented in Trimaran. A diagram of the Trimaran 
infrastructure with the proposed pre-pass scheduler and linear scan register allocator is 
given in Figure-6.5. 
 Good register allocator should reduce the number of spill code insertion to 
generated code. Due to insufficient registers, register allocator will sometimes be 
necessary to spill certain value to memory. In this case, we should consider minimizing 
the number of memory access – loads and stores during register allocation. The proposed 
scheduler can reduce the number of active live ranges that the linear scan allocator has to 
deal with. As a result, lesser spill code will be inserted. 
 
     76 
C program
K & R/ANSI-C Parsing
Renaming & Flattening
Control-Flow Profiling

































Figure-6.5: position of the proposed pre-pass scheduler and linear scan register 
allocator in Trimaran infrastructure  
 
6.2.1. Result and discussion 
  Table-6.1 and Table-6.2 show the total dynamic cycles and register allocation 
overhead of linear scan register allocator and register based register allocator. Table-6.3 
and Table-6.4 present the total dynamic cycles and register allocation overhead of linear 
scan register allocator and impact register allocator. The result suggests that linear scan 
register allocator can reduce both total dynamic cycles and register allocation overhead 
over region based register allocator and impact register allocator.  
 The reason is that the process of finding liveness analysis of linear scan register 
     77 
allocator is different from region based register allocator.  The linear scan algorithm takes 
as input a list of live intervals and scans active live intervals, which are the number of 
overlapping intervals that change only at the start and end points of live intervals. Thus, 
active live intervals can be computed easily in a single pass over the live intervals by 
iterating this list. In contrast to linear scan register allocator, region based register 
allocator constructs an interference graph to find a k-coloring from it. Because graph 
coloring is NP-complete, it is not guaranteed to find a k-coloring for all k-colorable 
graphs. Thus, unnecessary spill code may occur in region based register allocator and 
impact register since those register allocators use graph coloring framework.   
 
benchmarks 
( procedure ) Linear Scan Region based Linear Scan Region based
bmm ( mm_inner ) 52555 84582 no spill 57671
dag 3611 3628 no spill 20
eight 4209 4223 no spill 16
example_bench 4149307 8499479 no spill 5153261
fact2 45 56 no spill 18
fib 45 54 no spill 16
fib_mem 115 127 no spill 20
fir ( main ) 110555 126253 no spill 42524
hyper 3289 3294 no spill 8
ifthen 3411 4088 no spill 834
mm_double ( matmult ) 53812 52843 no spill 2033
mm_dyn 58486 61323 no spill 4070
mm_int 65710 71557 no spill 5360
mm 59332 60555 no spill 4830
nested 1362 1389 no spill 428
rawcaudio 7426154 7426154 no spill 337516
wave 14472 14487 no spill 22
wc ( cnt ) 1149125 1149169 no spill 964321
total dynamic cycle total register allocation overhead
 
 
Table-6.1: Total dynamic cycle and total register allocation overhead comparison 
between linear scan register allocator and region-based register allocator for 16 
registers  
     78 
benchmarks 
( procedure ) Linear Scan Region based Linear Scan Region based
bmm ( mm_inner ) 52555 52558 no spill 8033
dag 3611 3618 no spill 10
eight 4209 4217 no spill 10
example_bench 4149307 4149413 no spill 90177
fact2 45 58 no spill 20
fib 45 56 no spill 18
fib_mem 115 129 no spill 22
fir ( main ) 110555 126255 no spill 42526
hyper 3289 3296 no spill 10
ifthen 3411 3422 no spill 14
mm_double ( matmult ) 53812 53847 no spill 1636
mm_dyn 58486 59320 no spill 1640
mm_int 65710 66540 no spill 1636
mm 59332 59759 no spill 1634
nested 1362 1393 no spill 532
rawcaudio 7426154 7426322 no spill 282133
wave 14472 14485 no spill 20
wc ( cnt ) 1149125 1149173 no spill 964326
total dynamic cycle total register allocation overhead
 
Table-6.2: Total dynamic cycle and total register allocation overhead comparison 
between linear scan register allocator and region-based register allocator for 32 
registers  
 
benchmarks  total dynamic cycle total register allocation overhead 
( procedure ) Linear Scan Impact Linear Scan  Impact 
bmm ( mm_inner ) 52555 177439 no spill 123668 
dag 3460 5592 no spill 2439 
eight 4076 5432 no spill 1363 
example_bench  4119280 7705361 no spill 5994098 
fact2 45 43 no spill 11 
fib 44 49 no spill 5 
fib_mem 115 239 no spill 167 
fir ( main ) 110553 232770 no spill 115693 
Ifthen 3160 5629 no spill 2839 
mm_double ( matmult ) 53412 172272 no spill 120462 
mm_dyn 58486 196061 no spill 154347 
mm_int 4180416 15594527 no spill 12457476 
mm 59333 195353 no spill 144871 
nested 1362 2010 no spill 760 
wave 14473 36238 no spill 22591 
wc ( cnt ) 1105371 1186242 no spill 80884 
 
Table-6.3: Total dynamic cycle and total register allocation overhead comparison 
between linear scan register allocator and impact register allocator for 16 registers 
 
     79 
benchmarks  total dynamic cycle total register allocation overhead 
( procedure ) Linear Scan Impact Linear Scan  Impact 
bmm ( mm_inner ) 52555 52574 no spill 18 
dag 3460 3466 no spill 9 
eight 4076 4079 no spill 7 
example_bench  4119280 4119343 no spill 82 
fact2 45 51 no spill 11 
fib 44 49 no spill 7 
fib_mem 115 125 no spill 15 
fir ( main ) 110553 110566 no spill 13 
mm_double ( matmult ) 53412 53443 no spill 30 
mm_dyn 58486 58518 no spill 43 
mm_int 4180416 4180432 no spill 18 
mm 59333 59356 no spill 29 
nested 1362 1389 no spill 26 
wave 14473 14479 no spill 11 
wc ( cnt ) 1105371 1105398 no spill 28 
Table-6.4: Total dynamic cycle and total register allocation overhead comparison 
between linear scan register allocator and impact register allocator for 32 registers  
 
The result of table-6.5 suggests that combining the proposed press-pass scheduler 
with linear scan register allocator can significantly reduce the maximum active live 
intervals of basic block which have long and narrow data dependent graphs. Spill code 
insertion and unnecessary dependencies can be saved as our proposed scheduler can 
decrease simultaneously live ranges as well as the maximum number of active live 
intervals. 
Table-6.6 shows average speedups of combining the proposed pre-pass scheduler 
with linear scan register allocator over combing list scheduler with impact, region-based 
or linear scan register allocator.  
     80 
Benchmarks ( procedure ) Act1 Act2 Reduce% Act1 Act2 Reduce% Act1 Act2 Reduce%
181.mcf(_insert_new_arc) 16 11 31.25% 0 0 0% 1 1 0%
181.mcf(_replace_weaker_arc) 17 12 29.41% 0 0 0% 1 1 0%
181.mcf(_price_out_impl) 29 24 17.24% 0 0 0% 2 2 0%
181.mcf(_suspend_impl) 19 13 31.58% 0 0 0% 1 1 0%
181.mcf(_global_opt) 3 3 0% 0 0 0% 7 2 71.43%
101.tomcatv 69 65 5.80% 33 32 3.03% 7 2 71.43%
wc(_main) 7 7 0% 0 0 0% 3 2 33.33%
bmm(_sumup) 6 6 0% 1 1 0% 3 1 66.67%
dag 11 11 0% 0 0 0% 1 1 0%
eight 8 8 0% 0 0 0% 2 1 50.00%
example_bench(_convert_to_int) 2 2 0% 0 0 0% 3 2 33.33%
fact2 3 3 0% 0 0 0% 3 2 33.33%
fib 4 4 0% 0 0 0% 3 2 33.33%
fib_mem 6 6 0% 0 0 0% 3 2 33.33%
fir 11 11 0% 3 3 0% 3 2 33.33%
hyper 5 5 0% 0 0 0% 1 0 100.00%
ifthen 13 13 0% 0 0 0% 2 1 50.00%
mm_double(_matmult) 11 11 0% 3 3 0% 2 1 50.00%
mm_int 14 13 7.14% 0 0 0% 2 1 50.00%
mm 11 11 0% 3 3 0% 3 2 33.33%




Act1 – the number of active live intervals after Elcor pre-pass scheduler 
Act2 – the number of active live intervals after our pre-pass scheduler 
Table-6.6: The maximum active live intervals of each procedure, which have long 
and narrow data dependent graph, of several benchmarks in Trimaran.  
 
Generally, a linear scan register allocator attempts to find the number of live 
intervals which are currently active in a certain program point by visiting each lifetime 
interval in turn. The number of active live intervals represent the number of register 
needed at this point in the program. If the number of free registers is insufficient to fit, 
then some active live intervals are chosen to spill and the scan proceeds. Since a linear 
scan register allocator scans the whole process linearly rather than doing repeatedly after 
inserting spill code, it can operate faster than graph-coloring method based register 
allocators which are impact and region-based register allocators in Trimaran.  
     81 
Benchmarks C1 C2 C3 C4 C5 C6 spdup% spdup% spdup%
over c1 over c2 over c3
181.mcf 52046.79 1563.36 1232.86 1248.36 52031.26 1547.84 98% 21% 1%
101.tomcatv 159306.63 6608.26 2106.87 3721.74 157691.77 4993.39 99% 68% 43%
bmm 2451.21 1999.55 284.10 1918.37 816.94 365.29 88% 86% 85%
dag 16907.72 266.07 53.55 167.72 16793.55 151.90 100% 80% 68%
eight 3531.47 218.34 32.83 163.25 3401.05 87.92 99% 85% 80%
example_bench 2451.53 540.81 280.15 303.15 2428.53 517.82 89% 48% 8%
fact2 1900.10 107.20 39.79 69.47 1870.43 77.52 98% 63% 43%
fib 2272.47 127.00 32.22 71.83 2232.87 87.40 99% 75% 55%
fib_mem 5412.59 166.69 26.94 115.76 5323.77 77.88 100% 84% 77%
fir 5487.11 792.27 336.26 697.56 5125.81 430.97 94% 58% 53%
hyper 5402.24 83.60 20.85 32.62 5390.47 71.83 100% 75% 36%
ifthen 1487.90 180.19 57.29 90.40 1454.79 147.08 96% 68% 37%
mm_double 1825.95 351.48 98.81 285.88 1638.88 164.41 95% 72% 65%
mm_dyn 3224.45 2071.10 321.61 1879.43 1666.62 513.27 90% 84% 83%
mm_int 1631.96 519.42 134.48 391.13 1375.31 262.76 92% 74% 66%
mm 1784.74 375.10 170.01 237.90 1716.84 307.20 90% 55% 29%
nested 4249.32 142.89 46.32 94.51 4201.13 94.70 99% 68% 51%
wc 34725.10 1202.09 989.44 1087.40 34627.14 1104.14 97% 18% 9%
Clock Ticks ( in millions )
 
C1 – Combining List scheduler with Impact register allocator 
C2 – Combining List scheduler with Region-based register allocator 
C3 – Combining the proposed pre-pass scheduler with Linear scan register allocator 
C4 – Combining List scheduler with Linear scan register allocator 
C5 – Combining the proposed pre-pass scheduler with Impact register allocator 
C6 – Combining the proposed pre-pass scheduler with Region-based register allocator 
 
Table-6.6: Average speedups of combining the proposed pre-pass scheduler with 
linear scan register allocator over combining Trimaran’s default scheduler with 
impact or region-based register allocator.  
 
Moreover, list scheduler is very efficient in compilation time because it never 
unschedules already scheduled operations. The proposed pre-pass scheduler sometimes 
unschedules the operations only when all the operations cannot schedule within critical 
path length due to insufficient functional units. This may be very rare case for next 
generation of optimizing compilers because those compilers will have more functional 
units. Thus, the compilation time of the proposed pre-pass scheduler is as efficient as list 
scheduler. As a result, combining the proposed pre-pass scheduler with linear scan 
register allocator is significantly faster than combing Trimaran's list scheduler with 
impact or region-based register allocator, which supports the hypothesis that combining 
     82 
the proposed pre-pass scheduler with linear scan register allocation can result in better 
performance.   
6.3. Summary 
In this chapter, cooperative approach of instruction scheduling and linear scan 
register allocation has been presented. Although linear scan register allocator is recently 
attractive register allocator, nobody has yet attempted to evaluate the performance of 
combining instruction scheduling and linear scan register allocation. This is the first 
study to combine instruction scheduling and linear scan register allocation. The results 
show that combining the proposed pre-pass scheduler with the linear scan register 
allocator can reduce the maximum number of active live intervals, total dynamic cycles 
and register allocation overhead for those basic blocks which have long and narrow data 
dependent graphs. This can increase the probability that the linear scan register allocator 
can reduce the register usage and spill code insertion. Moreover, compared to the default 
scheduling and graph coloring allocator schemes found in the Impact and Elcor 
components of Trimaran, implementation with the proposed pre-pass scheduler and linear 
scan register allocator is significantly reduced compilation times. Future work can 
consider a cooperative approach of global instruction scheduling and linear scan register 
allocation. In addition, an integrated approach of instruction scheduling and linear scan 
register allocation can also be included. 
 
 
     83 
Chapter 7 
Conclusions and recommendation for further work 
7.1. Conclusion 
 There has been a call for finding exact and satisfactory solutions to important 
compiler optimization such as instruction scheduling, register allocation and integration 
of these two phases. However, until recently, this problem has not yet been dealt with 
perfectly. Research has shown that such optimizations can be solved in reasonable time 
using various methods that combine and corporate instruction scheduling and register 
allocation. Thus, this study surveys and examines various techniques that have been 
developed for instruction scheduling, register allocation and integration between these 
two phases. Furthermore, after analyzing and examining the efficiencies and weaknesses 
of various existing methods for instruction scheduling and register allocation, different 
approaches are implemented so as to solve phase ordering problem. First of all, integer 
linear programming approach is used to combine instruction scheduling and register 
allocation. The study finds out that using integer programming approach, variables and 
expressions are required to formulate instruction scheduling and register allocation 
problem. Even for a small code segment, these variables and expressions could be very 
large. As a result, it takes too much time to formulate instruction scheduling and register 
allocation problem and a very long solution time. Thus, it is necessary to further refine 
the formulations to reduce redundant variables and inequalities. Then, a new cooperative 
instruction scheduler, which can reduce simultaneously live ranges and is based on part 
     84 
of convergent scheduling, is proposed and combined with different register allocators – 
impact register allocator, region-based register allocator and linear scan register allocator. 
7.2. Contributions 
The main contributions of this thesis are as follows: 
• A study of various techniques for register allocation and instruction allocation. 
• A discussion of the phase ordering problem of register allocation and instruction 
scheduling for Instruction Level Parallelism (ILP) and existing combined 
instruction scheduling and register allocation strategies. 
• An experimental study of ILOG OPL studio version 3.3 scheduling model to 
formulate instruction scheduling problem to obtain optimally instruction 
scheduling using integer linear programming. 
• An experimental study of ORA (Optimal Register Allocator) developed by David 
W. Goodwin and Kent D. Wilken [GW95]. 
• The development of a new local instruction scheduler, which can reduce total 
dynamic cycles and register allocation overhead, based on part of convergent 
scheduling. 
• The development of linear scan register allocator, proposed by Massimiliano 
Poletto and Vivek Sarkar, within Trimaran so as to compare with impact register 
allocator and region based register allocator. 
• Implementations of the instruction scheduling problem and register allocation 
problem using integer linear programming, portion of convergent scheduling, a 
new instruction scheduler adapted from convergent scheduling, and linear scan 
     85 
register allocator.  
7.3. Recommendations for future work 
 In this thesis, several approaches to combine instruction scheduling and register 
allocation have been investigated and implemented. Hence, there are several rooms for 
improvement of combined instruction scheduling and register allocation.  
First of all, an experimental study of combined instruction scheduling and register 
allocation is carried out using integer linear programming approach. The preliminary 
results show that even for a small code segment, the variables and expressions to 
formulate phase ordering problem are very large. Thus, it is necessary to further refine 
the formulations to reduce redundant variables and inequalities. In order to fulfill this 
goal, future work can include improvement of formulations for combined instruction 
scheduling and register allocation. 
 Then, a much more promising approach, a pre-pass local instruction scheduler 
based on convergent scheduling, is proposed and implemented in Trimaran so as to solve 
phase ordering problem. The proposed scheduler can significantly reduce total dynamic 
cycles and total register allocation overhead of the benchmarks which have basic blocks 
with long and narrow data dependent graphs. As a future work of the proposed scheduler, 
efficient heuristic for fat and parallel data dependent graphs can be found out. 
Furthermore, efficient scheduling for global instruction scheduling can also be developed. 
Finally, linear scan register allocator is implemented in Trimaran to combine our 
pre-pass scheduler and linear scan register allocator. The experimental results show that 
combining the proposed pre-pass scheduler with the linear scan register allocator reduces 
     86 
total dynamic cycles, dynamic register allocation overhead compared with combining list 
scheduler with region based register allocator and impact register allocator that have been 
implemented in Trimaran. Current work is focused on the cooperative approach of 
combined instruction scheduling and linear scan register allocation. Moreover, linear scan 
register allocator does not spill for benchmarks which have so far been tested in 
Trimaran. Future work can include integrated approach of combined instruction 
scheduling and register allocation. We can find big benchmarks which cause spill code 
















     87 
BIBOLIGRAPHY 
[AMB00] Santosh G. Abraham, Waleed Meleis, Ivan D. Baev: Efficient Backtracking 
Instruction Schedulers. IEEE PACT 2000: 301-308. 
[AL01] A.W. Appel and L. George. Optimal spilling for CISC machines with few 
registers. In Proceedings of the ACM SIGPLAN Conference on Programming Language 
Design and Implementation, pages 243-253, 2001. 
[AN88]  A. Aiken and A. Nicolau, A development environment for horizontal microcode, 
IEEE Transactions on Software Engineering, 14(5):584-594, May 1988. 
[BB89] W. Baxter and H. R. Bauer, III. The program dependence graph and 
vectorization. In Proceedings of the Sixteenth Annual ACM SIGACT/SIGPLAN 
Symposium on Principles of Programming Language, Austin, TX, 1989. 
[BCKT89] Preston Briggs, Keith D. Cooper, Ken Kennedy, and Linda Torczon. Coloring 
heuristics for register allocation. In Proceedings of the ACM SIGPLAN ’89 Conference 
on Programming Language Design and Implementation, July 1989. 
[BEH91] David G. Bradlee, Susan J. Eggers, and Robert R. Henry. Integrating register 
allocation and instruction scheduling for RISCs. In Fourth International Conference on 
Architectural Support for Programming Languages and Operating Systems, pages 122-
131, Santa Clara, CA, April 1991. 
[BGS93] D. Berson, R. Gupta, and M.L. Soffa, URSA: A unified resource allocator for 
registers and functional units in VLIW architectures, Conference on Architectures and 
Compilation Techniques for Fine and Medium Grain Parallelism, IFIP Transactions A-
23, pages 243-254, Orlando, Florida, January 1993. 
[BGS94] D. Berson, R. Gupta and M.L. Soffa, Resource Spackling: A framework for 
     88 
integrating register allocation in local and global scheduler, International Conference on 
Parallel Architectures and Compilation Techniques, IFIP Transactions A-50, pages 135-
146, Montreal, Canada, August 1994. 
[BGS95] D. Berson, R. Gupta, and M.L. Soffa, GURRR: A global unified resource 
requirements representation, ACM SIGPLAN Workshop on Intermediate 
Representations, San Francisco, California, January 1995. 
[BGS98] D. Berson, R. Gupta, and M.L. Soffa, Integrated instruction scheduling and 
register allocation techniques, Eleventh International Workshop on Languages and 
Compilers for Parallel Computing, LNCS, Springer Verlag, North Carolina, Chapel Hill, 
August 1998. 
[BKK94] R. Bixby, K. Kennedy, and U. Kremer. Automatic data layout using 0-1 integer 
programming. In Proceedings of Conference on Parallel Architectures and Compilation 
Techniques, August 1994. 
[BR91] David Bernstein and Michael Rodeh, Global instruction scheduling for 
superscalar machines. In Proceedings of the SIGPLAN ’91 Conference on Programming 
Language Design and Implementation, Toronto, CANADA, June 1991. 
[Bri92] Preston Briggs. Register allocation via graph coloring. PhD thesis, Rice 
University, April 1992. 
[BW01] Peter van Beek, Kent Wilken, Fast optimal instruction scheduling for single-
issue processors with arbitrary latencies, 7th International Conference on Principles and 
Practice of Constraint Programming (CP2001), Paphos, Cyprus, 26 November – 1 
December 2001. 
[CCK90] David Callahan, Steve Carr, and Ken Kennedy. Improving register allocation 
     89 
for subscripted variables. In Proceedings of the ACM SIGPLAN ’90 Conference on 
Programming Language Design and Implementation, pages 53-65, WhitePlains, NY, 
June 1990. 
[CCK97] C-M Chang, C-M Chen, and C-T King. Using integer linear programming for 
instruction scheduling and register allocation in multi-issue processors. Computers and 
Mathematics with Applications, 34(9):1-14, November 1997. 
[CH84] Fred C. Chow, John L. Hennessy, Register allocation by priority-based coloring, 
ACM SIGPLAN Notices, v.19 n.6, p.222-232, Jun 1984. 
[CH90] Fred C. Chow, John L. Hennessy, The priority-based coloring approach to 
register allocation, ACM Transactions on Programming Languages and Systems 
(TOPLAS), v.12 n.4, p.501-536, Oct. 1990. 
[Cha81] G.J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. 
W. Markstein. Register allocation via coloring. Computer Languages, 6:47--57, Jan. 181. 
[Cha82] G.J. Chaitin. Register allocation and spilling via graph coloring. In SIGPLAN 
Symposium on Compiler Construction, Boston, June 1982. 
[KPD98] Keith D. Cooper, Philip J. Schielke, Devika Subramanian. An Experimental 
Evaluation of List Scheduling. Rice University, Department of Computer Science 
Technical Report 98-326, September 1998. 
[Cin95] Cindy Norris. Cooperative register allocation and instruction scheduling. PhD 
thesis, Delaware, May 1995. 
[CK91a] David Callahan and Brian Koblenz. Register allocation via hierarchical graph 
coloring. In Proceedings of the SIGPLAN ’91 Conference on Programming Language 
Design and Implementation, pages 192-203, Toronto, CANADA, June 1991. 
     90 
[CK91b] David Callahan and Brian Koblenz. Register allocation via hierarchical graph 
coloring. In Proceedings of the ACM SIGPLAN ’91 Conference on Programming 
Language Design and Implementation, pages 192-203, Toronto, CANADA, June 1991. 
[DP02] Diego Puppin. Convergent Scheduling: A Flexible and Extensible Scheduling 
Framework for Clustered VLIW Architectures, MIT Press, Cambridge, SM thesis, 2002. 
[Ell86] J. R. Ellis. Bulldog: A Compiler for VLIW Architectures, MIT Press, Cambridge,  
MA, 1986. 
[EK02] Johansson E, Sagonas K. Linear scan register allocation in a high performance 
Erlang compiler. Practical Applications of Declarative Languages: Proceedings of the 
PADL’2002 Symposium (Lecture Notes in Computer Science, vol. 2257). Springer: 
Berlin, 2002; 299–317. 
[Fis81] J. A. Fisher. Trace scheduling: A technique for global microcode compaction, 
IEEE Transactions on Computers, 30(7):478-490, July 1981. 
[FL98] M.Farach and V. Liberatore, On local register allocation, in ACM-SIAM 
Symposium on Discrete Algorithms, 1998. 
[FOW87] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program 
dependence graph and its use in optimization. ACM Transactions on Programming 
Languages and Systems, 9(3):319-349, 1987. 
[FR92] S.M. Freudenberger and J.C. Ruttenberg. Phase ordering of register allocation 
and instruction scheduling. In Code Generation-Concepts, Tools, Techniques: 
Proceedings of the International Workshop on Code Generation, May 1992. 
[Fre74] R.A Freiburghouse. Register allocation via usage counts. Communications of the 
ACM, 17(11):638-642, November 1974. 
     91 
[GC21] Gang Chen. Effective instruction scheduling with limited registers. Ph.D. thesis, 
Harvard University, Division of Engineering and Applied Sciences, March 2001. 
[GH86] J. R. Goodman, W. C. Hsu, On the use of registers vs. cache to minimize 
memory traffic, ACM SIGARCH Computer Architecture News, v.14 n.2, p.375-383, 
June 1986. 
[GH88] James R. Goodman and Wei-Chung Hsu. Code scheduling and register allocation 
in large basic blocks. In 1988 International Conference on Supercomputing, pages 442-
452, Orlando, Florida, November 1988. 
 [GS90a] Rajiv Gupta and Mary Lou Soffa, Region scheduling: An approach for 
detecting and redistributing parallelism, IEEE Transactions on Software Engineering, 
16(4):421-431, April 1990. 
[GSS89] Rajiv Gupta, Mary Lou Soffa, and Tim Steele. Register allocation via clique 
separators. In Proceedings of the SIGPLAN ‘98 Conference on Programming Language 
Design and Implementation, Portland, Oregon, June 1989. 
[GS99] Gang Chen, Michael D. Smith, Reorganizing global schedules for register 
allocation, Proceedings of the 13th international conference on Supercomputing, May 
1999. 
[GW96] D. Goodwin and K. Wilken. Optimal and near-optimal global register allocation 
Using 0-1 Integer Programming. Software-Practice and Experience, 26(8):929-965, 
August 1996. 
[Hen99] Pascal Van Hentenryck, The OPL optimization programming language; with 
contributions by Irvin Lustig, Laurent Michel, and Jean-Franois Puget. Cambridge, 
Mass.: MIT Press, 1999. 
     92 
[HFG89] Wei Chung Hsu, Charles N. Fischer, and James R. Goodman. On the 
minimization of loads/stores in local register allocation. IEEE Transactions on Software 
Engineering, 15 (10):1252-1260, 1989. 
[HG83] J. L. Hennessy and Thomas Gross. Postpass code optimization of pipeline 
constraints. ACM Transactions on Programming Languages and Systems, 5(3):422-448, 
July 1983. 
[HKMW66] L.P. Horwitz, R.M. Karp, R.E. Miller, S. Winograd, Index Register 
Allocation, Journal of the ACM(JACM), v.13 n.1, p.43-61, Jan. 1966. 
[HK01] Hansoo Kim. Region-based register allocation for EPIC architecture. Ph.D. 
Thesis, New York University, January 2001.  
[HPR88] S. Horwitz, J. Prins, and T. Reps. Integrating non-interfering versions of 
programs. In Proceedings of the Fifteenth Annual ACM SIGACT/SIGPLAN Symposium 
on Principles of Programming Languages, pages 133-145, San Diego, CA, 1988. 
[Jai91] S. Jain. Circular scheduling: A new technique to perform software pipelining. In 
Proceedings of the ACM SIGPLAN ’91 Conference on Programming Language Design 
and Implementation, pages 219-228, June 1991. 
[KFK97] Akira Koseki, Yoshiaki Fukazawa and Hideaki Komatsu, A register allocation 
technique using register existence graph, Proceedings of the 1997 International 
Conference on Parallel Processing (ICPP’97), August 1997. 
[KH93] P. Kolte and Mary Jean Harrold. Load/store range analysis for global register 
allocation. In Proceedings of the SIGPLAN ’93 Conference on Programming Language 
Design and Implementation, June 1993.  
[KKLPW81] D. J. Kuck, R. H. Kuhn, B, Leasure, D. A. Padua, and M. Wolfe. 
     93 
Dependence graphs and compiler optimizations. In Proceedings of the Eight Annual 
ACM Symposium on Principles of Programming Languages, pages 207-218, 1981. 
[KKN02] Akira Koseki , Hideaki Komatsu and Toshio Nakatani, Preference-directed 
graph coloring, ACM SIGPLAN Notices , Proceeding of the ACM SIGPLAN 2002 
Conference on Programming language design and implementation May 2002. 
[KW98] Timothy Kong and Kent D. Wilken. Precise register allocation for irregular 
architectures. In Proceedings of the 31st annual ACM/IEEE International Symposium on 
Microarchitecture, Dallas, Texas, United States. 1998. 
[Lam88] Monica Lam. Software pipelining: An effective scheduling technique for VLIW 
machines. In Proceedings of the SIGPLAN ‘88 Conference on Programming Language 
Design and Implementation. Atlanta, Georgia, June 1988. 
[LVAG98] Josep Llosa, Mateo Valero, Eduard Ayguadé and Antonio González, Modulo 
scheduling with reduced register pressure, IEEE Transactions on Computers, June 1998. 
[ME92] S-M Moon and K. Ebcioglu. An efficient resource-constrained global scheduling 
technique for superscalar and VLIW processors. In Proceedings of the Twenty-fifth 
International Symposium on Microarchitecture, Portland, OR, 1992. 
[Mor98] Robert Morgan, Building an optimizing compiler, Boston : Butterworth-
Heinemann, 1998. 
[Muc97] Steve Muchnick, Advanced compiler design and implementation, San 
Francisco, Calif. : Morgan Kaufmann Publishers, 1997. 
[NP93] Cindy Norris and Loris L. Pollock. A scheduler-sensitive global register 
allocator. In Supercomputing ’93 Proceedings, Portland, OR, November 1993. 
[NP94] Cindy Norris and Lori L. Pollock. Register allocation over the program 
     94 
dependence graph. In Proceedings of the SIGPLAN ’94 Conference on Programming 
Language Design and Implementation, June 1994. 
[NP95a] Cindy Norris, Lori L. Pollock, An experimental study of several cooperative 
register allocation and instruction scheduling strategies, Proceedings of the 28th annual 
international symposium on Microarchitecture, p.169-179, November 29-December 01, 
1995, Ann Arbor, Michigan, United States. 
[NP95b] Cindy Norris and Lori L. Pollock, Register allocation sensitive region 
scheduling, Working Conference on Parallel Architectures and Compilation Techniques, 
June 1995, Limassol, Cyprus. 
[DP02] Diego Puppin. Convergent Scheduling: A Flexible and Extensible Scheduling 
Framework for Clustered VLIW Architectures, MIT Press, Cambridge, SM thesis, 2002. 
[P98] Philip J. Schielke. Issues in Instruction Scheduling. Rice University, Department of 
Computer Science Technical Report 98-323, September 1998. 
[Ilog04] http:// www.ilog.com 
[KW04a] Khaing Khaing K. W., Weng-Fai W. A survey of combined instruction 
scheduling and register allocation. In Proceedings of the second internal conference on 
computer applications (ICCA’04), pages 146-154, Myanmar, 2004. 
[KW04b] Khaing Khaing K. W., Weng-Fai W. An efficient local instruction scheduling 
algorithm. In Proceedings of the second internal conference on computer applications 
(ICCA’04), pages 168-175, Myanmar, 2004. 
[OGM98] Traub O, Holloway G, Smith MD. Quality and speed in linear-scan register 
allocation. Proceedings of ACM SIGPLAN Conference on Programming Language 
Design and Implementation. ACM Press: New York, 1998; 142–151. 
     95 
[OT98] Omri Traub. Quality and Speed in Linear-Scan Register Allocation. 
B.A(Honours) Thesis, Harvard College, Cambridge, Massachusetts, 1998. 
[PF91] Todd A. Proebsting and Charles N. Fischer. Linear-time, optimal code scheduling 
for delayed-load architectures. In Proceedings of the ACM SIGPLAN ’91 Conference on 
Programming Language Design and Implementation, Toronto, June 1991. 
[PF92] Todd A. Proebsting, Charles N. Fischer, Probabilistic register allocation, ACM 
SIGPLAN Notics, v.27 n.7, p.300-310, July 1992. 
[PF96] Todd A. Proebsting, Charles N. Fischer, Demand-driven register allocation, ACM 
Transactions on Programming Languages and System (TOPLAS), v.18 n.6, p.683-710, 
Nov. 1996. 
[Pin93] S.S.Pinter. Register allocation with instruction scheduling: a new approach. In 
Proceedings of the SIGPLAN ’93 Conference on Programming Language Design and 
Implementation, June 1993. 
[PS99] Massimiliano Poletto, Vivek Sarkar, Linear scan register allocation, ACM 
Transactions  on Programming Languages and Systems (TOPLAS), v.21 n.5, p.895-913, 
Sept. 1999. 
[Pug91] W. Pugh. The Omega Test: A fast and practical integer programming algorithm 
for dependence analysis. In Proceedings Supercomputing '91, pages 18-22, Nov 1991. 
[RT81] J.H. Reif and R.E. Tarjian. Symbolic program analysis in almost linear time. 
SIAM Journal of Computing, 11(1):81-93, February 1981. 
[REH90]  Richard Eugene Hank. Machine Independent Register Allocation for the 
Impact-I C Compiler. BS Thesis, University of Illinois at Urbana-Champaign, 1990. 
 
     96 
[SB92] Philip H. Sweany and Steven J. Beaty. Dominator-path scheduling-a global 
scheduling method. In Proceedings of the Twenty fifth International Symposium on 
Microarchitecture, pages 260-263, Portland, OR, 1992. 
[SC99] Gang Chen and Michael D. Smith, Evaluating register allocation and instruction 
scheduling techniques in Out-Of-Order issue Processors, 1999 International Conference 
on Parallel Architectures and Compilation Techniques, October 12 – 16, 1999. 
[Sit79] Richard L. Sites, Machine-independent register allocation, Proceedings of the 
SIGPLAN symposium on Compiler construction, p.221-225, August 06-10, 1979, 
Denver, Colorado, United States. 
[SSS01] Vivek Sarkar, Mauricio J. Serrano, Barbara B. Simons, Register-sensitive 
selection, duplication, and sequencing of instructions, Proceedings of the 15th 
international conference on Supercomputing, p.277-288, June 2001, Sorrento, Italy. 
[SS03] Konstantinos F. Sagonas, Erik Stenman: Experimental evaluation and 
improvements to linear scan register allocation. Softw., Pract. Exper. 33(11): 1003-1034 
(2003). 
[tri03] http://www.Trimaran.org 
[War84] J. Warren. A hierarchical basis for reordering transformations. In Proceedings of 
the Eleventh Annual ACM Symposium on Principles of Programming Languages, pages 
272-282, 1984. 
[WDSS02] Walter Lee, Diego Puppin, Shane Swanson, Saman Amarasinghe. Convergent 
Scheduling. In Proceedings of the 35th Annual International Symposium on 
Microarchitecture (MICRO), Istanbul, Turkey, November 2002. 
[Win93] Wayne L. Winston, Operation research: applications and algorithms, third 
     97 
addition, 1993 
[WLH00] K. Wilken, J. Liu, and M. Heffernan. Optimal instruction scheduling using 
integer programming. In Programming Language Design and Implementation, pages 
121{133. ACM SIGPLAN, June 2000. 
[WMHR93] Nancy J. Warter, Scott A. Mahlke, Wen mei W. Hwu, and B. Ramakrishna 
Rau. Reverse if-conversion. In Proceedings of the SIGPLAN ’93 Conference on 
















     98 
Appendix A 
Source code in ILOG OPL studio for portion of Trimaran Rebel output (hb 35 ) 
enum Tasks { op141, op242, op244, op246, op243, op245, op247, op142, op143, 
op144}; 
int duration[Tasks] = [2,1,1,1,1,1,1,1,1,1]; 
int totalDuration = sum(t in Tasks) duration[t]; 
scheduleOrigin = 0; 
scheduleHorizon = totalDuration; 








    task[op144].end 
subject to { 
   task[op141] precedes task[op243]; 
   task[op141] precedes task[op245]; 
   task[op141] precedes task[op247]; 
   task[op243] precedes task[op142];    
   task[op245] precedes task[op143]; 
   task[op247] precedes task[op144]; 
    // Resources 
   task[op141] requires mem; 
   task[op242] requires inte; 
   task[op244] requires inte; 
   task[op246] requires inte; 
   task[op243] requires inte; 
   task[op245] requires inte; 
   task[op247] requires inte; 
   task[op142] requires br; 
   task[op143] requires br; 





     99 
Appendix B 
Optimal register allocation for two real registers and two symbolic registers 
 
var float+ x1_defA; 
var float+ x2_defA; 
var float+ x1_use_end1A; 
var float+ x1_use_cont1A; 
var float+ x1_use_end2A; 
var float+ x1_use_cont2A; 
var float+ x2_use_end1A; 
var float+ x2_use_cont1A; 
var float+ x2_use_end2A; 
var float+ x2_use_cont2A; 
var float+ x1_defB; 
var float+ x2_defB; 
var float+ x1_use_endB; 
var float+ x1_use_contB; 
var float+ x2_use_endB; 







         
subject to { 
 
x1_defA = x1_use_end1A + x1_use_cont1A; 
x1_defA = x1_use_end2A + x1_use_cont2A; 
x1_use_cont1A = x1_use_cont2A; 
x1_defB = x1_use_endB + x1_use_contB; 
x2_defA = x2_use_end1A + x2_use_cont1A; 
x2_defA = x2_use_end2A + x2_use_cont2A; 
x2_use_cont1A = x2_use_cont2A; 
x2_defB = x2_use_endB + x2_use_contB; 
x1_defA <= 1; 
x2_defA <= 1; 
x1_defB + x1_use_cont2A <= 1; 
x2_defB + x2_use_cont2A <= 1; 




x1_defB+x2_defB = 1; 












































     101 
Appendix C 
Optimal spill code placement for two real registers and two symbolic registers 
 
var float+ x1_defA; 
var float+ x2_defA; 
var float+ x1_use_end1A; 
var float+ x1_use_cont1A; 
var float+ x1_use_end2A; 
var float+ x1_use_cont2A; 
var float+ x2_use_end1A; 
var float+ x2_use_cont1A; 
var float+ x2_use_end2A; 
var float+ x2_use_cont2A; 
var float+ x1_defB; 
var float+ x2_defB; 
var float+ x1_use_endB; 
var float+ x1_use_contB; 
var float+ x2_use_endB; 
var float+ x2_use_contB; 
var float+ x1_cont1A; 
var float+ x1_cont2A;  
var float+ x1_cont3A;  
var float+ x1_contB;  
var float+ x2_cont1A; 
var float+ x2_cont2A;  
var float+ x2_cont3A;  
var float+ x2_contB;  
var float+ x1_store1A; 
var float+ x2_store1A; 
var float+ x1_store2A; 
var float+ x2_store2A; 
var float+ x1_store3A; 
var float+ x2_store3A; 
var float+ x1_load1A; 
var float+ x1_load2A; 
var float+ x1_load3A; 
var float+ x1_load4A; 
var float+ x1_load5A; 
var float+ x2_load1A; 
var float+ x2_load2A; 
var float+ x2_load3A; 
var float+ x2_load4A; 
var float+ x2_load5A; 
var float+ x_memory_cont1A; 
var float+ x_memory_cont2A; 
     102 
var float+ x_memory_cont3A; 
var float+ x_memory_cont4A; 
var float+ x_memory_cont5A; 
var float+ x1_storeB; 
var float+ x2_storeB; 
var float+ x1_load1B; 
var float+ x1_load2B; 
var float+ x2_load1B; 
var float+ x2_load2B; 
var float+ x_memory_cont1B; 




x1_cont1A + x1_cont2A + x1_cont3A + x1_contB + x2_cont1A + x2_cont2A + 
x2_cont3A + x2_contB + x1_use_endB + x1_use_contB + x2_use_endB + x2_use_contB 
+ x1_defA + x2_defA + x1_use_end1A + x1_use_cont1A + x2_use_end1A + 
x2_use_cont1A + x1_use_end2A + x1_use_cont2A + x2_use_end2A + x2_use_cont2A + 
x1_defB + x2_defB + x1_load1A + x1_load2A + x1_load3A + x1_load4A + x1_load5A 
+ x2_load1A + x2_load2A + x2_load3A + x2_load4A + x2_load5A + x1_store1A + 
x2_store1A + x1_store2A + x2_store2A + x1_store3A + x2_store3A + 
x_memory_cont1A + x_memory_cont2A + x_memory_cont3A + x_memory_cont4A + 
x_memory_cont5A + x1_load1B + x1_load2B + x2_load1B + x2_load2B + x1_storeB + 
x2_storeB + x_memory_cont1B + x_memory_cont2B 
    
subject to { 
 
x1_store1A <= x1_defA;   
x2_store1A <= x2_defA;   
x1_cont1A  <= x1_defA;   
x2_cont1A  <= x2_defA;   
x1_defA = x1_store1A + x1_cont1A; 
x2_defA = x2_store1A + x2_cont1A; 
x1_store2A <= x1_cont1A; 
x2_store3A <= x2_cont1A; 
x1_store3A + x1_cont3A + x2_store3A + x2_cont3A >= 1; 
x1_store2A + x1_cont2A + x2_store2A + x2_cont3A >= 1; 
x1_cont3A + x1_load2A + x2_cont3A + x2_load2A >=1; 
x1_cont2A + x1_load1A + x2_cont2A + x2_load1A >=1; 
x1_use_end2A + x1_use_cont2A + x2_use_end2A + x2_use_cont2A >= 1; 
x1_use_end1A + x1_use_cont1A + x2_use_end1A + x2_use_cont1A >= 1; 
x1_load1A <= x1_use_end1A + x1_use_cont1A; 
x2_load1A <= x2_use_end1A + x2_use_cont1A; 
x1_load2A <= x1_use_end2A + x1_use_cont2A; 
x2_load2A <= x2_use_end2A + x2_use_cont2A; 
x1_cont2A <= x1_use_end1A + x1_use_cont1A; 
     103 
x2_cont2A <= x2_use_end1A + x2_use_cont1A; 
x1_cont3A <= x1_use_end2A + x1_use_cont2A; 
x2_cont3A <= x2_use_end2A + x2_use_cont2A; 
x1_cont2A + x1_load1A + x2_cont2A + x2_load1A >=1; 
x1_cont3A + x1_load2A + x2_cont3A + x2_load2A >=1; 
x1_use_cont1A + x1_load3A + x2_use_cont1A + x2_load3A >= 1; 
x1_use_cont1A + x1_load3A + x1_load5A + x2_use_cont1A + x2_load3A + x2_load5A 
>= 1; 
x1_storeB <= x1_defB; 
x2_storeB <= x2_defB; 
x1_contB <= x1_defB;  
x2_contB <= x2_defB;  
x1_defB = x1_storeB + x1_contB; 
x2_defB = x2_storeB + x2_contB; 
x1_load1B <= x1_use_endB + x1_use_contB; 
x2_load1B <= x2_use_endB + x2_use_contB; 
x1_contB <= x1_use_endB + x1_use_contB; 
x2_contB <= x2_use_endB + x2_use_contB; 
x1_use_endB + x1_use_contB + x2_use_endB + x2_use_contB >= 1; 
x1_contB + x1_load1B + x2_contB + x2_load1B >= 1; 
x1_use_contB + x1_load2B + x2_use_contB + x2_load2B >= 1; 
x1_defA <= 1; 
x2_defA <= 1; 
x1_defB + x1_use_cont2A <= 1; 
x2_defB + x2_use_cont2A <= 1; 
x1_defB + x1_use_cont1A <= 1; 
x2_defB + x2_use_cont1A <= 1; 
x1_defB <= 1; 
x2_defB <= 1; 
x1_defA + x1_use_contB <=1; 
x2_defA + x2_use_contB <=1; 
x1_defA+x2_defA = 1; 
x1_defB+x2_defB = 1; 
x1_use_cont1A = x1_use_cont2A; 
x2_use_cont1A = x2_use_cont2A; 
x1_load1A + x2_load1A <= x1_store1A + x2_store1A + x1_store2A + x2_store2A; 
x_memory_cont1A <= x1_store1A + x2_store1A + x1_store2A + x2_store2A; 
x1_load2A + x2_load2A <= x1_store1A + x2_store1A + x1_store2A + x2_store2A; 
x_memory_cont2A <= x1_store1A + x2_store1A + x1_store2A + x2_store2A; 
x1_load3A + x2_load3A <= x_memory_cont1A; 
x_memory_cont3A <= x_memory_cont1A; 
x1_load4A + x2_load4A <= x_memory_cont2A; 
x_memory_cont4A <= x_memory_cont2A; 
x_memory_cont3A = x_memory_cont4A; 
x1_load5A + x2_load5A <= x_memory_cont3A; 
x_memory_cont5A <= x_memory_cont3A; 
     104 
x1_load1B + x2_load1B <= x1_storeB + x2_storeB; 
x_memory_cont1B <= x1_storeB + x2_storeB; 
x1_load2B + x2_load2B <= x_memory_cont1B; 










































     105 
Appendix D 
Optimal spill code placement for one real register and two symbolic registers 
 
var float+ x1_defA; 
var float+ x1_use_end1A; 
var float+ x1_use_cont1A; 
var float+ x1_use_end2A; 
var float+ x1_use_cont2A; 
var float+ x1_defB; 
var float+ x1_use_endB; 
var float+ x1_use_contB; 
var float+ x1_cont1A; 
var float+ x1_cont2A;  
var float+ x1_cont3A;  
var float+ x1_contB;  
var float+ x1_store1A; 
var float+ x1_store2A; 
var float+ x1_store3A; 
var float+ x1_load1A; 
var float+ x1_load2A; 
var float+ x1_load3A; 
var float+ x1_load4A; 
var float+ x1_load5A; 
var float+ x_memory_cont1A; 
var float+ x_memory_cont2A; 
var float+ x_memory_cont3A; 
var float+ x_memory_cont4A; 
var float+ x_memory_cont5A; 
var float+ x1_storeB; 
var float+ x1_load1B; 
var float+ x1_load2B; 
var float+ x_memory_cont1B; 




x1_cont1A + x1_cont2A + x1_cont3A + x1_contB + x1_use_endB + x1_use_contB + 
x1_defA + x1_use_end1A + x1_use_cont1A + x1_use_end2A + x1_use_cont2A + 
x1_defB + x1_load1A + x1_load2A + x1_load3A + x1_load4A + x1_load5A + 
x1_store1A + x1_store2A + x1_store3A + x_memory_cont1A + x_memory_cont2A + 
x_memory_cont3A + x_memory_cont4A + x_memory_cont5A + x1_load1B + 
x1_load2B + x1_storeB + x_memory_cont1B + x_memory_cont2B 
    
subject to { 
x1_store1A <= x1_defA;   
     106 
x1_cont1A  <= x1_defA;   
x1_store2A <= x1_cont1A;  
x1_store3A <= x1_cont1A; 
x1_defA = x1_use_end1A + x1_use_cont1A; 
x1_defA = x1_use_end2A + x1_use_cont2A; 
x1_cont1A >= x1_store2A + x1_cont2A; 
x1_cont1A >= x1_store3A + x1_cont3A; 
x1_cont2A = x1_use_end1A + x1_use_cont1A; 
x1_cont3A = x1_use_end2A + x1_use_cont2A; 
x1_use_end2A + x1_use_cont2A  >= 1; 
x1_use_end1A + x1_use_cont1A  >= 1; 
x1_use_cont2A + x1_load4A >= 1; 
x1_use_cont2A + x1_load4A + x1_load5A >= 1; 
x1_storeB <= x1_defB;  
x1_contB <= x1_defB;  
x1_defB = x1_use_endB + x1_use_contB; 
x1_contB = x1_use_endB + x1_use_contB; 
x1_use_endB + x1_use_contB  >= 1; 
x1_use_contB + x1_load2B >=1; 
x1_defA = 1; 
x1_defB + x1_use_cont2A <=1; 
x1_defB = 1; 
x1_defA + x1_use_endB <=1; 
x1_load1A <= x1_store1A + x1_store2A; 
x_memory_cont1A <= x1_store1A + x1_store2A; 
x1_load2A <= x1_store1A + x1_store2A; 
x_memory_cont2A <= x1_store1A + x1_store2A; 
x1_load3A <= x_memory_cont1A; 
x_memory_cont3A <= x_memory_cont1A; 
x1_load4A <= x_memory_cont2A; 
x_memory_cont4A <= x_memory_cont2A; 
x_memory_cont3A = x_memory_cont4A;  
x1_load5A <= x_memory_cont3A; 
x_memory_cont5A <= x_memory_cont3A; 
x1_load1B <= x1_storeB; 
x_memory_cont1B <= x1_storeB; 
x1_load2B <= x_memory_cont1B; 









     107 
Appendix E 




We present a new local instruction scheduler which takes into consideration 
efficient instruction scheduling while maintaining optimal instruction scheduling length. 
For this approach we provide heuristic for parallel instruction scheduling. The proposed 
scheduler performs the optimal scheduling within optimal instruction schedule length in a 
single pass. This is more effective than getting optimal schedule length followed by 
instruction reordering. The proposed instruction scheduler on the basis of convergent 
scheduler is built within the Trimaran and evaluated experimentally using several integer 
benchmarks. Our experimental results show that our proposed scheduler can significantly 
reduce total dynamic cycles and register allocation overhead. 
1. Introduction   
Modern optimizing compilers contain several optimization phases, including instruction 
scheduling which has received widespread attention in the past academic and industrial 
research. Instruction scheduling is one of the most important phases in compiler 
optimization since the goal of an optimization compiler is to efficiently use all of the 
resources of the target computer. The Explicitly Parallel Instruction Computing (EPIC) 
architecture exemplified by the Itanium Processor Family (IPF) requires compilers to 
statically schedule instructions to fully utilize its greater instruction level parallelism 
(ILP) [8]. Moreover, to produce good code for modern machines such as VLIW and 
superscalar machines, the compiler must expose enough instruction level parallelism 
     108 
(ILP) to let the scheduler to keep the various functional units busy. The scheduler must 
order the operations in such a way that lets them execute in parallel. Furthermore, the 
compiler must keep as many values in registers as possible, since the memory interface is 
rarely wide enough or versatile enough to meet the need of operands. The goal of 
instruction scheduling is to exploit available instruction level parallelism efficiently and 
effectively. In order to meet this goal, we propose a new instruction scheduler that can 
reduce total dynamic cycles and dynamic register allocation overhead. Our approach is 
based on part of convergent scheduling which is a general instruction scheduling 
framework that simplifies and facilitates the application of a multitude of arbitrary 
constraints and scheduling heuristics required to schedule instructions for modern 
complex processors. We implement our proposed scheduler within pre-pass scheduler of 
Trimaran to compare with Elcor pre-pass scheduler of Trimaran. In Trimaran, the 
instruction scheduling which is performed before and after register allocation - Impact 
register allocator - is called pre-pass scheduling and post-pass scheduling. 
Trimaran [15] is a compiler infrastructure for supporting state of the art research 
in compiling for Instruction Level Parallelism (ILP) architectures. The system is oriented 
towards EPIC (Explicitly Parallel Instruction Computing) architectures, and supports 
compiler research in what is typically considered to be "back end" techniques such as 
instruction scheduling, register allocation, and machine-dependent optimizations. The 
Trimaran system is based on the HPL-PlayDoh architecture which is a parametric 
processor architecture conceived for research in instruction-level parallelism. The HPL-
PD opcode repertoire, at its core, is similar to that of a RISC-like load/store architecture, 
with standard integer, floating point (including fused multiply-add type of operations) 
     109 
and memory operations. 
Most previous work has investigated various scheduling heuristics that try to get 
optimal instruction scheduling by scheduling the instruction at earliest cycles. Our 
proposed scheduler tries to schedule the instructions in the latest cycle within each basic 
block while maintaining the optimal instructing schedule length in order to reduce a 
number of simultaneously live variables and unnecessary dependencies. This can 
decrease the probability that the register allocation can reduce a number of spill codes 
which can also shorten the scheduling length after register allocation. A comparison 
between most previous schedule ( e.g Elcor ) and our proposed schedule has been given 
in Figure-1, Figure-2 and Figure-3.  
The rest of the paper is organized as follows. In the next section we start with 
some preliminary definitions. Section 3 covers some related works in this area. Then in 
section 4 the proposed instruction scheduler is presented, followed by the experimental 
result and discussion (Section 5). Section 6 concludes. 
 
 






























( a ) ( b ) ( c )  
Figure-1: Example data dependent graph(DDG) of basic block(BB) 38 of rawcaudio benchmark 
from Trimaran. (a) the original data dependent graph. Latency 1 on (op 140, op 239) means 
that operation 239 cannot start from one cycle after operation 140 completes. (b) the Elcor 
schedule which schedules the instructions at earliest cycle so that live range of operation 137 
can be lengthened, which can also increase simultaneously live ranges. As a result, the 
lifetime of register may be lengthened, and the amount of spill code and unnecessary 
dependencies may be increased. (c) the proposed schedule which schedules the instructions in 
the latest cycle while maintaining optimal instruction schedule length. 
 
BB < 38 > BB < 38 >





op 140< 0 >
< 1 >
< 2 > < 2 >
< 1 >
< 0 >
( b )( a )  
Figure-2: Result of Pre-pass Scheduling for BB 38 of rawcaudio benchmark from Trimaran (a) 
the schedule of Elcor scheduler. (b) the schedule of proposed scheduler. We set up a machine 
with 4 integer units, 2 float units, 2 memory units and 1 branch unit. We also set up a 
machine with 16 general purpose registers. Operation 140, 239, and 137 use integer units and 
operation 141 uses branch unit and operation 278 uses resource null unit. Our proposed 
scheduler schedules operation 137 at cycle 2 as it is not on the critical path while the Elcor 
scheduler schedules it at cycle 0. 
     111 



















































( a ) ( b )
 
Figure-3: Result of Post-pass Scheduling for BB 38 of rawcaudio benchmark from Trimaran. 
The dotted boxes represent the spill codes inserted from impact register allocator (a) the 
schedule of Elcor scheduler. (b) the schedule of proposed scheduler. Our proposed scheduler 
significantly reduces total dynamic cycles and spill code insertion because it can reduce 
simultaneously live ranges in order to remove unnecessary data dependencies and spill code 
insertion after pre-pass scheduling. 
 
2. Problem Definition 
This section discusses some of the basic concepts regarding the definition of the 
instruction scheduling and its problem. 
2.1. What is Instruction Scheduling? 
  Instruction scheduling is the process by which a compiler reorders the instructions 
     112 
of a program in an attempt to decrease its running time, to reduce its code size, to 
improve other aspects of the program or to hide latencies present in modern day 
microprocessors such that a more time-efficient schedule is produced. Scheduling is often 
critical in achieving peak performance from these processors. 
2.2. Different Types of Instruction Schedulers 
There are different types of schedulers, based on the size of the pieces of the 
procedure that they attempt to reorder. They are basic block scheduler, branch scheduler, 
cross-block scheduler, pipeliner, trace scheduler or percolation scheduler [12].  
Basic block schedulers reorder the instructions within individual blocks. The form 
of the program flow graph is not changed. The reordering of each block is independent of 
the reordering of other blocks, with the possible exception of some knowledge about 
values computed at the end of a block (or used at the beginning of a block) [2, 13]. 
Cross-block scheduling improves basic block scheduling by considering a tree of 
blocks at once and may move instructions from one block to another [12]. 
Software pipeliners reorder and replicate instructions in loops to eliminate stalls. 
The result of software pipelining is a new loop in which values are being simultaneously 
computed for multiple iterations of the original loop [12, 9, 10]. 
Trace scheduling is an Instruction Scheduling method developed by Fisher [12, 6, 
2]. A trace is a sequence of instructions, including branches without including loops, that 
is executed for some input data. Trace scheduling uses a basic-block scheduling method 
to schedule the instructions in each entire trace, beginning with the trace with the highest 
execution frequency. Trace schedulers reorder the instructions in a simple path of blocks. 
The paths that are reordered are chosen to be the most frequently executed paths in the 
     113 
program. Instructions may be moved to places where the value computed is not 
guaranteed to be used (speculative execution). By reordering these larger sequences of 
instructions, more opportunities can be found for eliminating stalls. 
Percolation scheduling is another aggressive cross-block scheduling method that 
was developed by Nicolau [12, 1]. 
2.3. Instruction scheduling problem 
Instructions are reordered within a basic block, a straight line sequence of code 
with a single entry point and a single exit point. This is called local instruction scheduling 
Methods [7]. The data dependencies in a basic block can be described by a directed 
acyclic graph ( DAG ). The leaves of the DAG are the variables occurring as operands in 
the basic block; the inner nodes represent intermediate results. Basic blocks are typically 
rather small, with up to 20 instructions. Nevertheless, scientific programs often contain 
larger basic blocks, due to e.g complex arithmetic expressions and array indexing. Larger 
basic blocks can also be produced by compiler techniques such as loop unrolling and 
trace scheduling.  In particular, a minimum execution time schedule contains the smallest 
possible number of no-ops or idle cycles, thereby utilizing all of the processor cycles 
effectively. Therefore, the local instruction scheduling problem is to find a minimum 
length instruction schedule for a basic block subject to precedence, latency, and resource 
constraints [3]. This instruction scheduling problem becomes complicated (interesting) 
for pipelined processors because of data hazards and structural hazards [17]. 
A data hazard [16] occurs when an instruction i produces a result that is used by a 
following instruction j, and it is necessary to delay j's execution until i's result is available 
depending on data dependences. There are four cases of data dependences: 
     114 
True dependence : If an instruction modifies some resource that is later used by a 
following instruction, then there is a true dependence. In the following example, I2 is true 
dependent on I1 because I1 defines variable x which is then used by I2. 
I1 :  x: = y + z 
I2 :  w: = x + z 
Anti-dependence : If an instruction uses a resource that is later modified by a following 
instruction, there is anti-dependent. In the example that follows, I2 is anti-dependent on 
I1 because I1 uses the value of y before I2 redefines y. 
I1 :  x : = y + z 
I2 :  y : = w + z 
Output dependence : If both instructions modify the same resource, then the initial order 
must be preserved so that later instruction will get the value of the resource modified by a 
preceding instruction. For example, I1 is output dependent on I2 and I2 is output 
dependence on I1 respectively because both I1 and I2 modify the same data item x. 
I1 :  x : = y + z 
I2 :  x : = w + z 
Input dependence : If both instructions use the same resource without modifying it, then 
there is no restriction on order. For example, I1 is input dependent on I2 and I2 is input 
dependence on I1 respectively because both I1 and I2 use the same data item y without 
redefining it. 
I1 :  x : = y + z 
I2 :  w : = y + v 
Among four kinds of data dependencies mentioned above, true dependence, anti-
     115 
dependence and output dependence can cause data hazard. A structural hazard occurs 
when a resource limitation causes an instruction's execution to be delayed. Since general 
instruction scheduling problem is NP-complete, a number of heuristic methods that give 
approximate solutions have been developed. Among them, list scheduling is the dominant 
method. More advanced techniques, such as trace scheduling and software pipelining, 
typically use list scheduling to perform the actual assignment of operations into specific 
cycle. 
3. Related Works  
As mentioned above, finding general instruction scheduling problem is NP-
complete. So, a number of approximation methods for nearly optimal solutions have been 
developed to solve this problem. Instruction scheduling for a single-issue and multi-issue 
processor is NP-complete if there is no fixed bound on the maximum latency [7]. Such 
negative results have led to the belief that in producing compilers, one must take a 
heuristic or approximation algorithm approach; rather than an exact approach to basic 
block scheduling [12]. In [3], Peter van Beek and Kent Wilken present a relatively simple 
constraint programming approach to instruction scheduling which is fast and optimal. 
Gang Chen and Michael D. Smith [4] propose an approach that maintains the 
effectiveness of pre-pass scheduling in exploiting Instruction Level Parallelism but they 
suggest a two-phase global instruction scheduling approach to pre-pass scheduling - first 
schedule to get optimal instruction schedule and then reorganize the schedules to reduce 
register pressure. Wilken et al. [17] show that through various modeling and algorithmic 
techniques, integer linear programming could be used to produce optimal instruction 
schedules for large basic blocks in a reasonable amount of time. Recently, in [14, 11], 
     116 
Walter Lee, Diego Puppin, Shane Swanson and Saman Amarasinghe propose a general 
instruction scheduling framework, convergent scheduling, which simplifies and facilitates 
scheduling heuristics for modern complex processors by offering a set of innovative 
features. 
4. The detailed implementation 
In this section, we present our proposed scheduler of the local instruction 
scheduling problem based on part of convergent scheduling [14, 11]. In the convergent 
scheduling framework, different heuristics and passes are used to improve the schedule in 
different ways. A pass works by manipulating the weight for a specific instruction to be 
scheduled at a specific cycle, in a specific cluster. At the end of the algorithm, every 
instruction will be scheduled in the space-time slot with the highest weight, which they 
call preferred slot. 
Our proposed scheduler also works by manipulating the weight for a specific 
instruction to be scheduled in the as latest cycle as possible while maintaining optimal 
instruction scheduling length, which we assume to be the critical path of a basic block. 
This can be more efficient for long and narrow graphs which have a few critical paths so 
that every instruction can be scheduled within critical path length. If the instruction can 
not be scheduled within critical path length because of insufficient functional units, we 
increase critical path length by one repeatedly until every instruction can be scheduled. 
At the end of our heuristic, every instruction will be scheduled in the schedule slot with 
the highest weight. 
A convergent scheduler [14] is composed of independent phases. Each phase 
implements a heuristic that addresses a particular problem such as ILP or register 
     117 
pressure. Compared with convergent scheduler, our proposed scheduler can handle both 
ILP and register pressure problems at the same time. This is more efficient because it 
does not need different phases. Once we have scheduled for ILP, our proposed scheduler 
can automatically reduce register pressure by saving simultaneously live ranges. 
4.1. Common Data Structure 
In order to implement the proposed scheduler, some common data structure of 
convergent scheduler as well as new data structure has been used. In order to get the 
proposed schedule, weight matrix has been used to calculate optimal schedule length. If i 
is instruction and t is cycle time, all the weights are distributed evenly between 0 and 1. 
 ∀i,t : 0 ≤ Wi,t ≤ 1        ( 1 ) 
The heuristic is based on earliest completion time (the longest path from root 
node to current node) and latest completion time (critical path length - the longest path 
from current node to leaf node) of the dependence graph as depicted in Figure-5.4. 
Instruction in the middle of the dependence graph can be scheduled after their 
predecessors, or before their successors. If le is the earliest completion time and ll is the 
latest completion time, the instruction can be scheduled only in the time slots between le 
and ll. If the instruction cannot schedule between le and ll due to insufficient functional 
units, ll is increased by one and the process of scheduling is restarted again. Weight 
matrix is first initialized by assigning value zero for weights of each operation which are 
outside of le and ll, and average weight which is one divided by total number of 
     118 
operations inside le and ll. 
 for each i, (t ≤ le ∪ t ≥ ll), Wi,t ←  0     ( 2 ) 
 for each i, (t ≥ le ∪ t ≤ ll), Wi,t ←  1 / ∑ i      ( 3 ) 
We give more weight to a specific instruction to be scheduled in a given time cycle by 
multiplying the weight with a constant value. Then, we normalize our invariants. 
 for each i, t, Wi,t ←  Wi,t / ∑ t Wi,t      ( 4 ) 
After normalization, sum of weights Wi,t of all cycle times t, which are between le and ll 
for every instruction i is one. 
 ∀i : ∑t Wi,t = 1        ( 5 ) 
Then, the schedule time for each operation is selected by choosing the cycle time with 
maximum weight. 
 for each i,t, schedule_time ( i ) = max {t : Wi,t}    ( 6 ) 






cycle 0 cycle 1 cycle 2
op 140 0.25 0 0
op 239 0 0.25 0
op 141 0 0 0.25
op 137 0.25 0.25 0.25
cycle 0 cycle 1 cycle 2
op 140 0.3 0 0
op 239 0 0.3 0
op 141 0 0 0.3
op 137 0.25 0.25 0.3
cycle 0 cycle 1 cycle 2 Total
op 140 0.3 0 0 0.3
op 239 0 0.3 0 0.3
op 141 0 0 0.3 0.3
op 137 0.25 0.25 0.3 0.8
cycle 0 cycle 1 cycle 2 Total Schedule
op 140 1 0 0 1 cycle 0
op 239 0 1 0 1 cycle 1
op 141 0 0 1 1 cycle 2
















Figure-4: Example weight matrix calculation. (a) Data Dependent Graph with earliest 
completion time and latest completion time, (le,ll), of BB 38 of rawcaudio benchmark from 
Trimaran (b) initialization of weight matrix by dividing one by total number of operations 
where op 278 uses resource null so that there has been only needed to consider 4 operations 
– op 140, op 239, op 141, op 137. (c) give more weight to specific cycle for each functional 
unit starting from latest completion time that is 0.25 multiplied by 1.2. (d) calculate total 
weights of all cycles for each operation. (e) normalization of weight matrix which is current 
weight divided by total weights. After normalization, total weight for all cycles for each 




The pseudo code of our proposed scheduler's heuristic is roughly described as follows.  
1.finish_schedule = 0; /* flag for if all operations is scheduled */ 
/* loop until all operations have scheduled */  
2.While finish_schedule is equal to 0   
3.     Let  N  be total number of operations within a basic block  
4.     For i  ← 0 to N-1 do /* loop over all operations*/  
5.      For j ← 0 to N-1 do /* loop over all operations*/  
     120 
6.      Wi,j = 0;  /* Initialize with zero*/  
7.     End for 
8.    End for 
9.    For i ← 0 to N-1 do /*loop over all operations*/  
10.         For j ← le to ll do /*loop between le and ll */ 
11.           Wi,j = 1/N /*divide by total operations*/  
12.      End for  
13.  End for 
 
/* Give more weight to a specific cycle */  
14.    Let  R  be the resource list array 
15.    Let  Nr  be the number of current resources 
16.    Let Ms be the maximum number of resources  
17.    Let  S be schedule flag array - 0 for unschedule, 1 for schedule  
18.    Initialize schedule flag array  S by 0  
19.    While ( ll > 0) do /* start schedule from latest completion time*/  
/* loop over all operations within a basic block*/  
20.     For i ← 0 to N-1 do  
21.        if (R[i] is in resource list and Nr < Ms and S[i] is zero 
                     and Wi,ll is not empty) 
       /* give more weight to specific cycle*/  
22.         Wi,ll = 1.2 * Wi,ll 
23.         S[i] = 1 /* set schedule flag to 'already scheduled'*/  
24.         Increase Nr by 1  /* Increase number of current resources */  
/* check whether number of current resources reach to maximum 
resources*/  
25.          if Nr = Ms 
26           Decrease ll by 1  /* move to previous cycle of ll */  
27.          Nr = 0  /* set number of current resources to zero */  
28.         End if  
29.        End if 
30.        End for 
31.        Decrease ll by 1  /* move to previous cycle of  ll */  
32.        Nr = 0  /* set number of current resources to zero */  
33.     End while 
 
/* Check whether all operations can schedule between le and ll */  
34.     finish_schedule = 1 
     121 
35.     For i ← 0 to N-1 do /* loop over all operations*/  
36.         If S[i] is not equal to 1 /* schedule flag is 'not schedule'*/  
37.               finish_schedule = 0 /* set finish schedule to 0 */ 
38.         End if 
39.     End for 
40.     If finish_schedule is equal to 0 
41.           For i ←  0 to N-1 do /* loop over all operations */  
42.                 Increase ll by 1  /* Increase latest completion time by 1 */  
43.                 S[i] = 0 /* set schedule flag to 0 */  
44.           End for 
45.     End if 
46.End while 
 
Then, we normalize as follows:  
1.For i  ← 0 to N-1 do /*loop over all operations within a basic block*/  
/* loop over all cycles between le and ll */  
2.  For j ← le to ll do 
/* sum all weights for each instruction */  
3.   Wi,ll+1 = Wi,ll+1 + Wi,j 
4.  End for 
5.End for 
6.For i ← 0 to N-1 do /* loop over all operations within a basic block */  
7. For j ← le to ll do  /* loop over all cycles between le and ll */  
8.   Wi,j = Wi,j / Wi,ll+1 /* each weight divided by total weights */  
9.  End for 
10.End for 
 
The cycle time with maximum weight is selected as the schedule time for each operation. 
We find schedule cycle for each operation as follows:  
 
1.Let St be the schedule cycle time array for each operation  
2.Let Mw be the maximum weight for each operation  
3.Mw = 0 /* Initialize maximum weight with zero */  
4.For i ← 0 to N-1 do /* loop over all operations within a basic block */  
5.  For j ← le to ll do  /* loop over all cycles between le and ll */  
6.          If ( Wi,j > Mw ) 
7.                   Mw = Wi,j  /* Maximum weight will be stored */  
     122 
/* the cycle with the maximum weight will be stored */  
8.                   St [i] = j 
9.   End if 
10.  End for 
11.End for 
 
5. Experimental evaluation 
This section presents an empirical comparison of our proposed scheduler and Trimaran 
scheduler. 
5.1. Methodology  
 We use Trimaran's simulator to measure the performance of our scheduler. The 
Trimaran framework consists of a simulator which is used to generate various statistics 
such as compute cycles, total number of operations, etc. The simulator converts the Rebel 
into executable code and emulates the execution on a virtual HPL-PD processor. We 
implement and insert our scheduler within Trimaran infrastructure of prepass scheduling. 
To perform simulation, we first compile a version of Trimaran's scheduler and then our 
scheduler. We use a number of integer benchmarks in our experiments. For all our 
experiments, we only use one pass to get optimal instruction scheduling. 
5.2. Results and discussion 
 Table-1 and Table-2 present comparison between Trimaran's scheduler and our 
proposed scheduler. Figure-5 and Figure-6 show average minimization of total dynamic 
cycles and register allocation overhead over Elcor. The result suggests that our proposed 
scheduler can significantly reduce total dynamic cycle and dynamic register allocation 
overhead of benchmarks which have long and narrow data dependent graphs. If there are 
enough functional units, the proposed scheduler can reduce total dynamic cycles and 
dynamic register allocation overhead. Spill code insertion and unnecessary dependencies 
     123 
can be saved as the proposed scheduler can decrease simultaneously live ranges. In 
addition, the schedule length after register allocation can be shortened by saving spill 
code insertion. As a result, compared with Elcor scheduler, the proposed scheduler can 
significantly reduce total dynamic cycle and dynamic register allocation overhead of the 
benchmarks which have basic blocks with long and narrow data dependent graphs. 
 
Total Dynamic Cycles Comparison Summary 
Benchmarks Elcor New Scheduler Reduce %
Eight 5432 4893 9.9227%
fib_mem 239 144 39.7490%
Mm 195353 168051 13.9757%
Mm_int 364015 291556 19.9055%
Dag 5592 4892 12.5179%
Sqrt 3616 3608 0.2212%
Rawcaudio 14740574 13707490 7.0084%
test-install 750064 750061 0.0004%
Table-1: Total dynamic cycles comparison on several integer benchmarks which have basic 
blocks with long and narrow data dependent graphs. 
 
Total Dynamic Register allocation Overhead 
Benchmarks Elcor New Scheduler Reduce % 
Eight 1363 823 39.6185%
fib_mem 167 91 45.5090%
Mm 144871 143229 1.1334%
Mm_int 291557 243557 16.4633%
Dag 2439 2439 0.0000%
Sqrt 1621 1619 0.1234%
Rawcaudio 8049749 8049155 0.0074%
test-install 42 38 9.5238%
Table-2: Total dynamic register allocation overhead on several integer benchmarks which have 
basic blocks with long and narrow data dependent graphs. 
 
     124 










































Figure-5: Average reduction of total dynamic cycles. 
 













































     125 
6. Conclusion 
In this paper, we have proposed a new instruction scheduler which can save 
simultaneously live ranges. The results show that our scheduler is very efficient and 
reasonably effective for a machine which has enough functional units. Compared with 
Elcor scheduler, our scheduler can significantly reduce total dynamic cycles and total 
register allocation overhead of the benchmarks which have basic blocks with long and 
narrow data dependent graphs. Moreover, there are still room for improvements both in 
terms of efficiency and effectiveness. As a future work, improving the effectiveness of 
our scheduler, we can try to find out efficient heuristic for fat and parallel data dependent 
graphs. Furthermore, we can also try to develop efficient scheduling for global instruction 
scheduling. A combination of instruction scheduling and register allocation can also be 
included. 
Acknowledgement 
We thank the authors of Convergent Scheduling [14, 11] for kindly providing us with the 
software, thesis paper and helpful explanations on how it works. 
References 
[1] A. Aiken and A. Nicolau. A development environment for horizontal microcode. 
IEEE Transactions on Software Engineering, 14(5):584-594, May 1988. 
[2] J. R. Ellis. Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge,  
MA, 1986. 
[3] Peter van Beek, Kent Wilken. Fast optimal instruction scheduling for single-issue 
processors with arbitrary latencies. 7th International Conference on Principles and 
     126 
Practice of Constraint Programming (CP2001), Paphos, Cyprus, 26 November - 1 
December 2001. 
[4] Gang Chen, Michael D. Smith. Reorganizing global schedules for register allocation. 
Proceedings of the 13th international conference on Supercomputing, May 1999. 
[5] Gang Chen. Effective Instruction Scheduling with Limited Registers. Ph.D. thesis, 
Harvard University, Division of Engineering and Applied Sciences, March 2001. 
[6] J. A. Fisher. Trace scheduling: A technique for global microcode compaction, IEEE 
Transactions on Computers, 30(7):478-490, July 1981. 
[7] J. L. Hennessy and Thomas Gross. Postpass code optimization of pipeline constraints. 
ACM Transactions on Programming Languages and Systems, 5(3):422-448, July 1983. 
[8] Intel, Intel Itanium Architecture Software Developer's Manual, Vol. 1, October 2002. 
[9] S. Jain. Circular scheduling: A new technique to perform software pipelining. In 
Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design 
and Implementation, pages 219-228, June 1991. 
[10] Monica Lam. Software pipelining: An effective scheduling technique for VLIW 
machines. In Proceedings of the SIGPLAN '88 Conference on Programming Language 
Design and Implementation. Atlanta, Georgia, June 1988. 
[11] Walter Lee, Diego Puppin, Shane Swanson, Saman Amarasinghe. Convergent 
Scheduling. In Proceedings of the 35th Annual International Symposium on 
Microarchitecture (MICRO), Istanbul, Turkey, November 2002. 
[12] Steve Muchnick. Advanced compiler design and implementation. San Francisco, 
Calif. : Morgan Kaufmann Publishers, 1997. 
[13] Robert Morgan. Building an optimizing compiler. Boston : Butterworth-Heinemann, 
     127 
1998. 
[14] Diego Puppin. Convergent Scheduling: A Flexible and Extensible Scheduling 
Framework for Clustered VLIW Architectures, MIT Press, Cambridge, SM thesis, 2002. 
[15] http://www.trimaran.org 
[16] J. Warren. A hierarchical basis for reordering transformations. In Proceedings of the 
Eleventh Annual ACM Symposium on Principles of Programming Languages, pages 
272-282, 1984. 
[17] K. Wilken, J. Liu, and M. Heffernan. Optimal instruction scheduling using integer 
programming. In Programming Language Design and Implementation, pages 121-133. 
ACM SIGPLAN, June 2000. 
 
 
