Abstract. A graph-coloring register allocator that optimally allocates registers for structured programs in polynomial time is presented. It can handle register aliasing. The assignment of registers is optimal with respect to spill and rematerialization costs, register preferences and coalescing. The register allocator is not restricted to programs in SSA form or chordal interference graphs. It requires the input program to be structured, which is automatically true for many programming languages and for others, such as C, is equivalent to a bound on the number of goto labels per function. The number of registers is assumed to be fixed. Non-structured programs can be handled at the cost of either a loss of optimality or an increase in runtime. A prototype implementation is already the default register allocator in some backends of a major cross-compiler for embedded systems.
Introduction
Compilers map variables to physical storage space in a computer. The problem of deciding which variables to store into which registers or into memory is called register allocation.
Register allocation is one of the most important stages in a compiler. Due to the ever-widening gap in speed between registers and memory the minimization of spill costs is of utmost importance. For CISC architectures, such as ubiquitous x86, register aliasing (i. e. multiple register names mapping to the same physical hardware and thus not being able to be used at the same time) and register preferences (e. g. due to certain instructions taking a different amount of time depending on which registers the operands reside in) have to be handled to generate good code. Coalescing (eliminating moves by assigning variables to the same registers, if they do not interfere, but are related by a copy instruction) is another aspect, where register allocation can have a significant impact on code size and speed.
Our approach is based on graph coloring and assumes the number of registers to be fixed. It can handle arbitrarily complex register layouts, including all kinds of register aliasing. Register preferences, coalescing and spilling are handled using a cost function. Different optimization goals, such as code size, speed, energy consumption, or some aggregate of them can be handled by choice of the cost function. Virtually all programs are structured, and for these the register allocator has polynomial runtime. This is the first optimal approach, that has polynomial runtime and works for such a huge class of programs.
Chaitin's classic approach to register allocation [6] uses graph coloring. The approach assumes k identical registers, identical spill cost for all variables, and does not handle register preferences or coalescing. Solving this problem optimally is equivalent to finding a maximal k-colorable subgraph in the interference graph of the variables and coloring it. In general this is NP-hard [7] . Even when it is known that a graph is k-colorable it is NP-hard to find a k-coloring compatible with a fraction of 1− 1 33k of the edges [17] . Thus Chaitin's approach uses heuristics instead of optimally solving the problem. It has been generalized to more complex architectures [26] .
The maximum k-colorable subgraph problem for fixed k can be solved optimally in polynomial time for chordal interference graphs [24, 30] , which can be obtained when the input programs are in static single assignment (SSA) form [20] .
Recent approaches have modelled register allocation as an integer linear programming (ILP) problem, resulting in optimal register allocation for all programs [16, 14] . However ILP is NP-hard, and the ILP-based approaches tend to have far worse runtime compared to graph coloring.
Linear scan register allocation [25] has become popular for just in time compilation [12] ; it is typically faster than approaches based on graph coloring, but the assignment is further away from optimality.
Thorup [27] uses the bounded tree-width of structured programs to approximate an optimal coloring of the intersection graph by a constant factor. Bodlaender et alii [3] present an algorithm that decides in linear time if it is possible to allocate registers for a structured program without spilling.
Section 2 introduces the basic concepts, including structured programs. Section 3 presents the register allocator in its generality and shows its polynomial runtime. Section 4 discusses further aspects of the allocator, including ways to reduce the practical runtime and how to handle non-structured programs. Section 5 discusses the complexity of register allocation and why certain NPhardness results do not apply in our setting. Section 6 presents the prototype implementation and experimental results. Section 8 concludes and mentions possible directions for future work.
Structured Programs
Let r be the number of registers. Let [r] := {0, . . . , r − 1} be the the set of registers. Definition 1. Let V be a set of variables. An assignment of variables V to registers [r] is a function f : U → [r], U ⊆ V . The assignment is valid if it is possible to generate correct code for it, which implies that no conflicting variables are assigned to the same register.
Variables in V \ U are to be placed in memory (spilt) or removed and their value recalculated as needed (rematerialized).
Definition 2 (Register allocation)
. Let the number of available registers be fixed. Given an input program containing variables and their live ranges and a cost function, that gives costs for register assignments, the problem of register allocation is to find an assignment of variables to the registers that minimizes the total cost. Definition 3 (Tree decompostion). Given a graph G = (Π, K) a tree decomposition of G is a pair (T, X ) of a tree T and a family X = {X i | i node of T } of subsets of Π with the following properties:
The width of a tree decomposition (T, X ) is max{|X i | | i node of T } − 1. The tree-width tw(G) of a graph G is the minimum width of all tree decompositions of G.
Definition 4 (Structured program).
Let k ∈ N be fixed. A program is called k-structured, iff its control-flow graph has tree-width at most k.
Programs written in Algol or Pascal are 2-structured, Modula-2 programs are 5-structured, programs written in C are (6 + g)-structured if the number of labels targeted by gotos per function does not exceed g [27] . Similarly, Java programs are (6+g)-structured if the number of loops targeted by labeled breaks and labeled continues per function does not exceed g [18] . Ada programs are (6 + g)-structured if the number of labels targeted by gotos and labeled loops per function does not exceed g [5] .
A survey looking at 12522 Java methods from applications and the standard library found tree-width above 3 to be very rare. With one exception (of treewidth 5) all methods had tree-width 4 or lower, the average tree-width was about 2.7 [19] .
Definition 5 (Nice Tree Decomposition). A tree decomposition (T, X ) of a graph G is called nice, iff -T is oriented, with root t, X t = ∅.
-Each node i of T is of one of the following types:
• Leaf node, no children • Introduce node, has one child j, X j X i • Forget node, has one child j, X j X i • Join node, has two children
From now on let G = (Π, K) be the control flow graph of the program, let I = (V, E) be the corresponding conflict graph of the variables of the program (i. e. the intersection graph of the variables' life ranges). Let (T, X be a rooted tree decomposition of minimum width of G with root t. There are intelligent approaches to root choice [23] , but for the following an arbitrary choice would be sufficient. For π ∈ Π let V π be the set of all Variables v ∈ V , that are alive at π.
The goal in register allocation is to minimize costs, including spill and rematerialization costs, costs from not respecting register preferences, costs from not coalescing, etc.
These costs are modelled by a cost function that gives costs for an instruction π under register assignment f :
Different optimization goals, such as speed or code size can be implemented by choosing c. E. g. when optimizing for code size c could give the code size for π under assignment f , or when optimizing for speed c could give the number of cycles π needs to execute multiplied by an execution probability obtained from a profiler. We assume that c can be evaluated in constant time.
The goal is thus finding an f , for which π ∈Π c(π, f ) is minimal. We define a function s, that gives the minimum possible costs for instructions in the subtree rooted at i ∈ T , excluding instructions in X i when assigning variables alive in the subtree rooted at i ∈ T when choosing f : U → [r], U ⊆ V as the assignment of variables alive at instructions i ⊆ Π to registers.
We define s inductively, and depending on the type of i:
By calculating all the s(i, f ) and recording which g gave the minimum we can obtain an optimal assignment. We will show that s correctly gives the minimum possible cost and that it can be calculated in polynomial time.
is the minimum possible cost for instructions in the subtree rooted at i ∈ T , excluding instructions in X i when assigning variables alive in the subtree rooted at i ∈ T when choosing f as the assignment of variables alive at instructions i ⊆ Π to registers. Using standard bookkeeping techniques we obtain the corresponding assignments for the subtree.
Proof. By induction we can assume that the lemma is true for all children of i. Let T i be the set of instructions in the subtree rooted at i ∈ T , excluding instructions in X i .
Case 1: i is a leaf. There are no instructions in T i = X i \ X i = ∅, thus the cost is 0.
Case 2: i is an introduce node with child j. T i = T j , since X i ⊆ X j , thus the cost remains the same.
Case 3: i is a forget node with child j. T i = T j ∪ X j \ X i , the union is disjoint. Thus we get the correct result by adding the costs for the instructions in X j \X i .
Case 4: i is a join node with children j 1 and j 2 .
The union is disjoint. Thus we get the correct result by adding the costs from both subtrees.
Lemma 2. Using the tree-decomposition of minimum width, s can be calculated in polynomial time.
Proof. Each V π , π ∈ Π is the union of two cliques: The variables alive at the start of the instruction form the clique, and so do the variables alive at the end of the instruction. Thus V p , p ∈ P is the union of at most 2(tw(G) + 1) cliques. From each clique at most r variables can be placed in registers.
At each node i of the tree decomposition time O(|V | 2(tw(G)+1)r ) is sufficient: Case 1: i is a leaf. There are at most O(|V | 2|Xi|r ) ⊆ (|V | 2(tw(G)+1)r ) possible f , and for each one we do a constant number of calculations.
Case 2: i is an introduce node with child j. The reasoning from case 1 holds. Case 3: i is a forget node. There are at most O(|V | 2|Xi|r ) possible f . For each one we need to consider at most
Case 4: i is a join node with children j 1 and j 2 . The reasoning from case 1 holds.
The tree decomposition has at most O(|Π|) nodes, thus the total time is in
for a constant c and thus polynomial.
Theorem 1. The register allocation problem can be solved in polynomial time for structured programs.
Proof. Since the control-flow of the input program graph is of limited tree-width a minimal tree decomposition can be calculated in linear time [2] . Using this tree decomposition s is calculated in polynomial time. The total runtime is thus in
Remarks
Remark 1. Bodlaender's algorithm [2] used in the proof above is not a practical option. However there are other, more practical alternatives, including a lineartime algorithm that is not guaranteed to give decompositions of minimal width, but will do so for many programming languages [27, 9] .
Remark 2. Implementations of the algorithm can be massively parallel, resulting in linear runtime.
Proof. At each i ∈ T the individual s(i, f ) do not depend on each other. They can be calculated in parallel. By requiring that |X j | = |X i |+1 at forget nodes, we can assume that the number of different g to consider is at most |V | 2r , resulting in time O(r) for calculating the minimum over the s(j, g). Thus given enough processing elements the runtime of the algorithm can be reduced to O(|Π|r).
Remark 3. Non-structured programs can be handled at the cost of either a loss of optimality or an increase in runtime.
Programs of high tree-width are extremely uncommon (none have been found so far, with the exception of artificially constructed examples). Nevertheless they should be handled correctly by compilers.
One approach would be to handle all these programs like all others. Since tw(G) is no longer constant, the algorithm is no longer guaranteed to have polynomial runtime.
Where polynomial runtime is essential, a preprocessing step can be used. This preprocessing stage would spill some variables (or allocate them using one of the existing heuristic approaches). Edges of G, at which no variables are alive, can be removed. Once enough edges have been removed tw(G) ≤ k and our approach can be applied to allocate the remaining variables.
Remark 4. The runtime of the polynomial time algorithm can be reduced by a factor of over (2(tw(G) + 1)r)!, if there is no register aliasing or registers are interchangeable within each class. Furthermore r can then be chosen as the maximum number of registers that can be used at the same time instead of the total number of registers, which gives a further runtime reduction in case of register aliasing.
Proof. Instead of using f : U → [r] we can directly use U . Remark 5. Using a suitable cost function and r = 1 we get a polynomial time algorithm for maximum independent set on intersection graphs of connected subgraphs of graphs of limited tree-width.
Remark 6. The allocator is easy to re-target, since the cost function is the only architecture-specific part.
Complexity of Register allocation
The complexity of register allocation in different variations has been studied for a long time and there are many NP-hardness results.
Publication
Difference to our setting Register allocation via coloring [7] tw(G) unbounded On the Complexity of Register Coalescing [4] tw(G) unbounded The complexity of coloring circular arcs and chords [15] r is part of input Aliased register allocation for straight line programs is NP-complete [22] r is part of input On Local Register Allocation [13] r is part of input
Given a graph I a program can be written, such that the program has conflict graph I [7] . This proves the NP-hardness of register allocation, as a decision problem for r = 3. However the result does not hold for structured programs. Coalescing is NP-hard even for programs in SSA-from [4] . Again this result does not hold for structured programs. Register allocation, as a decision problem, is NP-hard, even for series-parallel control flow graphs, i. e. tw(G) ≤ 2 and thus structured programs, when the number of registers is part of the input [15] . Register allocation, as a decision problem, is NP-hard when register aliasing is possible, even for straight-line programs, i. e. tw(G) = 1 and thus structured programs, when the number of registers is part of the input [22] . Minimizing spill costs is NP-hard, even for straight-line programs, i. e. tw(G) = 1 and thus structured programs, when the number of registers is part of the input [13] .
It is thus fundamental to our polynomial time optimal approach, which handles register aliasing, register preferences, coalescing and spilling, that the input program is structured and the number of registers is fixed.
The runtime bound of our approach proven above is exponential in the number of registers r. However as shown in [21] even a substantially simplified version of the register allocation problem is W[SAT]-and co-W[SAT]-hard when parametrized by the number of registers even for tw(G) = 2. Getting rid of the r in the exponent would thus imply a collapse of an infinite number of parametrized complexity hierarchies. Even a single such a collapse is considered highly unlikely in parametrized complexity theory.
Prototype implementation
A prototype of the allocator has been implemented in C++ for the Z80, Z180 and Rabbit 2000/3000 ports of sdcc [11] , a C compiler for embedded systems.
The Z80 architecture was chosen, since it is small enough to easily understand, yet has many of the typical features of complex CISC architectures. 9 8-bit registers are assigned by the allocator (A, B, C, D, E, H, L, IYL, IYH). IYL and IYH can only be used together as 16-bit register IY; there are instructions that treat BC, DE or HL as 16-bit registers; many 8-bit instructions can only use A as the left operand, while many 16-bit instructions can only use HL as the left operand. There are some complex instructions, like djnz, a decrement-and-jumpif-not-zero instruction that always uses B as its operand, or ldir, which essentially implements memcpy() with the pointer to the destination in DE, source pointer in HL and number of bytes to copy in BC. All these architectural quirks are captured by the cost function.
The prototype does not yet handle rematerialization, and limitations of current code generation do not allow the A or IY registers to hold parts of a variable.
Code size was used as the cost function, due to its importance in embedded systems and relative ease of implementation (optimal speed or energy optimization would require profiler-guided optimization).
We obtain the tree decomposition using Thorup's method [27] , and then transform it into a nice tree decomposition. The implementation of the allocator essentially follows Section 3, and is neither very optimized for speed nor parallelized. However a configurable limit on the number of assignments considered at each node of the tree decomposition has been introduced. When this limit is reached, some assignments are discarded heuristically. The heuristic mostly relies on the s(i, f ) to discard those assignments that have the highest cost so far first, but takes other aspects into account to increase the chance that compatible assignments will exist at join nodes. When the limit is reached, and the heuristic applied, the assignment is no longer provably optimal. This limit essentially provides a trade-off between runtime and quality of the assignment.
The prototype was compared to the old sdcc register allocator, which has been improved over years of use in sdcc. The old allocator is basically an improved linear scan [25, 12] algorithm extended to take the architecture into account, e.g. preferring to use registers HL and A, since accesses to them typically are faster than those to other registers and taking coalescing, register aliasing and some other preferences into account.
Experimental results
Six benchmarks considered representative of typical applications for embedded systems have been used to evaluate the register allocator:
-The Dhrystone benchmark [29] , version 2 [28] . An ANSI-C version was used, since sdcc does not support K&R C. -A set of source files taken from real-world applications and used by the sdcc project to track code size changes over sdcc revisions and to compare sdcc to other compilers. shows the compilation time, and Figure 2 shows the fraction of provably optimally allocated functions (i. e. those functions for which the heuristic never was applied); the former is little affected by enabling the peephole optimizer and the latter not at all.
The dhrystone benchmark is the simplest one: At 7 × 10 6 assignments per node, it becomes provably optimal, and the chosen assignments don't change from 450000 onwards. This also results in a moderate reduction in code size of 4.2% before and 3.4% after the peephole optimizer when compared to the old allocator. We consider all this to be due to the relatively simple functions with few control constructs and few local variables present in this benchmark.
The sdcc benchmark contains considerably more complex functions; even at 10 8 assignments per node about one fifth of the functions is not provably optimally allocated. However code size seems to be stable from 4 × 10 7 onwards. We get a code size reduction of 17.1% before and 16.3% after the peephole optimizer.
For Coremark, about 8% of the functions are not provably optimally allocated at 10 8 assignments per node; code size seems to be stable from 4.5×10 7 onwards. at that value the code size is reduced by 8.4% before and 7.2% after the peephole optimizer.
FatFS is the benchmark which is the most problematic for our allocator; it contains large functions with complex control flow, some containing nearly a kilobyte of local variables. Even at 2 × 10 7 assignments per node (we did not run compilations at higher values due to lack of time) only 45% of the functions are provably optimally allocated. We get a reduction in code size of 9.9% before and 8.1% after the peephole optimizer. Due to the low fraction of provably optimally allocated functions the code size reduction and compilation time are likely to be much higher for a higher number of assignments per node.
In the games benchmark, about 4% of the functions are not provably optimally allocated at 10 8 assignments per node; at that value the code size is reduced by 11.2% before and 8.7% after the peephole optimizer. This result is consistent with the previous two: The source code contains both complex and simple functions (and some data, since only source files containing data only were excluded, while those that contain both code and data were included).
For Contiki, about 9% of the functions are not provably optimally allocated at 2 × 10 7 assignments per node (again we did not run compilations at higher values due to lack of time); at that value the code size is reduced by 7.7% before and 5.6% after the peephole optimizer. Contiki contains some complex control flow, but it tends to use global instead of local variables; where there are local variables they are often 32-bit variables, of which neither the optimal nor the old allocator can place more than one in registers at a given time (due to the restriction in code generation that allows the use of IY for 16-bit variables only).
For all benchmarks, the reduction in code size was more substantial before peephole optimization. We attribute this to two aspects. One one hand our allocator generates better code than the old one, so there is less problematic code left to optimize for the peephole optimizer. On the other hand the peephole optimizer and rules co-evolved with the old allocator, so they are more tuned to its quirks than to the code emitted when using the new allocator.
Conclusion
An optimal register allocator, that has polynomial runtime has been presented. Register allocation is the most important part of a compiler. Thus the allocator is a major step towards improving compilers. The allocator can handle a variety of spill and rematerialization costs, register preferences and coalescing.
A prototype implementation shows the feasibility of the approach, and is already in use in a major cross-compiler targeting architectures found in embedded systems.
