We present a randomized algorithm to generate contiguous evaluations for expression DAGs representing basic blocks of straight line code with nearly minimal register need. This heuristic may be used to reorder the statements in a basic block before applying a global register allocation scheme like Graph Coloring. Experiments have shown that the new heuristic produces results which are about 30 better on the average than without reordering.
Introduction
Register allocation is one of the most important problems in compiler optimizations. Among the numerous register allocation schemes proposed register allocation and spilling via graph coloring is generally accepted to give good results. But register allocation via graph coloring has the disadvantage of using a xed evaluation order within a given basic block. This is the evaluation order given by the source program. But often there exists an evaluation order for the basic block that uses less registers. By using this order the global register allocation generated via graph coloring could be improved. The aim of this article is to achieve such an improvement b y improving the evaluation order within the basic blocks. We can represent a basic block by a directed acyclic graph DAG; see Fig. 1 for an example. An algorithm that constructs a DAG for a given basic block i s g i v en in 1 . For the evaluation of DAGs the following results are known: 1 If the DAG is a tree, the well known algorithm of Sethi and Ullman see 8 generates an optimal evaluation in linear time optimal means: uses as few registers as possible. 2 The problem of generating an optimal evaluation for a given DAG is NP complete see 9 . To generate a good evaluation order for a DAG that is not a tree, we have t o nd an heuristic to do this task. We present such an heuristic in the following sections. The new heuristic uses a mix of several simple evaluation strategies that also include a randomized evaluation selection. These simple evaluation strategies are applied concurrently and the best evaluation generated is selected. The idea behind this approach is that there exists no uniform heuristic that generates good evaluations for every possible DAG. But most of the DAGs encountered in real programs belong to one of a few simple classes. For each of these classes there exists a simple algorithm that generates good, often optimal evaluations. By running these simple algorithms "in parallel" and choosing the best result we obtain a heuristic that copes with most of the DAGs encountered in real programs. In section 2 we introduce the basic formalism and two simple depth rst search dfs variants as evaluation strategies. In section 3 we present the randomized evaluation strategy, in section 4 a v ariant of the Labeling Algorithm see 8 . In section 5 we put the pieces together and discuss the performance improvement reached by the reordering with respect to the original basic block.
Evaluating DAGs
We assume that we are generating code for a single processor machine with general purpose registers R = fR 0 ; R 1 ; R 2 : : : g and a countable sequence of memory locations.
The arithmetic machine operations are three-address-instructions of the following types: R k R i op R j binary operation, op 2 f +; ,; ; : : : g, R k op R i unary operation, R k Loada load register k from the memory location a, o r Storea R k store the contents of register k into the memory location a, where i 6 = j 6 = k 6 = i, R k ; R i ; R j 2 R .
The following considerations are also applicable to the case k = i or k = j. Our claim that the registers used by an operation must be mutually di erent makes the handling partially easier, but does not a ect the validity of our results.
De nition: A basic block is a sequence of three address instructions that can only be entered via the rst and only be left via the last statement.
A directed g r aph is a pair G = V ;E where V is a nite set of nodes and E V V is a set of edges. In the following let n = jV j denote the number of nodes of the graph. edge is directed from the son to the father. A node which has no sons is a leaf, otherwise it is an inner node; in particular we call a node with two sons binary and a node with only one son unary. A node with no father is called root of G.
The outdegree outdegw of a node w 2 V is the number of edges leaving w, i. e. the number of its fathers. The data dependencies in a basic block can be described by a directed acyclic graph DAG. The leaves of the DAG are the variables and constants occurring as operands in the basic block; the inner nodes represent i n termediate results. An example is given in Figure 1 . is called a topologic order of the nodes of G. It is well known that for a directed graph a topological order exists i it is acyclic. see e. g. 5 .
De The advantage of a contiguous evaluation is the fact that it can be generated by simple algorithms variations of depth rst search dfs . In this paper we will restrict our attention to contiguous evaluations. This is already a heuristic because there are some DAGs for which a noncontiguous evaluation exists that uses less registers than every contiguous evaluation. However, in practice these cases seem to be rare. The smallest DAG of this kind we found so far has 14 nodes and is printed in In general there will exist several optimal evaluations for a given DAG. | In this paper we always use the word "optimal" with respect to the register need. Sethi proved in 1975 9 that the problem of computing an optimal evaluation for a given DAG is NP complete. Assuming P 6 = NP we expect an algorithm with nonpolynomial run time. Unfortunately, this problem often occurs in compiler construction and should be solved fast. We will present a heuristic to produce fairly good evaluations in linear time. We get the information whether a node will be used later from a reference counter refv for each node v 2 V , which is initialized with outdegv at the beginning of lfs and decremented when using v as operand. If refv = 0 , v will not be needed any more; the register containing the result of v can be marked free line 8 9. We observe that each n o d e v will be held in a register from its evaluation point u n til the last reference to v is reached. Thus, the register need m results from the highest marked register number plus one, as in the de nition above. lfs or rfs might return a very bad evaluation when the DAG has a certain structure see Fig. 4 and Tab. 1. For this reason we try to modify these algorithms in the next section.
3 Random rst search Lemma 1 A unary node has no in uence on a contiguous evaluation generated by a dfs variation.
That is evident since for a unary node u with son v, dfsu has no other choice than to call dfsv. De nition: A decision node is a binary node which is not a tree node. By doing this we obtain all up to 2 d possible contiguous evaluations for G provided that we use a xed contiguous evaluation for the tree nodes of G. Unfortunately, the algorithm induced by that still might h a ve exponential run time since a D AG with n nodes can have u p t o d = n , 2 decision nodes e. g. consider a binary tree with n , 2 nodes; by adding two new nodes and n , 1 edges as given in Fig. 5 , we g e t a D AG with n , 2 decision nodes.
It is clear that in a tree
Of course we do not want t o i n vest exponential run time if we h a ve a lot of decision nodes. Often a heuristic solution su ces 1 . This suggests to throw coins in order to generate several random bitvectors and to hope that at least one of the evaluations computed by this procedure has a register need close to the optimum.
Algorithm randomfs: We generate a xed number zv of random bitvectors with prob i = 1 = 1=2 and apply dfs to each . Among the computed evaluations we select one with the least register need. The run time of randomfs is Ozv n according to the discussion of lfs. Of course, if zv 2 d we h a ve enough time to enumerate all possible 2 d bitvectors, i. e. we simulate a binary counter on the = 0 :::0000; 0:::0001; : : : ; 1:::1111. This procedure surely gives an optimal contiguous evaluation for G.
The advantage of this method lies in the fact that the quality of the generated solution can be controlled by zvzvmay be passed as parameter to the compiler. That is why we are interested in the questions how good the computed evaluation is on the average and what size zvshould have in order to get su ciently good results. We w ant to illustrate this problem for a special example: Consider the DAG of Fig. 4 . It is easy to see that randomfs can generate an optimal contiguous evaluation with a register need of 4 only if at the decision nodes 0, 3, 6, 9 and 12 the right son is always visited rst. The probability for the subDAG with the root 15 being evaluated rst is p = 1=2 5 = 1=32, about 3. The probability t o nd at least one optimal evaluation among zvpossibilities is 1 , 31 32 zv in this example. If we wish that probability being over 90, we conclude zv log 0:1 log 31 , log 32 72:5;
for a probability of 50 we need zv 22, and so on. Of course we might be satis ed if the generated evaluation would require ve instead of four registers. For the average register need with given zvwe h a ve found the following results for our example by experiments: zv 0 1 3 5 6 7 8 10 12 15 18 20 30 50 reg 9 8.4 7.7 6.7 6.3 5.6 5.6 5.6 5.6 5.6 5.0 5.0 4.9 4.4 We can see that already for a relatively small size of zv, e.g. 10, a fairly good average register need is scored. Of course the improvement of the evaluation quality decreases for increasing zv, and the probability for the same bitvector being chosen twice certainly increases for increasing zv, e . g . for our example DAG with d = 13 decision nodes thus 8192 possible bitvectors the probability of at least one bitvector occuring several times is over 50 already for zv= 107. Certainly these computations are limited to our example DAG above; a more general discussion of zvmay be a subject of further research. For the present w e will choose zv with respect to the run time of randomfs.
labelfs | another heuristic
It is possible to compute labels for all nodes of the DAG according to the formula of Sethi Ullman for trees given above. In general a label controlled evaluation of a DAG For the DAG of Fig. 4 labelfs gives an optimal evaluation 4 registers. But we give a counterexample where labelfs does not the best Fig. 6 , Tab. 2. So it seems sensible to unify all heuristics considered so far in a combination called V4 which applies all algorithms one after another and chooses the best evaluation generated. Table 3 : A series of tests with 20 randomly constructed DAGs: V4 always improved the register need of the original evaluation for the basic block GC stands for "Graph Coloring without reordering by V4"; here we obtained the average ratio GC V4 1:38.
ference graph RIG which m ust be constructed from a xed evaluation A this is the evaluation given in the original basic block. Two nodes symbolic registers, here identical with the DAG nodes interfere thus they are connected in the RIG by an edge if they are active simultaneously in A, i.e. they cannot be assigned to the same physical register same color. The coloring can be computed by a linear time heuristic applied here or via backtracking where exponential run time is possible. The number of di erent colors chromatic number needed for A corresponds to the register need m. In order to show the advantages of the new heuristic we apply V4 with zv= 10 to randomly constructed DAGs with 30 to 150 nodes average ca. 80, see Tab. 3. The result is surprisingly clear: For the original evaluation of the basic block about 1 3 more registers are needed on the average than for the evaluation returned by V4. The improvement a c hieved by V4 might even be increased by c hoosing a greater zv.
This observation can only be explained by the fact that before the reordering we h a ve one xed evaluation A 0 just the one which is given by the random construction of the test DAG, and in general this A 0 is noncontiguous. The probability for exactly this evaluation having a very low register need is rather small. On the other hand, V4 examines here zv+ 3 = 13 mostly di erent evaluations, and only one of them must have a l o wer register need than A 0 to improve the result.
Final remarks
We are rather pleased with the results returned by V4, so we use it for the code optimizer in the implementation of a compiler for vector PASCAL which is being developed at our institute. For more details see 7 . The next step in that optimizer, the adaption of a computed evaluation of a vector DAG to a special vector processor, is described in 3 and will be presented in a later paper.
