Abstract-Fixed-point DSPs are a class of embedded processors with highly irregular architectures. This irregularity makes it difficult to generate high-quality machine code from programming languages such as C. In this paper we present a novel constraint driven approach to code selection for irregular processor architectures, which provides a twofold improvement of earlier work. First, it handles complete data flow graphs instead of trees and thereby generates better code in presence of common subexpressions. Second, the presented technique is not restricted to computation of a single solution, but it generates alternative solutions. This feature enables the tight coupling of different code generation phases, resulting in better exploitation of instruction-level parallelism. Experimental results indicate that our technique is capable of generating machine code that competes well with handwritten assembly code.
Introduction
Today, many embedded systems employ programmable processors as their core components. The machine code running on embedded processors frequently must meet tight speed and size constraints. This is due to the presence of real-time requirements and limited silicon area for program memories. These requirements frequently prevent the use of high-level language compilers for embedded software development, since compilers usually cause an overhead in code quality as compared to hand-written assembly code. In this paper, we consider fixed-point DSPs as a specific class of embedded processors. In particular for this type of processors compilers tend to produce an intolerably large overhead in code size and performance [16] . As a consequence, the largest part of fixed-point DSP software is still written manuembedded software (as compared to ASIC hardware), such as reusability and portability. Our overall goal is to eliminate this bottleneck by providing better code generation techniques for high-level language compilers, which take the peculiarities of fixed-point DSPs into account. The poor quality of compiler-generated code for fixed-point DSPs is primarily caused by the irregular architecture of such processors, which in turn is a consequence of the demand for very efficient processors in the DSP area. By "irregu1arity"of the processor architecture we denote the following features: Special-purpose registers may not be orthogonally accessible by all functional units, but may be connected to the inputs and outputs of specific functional units. There may be chained operations, where the most important example in DSPs is the MAC (multiply-accumulate) instruction. Furthermore, fixedpoint DSPs typically show restricted instruction-level parallelism, i.e., they do permit the parallel execution of several instructions per instruction cycle, but unlike in VLIW machines, the permissible combinations of parallel instructions are quite restricted. In this contribution we focus on the task of code selection for fixed-point DSPs or, more generally, for processor architectures with irregular data paths. Code selection is concerned with mapping an intermediate representation (IR) of the source program to machine instructions of the target processor. This task can be viewed as "covering" the IR by machine instruction patterns. Most current code selection techniques are based on tree covering and operate on data flow tree (DFT) based IRs of basic blocks. However, tree covering in general produces suboptimal covers for basic blocks. Since basic blocks generally appear in the form of data flow graphs (DFGs), DFGs have to be split into DFTs ( fig. 1 ). This is performed by cutting DFGs at nodes representing multiple uses of values (common subexpressions, CSEs). ally in assembly languages. This implies a bottleneck in systern development and also reduces some of the advantages of 'The authors acknowledgethe support by the DFG and HP EESof Figure 1 : Splitting a DFG into DFTs backs particularly for irregular architectures and leads to i n knovel code selection technique, which generalizes code selection for irregular processor architectures from DFTs to DFGs, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distribcitation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 99, New Orleans, Louisiana
As we discuss later' the tree-based approach has drawrior code quality. In this contribution we therefore propose a uted for profit or commercial advantage and that copies bear this notice and the full 01999 ACM 1-581 13-092-9/99/0006..$5.00 thereby producing more efficient machine code. This code selection technique is capable of covering complete DFGs by machine instructions and is based on the paradigm of constraint logic programming (CLP). The paper is structured as follows: The next section provides a discussion of the limitations of DFT covering based code selection for fixed-point DSPs and also mentions related work. Section 3 shortly sketches concepts of constraint logic programming. In section 4 a model for representing alternative DFG covers and the DFG covering approach itself are presented. Section 5 describes several applications of the DFG covering technique in a compiler and provides experimental results indicating the code quality improvements achieved.
%. Motivation and related work
Throughout this paper we will represent processor operations by register transfers (RTs), that reflect the operations performed and the storage resource (SR) locations for the operands and the result. In fig. 2 a partial data path of an Analog Devices ADSP-2l0x together with a subset of its RTs are
shown. For instance, in RT ar := ax+ ay the first and second operand must reside in SRs ax and ay, respectively. The result is stored in SR ar. In the following, we call ar the de$nition of the RT and ax, ay its uses. We say usei, when referring to operand number i.
Drawbacks of DFT covering
Most of today's code selection approaches are based on tree covering or tree parsing. Covering consists of mapping DFT nodes to available RTs. DFT edges may be mapped to RTs denoting pure transfer (move) operations, required for routing data between the functional units (FUs). An example of a cover is shown in fig. 3 a. Tree covering is typically based on tree pattern matching combined with dynamic programming. Tree pattern matchers are used to determine the set of alternative covers for a DFT. Dynamic programming serves to select the optimal cover from the alternatives. Such code generators can be generated automatically by tools like iburg [3], which require a formal description of the target instruction set given as a tree grammar. For applying tree covering, DFGs have to be split into DFTs ( fig. l) As an example consider fig. 3 a) , where variables are mapped to memory. Thus, the CSE labeled with t is moved from SR mr to memory, and each use loads t back into SR ay. If code selection was done for the complete DFG, then the overall costs of the cover (i.e. the number of instructions required to implement the DFG) could be reduced from 9 to 6 ( fig. 3 b) . Note, that not only the location of the definition and the uses of the CSE change, but also the complete covering of the second DFT (exploiting commutativity of operators). Fig. 3 c) shows the additional optimization effect achieved by allowing CSEs to occur as sub-operations in chained operations, which again decreases the costs by one instruction.
A further drawback of approaches based on tree covering is that instruction-level parallelism (ILP) cannot be taken into account during covering. Fig. 4 shows two covers with the same costs, but only the second cover can be mapped to parallel code for the ADSP-210x: Parallel transfer operations are only feasible if one operand is moved from memory d to mr, and the second one from p to my. Therefore, it is obviously favorable to keep alternative covers of the same costs for the scheduling phase, during which the most appropriate alternative may be selected.
CyCleDsJ

Figure 4: Impact of covers on ILP
In the following, we discuss related work, which is based on either tree covering or a phase-coupled approach to DFG code generation.
Approaches based on tree covering
In the CBC and RECORD compilers [2, 7] , processor models are mapped to iburg [3] specifications, from which code generators are automatically generated. In [l] DFGs are transformed into DFTs by pruning edges of CSEs based on the "RTG criterion", leading to larger DFTs. Covering is then performed with help of the code generator generator olive (an extension of iburg). However, the RTG criterion only holds for a very specific class of fixed-point DSPs. In contrast to these approaches our technique enables optimal DFG covering (instead of DFTs), simultaneously taking into account routing costs of CSEs and allowing CSEs to be mapped to chained operations.
Approaches based on phase coupling
Several approaches are based on the paradigm of coupling different code generation phases (including code selection), so as to maximize code quality. In [ 15, 4, 5, 91 optimal code is generated for DFGs, taking into account register allocation and instruction scheduling. Code generation phases are described in the form of constraints (generally linear equations and inequations). The complete solution space is explored while all constraints are considered simultaneously. Generally only basic blocks of very limited size and a restricted set of architectures can be handled by this approach. Other phase coupling approaches are based on heuristics and postpone the final selection of a certain RT to register allocation, and/or scheduling [12, 13, 6, 81. In each step during register allocation or instruction scheduling, resources are strictly bound. In contrast to these approaches, the presented technique allows to choose between optimal and heuristic (for large DFGs) DFG covering. More important, our technique does not generate only a single DFG cover, but retains alternative DFG covers for later code generation phases. Alternatives are kept as long as possible, leaving much freedom for decisions during the scheduling phase. by means of constraining the variables to certain members of their domains, which is denoted as labeling. Given a set of variables V , a certain labeling strategy defines an order for traversing the variables and a strategy for selecting members from the domains. The constraints guide the labeling of variables. Constraint propagation serves for pruning the search space in each labeling step by reducing the domains of unlabeled variables and leads to an early detection of the infeasibility of a partial labeling. Any labeling which does not meet the given set of constraints is rejected and leads to backtracking. ECLiPSe provides several predefined labeling strategies, but also enables user defined labeling strategies. For optimization problems, there are predefined generic optimization procedures based on branch and bound strategies. Given a set of variables' V , the Optimization procedures expect a labeling strategy 1(V) and an objective function cost(V) defining the costs C. Solving the optimization problem is performed by calling one of the predefined optimization procedures in the form of optimize( 1 ( V ) , cost ( V ) ) .
Constraint logic programming
Each time a new minimal solution C' is found, a new constraint C < C' is added and search is continued, triggered by the backtracking mechanism of ECLiPSe. Given an appropriate design of the model, the efficiency for finding solutions basically depends on the labeling strategy, which may be specified by the user.
Generation of DFG covers
An essential point in our code selection methodology is a new representation of DFG covers and the computation of the set of all alternative covers of a DFG by means of a set of domain variables and constraints. We will first introduce our model of alternative DFG covers, then describe the generation of all alternative covers, and conclude this section with runtime results.
Model of alternative DFG covers
To represent alternative DFG covers, the resources of all RTs matching a node in the DFG are combined to a factorized *This may also be any data structure V, containing domain variables. A complete definition of a FRT is given by the tuple
,U,,], F,C, T , CS). O p denotes the operation.
The domain variables D and U , , . . . , U,, represent the alternative SR locations for the definition and the uses. F, T , and C denote the extended resource information (ERI), specifying the used FU F , the costs C (given as the number of instruction cycles required to execute the RT), and a machine instruction type T . Machine instruction types specify, how RTs can be combined to machine instructions, in order to be executed in parallel. A subset of the machine instruction types of the ADSP-21Ox are shown in fig. 6 . Types and FUs are used to model potential parallelism between RTs. CS is the constraint set defining the mutual dependencies between the SRs. It may also specify dependencies between SRs and ERIs. Thus, effects like selecting certain FUs and machine types (e.g., during scheduling) are also tightly coupled with the covering process. As an example consider the set of RTs {c:=a+b,a:=c+b}.
The FRT is described by ( + ,~, [~i ,~~] ,~,~,~, {~
E {c,a),Ui E { c , a } , U 2 = b)). Note, that the specification of domains is also given in the form of constraints. Additionally, we need constraints to describe the dependency between D and lJ1. This can be expressed by the constraint D = c t) U1 = U . If we now set D to c, the constraint causes the reduction of U1 to a. As a further example, we consider the FRT specification for the 
The FRT covering process
Our code selection approach covers each DFG node with a FRT. In the following, n; will denote the i-th node in a DFG. The FRT associated with n; is denoted as frti. Di denotes the definition of frti and Ui,, denotes the use, of frti. The function vd (i, j ) returns k, where nk defines the value used by Ui,,. Each ni is also associated with variables TCi,,, denoting the transfer costs for each U;,,. There is a predicate cse(i), which holds if n; is a CSE. For a given DFG, FRT covering is performed by applying the constraint match to each frt;, which yields a new instance of the matching FRT in the internal machine model. ( N * D 2 ) , where N is the number of DFG nodes, and D is the maximum number of either FUs, SRs, or instruction types. An important feature of our approach is, that a generally exponential number of alternative covers is stored in a representation of linear size (w.r.t. to the number of DFG nodes N ) , by means of FRT covers.
Applications of FRT covering
This section describes several applications of the presented FRT covering technique in code selection and code generation for DFGs. First, we consider optimal code selection for DFGs with respect to a sequential instruction execution model, i.e., neglecting ILP. We show the improvements in code quality achieved as compared to a tree-based code selection approach. Since optimal DFG covering may be too computation-time intensive for large DFGs, we also present a heuristic variant, which achieves close-to-optimum results within reasonable computation times. After that, we briefly describe the integration of the code selection phase into a phase-coupled code generation technique, where alternative DFG covers are exploited to maximize ILP, resulting in very compact machine code.
In the following we make use of the following notations: for ni, the set CVi = (Ci, TC,!} U { Ti,,JUi,, E f rti} denotes the cost variables; Costi = &-J, c denotes the cover costs, and SRi = {pi, Di} U {Ui,jlVi,j E frti} is the set of SR locations of definitions and uses. We will denote the root node of a DF7;: with ri.
Optimal DFG covering
Intuitively, optimal DFG covering can be specified as finding a labeling of the variables in U,,i(CVi U SRi), w.r.t. minimizing the costs Cost(DFG) = ClriCosti. It can be shown, that it is sufficient to label the set of variables V(DFG) = (U,,, CVi) U {D'Jcse(j)}, which drastically reduces the search space. Optimal covering can now be specified, making use of ECLiPSe's optimization procedures: minimize(l(V(DFC)), Cost(DFG)). The labeling strategy l ( V ) selects the most constrained variable in each labeling step. We compare three code selection methods for DFGs 
Heuristic DFG covering
Since optimal DFG covering is an exponential problem, optimal code selection for DFGs may be too computation-time intensive for large problems. We have therefore designed an additional code selection method, which splits a DFG into a set of smaller manageable DFGs. This method (CS4) leads to much better runtimes than optimal DFG covering, while coming close to the optimal results. In CS4, code selection is based on the optimization strategies described in the last subsection. The strategy for partitioning the DFG is splitting the DFG at its CSEs (like in CSl), leading to a set of DFTs.
We assume a certain ordering [DFTI , . . . , DF7J of the DFTs. Now, for each O F T , the variables 0: and TC,! of ri are excluded from labeling, and variables D ) and TC) of all CSEs nj being used in DF7;: are included in labeling. An example is shown in fig. 8 , where the labeling of D{ at the root of tree DFT1 is postponed until DFT2 is labeled (note that adding rl to DFT2 yields a DFG). The results of the method CS4 are shown in table 3, compared to the costs of optimal DFG covering (CS3) and to pure DFT covering (CSl). The costs achieved by CS4 are much better than the costs of CSl and come very close to the optimal cost values. 
Phase-coupled code generation
As mentioned in section 2 ( fig. 4) , selection of only a single optimal DFT or DFG cover from multiple alternative optimal covers may affect exploitation of ILP. A better exploitation of ILP is possible, if the final binding of operations and values to FUs and SRs is postponed until instruction scheduling. This phase coupling can be realized in our approach, since the FRT covering technique introduced in section 4.2 does not commit to a single DFG cover but implicitly retains a set of alternative optimal covers. We have implemented an extended list scheduling algorithm that integrates code selection, register allocation and instruction scheduling. The scheduling algorithm takes a FRT cover and transforms it into a sequence of machine instructions, while adding new constraints (e.g., on ILP) and thus reducing the resource sets. The amount of alternative resources represented by a FRT cover provides a high flexibility for making good decisions in each step during scheduling. We have generated parallel code for the example . 4 . Each entry gives a number of generated parallel machine instructions (including additional code for address computations). Column 2 shows results obtained with a GNU-based ADSP-2l0x C-Compiler. Column 3 (hw) gives the length of the hand-written reference code for the DSPStone benchmarks, while column 4 (pc) provides the results achieved by the phase-coupled code generation technique. Column 5 (tpc) shows the runtimes for phase coupled scheduling. For the tested benchmarks our technique was able to produce the same code quality as in the case of the hand-written code.
Conclusions
The irregular architecture of fixed-point DSPs often prevents compilation of efficient machine code due to many constraints imposed by special-purpose registers and ILP. In order to overcome this problem, in this paper we have proposed a novel constraint-driven approach to code selection, which takes irregularities in a DSP architecture into account. We have shown that the proposed DFG covering technique produces better code (for a sequential execution model) than the traditional tree-based method, due to a more efficient code selection for CSEs. Furthermore, our approach enables phase coupling by exploiting alternative DFG covers during the scheduling phase. Experimental results demonstrate that by exploiting these two features the quality of compilergenerated code can be significantly improved as compared to existing techniques and may come close to the quality of hand-written assembly code.
