Abstract. We describe a novel technique for code selection based on data-ow graphs, which arise naturally in the domain of digital signal processing. Code selection is the optimized mapping of abstract operations to partial machine instructions. The presented method performs an important task within the retargetable microcode generator CBC, which was designed to cope with the requirements arising in the context of custom digital signal processor (DSP) programming. The algorithm exploits a graph representation in which control-ow is modeled by scopes.
Introduction
In the domain of medium-throughput digital signal processing, micro-programmable processor cores are frequently chosen for system realization. By adding dedicated hardware (accelerator paths), these cores are tailored to the needs of new applications. Optimized processor modules can be reused, which is a major bene t compared to high-level synthesis 28 ] where a completely new design is developed for each application. Because of the application-speci c add-ons and the rather short lifetimes of a speci c design, there is a need for retargetable software development tools, especially code-generators.
Overview
In the next section we will shortly discuss several related approaches to code generation and point out some di erences of our system. Section 3 introduces the overall architecture and functionality of the CBC code generator. Section 4 explains the code selection task and the basic techniques used. In section 5 our algorithm is presented. We conclude the paper with experimental results. 
Related Work
Lansdkov et al. 27 ] present a machine model and methods for microcode generation. A subtask of code selection called bundling and a subset of scheduling called compaction are described. Both methods have a local view on the subject program. The YC system 6] deals with code selection but does not provide detailed scheduling. A phase called combiner only tries to concatenate adjacent operations. The work of Rimey 31, 32] describes a compiler for application-speci c DSPs. The main attention, however, is paid to scheduling and data-routing (i.e. mainly register assignment and spilling). Code selection and scheduling are only performed on straight-line code. Optimizations across branch boundaries are not performed. The Marion system 2, 3] performs code generation for RISC architectures. Here, a simple approach for code selection is chosen. A recursivedescent brute-force tree pattern matcher neither considers graph structure of the intermediate code nor global subexpressions. Our implementation is based on the work of Fraser et al. 13, 14] .
Points of major di erences between our code selection approach and similar tasks in \classic" code generation (CG) are:
{ Complexity of datapaths. CBC has to deal with highly specialized and optimized datapaths. The hardware units make the e cient execution of frequently used operation sequences possible. Operation patterns for the functional units of these datapaths are much more complex than for standard microprocessors.
{ Type-handling. DSP algorithms may employ a large variety of di erent word lengths and numerical types. The hardware operators are restricted to xed word lengths. A correct mapping must always be found. In most CG work this topic is neglected because language de nitions (and hence the compilers) are restricted to \implementation-dependent" types.
{ Evaluation order. Approaches like 6, 7] dealing with code selection assume a xed evaluation order, which is usually derived from the imperative source code. There is no explicit scheduling phase included in the back-ends. Commonly, register allocation is performed during code selection. Most of the time this is done by graph coloring 4] or \on-the-y". { Parallelism of functional blocks. Most DSP architectures contain several functional units that work in parallel. Therefore, the nal code cannot be emitted during or immediately after the code selection phase because partial instructions must be \compacted" into complete instructions at a later stage of compilation exploiting the possible parallelism. Consequently, code selection must not specify the complete behavior of the machine for each cycle. It must only select code for each of the individual units. { Machine description. In the compiler writer community machine descriptions are mainly intended to be used by the code generator only. Some detailed knowledge of the compiler is necessary to write good descriptions. By giving the semantics for each instruction as a transformation of the machine state, we describe the instruction set in a behavioral way. Out of this machine description, various machine models can be generated depending on the application (e.g. code generator, assembler or simulator).
{ Intermediate representation. Our intermediate representation is based on a data-and control-ow graph description that di ers from the representations used in many compilers.
Anatomy of the Compiler
In CBC, code generation is split into di erent tasks. Each of these is performed by a speci c tool. The intermediate results are passed on in human-readable text les. Figure 1 shows the general layout of the code generator, the underlying data-and rule-base as well as the retargeting mechanism. The primary goal is to generate highly optimized code from the description of the algorithm, which is speci ed graphically or textually in a signal ow graph. In principle, it is also possible to write the algorithm in other languages that are capable of modeling parallel behavior in an adequate way, e.g. the synchronous subset of the applicative real-time DSP language ALDiSP 16] . The intermediate representation can be easily obtained from a signal ow graph and will be described in section 3.3.
Retargeting
In our approach, the language nML 15] is used to describe the target architecture (see Fig. 2 ). Originally designed as a simple means for expressing programming models as found in the usual programmer's manuals, it has turned out to be powerful enough to describe current and future DSP cores { it may even serve as the basis for high-level hardware synthesis When retargeting the compiler, the nML analyzer examines the instruction set and the memory description of the target processor and builds a machine model, i.e. a representation of the capabilities and constraints of the machine. The process of building this model is detailed in 10, 12] . The machine model, along with the datapath constraints and machine-independent transformation rules are given as input to the generic (parameterized) code generator. The transformation rules specify, for example, how to perform a 32-bit addition on a 16-bit machine. The phases of code generation and the construction of the generic compiler are outlined in 9, 11].
Code Generation Script
The main tasks of code generation are: { Signal ow graph translation. This is the algorithmic design entry to the code generator. The speci cation of the application program is constructed using a schematic editor and a simulation tool. The resultant signal ow graph is translated by this front-end into the code generator's internal data format.
{ Control-ow transformations. Transformations concerning the mutually exclusive execution of operations depending on certain conditions are performed to reduce the overall execution time. A pure data-driven representation is translated into a hybrid data/control-driven representation re ecting the requirements of branch controllers and conditional transfers used in programmable DSP systems 11, 26] . 3 { Code selection. Subsets of the algorithm are mapped to datapaths. First, high-level operations of the algorithmic input are expanded into machineexecutable operations. Then, chains of expanded operations are merged to form more complex operations that are provided by the machine. This clustering reduces the complexity of the scheduling task and allows optimized code generation in reasonable time.
{ Scheduling and data-routing. The operations in the graph are ordered in time. To produce high quality code, e cient scheduling is a necessity. The goal of scheduling is minimum execution time for a given algorithm on an architecture which is xed at compile-time. Therefore, the assignment of registers to intermediate values, the generation of data-routes (including spill-code) and scheduling are performed in parallel 23, 32].
Intermediate Representation
The intermediate representation is a control/data-ow graph (CDFG). A CDFG is a program description based on a directed graph (N; E) consisting of two nite sets: the nodes n i 2 N represent the operations of the program and the directed edges e i 2 E which are ordered pairs of nodes e i = (n j ; n k ) display dependencies between the operations. An edge can either model a data-ow dependency (i.e. a data ow path) or an additional control-ow constraint. 4 The CDFG describes the body of the main execution loop of an application. Cycles in the graph result only from algorithmic delay operations which are used to refer to values from earlier incarnations of loops.
The data-ow graph models all data dependencies and operations. An operation node can be executed whenever input data is available. 5 Inputs and outputs of the program are represented as data sources and data sinks. Data is 3 This task is actually split in two: One phase before and one phase after code selection. The rst phase rewrites scope structures and the second inserts jump operations. represented as signals. A signal represents an in nite stream of values. For synchronous data-ow, the amount of data produced and consumed for each node is speci ed a priori. Our data-ow model limits the amout of data produced and consumed in a single cycle to exactly one. The execution of an operation therefore consists mainly of the use of one signal at each incoming edge and the de nition of one signal at each outgoing edge. 6 The control-ow graph is basically a hierarchical structure of macro nodes. A macro node is a cluster of operations and other macro nodes. They are used to model loops and conditional scopes. All operations inside a speci c conditional scope are related to a certain condition. Additionally, control-ow edges display precedence relations between operations. At the beginning of code generation there are few control-ow edges; later phases insert additional control-ow information modeling in-place storage of signals and the programming of the branch controller. The scheduler must nd an explicit execution order for all operations, resulting in a sequentially executable microprogram.
For the di erent stages of code generation, three distinct sets of arithmetic and logic operations exist in a common library: { Abstract operations (AOs). This is set of high-level operations that is available in the initial input-level graph.
{ Machine-executable operations (MEOs). This set consists of operations which correspond to primitives of the nML description. All initial CDFG operations must be mapped to members of this set.
{ Datapath operations (DOs). The third set comprises operations which occupy a full datapath. They are the basic entities for the scheduling process. These operations are formed out of the MEOs during chaining and represent the valid combinations of MEOs. Besides these operations, some canonical operations identifying the action on dedicated hardware (such as accelerator paths) can also be included in the algorithm at each stage of the translation. Since they represent both abstract and datapath operations they are included in the CDFG upon entry to the script and need not be transformed during code selection. Two more groups exist: { Transfer operations. These are used to describe assignments of data to memory locations and moves on buses. They are inserted into the description to route data between di erent storage locations and correspond to addressing modes and move operations.
{ Control-ow operations. All conditional and unconditional jumps belong to this set.
The Problem of Code Selection
Prior to code selection, the algorithm consists of operations that are machineindependent and well-typed. After code selection, the algorithm must consist of operations that are equivalents for clusters of MEOs. These clusters are associated with datapaths and must not violate encoding restrictions. The rst stage of code selection consists of two interleaved phases: machine-parameterized macro expansion and mapping to machine-executable operations. The second stage maps parts of the algorithm to datapaths.
The General Approach: Macro Expansion and Chaining
During macro expansion, operations in the CDFG are expanded into operations available on the machine. For example, multiplications are broken down to combinations of additions and shifts or into Booth-multiplication steps 24]. This process is controlled by rules, which are parameterized by the set of speci c hardware operators o ered by the target machine 7 . Therefore, the rules are machine-independent, but the choice between them is driven by the structure of the target machine.
When mapping to MEOs, limited word lengths are taken into account, i.e. the expanded execution of an operation on a smaller word length datapath is constructed. For example, an addition of two 32 bit values could be performed on a 16 bit datapath with two additions (assuming an addition with carry is possible). This task employs the Cathedral-2nd tool for expansion 28]. However, it relies heavily on our own operation library 29], which is two-fold: A machineindependent part describes constant folding and other peephole optimizations; a machine-dependent part describes all MEOs as well as the corresponding expansion rules. The machine-dependent entries are either generated or instantiated from templates during the retargeting process. Implementation alternatives are given from which the appropriate expansion can be chosen.
To allow the generation of optimized code within reasonable time, it is important to reduce the complexity of the scheduling task. Therefore, the second part of the code selection task maps subsets of the algorithm onto datapaths prior to scheduling. Once all high-level operations are re ned to MEOs, clusters of direct data-dependent operations which can be performed on a datapath within a single cycle are identi ed. These chains of operations are merged and replaced by a single operation each, thus forming more complex operations that are provided by the machine. These datapath operations occupy a complete datapath. In Fig. 4 a CDFG is clustered to be executed on the depicted datapath. The shift operations (>>) are executed on the shifter and the arithmetic operations (+ and -) are executed on the ALU core. 
Global Chaining
As outlined earlier, the goal of chaining is a \good" assignment of machine operations to datapaths. This implies that chaining assists the scheduler; it could indeed be integrated into the scheduling phase at the expense of increased complexity and run time. On the other hand, when chaining is done outside the context of scheduling, little information about resource usage is available. Especially in the presence of multiple similar datapaths 8 it is hard to estimate the impact of a particular chaining decision on the quality of the resulting code: Operator assignment performed during chaining may result in schedules not fully exploiting potential parallelism of the machine. To decouple the two tasks, the chaining tool must annotate chains with implementation alternatives. In this paper we can thus neglect the problem of similar datapaths.
Since the architectures under consideration feature complex datapaths, we emphasize that whole expressions are assigned to a single datapath whenever possible. A chaining decision can a ect the choices for distant operations, i.e., it has global e ects. Therefore, large pieces of the CDFG must be considered when making a speci c decision.
Encoding Restrictions
In general, the set of operation tuples executable on a datapath is not equal to all possible combinations of the hardware operators' functionalities; the designer may (and usually will) have imposed restrictions on operation chains. This is quite natural: the number of possible combinations a ects the length of the instruction word. It might be necessary to omit some (rarely needed) combinations to reduce the instruction word length. Furthermore, there may be con icts in the datapath hardware that prohibit certain combinations. As a result, code selection has to comply with encoding restrictions. As it is quite clear that the datapath structure alone is not su cient to hold this information, we decided to represent legal chains as a set of rewrite rules. Pattern matching is employed to nd legal chains in the CDFG.
Matching on Trees
Pattern matching is an established technique for instruction selection from expression trees in compilers for imperative languages 17, 21] . Code selection for stock microprocessors focuses mostly on a good exploitation of complex addressing modes. In the context of CBC, however, the emphasis is on good utilization of the complex datapaths. Nevertheless, similar tools can be used at the technical level. In the CBC environment, all legal patterns are generated by the nML front-end 9] and stored as a set of match-replace pairs (see Fig. 5 for an example). The match-replace database is intentionally held human-readable to allow an experienced user to modify some rules or add new rules by hand (e.g. for special optimizations). The depicted rule does not take commutativity of the add operation into account. This is not a serious problem; the nML front-end simply generates multiple patterns (in this case, s = add(t,i2) is replaced by s = add(i2,t)).
In the context of our compiler, the term rewrite system is not one monolithic unit; pattern matching and rewriting are separated phases. The tree parser generator we use, Iburg 14] , is only concerned with the matching phase; the connection to the rewrite phase is made by match rule numbers. The tree grammar (from which the tree parser is generated) and the rewrite procedure are both generated by our chaining preprocessor, which takes the rewrite rules as its input. The incorporated match algorithm is an extension to the BURS (Bottom Up Rewrite System) 30] theory and allows the computation of an optimal rewrite sequence for a tree (by matching the rewrite rules to subtrees), given a xed set of rewrite rules with xed costs. This computation takes time linearly proportional to the size of the tree. For the selection of the optimum match, tree parsing with dynamic programming is used 1]. The tree parser generator Burg 13] performs the dynamic programming at parser generation time and thus generates highly e cient pattern matchers. Iburg, a heavily simpli ed Burg version, still generates very e cient parsers, but their running time is no longer independent of the number, size, and structure of the patterns. 9 Because of its simplicity, Iburg can be modi ed quite easily. We extended it to accept certain match conditions in the rules; this way we can conveniently express type constraints or other operand constraints which are imposed by the hardware operators.
Code Selection on Graphs
Commonly, code selection is performed on expression trees. These are (partial) statements usually directly re ecting source language statements. The programs being compiled in our environment contain a large amount of decision making and common subexpressions. As mentioned above, cycles in the graph only result from values produced by delay operations. These are not considered during code selection. 10 Hence expressions in our CDFG model are DAGs. This means that intermediate results can have more than one use (Fig. 6a) which can also reside in di erent conditional scopes (Fig. 6b) . Figure 6c shows a signal that has multiple de nitions in di erent conditional scopes. In traditional compilers, conditions are at \borders" of basic blocks. The if-then-else statements themselves are also subject to code selection. In our approach, operation nodes of the CDFG 9 However, informally speaking, for our purposes the generated parser have \nearly" linear behavior and are still fast enough.
have a conditional context. For each condition a ag is computed and connected to a macro node, i.e. a scope. Then, global data-ow is speci ed, i.e. signals \enter" and \leave" scopes. This representation facilitates code selection that transcends basic blocks. Consider the architecture depicted in Fig. 4 . A value produced by the shifter is not immediately available for more than one operation. For that purpose an addition with zero must be performed to pass the value unchanged through the ALU core, i.e. the value is \spilled" to a register. We assume that the datapaths do not fork (and thus do not allow multiple uses within themselves). This implies that an operation de ning a multiply used value can never be chained. The CDFG in Fig. 6a would be mapped to three operations (a shift, an add and a sub) instead of two in the optimal case (a shift-add and a shift-sub). The best results possible for the datapath of Fig. 4 are shown in Fig. 7 . 
The Simple Approach: \Undagging"
Earlier section were concerned with tree parsing and pattern matching in trees.
On the other hand, it was unveiled that CDFGs are de nitely not tree-like in the general case. The resulting problems have to be solved. Taking the previous section into account, it can be seen that some subgraphs indeed have a tree structure; namely those that lie between points of multiple uses and multiple definitions. Incidentally, the values which are de ned or used more than once must be held in registers: multiple de nitions require a proper modeling of controlow which cannot (generally) be mapped onto the datapath; multiple uses map to di erent instructions. 11 This leads to a very simple chaining method: Cutting the DAG whenever a value is de ned or used in multiple places yields a set of (usually small) trees. These trees are then individually processed by the rewrite system and reconnected afterwards to compose a chained version of the original DAG.
A More Sophisticated Approach: Heuristic Node Duplication
The advantage of the previous method is its simplicity. However, in a signi cant number of cases chaining possibilities are lost due to cuts at multiple uses or de nitions. We seek a way to improve this situation. The key insight is that the CDFG must be modi ed in order to create more chaining possibilities. 12 Consider the cases where chaining possibilities may be missed. There are essentially four of them:
{ A signal has one de nition and multiple uses in the same scope. This implies that this signal must be made available to di erent DOs (since multiple uses in the datapath are not possible). By duplicating the de nition once for each use, the multiple use has been resolved (while introducing a multiple use at each operand) and the desired chaining possibilities have been created. { A signal has one de nition and multiple uses, at least one in another scope.
To generate a chaining possibility, the uses must be within the same scope as the de nition. A further look reveals that this case resembles the previous one; it is solved in basically the same manner.
{ A signal has multiple de nitions (in mutually exclusive scopes) and one use outside the scopes of the de nitions. A chaining possibility can be created by duplicating the use and nesting the copies into the scopes of the de nitions. However, we must bear in mind that if the use has yet another operand multiply de ned in di erent scopes, a particular evaluation order for the mentioned de ning scopes would be enforced. This could be undesirable (see Fig. 8a ).
{ A signal has multiple de nitions (in mutually exclusive scopes) and multiple uses. This case is not further considered since there are rarely any cases where duplication of operations could lead to shorter code. Consider a signal that has n uses and m de nitions. If each of these operations is (trivially) chained to a single DO, we get n + m operations. If all necessary copies of operations are generated to get more chaining possibilities and each of these would actually be chained, the result would be n m DOs (see Fig. 8b ).
11
This is a consequence of the postulation that no multiple uses exist in the datapaths. 12 More exactly, we do not want to create chaining possibilities per se but only in those places where this will lead to an improvement of the generated code. It can be seen that the creation of chaining possibilities is associated with duplication of nodes. When duplicating excessively, the graph might grow too large. This is overcome by rst partitioning the graph, which yields (usually) small partitions, and then processing each partition in turn. The partitions are chosen so that no chains across partition boundaries are possible.
One problem not yet mentioned is the identi cation of chaining possibilities. A simple heuristic is employed: For all pairs of MEOs, the pattern base is looked up counting the occurrences of an edge between both operations. (This is quite informal, but should be intuitively justi ed.) This information is then used for partitioning; two operations are put into the same partition if a chaining possibility exists. Partitions including only one operation are trivial cases.
When duplicating into scopes (either at a multiple de nition or at a multiple use), the code size might be increased but (usually) not the execution time because only one of the exclusive scopes is executed. To the case shown in Fig. 6a , however, this argument cannot be applied. Therefore, a common subexpression elimination (CSE) phase, which succeeds the pattern matcher, removes most of the unchained operation copies from the graph. This also works for duplications at scope boundaries. Since with duplication there is a danger that the number of operations (and thus program size) increases unduly, a cost function is used to resort to the simple undagging method in cases where the node duplication heuristic fails. The cost is computed for both the undagging and the node duplication method 13 as the weighted sum of the number of resulting DOs (a rough estimate of the code size) and the expected number of executed DOs on each execution path (a rough estimate of execution time). The better alternative is kept.
Results
The experimental results shown in Table 1 are taken from a \real-life" AD-PCM algorithm, which is incorporated into speech compression applications. The (exemplary) datapath from Fig. 4 served as target architecture. The tool was implemented and tested on a SPARC station IPX using C++. All CPU times including parsing and computation of statistics are less than one minute. Therefore, they are not explicitely given. We have presented an algorithm for code selection on control/data-ow graphs. The approach is based on a global view on the subject programs. The points of interest are multiple uses of values resulting from common subexpressions and multiple de nitions of values resulting from conditional scopes. An implementation of the algorithm is incorporated into the CBC compiler and was successfully tested with the Siemens DECT (Digital European Cordless Telephone) design. One line of future research includes the coupling of code selection and scheduling as well as the adaption of our technique to loops.
