In this paper, we present the application of dynamic trellis diagrams (DTDs) to automatic translation of data flow graphs (DFGs) into highly optimized programs for digital signal processors (DSPs). In contrast to static trellis diagrams (STDs), which may be precalculated, DTDs are built at runtime and adapted exactly to the local requirements. Therefore, DTDs are more flexible and need less program memory. Due to the significant reduction in memory size, the increase of compilation time is only moderate. At present, the concept of DTDs has been successfully applied to DFG compiler implementations for a variety of general purpose DSP families, including Motorola's DSP56000 and Analog Devices' ADSP-2 100.
INTRODUCTION
Programming of general purpose DSPs is either very inefficient or very time-consuming. The reason for this behaviour is the lack of well performing compilers in an environment where real-time algorithms have to be implemented. Typically, hand-coded DSP programs are about a magnitude faster than code which is generated from a standard DSP CCompiler.
To produce better results, a DFG compiler has been developed which is able to produce highly optimized assembly code from a DFG specification. An overview of the compiler is given in Figure 1 .
It consists of a DFG decomposition unit which splits the DFG into one or more expression trees (ETs) [I, 21. This step is necessary because the generation of optimal code directly from DFGs is known to be NP-hard [ 3 ] . However, it has been shown in [4] and [5] that expression trees can be translated into optimal assembly code by an O ( n ) algorithm. During the code generation step, for each expressi- This work was supported by the Fonds zur Fordemng,,der wissenschaftlichen Forschung (FWF) under research grant P10701-OTE on tree, a trellis tree is generated by concatenating the appropriate trellis diagrams and inserting transfer diagrams in between. Code generation is done by looking for minimum weighted paths in the trellis tree, using data transfer instructions only if neces:sary. As the generated code is optimal for each expression tree, the trees should have the maximum size possible. In the final step, the code segments are concatenated and compacted by combining two or more instructions into one whenever this is possible, yielding highly optimized code for the given DFG. The DFG compiler has been made machine independent by using a behavioural target architecture description. This description is specified in TDL [6] , designed to match the requirements of the compiler. Retargeting can be performed by simply replacing the target architecture description. At present, descriptions exist for a subset of the functionality of Motorola's DSP56000, Analog Devices' ADSP-2 100, Texas Instruments' TMS320C5x. and NEC's 7701.
RELATED WORK
Previous versions of the DFG compiler have been using static trellis diagrams [ 1, 41. Trellis diagrams are the building blocks for trellis trees which serve as an underlying data structure in the code generation process. As shown in Figure 3, they consist of nodes and edges. The nodes, which are also called states, correspond to a particular storage resource type which can either be a specific register or an arbitrary memory location. E,ach state specifies not only the operand location but also a set of data registers which is assumed to be available for the execution of the instructions. Edges correspond to instructions of a specified arithmetic or logic operation.
The trellis diagram presented in Figure 3 shows the relevant parts of the aidd instruction Ai = Ai + Bj of a hypothetical four-register machine with the registers Ao, A I , Bo, and B1. Let A = { Ao, AI } and B = {Bo, B1 } be homogeneous register sets. At the top of the figure, the destination states are listed. In this diagram, we only consider destination state OAOB which indicates that neither a register of set A nor set B is locked ;and that the result is stored in a register of set A. The register assignment will be done later on. In our notation, writing a register set first means that one of its registers contains the current result. The two sets of source states at the bottom of Figure 3 represent the register states for the operands. If the left operand is evaluated first, again all registers may be used for computing the operand with the result in a register of set A (left source state OAOB). The right operand, which is evaluated afterwards, must not modify the register containing the left operand calculated before. Thus right source state 0B1A is used indicating that one register of set A is locked and the result is stored into register set B. This is represented by the solid lines in Figure 3 while
OBOA OBlA

Mem
Source 2
Figure 3: STD of a simple add instruction the dotted lines indicate that the right operand is calculated first. By exploiting the commutativity, the two source operands may be swapped (Ai = Bj + Ai) which would lead to two more states for both source operands. Since the principle remains the same, we do not show these states in Figure  3 . When constructing STDs this procedure has to be repeated for all possible destination and source states defined by the target architecture.
A trellis diagram has altogether at most different states [7] . In this equation, N E denotes the number of data registers. Essentially the number of states increases exponentially with the number of registers. Although the number of states can be reduced if homogenous register subsets exist, it can remain too large for practical implementations. As an example, the static trellis diagrams for Motorola's DSP56000 consist of 103 different states while STDs for the ADSP-2100, which has more heterogeneous registers, contain more than half a million states making it practically impossible to use STDs for this target architecture. Trellis trees are built from expression trees by replacing each ET node by a trellis diagram. The instructions represented by a trells diagram realize the arithmetic or logical operation of the corresponding ET node. The trellis diagrams are augmented with transfer diagrams which ensure that the instruction operands of adjacent arithmetic or logical operations may be located in different registers or that intermediate results may be stored in memory. As an example, Figure 2 shows the expression tree and the schematic representation of the resulting trellis tree for the expression for an addition augmented by move diagrams. On the other hand, every trellis diagram has to be created from scratch, resulting in a run-time overhead when compared to the prebuilt diagrams of the STD algorithm. To reduce this overhead, every state encountered is saved into an AVL tree. If a specific state is used another time, which is likely to happen due to the symmetric nature of most signal processing algorithms, the old state objects can be copied instead of building them another time. This compensates some of the negative run-time aspects.
DYNAMIC TRELLIS DIAGRAMS 4. EXPERIMENTAL RESULTS
When using the algorithm described above, it can be noticed that often a large number of the states remain unused. Typically, an arithmetic or logical instruction writes back the result into a specific register or register bank. In these cases, it is unneccessary to create all possible states which are permutations of all registers currently available. We restrict the algorithm to states which may be on an optimum path. To exploit these potential savings, it is no longer possible to use prebuilt trellis diagrams, as the actual number of states depends on the context. Instead, the trellis diagrams are built dynamically whenever a new node of the expression tree is encountered. The generation starts with building the destination states subject to the following constraints: the state must represent a valid 0 destination for the current instruction 0 source for the previously processed instruction.
Once the valid destination states are determined, the source states can be extracted from the target architecture description.
This dynamic construction saves program memory as only a small fraction of all possible states will be used. For the example in Figure 3 , all the nodes without an edge leading to them would be left out. An example for a DTD is given in Figure 4 which shows the dynamic trellis diagram DTDs have been integrated into a code generator transforming DFGs into generic DSP assembly code automatically. The program has been tested with several different graphs and the results have been examined. It is not possible to exactly quantify the savings by the DTD concept due to the fact that the savings strongly depend on the specific example. However, note that the reduction of memory usage is significant as can be seen in Figures 5 and 6 which have been generated for a second order lattice filter. The DSP56000 diagrams are significantly smaller due to the symmetric nature of the processor. The savings compared to STDs are about 75%. The average number of states is approximately 21 opposed to a maximum number of 103. The ADSP-2100 diagrams are one magnitude larger. However, they are more than 99% smaller than the corresponding STDs. This is due to the highly heterogenous architecture of the ADSP-2100 which makes the use of STDs practically impossible. The average number of states with DTDs is 275.
A moderate run-time increase has been observed resulting from the dynamic generation of every single trellis diagram opposed to the usage of precalculated diagrams for the STD algorithm. This increase is partly compensated by a speed gain during the evaluation of the trellis tree resulting from the much smaller number of states. The total run-time strongly depends on the architecture ranging from a few se- conds for Motorola's DSP56000 family to several minutes for Analog Devices' ADSP-2100. It has to be noted that due to the huge number of states for the Analog Devices' processors, it was impossible to compile code for this architectur using STDs.
111-494
CONCLUSIONS
The proposed method of using dynamic trellis diagrams to generate highly optimized DSP code makes DFG based code generation applicable to most available general purpose DSP architectures. By omitting unnecessary states in the trellis tree, the amount of memory needed can be significantly reduced. This can be done by creating new trellis diagrams for each and every node in the tree instead of using precalculated information. When creating the diagrams, only states which are actually being used are added to the trellis diagram. The savings are especially high for nonorthogonal DSP architectures. The code generator is architecture independent, the target architecture language is specified by a special hardware description language. This information is used folr the dynamic generation of the trellis diagrams. Besides the much more efficient memory usage, DTDs allow other improvements of the code generation process which have yet to be exploited.
For the scope of this paper, DTDs have been used to reduce memory requirements when building and evaluating the trellis tree. However, it seems to be possible to gain additional benefits. With STDs, the implemenation of multiple instructions like the MAC command, which is typical on standard DSP architecures, has shown to be problematic. As a program may either contain a multiplictation command followed by an addition or a single MAC instruction, one would need two different kinds of STDs to cover the expression tree. As this is not possible, workarounds have to be introduced, which are awkward and cannot be generalized. With DTDs however, there is the possibility to make a dynamic connection directly from the leaves of the MAC instruction to the register containing the result. This may be used in combination with separate trellis diagrams for multiplication and addition. Effectively the two alternatives are both present in the trellis tree allowing the compiler to choose the more efficient one. Another improvement is the possibility to map different types of instructions to one trellis diagram. For example, a multiplication by two may be expressed as a multiplication (with a factor two), an addition (with equal operands), or as a shift command (by one to the left). With STDs, these additional opportunities are not exploited since the trellis diagrams for the individual commands have been prebuilt and are not adapted to the local requirements. DTDs have this capability, as they are created at a point of time where more detailed information about the input operands is available.
