Abstract
Introduction
The very first step in the design-flow of digital systems is concerned with formulating the behavioral specification in a hardware description language such as VHDL [9] . This behavioral description forms the basis for all subsequent design steps starting with behavioral (or high-level) synthesis at the algorithmic level. High-level synthesis deals with the transformation of the behavioral description into a netlist of RT-level components and is generally understood as a mapping of operations of the data-flow graph (DFG) to control steps and to suitable components or an existing library [ 161. Since high-level synthesis systems directly operate on the internal representation of the behavioral description it is quite obvious that the chosen formulation style has a lasting effect to later design steps and therefore to the final result.
Although the use of high-level synthesis systems has gained acceptance during thc rccent years, thc actual question of how to suitably formulate a behavioral description has often been underrated. This question particularly arises in the domain of digital signal applications which are characterized by complex arithmetical computations resulting in complex data-flow graphs. Figure l a presents the original expression requiring one adder and two multipliers with a critical path of 3 control stepst. After applying the associativity law (figure lb), the order of operations has been changed such that component sharing could be improved. Therefore, only one multiplier is required after the transformation. The additional exploitation of commutativity shown in figure IC neither leads to a further improvement of resource sharing nor to a shorter critical path, but may increase the applicability of other transformation rules. Figure Id shows the expression after exploiting distributivity. Although implementation costs and critical path have not further changed, the application of a special component-driven transformation (x * y) + z MAC(z, y, 2 ) is possible. This rule implies a mapping of the two operations to a single multiplieradder-accumulator (MAC) in contrast to a separate implementation by one multiplier and one adder. As we can recognize, the use of this component-directed transformation has a major impact on the synthesis result: in the latter case, only one MAC is required because this component can also be employed as a simple multiplier or adder by applying the particular identity element to the corresponding input port. Since components like MACs are capable of performing two ore more operations in one execution step we call such functional units complex components or BIC
Even this simple example has shown that synthesis results are strongly dependent on the choice of a certain formulation style of the behavioral description. Obviously, it might be difficult to recognize in advance how to formulate a behavioral description that leads to the best synthesis result. This becomes almost impossible for complex data-flow dominated circuits. The goal of applying algebraic transformations can thus be stated as making synthesis results mostly independent of the formulation style or to transform the input description such that synthesis yields better results than for the original description.
The remainder of this paper is organized as follows: Section 2 gives an overview of the research on algebraic optimization in different areas. After introducing some hardware-related transformations in section 3 we describe the genetic algorithm including the chromosomal representation and the genetic operators in section 4. Section 5 presents experimental results for several standard benchmarks, and section 6 concludes the paper.
]Related work
The use of algebraic transformations has been established in different domains: In classical computer-algebra t In this example all operations are assumed to be single-cycled.
systems such as MAPLE [5] or MATHEMATICA [21] they are indispensable for the transformation and simplification of algebraic expressions. In the domain of high-levellanguage compilers (see [2] for an overview) algebraic transformations are mainly used for tree height reduction, common subexpression elimination, constant folding, constant propagation and rather simple optimizations based on strength reduction.
In the area of high-level synthesis algebraic transformations have been particularly used for improving resource utilization [ 181 [ 171, tree-height minimization [6] [7] , the maximization of data throughput [8] [ 101 and minimization of power consumption [3] . The use of algebraic transformations in combination with complex components (e.g. MACs) was first proposed in [ 141. However, to the author's knowledge there is no optimization technique that exploits component-directed transformations in the same extent as the approach presented in this paper.
An approach based upon evolutionary programming for an area efficient design of Application Specific Programmable Processors (ASPP) has been published in [22] . ASPPs are programmable architectures which are designed for a set of different algorithms. The underlying approach is based on a genetic algorithm for transforming the particular data-flow graphs such that a given behavioral kernel (defined by a set of RT-level components) is optimally exploited by the algorithms.
Concerning the chromosomal representation of dataflow graphs and genetic operators (mutation and crossover), our method appears similar to [22] . However, it combines the concept of evolutionary programming with the algebraic optimization techniques for critical path minimization and improved resource exploitation on a finer level of granularity.
Overview of algebraic optimizations
The introducing example has already shown that apart from the exploitation of algebraic laws e.g. commutativity, associativity, distributivity etc., even hardware-related transformations have the potential of a considerable reduction of hardware costs and the critical path. In this section we demonstrate how hardware-related transformations can be specifically employed in order to reduce resource requirements.
General hardware-related transformations
First let us consider the expression 2 * c. Let c be a constant with c = 2". Obviously, the expression can be implemented in several ways: a) using a multiplier, b) using a shifter, or c) by an offset in the wiring pattern. 
Hence, additions with a power of two can be replaced by increment operations and thus lead to cheaper hardware realization. The same transformations also hold for z -2", however applying a decrement operation. Thereduction of an addition to an increment operation can also be exploited in order to find a cost efficient implementation for 1 -z with only one incrementer and one in-
All transformations presented above are accompanied by an immediate reduction of the realization costs. However, we still have to take transformations into account which temporarily have the opposite effect:
Constant Unfolding is a technique that promises a further improvement in terms of cost and speed by splitting constants into a power of two and a remainder. Consider
[e -7-1 = 2" is a constant. We can split expression 9 * z into (a3 + 1) * z, which is equivalent to 23 * z + 2. Since 23 * z = z &OOO can be implemented by an offset in the wiring pattern, only one adder instead of a multiplier is required to realize the expression. Another transformation rule is concerned with introducing identity elements which may be necessary to increase the applicability of further transformations:
Obviously, the use of identity elements is necessary for the formal proof of equivalence without knowledge of the rule a + a U a * 2. Although the introduction of identity elements helps to find new formulations of an initial expression, it also temporarily increases the implementation costs. Concerning the length of the critical path it was shown in [ 131 that the creation of additional operations in the DFG may have a positive effect on synthesis results. Therefore it is essential that the underlying optimization method does not reject transformations which temporarily where a + a e ( a * l ) + ( u * 1 ) * a * ( l + l ) e a * 2 . lead to suboptimal results.
Component-driven transformations
As we saw in the introduction, the use of a special component-driven transformation for employing multiplier-adders has the potential of a further reduction of both hardware costs and the critical path. In the following example we demonstrate how such transformations can be applied in order to exploit existing library components more efficiently: Consider the transformation rule z + y + z e CADD(z, y, z ) , with z E {O,1}. This rule implies that expression z + y + z is mapped to one carry-adder. In combination with the transformations presented above the following rule allows to implement z + y -z with y = 2n by only one carryadder and one inverter:
This section has shown the applicability and the positive impact of hardware-related transformation rules concerning hardware costs and speed. We also have recognized that some transformations temporarily may increase the costs but allow the application of further rules which globally may decrease the costs. Since cost-driven heuristics do not work appropriately in this case, we formulated the problem using a probabilistic approach based on a genetic algorithm which will be presented in the following section.
Algebraic optimization by genetic algorithm
The general principle in natural evolution as well as evolutionary algorithms is the optimization of a population's fitness in the course of generations driven by the randomized processes of selection, recombination and mutation. Genetic algorithms as one representative of evolutionary algorithms (see [l] for an overview) have been proven to be very powerful for searching vast solution spaces. Solutions found by genetic algorithms are generally close to the global optimum.
Chromosomal representation
Each chromosome of the population represents one semantically equivalent formulation of an initial data-flow graph. The genes which are located on the chromosomeor the gene positions, to be precise-represent the operations of the DFG together with references to the predecessor operations. + b) shl 1). It should be mentioned that chromosomes may also contain some redundant genes (in this example: y + 6 -+ E ) which have no direct influence on the created DFG. However, redundant genes can be reactivated instantly by small mutations of the chromosome.
As we have seen, the chromosomal representation introduced above guarantees that all subexpressions referenced by different alleles at the same gene position are semantically equivalent. This means that every possible allele substitution at any gene position will subsequently lead to a new semantically equivalent data-flow graph. This property is crucial for preserving the correctness of the used genetic operators, namely mutation and crossover.
Genetic operators

Mutation
The principle of mutation was implicitly shown in figure 3:
Without changing the entire semantics we can transform a data-flow graph by a simple gene mutation that substitutes the selected allele by another one at the same gene position. Obviously, the mutation operator can be implemented in time O(1).
Crossover
The goal of crossover is to recombine the parental properties and its transmission to the new offspring. In the meaning of transforming algebraic expressions, crossover recombines subexpressions of the parental data-flow graphs and can be sketched as follows:
1. Create two new chromosomes representing the children.
2. Select an arbitrary gene position. Also crossover benefits from the underlying representation and always creates only those DFGs which are semantically equivalent to the initial specification. Obviously, crossover can be implemented in linear time O ( n ) where n represents the chromosome length. 
Selection
Selection is a crucial process in (simulated) evolution that favors individuals of higher fitness to survive ("survival of the fittest") and thus become the co-founders of the next population. Generally, we presume the probability of an individual to be selected is proportional to its fitness. This enables even individuals with a lower fitness to be selected and thus to transmit their gene information to the offspring.
In the meaning of the final hardware realization we define the fitness as weighted sum of the required functional units and the length of the critical path. In contrast to the critical path computation that can be done in linear time by an ASAP (or ALAP) scheduling, the exact computation of resource costs is NP-complete. Therefore, we have to employ resource estimation techniques (e.g. [ 191) in order to value the effect of performed transformations. Surprisingly, experimental results have shown that even simple fitness functions are sufficient for producing excellent optimization results (see figure 5-6 and table 2 ). We used a combination of the critical-path length and hardware costs computed by direct compilation, i.e. each operation of the data-flow is associated with certain hardware costs. An advantage of using direct compilation is its efficient implementation in linear time and is thus crucial for the fast execution time of the optimization routine. Figure 5 presents the genetic algorithm that serves as a basis for algebraic optimization and takes pattern from the standard algorithm in [4] . The outer loop is required in order to extend the current gene pool in two directions: On the one hand we introduced the new alleles 611,6111 and 61" in our example to represent the transformations a * 2 + a + a 3 a shl 1 3 a & 0. On the other hand the chromosome is extended by new genes along with their alleles. In our example, we introduced gene t with its alleles €1 and €11 and the new allele a11 in order to exploit the associativity.
Skeleton of the genetic algorithm
In contrast to the actual composition of new DFGs including their recombination, mutation and selection in progress with the generations, we call the continuous extension of the current gene pool by transformations epochs. The termination of the entire algorithm is controlled by fulfilling TE that can be realized in the same way as the loop-exit criterion TG.
Experimental results
We applied the presented algorithm to several standard benchmarks. For each example we achieved a gain of up to 30 % concerning the critical path and an area gain of up to 26 %. On the basis of empirical tests we determined the following genetic parameters: population size: 80 individuals; number of individuals in the population to be replaced by the offspring: 60; number of generations: 40, Mutation rate: 0.1.
All presented results have been computed on a SparcStation 20. The execution time of the optimization routine was for all examples approximately one minute for the chosen parameters. Figure 6 and 7 present the initial and the optimized dataflow graph of the 5th-order elliptical wave filter7 [ 1 l], respectively. All operations are assumed to be single-cycled. The macro-nodes in fig. 7 represent multiply-add subexpressions which have been bound to MACS. These complex components can also be used for computing additions by simply applying the corresponding identity elements to the appropriate inports (control steps 3, 8,9, 11 in fig. 8 ). Figure 7 : Optimized DFG of the EW-benchmark It seems to be quite obvious, that manually formulating a behavioral description with the same quality of the presented optimized DFG is virtually impossible. In this example, the critical path has been shortened from 14 to 10 control-steps requiring only two MACS and two adders instead of three adders and two multipliers. Table 2 shows the synthesis results for several benchmarks before and after algebraic optimization. We used the high-level synthesis system OSCAR [ 121 for synthesizing the original and the optimized design. Due to its underlying integer programming formulation all presented results are optimal concerning the overall costs of functional units. Areas and delayst of functional units have been adopted from the underlying 1 . 0~ VLSI component library [20] . The execution times of high-level synthesis were always less than one second.
Conclusion
We presented a genetic algorithm based approach for algebraic optimization of data-flow graphs. Due to the underlying chromosomal representation all genetic operators are correctness preserving and can be implemented very efficiently. Apart from standard transformation rules such as commutativity, associativity and distributivity, we also support hardware-related transformations. It has been shown that even these transformations have a positive effect to the quality of the achieved synthesis results. Since all rules are stored in an external library, they can be modified or extended by the designer.
The system has been implemented as a front end to an ILP-based synthesis system [12] and benefits from its capability of supporting complex component libraries. However, the approach can also be easily realized as a source-to-source (e.g. VHDL-to-VHDL) optimizer that 
