several points within the design space are examined by the involved advice tool. Thus, predictions of the final result can be computed without completely synthesizing the whole design. This leads to a reasonable reduction of design time because only the design with the best estimated characteristics is given to logic synthesis for implementation. Future work will concentrate on bringing the advice to higher levels of abstraction, i.e. profiling of design characteristics in the behavioral domain. because fidelity corresponds to the percentage of correctly predicted comparisons between design implementations. The fidelity of our experiments is 78, which indicates that estimates computed by the considered advice tool reflect the real situation quite good.
Additionally, the time savings which can be achieved inside an design space exploration loop by using estimation instead of synthesis is considered. It has been found that the computation time for estimation is only about one third of logic synthesis time (see figure 9 ).
Considered a logic synthesis path for one design as one time unit, design space exploration of our experiment consisting of nine design versions by logic synthesis takes (9 • 1 =) 9 time units. Using our methodology, the estimation phase takes (9 • 1/3 =) 3 time units plus one unit for the final implementation. Thus, in our experiment design time is reduced from 9 to 4 time units. So, using the presented approach far more points in design space can be explored within acceptable time limits.
Conclusion
This paper shows the integration of RT-level advice into our system-level synthesis methodology. A standard C file is analyzed, HW is extracted and synthesized by our highlevel synthesis tool PMOSS. For supporting a fast exploration of the design space, the final task of logic synthesis is done after a separate estimation phase. In this estimation phase As exemplary shown in figure 7 , there is a strong correlation between area consumption of the mapped (synthesized) design and component count of the unmapped (estimated) design. This correlation has been found for most of our design variants. In the same manner the system delay characteristics can be explored.
Results
All design versions have been fed into the RT-level advice tool. The selected solution has been passed to logic synthesis for optimization of all components. Additionally, optimization across component boundaries are done. This leads to a highly optimized design implementation. As figure 7 indicates, the synthesis results for area follow in all cases the estimated prediction (component count) quite well.
The fidelity of estimates has been defined in [7] as whereby holds
The estimation result of design number is indicated by , the synthesis result by , and the number of different design versions by . The maximum fidelity is 100 
DCT-Algorithm
The example considered in our experiments is the algorithm for the computation of the Discrete Cosine Transformation (DCT), which is a lossy algorithm for data-compression used for transforming a spatial frequency (color) distribution into an energy spectrum. This data-compression bases on the fact that most of the information energy is concentrated into the first few coefficients of the resulting progression. Therefore, only handling a small number of leading coefficients will cause a controlled loss of information. Examples of applications using the DCT are JPEG or MPEG [21] in the domain of image processing. For our experiments the DCT-algorithm proposed by Chen [2] is considered for HW implementation.
For our further investigations we regard nine different descriptions of the DCT-algorithm. The original DCT-algorithm (V1) is transformed by high-level transformations before behavioral synthesis. The first variant of our algorithm (V2) is generated via strength reduction and scalarization. This means that expensive integer multiplication-and divisionoperators are exchanged by cheap shift-operators. In addition, scalarization of an array leads to utilization of constants instead of array components. Constant propagation is applied to the V2 version which results into the third variant (V3). Regarding the control-flow of the DCTalgorithm three different loops can be identified. Each loop of the V3 description is given separately to the loop unrolling transformation. This leads to the versions V4, V5 and V6. Furthermore, V3 can be transformed into V7 by applying loop unrolling to all three loops simultaneously. The last transformation, elimination of temporary variables, is applied to the versions V4 and V6, resulting in V8 and V9. In this way, we get nine different design DCTversions of reasonable complexity for design space exploration and evaluation of our system design methodology. Table 1 lists the transformations and the corresponding design versions. The size of the design on behavioral level is given by the number of C code lines.
The design characteristics of each version can be examined by the HDL Advisor. For our further investigations, the design criteria area is regarded. For demonstration purposes we present some profile and histogram charts for the V1 design version. The chart in figure 7 (a) presents the estimated values. The distribution of the consumed area per component is also estimation-loop to generate a prediction of the final result instead of the conventional (slower) outer synthesis-loop, in which the design process is carried out completely in every iteration step.
The main problem of the estimation step is the fact, that structural-VHDL provoke redundancy [9] , which especially is critical for the controller (e.g., additional code could be added to force a special kind of realization of the symbolic state table in VHDL). Because of this reason, even the conversion of VHDL into an internal format leads to problems (e.g., coding of enumeration-types, realization of case-constructs using multiplexors, mapping undefined outputs to tristate-ports). This shows, that achieving a good estimation result requires that the advisor has to fit to the logic synthesis system actually used to generate the final implementation.
Using the number of logic levels as a prediction for the delay of the circuit and the component-count as an advice for the area-consumption gives, as can be seen in the following section, a good estimate of the final result. In addition to the estimates described above, the HDL-Advisor is able to analyze the mapped design-data available after logic synthesis. Based on those mapped data, information about delay, area-consumption and power dissipation of (particular parts of) the circuit can be extracted and directly tracked back to the VHDL source code.
The HDL-Advisor enables the user to comfortably browse through the VHDL source code and display several characteristics of the (sub-)circuit under consideration. For this sake, there exist a set of windows, e.g. for graphical visualization of the timing-delay inside a particular (sub-)circuit as a histogram or for graphical analysis of the expected critical path of a (sub-)circuit.
In our approach, inside the optimization loop the task of logic synthesis was replaced by an estimation computed by the advice tool. As can be shown in the following section, there exists a strong correlation between this estimation and the final result created by logicsynthesis. Using this methodology, design space exploration can be done without truly generating the final result; only if the optimization process delivers a result with characteristics satisfying the designers demands, logic synthesis has to be consulted. By this, a large amount of design-time can be saved. analogy to the topic of high-level transformations -the existence of methods for the rapid estimation of the final result of the design process, especially the size of the particular blocks. Then the partitioning algorithm may efficiently take into account similarities of basic building blocks (e.g. adders and subtractors) and the presence of hierarchical operators (e.g. an n-bit adder built out of a set of 1-bit adders), resulting in a set of blocks which are balanced concerning size [5] .
From RT-Level to Hardware
The output of the previous synthesis task is a structural description of the system which consists of a controller and a multiplexor-based datapath:
• The controller which is represented as a state-table or state-transition graph is irregular and is normally implemented as random logic using sequential logic synthesis.
• On the other hand, the datapath consists of very regular modules. These modules can be either implemented using macro cells or unoptimized units which have to be fed into logic synthesis in order to obtain an optimized implementation. The advantage of the second approach to datapath synthesis is that optimization potential arising from constants at the inputs of the modules can be exploited. The disadvantage is that the design cycle is lengthened and the effect of optimization at the boolean level is difficult to predict.
Since savings in cost and area are significant for our approach, we follow the second one. This necessitates the use of sophisticated advice tools working at the logic level and providing information about the expected design characteristics to the behavioral level. However, for the accurate prediction of the expected characteristics some modifications have to be applied to the structural description:
• Since we aim at a multi-level logic implementation of the system, the size/delay of the final controller implementation is difficult to predict purely out of a symbolic state table.
In [16] a model for estimating the area of a controller for a 2-level logic implementation is presented. However, for large controllers with a few hundred states arising in a transformational synthesis environment (with e.g. unrolled loops, inlined function calls) the accuracy is very limited since the controller cannot be implemented in 2-level logic without using additional options (e.g. PLA-folding). For multi-level logic implementations the result depends on the structuring of the design.
• For increasing the accuracy of the estimation we do some preprocessing steps which can be applied rather quickly in order to predict the complexity of the controller: (a) state minimization is applied to obtain a non redundant machine; (b) since the encoding of the symbolic states of the machines heavily influences the size of the transition and output logic, we apply state encoding to prune the search of the subsequent estimation phase.
RT-Level Estimation
Our approach for a rapid estimation of the final result of the design process bases on the use of an advice tool which includes the entire part below RT-level (see figure 1) . The tool used in our experiments, the HDL-Advisor [19] , requires an input-specification given in structural-VHDL and calculates an estimate of the expected number of logic levels as well as the expected number of components of the design implementation resulting from logic synthesis.
As indicated in figure 6 we use structural-VHDL to combine design-activities above and below RT-level. From this level of abstraction, the structural-VHDL can be estimated, simulated or synthesized. Regarding figure 6, our approach tends to use the (rapid) innermost
For the internal representation of design data two levels of abstraction are taken into account: the behavioral level (realized by a control-data-flow-graph, CDFG) and the structural level (given by a controller and a datapath, CDP) [15] . During high-level synthesis, all synthesis tasks gradually annotate the CDFG with information (e.g. control-step of an operation, allocated module type, instance of the module type). The applicability of a particular synthesis step is just restricted by the availability of the required information in the CDFG. If all synthesis steps have terminated successfully, the annotation of the CDFG is complete and a conversion into a CDP can be done in a one-by-one fashion.
For manual intervention in the synthesis process there exists a graphical user-interface, which for instance allows the designer to choose a certain algorithm for a particular synthesis step which fits her/his needs. Furthermore, there exist graphical editors to visualize the CDFG and CDP.
Optimization of Behavioral-and RT-Level Description
For design space exploration, a specification can be transformed into several functional equivalent descriptions. This results in a set of different specifications describing the same design. This degree of freedom can be utilized by transforming a specification into an equivalent one in order to improve the characteristics of the design. Such transformations can be established before high-level synthesis (on behavioral level) as well as after high-level synthesis (on RT-level). Transformations of the first kind are called high-level transformations, those of the second kind are referenced as repartitioning/resynthesis.
High-Level Transformations
The term high-level transformations is used for optimizations of the behavioral specification which often correspond to equivalent techniques in the area of compiler theory (e.g., loop unrolling, constant propagation, common sub-expression elimination) [1]. In PMOSS highlevel transformations base immediately on the CDFG. The output generated within the highlevel transformation task serves as the input of the high-level synthesis task. The impact of high-level transformations might be twofold: first, they can be used to achieve an acceleration of the subsequent tasks in high-level synthesis [18] , and second an improvement of the final result with respect to user-specific cost measures (e.g., area-consumption, performance, power dissipation) can be reached [20] .
In most of the existing systems high-level transformations are used in an off-line fashion by the designer. An essential precondition for automatic high-level transformations is the availability of methods which allow the rapid estimation of the changes in the resulting design caused by this transformation. Only the availability of such methods allows the exhaustive exploration of the design space within an acceptable effort of time.
Repartitioning / Resynthesis
The output of the high-level synthesis task is a structural description on RT-level which describes a deterministic, synchronous system that needs to be transformed into a gate-level representation. As a reason of complexity, the processing of a RT-level description within a logic-synthesis system normally requires a partitioning of the datapath in a number of blocks. This partitioning leads to additional optimization potential at the block-interfaces caused by additional don't care conditions. The objective of the repartitioning/resynthesis task is to make use of such partitioning-forced optimization potentials with regard to the following task of logic-synthesis. An essential precondition for automatic repartitioning/resynthesis is -in In figure 5 the detailed description of the controller and datapath of the design in figure 4 is shown.
Modular Synthesis
A fundamental characteristic of our high-level synthesis system is its modular concept. This is realized by separating the different synthesis tasks, which allows a flexible combination and exchange of different synthesis steps and algorithms (e.g. scheduling, allocation, binding of functional units, registers and interconnections). On the other hand there exist several front-ends (e.g. for behavioral-VHDL, C or C++) and back-ends (e.g. for structural-VHDL or BLIF). This allows an individual configuration of the design process with regard to userspecific goals and restrictions as well as a flexible integration of our tool in an entire design flow. estimation determines the impact of the partitioning on the target system before implementation [14] . This codesign concept is displayed in figure 3 (b) . The HW part is the input of the next synthesis stage, high level HW synthesis. The HW part communicates with the SW part via a pipelined interface [13] .
High-Level Synthesis
High-level synthesis transforms the behavioral specification identified for HW realization in the previous step into an implementation on RT-level. In the following subsections, some essential characteristics and features of our high-level synthesis methodology are presented.
Target Architecture
The interconnections between the RT components can be either realized by busses or by multiplexors [3] . The choice of the target architecture may have a significant impact on the subsequent optimization steps. The difference stems from two sources:
• In a bus architecture the RT components are formed by macro cells, i.e. a set of preoptimized components which are build up by layout generators. The advantage here is that geometrical aspects (layout) can be involved at early stages of the synthesis process [6] and tight area estimates can be gained quickly. However, it turned out that this type of architecture is somewhat difficult to test and no components separated by the bus can be optimized together.
• In contrast to this, multiplexor based implementations are easy to test and test concepts can be integrated either on the behavioral or on the logic level. Since during the structural synthesis abstract measures are used for cost and performance estimates, one has to use module generators to build the actual instances in the netlist which is then passed to logic synthesis. The need for a subsequent optimization step may serve as a disadvantage (since the synthesis time increases) or advantage (since we have the powerful methodology of logic synthesis at our disposal), as well.
Because of the reasons pointed out above, PMOSS supports the multiplexor style architecture. The result of the synthesis procedure consists of a controller and a datapath. As pointed out in the introduction, we focus on the synthesis out of C source code. However, this requires a methodology to handle function calls and memory accesses. Therefore, the interface of the top-level function does not directly correspond to the entity which has to be synthesized because additional control and status lines have to be added to the HW entity to implement (a) the handshake mechanism for function calls and (b) external memory access for pointer parameters. In figure 4 and figure 5 the target architecture and the handshake mechanism (realized by signals START, RESET, and READY) to implement C functions is shown. 3 From System-Level to RT-Level This section considers design activities above RT-level. On those levels, design automation is done by using our synthesis environment PMOSS 1 . Figure 2 shows an overview on design activities and formats covered by our system. In the following subsections, the individual steps are discussed in detail.
HW/SW-Codesign
Considering the first stage of the synthesis flow, system synthesis, HW/SW codesign comes into view. Complex systems are typically implemented in hardware and software. They are described by abstract (technology independent) system specifications. In our approach, we start from a C-code specification. Classical approaches define a fixed HW/SW-interface, e.g. the processor instruction set, and develop the HW part and the SW separately. Performing codesign both parts are considered together without a predefined interface. However, every design process will focus on a target architecture. On one hand there are state of the art general purpose processors which are very powerful. On the other hand special function designs which are much more efficient for special applications can be found. In our codesign approach we intend to join both aspects together. Therefore we define a target architecture based on a general purpose processor extended by an additional special function unit (SFU), as shown in figure 3 (a) .
The codesign task maps one part of a given specification to the general purpose processor and the other to the SFU. The actual partitioning aims at a maximum speed-up of the overall system. The codesign task itself may be divided into several subtasks. Specification analysis should provide all information needed to partition the specification into a HW and a SW part heuristically [12] . The partitioning subtask results into a specification part which has to be implemented in HW and a second part which has to be implemented in SW. Speedup HW cost metrics depend on the implementation technology. In many cases system cost is related to chipsize. In a full custom design view a limited amount of square millimeter silicon is available. For semi custom designs, e.g. based on FPGA components, the number of available gates is restricted. Here also communication comes into view because the wiring of computation components consumes additional area.
SW cost metrics abstract from the underlying microprocessor, but the obtained data is specific for the microprocessor actually used. Often used SW cost metrics are the size of the executable program and the expected data memory size. More detailed cost metrics take into account the number of instructions and the time spend for execution of subtasks or instructions [12] .
Other metrics also applied to system design are power dissipation, testability of a design, design time and manufacturing costs. Design time is defined as the time required to obtain an implementation from the functional specification. Manufacturing costs depend on a variety of external factors like market prices for raw material etc. Based on this estimates criteria system models can be developed. An overview is given in [7] .
Estimation Tools
Several estimation tools are available. Aparty [4] is a tool for partitioning a behavioral system into multiple partitions. Each partition is passed to the estimator. The design area understood as the area of all sub-components is estimated and the interconnection between the clusters is approximated by the number of required wires. Performance is estimated by the number of control steps [17] . In Vulcan [10] the system is represented by a graph in which information about the exclusiveness of operations is stored. Each node represents an operation, performance and size information is associated with it. Estimation is done by graph traversal. The design framework SpecSyn of Gajski et. al. provides a design tool for system specification partitioning. HW parameters, e.g. area, behavioral execution time and clock cycle are obtained from a library. SW parameters are stored in technology files. Beside area and performance of pins and the number of abstract variable-accesses, procedures and communication structures are estimated. The HDL Advisor tool [19] focuses on HW and performance metrics. In a first step the design specification is transformed into an internal format on which the entire estimation is done. The main objective of this tool is to find system bottlenecks according to the metrics. If the designer is provided with this information, system redesign becomes much easier.
The Caddy [11] high-level synthesis tool also uses estimation at a low level of abstraction. A rating function determines the number of control steps required to execute a given set of operations. In a first step a linear time scheduling algorithm is used. In the second step the preliminary result is checked against the resource constraints. The HDL Advisor tool performs all estimates based on the technology independent intermediate format. For this, no additional library of design data is needed. Because of this advantage, we decided to use this tool in our optimization approach. It has been found that estimation time is rather short compared to synthesis time.
further treatment. In the ideal case, the time-consuming step of logic synthesis has to be carried out only once. Figure 1 shows an overview on this procedure. The main advantage is the early feedback inside the optimization loop including the rapid estimation of the final result of the design.
The paper is organized as follows: after surveying the state of the art, section 3 gives a short overview on the design activities covering the steps from system-level to RT-level. Section 4 takes a look at the RT-level and logic-level synthesis step. Our proposed methodology gains on a fast evaluation of the design space by bypassing RT-level and logic-level synthesis by the estimation step presented in more detail in section 5. Section 6 explains the use of the RTlevel estimation for the evaluation of different versions of RT-level descriptions of an example behavioral specification and evaluates their quality with respect to the actual results generated by logic synthesis. To demonstrate this, the Discrete Cosine Transformation (DCT) algorithm [2] is considered. As our experiments show, evaluation time for an iteration of the optimization loop can be reduced to approx. 30% with respect to synthesis time by using the advice tool. A comparison of the actual synthesis result with the estimates shows that the advice tool approximates the actual design created by logic synthesis very accurately.
State of the Art
In system design a variety of estimation techniques and criteria have been developed. Each technique consists of a system model and an estimation algorithm. In general, the accuracy raises with the complexity of the system model and the computation time spend for estimation. The problem is to use a system model which is complex enough to guarantee estimation results of good quality and which is able to accelerate estimation as much as possible. The system model must contain all details which are examined by the estimation algorithms.
Estimation Criteria
Design space exploration is based on three main objectives: performance, HW costs and SW costs. Metrics for performance estimation are commonly based on the system clock cycle 
High-Level Transformations

Repartitioning
Resynthesis
