Rapid evaluation and design space exploration at the algorithmic level are important issues in the design cycle. In this paper we propose an original area vs delay estimation methodology that targets reconfigurable architectures. Two main steps compose the estimation flow: i) the structural estimation which is technological independent and performs an automatic design space exploration and ii) the physical estimation which performs a technologic mapping to the target reconfigurable architecture. Experiments conducted on Xilinx (XC4000, Virtex) and Altera (FlexlOK, Apex) components for a 2D DWT and a speech coder lead to an average error of about IO % for temporal values and 18 % for =ea estimations.
INTRODUCTION
The evolution of telecommunication and multimedia applications towards new standards requires innovative architectures in order to respect always tighter constraints. The recent evolutions of reconfigurable architectures, in terms of capacity and performances, efficient resource integration (like DSP operators and memories), or flexibility through the possibility of run time reconfiguration, offer a very promising issue for reconfigurable system on chip. As a result, the choice of a suitable target component, satisfying both physical (area, performances, . . .) and marketing (final product cost, time to market, . . .) constraints is a complex issue often left to the designer experience. Dealing with such problems as application parallelism exploration and FPGA architecture matching, impose to define new design methodologies in order to find more quickly and surely an integration solution satisfying all the design constraints. Until now, typical hardware design methodologies start, from an algorithmic description of the application, with an architectural synthesis step to obtain a description at the RTL level. Then logic synthesis and place & route steps are performed to obtain the final description of the circuit and precise values of area ( P G A occupation) and performances (execution time). These two steps are very time consuming since the only architectural synthesis step can take from several hours (with a High Level Synthesis tool, HLS in the following) to several months (hand coding) to overcome. Furthermore design space exploration may need several iterations if constraints are not met, what can lead to prohibitive design times.
OBJECTIVES & CONTRIBUTION
The purpose of the work presented in this paper is to define an efficient exploration methodology starting from system level specifications that allows: i) to define several architectural solutions and ii) to compute the corresponding estimated area and execution time values. The second point allows the designer to make a choice of a solution, while the first point gives information for the selected solution design. To find an interesting answer to this problem; the following considerations have been addressed: i) Define a method operating at a high level of abstraction, from system level specification including control structures, multidimensional data and hierarchy to deal with complex modern applications. ii) Give realistic cost characterization: estimation takes into account all the different units of the architecture (datapath, control unit, memory unit). iii) The method should explore the application parallelism: several architectural solutions are defined for a given specification. iv) The method should be applicable to several FPGA families. v) Define feasible solutions and give sufficient information for post exploration steps (selected architecture design) and vi) low complexity to enable large design space exploration. The methodology developed can be seen as a global exploration I estimation technique based on the numerous existing works in the field of estimation and HLS (memory size estimation, scheduling techniques, data flow modelling, . . .). Compared to other estimation approaches, the definition of effective architectures have been emphasized: each solution is implementable and corresponds to a given resource allocation, clock period value and scheduling. Their definition relies on a precise architectural model (not only datapath, but also memories and control units) and takes care of modern FPGA architectural specificities. Due to the paper size restriction the complete description of the related work can be found in [I]. Compared to a typical design exploration flow, we do not need to make a complete and precise description of the circuit. For example, we do not need to go until the precise description of V-589 0-7803-7761-31031f17.00 02003 IEEE the connections between resources, or to build a floorplan.
Those steps are only needed to he computed once in the design cycle and are left to the steps following the exploration process (synthesis I refinement I optimization). The reduced complexity allows then to explore quickly the effect of different implementation possibilities (intra loop parallelism exploration, resource allocation, clock period, evaluation of several target FPGAs). Obviously, the solutions defined may he sub-optimal in some cases, but they always correspond to implementable solutions. So estimation values computed (area and execution time) are more representative of the system's feasibility. Moreover, those metrics give a designer usefull information that allow to make an easiest choice for implementation (satisfying both area and execution time constraints). Once a solution selected, application synthesis and solution refinement I optimization can be performed in a classical way with the use of a HLS tool for example, thanks to the rich set of information given by the architecture definition step. This fast system level exploration allows then to evaluate many design possibilities very early in the design cycle, where choices have a great impact on the final system performances. The evaluation of several design possibilities allows moreover to converge more quickly and surely towards an optimal implementation solution.
EXPLORATION I ESTIMATION METHODOLOGY
First, the system level specification is given in a high level language (C language), and is then translated into an intermediate representation, the HCDFG model [I] . This model is a hierarchical control and data flow graph allowing efficient algorithm characterization and exploration of complex modern applications including control flow and multidimensional data. As illustrated in figure 1 a C program is decomposed into control structures called CFGs and into linear sequences of operations called DFGs. For example the If-Then-Else construct labeled 2 is composed of three DFGs, one for the evaluation of the condition and two for the True and False sequences of code. Hence, using the HCDFG model, the C program is converted into a hierachical graph. For further information about the HCDFG model please refer to [l] . Starting from this specification and given a target component, the architectural exploration methodology (figure 2) consists in defining several implementation solutions and estimating FPGA resource occupation and algorithm execution time. To perform this estimation, we need to know the target FPGA characteristics which are described in a technology file [I] . Moreover, to give realistic estimation values, we use a specific architectural model and take memory requirements into account (the total memory size needed is estimated). The Explo- 
EXPERIMENTS & RESULTS

From specification to synthesis
In this section, the design cycle described above is applied to the example of a half Discrete Wavelet Transform (DWT). Specification is written in the C language for test and simulation, and is then translated into the intermediate representation model (HCDFG) on which the exploration I estimation tool works. The DWT application is composed of 4 filtering I lifting schemes followed by a scaling process and image re-arrange, described by 2nd order nested loops. Figure 3 shows the exploration results for the Xilinx Virtex V400EPQ240-7. We have only represented the logic cells occupation (where the maximum number is 4000 slices for Virtex) vs excution time (ns) curves as they represent the most significant FPGA resource occupation for this example. As we can see on the figure, exploration provides 65 architectural solutions, each one corresponding to a different parallelism degree. Let's for example consider the solution highlighted in figures 3 since it corresponds to a good aredspeed trade-off. Based on that solution the designer may want to refine the exploration. For example, in this experiment the default clock period value corresponds to the slowest functional unit delay used in the architecture. Hence, the designer can refine the exploration results obtained previously by analyzing the effect of different clock periods and resource allocation. For the solution selected (table 1) . Those partial results fully characterize each architectural solution and give the designer all the necessary information needed for the system design. In the case of our example, we can see that the selected solution is composed of 4 multipliers and 8 adders for a 223 cycles execution, which correspond to a resource occupation of 1941 (/4000) slices, 12 (/40) BRAMs (dedicated resources for memory implementation) and 256 (/4960) tristate buffers (used in case of resource sharing or conditional branches) for a 4.50s execution time. As exhibited in the figure, the exploration I estimation approach enables to reduce strongly the design cycle. Hence, the designer can focus on a subset of architectural solutions that presents the best delay vs area trade-offs.
V-591
CONCLUSION & PERSPECTIVES
In this paper we present an automatic exploration I estimation methodology at the algorithmic level. This approach, which has been integrated in the codesign environment Design Tmtter [2], enables to explore a large design space at an early stage of the design cycle and to characterize each solution in terms of area vs delay. In order to provide the designer useful bounds, the control, datapath and memory units are considered and several FPGA technologies can be targeted. The time saving resulting from this approach is significant and allows to shorten strongly the time to market constraints as well as to converge towards a better application I component matching. Some extensions of this work are currently being studied to consider a separated address generation unit, to take into account some synthesis optimizations to improve local errors and to include power consumption estimation. 
V-592
