Abstract { A methodology for deriving image processing ASICs from the results of their realtime emulation on the Data-Flow Functional Computer is presented. The aim of the method is to reduce the time and e ort required for synthesizing and validating ASICs after emulation. This is achieved by optimizing the architecture validated on the emulator and integrating the optimized resources. The results of the derivation of a defect detector are presented.
I. Introduction
The automation of Application Speci c Integrated Circuit (ASIC) design has been an active eld of research for many years. On one hand, a whole range of tools aimed at helping the traditional ASIC designer has been developed (e.g. combinational logic optimization, automatic placement/routing...); on the other hand high-level synthesis tools expected to generate a Register-Transfer and/or gate level netlist from a behavioral description of the design have been studied. Despite all these e orts, designing a correct ASIC (which will function as expected in its target environment) remains a far from trivial task. One e ective way of ensuring that the ASIC will behave correctly once plugged in its environment is to emulate it. Emulated designs are also of a higher-quality since the number of cycles that can be run are several orders of magnitude more than with simulation (even accelerated). Once successfully emulated, the design has to be retargeted and integrated into one or more ASICs. This paper presents a methodology for deriving image processing ASICs from the results of their realtime emulation on the Data-Flow Functional ComEmail: ik@etca.fr y G.M. Qu enot is currently at LIMSI-CNRS, BP133, 91403 Orsay Cedex FRANCE. Email: quenot@limsi.fr z B. Zavidovique is also with the Institut d' Electronique Fondamentale, Universit e de Paris XI, 91405 Orsay Cedex FRANCE. Email: zavido@etca.fr puter (DFFC) emulator. The design is rstly emulated in real time on the DFFC emulator dedicated to image processing 2, 3]. The DFFC is an 8 8 16 array of Data-Flow Processors (DFPs) and it processes digital video streams on the y at a rate up to 25 MHz pixel. The algorithm is expressed in a functional programming language which is translated into a DFP graph using operators from a database (200 operators). The function of each DFP is speci ed by a nite-state machine de nition and a DFP implements a low-level image processing operator (e.g. adder, line/pixel delay, histogrammer...). During emulation, an architecture implementing the algorithm in real time on real-life scenes is exhibited.
II. Derivation from emulation results

A. Basic concept
Emulation on the Data-Flow Functional Computer yields a multi-level validated description of the design: at the highest abstraction level the design is represented by a data-ow graph (DFG), then it is de ned as a network of nite-state machines (state transition graphs), nally it is speci ed by a RegisterTransfer/gate level netlist. Each description can be independently used for generating the ASICs. Synthesis from the data-ow graph or state transition graphs yields optimized ASICs at the expense of a costly validation. Retargeting the RT/gate level netlist yields sub-optimal ASICs but minimizes the validation effort.
The approach to ASIC generation after emulation presented in this paper is an attempt to exploit all the results of the emulation: each one of the 3 di erent descriptions of the design is used in the derivation process at its best advantage. The derivation process consists in optimizing the validated Register-Transfer level description of the design in order to obtain a reasonably cheap (i.e. small silicon area) integrated design in the least amount of time.
B. Emulation for derivation
In this section is discussed how to improve the derivation process by carefully specifying the data-ow graph of the design. A design can be speci ed and emulated by using only database operators, however the resulting DFP graph is not necessarily optimal as far as the amount of resources required is concerned (the basic database operators are not necessarily the cheapest implementations of a given operator). Although this has absolutely no incidence on the emulation (as long as the design ts in the 1024 DFPs), the derivation process bene ts from a careful speci cation of the data-ow graph and the DFP operators involved.
When specifying the data-ow graph of the design, a simple rule is to use the less possible nodes (i.e. operators) as possible. Figure 1 depicts some simple yet e ective resource-saving graph transformations. 
1-LINE FIFO
1-LINE DELAY
1-LINE FIFO
1-LINE DELAY
1-LINE FIFO
1-LINE DELAY
1-LINE FIFO
1-LINE DELAY B DELAY
2-LINE
DEF C = OP.[LD(1).A, LD(1).B] ;
-> 5 DFPs -> 3 DFPs
LD(1).OP.[A, B] ; DEF C = DEF B = LD(1).A ; DEF C = LD(2).A ;
-> 5 DFPs
DEF B = LD(1).A ; DEF C = LD(1).B ;
-> 4 DFPs This resource-conscious approach to emulation can be very e cient: in the defect detector application 38 out of 143 DFPs (26%) were spared from an initial database operators-based speci cation.
III. Optimizing the DFP graph
Input to the derivation process is the le describing the DFP graph. This le contains 1) the nite-state machine de nition of each DFP involved and 2) the connections between the DFPs. The derivation software builds internally a Register-Transfer level netlist of the DFP graph using a generic model of the DataFlow Processor (also at the RT level). This netlist is not fully attened: the intrinsic DFP-level granularity of the netlist is preserved. Optimizations are thus de ned at the DFP-level and they are performed on this Register-Transfer level speci cation of the design.
The architecture of the Data-Flow Processor is shown in gure 3 and a schematic of its datapath appears in gure 4. In gure 3 the 12 input and output ports are distinct for clarity purposes, but there are actually 6 input/output ports (corresponding to the 6-connectivity in 3D). Each input/output port is a bidirectional 10-bit bus interface and can be con gured either as a sending (output), a receiving (input) or a feedback port. In the following, we call a resource of a Data-Flow Processor any DFP datapath element (ALU, multiplier, FIFO...), the DFP controller and its I/O ports. The derivation ow is shown in gure 5. DFP datapath. Unused resources are removed, the bitwidths of the remaining resources are reduced and the depths of the FIFOs and the datapath pipeline are adjusted as required. Removing the unused datapath resources is performed by analyzing the nite-state machine description of the DFP operator: the state transition graph de nes an active sub-path in the (con gurable) datapath (the data will ow only through this sub-path). The resources present on the active sub-path are the only ones that are kept and reduced/adjusted. DFP controller. The speci cation of the DFP controller is a state transition graph and its implementation is done with a RAM. Thus the most immediate way of implementing the controller in derived DFPs is by keeping the initial implementation of the controller and optimizing it where possible: the \useful" content of the RAM is dumped into a ROM and the multiplexors/buses are adjusted accordingly. This approach obviously lacks exibility, hence the choice At this point in the derivation process the design is represented by a graph of \data-ow speci c processors", e.g. data-ow adders, data-ow line delays, data-ow histogrammers... They still include costly ow-management resources (I/O FIFOs and ports). They are dealt with by the high level transformations.
Adjusting of the datapath resources
Identification of the used resources
Reduction of the programmable resources
Collapsing of neighboring DFPs Replacement of non-optimal macro-DFPs
LOW LEVEL OPTIMIZATIONS HIGH LEVEL TRANSFORMATIONS
RTL VHDL netlist
LAYOUT COMPILER
VLSI chipset Operator Database
Data-Flow Graph
Data-Flow Processor Graph
Data-Flow Functional Programming
EMULATION ON THE DATA-FLOW FUNCTIONAL COMPUTER
Operator Library The ow-management resources are eliminated through collapsing neighboring DFPs into macroDFPs according to the function they implement or to user's speci cations. For instance, the atan arc tangent operator of the defect detector required a 1024 8 Look-Up Table ( LUT) and was emulated using 9 DFPs (4 LUTs, 5 selectors) due to the limitations (a DFP contains a 256 9 RAM). This operator was derived into a macro-DFP containing a 1024 8 ROM.
The architecture of the derived ASICs is nite-state machine with datapath and their execution model is data-ow. The ASICs are stand-alone autonomous circuits but they remain compatible with the DFFC. C. Output of the derivation process
The output of the derivation process is a VHDL RT/gate level netlist describing the optimized design. The hierarchy of the netlist is shown in gure 6. A VHDL package provides the netlist generation with all necessary RT level operators. The \DFP CORE" VHDL entity can be a single DFP core or a collapsed macro-DFP core. Note that the derived design can be described either as a re-usable macro (without pads) or as an actual chip (with pads). The VHDL netlist contains both Register-Transfer level (datapath) and gate level (controller and I/O control) primitives. The layout compilation is performed on COMPASS' back-end tools. The netlist is technology-independent; theoretically the corresponding design can be implemented in any technology, e.g. gate array, standard cells, datapath and memory compiler cells, or even eld-programmable gate array. Two implementation options have been especially explored: full standard cells and mixed standard/datapath cells (plus RAM/ROM/multiplier compiler cells for both options) in ES2's 1.0 m CMOS technology. The main di erences between the two options are summarized in table I. In practice the only possible way of manually placing mixed implementations is to rstly place and route each DFP or macro-DFP individually, then to assemble (place and route) the resulting blocks. This is an incremental procedure: the oorplan is re ned as long as the nal layout is not satisfactory, the designer has to modify the placement of the full design as well as the placement of each (macro-)DFP. A gain of 10% between a poorly planned placement and a carefully studied one is easily achieved. One interesting feature of this method is the high re-usability of the (macro-)DFPs. Each placed and routed (macro-)DFP can be used for any design where it is needed. The layout generation is the most time-consuming step in the derivation process. Whether the designer chooses a full standard cells implementation or a mixed one, the generation of a satisfactory layout can take up to a few days. This is to be compared to the few minutes necessary to derive the netlist of the design after emulation. In order to cope with this problem, we plan to generate automatically placement constraints for cells and connectors and guidances for buses and signals. Furthermore, a trade-o between the quality of the layout and the generation delay shall be studied.
V. Results
Typical results of derived Data-Flow Processors (featuring low level optimizations and mixed standard/datapath cells) are shown in table II. The appliance of high level transformations yields a further area reduction of about 60% for derived circuits. With the 1.0 m technology used, a maximum of 100 DFPs can be safely integrated on a single ASIC. In the future a 0.7 m technology will raise the density to 200 DFPs per ASIC. Derived circuits are validated by simulation, rstly at the (macro-)DFP level, then at the circuit level. Since it is actually a validation of optimized (validated) circuits, the amount of simulation needed is restricted to a few hundreds input vectors.
The defect detector algorithm consists in identifying and locating defective regions in strongly patterned images (e.g. wafers) by considering edge directions. It is based on a model of human vision presented in 1]. Basically the detector identi es the regions where the edges have directions that are weakly represented in the whole image (a pattern defect is de ned as a di erence of the local mean edges direction compared to a global measure of these directions). The data-ow graph of the defect detector is shown in gure 7. The algorithm involves about 70 instructions/pixel, thus at a rate of 25 images/second (considering 572 768 images) the computing power required is about 800 MIPS. The defect detector has been emulated on the Data-Flow Functional Computer in real time using 108 Data-Flow Processors. The defect detector has been derived by respecting the hierarchy of the data-ow graph: each macrooperator has been derived in a single ASIC. The technology used was the CMOS 1.0 m ES2. The extraction macro has been implemented in both standard and datapath cells. The area of the chip is 36.24 mm 2 (core area 23.59 mm 2 ). The direction macro has been implemented in standard cells. The chip area is 47.79 mm 2 (core area 34.09 mm 2 ). The layouts of the chips are displayed in gure 8. This is obviously not the least silicon-greedy way of deriving the defect detector. It should indeed be derived as a single DFP graph in order to apply the optimizations to the whole graph and not independently on each macro. Both datapath and standard cells should be used. We estimate that such a derived defect detector could be integrated onto a single chip whose core area would be about 100 mm 2 . Future work consists in studying the automatic generation of placement constraints for cells and connectors and guidances for buses and signals in order to improve the layout generation.
