This paper studies the use of a reconhgurable architecture platform for embedded control applications aimed at improving real time performance. The hw/sw codesign methodology from FQLlS is used. It stms from high-level specifications, optimizes an intermediate model of computation (Extended Finite State Machines) and derives both hardware and software, based on performance consoaints. We study a panicular architecture platform, which consists of a general purpose prrressor core, augmented with a reconfigurable function unit and data-path to improve tun time performance. A new mapping flow and algorithms to pmition hardware and software are proposed to generate implementations that best utilize this architecture. Encouraging preliminary resulis are shown for automotive electronic control examples.
INTRODUCTION
Configurable sysiem-on-a-chip (CSoC) architectures are emerging as a promising altemative to both ASIC and general purpose processors (GPP), as witnessed by the number of currently available commercial platforms [ZS]. ASlCs suffer from long design cycles, sky-rocketing NRE COILS (SEMATECH estimates a cost of SIM for a 0.1% mask set) and poor flexibility. while GPPs do not meet performance requirements for demanding applications. The embedding of reconfigurable hardware (eFPGA) expands the range of problems for which post-fabrication solutions are viable. This eliminates the time and money spent in silicon design, fabrication and manufacturing verification.
The wide adoption of these reconfigurable architectures relies on the availability of appropriate methodologies as well as CAD tcals to map designs to future multi-million gate devices quickly and to efficiently exploit their flexibility. particularly, the hardwarelsoftware Permission to make digital or hard copies of 811 or pan of this work for personal or classmm use is granted without fee provided that copies are not made or distributed for proofit or commercial advantage and that copies h this notice and the full citation on the first page. To copy otherwise, or republish. to post on WBS or to redistnbute to lists. requires prior specific permission andlor a fee CODES'O?. May [6] [7] [8] 2002 , Estes Park. Colorado, USA.
Copyright 2002 ACM 1-581 13-542-410210005 ... $5.00, co-design process must determine which ponions of the overall specification should be mapped into the reconfigurable logic and which retained on the processor. We discuss the automation of the entire design Row. from high-level specification to hardware and software implementation for control-oriented applications.
While Estrin's "fixed plus variable structure computer" proposed in the early 60s is likely the hrst reconfigurable computer, the introduction of FFGAs spurred a wealth of research in reconfigurable FFGA-based systems [13] . In the last decade, FPGA-based platforms have achieved significant speedups for a range of applications including data encryption, DNA sequence matching, automatic target recognition. genetic algorithms. image filtering and network processors. Hwever, with increasing system requirements, embedded control applications, such as automotive control. avionics, robotics and indusuial plant control processes. are also experiencing performance bottlenecks on traditional microsontroller platforms. Thus reconfigurable hardware opens up new implemenmion opponunities in this domain.
We use a homogeneous representation for both hardware and software. based on a network of Extended Finite State Machines (EFSMs). This can be captured using several high-level languages with underlying EFSM semantics. such as ESTEREL [91 or State Transition Diagrams. From this common representation we perform hardwardsoftware partitioning and code generation based on performance profiling.
For system implementation. we selected the reconfigwable VLlW RlSC core proposed in [71, featuring a 32-bit MIPS core augmented with an FFGA reconfigurable control unit and data-path. which can be customized to issue special instructions reading from and writing to the Same MIPS integer registers. A GCC-based software tool-chain for compilation and performance simulation for this architecture was developed. as presented in [?3]. It supports arbitrary user defined instructions to be mapped onto the reconfigurable mauix. The user manually identifies and tags certain computation kernels in the source code to be extracted as single FPGA instructions. The tool-set then provides automated support for compilation, assembly. simulation and profiling of the resulting code on this platform.
Starting from the EFSM representation. we provide methods to explore different hwkw trade-offs based on the profiling of the source code with the GCC-based tool chain. Thus, we ouromotically derive the C code with critical kernels tagged for FFGA implementation. Our methodology provides a totally automated Row from high-level specifications such as ESTEREL, down to performanceoptimized machine code for the reconfigurable target architecture.
The paper is organized as follows. Section 2 reviews some related work in reconfigurable computing and code generation. Section 3 gives our design Row, with details on the model of computation, architecture selection and performance evaluation methods. Section 4 details our method of hwlsw partitioning and code generation. Section 5 gives experimental results followed by conclusions and future work in Section 6.
RELATED WORK
A number of research efforts have studied the development strategies and CAD tools for reconfigurable platforms. A popular approach is to produce a unified development environment, and provide a single language that can be effectively mapped to either hardware or software. PRISM [ I ] accepts generic C code and generates FPGA configurations in a semi-automatic fashion. PRISM analyzed C code to identify C functions thai could be implemented with combinational logic. A similar approach to instruction-set augmenration for general p u r p~~e computing was proposed for Pnsc [22] which considers a finer granularity. i.e. any grouping of instructions instead of entire C functions. for hardware synthesis onto hardware programmable functional units The Our approach starts from a high level abstraction using EFSMs as the formal model to capture system specifications. Any textual or graphical language with underlying EFSM semantics (at present ESTEREL is used as a specification language) can be employed. In the POLIS framework 121 each EFSM is represented by a single state transition table for the control path and a lookup table for the data-path. This does not scale well to designs with large state spaces. Binary Decision Diagrams (BDDa) are used to optimize the slate transitions and synthesize the code. This single BDD-based representation cannot be used to perform partitioning between hardware and software at a fine logic computation level.
Other works on Esterel compilation provide efficient ways to translate Esterel program directly into sequential C code [S, 12. 281. These techniques do not easily suppon hwlsw pmitioning and cannot be utilized directly for mapping the applications onto a reconfigurable architecture.
For functional evaluation using reconfigurable logic. Sasao erol. [24] proposed using a platfom that combines FPGAs with sequencing logic to perform logic simulation and showed significant speed up versus a GPP approach. However, only combinational logic functions are considered in their approach.
% E l
Figure 1: Contml-data network
DESIGN FLOW 3.1 Model of Computation
We use a network of EFSMs as the fundamental computation model. This is derived from a high-level specification, currently ESTEREL.
Technology independent optimization is performed using multivalued (MV) logic and data-path manipulation. In the architecture mapping phase. implementation code is automatically generated from this control-data network for the reconfigurable platform. (See Section 4).
Architectures
Reconfigurable platforms. coupling a programmable logic with a processor core, come in different varieties. differing in processor integration scheme. computing model. and the granularity of the reprogrammable logic. For the computing model, the reconfigurable m a y may be deployed as an autonomous cdstream processor, dedicated 1/0 processor, interface glue logic, or as instruction set aug-mentation. Several computing models may be rupponed by the same platform, but effectiveness depends on the level of integration between the reconfigurable logic and CPU core. The 
21, 141
The integration scheme determines the granularity of the application segments executed on the reconfigurable fabric. Due to the fine granularity of the finite state machine code, we adopted the instruction augmentation computing model and the reconfigurable platform of [71. Special WGA operations can be reconfigured and viewed as special instructions in the system ISA. This can be utilized at the C programming level. Since the reconfigurable m a y is pan of the processor's data path, there is minimal communication overhead compared to the coprocessor implementation. A coprocessor platform requires additional cycles to explicitly Iransfer data to and from the reconfiprable array, thus undermining the performance gain obtained from FFGA instructions.
There are disadvantages of this approach as well. The number of pons on the register file limits the inpuUoutput bandwidth of the FPGA array. The contml flow of the data path also requires the FPGA array be executed synchronously with the pipeline design. These requirements dictate that only small blocks with few inputs and outputs can be implemented on the FPGA array. Fonunately.
this works well in the chosen application space, in that EFSMs contain nodes that have only a few inputs and outputs and perform simple calculations
Performance Evaluation
We use the GCC-based performance evaluation tool-chain developed for the target architecture [23] . Given the C code of the target application. the user can tag blocks of C code to be implemented on the reconfigurable array, using a pragma directive: @ragma innrmmr opcode d e l q ~O U I nin ours inr where inrrrnome is the mnemonic name of the FPGA instruction: opcode is the instruction code used in hardware simulation, which is not relevant in our approach: ddoy is the latency in clmk cycles of this insuuction; nous and nins are the numbers of outputs and inputs: ours and ins are the lists of output and input variables. The code that follows is interpreted as a C simulation model, which is then ended with another pragma directive:
After the lags are added to the C code. the simulator evaluates the code using the cycle counts specified by the user for the tagged blocks. The profiler returns the number of cycles used to execute each line of code.
Our goal is to automatically partition hardware and software and generate the tags for FFGA insuuctions, which implements the hardware panition. There are two limitations for an FFGA insuuclion: (a) it must have no more than 3 inputs and 2 outputs. due to instruction encoding and register file limitations; (b) there is a limited pool of LUTs for WGA instructions. These are taken into account in the panition algorithm in the next section. There is also a challenge of accurately estimating the cycle count of the FFGA instructions. The most precise method compiles the perspective FPGA insfructions into LUTs and calculates the longest path delay versus the clock cycle time of the processor pipeline. In the experiments we use heuristics for the estimation in order to have a quick performance evaluation.
CODE GENERATION
The code generation problem is, given a multi-level control-data network, generate efficient code that consists of two portions: software blocks to be executed on the processor and tagged hardware blocks to be implemented as FFGA instructions. The objective is to maximize the overall performance while satisfying appropriate resource constraints. such as RAM and ROM usage and FF' GA size.
We solve this problem in two steps: (a) consuuct maximal regions 
Clubbing
A club is a candidate block of functionality for potential hardware implementation. It is defined as a cluster of nodes, which satisfies the following constraints:
I. It does not contain primary inputs or latches;
2. It consists of either pure control nodes or pure data nodes:
3. Its number of inputs (outputs) does not exceed a predetermined maximum number;
It does not inuoduce combinational cycles among clubs.
We currently do not allow primary inputs to be implemented on the reconfigurable array, because the architecture dictates that all functional inputs be supplied through the register files. However, in some applications it may be wonhwhile to consider different architectures that allow this. Latches can be implemented in FFGAs as well, thus the FPGA operations may have states that are kept from instruction to instruction. This involves sequential EFSM parlitioning, which is beyond the scope of this paper.
Condition (2) is present because control and data nodes require different sets of logic and data-paths for implementation. Condition (3) is due to instruction encoding and register file limitations. Condition (4) guarantees deterministic and correct functionality.
In [I 81 Khavi introduced a clustering algorithm. with a similar definition of clubs, for mapping from a logic network to a network of PLAs. Although it does not satisfy OUT clubbing constraints. our clubbing algorithm. outlined in Figure 2 , is based on this.
The network is first optimized and decomposed into small nodes.
Routine Build~eve1iiednrra)~I (step 2) then levelizes and sons the nodes in a depth-first order. In the levelization. the nodes are Iraversed from inputs to outputs: when a node is added IO 
Bit-packing
The synthesis flow supports multi-valued variables in the EFSM model. but in many designs. a majority of the nodes only require a few bits to represent the largest value. Therefore. in order to fully utilize the FPCA instructions with the 3-input and 2-output limitation, we need IO To incorporate bit-packing. the code generation flow described above is modified in two aspms:
'ihe clubbing algonthm IS modified so that the inputloutput consmnt reflects the numbcr 01 bits nthci than \anable, For the selected architecture 171. the rnput constraint IS 32 * 3 bits and the output constrant IS 32 * 2 bits
The code of cach club IS includrd utth ddiitonal tempom-, \anabler for the consmned clubinputs and club-outputr VCH code IS added to align the o n g d n p~t bits I I U b e temporar) club input tanabler kforc me FPCA mswctton and to exmct output bits ffom the temporary club output banable\ after the FF'GA mstruction Bit-paclung allows each club to have up to 96 binary variables as input, and 64 binary variables as output. The IO constraint of the FPGA insvuctions for control is then only limited by the data dependencies between convol nodes and data nodes. However. bit packing and unpacking are very expensive operations on conventional processors and create overhead on the software partition. Yet they are extremely cheap on the FPCA processor. This again illusmtes the trade-off between computation and communication 
3 Partitioning
We explore different hardwardsoftware trade-offs based on perfarmance profiles of each potential club. This is done in two steps:
1. Obtain the performance profile and FFGA implementation 2. Find the best hwlsw panition that maximizes performance cost (reconfigurable array LUT wunt) for each club.
gain and satisfies the FFGA size constraints.
For the first step. we generate C code for all clubs with no FPGA instructions; we run through the GCC-based tool-chain and obtain an average cycle couni for each line of the C code: the cycle count foraclubisthenthesummationof thecyclecountsforallitssource lines.
A simple algorithm based on the number of input and output bits is used to estimate the number of LUTE required if the club is implemented in hardware. Node patterns that commonly appeared were synthesized using Xilinx logic synthesis t~l s to determine the LUT counts. Any unidentifiable node types such as function calls were estimated manually. In the actual experiments done so far, due to our primitive estimation of cycle count and LUT size. we chose a greedy assigxnent approach the clubs are sorted in descending order of their profiled cycle counn: assign, in order. each as FPCA instruction until the FF' GA size limit is reached.
After hwlsw parlitioning, each chosen FPGA instruction needs an estimated cycle count for performance simulation. The heuristics used assume that small nodes will rake 1 cycle. 
EXPERIMENTS
The benchmark set includes a multi-injection driver far engine control systems Generally the results show that performance benefits the most from implementing data nodes on the reconhgurable array. However data nodes require a significantly higher number of LUTs. Funher, implementation of complex data computation like multiplication in FFGA may not be a5 efficient as the arithmetic unit in the GPP core. Funher study is needed for more intricate wade-off berween FPGA hardware and tradilional arithmetic data-path in GPPs.
Due the relatively small size of our benchmark suite, the FFGA size of the target architecture is large enough to implement all potential clubs in these examples. We experimented with smaller FPGA limitations, and the mapping algorithm succeeded in producing significant performance improvement.
The results of adding the bit-packing feature is shown in Table 3 .
Since only ~ontrol clubs take advantage of bit-packing. we present r e d s for implementing only conwol clubs. As a result of relaxed IO constraints. the clubs become larger and more consolidated (from 122 to 50 for the injection driver example) which leads to more efficient FPGA implementation. Since a large amount of the cycle count lies in dam computation. the improvement of the total cycle count due to bit-packing is limited. Yet, if does increase the performance of the control portion, from 20% to 40%.
CONCLUSION AND FUTURE WORK
We introduced an automated hwlsw partitioning and code genea- 
