Since several years, embedded systems are used in an expanding number of domains. Furthermore, their complexity has significantly increased and most embedded systems must respect several design constraints. Hence, it becomes necessary to introduce major improvements in CAD methodologies in order to help designers. Hardware-Software codesign attempts to deduce automatically heterogeneous system solutions from a high level specification. The interface between heterogeneous resources can become critical in time and/or area for telecommunication applications, hence it is important to define techniques that minimize the communication overhead cost. In this paper we present an original method to synthesize efficient interfaces.
Introduction
Since several years, embedded systems are used in an expanding number of domains, for example in medical devices, automobile cruise control, videoconference or wireless applications. Their complexity has also significantly increased due to integration density improvements and sophistication of digital applications. Furthermore most embedded systems must respect several design constraints (i.e., hardware cost, performance, power consumption). Hence, in order to handle these constraints designers use heterogeneous resources. For example a DSP and an ASIC. Hand-crafted techniques to design hardware-software systems are no longer sufficient due to the high complexity and the time to market pressure. Thus, it becomes necessary to introduce major improvements in CAD methodologies in order to help designers. For the considered applications the design cycle is generally composed of four main steps:
Specification modelling, Hardware-software partitioning, Architecture refinement and Technology implementation. The hardware-software partitioning step precises, starting from the model of the specification, which parts will be handled by the software (DSP or RISC) and which functions will be supported by the hardware. Several partitioning methods and heuristics have already been published [4] , [9] , [14] , [11] , [17] . The architecture refinement follows the partitioning step and manages three interdependent tasks: the interface synthesis, the hardware synthesis and the code generation for the software part. These tasks lead to a RTL level description of the mixed hardware-software architecture which includes the code for the software entity. Finally the technology implementation step allows to generate the layout of the application. Each step of this design flow involves complex and sometimes open problems. In the case
of the architecture refinement, the interface synthesis problem can become critical in time and/or area for applications that deal with an important number of data as in signal processing with spectral methods or in image processing applications. Hence, it is important to define techniques that allow to minimize the communication overhead cost. In this paper we present an original method to realize the interface synthesis.
The rest of this paper is organized as follows. First, we present most significant related works and detail our approach of the interface problem. Then, we describe the methodology that is used to minimize the hardware communication overhead cost. In Section 4, we introduce the example of an acoustic echo canceller to illustrate the considered approach and, finally, in Section 5, we conclude.
Related works
Communications between different entities can be considered at several levels of granularity. At the lowest level, communication primitives can be described with chronograms. These chronograms represent the behavior of each control signal that supports communications. At a higher level, protocols can be described with communicating processes using send and receive primitives. In this case the fine level behavior is not considered.
The method proposed by D.E. Thomas and J.A. Nestor [13] allows to generate hardware structures that support fine grain protocols. Relations between signals are described with a textual formalism where designers define behavior and timing constraints. The synthesized hardware architecture is based on finite state machines. In a same way, studies handled by G. Boriello and R.H. Katz [2] deal with interface synthesis at a fine grain level. The proposed method can manage hardware area minimization as well as bus bandwidth optimization. The methodology is based on formalized timing diagrams to specify the interface. The synthesis algorithms transform event graphs derived from the timing diagrams into a logic specification for the interface. The interface is based on elementary logic cells. In works of A.C.
Parker, S.A. Hayati and J.J. Granacki [10] the description of the interface is also based on graphs. The main feature of the method is the wide spectrum of protocols that can be described. For example synchronous and asynchronous protocols can be supported. The method proposed by S. Narayan and D. Gajski [12] supports the interface synthesis of incompatible protocols. The interface process generated between the communicating entities can manage different data widths. The method is based on duals of atomic protocol operations and can minimize the interface hardware area. All these methods deal with descriptions of protocols at a fine grain level. Designers need to have a precise knowledge of requested protocols.
Such knowledge becomes difficult to get if we consider complex systems using heterogeneous resources.
In this case it is important to consider a higher level of abstraction, so designers can concentrate on system design instead of the interface structure. In works handled by R.W. Brodersen and J.S. Sun [16] the considered applications are composed of a wide range of components (DSP, ASIC, FPGA or MCM subsystems). Generally, all these components have their own communication mechanisms. Thus, in order to allow data transfers between them it becomes necessary to introduce a hardware interface. The method is based on a specific formalism that describes the communication protocols with high level primitives.
These primitives are then decomposed in event graphs before generating the interface. In works handled by J.M. Daveau, T.B. Ismail and A.A. Jerraya [3] the problem of interface synthesis is managed with an other approach. Designers specify protocols at a high level of description, for example a communication primitive is defined with a rendez-vous mechanism, maximum and average bandwidth can also be precised. Furthermore several protocols with their main features are defined in a library. The proposed method is based on an allocation algorithm: the problem is to find the best interface to support protocols specified by designers. The solution minimizes the hardware interface cost and respects all the communication constraints.
The method proposed by J. Gong, D. Gajski and S. Bakshi [8] allows to specify an interface between processors starting from the specification of the application. The application is decomposed into local and global variables and functions that realize computations. Four interface models are proposed that enable designers to explore different communication schemes. The choice of the interface is made according to design constraints and also depending of the number of local and global variables. When the specification is refined with interface features, behavioral synthesis and software synthesis permit to generate the hardware-software design. In this method one of the interesting point is that it relieves the designer from the protocol determination task. In a same way, works proposed by D. Filo, D. Ku, C. Coelho and G. De Micheli [6] enable to define automatically high level interfaces. In their method, application specifications are described with graphs where nodes represent computations and edges functional and temporal dependencies. The considered target architecture is fully synchronous i.e., the same clock signal is propagated through all the design. To minimize hardware communication costs the method schedules graph nodes in order to increase the use of synchronous protocols. The cost related to these protocols in a synchronous architecture is lower than asynchronous protocols. However, synchronous target architectures are not generally best solutions to implement complex application using heterogeneous resources due to synchronization problems. Furthermore, processor clock periods are mainly lower than hardware implementations using FPGA or gate arrays structures [15] . Hence, an asynchronous template architecture with concurrent resources timed by local clocks may yield more efficient mixed hardware-software implementations.
Optimizing the interconnection between units in such heterogeneous architectures suggests the following properties:
• (i) performing a global optimization of all communications involved by the application and not only between two entities,
• (ii) a high level of abstraction to model communication features.
Furthermore, relieving the designers from this time consuming task is one aim of the codesign paradigm. Above introduced methods are not well convenient for this purpose then we propose a new approach for the interface synthesis.
Interface synthesis
The interface synthesis process takes place after hardware-software partitioning (Fig. 1 •) . We consider that the result of the partitioning step is a direct acyclic graph where nodes V i represent computations and edges e i,j functional and temporal dependencies between nodes V i and V j . Functional dependency edges represent data transfers. Temporal dependency edges link all nodes that perform their computations on the same resource even if they are functionally independent. This relation exhibits the sequentiality of these nodes. Granularity levels of nodes depend on applications and several levels can be mixed in the graph. For example, modelling of the acoustic echo canceller application requires nodes representing addition or FFT operations. Furthermore, all nodes are characterized with a parameter which indicates whether the considered node represents an operation that will be handled by a software unit or it will be performed on a hardware resource. Interface synthesis takes into account the underlying target architecture. In our case this architecture is composed of asynchronous functional units ( 
Communication model
Our communication model is based on (i) bufferized structures when communications are asynchronous, (ii) non bufferized structures when sender and receiver can be synchronized using a rendezvous protocol (Fig. 3 •) . A bufferized structure allows sender and receiver to emit and receive independently data since the communication is performed through a FIFO. This FIFO represents an additional hardware resource cost dedicated to communications. The shared memory communication model is not considered, this is due to the execution scheme of our data synchronized architecture which is based on a FIFO communication model. Using shared memories would impose to introduce local control logic which leads to the same complexity as a FIFO structure. The complexity of control mechanisms depends on protocols associated with asynchronous communications i.e., blocking or non blocking. The protocol is blocking if before reading or writing into a FIFO the corresponding unit respectively check the availabilities of data or verify that the FIFO is not full. With a non blocking protocol no verification needs to be done before writing or reading data. For communications using non bufferized structures, the hardware resource is a simple bus. Cost due to this transfer type is less important, however synchronization mechanisms must be introduced. As this solution represents a lower cost the interface synthesis attempts 
Communication synthesis method
Starting In the sequel each step of the interface synthesis flow in described.
• Merging redundant links
Merging redundant links performs a preliminary hardware resource minimization. Communication edges representing data transfers between nodes can only be merged under several conditions: (i)
There is a common source node, (ii) There is a single functional unit implementing destination nodes (Fig. 5 •) . These edges represent the diffusion of the same data from a functional unit (associated to the source node) to another functional unit (associated to destination nodes). However destination nodes are scheduled at different times since their computations are performed on the same functional unit. If these edges are merged, the transferred data must be memorized in order to be read again at the right times. The This temporal dilatation of node 3 is called local rescheduling as illustrated in Fig. 6 •c. Note that the use of synchronous communications may involve overstepping of timing constraints (Fig. 7 •) . In such case, the transfer of data is supported through an asynchronous communication.
In order to determine transfer type of each communication edge, two main functions are considered in the proposed algorithm: Node_characterization and Edge_characterization. For each node V i of the application graph, Node_characterization computes the mobility interval ∆ M_Vi which is defined as Interface Synthesis in Embedded Hardware-Software Systems nous. The example presented in Fig. 8 • illustrates the potential synchronous notion. In Fig. 8 •a all communications are potential synchronous, this is due to interval mobilities associated with communication edges. Fig. 8 •b and c describe two cases for which potential synchronous communications become asynchronous. For example, in Fig. 8 •b the communication associated with edge e 6,7 becomes asynchronous due to the temporal dependence between nodes 5 and 7 (these two nodes share the same functional unit).
Consequently, the fact that initially all communications can be implemented with synchronous protocol does not involve that in the final solution all the communications will be effectively synchronous.
The If after this step it remains potential synchronous edges in L, each of them involves at least one asynchronous communication.
Let ζ i,j be the cost function associated with the edge e i,j of L defined as the ratio of the total volume of data associated with non labelled edges of L that became asynchronous and the total volume of data associated with non labelled edges of L that remain synchronous: ζ i,j = ∑ data of asynchronous edges / ∑ data of synchronous edges. The edge e i,j of L with ζ i,j minimum is labelled with a synchronous transfer since the objective is to minimize the area dedicated to FIFO structures. Edges with asynchronous communications, resulting from this choice are removed from the list L. The list is reordered according to the cost function ξe i,j and the whole process is iterated until all communication edges are not labelled.
• Communication support and protocol determination
The FIFO size associated with a communication edge is defined in order to be sufficient to memorize the whole data transferred through this edge. Protocols associated with communications are determined in order to avoid resource access conflicts. Since synchronous communications are supported through a rendez-vous mechanism, protocols associated with senders and receivers are blocking. For asynchronous communications, since transfers have associated FIFOs, protocols associated with senders are non blocking. Furthermore, the above algorithm ensures that when the receiver reads from the FIFO, all the data written by the sender are available. Hence, a non blocking protocol is associated with each receiver.
• Transfer mode determination
Generally, a transfer of an array of data is performed either by a direct memory access (DMA) or by memory mapped I/O move operations. The selection of the transfer mode depends on the capabilities integrated in the target processor. For example, the Motorola DSP56002 does not integrate a real DMA protocol, i.e., the DMA controller is located outside of the DSP and each value is transferred through an interrupt mechanism. Thus, on the DSP56002, CPU operations and DMA transfers are exclusive. On the TMS320C40 they may be overlapped but modelling the behavior of this DSP is more complex since numerous conflicts may occur due to movements of instructions and data that may conflict on buses. In Table 1 • approximate models of the total execution times on these DSPs for a transfer of N values and C cycles of CPU operations (C ≥ 0) are depicted. Note that for DMA transfers these two tasks 
Experiment: an acoustic echo canceller
Our experiment is the implementation of the GMDFα (Generalized Multi-delay frequency Domain Filter) algorithm which is used to perform echo cancellation in order to improve the quality of handfree telephones. The acoustic coupling between the loudspeaker and the microphone of each terminal is a significant operating difficulty. The GMDFα algorithm is a frequency-domain block adaptive algorithm which allows to meet application constraints and to reduce arithmetic complexity. A detailed description is given in [5] .
The data flow graph of the GMDFα [7] algorithm is given in Fig. 9 • (R=64, N=128, K=8, α=2).
On each arc, the name and the volume of data are given. A dotted line means that this data will be used 
• First partitioning
In the first partitioning, mobilities of nodes enable to use only synchronous communications.
Hence, no memorization resource is required. Hardware cost is due to control associated with synchronization mechanisms. Execution scheme (Fig. 10 •) is supported with DMA and I/O memory mapped technique in order to respect total execution time. Note that software to hardware data transfers after nodes 4 and 7.i (i<8) use different modes even if some parallelism exists between HW and SW units: the volume of computations of node 5 is too limited to exploit efficiently the DMA transfer mode of the DSP (see Table 1 •). The DMA transfer in node 7.i has a duration of 25µs and overlaps during 74µs with computations of node 7.i+1. Each node 7.1 to 7.7 has an execution time of 574µs including the DMA transfer.
At the end of node 7.8 the transfer is performed with explicit move instructions and requires 54µs. Consequently, the total elapse time due to hardware/software communications is 411µs and the total execution time of the GMDFα algorithm is 6.25ms that is less than the limit of 8ms imposed by the sampling.
The ratio of HW/SW communications represents less than 7% of the total execution time. Utilization of the DMA mode allows to improve execution time of 2% compared to an execution with only a I/O memory mapped technique.
• Second partitioning This second partitioning involves an extra hardware area compared to the first one. This overcost is mainly due to communication resources. The solution that will be considered for the following steps in the design flow (code generation and hardware synthesis) is the first one since a lower hardware area is obtained and timing constraints are met. 
Conclusion
Interface synthesis represents one of the main step in the hardware-software system synthesis flow. Its aim is to define an interface supporting all the communications between heterogeneous resources. The interface determination takes place after hardware-software partitioning in order to be indepen- 6.
