In this paper, we propose the target board architecture of a rapid prototyping embedded system based on hardware software codesign. 
have been presented. The architectural templates usually consist of a microprocessor for the software section, plus some programmable logic devices or co-processor for the hardware section. The communication between the HW and SW sections is achieved by shared memory, FIFO or handshaking protocols. They can support either mastermaster (where both sections can work concurrently) or master-slave (where only one of the two sections is active at a time) computing paradigm. 
The target board architecture
As shown in Figure 1 , our target board architecture contains 5 basic modules, i.e. z a TI TMS320C30 DSP processor which implements as much as possible of the signal processing and control functions of the system; z an FPGA array which is a programmable hardware accelerator consisting of four Xilinx XC4025Es to implement time critical functions of the system; z a peripheral block which is for communicating with the data acquisition and playback devices; z a 16M shared memory module which holds the program/data accessible to the DSP processor, the FPGA array and the host machine; which the target board can be initialized and communicate data with the host machine. To ensure the target board adaptable to a wide range of applications, a universal shared bus architecture is adopted. The 32-bit primary bus facilitates a basic shared memory communication model between the hardware and the software sections. The C30 processor is assigned higher priority than the FPGAs. The 16-bit expansion bus aims to provide a streamline port-to-port communication between the C30 and the FPGA array. With an addressing mechanism, multiple FIFO communication channels can be implemented over it. Inside the FPGA array, four 4025E FPGAs are connected as a ring structure as shown in Figure 2 . This bus is employed for the point-to-point direct connection across FPGAs. Each FPGA is also associated with a 2K local memory module which can also be accessed by the C30's DMA controller
Communication protocols and interfaces
In our target board design, five types of communications are supported, they are: Basically, software protocols are coded in C routines while hardware protocols are described in synthesizable VHDL codes which contain both control FSM and interface circuitry. Table 2 outlines the features of these protocols. The readers are referred to [9] for the details. Figure 3 shows the FPGA implementation model which consists of a data path, communication interfaces (e.g. queues, shared memory access port, hardware send/receive ports), and their associated FSM controllers. For the data path FSM controller, the state diagram is shown in Figure 4 . In Table 3 , we list the estimated communication delays of the proposed communication protocols. The numbers are derived from the FPGA implementation and from counting the C30 execution cycles. (each cycle is 60 ns) They, however, do not include the wait delay incurred from the mismatch between the send and receive operations. Among all, the FPGA local memory access has the least communication overhead while the handshaking scheme takes the longest time. In Table 4 , the FPGA interface circuitry overheads are compiled. The interface circuitry occupies only about 7% of the CLB resources. 
B a t c h write
B a t c h read H a r d w a r e send H a r d w a r e
LD-CELP decoder example and summary
partitioning results and the rules given in Table 5 . Table 5 . Assignment of communications
The total communication time overheads are compiled in Table 4 . For both processes, the communication overheads consume less then 1% of the computing iteration time. The average delays for transferring one word of data are 180 ns and 127 ns, respectively. These figures are approximately only equal to the delays for the C30 processor to perform one off-chip memory access. The proposed communication protocols are therefore considered very efficient. Both the target board architecture and the LD-CELP speech decoder are currently still under development. Further refinements are expected before its real implementation. We have written the interface codes in both VHDL and C codes. For the LD-CELP decoder example, we have completed the entire system simulation and successfully implemented the Levinson-Durbin Recursion module on three XC4025E FPGAs. In summary, in this paper, we have presented a novel embedded prototyping system based on hardware/software co-design. The target board is carefully designed so that various communication protocols can be implemented efficiently with very little time and circuitry overhead. The communication interfaces are described in VHDL code and C communication routines and can be easily augmented to the HW and SW section designs, respectively. Our experiment with an LD-CELP speech decoder system fully exhibits the efficiency of the proposed communication protocols and interfaces. 
