Abstract-This paper presents a bit-level parallel communication interface used for inter processor communication separated on different printed circuit boards. A high performance board-to-board communication interface is important in modern supercomputers and portable computers or gadgets with multiple screen displays. We propose a recalibrated transmitter and receiver soft IP cores to support asynchronous handshake communication interface. The valid signal can be delayed for a few cycle to guarantee the metastability of data signals. The tuning of the delay can be recalibrated and tested during pre-implementation step. The flexibility to tune a correct valid delay time, which is set as minimum as possible as far as the data integrity can be guaranteed, enables the operation the communicating devices at its maximum performance. The proposed technique has been simulated using HDL-level simulation and has shown its expected performance with four testing scenarios.
The development of board-to-board communication has its own challenges, namely metastability, crosstalk, and cross domain clock. Metastability is concerned with problem called data signals stability on a data path, which cannot be read before all data signals haven been in a steady-state condition. Accessing the data before the metastability is guaranteed can cause the lose of data integrity and validity. To solve this problem we can use an open loop or closed loop method with synchronizer. Meanwhile, we use a valid signalling method to ensure that the data have been steadily loaded before the recipient node read them from the physical link.
Data processing capability of a system depends not only on data processing devices but also influenced by data communication interface. The communication interfaces are connected directly to the physical links of between two communicating devices on each separated board. The bottleneck performance presented in the physical link will lower the system performance, regardless the higher working frequency speed of devices or processing elements. Hence, board-to-board data communication is important aspect in a high performance computing system, as it is also discussed in this paper. Supercomputers or high performance computers implemented on a huge number of rack boxes, and smart gadgets implemented with multiple display screens in different boards are good examples of the board-to-board inter processor data communication.
II. RELATED WORKS AND THE KEY FEATURE
Board-to-board data communication is usually classified into asynchronous data communication scheme, because naturally both communicating devices on the boards have different clock sources. The clock frequency and clock phase of the devices can accordingly be different.
The main problem in the asynchronous communication is as mentioned before the metastability. There are many techniques that can be utilized to overcome that problem. A transceiver is usually interfaced with a FIFO buffer with its back-end processing element. The FIFO buffer is implemented both in its transmitting side and in its receiving side. Therefore, we can design and implement a dual-clock FIFO buffer, which can be clocked with two different clock frequencies [6] .
To overcome the metastability problem, we can also implement a source synchronous interface [7] , where the sending node sends its clock signal through a clock path to the receiving node. The sent clock signal is certainly used to synchronized the transmitted data.
Another work by Jenning et al [8] presented also a transceiver for a board-to-board communication interface. In particular, the proposed technique replaced the wired data link with the deployment of wireless link, where the used carrier frequency is above 100 GHz. The wireless communication is efficient in term of the simplicity of the complex cabling. However, the wireless communications can be made only with single bit serial data communication, which has lower data rate compared to its counterpart bit-parallel data communication. However, we can implement wireless bit-parallel data communication with a specific reliable and robust communication protocol such, as multiple carrier frequencies or carrier codes. This paper propose also another technique to overcome the metastability problem. The proposed technique is derived from our previous work [9] . However, we have made better improvements in term of reconfigurability of the valid delay settlement as the key feature of the interface. The time delay of the steady state conditions of the N-bit data signals on the N-bit data paths can be different depending on the length and the characteristic of the physical links, including the pattern of the metal wire paths on the board. Thus, during pre-implementation testing, we can configure and calibrate the correct valid delay time as minimum as possible as far as the data integrity can be guaranteed. Therefore, we can operate the communicating devices at its maximum performance. Our technique has been simulated using HDL-level simulation and has shown its performance with four testing scenarios. Fig. 1 presents the on-chip architecture of our communication interface. The figure present two interfaces, where each of them implemented is on a Field Programmable Gate Array (FPGA) device. Each interface consists of four components, i.e. two first-in first-out (FIFO) buffer, a transmitter core (TX) and a receiver core (RX). Each interface on the different FPGA mounted on different printed circuit board (PCB) has its own and different clock source. Eventually, they can also work with different frequency clocks. Thus, not only frequency but also the phase of the clocks of both FPGAs are probably not equal.
III. OVERVIEWS OF THE PROTOCOL AND ON-CHIP ARCHITECTURE
The data communication interface is full-duplex. Each interface is faced the other one through an N-bit data link and a valid-bit and an acknowledge (ack) signal link. The valid signal flows via the single bit valid path, and can be delayed for several cycles. The acknowledge signal flows back through the single-bit acknowledge path.
At sender side, a FIFO buffer is used to interface the TX module with a processor system bus which connecting also a memory module. Another FIFO is also used to interface the RX module with another processor system bus which connecting also a memory module. The sender processor produces data, and move them into the FIFO buffer. The TX module will receive the data and send them to the physical link. For a few cycle the valid signal is delayed and is set to inform the RX module at the other board side. The RX module will then send an acknowledge signal or flag soon after it receives the data and send them into the FIFO buffer.
IV. SIMULATION RESULT
The simulation results presented in this paper are categorized into 4 parts, The first category is the simulation where the transmitting and receiving nodes have the same working clock frequency, as presented in Section IV-A. The second category is the simulation where the transmitting and receiving nodes have the same working clock frequency but having different clock phase, as presented in Section IV-B. The third category is the simulation where the transmitting node has slower clock frequency than the receiving node, as presented in Section IV-C. The fourth category is the simulation where the transmitting node has faster clock frequency than the receiving node, as presented in Section IV-D.
In all simulation cases, we measured the number of clock cycles required by each data to be in the transmitter input terminal and the receiver output terminal for each different setting time of valid flag signal measured in number of clock cycles. Hence, we will see in the simulation results the data sequence number and the number of clock cycles. The communication performance or data rate at an instant cycle can be formally modelled as follows:
A. Simulation with The Same Clock Frequency
If D is the number of transmitted data words over the total cycle time T C . The unit of R can be determined as the number of data words per cycle period.
If t k is the cycle time to detect a transmitted data word k th and t k−1 is the cycle time to detect the previous transmitted data word (k − 1) th , then formally the time-dependent data rate R(t k ) at instant time t k can be expressed as follows. Fig. 3 presents the simulation result of the required cycletime of each data to be in receiver side when the transmitter and the receiver's have same clock frequency. The figure shows also that the required cycle-time is increased as the flag of the valid signal is delayed with more cycle times.
Based on the simulation result presented in Fig. 3 
= 20
Gbps. By using 64-bit data word width, the data rate can approach 40 Gbps. Thus, by using CMOS technology with smaller minimum transistor feature/gate size in order to achieve a device with higher clock frequency, then a high performance communication interface can be realized.
From Fig. 2 and Fig. 3 , we can see that the time to move a data from transmitter and receiver depends heavily not only on the valid setting delay but also affected by the delay time due to valid and ack delivery. The slope of each curve in the figures represents the data rate of the communication. When we compare the results from Fig. 4 and Fig. 5 with the previous results from Fig. 2 and Fig. 3 , then we can see that the time delay for the data transmission is not significantly affected by the phase difference of the clock. The significant effect occurs when the phase difference is 0.85π until 0.95π radian, and the delay time can increase by one clock cycle. Fig. 6 presents a simulation result of the required cycletime of data to be in transmitter input terminal when the transmitter's clock frequency is slower than the receiver's clock frequency. Fig. 7 presents a simulation result of the required cycle-time of data to be in receiver output terminal when the transmitter's clock frequency is slower than the receiver's clock frequency. Both figures s how a lso t hat the required cycle-time is increased as the flag o f t he valid signal is delayed with more cycle times. Fig. 8 presents a simulation result of the required cycletime of data to be in transmitter input terminal when the transmitter's clock frequency is faster than the receiver's clock frequency. Fig. 9 presents a simulation result of the required cycle-time of data to be in receiver output terminal when the transmitter's clock frequency is faster than the receiver's clock frequency. Both figures show also that the required cycle-time is increased as the flag of the valid signal is delayed with more cycle times.
B. Simulation with Different Clock Phase

C. Simulation with Transmitter's Slower Clock Frequency
D. Simulation with Transmitter's Faster Clock Frequency
If we compare the simulation results of Fig. 6 and Fig. 8 , we can see that the latency of the data with the slower transmitter clock frequency is higher than one with the higher transmitter clock frequency.
V. SYNTHESIS RESULT
The logic synthesis of parallel communication based on an FPGA is presented in this paper, because FPGAs are easy to configure and accordingly have low prototyping cost. The FPGA (Field Programmable Gate Array) has a characteristic of the reconstruction, the rapidity, design flexibility and the highdensity of logical resources [10] and can meet the increasingly complex logic demands [11] .
We have synthesized our transmitter and receiver IP (Intellectual Property) cores using Cyclone III device with device number EP3C16F484C6, an FPGA device from Altera. By using 16-bit parallel data interface, the transmitter core requires about 31 logic elements and the receiver core requires 38 logic elements. From the synthesis data, we can seen that our cores have relatively low logic area. Unfortunately, we cannot compare this synthesis data with other bit-parallel communication interface techniques due to the lack of the FPGA-based synthesis data provided by the other parallel interfaces. 
VI. CONCLUSIONS AND OUTLOOKS
This paper has presented reconfigurable transmitter and receiver IP cores with bit-level parallel interface used in board-to-board inter processor communications. The proposed transceiver IP cores allows us to implement bit-level parallel communications through physical links between devices mounted on different boards with the following operating clock conditions.
1) The sender and the receiver device have the same working clock frequency and clock phase. 2) The sender and the receiver device have the same clock frequency but have different clock phase.
3) The sender's clock frequency is lower than the receiver's clock frequency. 4) The sender's clock frequency is higher than the receiver's clock frequency.
Under above working frequency conditions the transmitted data by the sender can be received well by the receiver node. In our HDL-level simulations, we use FIFO buffers with 16-slot data buffer. Therefore the data measurements, particularly on the receiver side, are made for only until the sixteenth datum. In the future, we will analyse the impacts of the FIFO buffer depth on the data communication performance, as well as the performance measurement for higher testing data volumes.
We have also made the performance estimation of the proposed communication interface. For 14 cycles valid delay for example as shown in the simulation result with the same clock frequency, the data rate at the last measured cycle time is 0.0625 data word per cycle. When the transmitter and receiver node works with 100 Mhz clock frequency or 0.1 ns clock period and the data width is 32-bit, then we can estimate that the data rate communication for that condition is 0.0625×32 0.1 ns = 20 Gbps. By using 64-bit data word width, the data rate can approach 40 Gbps. Therefore, we can potentially achieve a high performance communication interface, by using CMOS technology with smaller minimum transistor feature/gate size.
We have not implemented our design onto a CMOS standard-cell technology yet. However, the soft IP cores of the transmitter and receiver have been synthesized using a Cyclone III FPGA device from Altera. The total number of logic elements used on the FPGA device for both transmitter and receiver is about 69 logic elements, i.e. 31 for the transmitter core and 38 for the receiver core. This number is quite small and will potentially consume low logic area when we implement it using CMOS standard-cells in the future.
