System-on-Chip (SoC) design is facing increasing diicufties hi its integration, global wiring delay and power dissipation. Interconnection network technology has the abvmtage over the cmvenlional bus techrrology in Its scalabiliv: on the other hand, asynchronoids circitit design technaiog)i m q offer power saving and tackle the clock-skew problem. The combination of these two teclmologies therefore could be an optimal solutioiiJop the interconnection of SOC. in this paper we focus on the iiiipbritentation of packet-switch with nsynchronous iechnology. The resulls o j expperinrents l u n to evalmfe several aspects of the packet-switch inrplementation are presorted.
Introduction
As technology scales, a variety of challenges have been presented to IC developers. System integration is one among them. Although buses are still the dominant approach so far, interconnection networks have been received more and more attentions as an alternative integration solution [Z, 7 , XI. One advantage of interconnection networks over buses is the scalability in throughput, latency, cost and inteption of the system.
Wire delay is another challenge. The increase in the delay of a global wire, almost doubling every year, affects the signalling, timing, and architecture of digital systems. This makes it extremely difficult to distribute a global clock with low skew [Zj. h e solution is to devote a large quantity of interconnect metal to building a lowmpedance dock grid or wire using new materials. Howcver, this solution is anticipated to only be effective for one or two more generations [I] . A more radical approach is to eliminate global synchrony. One can either divide the chip into separate clock domains (known as Globally Asynchronous Locally Synchronous), or more aggressively, fully emptoy asynchronous circuit design technology, such as [ 6 ] .
Power dissipation has become another critical metric. The growing market of mobile, battery-powered electronic systems fuels the demands for ICs with low power dissipation. Unfortunately, power dissipation in real-life ICs does not follow the descending trend in semiconductor technology [3] . Including asynchronous circuits into a comples VLSI design can help reduce p w e r dissipation
[Y]: unlike a synchronous system, charging and discharging (consuming power) in asynchronous circuits takes place only when a circuit is in operation.
Having seen the challenges of SoC design and the advantages of interconnection-network technology and asynchronous circuit design technology, one question may arise: could the combination of these two technologies provide a solution to the interconnection of SOC. This paper aims to address this question and to consider the feasibility of interconnection network components using asynchronous methods. In particular, the asynchronous design of a packet-switch is examined. The rest of paper is organized as follows: in Section 2, the architecture of a packet-switch is presented; the imphmentation detail of the packet-switch by asynchronous technology is presented in Section 3; the simulation results are presented in section 4; finally the conclusions are drawn in Section 5 .
Architecture of packet-switch
An output-buffering packet-switch [ I I ] is proposed for this study. For ease of our implementation, the packet- 
B]adtl Elodl
Both input and Output Figure 1 A zbyz blocks comprise of control block and data path. The input data path can buffer one flit (32-bit wide) at a time; the output data path as buffer memory is the main storagc element ofthe packet-switch.
Data transfers between two circuits are based on a Point-to-point flow control protocol involving requests, which initialise each transfer, and acknowledgements, which signal the completion of a transfer. Each packet consists of two parts: header and payload. Header is one flit long, located at the beginning of each packet and containing all output ports a packet will pass through. The rest of a packet is payioad, only containing data.
Packets at each packet-switch are processed in a pipelined fashion (three stages) as shown in Fig$. Incoming flits from previous packet-switches /source are buffered at input blocks as they arrive. If a flit is the 
'
In a 2by2 packet-switch, routing decisions for a packet can be made using just one bit of information. Here, we assume that the routing information bit [RIB) for each packet-switch is always the Least Significant Bit (LSB) of each header. This determines the output block with which to communicate.
Making routing decisions is achieved by means of shifting and buffering operations at input data paths. As shown in Fig DFF's, outputting either from bit0 to bit31 or from bitl to bit32. The former event takes place when the incoming flit is the header and the latter event takes place when it is part of payload. Multiplexers are depIoyed in front of the DFF's to direct the incoming flit. As a header is buffered at input data paths, the Most Significant Bit (MSB) is filled with "0", bit1 turning into the new LSB which together with the rest of header will be transmitted to its next stage. Both header and payload are read from bitl to bit32 of input data paths.
Memory Arrangement
Memory arrangement is mainly conducted by output blocks. The overview of memory arrangement is illustrated in Fig.4 Processing (each flit of) payload in an input block is conducted in two steps, i.e., buffering and then transmitting it to the same output block as its header. The input conlrol block interacts with the sender using the handshaking pair P-KP -0. P-r rises as one flit of payload is sent. The input control logic buffers the ffit at the input data path by raising Bufp-r as soon as f -r goes high. P-a is driven high as the flit is latched at the input data path, indicated by B u f p a going to high,
Output control logic
Memory -arrangement in output control blocks is described by an STG presented in Fig.7 4 Experimental results
Simulation environment
To evaluate the asynchronous implementation, a synchronous packet-switch was also implemented based on the same architecture presented in Section 2.
Asynchronous controI circuits in this paper were synthesized by Petri@ with 0 . 5 p CMOS technology, and synchronous control circuits were synthesized by S E . Both implementations were evaluated in WcroSim Design Centre. The minimum clock period, 6ns, was determined by the critical path and obtained from the PSPICE simulation.
The base system used for the simulation was a k-stage butterfly network 1121. The packet size in the evaluation was t'ixed. For the convenience, the interface between a host and a network was viewed as contributing to the same routing delay as a packet-switch [5]. Fig.9 shows that the Latency of routing an 8-flit long packet through an empty 2-stage network as well as the contributions of its header and payload to the overall latency. achieved.
Simulation results
The impact of network scale on its performance is illustrated in Fig. 1 1. The result shows that the performance of the asynchonous networks caught up the synchronous networks after scaling up to 3 stages. It is because in an untoaded network, where the delay of a header at packetswitches is always greater thm the delay of payload, the latency of routing a packet is contributed by the time of a header to establish the route from its source to its destination plus the time of loading its payload from its final stage packet-switch to its destination host. The latter is determined by the packet-size, the processing delay at its final stage packet-switch, and fhe pipeline efficiency; the former is determined by the distance between the source and destination and the delay of header at each packet-switch. When the packet size was fixed, as the network scale (distance) increased, the latency of header began to dominate and the routing latency of a packet in an asynchronous implementation improved on that of a synchronous implementation.
The impact of packet-size on the performance of networks is presented in Fig.12 , where the packet size was varied from 8 flits to 32 flits. 
Gate-counts consideration
The gate-counts of synchronous and asynchronous controt logic are compared in Table 1 shows that the asynchronous packet-switch has similar size to the synchronous one in output control logic, however, it cost 100 more (equivalent) gates than the synchronous one in input control logic. It is because in the asynchronous input blocks, processing header and payload had to be described using two separate STG's due to the limitation of the asynchronous synthesis tool.
Summaries and conclusions
In this paper, we explored the feasibility of an on-chip network using asynchronous circuit design technology as a solution of system integration. A packet-switch was proposed. The asynchronous implementation was presented and compared with its synchronous counterpart. The simulation results suggest that asynchronous networks could outperform synchronous networks as the nehvorkscale increases while underperfom with the increase of packet size. The associated reasons were also explained.
Acknowledgement
The first author would like to thank UK Overseas Rescarch Students Awards Scheme (ORs) and London South Bank University for their financial support and especially Professor Mark Josephs for his illspiring supervision.
