Asynchronous packet-switch for SoC by Xu, J. & Sotudeh, R.
Asynchronous Packet-Switch for SoC 
Jun Xu, Reza Sotudeh 
University of Hertfordshire 
x.jirn@herts. C I U .  rrk ; r. sortrdeh@herrs. CJC. irk 
Abstract: 
System-on-Chip (SoC) design is facing increasing 
diicufties hi its integration, global wiring delay and 
power dissipation. Interconnection network technology 
has the abvmtage over the cmvenlional bus techrrology in 
Its scalabiliv: on the other hand, asynchronoids circitit 
design technaiog)i m q  offer power saving and tackle the 
clock-skew problem. The combination of these two 
teclmologies therefore could be an optimal solutioiiJop the 
interconnection of SOC. in this paper we focus on the 
iiiipbritentation of packet-switch with nsynchronous 
iechnology. The resulls o j  expperinrents l u n  to evalmfe 
several aspects of the packet-switch inrplementation are 
presorted. 
1 Introduction 
As technology scales, a variety of challenges have been 
presented to IC developers. System integration is one 
among them. Although buses are still the dominant 
approach so far, interconnection networks have been 
received more and more attentions as an alternative 
integration solution [Z, 7, XI. One advantage of 
interconnection networks over buses is the scalability in 
throughput, latency, cost and intept ion of the system. 
Wire delay is another challenge. The increase in the 
delay of a global wire, almost doubling every year, affects 
the signalling, timing, and architecture of digital systems. 
This makes it extremely difficult to distribute a global 
clock with low skew [Zj. h e  solution is to devote a large 
quantity of interconnect metal to building a low- 
mpedance dock grid or wire using new materials. 
Howcver, this solution is anticipated to only be effective 
for one or two more generations [I]. A more radical 
approach is to eliminate global synchrony. One can either 
divide the chip into separate clock domains (known as 
Globally Asynchronous Locally Synchronous), or more 
aggressively, fully emptoy asynchronous circuit design 
technology, such as [ 6 ] .  
Power dissipation has become another critical metric. 
The growing market of mobile, battery-powered electronic 
systems fuels the demands for ICs with low power 
dissipation. Unfortunately, power dissipation in real-life 
ICs does not follow the descending trend in semiconductor 
technology [3]. Including asynchronous circuits into a 
comples VLSI design can help reduce p w e r  dissipation 
[Y]: unlike a synchronous system, charging and 
discharging (consuming power) in asynchronous circuits 
takes place only when a circuit is in operation. 
Having seen the challenges of SoC design and the 
advantages of interconnection-network technology and 
asynchronous circuit design technology, one question may 
arise: could the combination of these two technologies 
provide a solution to the interconnection of SOC. This 
paper aims to address this question and to consider the 
feasibility of interconnection network components using 
asynchronous methods. In particular, the asynchronous 
design of a packet-switch is examined. The rest of paper is 
organized as follows: in Section 2, the architecture of a 
packet-switch is presented; the imphmentation detail of 
the packet-switch by asynchronous technology is 
presented in Section 3; the simulation results are presented 
in section 4; finally the conclusions are drawn in Section 
5 .  
2 Architecture of packet-switch 
An output-buffering packet-switch [ I  I ]  is proposed for 
this study. For ease of our 
implementation, the packet- 1 ,nnlt ollpu 1 
switch only consists of two 
inpuVoutput ports as shown in 
Fig. 1.  The packet-switch 
consists DT four blocks: Output 
Blockl, Output BlockO, Input 
Block1 and input BlockO. B]adtl Elodl 
Both input and Output Figure 1 A zbyz blocks comprise of control 
block and data path. The input data path can buffer one flit 
(32-bit wide) at a time; the output data path as buffer 
memory is the main storagc element ofthe packet-switch. 
Data transfers between two circuits are based on a 
Point-to-point flow control protocol involving requests, 
which initialise each transfer, and acknowledgements, 
which signal the completion of a transfer. 
Each packet consists of two parts: header and payload. 
Header is one flit long, located at the beginning of each 
packet and containing all output ports a packet will pass 
through. The rest of a packet is payioad, only containing 
data. 
Packets at each packet-switch are processed in a 
pipelined fashion (three stages) as shown in Fig$. 
Incoming flits from previous packet-switches /source are 
buffered at input blocks as they arrive. If a f l i t  is the 
Itaye1 s((gr2 header of a 
Ibl I C 1  7 packei, its 
I pzlloAd b r p m  p m  destination is 
routing 
Figure 2 Pipelined processing model identikjed, and 
then I request 
is sent to the corresponding output block to arrange 
memory for the packet. If the flit belongs to payload, it is 
then transmitted to the same memory as its header. In this 
0-7800386566/04/$20 O O @ X W  /E€€ . 335 
packet-swltch model, input blocks are involved in stages 
l(a) and le), and output blocks are involved in stages 
I (c), 2 and 3. 
J1 n.!. 
? 
t 
Ihll-c U l l b . a * - C B l l b . . -  
I 
2.1 Input data path 
respectively. 
E H-r and f f - ~  are the 
e 5 handshaking pair 
#; interacting with the 
3 g. sender for header- ' transmission. The 
- c  
' 
In a 2by2 packet-switch, routing decisions for a packet 
can be made using just one bit of information. Here, we 
assume that the routing information bit [RIB) for each 
packet-switch is always the Least Significant Bit (LSB) of 
each header. This determines the output block with which 
to communicate. 
Making routing decisions is achieved by means of 
shifting and buffering operations at input data paths. As 
shown in Fig.3, the 
receiver notifies the 
input data- path 
includes 33-bit D- 
type flipflops 
(JIFF'S). The extra 
bit is used to support 
the shift operation 
which removes the 
LSB of each header 
thus stripping off the 
used bit of routing Figure 3 Input data path 
information. An incoming flit is transmitted through 32 
DFF's, outputting either from bit0 to bit31 or from bitl to 
bit32. The former event takes place when the incoming flit 
is the header and the latter event takes place when it is part 
of payload. Multiplexers are depIoyed in front of the DFF's 
to direct the incoming flit. As a header is buffered at input 
data paths, the Most Significant Bit (MSB) is filled with 
"0", bit1 turning into the new LSB which together with the 
rest of header will be transmitted to its next stage. Both 
header and payload are read from bitl to bit32 of input 
data paths. 
2.2 Memory Arrangement 
Memory arrangement is mainly conducted by output 
blocks. The overview of memory arrangement is 
illustrated in Fig.4, 
where we assume that 
the memory consists 
of two memory- 
banks. 
Each memory- 
bank has two 2-to-1 
multiplexers in front 
of it, one for data, 
and the other for 
control signals. Since 
each output block 
communicates with 
Fikre 4 Memory-arrangement at input 
Output BlocMl through the 
multiplexers, 
signals from either of the input blocks can be directed to 
any memory-bank. The output control block forms 
connections between input blocks and memory-banks by 
setting up the associated 2-to-I multiplexers. A counter, 
implemented in the output control block, provides 
memory-bank addresses and determines which pair of 2- 
to-] multiplexers is supplying data. 
To avoid collision, setting up 2-to-] multiplexers for 
different packets must be mutually exclusive: only one 
action is allowed to progress at a time, therefore, an arbiter 
must be employed. The arbiter allows one request to pass 
through at a time; the one that arrives first i s  selected. 
When two requests arrive simultaneously, it arbitrarily 
selects one to go through. 
3 Asynchronous implementation 
3.1 Asynchronous design methodologies 
The asynchronous circuits in this paper are based on a 
Speed-Independent model, where delays on wires are 
regarded as zero or negligible while delays on gates are 
unbounded [IO]. Data encoding is based on bundled-data 
protocol. In the case that data value is n-bit wide, n+2 
wires, i.e., n bits for data, 1 bit for request, 1 bit for 
acknowledgement, are required in transferring each data. 
Encoding for handshaking signals are based on a 4-phase 
level signalling protocol (return-to-zero). After each 
transfer, the channel signalling system returns to the same 
state as it was in before the next transfer can start. 
Shft I' is the shifting 
request signal, activated STG for header- processing at input when an incoming 
flit is the header, and is released afte; the header is latchei 
at the input data path. 
,a The shifted header is v-rr P-.. 
latched as Buj71-r goes 
Figure 6 STG for payload- co"Tondini? to L y k f - R  
processing at input blocks and L3iflt-R 
respectively. The falling 
transition of Shift_. indicates that the routing information 
has stabilized in the input b h k .  
Processing (each flit of) payload in an input block is 
336 
conducted in two steps, i.e., buffering and then 
transmitting it to the same output block as its header. The 
input conlrol block interacts with the sender using the 
handshaking pair P-KP -0. P-r rises as one flit of payload 
is sent. The input control logic buffers the ffit at the input 
data path by raising Bufp-r as soon as f - r  goes high. P-a 
is driven high as the flit is latched at the input data path, 
indicated by B u f p a  going to high, 
3.3 Output control logic 
Memory -arrangement in output control blocks is 
described by an STG presented in Fig.7. Ba-rOO and 
B ~ ~ p r f O  are the request signals asserted by Input Block0 
and Input Blockl for memory-arrangement, and Ba-uOO 
and Ha-alO are the corresponding acknowledge signals, 
Setting up the 
%.,b. *h.-.m* k b , t i r . ~ .  MW.,. associated 240-1 
t 1 1 t M*i-r* "f' multiplexers using 
handshaking pairs, 
SI-rO and St-a0. 
Blockl using 
St-rf and St-a], 
respectively. The 
counter, which 
A respectively. 
St_rO ,sblo. .P+-C*Ibbt*~O- Wblr$.rl- 'hbai-aW S!rI. 
"5" '7' and for Input 
provides memory- Figure 7 STG for memory- banli addresses 
arrangement at Output Block0 are 
data, is driven by the handshaking pair Coaririer:r and 
Counter-U. The output conlrol block communicates with 
the arbiter [131 using the handshaking pair, Arbiferr0 and 
..lrbirer-oO, for packets from Input BlockO, and Arbiter-rl 
and Arbifer-al, for packets from Input Block1 
respectively. 
Transmitting packets, stored in memory, to their next 
switches or destination hosts 
is described in Eig.8. Packets 
are read out of memory flit 
by flit using the handshaking f I 1 pair, Ratao-r and Dutoo-a, 
and are forwarded to their 
Figure 8 STG for next packet-switchs ar 
packet-output destination hosts using the 
handshaking signaIs, Innau!-r 
and h ~ t p .  Note that 1no~t-r and irroul-u are the 
handshaking pair employed between packet-switchs or 
between a host and a packet-switch. Signals on Inoui_r are 
passed onto H-r (refer to section 3.2) when the incoming 
flit is a header, and are passed onto P-r when the 
incoming flit is part of the payload. Correspondingly, 
Signals on H-U and P-U are multiplexed onto Iiz0ttf-u. 
Dmm6_r - 11.Iw-e 
+ I  
lho:. h m - r  h0u-S 
m.0:. c h-., h3.l. 
4 Experimental results 
4.1 Simulation environment 
To evaluate the asynchronous implementation, a 
synchronous packet-switch was also implemented based 
on the same architecture presented in Section 2. 
Asynchronous controI circuits in this paper were 
synthesized by Petri@ with 0 . 5 p  CMOS technology, and 
synchronous control circuits were synthesized by SE.  
Both implementations were evaluated in WcroSim Design 
Centre. The minimum clock period, 6ns, was determined 
by the critical path and obtained from the PSPICE 
simulation. 
The base system used for the simulation was a k-stage 
butterfly network 1121. The packet size in the evaluation 
was t'ixed. For the convenience, the interface between a 
host and a network was viewed as contributing to the same 
routing delay as a packet-switch [5]. 
4.2 Simulation results 
Fig.9 shows that the Latency of routing an 8-flit long 
packet through an empty 2-stage network as  well as the 
contributions of its header and payload to the overall 
latency. Fig. 10 further 
shows the performance of 
each packet-switch in the 
network in processing 
each individual flit. The 
simulation results 
Figure 9 Latency of indicate that despite the 
transmitting an 8-flit packet asynchronous packet- 
through a network switches outperformed 
the synchronous packet- 
smtches in processing each individual flit, the latency of 
routing the whole packet in the asynchronous network was 
nt OAIplC - . greater than in the 
Isll"c - synchronous network - U1 U The reason that the 
U 30 synchronous packet- 
switches lost to the E 
asynchronous ones m 
Hasdar each 
- 
+ 
Figure Delay at each individual flit was 
mainly caused by the 
redundant time in each clock cyclc. The 6 ns clock period 
was dictated by the slowest path as described in Section 
4.1. The optimal clock period for these two operations 
based on our experiment was approximate 4ns. By 
contrast, the asynchronous circuits immediately progressed 
as saon as their environment responded. 
When flits are transmitted consecutively in a pipeline 
style, however, the routing time of each flit can be 
overlapped by its neighbouring flits. The more they 
overlap each other, the less routing latencv a packet has. 
For an asynchronous circuit, ruled by a 4-phasc level 
signalling protocol, recovery time was required to return 
the asynchronous circuit to its original state before another 
transfer could start. Our 
simulation result shows 
the recovery operations 
caused the asynchronous 
pipctine less intcrleaved- 
only 65.6% of payload- 
routing time was 
overlapped, compared 
Figure 11 Latency as network to the ~Ynchonous 
scale increases network, where 87.5% 
of overlap rate was 
~~~~~ 
E m  
F 
m 
Y y l  d q . 6  uq.8 
337 
achieved. 
The impact of network scale on its performance is 
illustrated in Fig. 1 1. The result shows that the performance 
of the asynchonous networks caught up the synchronous 
networks after scaling up to 3 stages. It is because in an 
untoaded network, where the delay of a header at packet- 
switches is always greater thm the delay of payload, the 
latency of routing a packet is  contributed by the time of a 
header to establish the route from its source to its 
destination plus the time of loading its payload from its 
final stage packet-switch to its destination host. The latter 
is determined by the packet-size, the processing delay at 
its final stage packet-switch, and fhe pipeline efficiency; 
the former is determined by the distance between the 
source and destination and the delay of header at each 
packet-switch. When the packet size was fixed, as the 
network scale (distance) increased, the latency of header 
began to dominate and the routing latency of a packet in 
an asynchronous implementation improved on that of a 
synchronous implementation. 
The impact of packet-size on the performance of 
networks is presented 
in Fig.12, where the 
packet size was varied 
from 8 flits to 32 flits. 
i== The simulation result 
indicates that 
~~ Bill, ,fir*, %I,,, Zll,, increasing packet size 
can cause longer 
latency for a packet 
routing in an 
Figure 12 Latency as packet-size 
increases 
asynchronous network than in a synchronous one 
4.3 Gate-counts consideration 
The gate-counts of synchronous and asynchronous 
controt logic are compared in Table 1. The data paths in 
both implementations share similar structures. and 
therefore they are not considered in this paper 
Aiynchmnour Synchronous 
Implemenlation Implcmeslatbn 
Esuiv. I Gate t EOUW. Gst+ Rlork Name Cat= f 
Table 1 Gate-counts of asynchronous and synchronous 
circuits 
Table 1 shows that the asynchronous packet-switch has 
similar size to the synchronous one in output control logic, 
however, it cost 100 more (equivalent) gates than the 
synchronous one in input control logic. It is because in the 
asynchronous input blocks, processing header and payload 
had to be described using two separate STG's due to the 
limitation of the asynchronous synthesis tool. 
5 Summaries and conclusions 
In this paper, we explored the feasibility of an on-chip 
network using asynchronous circuit design technology as a 
solution of system integration. A packet-switch was 
proposed. The asynchronous implementation was 
presented and compared with its synchronous counterpart. 
The simulation results suggest that asynchronous networks 
could outperform synchronous networks as the nehvork- 
scale increases while underperfom with the increase of 
packet size. The associated reasons were also explained. 
6 Acknowledgement 
The first author would like to thank UK Overseas Rescarch 
Students Awards Scheme (ORs) and London South Bank 
University for their financial support and especially Professor 
Mark Josephs for his illspiring supervision. 
7 References 
[I ]  M.T. Bohr, lnterconnect scaling-the real limiter to high 
performance ULSI, Proc. Int. Electron Devices Meeting, Dec. 
[Z] L. Benini and G. De Micheli, Networks on chips: a new SoC 
paradigm. Computer, Volume: 35 Issue: 1, Page(s): 70 -78. Jan 
2002. 
[3] L. Benini, G. De Micheli and E. Macii. Designing low-power 
circuits: practical recipes, IEEE Circuits and Systems Magazine. 
Vd: I ,  Issue: 1, Page(s): 6 -25,2001. 141 l . -A ,  Chu, Synthesis of 
Self-timed VLSI Circuits from Graph-theoretic Specifications. 
PhD Thcsis, MIT, June 1987. 
[4] 'T.-A. Chu. Synthesis of Self-timed VLSI Circuits from 
Graph-tbeoretic Specifications. PhD Thesis. MIT. June 1987. 
[ 5 ]  D.E. Culler and 1P. Singh. Parallel Computer Architecture. a 
hardwareisohre approach, Morgan Kauflnann Publishers. Inc. 
1999, USA. 
[6] ID. Garside, W.l Bainbridge, A .  Bardsley. D.M. Clark. DA 
Edwards, S.B. Furber, 1 Liu, D.W. Lloyd, S .  Mohammadi, 1s. 
Pepper, 0. Petlin, S. Temple and 1V. Woods, "AMULET3i-an 
http://www.cs.man.ac.uWamuletl. 2001. 
173 P. Guerrier and A. Greiner, A generic architecture for on- 
chip packet-switched interconnections, Design, Automation and 
Test in Europe Conference and Exhibition 2000 Proceedings. 
Page(s): 250 -256,2000. 
IS] K Goossens. E Rijpkema, P Wielage, A Peeters and J van 
Meerbergen. Philips Research. NL. Networks on Silicon: The 
Next Design Paradigm for System on Silicon. Design 
Automation & Test in Europe (DATE) 2002. 
191 Scott Mauck, Asynchronous Design Methodologies: An 
Overview, Proceedings of the IEEE, Vo1.83. No.], pp69-93. 
January 1995. 
[lo] M.B. Joseplis. S.M. Nowick and c.H. van Berkel. 
Modelling and Design of Asynchronous circuits, Proceedings of 
the IEEE on Asynchronous circuits and systems, v. 87:2. Feb.. 
1999. 
[ I  I ]  M. 1. Karol, M. G.  Hluchyj, and S. P. Morgan, lnput versus 
output queuing on a space division packet switch. IEEE 
Transactions on Commtmications, COM-35 (12): 1347- 1354, 
December 1987. 
I121 F. Thomson Leighton, Introduction to Parallel algorithms 
and architectures: arrays, trees, hypercubes. Morgan Kaufinann 
Publisher San Mateo, California, 1992. [I31 G.  Moore, VtSI:  
Some fundamental challenges, IEEE Spectrum. Vol. 16, p.30. 
1979. 
1131 C. L. Seitz, System Timing. In C.A. Mead and L.A. 
Conway, editors, Introduction to VLSI Systems, chapter 7. 
Addison-Wesley, 1980. 
1995, pp. 24 1-244. 
Asynchronous Systcm-on-Chip", 
338 
