Memory management in output-buffering packet-switch design by Xu, J. & Sotudeh, R.
Memory Management in Output-Buffering Packet- 
Switch Design 
Jun Xu, Rezrt Sotudeh 
School of Electrical, Communication and Electronic Engineering 
University of Hertfordshire, UK 
x.iun(L$herts.ac.uk. r.sotudeh@herts.ac.uk 
Abstract-The most pressing problem in design of + 
synchronous buffer-memory system in high-speed packet switches 
is memory bandwidth IC there are multiple packets headmg for 
the same buffer while the buffer cannot consume them 
simultaneously, some of the packets will have to be dropped. Two 
approaches are e x p l o d  to resolve this problem @ this paper. 
One is via improving the buffer-memory architecture, and the 
other is via repladng clock-based synchronous technology with 
handshaking-based asynchronous technotogy. Both approaches 
are implemented and the results of experiments run to evaluate 
several aspects of the implementations are compared. 
1. INTRODUCTION 
Input buffering, centralised (shared) buffering and output 
buffering are the'three best known buffering strategies in 
packet-switch design. Under input buffering, packets are stored 
in an independent buffer associated with each input port, at 
which they amve. Under centraIised buffering, packets are 
stored in a centralised memory shared by all input ports and 
output ports. Under output buffering, packets are stored in an 
independent output buffer dedicated to output port that is their 
destination. 
In a conventional input-buffering packet-swi tch design, 
each buffer is implemented as a single FIFO queue and only 
the packet at the head of such a queue can be transmitted. If 
the packet at the head of the queue is blocked, all packets 
behind it have to be blocked wherever their destinations are, 
which is known as the Head Of Line (HOL) problem. Input 
I buffcring packet switches with the HOL problem can only 
achieve a masimum throughput of around 60% [ 11. 
Output buffering and centralised buffering packet-switches 
can achieve throughput of around 80%. However, 
blockageldata loss can still occur when packets from different 
input ports head for the same buffer. If the buffer cannot 
consume all the incoming packets at the same time, some of 
the packets will have to be dropped. The conventional solution 
to this problem is to increase memory-access-speed or 
widening data-path. However, it can be observed that in recent 
years, the bandwidth of links used for interconnection has 
continued to increase [2]. To further extend the bandwidth of 
the memory has therefore become increasingly Impractical, 
In this paper, two approaches are presented to eliminate the 
bandwidth problem for output buffering packet-switches. One 
is via adding pipelines prior to each output buffer, and each 
pipeline is dedicated to one input port. Provided-that multiple 
0-7803-9029-6/05/$20.00 02005 I EEE 
packets head to the same output.buffer simultaneously, the 
contention is resolved while they are rippling through the 
pipelines. To avoid data loss, the minimum depth of each 
pipeline must be N+1, in which N is the number of input ports. 
Buffer in such a system is constituted by multiple memory 
banks. Memory addresses are assigned to packets in sequence 
and on demand. Once memory addresses are assigned, packets 
from different input $rts can be uploaded to their associated 
memory banks concurrently and independently. Unlike Prizma 
[3], at whch the dock speed in control logic has to be N times 
faster than the clock speed in data path, the newly proposed 
system is synchronised by a single clock signal. 
Having seen that clock-based buffer-memory systems are 
struggling to meet modern packet-switch design, a question 
arises: can a nonclocked syspm,. namely an asynchronous 
buffer-memory system, avoid the bandwidth. problem? The 
second contribution' of this paper i s  therefore to explore the 
possibility of implementing an asynchronous buffer-memoy 
system. Under asynchronous operations, synchronisation is 
devolved to local control signals: hndshaking signals, instead 
of using clocks. As a nature of asynchronous technology, data 
transfers between two circuits are based on a Point-to-Point 
flow control protocol, which guarantees that no data loss will 
occur. The asynchronous approach in this paper shares most of 
buffer-memory archtecture with the synchronous pipeline 
approach. However, unlike the synchronous approach, 
contention is resolved b y  an arbiter [4] rather than in pipelines. 
The arbiter only allows one request to pass through at a t h e ;  
the one that arrives first is selected. When two requests arrive 
simultaneously, it arbitrarily selects one to go through. 
The rest of this paper is organiied as follows: in Section 2, 
the architecture o f  a buffer-memory system that is shared by 
both approaches is presented; the implementation detail by 
asynchronous technology is presented in Section 3 and the 
implementation detail by synchronous technology is presented 
in Section 4, respectively; simulation results are presented in 
section 5; finally conclusions are drawn in Section 6. 
II. ARCHITECTURE OF MEMORY MANAGEMENT IN . 
OUTPUT-BU'FFERING PACKET-SWITCHES 
Fig.] ilhstrates the structure of the proposed buffer- 
memory system for a 2by2 output-buffering packet switch. The 
memory in such packet-switches is constituted by multiple 
memory banks (only two memory banks are illustrated in the 
figure though more banks could be accommodated if required). 
391 
! 
Each memory bank is logically independent from others and is 
designed to store om packet. Each memory bank has two 2-to- 
1 multiplexers in front of it: one for data, and the other for 
control signals. 
The control unit interacts with both input ports and assigns 
memory banks to incoming packets on demand. It builds up 
links between input ports and memory banks by setting up the 
associated 2-to-1 multiplexers. Once a link is built up, the 
control signals and data from the associated input port can 
directly communicate with the assigned memory W. A 
circular counter, implemented in the control unit, records 
memory bank addresses and determines which pair of 2-to-I 
multipIexers should be configured. Although the control unit 
processes memory-arrangement re,quests sequentially, once 
memory banks are assigned to packets, data transfers between 
the memory banks and their associated input ports can be 
conducted concurrently and independently: 
Figure 1 Ovaview of the Ropoaod Memory-management Syatcm 
IU IMPLEMENTATION 
A. Asynchronous implementation 
The asynchronous circuits in this paper are based on a 
Speed-Independent model, whew delays on wires are regarded 
as zero or negligible while delays on gates are unbunded [5]. '  
Data encoding is based on bundled-data protocol. In the case 
that data value is n-bit wide, n+2 wires, i.e., n bits for data, I 
bit for request, I bit for acknowledgement, are required in 
transferring each data. Encoding for handshaking signals are 
based on a 4-phase level signalling protocol (retum-to-zero). 
After each transfer, the channel signalling system returns to the 
same state as it was in before the next transfer can start. 
The asynchronous memory-management is described as the 
format of STG (State Transition Graph) [6] presented in Fig.2. 
R-baO and R-bal are the request signals from Input PortO and 
Input Portl respectively. They are asserted to demand the 
control unit.in, the buffer-memory to arrange memory before 
packets can be uploaded. Unlike the synchronous pipeline 
approach, the asynchronoui system resolvhg contention relies 
on an arbiter. The arbiter only allows one request to pass 
through at a time; the one that amves first is selected. When 
two requests amve simultaneously, it arbitrarily selects one to 
go through The arbiter is communicated by using the 
handshaking pair, R-arbiter0 and A-arbiferO, for packets from 
Input PortO, and R-arbiter1 and Aarbiterl. for packets from 
Input Portl respectively. Once granted by the arbiter, setting 
up, the associated 2-to-1 multiplexers is conducted via using 
handshaking pairs, R-stO and A-stO, for packets from Input 
PortO and R-sll and A-ssll for packets from Input Block1 
respectively. The counter, whch provides memory bank 
addresses, is incremented as soon as a memory bank has been 
assigned. To ensure that packets from different input ports are 
not uploaded to the same memory bank, the arbiter must not be 
released until after the counter has been incremented. The 
counter is driven by the handshaking pair R-counter and 
&counter. A-baO and A-bal ate the acknowledge signals 
corresponding to R-bd and R - b d  respectively, notifying the 
input ports after the associated memory-arrangement jobs have 
been done. 
A-b,Nl-  R - b k  A A-ia l .  
R-mrbncrOt A-arbberO- A-vbrterl- R-ubiterl+ 
I 1 1 I 
Figure 2 STG far Aaynchmous Memory-management 
Figure 3 Timing af ~ ~ c h r o n o u s  Memny-management 
Further explanation on how the asynchronous system 
manages its memory when two packets head for the same 
buffer simultaneously is illustrated in Fig.3. The arbiter 
randomly grants the request (R-baU) from Input PortO first, 
3 92 
and as a result, a memory-bank is assigned to the packet from 
Input Fort0 prior to the one from Input Port1 
B. Syichronous implemenlaiion 
In the synchronous approach, the memory bandwidth 
problem is resolved by using N pipelines before each buffer (N 
is the number of input-ports) as  shown in Fig4 (Si in the figure 
represents the ith stage of pipetine). Each pipeline is dedicated 
to one input port. Provided that multiple packets head to the 
same output buffer simuItaneously , the contention is resolved 
while they are rippling through the pipelines. In order to avoid 
data loss, the minimum depth of each pipeline must be N+1. A 
token alternating between "0" to "N-I", implements a fair 
policy for sequencing requests when there is contention: when 
the token is "i", the packet from Input Porti will proceed first, 
and the packet from Input Porti+l will proceed in the second 
place and the packet from Input Forti-] will proceed at last. 
The period from t35 to tS8 shows how the system manages 
memory and resolves contention when two packets . from 
different input ports .head for the same output port 
simultaneously. Both packets are detected by the control unit 
in the first stage of their pipelines between t35 and t36 when 
R-bo0 and R-bai are both high. Since the token is "0", the 
control unit builds up the Iink (by setting up the associated.2- 
to-1 multiplexers) for the, packet from Input PokO between t36 
and t37, and built up another link for.the packet from Input 
Portl between t37 and t38. The token is switched to " 1 "  once the 
contention is resolved at t36. Uploading the packet from Input 
PortO into its assigned memory bank starts &om the third clock 
cycle (t3,) and the packet from Input Port1 from the fourth 
clock cycle (tjg) respectively. 
Figure 5 Tuning of Synchronous Memoty-management 
gum 4 Synchronous Pipelines 
IV. EXPERIMENTAL RESULTS 
To reduce the latency that the pipelines introduced, bypass 
paths are implemented in each pipeline. Packets only have tu 
ripple through the number of pipeline stages that are enough 
for the control unit to settle down their memory banks. Once 
their memory banks are arranged, packets can be directly 
uploaded from any stage of pipelines. The first packet that 
wins contention is propagated to its assigned memory bank 
after rippling through its second stage of pipeline; the second 
packet is propagated to memory after its third stage of pipeline 
and the Nth packet will be done after its N+lth stage of 
pipeline. 
For a 2by2 packet-switch, three stages are involved in each 
pipeline and it takes one clock cycle for packets to ripple 
through each stage. The token alternates between "0" to "1".  
Once the token has been used to make such a decision, its 
value is changed. The counter recording memory-bank address 
is incremented in the same clock cycle as a link is built up. 
An example showing how the synchronous system 
manages memory is presented in Fig.5. The period from tll to 
t i 4  illustrates the scenario when packets reach their associated 
pipelines at different clock cycles. Same as the asynchronous 
approach, R-baO and R b o i  are asserted by Input Port0 and 
Input Portl respectively to demand the control wit to arrange 
memory before packets can be uploaded. The associated links 
and memory banks are assigned between tI2 and t13 by using 
R-slO and' behveen ti3 and ti4 by using R-stl respectively. 
A-600 and A b o l  are flagged high after the memory- 
arrangement for the associated packets has been done. 
A. Simulation environment 
A synchronous and an asynchronous 2by2 output buffering 
switches are implemented for the evaluation of each proposed 
buffer-memory system. The switches consist of four blocks, 
i.e , two input blocks and two output blocks The input blocks 
identify the destination of packets, and the buffer-memory 
systems are allocated at output blocks. The asynchronous 
control circuits and synchronous circuits are synthesized using 
Petrify' and SIS respectively with 0 . 5 ~  CMOS technology 
The minimum clock cycle for the synchronous circuit based on 
PSPICE simulation is 61x3. Typically, trmsmitG one flit 
through a synchronous switch takes six or seven clock cycles: 
one clock cycle for receivlng the flit at an input block; two or 
three clock cycles for the flit rippling through its associated 
pipeline, which depends on if there is contention; and three 
clock cycles for the flit to be loaded into memory and then 
forwrded to next switchldestination. A packet in this paper 
consists of two parts: header and payload. Headers conkin 
routing information while payloads only containdata. 
B. Simulation results 
The simulation waveform generated by MicroSim Design 
Centre for the asynchronous approach is presented in Fig.6. 
Two 8-flits packets from differtmt input ports head to the same 
buffer-memory simultaneously. As shown in the figure, once 
memory banks were sequentially assigned, the two packets 
were concurrently and independently propagated into their 
' h t t p ; / / w .  I s i . u p c . ~ / - i o r d i o / ~ ~ v / u e ~ f v . h ~ l .  
393 
associated addresses2. R-dutuU/AdutaO and R-dutui/A-dutal 
are the associated handshaking signals for uploading packets 
into assigned memory banks (refer to Fig. 1). 
R-bal 
R - W  
R - h l  
A - h l  
A-bd 
6% l o b  aam 
Figure 6 Simulation waveform of propagating two packets into the same 
buffer-memory 
U Async ___ 
100 -=Sync __ 
an 8-flits packet MA-1 . . MA-2 
Figum 7 Processing time of the pmpmed buffer-memory systems and their 
. associated switches 
Fig7 presents another two results. The first result shows 
that the asynchronous buffer-memory system only consumes 
half of the time (equivalent to one clock cycIe) in managing 
memory for one packet (MA-1) and two thirds of the time 
(equivalent to two clock cycles) in managing and resolving 
contention than the synchronous pipeline approach (MA-2). 
The second result shows that the synchronous switch 
outperformed -thz _asynchronou$ switch and the synchronous 
switch only spent -tliirteex dock cycles (six clock cycles for 
header-transmission, and one clock cycle for each fl it  of 
payload) in routing through a packet. The result suggests that 
although the synchronous pipelines introduce extra latency, the 
impact on the overall performance of a switch is limited. If 
flits in a packet are transmitted consecutively in the 
synchronous approach, the extra latency consumed in the 
pipelines only applies to te f i r s t  flit of a packet (header) while 
the latency of the subsequent flits can be'overlapped by their 
preceding. In the case that packets are transmitted non- 
consecutively, the latency on pipelines can be reduced by 
bypassing packets from the'pipelines. 
' I1 woks in the same way as in the synchronous approach, in which Once 
mmory-banks are sequentially sssigncd, packets can be concurrently and 
independently pmpagated into their associated addresses. 
0 Recovery time 
HTransmlasion time 
u a l  
3 10 
0 
header payload (1 -flit) 
Figure 8 Contibution of the recovery time in the asynchronous transmission 
cycle for a header and one flit of payload respsctively 
Fig.8 explains why the asynchronous switch failed to win 
over the synchronous switch the asynchronous circuits wasted 
a lot of time in recovering handshaking signals. As mentioned 
in the asynchronous implementation section, for a circuit ruled 
by return-to-zero signalling protocol, after each transfer, the 
channel signalling system must return to the same state as it 
was in before the next transfer can start. Fig.8 shows that 
despite that transmitting a header and one flit of payload 
through a swtch only took about 21ns and 12ns, respectively, 
the asynchronous approach spent another 9ns each in returning 
to the same slate as it was. The performance of the 
asynchronous approach can be improved by replacing the 4- 
phase signalling protocol with a 2-phase signalling protocol. 
Traditionally, 2-phase signalling, protocols do not have the 
recovery-time problem. 
V. SUMMARIES AND CONCLUSIONS 
In this paper, two approaches were explored to resolve 
the memory bandwidth problem for output buffering packet- 
switches. One is via improving .the buffer-memory 
architecture, and the second approach is via replacing dock 
signals with handshaking signals. In the former case, 
contention is resolved while packets are tippling through their 
associated pipelines. In the latter case, contention is resolved 
by an arbiter. Both approaches xe implemented and compared. 
The experimental results suggest both buffer-memory 
management systems can resdve the bandwidth problem. The 
experimental results also showed that the asynchronous buffer- 
memory management system outperforms its synchronous 
counterpart, but the asynchronous switch lost to the 
synchronous switch due to its recovery time. 
REFERENCES 
[I] M. 1. KamL M. G. Hluchyj, and S,. P. Morgan, Input versus outpul queuing 
on a space division packet switch, IEEE Transactions on Communioations. 
COM-35 (12): 1347-1356, Deoember 1987. 
121 C.B. Stunkel, "Challenges in the design of contemporary routers". 
Proceedings of the ZUd parallel computer Routing and communication 
workshop, pp 139-152, June 1997. 
[3] Minkenberg, C ; Engberaen. T.; "A combined input and output queued 
packet switched aygtem based 0". PRIZMA switch on a chip technology" 
Communications Magazine, IEEE, Vol. 38, Issue: 12. Dec. 1,000 PP:7D- 77. 
141 C. L. Seitz, System Timing. In C A  Mead and L.A Conway, editors. 
Introduciion to VLSI Systems, chapter 7. Addison-Wesley, 1980. 
[ 5 ]  Scott Hawk, Asynchronous Design Methodologies: An Overview. 
Proceedings ofthe IEEE, Vo1.83, No.], pp69-93, lanuruy 1995. 
[6]  T.-A. Chu, Synthesis of Self-timed VLSI Circuit :from Graph-theoretic 
Specifications, PhD Thesis, MIT, June 1987. 
394 
