A note on implementing combining networks by Keller, Jörg & Walle, Thomas
A Note on Implementing Combining Networks

Jorg Keller Thomas Walle
FB  Informatik Universitat des Saarlandes
Postfach     Saarbrucken Germany
Abstract
In sharedmemory multiprocessors combining networks serve to eliminate hot spots due to
concurrent access to the same memory location Examples are the NYU Ultracomputer
the IBM RP and the Fluent Machine We present a problem that occurs when one tries to
implement the Fluent Machines network nodes with network chips that do not know their
position within the network We formulate the problem mathematically and present two
solutions The rst solution requires some additional hardware around nodes that can be
put outside network chips The second solution requires a minor modication of the routing
algorithm but one can prove that there is no performance loss
Keywords Computer Architecture combining networks buttery networks
 Introduction
In machines with emulated shared memory combining networks serve two purposes First
they route memory access requests between processors and memory modules Second they
merge concurrent accesses of several processors to one memory cell into one request and thus
reduce hot spots This kind of access cannot be neglected because it will occur in system
parts like synchronization and resource management Also concurrent access is often used
in parallel algorithms for the PRAM model Hence combining is crucial for implementing
shared memory Combining networks have been used in several architectures eg the NYU
Ultracomputer 	 the IBM RP 
	 and the Fluent Machine 	
The Fluent Machine diers from previous approaches in that it is guaranteed that requests
for the same cell are merged into one request However it is not obvious how to implement

This work was supported by the German Science Foundation DFG in SFB 

the Fluent Machines network nodes with universal network nodes ie nodes that do not
know their position within the network The use of universal network nodes is advantageous
because only one type of node is necessary which can even be used for designs of dierent
sizes
We will formulate the problem mathematically and present two solutions The rst does
not change the routing algorithm used in the Fluent Machine but requires additional hard
ware around network nodes The second solution requires a minor change of the routing
algorithm We prove that the algorithm is still correct and that performance is not aected
by this modication
The remainder of the article is organized as follows In Section  we review Ranades
routing algorithm for the Fluent Machine In Section  we work out the problem that occurs
when implementing this algorithm in universal network chips In Section  we present two
solutions to that problem Section 
 contains a discussion
 Ranades Routing Algorithm
Ranades routing algorithm uses six phases ie six traversals of Buttery networks to route
and combine requests from processors to memory modules and to route and reduplicate
answers back to processors However routing only occurs in phases  and 
 the other
phases can be implemented by dedicated hardware 	 In Ranades scheme each buttery
node contains a processor and a memory module however this can be changed such that
processors together with dedicated hardware for phases  and  are only placed at the
inputs of phase  Memory modules with multiple banks are only placed at the outputs of
phase  One physical processor simulates a number of Ranades processors We call the
execution of one instruction of each simulated processor a processor round For details of
the processor architecture see  	
We will focus on phase  because combining happens here Phase  is implemented on a
buttery network as given by Def 
Denition  A buttery network with N  
n
inputs and outputs is a graph G
n
that
consists of n   stages numbered from  to n with N nodes per stage numbered

from
 to N   G

consists of a single node G
n
can be constructed by taking two copies of
G
n
and N additional nodes that form the last stage of G
n
 Node i where   i  N 
in stage n of the smaller butteries is connected to nodes i and iN in stage n   The
construction is shown in Fig  The left output of a network node is denoted by  the right
one by 

In the sequel we will use binary representations instead of the numbers itself

d d d d
     







































X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
G
n
G
n
d d



















a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
d d d d



















a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
d d
G
n
G
n
G
n
G
n
d d d dd d d d

N

 
N

N  
N
N

 
N

N  
           
           
Figure  Construction of G
n
Requests by processors are put into packets injected at level  and delivered to memory
modules in level n Packets consist of a mode READ or WRITE an address and one
data word Although a data word is not needed for READ packets a dummy value is sent
to get a unique packet length Packets of one processor round are injected sorted by their
addresses at the end of the round a packet with special mode End of Round EOR and
address  is injected
Each network node selects from the two input buers the packet with the smaller address
and thus maintains the sorted order of packets which can be easily proven by induction If
two packets with identical addresses and modes meet one is selected the other is deleted and
information is stored to guarantee reduplication on the way back The sorting guarantees
that all packets of one round with identical addresses meet and get combined
The packet selected by a node is transmitted to the next level of the network via the
appropriate output link of the node for path selection see Section  Only EOR packets
are transmitted via both outputs to ensure separation of rounds address  ensures that
an EOR is only selected if both input buers contain EOR packets
An empty input buer prevents a node from sending a packet waiting at the other input
buer If it would be sent the sorting could be destroyed by a packet with smaller address
arriving later at the empty input buer To avoid unnecessary waiting GHOST packets are
introduced If a selected packet is transmitted via an output link a where a  f g then
a GHOST packet carrying the same address is sent via output link  a Hence GHOSTs
serve as lower bounds of future packet addresses along this link GHOSTs that must wait
because they are not selected or blocked by full buers are destroyed because a new GHOST
or a packet will follow the next cycle so no information is lost

 Implementation
The n most signicant bits of a packets address specify the destination module of this
packet Path selection is given by the following Lemma 
Lemma  A packet with destination module i
n
   i

 that is injected at level  of a
buttery network G
n
 must be transmitted in level j where   j  n   along output
i
nj

Proof by Induction on n The case n   is obvious To prove the claim for a
buttery network G
n
 where n   we consider the recursive construction from networks
G
n
as given in Fig  The packet will be routed to node x  i
n
   i

in level n  in
one of the networks G
n
 By the denition of G
n
 it will reach node i
n
   i

in level n
from both positions by taking output i


Note that normally destination bits are not taken in reverse order However in our case
this reversal only leads to a permutation of memory modules which does not aect the
correctness but simplies the implementation as we will see later
In a direct implementation of the path selection scheme from Lemma  each network node
must know its level number To implement the algorithm with universal network nodes
this must be avoided A solution would be to have the desired routing bit always at the
same position in every level This is possible by the following Lemma 
Lemma  If two packets meet in a network node in level i where   i  n then the i
most signicant bits of both addresses are identical We will call these bits address prex
Proof Consider the subgraph of G
n
that contains the two nodes where the packets were
injected and the node where they meet The subgraph is a buttery network G
i
 We apply
Lemma  with n  i then the two packets are destined for the same output node of a
buttery network G
i
and hence their i most signicant address bits are identical
Lemma  seems to induce the following implementation Because the prexes of twomeeting
packets are identical only the remaining addresses which consist of address bits n  i 
to  are needed to compare addresses in level i If the address is shifted left by one position
after each level then the desired routing bit is always bit n which allows to use universal
network nodes The address part of a packet contains a shifted version of the remaining
address which guarantees correct packet selection within network nodes
However this implementation leads to errors as the following Lemma  will show

Lemma  If a GHOST and a packet meet in a node j
n
   j

in level i then their address
prexes are dierent ie comparison of the remaining addresses is not su	cient
Proof For a packet the prex is the sequence of routing decisions so far However when
a GHOST is generated it is not transmitted via the output that the address would force
see end of Section  Hence the GHOSTs address prex diers in that position from the
sequence of routing decisions If a GHOST and a packet meet their sequences of routing
decisions are identical and hence their address prexes must be dierent In this case it
can happen that the packet is selected before the GHOST because the packets remaining
address is smaller although the packets address is larger than the GHOSTs address
 Two Solutions
 Minor Hardware Modication
One can avoid the error by providing complete addresses to comparator units see Fig 
This however has to be done in such a way that the desired routing bit still has position
n   Both demands together can be fullled by inserting the following circuit before the
routing and address shifting unit in level i The address is shifted right by one position
then bit n  i  ie the desired routing bit is copied to position n  
Now the desired routing bit always is in position n   and after the regular left shift we
have the complete address We only have to ensure that that we have one spare bit in the
address part of the packets so that no address information is lost during the right shift
This should normally be possible as address parts typically have xed sizes  or  bit
and real address spaces are smaller
The copying of an address bit is an implicit encoding of the level number Hence we have
to take care that this copying is done outside network chips Then we can use universal
network chips only the network boards dier for dierent levels This is considered a minor
problem because there are fewer boards than chips and because boards are less expensive
than ASICs
To see how the copying unit can be placed outside a chip we consider the design of a network
node as shown in Fig  An obvious mapping of one network node to a chip would result in
the copying within the chip However if we consider the dashed line in Fig  the number of
wires crossing it is not more than the number of wires in one output link Hence we can use
a mapping from 	 as shown in Fig  The resulting chips do not use more pins than chips
that implement one network node and the copying can be put between two chips One can
prove that the network of chips obtained by this mapping still is a buttery network 	
Note that this mapping doubles the gate utilization in network chips However as network


Input
Buffer
Input
Buffer
Right Shift + Copying
Routing + Left Shift
Output
Link 0
Output
Link 1
Link 0
Input
Link 1
Input
Selection/Comparator
0
0
0
x-1
n-i-1 x
R
B
R
RR
B
BB
n-1
n-2
Figure  Schematic Design of a Network Node
N
et
w
or
k 
Ch
ip
Figure  Mapping of Network Nodes to Chips

UZ
0 1 10
W
V
G’ BA G
U
Z
0 1 10
W
V
G’ BA G
Figure  Generation of GHOSTs
chips are pinlimited this does not impose a problem and even reduces the number of chips
by a factor of two 	
 Minor Algorithm Modication
Consider the situation shown in Fig  Node U sends a packet A to node Z along output
link  The packets address then must have the form ab where a is the prex and b is the
remaining address The packet generates a GHOST G with address ab that is sent along
output link  to node W  A packet B that meets this GHOST in node W must have entered
W along the other input and hence must have address ac It follows that GHOST G must
be selected in node W  Since GHOSTs serve to avoid unnecessary waiting GHOST G is of
no use in node W  because the packet in W must wait no matter whether the GHOST was
there or whether the buer was empty
Now consider the situation for packet B which is routed along output link  from V to W 
Packet B generates a GHOST G

with address ac that is transmitted along output link 
to node Z where it meets packet A with address ab It follows that in node Z packet A
must always be selected without address comparison
It follows that if one sends GHOSTs only along output link  then the comparison between
a packet and a GHOST is independent of the GHOSTs address the packet will always win
It is obvious that the modied algorithm is correct as long as an empty input buer prevents
nodes from sending a packet that is waiting in the other input buer It is also easy to see
that performance will not change as GHOSTs that were generated along an output link 
have a smaller address than the packets that they meet even if these GHOSTs are further
forwarded

 Discussion
We presented two solutions to an implementation problem of Ranades routing algorithm
The rst solution has the minor disadvantage that it only allows usage of universal net
work chips but requires dierent layouts of boards for dierent levels Also if packets are
transmitted between chips in several pieces called its the its must be treated dierently
depending on whether they carry an address part or a data part of a packet This requires
additional hardware on boards and enlarges propagation delay between two network chips
The second solution allows to use universal network nodes and boards and requires no
additional hardware between network chips
The second solution suers from the fact that it can only be applied if routing bits are
taken in reversed order as remarked in Section  The rst solution allows any order of
routing bits Although this does not aect correctness simulations hint that performance
is better if routing bits are taken in normal order The reason for this is the sorted order
of addresses If bits are taken in reversed order than in each processor round each network
node will rst send packets along output link  then along output link  If routing bits are
taken in normal order then routing decision and sorting are decoupled ie in one processor
round a network node will rst send one or several packets along one of the output links
then some packets along the other output link then some packets along the rst output
link and so on This better distribution leads to a better utilization of buers and hence to
a performance improvement Improvements in simulations have been between 
 and  
Thus one has a kind of tradeo between performance and universality of design
Acknowledgements
The authors would like to thank Andreas Paul for bringing up the problem of disturbed
sorted order in the network
References
	 F Abolhassan J Keller and W J Paul On the costeectiveness of PRAMs in Proc

rd IEEE Symp on Parallel and Distributed Processing  
	 D Cross R Drefenstedt and J Keller Reduction of network cost and wiring in Ranades
buttery routing Inform Process Lett 	  
	 A Gottlieb R Grishman C P Kruskal K P McAulie L Rudolph and M Snir The
NYU ultracomputer  designing an MIMD shared memory parallel computer IEEE
Trans Comput C
  


	 J Keller W J Paul and D Scheerer Realization of PRAMs Processor design in
Proc WDAG  th Internat Workshop on Distributed Algorithms  

	 G F Pster W C Brantley D A George S L Harvey W J Kleinfelder
K P McAulie E A Melton V A Norton and J Weiss The IBM research parallel
processor prototype RP Introduction and architecture in Proc  Internat
Conf on Parallel Processing 
 
	 A G Ranade How to emulate shared memory J Comput System Sci  


