The Macrame 1024 node switching network by Haas, S. et al.

















CERN, 1211 Geneva 23, Switzerland
2
University of Liverpool, Liverpool, UK
3
University of Kent, Canterbury, UK
4
RHBNC, University of London, London, UK
Abstract. The work reported involves the construction of a large mod-
ular testbed using IEEE 1355 DS link technology. A thousand nodes will
be interconnected by a switching fabric based on the STC104 packet
switch. The system has been designed and constructed in a modular way
in order to allow a variety of dierent network topologies to be investi-
gated. Network throughput and latency have been studied for dierent
network topologies under various trac conditions.
1 Introduction
To date, practical experience in constructing switching networks using IEEE
1355 technology [1] has been conned to relatively small systems and there are
no experimental results on how the performance of such systems will scale up to
several hundred or even several thousand nodes. Theoretical studies [2,3] have
been carried out for large networks of up to one thousand nodes for dierent
topologies.
We present results obtained on a large modular testbed using 100 Mbits/s
point to point DS links and switching technology, as dened in the IEEE 1355
standard. One thousand nodes will be interconnected by a switching fabric based
on the 32 way STC104 packet switch [4]. The system has been designed and
constructed in a modular way to allow a variety of dierent network topologies
to be investigated (Clos, grid, torus, etc.).
Network throughput and latency are being studied for various trac condi-
tions as a function of the topology and network size. Results obtained with the
current 656 node setup are presented.
This work presented has been carried out within the framework of the Euro-
pean Union's Esprit
1
program as part of the Macrame [5] project (Esprit project
8603).
1
European Strategic Program for Research and development in Information Technol-
ogy
2 The IEEE 1355 standard
The Esprit OMI/HIC
2
project has developed two bi-directional link protocols
which form the basis of the IEEE 1355 standard, these are:
{ a 100 Mbits/s Data-Strobe (DS) Link,
{ a 1 Gbit/s High Speed (HS) Link.
The work reported here is based on the the DS link protocol as shown in
Figure 1. The data line carries the binary data values and the strobe line only
changes state when the next data bit has the same value as the previous one.
The links are asynchronous, as the data/strobe signal pair carries an encoded
clock. Studies on the reliability of DS links [6], up to distances of 20 meters, have






1 0 0 1 0 1 1
Fig. 1. DS link protocol
On top of the bit level there are a further three levels of protocol: the char-
acter, exchange and packet levels. Characters are a group of consecutive bits
used to represent data or control information. The exchange layer describes the
exchange of characters to ensure the proper function of a link. DS links use a
credit based 
ow control scheme which operates on a per link basis. This ensures
that the switching fabric is lossless: no characters are lost internally due to buer
over
ow.
Information is transferred in the form of packets. A packet consists of a
header, which contains routing information, a payload containing zero or more
data bytes and an end of packet marker. The protocol allows arbitrary length
packets to be exchanged.
3 The Macrame network testbed
3.1 Hardware
The requirement to study dierent topologies, plus the need to do this for hun-
dreds of nodes, imposes a system design and implementation which is highly
2
Open Microprocessor Systems Initiative / Heterogeneous Inter-Connect Project
modular and 
exible, for quick and reliable system re-conguration, as well as
having a very low cost per node. The system is built up from three elements,
each one housed in VME 6U mechanics:
{ Trac node modules
{ STC104 packet switch units
{ Timing and spy nodes
A trac node can simultaneously send and receive data at the full link speed
of 100 Mbits/s. A series of packet descriptors dene the trac pattern. The
packet destination address, the packet length and the time delay to wait before
dispatching the next packet is programmable. Each trac node has memory for
up to 8k such packet descriptors. The nodes are all synchronised with the same
clock.
The dispatch algorithm is implemented in an FPGA
3
which can be recong-
ured under host control. A control processor is used to supervise the operation
of a group of 4 trac nodes and all these processors are connected via a control
network.
To reduce cabling, sixteen trac nodes are hard-wired to an on board STC104
packet switch. The remaining 16 ports of the switch are brought out to the front
panel for inter module connection. Boards can be interconnected either directly,
or via packet switch units which contain one switch with all 32 ports brought
out to the front panel.
To measure latency, the timing nodes transmit and analyse time stamped
packets which cross the network between chosen points. The same modules, in
spy mode, can be inserted into any cabled connection to provide a snapshot of
the trac passing through that point. This provides debugging information and
additional data on congestion "hot spots".
A VME crate contains 128 trac nodes and the entire 1024 node system can
be housed within eight crates. All crates have an Ethernet port which drives an
OS
4
link daisy chain connection to the control processors. The STC104 packet
switches have their own separate DS control network which is independent from
the main data path.
Figure 2 shows how a two dimensional grid network topology can be con-
structed. Each packet switch has 16 on board connections to trac nodes and
four external cabled connections to each of its four nearest neighbours.
So far 656 nodes have been built and tested. They have been assembled as a
range of 2D grid, and multistage Clos networks [7]. An example of a 256 node
Clos is shown in Figure 3. Results are presented for these congurations. Further
details on the design of the testbed are presented in [8].
3
Field Programmable Gate Array
4
20 Mbits/s Over Sampled Transputer links





Fig. 2. Architecture of the Macrame testbed











Fig. 3. 256 node Clos network
3.2 Software
A set of les is prepared o line containing: the packet descriptors, the cong-
uration information for every trac node and the routing tables for the packet
switches. Prior to loading this data, the control networks for the trac nodes
and packet switches are used to verify that the expected devices are present and
connected in the required order.
Each control processor has 4 Kbytes of on-chip memory. It is loaded at initial-
isation time with a kernel which handles the control link trac and the dynamic
loading of the application programs. Application programs for self-test, hard-
ware conguration, storing of trac descriptors and run time supervision are
loaded in turn by the host which also controls their synchronisation.
Once the system is running each control processor maintains local histograms
of results. These are returned to the host on request for on-line display, data
logging and subsequent analysis.
4 Results
4.1 Network latency for Clos networks
Figure 4 shows the latency of three dierent size Clos networks as a function of
the aggregate network throughput. The trac pattern is random, i.e. transmit-
ting nodes choose a destination from a uniform distribution. The packet length
is 64 bytes. The results are produced by varying the network load and measuring
the corresponding throughput and latency. It can be seen that the average la-
tency increases exponentially as the network throughput approaches saturation.
Therefore, to achieve low average latencies the network load must be below the
saturation throughput.
Figure 5 shows the probability that a packet will have a latency greater than
a given value for various network loads. The trac pattern is random, with a
packet length of 64 bytes. For 10% load the latency distribution is very narrow
compared to higher loads. Near the saturation throughput (about 60% load) a
signicant percentage of the packets experience a latency many times the average
value, which is 18 s. To reduce the probability of very large latency values the
network load must be far below the saturation throughput.
64 node Clos 
128 node Clos
256 node Clos























Fig. 4. Latency versus throughput for
























Fig. 5. Probability that a packet will
have greater than a given latency value
for a 64 node Clos network
4.2 Comparison of network topologies
Figure 6 shows the per node saturation throughput for dierent size 2D grids
and Clos networks under random trac as a function of the packet length. The
Clos shows better per node performance, this is because of the higher cross-
sectional bandwidth. The eect of packet length on throughput can also be
seen, for small packets the throughput is reduced due to xed packet overheads.
Medium sized packets give the best performance because of the buering present
in the STC104, each switch can buer 32 bytes in both the link input and output
ports. Long packets ll the entire path from source to destination, and therefore
throughput is reduced by head-of-line blocking.
4.3 Scalability of Clos and grid networks
Figure 6 also shows that the throughput of Clos and 2D grid networks does
not scale linearly with network size under random trac, the per node through-
put is reduced as the network size increases. Figure 7 shows saturation network
throughput for dierent sizes of Clos and 2D grid networks under random and
systematic trac. The packet length is 64 bytes. Systematic trac involves xed
pairs of nodes sending to each other. For the grid this trac pattern involves
communication between nodes attached to nearest neighbour switches. The per-
formance of the Clos under systematic trac is independent of the choice of
pairs. For random trac, contention at the destinations and internally to the
network reduces the network throughput compared to that obtained for system-
atic trac, where there is no destination contention. The fall o in performance
from systematic to random trac is more pronounced for the grid than the Clos.
The throughput of the grid network increases logarithmically with the network
























64 node Clos (4:2)
128 node Clos (8:4)
256 node Clos (16:8)
2x2 2D grid (64 nodes)
3x3 2D grid (144 nodes)
4x4 2D grid (256 nodes)
5x5 2D grid (400 nodes)
Fig. 6. Per node throughput for 2D
grid and Clos networks under random
trac
Clos networks, random traffic       
Clos networks, systematic traffic   
2D grid networks, random traffic    
2D grid networks, systematic traffic























Fig. 7. Throughput versus network size
for Clos and grid networks
4.4 Packet transmission overhead
The overhead in dispatching packets in the trac nodes is determined by hard-
ware and is small, approximately 650 ns. This will not in general be the case
when interfacing links to a microprocessor. To demonstrate the eect of the
packet overhead the dispatching delay has been articially increased. Figure 8
shows the dependence of network throughput on packet overhead for a 128 node
Clos under random trac. The fall o in performance is particularly marked for
short packets; the throughput drops by nearly an order of magnitude when the
overhead is increased from 10 to 100 s. This underlines the importance of an
ecient processor to link interface.
4.5 Comparison of simulation and measurement
A 64 node Clos network has been simulated using the commercial OPNET sim-
ulation package [9]. A model of the DS link and the STC104 switch has been
developed within the Macrame project for this simulator. Results from simula-
tion and measurement have been compared and are shown in Figure 9, which
shows the latency distribution for 64 byte packets and random trac at 50%
load. The majority of packets pass through the network without being queued,
corresponding to the peak at 12 s. It can be seen that the agreement between
simulation and measurement is very good.
16 byte (calc.) 
64 byte (calc.) 
512 byte (calc.)
16 byte (meas.) 
64 byte (meas.) 
512 byte (meas.)
























Fig. 8. Network throughput versus



























Fig. 9. A comparison of the simulated
and measured latency distributions for
a 64 node Clos network
5 Conclusions
We have demonstrated a large packet switching system, based on DS Link tech-
nology, that is performing reliably, and has provided quantitative measurements
of the performance of 2D grid and Clos topologies. Data from this system has
been used to calibrate the simulation models which now closely agree with our
measurements. This work will be extended to cover other topologies and a sys-
tematic study of performance, working up to the design target of 1024 nodes.
Acknowledgements
We are very grateful for the support of the European Union through the Macrame
project (Esprit project 8603). We would also like to thank PACT (Partnership
in Advanced Computing Technologies, UK) for providing the simulation results
presented within this paper.
References
1. IEEE Std. 1355, Standard for Heterogeneous Inter-Connect (HIC). Low Cost Low
Latency Scalable Serial Interconnect for Parallel System Construction. IEEE Inc.,
1995.
2. A. Klein, Interconnection Networks for Universal Message-Passing Systems, Proc.
ESPRIT Conference '91 , pp. 336-351, Commission for the European
Communities, Nov. 1991, ISBN 92-826-2905-8.
3. Networks, Routers and Transputers, edited by M.D. May, P.W. Thompson, P.H.
Welch, ISBN 90 5199 129 0, http://www.hensa.ac.uk/parallel/www/nrat.html
4. The STC104 Asynchronous Packet Switch, Data sheet, April 1995.
SGS-THOMSON Microelectronics.
5. The Esprit Project Macrame, http://www.pact.srf.ac.uk/macrame/welcome.html
6. S. Haas, X. Liu and B. Martin, Long Distance Dierential Transmission of DS
Links over Copper Cable (CERN),
http://www.hensa.ac.uk/parallel/vendors/inmos/ieeehic/copper.ps.gz
7. C. Clos, A Study of Non-blocking Switching Networks, Bell Systems Technical
Journal 32, 1953.
8. R.W. Dobinson, B. Martin, S. Haas, R. Heeley, M. Zhu, J. Renner Hansen,
Realization of a 1000-node high-speed packet switching network, ICS-NET '95 St
Petersburg, Russia.
9. The OPNET Modeler, http://www.mil3.com/.
