The C104 is an asynchronous 32-way dynamic packet routing chip. It has a 264 Mbytes/s bi-directional bandwidth and a 1 sec switching latency. It o ers high density coste ective commodity communications, which allows large switching networks to be constructed. Results are presented on the performance of this switching technology within the context of future high energy physics experiments.
Introduction
The bandwidth and connectivity requirements of trigger and data acquisition systems for future HEP experiments at the CERN LHC and HERA-B at DESY, highlight the limitations in scaling embedded multi-processor bus based systems. The prominently featured alternative in the proposed systems is large switching networks.
The performance of large switching fabrics has been studied within the context of message passing multi-processing computers 1 . However, these studies were based on telecommunications tra c patterns which di ers to those expected in the trigger systems of HEP experiments. In the latter, switching fabrics will be required to connect N sources to M destinations, where M and N are of order 250. The tra c pattern will be`bursty' and have di erent characteristics for level II and III triggers. The modeling of large switching fabrics under such tra c patterns is being performed by several groups 2;3;4 . We report here on the actual performance measurements of a switching fabric built from C104 packet routing chips 5 .
The serial technology used and the general communications performance of the network are brie y discussed, followed by the performance of the network under expected HEP data tra c patterns. project has developed two bi-directional link protocols which form the basis of the IEEE P1355 standard 6 , these are: a 100 Mbit/s Data-Strobe (DS) link, a 1-3 Gbit/s High Speed (HS) link. The work reported here is based on the the DS link protocol depicted in gure 1. The data line carries the data and the strobe line only changes state when the data remains constant. In this protocol the clock is encoded, enabling autobauding at the receiver and asynchronous links. Studies on the reliability of DS links 7 , up to distances of 20 m, have measured a bit error rate of less than 10 ?14 . On top of the bit level there are a further three levels of protocol: the character, exchange and packet levels. Characters are a group of consecutive bits used to represent data or control information. The exchange layer describes the exchange of characters to ensure the proper function of a link. In particular, ow control characters are used to enable tra c ow from the link sender. This ensures that the switching fabric is lossless: no characters are lost internally due to bu er over ow. A packet is a sequence of characters with a speci c order and format: a header, which contains routing information, a payload containing zero or more data bytes and an end of packet marker. The protocol does not specify a speci c packet size. Messages are sent through a network as a sequence of packets.
A family of communication devices has been developed to support the DS link protocol, these include a parallel-DS link converter and an asynchronous packet routing chip. The DS link protocol is also used by the T9000 Transputer. These components have been discussed in detail elsewhere 8;9 and only the packet routing chip is described here. Worm-hole Routing The implementation of worm-hole routing minimises the latency and bu ering requirements of the C104 compared to switches using message bu ering with store and forward techniques. The routing decision is made as soon as the packet header arrives, the header is subsequently sent to the chosen output link and the rest of the packet follows without being internally bu ered. This implies that packets may be passing through several layers of C104s at the same time. The header, when passing through each device, creates a temporary circuit (the worm hole) through which the rest of the packet ows. As the end of the packet passes through each device the circuit closes. The packet latency across a C104 has been measured to be 1 s.
Grouped Adaptive Routing In any multi-stage network, e cient load balancing is a primary concern. The support of grouped adaptive routing by the C104 minimises the e ects of load imbalance 1 . If consecutive links of a C104 are used to access a common destination, they may be logically`bundled' together, allowing a higher bandwidth to the common destination. Any packet destined for the nal common destination is routed through the rst available link in the`bundle'.
Universal Routing It is not practical or in some cases possible to increase the number of links in à bundle' to match the bandwidth requirements to a destination. As a complementary method of load balancing, the C104 has been designed to support universal routing 10 . In this technique, packets are rst sent by a C104 to a random device (another C104) in a multi-stage network. At this device, the packets are then routed to their nal destination. The initial randomisation balances the load across the network reducing link contention and therefore minimising the formation of hot spots c .
The GP-MIMD Machine
The GP-MIMD machine has been designed as a switching fabric of 1000 DS links interconnecting up to 256 T9000 processors. In its present con guration 38 C104s provide full inter-connectivity between 48 processors. Six motherboards (see gure 2a) each carry 8 T9000s and 5 C104s. Two switch cards, each carrying 4 C104s, provide the inter-motherboard connectivity, corresponding to 25% of the full switching capability. Four independent Clos 11 networks have been implemented to e ciently use the four DS links associated with each T9000 processor.
The basic Clos network of the machine is shown in gure 2b. Each C104 on the left and right represent a single C104 on a motherboard and the switch card c An initial localised bottleneck in a network, which then propagates across the network 
General Communications Performance
The T9000s were divided into 21 sources and 21 destinations and the average unidirectional cross-sectional bandwidth of the machine measured for two types of tra c pattern. These two tra c patterns give information on the general communications performance. Measurements using HEP trigger tra c patterns are presented in the next section. The T9000s were only used as data producers and consumers; they performed no processing.
In the rst of these tra c patterns, sources and destinations are formed into xed pairs; no two sources send data to the same destination. In this situation, the maximum uni-directional bandwidth allowed by the technology, network and routing algorithm are measured. The achieved uni-directional bandwidth as a function of message size and the extent of grouped adaptive routing (3,6,9 or 12 links) for a single network is shown in gure 3a. The measured asymptotic, uni-directional bandwidth is in good agreement with the theoretically expected values to within 3%. In addition, the achieved uni-directional bandwidth scales linearly with the extent of grouped adaptive routing (3,6,9 or 12 links) between the rst and middle stage of the network, showing no overheads due to its implementation. This shows that with no contention on the destination links and the maximum amount of grouped adaptive routing, each network has a uni-directional bandwidth of 108 Mbytes/s.
In the second tra c pattern, messages are sent from each of the 21 sources to the 21 destinations randomly of message size and grouped adaptive routing is shown in gure 3b. A signi cant degradation in the uni-directional bandwidth compared to the paired tra c pattern can be seen. For 6 and 12 grouped links only 61% and 38% of the uni-directional theoretical bandwidth is achieved. Although the bandwidth achieved with 12 links is greater than with 6 links, there is lower utilisation of the available bandwidth. This indicates that under this tra c pattern, contention on the nal destination links limits the performance.
HEP Tra c Patterns
In our terminology, an event consists of data blocks which are de ned as the tra c going to a single destination processor. The level II data block corresponds to a subset of a sub-detectors event data and a level III data block corresponds to the total event data of a sub-detector. A data block consists of a number of fragments. Each fragment emanates from a di erent source and is sent as an individual message. An example of the level II trigger tra c pattern for a sub-detector is shown in gure 4a for two events. The rst event consists of 2 data blocks: each data block is distributed over two sources and routed to a separate destination. The second event consists of 3 data blocks: one data block is distributed over 3 sources and 2 data blocks originate from a single source.
The level III trigger tra c pattern is shown in gure 4b. Event 1(2) consists of a single data block which is distributed over 4(3) sources and routed to a single destination. The currently expected level II and III tra c patterns of a sub-detector in the ATLAS experiment 3 at CERN are summarised in table 1. In our model, T9000 processors act as sources and destinations of event data. No computation is performed by the destinations. Measurements have been carried out using pre-determined event sequences, stored by the sources as look-up tables.
The performance, in terms of achievable event rates and bandwidths, of a single network under these types of tra c patterns has been measured. The total performance of the machine is four times the single network result. The number of source and destination processors used is typically 18 and the maximum amount of grouped adaptive routing (12 links) is applied between the rst and second C104 layers. Data blocks are distributed equally over the number of participating sources. No studies have been performed with varying packet size, as only packets of up to a maximum length of 32 bytes are supported by the T9000.
Level II Triggering Performance
The achieved event rate and bandwidth, for a con guration of 18 sources and 18 destinations, as a function of the attempted event rate is shown in gures 5a and 5b, for di erent data block sizes. In these measurements the number of data blocks per event and sources per data block has been xed to 3 and 1 respectively. To model the level 1 trigger a random selection is made as to which three sources participate per event. The allocation of data blocks to destination processors per event is based on a round robin selection. At the maximum achievable event frequencies, 70 Mbytes/s out of the maximum theoretical network bandwidth of 111 Mbytes/s is achieved.
The loss of bandwidth is due to the random selection of which sources participate per event. Sources may have to participate in a number of consecutive events, causing some events to be delayed as the temporarily required bandwidth at a source exceeds that available. This queuing of events in the sources results in network under utilisation: messages are queued on a busy source. The network is under utilised as idle sources may exist. This e ect was demonstrated by selecting sources based on a round robin criteria, thus removing the queuing of consecutive events in the sources. The results in gure 5c and 5d for this situation show a 97% bandwidth utilisation and a corresponding increase in event rate, con rming event queuing in the sources. In gure 6 the achieved event frequency and bandwidth as a function of the number of data blocks per event is shown for di erent numbers of sources per data block and a xed data block length of 1.0 Kbytes. The achieved bandwidth across the switch is independent of these parameters and the achieved event frequency decreases as the number of data blocks per event increases.
The performance of the network, as a function of the number of active respectively. The full capacity of the machine with these parameters is therefore 400 and 80 KHz respectively. The results presented have used the ability of the T9000 to e ciently multiplex virtual links g onto a single physical link. In gure 8, the bandwidth and achieved event frequencies for di erent numbers of virtual links are shown. It can be seen that the achieved event frequency depends on the number of virtual links and is constant for more than 5 virtual links per physical link. In the case of a single virtual link the end to end packet ow control limits the achievable bandwidth, as a source must wait for a packet acknowledgment from a destination. This e ect is reduced by using multiple virtual links; while waiting for a packet acknowledgment on one virtual link, a source may send a packet on another virtual link to another destination. In these measurements 5 virtual links have been used. g A virtual link is a logical communication connection between two processes. 
Level III Trigger Performance
The achieved event frequency as a function of the number of sources per data block, in a network of 21 14, is shown in gure 9a for 3 di erent data block sizes. The achieved event rate varies linearly with the number of sources per data block, due to a better utilisation of the network. At the maximum input utilisation (21 sources per data block) for the 3 data block sizes considered, 50 Mbytes/s out of a theoretical maximum bandwidth of 111 Mbytes/s is achieved.
In gure 9b the communication latency per source as a function of event number is shown for three of the sources in the above con guration. For the rst event, all sources are competing for the same destination link resulting in a high latency for each source. The latency then decreases with event number, settling down to a steady state, as di erent sources are sending di erent events. The fact that sources are working on di erent events implies di erent destinations are being used in parallel and therefore less contention occurs on the destination links. This e ect is also shown by the dotted curves in gure 9a. In this gure the sources have been selected, where possible, by a round robin criteria. For example, in the case of 5 sources per data block, the rst ve sources send data block 0 and the second ve sources send data block 1. This continues up to data block 3 and then the rst ve sources send data block number 4. This arti cially spreads the load across the number of sources on an event basis and results in an improved event rate for 5 and 10 sources per data block.
In order to investigate the e ect of communications latency on performance, the message latency in the sources has been arti cially increased. The results are shown in gure 10a, for 21 sources per data block and 14 destinations. There is a very strong dependence of performance on message latency. An e cient node to network interface, together with light weight communication protocol, is vital to e ciently utilise the available network bandwidth.
The network performance scales linearly with the number of active sources (=destinations), see gure 10. This is consistent with the design speci cation of the GP-MIMD machines communications network; it can be scaled up to 1000 links. 
Summary and Outlook
Measurements have been presented on the performance of a 40 node Clos switching fabric. This fabric is based on the C104 packet routing chip and the DS link protocol which is part of the IEEE 1355 standard. The DS link and associated packet routing chip, the C104, have been shown to meet their design speci cations. In the present con guration of the GP-MIMD machine, only 25% the full switching capacity is used and only 20% of the maximum number of processing nodes are installed. By the end of '95 the switching fabric will be upgraded to 4 switch cards and the number of supported processing nodes increased to 64. The results presented here, will be used to calibrate models of C104 networks. These models will then be used to perform detailed simulations of the networks required for future large HEP experiments. Under the currently expected tra c patterns of sub-detectors in the level II These results have been obtained under the following conditions: a single network (out of the available four) in use using grouped adaptive routing on the C104s redundant links on the C104s using the T9000s ability to multiplex virtual links onto a single physical link software written in OCCAM 12 , the native language of Transputers It appears that currently available technology can meet many of the switching and bandwidth requirements of future HEP trigger systems. This would be achieved by scaling or replicating the Clos networks used in these measurements. By adding layers of C104s, the Clos network architecture may be arbitrarily scaled. However, networks that allow the bandwidth to be scaled with the number of nodes, require an increase in the number of switches which is more than proportional. Replication of the network is potentially a more e cient use of resources, but may require that data sources be able to simultaneously drive several networks.
Other questions remain to be addressed: What is the dependence of network utilisation, scalability and latency on network topology ? What are the cost and practical advantages of many smaller or lower speed (replicated) networks versus fewer large or higher speed networks ? What is the e ect of di erent packet sizes on network performance ? What are the issues involved in network management, monitoring, error handling and fault tolerance ? How do the currently available serial interconnects compare ? To address these questions a large 1000 node variable topology switching fabric 8 is being built at CERN as part of the MACRAM E project and should start to provide results in the summer of 1996.
