Abstract
Introduction
Computers in a heterogeneous c,omputing environment have to communicate with each other to exchange information. Since each computer may have its own data format, a suitable transfer syntax is required to enable the exchange of data. The presentation layer of the OS1 protocol standard includes a transfer syntax conversion step to handle the data transformation. In a gigabit network, the processing power required can be as much as 2500 mips assuming that 20 instructions [4] are needed to process a byte of data. The computational power required can not be easily met with a single processor nor the application specific architectures in a heterogeneous system which may not be suitable for performing protocol processing functions if we would like to make use of all the commutThis work was partially supported by a scholarship and a grant (58-0980) from the Natural Science and Engineering Research Council of Canada.
nication bandwidth available. In our study, we have chosen a stand alone multiprocessor system based on moderate speed processors to perform the necessary protocol processing functions for the datalink up to the presentation layer in the OS1 model.
A processor per packet design is adopted [l] . With this approach, a packet would be processed by a single processor but multiple processors can be working concurrently to increase the system throughput. Besides freeing up the host computer, an added advantage of a multiprocessor system is that the system can be made more fault tolerant.
Previous work
Different approaches have been taken by researchers to tackle the high processing power requirement for OS1 protocol processing. One approach is to speed up the transfer syntax conversion operations [2] [3] as these are the most CPU cycle consuming operations needed for protocol processing. Another approach is to increase the overall throughput by parallelizing the processing [5] . Takeuchi [4] proposed a promising design for which a processing throughput of up to 100MBits/s has been simulated but the design suffers from a low utilization of the processors and thus a linear speedup with the number of processors is not achieved. Designs proposed in this paper solve some of the problems observed in Takeuchi's [4] design and handle the needs for higher speed networks operating at 500MBit/s or more.
Protocol processing overview
Packets normally have to be processed in sequence especially in the upper protocol layers. Based on classical queueing theory, a single queue multiple server design would be the most suitable for a multiple processor protocol processing system in terms of delay time characteristics. A bus based architecture keeps the cost low and makes it justifiable to use a stand alone protocol processing system. The main problem with a bus based design is bus contention. In our study, we have attempted to make use of the principle of spatial locality of reference and the use of prefetching to deal with the bus contention problem by reducing the amount of false sharing of data.
2
The shared memory design A simple shared memory (SSM) architecture has been analyzed to identify the possible bottleneck of this design in a protocol processing application. The shared memory architecture that we are proposing consists of dual ported main memory and two buses. The random access ports of the memory devices are connected by the random access bus while the burst access ports of the memory devices are connected by the burst access bus. The random access port is used for accessing instruction and packet data while the burst access port is used for moving packets between the global shared memory and the external interface to the host computer and the network.
Modifications are made to this basic design to maximize the system throughput. In this paper, 3 different designs are discussed.
The different designs are evaluated by using analytic models and simulations written in SimScript 11.5. To estimate the maximumsystem throughput, the following notations are used' :
and since the data bus can only support a certain amount of traffic. There are other factors which will affect the throughput but we have restricted ourselves to these three, which are found from our study to be the major factors affecting the system throughout.
The local instruction memory (LIM) design
In the LIM design, we studied the effect of keeping instructions in the local memory instead of sharing them in the global shared memory. With this approach, the shared memory is only used for storing packets, connection and system state information. The architecture of the LIM design is shown in figure 1. The only modification from a simple shared memory design is the addition of local memory to each of the processors. The local memory is connected to its processor using the random access bus and is only accessible by that processor From our result, we found that although instructions are kept in the local memory, a fair amount of global shared memory accesses are still needed for the packet data and these accesses result in an imbalanced utilization of the two buses. The resulting bus contention makes it impossible to obtain a high system throughput as shown in figure 2 when the random access bus saturates and processors have to wait for their turn to access data. 
The packet relocation designs

The simple packet relocation (SPR) design
The packet relocation design reduces the random access bus traffic by dealing with the false sharing problem of packets in the global shared memory. Since the transfer syntax conversion of a packet is performed by a single processor, the packet needs not be shared at that time.
The architecture of this design is shown in figure 3 .
Dual ported local memory devices are used for both the global and local memory. The burst access port in the local memory allows packets to be copied into the local memory at a high speed.
In the presentation layer where there is potentially a high number of accesses to the packet under processing, the packet is copied from the shared memory to the local memory using block transfer on the burst access bus. By fetching the data to the local memory, we can reduce the amount of random access traffic which is the bottleneck in the previous design. In this design, a tradeoff is made to use the burst access bus instead of the random access bus for processing the packet. Since burst access is faster than random access in normal memory devices, the performance of the system can be significantly improved as derived from our analytical model.
Our simulation result shows that extra data movements needed for the packet relocation design do not affect the processing latency. With a 50ns burst access cycle and a 32 bit bus, the extra time needed for copying a packet with a length of 10000 bytes, is found to be about two orders of magnitude smaller than that of the average protocol processing time for the packet.
The burst access bus traffic consists of packets that are copied from the global shared memory to the local memory and also packets that are copied between the global shared memory and the interface to the host and the network. From figure 4 , it can be seen that the bottleneck of this design is in the burst access bus due to the extra data movement needed. Therefore if the burst access is slow because of slow memory devices or frequent crossings of memory device boundaries, the possible improvement in system throughput could decrease. Besides, shared resources like the shared memory and the burst access bus are consumed even before processing begins when the packets are moved from the interface to the global shared memory. Therefore further improvements can be made to increase the system throughput and fault tolerancy.
The improved packet relocation (IPR) design
The IPR design reduces the amount of resource usage before actual processing starts so that graceful The improved packet relocation design is basically the same as the SPR design with the addition of a big interface buffer so that packets are kept in the interface as long as possible before processing starts. In previous designs, a packet will be moved to the shared memory from the interface immediately when the complete packet arrives.
The IPR design keeps the use of shared resources to a minimum before a packet is actually processed. By keeping the packets in the interface buffer, we can also reduce the number of times that a packet needs to be copied and further reduce the false sharing problem. Packets waiting to be processed can be moved directly from the interface buffer to the local memory of a processor without going through the shared memory. Our analysis also indicates that we can also achieve a higher fault tolerance and a higher throughput without introducing any undesirable effects when compared to the simple packet relocation design. Since we are interested in gigabit networks, a IPR' design has been simulated. This design is the same as the IPR design with the use of faster memory devices with a 19Ons random access and a 30ns burst access cycle and 64 bit buses. Figure 5 shows the resource utilizations of this design.
Summary
IPR' design where faster memory and wider buses as compared to the IPR design are used.
In all our designs, the processor utilizations are kept at 0.8 or higher even when the system saturates. The use of local memory and a prefetching scheme makes it possible to increase the throughput of the system with minimal cost. Since the price of memory devices in terms of cost per bit is expected to drop with the advent of VLSI technology, the system proposed should be realizable with an affordable price tag. Table 1 summarizes the maximum throughput and major characteristics for the various designs. All systems are assumed to have 40mips processors, 32 bit buses and memory devices with a 200ns random access cycle and a 50ns burst access cycle except for the Figure 6 shows the maximum system throughput of the various designs versus the number of processors. It could be observed that a good speed up behaviour is obtained. In particular, the IPR design gives the highest system throughput. It is expected that with faster Table 1 : Summary of system characteristics memory devices, an even higher throughput can be obtained. Since the system speedup behaviour is not sensitive to the number of processor present, graceful degradation in performance can be achieved when one or more processor fail.
Conclusions
In this paper, we have described three multiprocessor based OS1 protocol processing systems. These systems are designed to minimize the utilization of shared resources using techniques similar to those used for cache prefetching. Our simulation result indicates that the OS1 protocol stack is promising in a high performance heterogeneous computing environment using a gigabit network despite of the computation intensive operations. Bus based architectures that we have proposed are found to be sufficient for handling a network rate of over 56OMBits/s without incurring a high hardware cost.
We have found that using fast processors alone without the matching main memory is not enough in providing the necessary processing power needed for protocol processing in a high speed networking environment. The constant flow of packets through the system makes data caches ineffective in bridging the growing gap between processor and memory speed. A suitable memory access architecture is needed to ensure that the sharing of information can be accomplished without slowing down any processors.
The need to process packets in sequence reduces the amount of parallelism that can be incorporated into the system. The sequential processing requirement is found to be one of the major bottlenecks which also include bus speed, memory speed, number of active connections, packet lengths and transfer syntax complexities. Operations like reassembly of packets are found to slow down the processing of received packets and this type of operations should be revised to enable a better use of the resources. Instead of using finite state machines, a parallel protocol specification could be used to enable a higher degree of parallelism in gigabit network protocols.
Architectures presented in this paper are suitable for applications where a large amount of processing is needed for a block of data which would stay in the system for a short time. Therefore, besides being efficient for carrying out transfer syntax conversions, the proposed architectures can also be adapted for data packet encryptions or decryptions which would also require a lot of computations per byte of data.
