The performance enhancement of communication protocols is an essential step toward the building of high performance communication subsystems. Integrated Layer Processing (ILP) has been proposed as an engineering principle for the optimization of communication protocol implementations. However, the applicability of this principle to complete operational communication subsystems has not been yet addressed. This is in part due to the lack of high performance implementation of presentation and transport functions. In this paper, we show to what extent \ILP based" optimizations improve the performance of some data manipulation functions (e.g. presentation encoding or checksumming) independently. We present how some of these implementation optimization techniques were applied to the ASN.1 encodings rules. We also analyze the impact of such optimization techniques according to the support hardware. A prototype implementation of the XTP protocol is also described. ILP based optimization applied to the checksum calculation algorithm show that an important performance enhancement can be achieved. These results should facilitate the support of ILP within an operational communication subsystem.
Introduction
The emergence of high speed and low error rate networks motivated many research work on the design of more e cient communication protocols. This includes the tuning of existing transport protocols (e.g. TCP extensions for high speed paths Jac90], TP4 1992 revision CCI92], and other related work Wat87], Col85]) and the design of a new transport protocols (e.g. XTP PEI92]). At the presentation level, this concerns the performance optimization of the presentation coding and decoding routines either by de ning new \light weight transfer syntaxes" Hui90] or by implementation enhancements of the standard ASN.1 Basic Encoding Rules (BER) routines Dab92], Hui92], Lin93] .
Good implementation techniques represent one of the most important factors in determining the performance of a given protocol Cla89]. These techniques depend on the environment more than on the protocol itself. The proposed solutions focused on the enhancement of the protocol implementation performance in a given software or a hardware environment: outboard protocol processors (e.g. Kan88] , Coo90]), hardware protocol implementations (early work on XTP/Protocol Engine Che89]) or parallel implementations of transport protocols Bra92], R ut92], Lap92], Bj o93]. A detailed survey of protocol implementation optimization techniques can be found in Dab91] and Fel93b] .
The performance of workstations has increased with the advent of modern RISC architectures but not at the same pace as the network bandwidth during past years. Furthermore, access to primary memory is relatively costly compared to cache and registers and the discrepancy between the processor and memory performance is expected to get worser. The memory access is expected to represent a bottleneck Dru93] .
Protocol processing can be divided into two parts, control functions and data manipulation functions. Example of data manipulation functions are presentation encoding, checksumming, encryption and compression. In the control part there are functions for header and connection state processing. Jacobson et al. have demonstrated that the control part processing can match gigabit network performance for the most common size of PDUs with appropriate implementations Cla89].
However, data manipulation functions present a bottleneck Cla90], Gun91]. They consist of two or three phases. First a read phase where data is loaded from memory to cache or registers, then a manipulation or \processing" phase followed by a write phase for some functions, e.g. presentation encoding. For very simple functions, e.g. checksumming or byte swap, the time to read and write to memory dominates the processing time. For other processing oriented functions, like encryption and some presentation encodings, the manipulation time dominates with current processor speeds. However, the situation is expected to change with the increase of processor performance: the memory access will be the major bottleneck rather than data processing.
The data manipulation functions are spread over di erent layers. In a naive protocol suite implementation, the layers are mapped into distinct software or hardware entities which can be seen as atomic entities. The functions of each layer are carried out completely before the protocol data unit is passed to the next layer. This means that the optimization of each layer has to be done separately. Such ordering constraints is in con ict with e cient implementation of data manipulation functions Wak92], Cla90].
One could accuse the layered model (TCP/IP or OSI) for causing this con ict. In fact, the operations of multiplexing and segmentation both hide vital information that lower layers need to optimize their performance. Although it is important to distinguish between the architecture of a protocol suite and the implementation of a speci c end system or a relay node, a certain degree of exibility is needed in the way the functions are organized within the layers. One solution is to integrate all layers in a single block, in essence discarding the modularity of layers in exchange for performance. The alternative is to perform multiplexing/demultiplexing and segmentation/reassembly exactly once in the protocol stack. Another approach is to perform multiplexing/demultiplexing and reassembly exactly once in the protocol stack and allow multiplex segmentations to occur (see Fel93a] ).
Integrated Layer Processing (ILP) is an engineering principle that has been suggested for addressing the cited problem Cla90]. The main concept behind ILP is to minimize costly memory read/write operations by combining data manipulation oriented functions within one or two processing loops instead of performing them serially as is most often done today . It is expected that the cost reduction due to this optimization will result in better overall performance as it will reduce time consuming memory access. This optimization may be applied within a single data manipulation function (intra-function optimization) or across several functions (inter-function optimization). The interest of the ILP principle (and similar software pipelining principles such as lazy message evaluation and delayed evaluation) has been discussed in Cla90] However, the examples cited in these papers correspond to the expected performance gain when some speci c data manipulation functions were integrated. They do not address the problem of the design and implementation of a complete operational communication subsystem according to the ILP principle (some issues concerning the design of an entire stack according to ILP are presented in Abb93a]). As the performance bene ts of ILP are applicable when the memory read and write operations are the bottleneck, previous work reported examples with simple manipulation functions (e.g. checksumming or byte swap). However, the development of complete operational communication subsystems requires the integration of other \processing oriented" data manipulation functions (e.g. presentation encoding) in order to provide the desired service. The optimization of these processing oriented data manipulation functions will result in a double bene t: in addition to their execution at higher speeds, their integration with other manipulation functions will be facilitated leading to increased performance gain. This optimization is now both feasible and attractive due to the increase of processor speeds and of the discrepancy between processor and memory performance.
We worked on the optimization of speci c processing oriented data manipulation functions: the ASN.1 presentation encoding and decoding routines. In addition, in order to experiment integration techniques with transport level functions, we performed a prototype implementation of the XTP protocol at the user level. The integration of presentation and transport level functions is more easily achievable if the transport is implemented at the user level. We also optimized the XTP checksum implementation which resulted in high speed checksum calculation. We present the results of work on the optimization of both presentation and transport level functions i.e. presentation encoding and decoding routines and the XTP checksum calculation which are two examples of intra-function ILP optimizations. The rest of the paper is organized as follows: Section 2 presents the work on the enhancement of the presentation routines. In section 3, we describe the XTP implementation and present some performance test results. Section 4 concludes the paper.
2 High speed presentation Our work was centered around the optimization of the ASN.1 Basic Encoding Rules. The cost of the coding and decoding routines is attributed to the heavy Type-Length-Value oriented coding of ASN.1 BER. This motivated the work on \light weight" or XDR-like transfer syntaxes Hui89] based on three design principles: avoid unnecessary information in the encoding, use xed representation when it is possible and, simplify the mapping of the elements by using xed length structures. The rst principle leads to the abandonment of the systematic \Type-Length-Value" encoding, which is replaced by new encoding rules for the ASN.1 constructs; the second and the third principles lead to a set of easily decoded word oriented encoding rules. The word size, is a parameter that characterize the syntax (16, 32, or 64 bits). The results of performance tests accomplished on a Sun 3/60 are given in table 2. The LWS coding routines were 1.42 to 5.85 times faster than those of the BER depending on the data types. For basic data types, the improvement is considerable (specially for the real type) because of the relative e ciency of the word oriented coding. However, the results for the strings and MPDU types are very disappointing, as they showed only a modest gain in performance for the light-weight transfer syntax, while resulting in a more than twice longer encoding.
After this rst test series we conducted several optimizations by analyzing the coding routines for a tree structure type in order to facilitate the analysis of the \execution pro le". We tested a number of modi cations in MAVROS ASN.1 BER coding and decoding algorithms, of which we cite: Inline the encodings for some types (\OCTET STRINGS"),
Apply the \type reduction" technique in order to remove all \repeated indirections", replacing them as much as possible by direct references to the data type.
Use \header prediction" techniques to speed up the decoding of tags and that of length elds, Speed up the encoding of tags and length eld by using systematically the \inde nite length" encoding form.
Use static declaration and better memory management (reduce the calls to malloc).
Some of these optimizations are \ILP based": minimize costly memory access operations. Some other concern the reduction of the processing operations. These optimizations were reported in the code generation programs of the ASN.1 compiler, as well as in the \run-time" library. A new series of tests were conducted again with the MPDU type on the Sun3 but also on other workstation types and the performance gures are shown in Table 3 : Speed of the presentation routines.
The throughput is de ned as the ratio of the code size in bits to the decoding time in seconds. In fact, this gives an indication of the speed limit in processing the incoming transport packets. The comparison of BER and LWS will be based on the decoding time instead of the decoding throughput.
Before we analyze these gures, recall that the coding procedure consists in copying the perf-object element structure from memory to the cache, looping on processing instructions and then storing the result (BER or LWS streamlined data) in memory. This cycle is reversed for decoding but the processing part is longer due to decoding controls and to the need for memory allocation. Therefore, if the overall decoding time is much higher than the coding time, we can deduce that the processing speed is the bottleneck. Otherwise, if the decoding and coding times are comparable, then we may say that the memory access overhead (which is the same for both routines) dominates the processing time. In addition, BER encoding requires more processing than the LWS which generates, however, a larger code size.
In the case of initial tests on Sun3, it is clear that the processing speed is the limiting factor for both BER and LWS (decoding slower than coding and BER slower than LWS). In this case, it is advantageous to use the LWS.
The optimizations led to a drastic reduction of the BER cost and the optimized Sun3 (new) gures shows that the BER coding time is lower than LWS: what is gained in processing for LWS is lost due to the higher size of the data which will transit on the bus. However, the far more CPU consuming routine (decoding) BER is still slower than the LWS.
The results are more interesting with high performance RISC workstations. On the SparcStation 10, coding BER is much more e cient than LWS which implies that the memory access dominates the processing time. However, BER decoding is still limited by the CPU performance. Both coding and decoding routines for LWS are limited by the memory access (due to the large code size stored and loaded).
Similar results are obtained on both Dec 3000 and HP workstations. Note, however, that the coding time is the same for both BER and LWS which means that both memory access and CPU bottlenecks are balanced. For decoding, the processor speed limitation is more pronounced.
The \best" results are obtained on Dec-alpha where the 64 bit bus enhances the memory access performance. The gures are \classical": BER more costly than LWS and decoding (BER or LWS) is more costly than coding. This is typically due to CPU limitation.
The maximum decoding throughput is about 24.54 Mbps for BER and 88.9 Mbps for LWS on the Dec-Alpha.
From the above results we can learn the following: A drastic improvement of the speed of coding and decoding routines for both BER and LWS routines can be obtained by adequate implementation optimization on high performance workstations The LWS is interesting if the processor speed is the bottleneck, however, a more compact syntax is desirable in order to reduce the size of the data to be transmitted on the network and the memory bus.
Memory access is a limiting factor in most cases on RISC workstations. This conrms that integration techniques should result in increased performance on such workstations. ILP intra-function optimizations were applied and resulted in enhanced performance for BER encoding.
The optimized version of the encoding and decoding functions should facilitate the implementation of the presentation layer as a lter with \streamlined" encoding and transmission of application data units.
High speed transport mechanisms
We now present our work on the transport level functions. This work has two goals:
To experiment the e ect of intra-function integration mechanisms on the performance of transport functions, To have a substrate suitable for the building of communication subsystems with support of the ILP principle.
We performed a prototype implementation of the XTP protocol. XTP was initially designed to be implemented on specialized hardware with execution e ciency as inherent part of the design process. Therefore, many syntactical and algorithmic choices of the protocol are oriented to facilitate high speed operation over low error rate networks.
XTP provides a set of mechanisms, the application can select the desired type of service according to its own requirements. We implemented XTP in the Unix user level as a library of transport functionalities. This library is linked to the application code. The interest of a user level implementation of XTP is to provide a support for the test of the integration mechanisms. In fact, on the same processor and in the absence of scheduling overhead one could expect that the performance of both kernel and user level implementations of the same transport protocol are comparable. This is con rmed by the results reported in section 3.2.2.
In the rest of this section we will describe the XTP implementation. We will also compare the performance of user level XTP implementation with kernel supported TCP.
XTP implementation
The implementation runs on an extension of the 4.3 BSD socket interface. The extension of the kernel protocols was done in order to add the support of XTP on top of IP. The implementation architecture is depicted in gure 1.
Two functions xtp input and xtp output on top of IP serve as a demultiplexing layer. They process the key eld in the XTP header. This minimal kernel support provides a secure allocation of network ports. The provision of \datagram" XTP sockets enables the application to declare \network entry points". Transmission control procedures are performed at the user level and thus can be easily con gured according to the application needs. User applications have access to the XTP socket interface through the The XTP connection set up, the reliable data transfer and the connection tear down functionalities were implemented. Other XTP functionalities like rate control, route management, multicast procedures have not been implemented. A detailed description of this implementation is given Dab93]. We present here the main implementation issues:
Context initialization An application using XTP should initialize a context before sending or receiving data. After this initialization, a eld within the context structure points to a control block with default values, corresponding to default type and quality of service. The application may overwrite some or all the elds of the control block. This re ects the programmability of the XTP protocol.
Context scheduling An application may have several active contexts in the same time (several point to point associations with di erent application entities). The application examines the list of \active" contexts to determine the next context to be processed and the corresponding wait timer value as described in Var87].
Bu er management The allocation of memory bu ers is performed by the application.
The sizes of the maximum receive or send bu ers are speci ed in the corresponding elds in the control bloc structure. These bu ers are directly used by the application to compute/process data. There is no intermediate transport level bu ers. When an XTP packet is received, only the packet header is consulted. After the necessary control functions had been performed, the data is copied into the corresponding place within the user level bu er. The system calls readv and writev are used to accomplish the scatter/gather I/O. According to its own requirements (e.g when a complete frame is received for video application) the application is invited to process the message.
Retransmission strategies The selective ACK-upon-request mechanism used by XTP facilitates the implementation of the two well known retransmissions strategies, goBack-N and selective repeat. Both strategies have been implemented. During the performance tests presented hereafter the receiver has adopted the selective repeat strategy.
Checksum calculation The 32-bit XTP check function called CXOR, is de ned as the catenation of two 16-bit functions: XOR which is a straight \vertical" exclusive-or of each 16-bit short word in a block of information, and RXOR a rotated \spiral" exclusive-or of each 16-bit short word. We rst implemented a \raw" version of the XTP checksum and then experimented the e ect of ILP intra-function optimizations on the performance of the checksum algorithm. We had increased performance by a factor of 12 over the initial \raw" implementation due to these optimizations as will be shown in the section 3.2.1
Performance tests
In this section we present performance test results of our prototype XTP. Performance of the standard kernel TCP BSD implementation will be presented for reference. However, the comparison should be made very carefully: the performance of a protocol depends on the environment, implementation tunings and enhancement techniques as demonstrated in Dab92]. Many di erences exist between the choices we adopted and the TCP kernel implementation, and the XTP implementation is not as well \tuned" as the TCP kernel implementation. Therefore, it is not meaningful to make precise comparison of both protocols based on the performance gures we have. The gures give only an idea on the possible performance of XTP when implemented in the user level. Other detailed performance tests can be found in Cab88] and Nic91] for TCP and in Fan93] for XTP. A user level implementation of protocol of TCP is described in The93]. Another work on optimized TCP checksum is reported in Kay93].
Performance of the checksum algorithm
The speed of checksum calculation is known to be one of the important factors that determine the performance of a transport protocol Cla90], Ros90]. The XTP designers proposed to perform this task in hardware Str92]. However, the reduced exibility of the hardware approach is a critical constraint because of the applications diversity. The following results show that it is possible to optimize the software implementation of this routine on modern workstations thus removing one of the bottlenecks for protocol performance.
The performance of TCP and ISO TP4 checksum algorithms are given in the rst two rows of the table 4 for four di erent machine hardwares. These are byte or short word oriented calculations. The \raw" implementation of the XTP checksum as a loop of short word oriented exclusive-OR and rotate operations was too slow (maximum throughput of 13.88 Mbps on a SS-10).
We analyzed the cost of the basic operations needed for the the checksum routine in order to determine the most costly functions. The results are presented in table 5. The We performed some hand optimizations on the checksum calculation algorithm. The results are given in table 6. A rst optimization (op1) consisted in doing the XORs on words with the same RXOR rotation and saving the rotate operations until the end. This \enhancement" increased the speed slightly on Sun3, but resulted in slower operation on Sparc and Dec workstations. In fact, to implement this modi cation, we used an array of 16 short words to store the XOR values. As the manipulation operations on the CISC Sun3 were the bottleneck, the saving in the number of rotate operations resulted in increased performance. On the contrary, the memory access on both Dec and Sparc workstations is the bottleneck. Therefore, the load operations of the array elements decreased the performance. A second optimization (op2) consisted in declaring 16 short words instead of the array in order to save indirections (unroll the loop). This resulted in increased performance on all machine types at the cost of a slightly higher size of the checksum calculation program. The performance increase is, however, more pronounced for RISC workstations. In fact, these 16 words were de ned as registers thus avoiding costly memory access in the checksum computation loop. In a third modi cation (op3), the number of XOR operations is divided by 2 by XORing the 16 words at the end to determine the CXOR value instead of doing it on the y. This also increased the overall performance. In (op4) the same optimizations as (op3) are applied with long words calculations and then folding the results down to 16-bits. The results are interesting: the increase factor with regard the third enhancement vary between 2.6 for the DecStation and 5.6 for the SS10. The higher the value of this factor, the more costly is memory access relatively.
Throughput measures
Tests have been performed over both Ethernet and FDDI networks. We implemented a simple client/server application where the client sends raw data to the server using UDP, Table 6 : E ect of optimizations of the XTP checksum algorithm TCP or XTP protocols. For UDP, the client will send numbered packets, the test is considered terminated when the last packet is received at the server. If this packet is lost during one of the tests, the corresponding results are ignored. For both TCP and XTP, the test is terminated when all the data is received correctly at the server. The XTP library is linked with both the client and server code. The throughput is de ned in the three cases as the number of received bits divided by the time interval between the reception of the rst and the last packet at the server. The tables 7, 8 and 9 give the results of the tests over both Ethernet and FDDI. The machines running the client and the server are both DECstations 5000/200. Table 7 : Maximum throughput for UDP One may be tempted to say that the UDP gures give the raw performance that may be obtained on either Ethernet or FDDI. However, this assertion should be taken with care: packet losses at the ipqueue level due the high number of packet transmitted during a test (100 to 500) reduce the throughput as de ned here above. The best performance for UDP are obtained for small number of transmitted packets (15 or 20). Over Ethernet, there is no bottleneck at the UDP level: the throughput is limited by the speed of the Ethernet interface. Over FDDI, the maximum raw UDP throughput is limited either by the performance of the FDDI interface or by operating system overhead. Note that the MTU is 1460 and 4312 octets over Ethernet and FDDI respectively. Therefore, the 9000 octets packets are transmitted as 7 Ethernet packets and 3 FDDI frames and reassembled by the kernel at the UDP/IP layer. The TCP gures are more interesting: no bottleneck over Ethernet, TCP can easily saturate the 10 Mbps CSMA-CD LAN. However, the performance over FDDI are limited by the control mechanisms of TCP. The highest throughput (20.74 Mbps) is lower than the best UDP performance. The socket receive bu ers have been set to 65535 octets in order to increase the window size. TCP Packets shorter than 1024 octets show poor performance even over FDDI. The limitation comes from the bu ering mechanism within the kernel: small bu ers of 128 octets are used, thus limiting the overall performance of the transmission. For packets longer than 1024 octets, the performance are quite similar with a tendency to decrease.
For the XTP tests, one should distinguish between two notions: ADU and NDU as they are de ned in Dav92]. ADU or Application Data Unit is the basic unit of data exchange between application entities. NDU or Network Data Unit is the unit of data processing and switching within the network. For the purpose of these tests NDUs over Ethernet are 1400 bytes long and NDUs over FDDI are 4000 bytes long. In table 9, we report the ADU sizes. Note that the results do not vary a lot with the ADU size, as the same NDU size is used. The most important result shown in this table is that XTP can saturate Ethernet and operate up to a speed of 21 Mbps over FDDI. This was made possible because of the following factors:
minimize data copying by the use of writev and readv, optimize the checksum calculation by ILP intra-function optimizations, minimal overhead by return key based context look up in the kernel These results show that it is possible to have good performance with user level implementation of transport protocols. The throughput gures should not be taken as an indication of the performance of XTP with all control mechanisms implemented. However, the other control mechanisms (rate control, route management) are not data manipulation oriented and may be considered as having minimal additional cost. The performance of our XTP implementation with the complete error and ow control procedures is an encouraging step in the direction of integrated implementation of protocols. Such enhanced performance transport implementation provides building blocks that will facilitate the con guration of application speci c communication subsystems. Two examples of a user-space implementation of TCP can be found in Cas94] and The93]. However, some problems still to be solved before user level implementations provide optimal results: the execution environment should be more adapted to the support of network protocols by providing better memory management and process scheduling than Unix.
We implemented XTP as a step towards the development of new high performance communication subsystems. However, it is not clear how the application should con gure the elementary transport and presentation \building blocks" in order to generate the complete communication subsystem. XTP can serve as a substrate for the implementation of this communication subsystem by providing e cient \building blocks". Other design studies should determine how the elementary building blocks can be grouped together in order to provide the desired service with the best performance.
Conclusion
In this paper, we presented our work on high performance implementation of presentation and transport functions. The experiments show that it is possible to have increased performance by proper implementation techniques. We showed speci cally that an important speed up factor can be obtained by applying ILP intra-function optimizations for costly data manipulation functions. The performance enhancement of both presentation ASN.1 BER and XTP checksum algorithm show that it is possible to use such building blocks in the design and implementation of high speed integrated communication subsystems. Next step is to study how to combine elementary building blocks according to the application needs, to generate a complete communication subsystem verifying the ILP principle.
