Transport Protocols for High Speed Broadband Networks Workshop, Globecom'96, Westminster, London, UK, November 22, 1996 domination (unfair bandwidth utilization between the two flows) or synchronization (poor performance for both flows), just as in Figure 5 . Figure 5 displayed these undesirable performance effects immediately after reaching its maximum throughput due to the small per port buffers being exhausted rapidly. If there was an additional delay of 20 msec, the high throughput section of the graph in Figure 8 would have shifted right, to about 400 Kilobyte windows. No delay was intentionally added to determine the exact per port buffering in the ATM switch.
By comparing the results in Figure 5 and Figure 8 , one can see that sufficient cell buffers provide good throughput performance for a wide range of TCP window sizes. A switch buffer of tens of thousands of cells per port can comfortably accommodate up to 100msec RTDs over a DS3 network. Thus, network delays that can possibly change during a TCP session won't likely degrade the performance characteristics of the session. However, with limited cell buffers, a change in RTD during a session may have quite a negative impact on the performance of the session. Fine tuning window size parameters to the Bandwidth*Delay product of the network is required with limited buffering in the ATM network. Such problems do not exist for larger buffer switches, as long as the window size is set larger than the Bandwidth*Delay product of the connection. Larger buffers have been considered and implemented by many ATM switch vendors already. The reduced costs of memory compared to a few years ago make it even a more viable and attractive solution.
Conclusion
This paper discusses two methods as solutions to guarantee acceptable TCP/IP performance over an ATM network. State of the art equipment is available in the market to implement either of the solutions.The paper presents experimental results collected over a TCP/IP over ATM research testbed to demonstrate the improved performance using these methods. One method is to provide rate shaping (CBR or VBR or other viable algorithms) at the entry to the ATM network, in proportion to the bandwidth of the bottleneck resources. The other method, which has significant advantages, is to provide ample buffering in the ATM switches. The superiority of the latter to the first is that it does not require provisioning of only a portion of the bandwidth to a PVC, thus the whole bandwidth can be statistically multiplexed among active TCP flows. In peak rates that yield smaller peak rate to average rate ratios should bring forth the advantages of leaky buckets for small-buffer ATM switches, and can then provide statistical multiplexing and improved performance with the use of larger burst sizes. Unfortunately, current inflexible implementations of the algorithm by the commercial equipment used does not help, and is potentially harmful under full load, for small-buffer ATM switches. Figure 6 and Figure 7 display the cases when the bandwidth assigned to each flow is equal.
When the ratio of the bandwidths is not equal to one, the same behavior is observed except that the ratio of each flow's throughput is in proportion to the ratio of the bandwidths assigned.
6•2 Sufficient Buffering
Although rate shaping the streams at the entry to the ATM network is an effective solution, it requires provisioning parts of the total bandwidth for specific connections ahead of time (using Permanent Virtual Circuits (PVCs)). Since all the connections, for each of which a portion of the bandwidth is reserved, may not be active at all times, the active connections cannot utilize the unused bandwidth reserved for the inactive ones with CBR provisioning. Furthermore, VBR provisioning implemented in the IP routers is ineffective for small-buffer ATM switches. In addition, provisioning connections requires manual intervention. Switched Virtual Circuits (SVCs) will address these problems, however they are not available for the tests presented here.
A second solution is to provide sufficient buffering at the bottleneck resources (in relation to the Bandwidth*Delay product of the network). Sufficient buffering is a relative term and there is not a single optimum number for best performance under all circumstances. However, it is possible to estimate a range to cover a range of possible RTDs for a given bandwidth. This section attempts to do this by comparing the results of the two tests displayed in Figure 5 and Figure 8 .
Both of these tests are carried out at full DS3 rates with no rate control mechanism being exercised at the source IP routers (Router_1 and Router_2). The major difference is that the performance curve in Figure 5 is collected through the small buffer ATM switch, and the curve in Figure 8 is collected through the large buffer ATM switch. The ratio of the buffer sizes is about 1:100. The small buffer switch has only a few hundreds of cell buffers per port whereas the large buffer switch has a few tens of thousands of cells per port. There was also no added network delay in the test whose results are presented in Figure 8 . Mbps for each flow, thus an aggregate of 34.5-35 Mbps which confirms the calculations in Section 3. Note that the buffering within the routers is the same (400 Kilobyte).
As Figure 6 and Figure 7 demonstrate, the performance gain due to VBR shaping is not significant compared to CBR shaping with the use of small burst sizes. Unfortunately, large burst sizes are found to yield significant throughput degradation when multiple flows congest the small-buffer ATM switch, by sending large bursts of data simultaneously to the DS3 ATM bottleneck. This is expected since the peak rates are 40Mbps each. Unfortunately, the smallest peak rate to average rate ratio that can be set is 2 (a high oversubscription rate) due to the limitations of the commercial equipment available to this study at the time. A capability to choose a random sequence of loss, timeout and retransmission events as for the glitch in Figure 3 . Figure 7 presents the results of another similar performance test using the small buffer ATM switch. The difference between Figure 6 and Figure 7 is that Figure 7 depicts VBR shaping at each of the source routers (Router_1 and Router_2) and no additional delay is provided. The average bandwidth assigned to each flow is still 20Mbps, however the peak rate is set at 40Mbps. A small burst size equal to the FDDI MTU is used in this test. Since the 20msec additional delay is no longer used, maximum throughput is reached at very small window sizes (as in Figure 2 and Figure 3 ). Due to the VBR shaping, the throughput slightly increases as higher window sizes are used, unlike the flat throughput observed in Figure 6 . The maximum throughput reaches 17-17.5 while during other times, the S_2 originated flow dominates.
Possible Solutions
The congestion-induced degradation in performance (due to synchronization) or unfair bandwidth utilization (due to domination) can be rectified in a number of ways. This section examines two possible solutions -source rate control and increased switch buffering -based on experimental results. These solutions are by no means the only methods that exist to overcome these problems. For example, there are ATM switches available in the market that implement Packet Discard algorithms (Early Packet Discard (EPD) or Partial Packet Discard (PPD)) discussed in various literature [20] , [21] . There are ongoing experimental studies of these methods within
Bellcore which are not available for publication at the present time. ABR switches will be able to address these problems as well, however they are just becoming available in the market.
6•1 Rate Control
One way to assure acceptable TCP/IP performance for ATM networks with limited buffer switches is to control the amount of data into the ATM network, thus into the bottleneck resource.
This can be provided by exercising rate control at the ingress to the ATM network, either at the data sources or the IP routers interfacing to the ATM network. As long as either provide adequate rate control and buffering, either should be effective. Experiments here are carried out using rate shaping at the IP routers. Rate shaping corrects the undesirable behaviors explained in Section 5. displays the results of such a performance test in which two TCP sources simultaneously send to the same sink (i.e. from S_1 and S_2 to S_3) through the small buffer ATM switch, when there is no rate control exercised at either of the source routers (Router_1 and Router_2). In this test, two DS3 sources at full rate, congest the outgoing DS3 ATM port (towards Router_3) over the switch, and since the switch has limited cell buffering, it starts to drop cells from either one of the flows.
Which TCP session is losing its cells is a random behavior, and performance results display various throughput degradation effects as will be explained in detail.
The performance test displayed in Figure 5 is performed using an additional 20msec delay introduced by the Delay Generator to more clearly depict the initial and acceptable performance characteristics, i.e. the initial slope where the throughput increases as the window sizes increase.
During this time, both flows exhibit equal throughput values as the share of bandwidth assigned to each is also equal. They both reach a throughput of approximately 17.25Mbps (aggregate throughput is 34.5Mbps) at a window size approximately half the Bandwidth*Delay product (The two window sizes together are equal to the Bandwidth*Delay product). Thus, once the window sizes of both flows become large enough to fill the outgoing DS3 pipe and the small buffering in the ATM switch runs out, both flows start to lose cells at a high rate. Thus, both flows' congestion control mechanisms [8] , [13] commence and both switch into the slow start state quickly. Many cycles of slow start/congestion avoidance repeat until both flows finally finish the data transfer process. This is the synchronization artifact of TCP. However, as the TCP window sizes used become larger than 200 Kilobytes, a different artifact of TCP becomes apparent, and that is domination of one flow over the other. At such high window sizes, the burst of data arriving from either of the flows is much larger, and in most cases, the flow ahead of the queue manages to get through while the other one loses a lot of cells. Once that occurs, the leading flow dominates the other which has backed off and continues to back off until the dominating one finishes its session.
Only then can the backed-off flow proceed faster; however, due to the difficulty in raising its congestion window much higher, it displays dismal throughput, much smaller than half the throughput of the other flow. This unfair behavior can work against either flow 10 ; it is completely random, as exhibited in Figure 5 ; at certain window sizes, the S_1 originated flow dominates, directly indicates the amount of bottleneck buffering (i.e. buffering in the ingress router) over the data path. Consequently, additional delay also shifts the throughput collapse point, i.e. the window size at which the bottleneck resource overflows its buffers and degrades the throughput.
In summary, with a higher RTD, the window sizes at which maximum throughput is reached, and throughput collapses, are larger than those with a lower RTD.
Impact of Congestion
Unlike the results presented in Figures 2,3 Since the maximum attainable throughput is reached at a higher window size for a larger Bandwidth*Delay product path, the effect of additional delay is a smaller slope on the throughput versus window size graph in Figure 4 compared to that in Figure 2 . However, the amount of buffering within the physical equipment over the data path is independent of the network delay, and the width of the high throughput section is the same on both of the graphs. In fact, this width The objective of adding network delay, is to emulate high latency networks, thus to test long • the TCP/IP header being actually larger than 40 Bytes due to the TCP timestamp option RFC1323 implementations use with data packets, for extended TCP windows, (larger SYN headers due to options used with the SYN packets can be neglected since the data transfer phase is very long)
• data packets being slightly smaller than the actual TCP maximum segment size negotiated during MTU path discovery. Figure 2 and Figure 3 present two baseline performance tests for single-flow TCPs (from a single source workstation to a sink workstation, i.e. from S_1 to S_3): first, through the small buffer ATM switch, and second, through the large buffer ATM switch respectively, in the testbed configuration in Figure 1 . Both of these tests are performed under the exact same conditions; at full DS3 rates when no rate shaping is exercised by the source IP router (Router_1) at the ingress to the ATM network and when no additional network delay is introduced by the delay generator.
Baseline TCP/IP over ATM Performance Tests
These one-to-one or single-flow tests are important since they form a baseline for the following two-to-one tests which exhibit congestion. Full-load two-to-one tests cannot perform any better than these single-flow tests. The one-to-one tests characterize the amount of buffering at the ingress IP router, since there is no slower link ahead. The two-to-one tests are performed to assess the ATM switch behavior under congestion and to characterize the buffering in the switch which becomes the bottleneck resource for more than one flows.
The following observations hold for both of these tests. RTD is around 1 msec and the packet processing delays range from 100 µsec to 5 msec, which contribute to the overall delay. Thus, the Bandwidth*Delay Product is small compared to larger RTD networks and full utilization is Bytes.
Effects of Protocol Overheads
ATM has been shown to be inefficient for IP datagram services [17] due to various overheads:
• PLCP 7 formatting (on average, the DS3 PLCP transmits a cell every 10.42 microseconds),
• 5-Byte ATM cell header for each cell payload formed by segmenting the AAL5 8 packet,
• 8-Byte LLC/SNAP 9 header [18] encapsulating IP in AAL5 [19] and 8-Byte AAL5 trailer, and
• Any AAL5 padding, necessary to divide the AAL5 packet evenly into 48-Byte cell payloads.
Theoretically, the maximum cell payload throughput over a DS3 ATM network is 36.86 Mbps ((53*8/10.42)*48/53). This is a total of 18% overhead just due to the ATM cell headers and PLCP framing. Since, the maximum IP packet that can be used in the testbed in Figure 1 overhead. The actual maximum TCP data throughput measured over the DS3 ATM testbed in Figure 1 is slightly less than this value calculated. This, of minor concern, is due to several reasons such as:
• varying packet processing delays at the workstations (100 µsec to 5 msec),
• the amount of AAL5 padding inserted in the TCP packets to align with cell boundaries, switch and Router_3 in Figure 1 , since Router_3 is used as the sink router (thus S_3 as the sink station) in most of the test studies. In this way, the same network delay can be introduced for both sources, S_1 and S_2, when two-to-one tests (two sources sending to the same sink) are conducted.
2•1 Performance Tests and Test Tools -A Summary
The performance tests carried out over the ATM testbed displayed in Figure 1 The throughput versus TCP window size measurements presented here were performed using a Bellcore shell script (written by Grenville Armitage) called ttcp-multi, which in turn uses a public domain performance tool, called ttcp [15] . ttcp-multi automates multiple consecutive ttcp's by incrementing the window size offered to each ttcp, checks the network statistics for TCP timeouts and retransmissions for each session, and post-processes the results for data presentation. The same set of tests were later repeated while tcpdump [16] was running in parallel to ttcp-multi in order to find out the exact details of data transmission for these TCP flows.
During these tests, the Nagle algorithm [6] was disabled by setting the TCP_NODELAY [15] option to push data out to the link as fast as possible since these are bulk data transmission measurements. The Don't Fragment (DF) flag in the IP header was set on all the packets and Push discovery [7] and sets the TCP default maximum segment size (mss) to the minimum Maximum Transmission Unit (MTU) in the network (minus the TCP/IP header) 6 . In Figure 1 , the FDDI MTU of 4352 Bytes (minus the 40 Bytes TCP/IP header) is used as the maximum TCP segment size. It is a well-known networking phenomenon to use the largest packet size possible while avoiding packet fragmentation, in order to reduce the per packet overhead at the hosts, and to attain the maximum throughput through the network [8] , [9] , [10] , [11] . The TCP version implemented in the Irix 5.3 kernel is 4.3BSD/Reno that provides the Fast Retransmit and Fast Recovery algorithms [12] which are modifications to the original TCP congestion avoidance algorithm [13] .
In the TCP/IP over DS3 ATM testbed displayed in Figure 1 , two different ATM switches are used; one switch with a small number of cell buffers per port (on the order of a few hundreds) and the other with considerably larger cell buffers (on the order of a few tens of thousands). The ratio between the per port cell buffers of the two ATM switches is on the order of 100/1. Under full load, the two switches provide the TCP flows with significantly different performance characteristics as will be explained in Section 5. The traffic service tested on the ATM switches is the Unspecified Bit Rate (UBR); i.e. no traffic shaping or policing has been provided at either one.
The three IP routers used in the experiments have ATM interfaces that provide direct DS3 ATM UNI access to the ATM switches. They have the capability to rate shape the data into the ATM network, either using Constant Bit Rate (CBR) or Variable Bit Rate (VBR) as defined by the ATM Forum [14] . In the absence of commercial Available Bit Rate (ABR) implementations, rate shaping ingress to the ATM network provides performance gains for UBR switches with limited buffers, as will be demonstrated in detail in Section 6.1. Rate shaping can alternatively be exercised at the source workstations by adjusting the inter-packet spacing. Network interface cards that perform rate shaping by varying the inter-packet gaps are available in the market for various workstations.
The testbed used in this study provides rate shaping at the entry to the ATM network because of the type and availability of equipment employed at the time, it was not a matter of preference.
An Adtech SX/13 Data Channel Simulator with DS3 interface was used to introduce additional network delay in certain tests. It has the capability of introducing RTDs up to 100 msec. The Data Channel Simulator, or Delay Generator, for testing purposes, is connected between the ATM communications industries to support current as well as future broadband services.
To allow ATM and other networking technologies to be combined to become overall end-toend services, it is necessary that the Internet Protocol (IP) [2] 
