Title of Dissertation: Supporting Distributed Multimedia Applications on 
ATM Networks by Saha, Debanjan
AbstractTitle of Dissertation: Supporting Distributed Multimedia Applicationson ATM NetworksDebanjan Saha, Doctor of Philosophy, 1995Dissertation directed by: Professor Satish K. TripathiDepartment of Computer ScienceATM oers a number of features, such as high-bandwidth, and provision for per-connection qualityof service guarantees, making it particularly attractive to multimedia applications. Unfortunately,the bandwidth available at ATM's data-link layer is not visible to the applications due to operatingsystem (OS) bottlenecks at the host-network interface. Similarly, the promise of per-connectionservice guarantees is still elusive due to the lack of appropriate trac control mechanisms. In thisdissertation, we investigate both of these problems, taking multimedia applications as examples.The OS bottlenecks are not limited to the network interfaces, but aect the performance of the entireI/O subsystem. We propose to alleviate OS's I/O bottleneck by according more autonomy to I/Odevices and by using a connection oriented framework for I/O transfers. We present experimentalresults on a video conferencing testbed demonstrating the tremendous performance impact of theproposed I/O architecture on networked multimedia applications.To address the problem of quality of service support in ATM networks, we propose a simple cellscheduling mechanism, named carry-over round robin (CORR). Using analytical techniques, weanalyze the delay performance of CORR scheduling. Besides providing guarantees on delay, CORRis also fair in distributing the excess bandwidth. We show that albeit its simplicity, CORR is verycompetitive with other more complex schemes both in terms of delay performance and fairness.
Supporting Distributed Multimedia Applicationson ATM NetworksbyDebanjan SahaDissertation submitted to the Faculty of the Graduate Schoolof The University of Maryland in partial fulllmentof the requirements for the degree ofDoctor of Philosophy1995Advisory Committee:Professor Satish K. Tripathi, Chairman/AdvisorProfessor Ashok K. AgrawalaAssistant Professor Michael FranklinAssistant Professor Richard GerberProfessor Joseph JaJaProfessor Armand Makowski
c Copyright byDebanjan Saha1995
DedicationTo my parents.
ii
AcknowledgementsI would like to express my earnest gratitude to my advisor, Professor Satish Tipathi, for motivatingme to do a Ph.D., arranging for my funding, and supervising this thesis with care and interest.He has been the ideal advisor, one who allowed me to explore the space while providing me withexperience and insight which I often lacked. I will always be grateful to him for believing in myabilities and for being on my side at every step of the way. Many thanks to Professor AshokAgrawala for his support and encouragement. A special thanks to Professor Richard Gerber forhis interest in my work and for many exciting and enlightening discussions we had during my stayin Maryland. I thank the members of my thesis committee, Professors Armand Makowski, JosephJaJa, and Michael Franklin for carefully reading my thesis and making suggestions for improvement.I would like to thank Drs. Dilip Kandlur at IBM Research, Dipak Ghosal at Bellcore, T.V. Laksh-man and Hemant Kanakia in AT&T Bell Labs for their help during dierent stages of my disserta-tion research. A very special thanks goes to Sarit Mukherjee, Pravin Bhagwat, and Manas Saksenafor their constructive criticism of my work and for many late night debates which inuenced mybrain waves. I am particularly grateful to Jennifer Yuin, Ibrahim Matta, and Krishnan Kailas forreading dierent parts of the thesis and for suggesting numerous improvements. I would like tothank the members of the Systems Design and Analysis Group, particularly Cengiz Alaettinoglu,Sedat Akyurek, Sanjeev Setia, Dheeraj Sanghi, Bao Trinh, Chia-Mei Chen, Partho Mishra, andFrank Miller for creating a wonderful atmosphere for research. Thanks to Nancy Lindley for makingmy life in the department a lot less complex than it would have been.I would like to thank all my friends, particularly Kateel Vijayananda, Shailendra Verma, ShuvankerGhosh, Stayandra Gupta, Gagan Agrawal, Kaustabh Duorah, Greg Barato, Sibel Adali, ClaudiaRodriguez, and Koyeli Dey, for making my life inside and outside school a lot more fun.Although my parents are on the other side of the Atlantic, they have been a constant source ofencouragement. Without their support, this thesis would not have been complete. Finally, I wouldlike to thank a sweet mun for her love and aection which will never fade in my memory.iii
Table of ContentsSection PageList of Figures vii1 Introduction 11.1 Problem Description and Solution Approach : : : : : : : : : : : : : : : : : : : : : : : 21.1.1 Application Throughput : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21.1.2 Quality of Service : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51.2 Summary of Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81.3 Organization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 92 Autonomous Device Architecture 102.1 Current State of Art : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 102.2 Guiding Principles : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 132.2.1 Device Autonomy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 132.2.2 Connection Oriented I/O : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 142.3 System Components : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 152.4 Data Flow : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 162.5 Flow Control : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 182.6 Data Processing Modules : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 182.7 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20iv
3 Experimental Evaluation 213.1 System Platform : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 223.2 Base System: Implementation and Proling : : : : : : : : : : : : : : : : : : : : : : : 243.2.1 Implementation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 243.2.2 Data Path Proling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 253.3 Optimizations with Autonomous I/O : : : : : : : : : : : : : : : : : : : : : : : : : : : 283.3.1 Driver Extensions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 283.3.2 Connection Specic Protocol Processing : : : : : : : : : : : : : : : : : : : : : 303.3.3 Application Structure : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 303.3.4 Data Flow : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 313.3.5 Limitations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 333.4 Performance Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 343.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 354 Trac Shaping 364.1 Trac Shapers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 374.2 Simple Shapers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 384.3 Composite Shapers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 404.3.1 Composite Leaky Bucket : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 424.3.2 Composite Moving Window : : : : : : : : : : : : : : : : : : : : : : : : : : : : 454.3.3 Composite Jumping Window : : : : : : : : : : : : : : : : : : : : : : : : : : : 484.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 495 Carry-Over Round Robin Scheduling 515.1 Current State of Art : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 515.1.1 Delay Guarantee : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52v
5.1.2 Throughput Guarantee : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 525.2 Scheduling Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 545.2.1 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 575.3 Implementation in a Switch : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 575.4 Basic Properties : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 595.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 626 Quality of Service Envelope 636.1 Delay Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 636.1.1 Single-node Case : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 646.1.2 Multiple-node Case : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 706.1.3 Comparison with Other Schemes : : : : : : : : : : : : : : : : : : : : : : : : : 756.2 Fairness Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 786.2.1 Fairness of CORR : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 796.2.2 Comparison with Other Schemes : : : : : : : : : : : : : : : : : : : : : : : : : 806.3 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 817 Conclusions 827.1 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 827.2 Future Directions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 83
vi
List of FiguresNumber Page1.1 A video capture application. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 32.1 Seperation of control and data ows. : : : : : : : : : : : : : : : : : : : : : : : : : : 142.2 Alternative data paths. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 172.3 DMA data path. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 193.1 RS/6000 architecture. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 223.2 MMT adapter. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 233.3 Transmit latency in the base system (in microseconds). : : : : : : : : : : : : : : : : : 263.4 Receive latency in the base system (in microseconds). : : : : : : : : : : : : : : : : : 263.5 Transmit and receive data path in the optimized system. : : : : : : : : : : : : : : : : 313.6 Comparison of transmit latencies in the base and optimized system (in microseconds). 323.7 Comparison of receive latences in the base and optimize system (in microseconds). : 323.8 Transmit and receive throughputs. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 334.1 The network model. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 374.2 An example of an MPEG coded stream. : : : : : : : : : : : : : : : : : : : : : : : : : 404.3 Composite leaky bucket. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 414.4 Shaping with multiple leaky buckets. : : : : : : : : : : : : : : : : : : : : : : : : : : 424.5 Trac envelope after adding (m+ 1)th bucket. : : : : : : : : : : : : : : : : : : : : : 44vii
4.6 Trac envelope of a composite moving window. : : : : : : : : : : : : : : : : : : : : : 464.7 Trac envelope after adding the (l+ 1)th moving window. : : : : : : : : : : : : : : : 474.8 Trac envelope of a composite jumping window shaper. : : : : : : : : : : : : : : : : 495.1 Carry-Over Round Robin Scheduling. : : : : : : : : : : : : : : : : : : : : : : : : : : 555.2 An Example Allocation. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 575.3 Architecture of the buer manager. : : : : : : : : : : : : : : : : : : : : : : : : : : : 586.1 Computing delay and backlog from the arrival and departure functions. : : : : : : : 646.2 Nodes in tandem. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70
viii
Chapter 1IntroductionThe introduction of digital audio and video to the desktop computing environment has openedthe door to an array of exciting applications, such as multimedia conferencing, video-on-demand,tele-medicine, virtual reality, just to name a few. Along with the excitement and opportunity,the emergence of these multimedia applications have brought new challenges to all aspects of sys-tem design. Audio and video are fundamentally dierent from the classical media, such as textand graphics, in terms of their storage, processing, distribution, and display requirements. Conse-quently, the computing and communication subsystems are going through an era of reevaluationand redesign to rise up to the challenges posed by the emerging multimedia applications. In thisdissertation our focus is on the communication subsystem.From the network's perspective, multimedia applications are dierent from the traditional text andgraphics applications in two fundamental ways: 1) they generate orders of magnitude more data,and 2) they often have to satisfy stringent timing requirements on the presentation of the data.Unfortunately, the existing networks, designed mainly for low-bandwidth, non-time-critical appli-cations, provide neither the bandwidth nor the timing guarantees demanded by these applications.This lack of adequate network support is the main driving force behind the ambitious endeavor todevelop a new generation of networking infrastructure. Among the set of alternative technologiesproposed as the basic switching fabric of this infrastructure, Asynchronous Transfer Mode [21],popularly known as ATM, is a clear forerunner.Architecturally, ATM networks consist of switches connected via point-to-point ber links. ATMswitches can scale from small multiplexers to very large switches in both the aggregate capacityand the number of access ports. A switch can accommodate access ports from low speeds (1.5Mb/sand 6Mb/s) to very high speeds (2.4 Gb/s). The number of access ports can vary from four toeight in local area switches to hundreds and thousands in wide area switches. Clearly, ATM hasthe potential to solve the bandwidth crisis in both local and wide area networks. Operationally, anATM network is a connection oriented cell switching network. That is, a connection is established1
between the source and the destination before data transfer can begin. Once the connection isestablished, data is exchanged in xed size cells, all of which follow the same path from the sourceto the destination. The connection oriented architecture of ATM makes it possible to reserveresources for each connection, and hence it provides a framework to develop an end-to-end servicearchitecture that can provide performance guarantees on a per connection basis.ATM has the potential to provide both high bandwidth and per connection service guarantees,making it particularly attractive to multimedia applications. Unfortunately, these benets are stillbeyond the reach of the applications. ATM provides megabits of bandwidth at the data-link layer.However, due to protocol processing and operating system overheads, only a small fraction of thisbandwidth is actually available to the applications. Similarly, the connection oriented architectureof ATM provides the necessary infrastructure for a quality of service architecture, but ATM doesnot really dene the specic mechanisms required to guarantee service quality. Consequently, theend-to-end service guarantees are still elusive to the applications.1.1 Problem Description and Solution ApproachThe objective of this dissertation is to bring the potential benets of ATM within the reach of theapplications. More specically, we address the problems of 1) making the bandwidth available atthe data-link layer of ATM visible to the applications, and 2) developing trac control policiesand mechanisms for establishing an end-to-end quality of service architecture. In the following webriey discuss both problems, and outline our solution approach.1.1.1 Application ThroughputThe job of the networking subsystem at the end-host is to move data between the network interfaces,the host operating system and the networked applications. The throughput of this data pathdepends primarily on the cost of network protocol processing, and that of moving data from thenetwork interface to the applications through the operating system and vice versa. While theoverhead of protocol processing depends largely on the on the processor bandwidth, the cost ofdata movement is determined by the memory bandwidth of the system. With the increasinggap between the CPU and memory speeds the cost of data movement has emerged as the mostsignicant contributor of end-to-end latency. In many multimedia applications, the end-point ofcommunication is often a device, such as a disk controller or a CODEC, rather than a processrunning on the CPU. End-to-end data path in such applications crosses the application and theoperating system domains multiple times, increasing the cost of data movement even further, andconsequently quenching achievable application throughput to an unacceptable level. To get a better2























Figure 1.1: A video capture application.understanding of the problem let us consider the following example.Figure 1.1 shows an example of a video capture application. The video images captured by acamera and digitized and compressed by a CODEC on one host is transported over an ATMnetwork and is stored on the disk in another host. The important thing to observe in this exampleis that the end-points of a network connection are not the network interfaces at the end-hostsor an application process, but the devices like the CODEC and the disk-controllers that actuallygenerate and consume data. If we take a close look at the transmit-side data path, we observethat in order to move data from the CODEC to the ATM network interface, data is rst copiedfrom the CODEC's buers to the kernel (operating system) buers in the main memory. Fromthere the data is copied into the buers in the application address space. To move data across thenetwork, one would then have to copy data from the application space back in the kernel space,and from there onto the buer on the ATM network interface. Thus, the data path crosses domainboundaries twice (from kernel to application and back), and the data is copied four times beforebeing sent out on the network. Similar steps are involved in moving data from the network interfaceto the disk controller at the destination. Besides the overhead of data copying, crossing of domainboundaries inicts an additional cost of context switching between the application domain and thekernel domain. Also to be accounted in the end-to-end overhead, is the cost of network protocolprocessing. Hence, it is not very surprising that end-to-end throughput is only a fraction of thenetwork bandwidth available at the data-link layer.The problem of poor application throughput is not limited to the networked applications only.3
All I/O 1 intensive applications suer from the chronic I/O bottleneck at the end-host. The rootcause behind this bottleneck is the inecient data movements between the I/O devices, such as thenetwork interface, the disk controller, the CODEC, the display controller, etc. In almost all of theexisting systems, I/O data path crosses multiple address spaces, causing multiple data copies andcontext switches. Several research groups [33, 36, 12] identify data copying as the primary problemwith respect to system performance. Frequent context switches is the second leading cause of thepoor performance [32, 15, 18].In most systems crossing of domain boundaries (e.g. user to kernel and kernel to user) causephysical copying data from the source domain to the destination domain. Several recent workshave addressed the problem of data copying due to crossing of domain boundaries. There aretwo basic approaches to solving the problem |- virtual copying and use of shared memory. In thetechniques [3, 37, 10, 44] using virtual copying, data is not physically copied when it moves from onedomain to another, rather an association is maintained between the physical copy of the data andthe domain (or domains) that currently owns it. A physical copy is made only at the applicationrequest [37] or when one of the domains attempts to modify the data item. The problem withvirtual copying is that the overhead of maintaining the association between the data items and thedomains is too high and often comparable with that of physical copying. In the shared memoryapproach [14, 15], two or more domains statically share a part of their address space and use thismemory to transfer data. The cost of maintaining serializability of data accesses and the problem ofproviding protection against unauthorized accesses often cast doubt on their performance benets.In the above example, every copy of data buer also requires a context switch between the userand the kernel domains. These context switches cannot be avoided because protection boundariesare crossed in copying data from the user to the kernel space and vice versa. The trend in the mi-croprocessor technology towards a larger set of registers, deeper pipelines, and multiple instructionissues is going to further increase the cost of a context switch. The simplest way to avoid contextswitching (and data copying) is to avoid crossing domain boundaries. In [18], Fall et al. proposes apeer-to-peer I/O model where most of the data ow occurs through the kernel and does not requireextra copies or context switches between the user and the kernel domains. Although peer-to-peerI/O transfers would lead to higher performance for I/O intensive applications, it also increases thecomplexity of the kernel and thus may adversely aect the performance of the kernel and otherapplications. An alternative approach to achieve the same goal is to contain the data path withinthe application domain. In [17] Engler et al. propose to eliminate all operating system abstractionsand provide the applications almost direct access to the hardware. By giving applications morecontrol over the hardware, it is possible to avoid transfer of control to the operating system for I/Oservices. An immediate consequence of this is a reduction in the frequency of domain boundary1We consider the network interface to be an I/O device.4
crossing. The authors also claim that by lowering the interface to the hardware, the manage-ment and abstraction of these resources can be customized by applications for simplicity, eciency,and appropriateness. It is too early to comment on the potential of this approach. The problemof application portability and the overhead of providing low level protection can act against itssuccess.Although our main objective is to optimize the data path between the network interface and theother devices that consume and generate network bound data, the same principles and mechanismsapply to data transfer between any other I/O devices as well. We propose an I/O architecture thatnot only helps preserve the network throughput as seen by the applications, but also facilitatesdevice-to-device data transfer in general. In our model, like the network connections in ATM,device-to-device I/O transfer is also connection oriented. That is, a connection is established be-tween the I/O devices before data transfer can take place. Like a network connection, an I/Oconnection can be multihop. For example, an I/O connection carrying video data from a diskcontroller can pass through a decompressor on its way to the display. Also, an I/O connectioncan extend beyond a host, to the devices in other hosts. For example, in the video capture appli-cation described above, the I/O connection from the CODEC of the source can pass through thenetwork interfaces of the source and the destination and the intermediate switches, if any, to thedisk controller at the destination. Once the connection is established, devices can exchange datawith minimal or no intervention from the operating system and the application. We call these de-vices, capable of transmitting and receiving data on their own, active or autonomous devices. Theadvantages of device autonomy, coupled with a connection oriented I/O architecture, are manifold.Most importantly, it helps optimize the I/O data path. Since the application and the operatingsystem no longer need to control I/O transfers, I/O data path can bypass the application and kerneladdress spaces. A direct consequence of a limited operating system and application interventionis improved I/O throughput. In the context of network I/O, this translates to higher availablebandwidth to the applications. A connection oriented I/O architecture also helps optimize networkprotocol processing using connection specic customizations. We have experimentally analyzedthe performance impact of the proposed architecture on networked multimedia applications. Ona video conferencing system built around IBM RS/6000s equipped with high-performance videoCODECs and connected via 100 Mb/s ATM links, we have shown that a connection oriented au-tonomous I/O architecture can improve end-to-end network throughput by as much as three timesthat achievable using the existing architecture.1.1.2 Quality of ServiceBandwidth does not solve the problem of guaranteeing service quality. The heart of a servicearchitecture providing guarantees on end-to-end performance is the scheduling mechanism used for5
multiplexing trac and the switching nodes. The manner in which multiplexing is performed has aprofound eect on the end-to-end performance of the system. Since each network connection mighthave dierent trac characteristics and service requirements, it is important that the multiplexingdiscipline treats them dierently, in accordance with their negotiated quality of service. However,this exibility should not compromise the integrity of the scheme, that is, a few connections shouldnot be able to degrade service to other connections to the extent that the performance guaranteesare violated. To protect the system from malicious and ill-behaved sources, it is also important thatthe access to the network is regulated. That is, each network connection should be associated witha trac envelope describing the characteristics of the trac it is carrying, and trac generatedby a source should be passed through a shaper or a regulator to prevent any violation of thistrac envelope. Besides policing, the shaper also smooths the trac to a form that is easilycharacterizable.In the last several years a number of multiplexing disciplines have been proposed [5]. Based on theperformance guarantees they provide, these schemes can be broadly categorized into two classes:ones that provide guarantees on maximum delay and ones that guarantee a minimum through-put. The multiplexing disciplines providing delay guarantees [26, 22] typically use priority basedscheduling to bound the worst case delay. Depending on the nature of the priority assignment, theycan be further sub-divided into static priority schemes and dynamic priority schemes. In a staticpriority scheme [26], each connection is statically assigned a priority at the time of connection setup. When a cell arrives at the multiplexing node, it is stamped with the priority label associatedwith its connection and is added to a common queue. The cells are served according to their pri-ority order. There are other alternative approaches to implement a static priority scheduler. In adynamic priority scheduler, the priority assigned to the cells belonging to a particular connectioncan be potentially dierent, depending on the state of the server and that of the connections. Hereagain, cells are put in a common queue and served in the priority order. Knowing the exact arrivalpatterns of the cells from dierent connections, it is possible to bound the worst-case delay sueredby cells from a particular connection in both static and dynamic priority scheduling. One of theserious problems with the schemes described above is that they require trac reshaping at eachnode. Priority scheduling completely destroys the original shape of the trac envelope. Since theseschemes require that the exact form of the trac envelope be known at each node in order toguarantee worst-case delay bounds, trac has to be reshaped into its original form as it exits aswitching node.The schemes oering throughput guarantees use weighted fair queueing [35, 48] and frame basedscheduling [7, 23] to guarantee a minimum rate of service at each node. Knowing the tracenvelope, this rate guarantee can be translated into guarantees on other performance metrics, suchas delay, delay jitter, worst-case backlog at a switch, etc. Based on implementation strategies,rate-based schemes can be further classied into two categories: 1) priority queue implementation,6
and 2) frame-based implementation. The most popular examples schemes using priority queueimplementations are virtual clock [48], packet-by-packet generalized processor sharing (PGPS) [13,35], self clocked fair queueing (SFQ) [25], etc. In all of these schemes, cells are stamped at theirarrival with a priority label reecting the service rate allocated to connections they belong to. Theyare then put in a common queue, and served in the priority order. While these schemes are extremelyexible in terms of allocating bandwidth in very ne granularity and fair distribution of bandwidthamong active connections, they are costly in terms of implementation. Maintaining a priorityqueue in the switches is expensive. In some cases [35], the overhead of the stamping algorithmalso can be quite high. In contrast, frame-based mechanisms are much simpler to implement.The most popular frame-based schemes are Hierarchical-Round-Robin (HRR) [7] and Stop-and-Go(SG) [23, 24]. HRR is equivalent to a non-work-conserving round robin service discipline. In HRR,each connection is assigned a fraction of the total available bandwidth and receives that bandwidthin each frame, if it has sucient cells available for service. The server ensures that no connectiongets more bandwidth than what is allocated to it, even if it has spare capacity and the connectionis backlogged. Like HRR, SG also is a non-work-conserving service discipline. It tries to emulatecircuit switching in a packet switched network. Due their non-work-conserving service policy, bothSG and HRR fail to exploit the multiplexing gains of ATM. Another important drawback of SGand HRR, and all framing strategies for that matter, is that they couple the service delay withbandwidth allocation granularity. That is, the ner is the granularity of bandwidth allocation thehigher is the delay suered at the server.In almost all the schemes discussed above, the shaper is assumed to enforce a specic rate constraintson the source, which is typically a declared peak or mean rate. In other words, the scheduler assumesa peak or mean rate approximation of the original source. Although a single rate characterizationsimplies the task of the scheduler, it has a detrimental impact on system performance. Mostmultimedia applications generate inherently bursty trac. Hence, enforcement of a mean rateresults in a higher delay, while peak rate enforcement leads to a lower network utilization. Toalleviate this problem we propose to use multirate shapers. A multirate shaper enforces dierentrate constraints over time windows of dierent lengths. For example, a dual-rate shaper can enforcea long term average rate and a short term peak rate. Multirate shaping allows a more precisecharacterization of bursty trac which can potentially be used by the scheduler to exploit themultiplexing gains of ATM.Although multi-rate shapers better characterize bursty trac, it is the scheduler which determineshow that information is used to improve system performance. As mentioned above, an ideal sched-uler should also provide exibility and protection, and should be simple enough to be implementedat the high-speed switches. We propose a simple work-conserving scheduling mechanism designedto integrate the exibility and fairness of the fair queueing strategies with the simplicity of frame-based mechanisms. The scheduling mechanism, which we call Carry-Over Round Robin (CORR), is7
an extension of simple round robin scheduling. Very much like in round robin scheduling, CORR di-vides the time-line into allocation cycles, and each connection is allocated a fraction of the availablebandwidth in each cycle. However, unlike slotted implementations of round robin schemes wherebandwidth is allocated as a multiple of a xed quantum, the bandwidth allocation granularity canbe arbitrarily small in our scheme. Also, unlike the framing strategies like SG and HRR, ours is awork-conserving discipline, and hence unused bandwidth is not wasted but is fairly shared amongthe active connections. We have shown that when used in conjunction with multi-rate shaping,CORR is very competitive with much more complex mechanisms, such as PGPS and SFQ.1.2 Summary of ContributionsIn this dissertation we have addressed two important problems hindering the ubiquitous deploy-ment of distributed multimedia applications |- 1) a lack of operating system support for networkintensive applications, and 2) a lack of network support for quality of service guarantees. In thefollowing, we briey summarize our contributions in both of these areas.To enhance operating system support for network intensive applications, we have proposed an I/Oarchitecture. The proposed architecture limits the operating system involvement in I/O transfers bymigrating some of the operating system's I/O functions to the I/O devices. By allowing autonomyto I/O devices and by introducing the notion of a connection oriented I/O architecture, we manageto limit the operating system and application involvement in I/O transfers. Consequently, it notonly improves network throughput, but also addresses the more general problem of chronic I/Obottleneck of the current generation of operating systems. We have experimentally demonstratedthe performance impact of the proposed I/O architecture on networked multimedia applications.We have developed a video conferencing system using the principles of device autonomy and con-nection oriented I/O transfers, and we have achieved a three-fold performance improvement overthe system using the conventional I/O model.To address the problem of quality of service support in ATM networks, we have proposed a simplecell scheduling mechanism, named carry-over round robin (CORR). Using analytical techniques,we have analyzed the delay performance of CORR scheduling assuming multi-rate sources. To thebest of our knowledge, our source model is the most general among all the related studies publishedin the literature. We have derived closed form bounds for end-to-end delay when CORR is used inconjunction with multi-rate shapers. Besides providing guarantees on delay, CORR is also fair indistributing the excess bandwidth. We show that albeit its simplicity, CORR scheduling disciplineis very competitive with more complex disciplines.8
1.3 OrganizationThe rest of dissertation is organized as follows. In chapter 2 we introduce the concept of deviceautonomy. The autonomous devices are at the heart of the connection oriented I/O architectureproposed in this chapter.Chapter 3 is devoted to the experimental validation of the I/O architecture proposed in chapter 2.In this chapter we describe in detail the architecture of a high-performance video conferencingsystem developed using the principles of device autonomy and connection oriented I/O. We presentdetailed performance results to demonstrate the impact of the proposed I/O architecture on theperformance of the network subsystem in particular, and that of the I/O subsystem in general.In chapters 4, 5, and 6 we discuss dierent aspects of quality of service management in ATMnetworks. In chapter 4, dierent shaping mechanisms used to regulate trac at the edge of thenetwork are presented. Here, we introduce the concept of multi-rate shaping and characterize thetrac envelopes dened by composite moving window, jumping window, and leaky bucket shapers.In chapter 5, we present detailed algorithmic description of the Carry-Over Round Robin schedulingdiscipline and analyze some of its basic properties.Chapter 6 is devoted to the evaluation of the shaping and scheduling mechanisms. We derive closedform bounds on the worst-case end-to-end delay when CORR is used in conjunction with multi-rateshapers. We also analyze the fairness properties of CORR scheduling.We conclude in chapter 7 by noting the contributions of the work and laying a roadmap for futureextensions.
9
Chapter 2Autonomous Device ArchitectureMultimedia applications typically involve moving large volumes of time-sensitive data betweendevices and peripherals attached to the same host or to dierent hosts. For example, in a videorecording application, data captured from a camera is compressed and coded by a CODEC andstored on a disk, all of which can be done on the same host. Alternatively, images can be capturedby a camera attached to one host, compressed by a CODEC in another, and stored on a disk inyet another host, all connected via a network. In either case, providing adequate system supportrequires an infrastructure that is capable of moving hundreds and thousands of megabytes of datafrom one end to the other in a timely and orderly manner. Although we have witnessed great gainsin hardware performance in recent years, software performance has not improved commensurately.The inadequacy of the current generation of operating systems (OSs) in supporting I/O intensiveapplications is a major deterrent in the wide-spread deployment of multimedia applications. Inthis chapter, we propose a new architecture designed to alleviate the I/O bottleneck in the currentgeneration of OSs.The rest of the chapter is organized as follows. In section 2.1 we review related works. The guidingprinciples behind the proposed architecture are discussed in section 2.2. Section 2.3 is devoted to adescription of the basic components of the architecture. Alternative data paths for device-to-devicedata transfer are discussed in section 2.4. Flow control and data processing issues are addressed insections 2.5 and 2.6, respectively. We summarize the contributions of this chapter in section 2.7.2.1 Current State of ArtMost of the conventional I/O system architectures stem from the Multics system of the late 1960's.Given the enormous changes in the application prole, it is not surprising that the OSs fail toprovide the sheer performance required or the predictability of performance desired by multime-10
dia applications. Although it is not immediately obvious, both of these limitations arise from theprocessor-centric viewpoint adopted by the current OSs. The processor-centric viewpoint is em-bodied in the notion of a process. A process, by denition, is an instantiation of the current stateof a computation and a place to keep a record of resources reserved for the computation. The mostimportant omission from this notion is the communications and the I/O operations that take place.Nothing is indicated about the resources to be used for communications and I/O and their expectedusage pattern. Implicit resource demands of communications and I/O make it hard to design OSswhich would provide predictable performance. Another aspect of the processor-centric viewpointof current OSs is manifested in the unwarranted involvement of the processor in I/O operations. Inthe current systems, the processor is involved in initiating I/O operations and moving data, evenwhen the applications perform no processing on the data.In order to understand why I/O intensive multimedia applications suer from the existing OSarchitecture, let us consider the typical activities that take place in the OS while supporting a basicmultimedia application. Consider the video recording application described in the last chapter.Here, in order to move a data packet from the CODEC to the network interface, the applicationhas to rst make a system call to read data from the CODEC. As a result of the read call, thecontrol is transferred to the kernel. In the kernel, the system call is translated into a sequence ofdevice specic operations, ultimately resulting in copying of data from the device buer to buersin the application space. Once data is received in the application space, the application requestsa network send. Once again the control is transferred to the kernel. The kernel translates thesystem call into device specic actions, and eventually data is moved from the application buerto the buer on the network interface. Hence, to move a data packet between two I/O devices,the control is transferred back and forth between the application and the kernel domains twice,resulting in four context switches and multiple data copies. This results in signicant degradationof system performance by keeping the host bus busy, and consuming processor cycles and memorybandwidth.Several recent works have addressed the problem of data copying due to crossing of domain bound-aries. The simplest way of avoiding data copy is to use shared memory [14, 15]. In [37] inter-domaintransfers are optimized by encapsulating data in a sequence of pallets (contiguous virtual memoryaddress). Data is mapped into a receiving domain only when requested by the application. Mach [3]and its predecessor Accent [20] use a scheme known as copy-on-write to avoid unnecessary datacopying. A number of techniques rely on the virtual memory system to provide copy-free cross do-main transfers. Virtual page remapping [10, 44] unmaps the pages containing data units from thesending domain and maps it into the receiving domain. Shared virtual memory [41] employs buersthat are statically shared among two or more domains to avoid data transfers. The problems withusing shared memory is that the unit of sharing is typically a page and data must be aligned to thepage boundary. There are also tricky protection issues which often cast doubt on their viability.11
Frequent context switches is another leading cause of the poor performance [32, 15, 18]. In the aboveexample, every copy of data buer also requires a context switch between the user and the kerneldomains. These context switches cannot be avoided because protection boundaries are crossed incopying data from the user to the kernel space and vice versa. The trend in the microprocessortechnology towards a larger set of registers, deeper pipelines, and multiple instruction issues isgoing to further increase the cost of a context switch. In order to reduce the frequency of contextswitches, Bershad et al. [6, 31] proposes building extensible kernels that would include some partof an application to run in the kernel. In [18], Fall et al. proposes a peer-to-peer I/O model wheremost of the data ow occurs through the kernel and does not require extra copies or context switchbetween the user and the kernel domains. Although peer-to-peer I/O transfers lead to higherperformance for I/O intensive applications, it also increases the complexity of the kernel and, thus,may adversely aect the performance of the kernel and other applications.Moving I/O data through the CPU and the memory subsystem also harms the cache performance.The processor's primary and secondary caches are lled with data that is used only once, leading toushing of caches of other data. This would result in more processor stalls later for other programs.On the other hand, if the data transferred is allowed to bypass the cache, the networked applicationwill experience an increased latency due to cache misses when processing message headers andcontrol messages. The increasing gap between processor speed and memory bandwidth means thatthe cost of delivering data to a wrong place in the memory hierarchy would also rise proportionately.Carter et al. [8] proposes to address this problem by integrating the memory controller with thenetwork interface.One solution to poor I/O throughput in the system is to move the I/O data path away from theapplication and the kernel. With the availability of cheaper and faster microprocessors, most devicesnow a days are equipped with powerful on-board processors. Taking advantage of this trend, it ispossible to make devices exchange data independently, instead of the application driven push-pullcontrol. Data can be transferred from one device to another without processor or main memorygetting in the way. Such devices would not only move data, but also do rate-matching, performlimited data processing, and recover from occasional data losses. Once the connection is establishedbetween two devices by an application program, there would be no further need to involve the mainprocessor in the data transfer. In the context of the example shown above, a connection would beestablished between the camera controller and the network controller, and then frames would becaptured at a rate at which the network agrees to send it out. Once the connection is established,the controllers would never need to interrupt the kernel until one desires to terminate the capturemode or wishes to exchange control information, such as controls for camera positioning, exceptionhandling, etc.Smart devices capable of direct data transfer address the problem only partially. A disk controller12


















I/O Bus I/O Bus
Application Application










Disk SystemFigure 2.1: Seperation of control and data ows.and sometimes in the application space as well. We refer to this type of device as a passive device.All devices are currently treated as passive devices.An active device is dened as one that is capable of handling data transfers, buering data, andmatching data rates with ow control mechanisms. An autonomous device is dened as an activedevice that is capable of de-multiplexing trac according to application-specic contexts providedto it by a processor. An autonomous device is an active device with simple data manipulationcapabilities such as byte swapping, checksum operation, and more complex actions such as BITBLToperations used in bit-mapped display devices.In an autonomous device architecture, the OS is viewed as the one that establishes the necessarycontexts at devices at the beginning of a data transfer. The control may be transferred to the OSonly for exception handling and not during the data ow.2.2.2 Connection Oriented I/OIn our model, all I/O transfers are connection oriented. A connection is established between thesource and the sink before the beginning of data transfer. The application and the OS participatein the connection setup phase only by setting up the context at the devices and initializing theconnection states. Once the data transfer begins, it proceeds autonomously between the sourceand the sink, transparent to the kernel and the application. Besides lesser involvement of the OSand the application in data movement, the connection abstraction also helps build a quality of14
service architecture in the I/O subsystem. In order to support guaranteed quality of service, eachconnection has to be associated with a trac envelope, and the connection setup procedure has toinclude an admission control test to determine if the new connection can be supported.In many ways the connection oriented I/O architecture resembles the ATM network architecture.In some sense it is the generalization of connection oriented network I/O to other froms of I/Oactivity inside the end-host.One of the major advantages of connection oriented architecture is that it allows separation ofcontrol and data ows. Consequently, data transfer mechanisms can be fast and dumb, whilethe control mechanisms can be as complicated as necessary. For a better understanding of itsimplications, consider the video capturing example. The problem with the application architectureshown in gure 1.1 is that both control and data ows between devices are through an applicationprocess. In this example, only the control ow needs to pass through the application. Sincedata and control cannot be separated in the existing architecture, both control and data ows passthrough the application, severely limiting the throughput of the data path. In a connection orientedarchitecture, we can alleviate this problem by using separate connections for control and data ows.In the architecture shown in gure 2.1, data paths connect the devices that generate/consume datadirectly, while the control ow still passes through the application. That is, the application stillremains in control of the data ow without being directly involved in moving data.2.3 System ComponentsThe proposed architecture is general enough to be applicable to a large class of systems. In themost general form we assume that a host system consists of one or more CPUs, memory subsystem,and I/O devices capable of performing autonomous operations. We do not make any assumptionregarding how the devices are interconnected. They can be connected through a bus or a switch.We also do not make any assumption on the degree of autonomy exercised by the devices. Thearchitecture is general enough to support autonomous, active, as well as passive devices.In our system all forms of device to device and application to device data transfers are preceded bya connection setup procedure between the source and the sink. After a connection is established,the involvement of the OS in data transfer depends on the degree of autonomy exercised by thedevices. If the source and the sink are passive, data still ows through the kernel. For active andautonomous devices, the responsibility of the OS can range from exception handling to interruptprocessing, depending on the sophistication of the devices in terms of the functions they support.To support notion of connections, devices and device interfaces need to be altered. We introducethe abstraction of I/O channels to support I/O connections at the device end. An I/O channel is15




















Device 1 Device 2
buffer bufferFigure 2.2: Alternative data paths.can be substituted by buering in the I/O controller itself. This improves the sustainablethroughput since the DMA transfer does not have to compete with the concurrent CPUactivity for main memory accesses. Two I/O bus trips are required for DMA streaming, andconsequently throughput is bounded by half of the I/O bus bandwidth.DMA streaming suers from the same shortcomings of the hardware streaming. Since thedata path is transparent to the CPU, data processing is limited to functions performed bythe adapters.Kernel Streaming: Both hardware streaming and DMA streaming suer from the lack ofexibility in data manipulation. In order to provide applications with complete control on thedata, we need to move data through the kernel and one or more application domains. Clearly,data must pass through the CPU/cache at least once. In most systems, crossing of domainboundaries require data copying, leading to further reduction in achievable throughput. Ifall data manipulation can be performed in the kernel mode, overhead due to kernel-to-userdomain boundary crossing can be avoided. We refer to in-kernel but application transparentdata transfer between devices as kernel streaming. All devices are capable of using kernelstreaming.Although kernel-streaming oers full programmability, data manipulation has to be performedin the kernel. Consequently, applications are limited by the functionalities provided by thekernel for data manipulation. Some of the newer OSs are trying to address this problem17
by providing interfaces to execute application code in the kernel domain. The problem ofprotection against faulty and malicious user code casts doubt on this approach.Application Streaming: We refer to data transfer between devices through kernel andone or more application domains as application streaming. Most of the existing applicationsfollow this model of data streaming. Clearly, application streaming provides the most exibleinterface for data manipulation, but only at the cost of data transfer throughput.2.5 Flow ControlFlow control refers to the task of speed-matching between the data source and the data sink. Inthe traditional application streaming, the application controls the ow by exercising a push-pullcontrol on the source and the sink. That is, a source is blocked until the sink has consumed atleast a part of the the outstanding data. Data is buered in the application and in the kernelto absorb temporary disparity in the rates of data generation and consumption. Unfortunately,application driven control is not an available option in application transparent data streaming.Also, a push-pull control is not the most ideal form of ow control for high-speed data transfers.We propose to use source rate control for controlling autonomous data ows. In our model, eachI/O connection is associated with a trac envelope which describes the characteristics of the dataow between the source and the sink of the connection. For example, a ow envelope may specifythe peak and the mean rates of the ow. As a part of connection setup, an admission controltest is performed to check if the resources available at the sink are sucient to consume the datagenerated by the source. If sucient resources are available, the connection is admitted, and thetrac envelope is communicated to the source. It is the responsibility of the source to conform tothis trac envelope. If sucient resources are not available at the sink, the connection is aborted.The advantage of source rate control is that the overhead of ow control during data transferis minimal. Hence, it is extremely suitable for high-speed data streaming. The ip side of thisapproach is that once the resources are committed to a connection, they cannot be used for otherconnections, even when they are under-utilized.2.6 Data Processing ModulesThe proposed architecture is an exact t for the applications that move large volumes of data butperform very little processing on it. Although most of the multimedia applications fall into thismodel, there are many which do not. Hence, it is important to add support for data manipulation18
Flushing the entire cache
Partial cache invalidation
Lazy cache invalidation
− word by word cache
   invalidation is costly
− high realoading cost
− fast invalidation
− possibility of reading






InterfaceFigure 2.3: DMA data path.capability to enhance the generality of the model. We use channel handlers for this purpose.The handlers associated with each channel can be used to process data both at the source and atthe destination in a channel specic way. In order to oer full exibility in data processing, weneed to provide an interface for applications to specify the handlers to be used with a particularchannel. Many modern operating systems, such as IBM's AIX and SUN's Solaris provide dynamicloading facility to load code modules into a protected address space. This facility is widely usedto load device drivers and other kernel modules selectively depending on hardware availability. Itcan be easily extended to load user code to an attached peripheral device rather than the kerneladdress space. Hence, user specied data processing modules can be attached to I/O connection.However, the problem of protection against erroneous and malicious code module is still remainsto be addressed.Even when application code cannot be used as processing modules, connection handlers couldstill be quite useful for connection specic data processing. For example, consider the scenariosketched in gure 2.3. Video data received over the network is DMAed into main memory forsubsequent display. In many systems, such as IBM's RS/6000, the DMA data path bypasses thecache. Hence, to maintain cache consistency, after a DMA operation the network device driverushes the data cache. However, the chances that a stale cache line is accessed by the CPU ifthe cache is not ushed are quite small. While it is important to ush the cache where high datadelity is mandatory, avoiding cache ushing not only saves the overhead of the ush operation,but also improves cache performance. If we identify network connections where data corruptiondoes not lead to catastrophic consequences, we can improve system performance by avoiding cacheushing [16]. The video application described above is a perfect candidate for such optimizations.19
2.7 SummaryIn this chapter we have proposed a connection oriented autonomous I/O architecture. Our approachof delegating more responsibility to the devices is fundamentally dierent frommost of the solutionsproposed in the literature. Device-to-device autonomous data transfers with minimal OS andapplication intervention have the potential to eliminate the I/O bottleneck in the operating system.The notion of I/O channels not only provides a uniform abstraction for all I/O activity in the system,but also enables channel specic customization of I/O services. It also establishes an infrastructurefor performance guarantees on I/O operations.
20
















DP−RAM Stat. Reg.Com. Reg.
Micro Channel Bus
Figure 3.2: MMT adapter.output after decompression. It supports full-duplex real-time video compression and decom-pression at frame rates up to 30 frames/second. The whole system is controlled by a dedicatedDSP4 processor. The DSP has access to a 256 Kbyte SRAM and a 16 Kbyte dual port mem-ory (DP-RAM). The SRAM is used for smoothing and mixing of video and audio streams.The dual port memory is used to communicate with the system.The adapter currently supports the ISO Motion-JPEG [46] standard. The video processingunit consists of two motion JPEG engines, one for compression and the other for decompres-sion. They can support dierent frame rates and resolutions. The video input is fed throughthe video frame rate control logic to the compression engine. The data rate of the compresseddata stream can be controlled by programming the quantizer to as low as 128 Kbits/sec andas high as 10 Mbits/sec [1]. The CODEC is capable of mixing up to 32 video streams [42] inthe compressed domain and presenting them in multiple video windows.ATM Adapter. The IBM ATM adapter [2] is responsible for performing the AAL5 func-tionalities. It features a dedicated i960 processor and a specialized chipset to handle AAL5segmentation and reassembly in hardware. The adapter is equipped with a DMA master and2MB of on-board memory. It can support up to 1024 network connections with an aggregatethroughput of 100 Mb/s.4DSP stands for Digital Signal Processor. 23
3.2 Base System: Implementation and ProlingIn this section, we describe the architecture and the performance of the rst prototype implemen-tation of the conferencing system. Our objective behind this eort is to understand the limitationsof the conventional I/O architecture. In the following description we limit our attention to theVideo Audio Support Unit (VASU), the subsystem responsible for moving audio and video data.For a detailed description of the entire system refer to [39].3.2.1 ImplementationThe rst prototype, referred to as the base system, uses classical UNIX model of peer-to-peercommunication. Peer VASUs running on dierent hosts open communication channels to each otherusing datagram sockets running over UDP/IP on ATM AAL5. In the base system, no modicationto the devices or the device interfaces are made. We have written AIX drivers for MMT and ATMadapters following the standard UNIX paradigm for character and network devices, respectively.A simple optimization is incorporated in the MMT driver to reduce data copying. In the followingwe describe some of the implementation details.MMT Device Driver. The MMT device driver follows the well-known cong-open-close-read-write-ioctl UNIX paradigm. The cong call initializes the MMT device by loading ap-propriate micro-code and initializing the device state. The open and close calls are standard.To optimize data movement between device buers and applications we map MMT buersinto kernel address space. This allows us to move data directly from the device buer tothe application space, and vice versa, without copying into intermediate kernel buers in themain memory. This optimization saves one data copy for each read and write operation. Wehave implemented several device specic ioctl (I/O control) calls. These include registeringuser processes with the device so that asynchronous call-backs can be made to the appropriateprocess when data is ready in the device, or when the device is ready to accept data. Thereare ioctl interfaces to register asynchronous exception handlers, to change device congura-tions, such as quantization, frame rate, etc. We have also added ioclt calls for turning deviceproling on and o and generating device statistics.ATM Interface Driver. The ATM device driver consists of several sub-layers in whichthe lowest layer interfaces with the ATM adapter and the highest layer interfaces to the AIXnetwork subsystem. The standard interface for the ATM device is the IP network interfaceand the device supports the classical IP over ATMmodel [29]. In addition to this IP interface,the ATM device driver also provides a low-level UNIX device interface. Hence, the VASU canaccess the ATM network using either one of these two mechanisms. While using the low-level24
interface, the device is opened via the open system call. Before data can be sent or received(write and read respectively), virtual channels (VC) have to be established. This is done viaioctl calls. There are ioctl interfaces for opening and closing connections. In opening a VC itstrac characteristics (peak cell rate, sustainable cell rate, burst length) and other connectionparameters (simplex/duplex, service priority, AAL type) are specied. In addition, there areioctl calls for resetting the sender and receiver entities for a certain VC.The VASUs use the le I/O interface to open, close, read, write data from the MMT and use thesocket API to open and close network connections and send and receive system calls to exchangevideo and audio data over the network. On the transmitting side, audio and video data is captured,digitized and compressed by the MMT. Compressed data is packetized by the DSP and an interruptis sent to the driver indicating that data is ready to be read. The driver, in turn, sends a signalto the VASU. The VASU, upon receipt of the signal, reads the data and sends it over the UDP/IPsocket to its peer. On the receive side, VASU receives data on the UDP socket. Once data isreceived from the network interface, the VASU writes it into the MMT buer using the writesystem call provided by the MMT driver.3.2.2 Data Path ProlingThe quality of video and audio in the base system was far below our expectations. Clearly, neitherthe network nor the the CODEC was the bottleneck. In order to identify the system bottleneckswe instrumented the transmit and receive data paths and performed a thorough proling of thesystem.Figures 3.3 and 3.4 show the transmit and receive latencies in the base system. These measurementshave been taken on an RS/6000 Model 530H with a 32 Mbyte memory and a 50 MHz processor.The measurements were taken using the system's real-time clock which is an integral part of theRS/6000 architecture. This clock can be accessed by any process by reading two 32-bit clockregisters with microsecond granularity. We used a two-instruction assembly language function toread the clock registers with negligible overhead. To store the statistics gathered, we allocatedtemporary buers in the kernel. We also added ioctl calls so that the applications can access thestatistics.As shown in gure 3.3, the latency on the transmit side comprises of two major components {MMT read and network send, each of which is a system call. The MMT read overhead can befurther broken down into the cost of context switching and the overhead due to data copying acrossdomain boundaries. Note that, the data copying in this particular case is a copy from the MMTadapter to the main memory across the I/O bus. The network send overhead include the cost of25






















Write to netwrok (data copy)
Socket, UDP/IP processingFigure 3.3: Transmit latency in the base system (in microseconds).






















Read from network (data copy)
Write  to MMTFigure 3.4: Receive latency in the base system (in microseconds).context switching, two data copies, and network protocol processing. First, data is copied fromthe user buer to kernel mbufs 5, a main memory to main memory copy. The second data copyis a copy from the kernel mbufs to buers on the adapter across the I/O bus. Hence, the entiretransmit operation involves two context switches and three data copies, one main memory to mainmemory and two between the main memory and the memory on the I/O adapters. Similarly, the5mbufs are kernel managed buers [45]. 26
receive data path consists of two system calls (see gure 3.3), one each for network receive andMMT write.Quite expectedly, the overheads incurred by read and write calls to MMT are approximately thesame. They are primarily due to context switching and data copies across domain boundaries.However, the socket sends and receives are signicantly more expensive than MMT read and writecalls. Besides the cost of data copying across domain boundaries and between the adapter buer andmain memory, they also include protocol processing overheads. In MMT read and write operationswe save one data copy by copying data directly from the adapter buer to application buer andvice versa. When data is moved between the application buer and the ATM interface buer, datais rst copied into to a kernel buer in the main memory, and then copied into the applicationbuer or the adapter buer depending on the direction of data ow. This extra copy into thekernel buer cannot be avoided since the CPU performs network protocol processing on the data.If we convert these latency gures into equivalent throughput, we observe that only a very smallfraction of the network bandwidth is actually available to the application. A large portion overheadcan be attributed to the circuitous data path between the MMT and ATM adapters. There is noreason for video and audio data generated by the MMT to pass through the buers in the kerneland the application address space on its way to the ATM network interface. Similarly, movingnetwork data from the ATM adapter through kernel and application buers in the main memoryto the MMT adapter is not an example of the most optimal data path. Unied ow of control anddata and lack of autonomous device operation in the current generation of OSs is the only reasonwhy data has to pass through the application. By taking advantage of I/O connections we haveseparated control and data ows in the optimized system. This allows us to set up a MMT to ATMdirect data path bypassing VASU.From gures 3.3 and 3.4 we also observe that one third of the overhead in the transmit and receivelatencies is contributed by the protocol and socket processing overheads. A large component ofthis overhead stems from the protocol redundancy in the existing layered protocol architecture,such as the Internet protocol family. The ATM adaptation layer, implemented in hardware, pro-vides a rich set of functionality, such as connection management, ow control, segmentation, andreassembly. However, to maintain the same interface to all the link layers, the Internet protocolsuite (TCP/IP and UDP/IP) ignores the special features provided by ATM, and many of the func-tionalities are replicated in higher layers. In order to exploit the rich set of functionality providedby the ATM network, in the optimized system we use a native ATM stack running over AAL5instead of UDP/IP. The connection oriented I/O architecture provides the necessary infrastructurefor an elegant implementation of the native ATM stack.27
3.3 Optimizations with Autonomous I/OBased on the lessons learned from the implementation and evaluation of the rst prototype, wehave designed and implemented the second prototype with two primary objectives: Optimize the data path between MMT and ATM adapters by using DMA-streaming. Al-though hardware streaming would have been the best option in terms of performance, weopted for DMA streaming mainly due to logistic limitations 6. Optimize protocol processing overhead by using a light weight native ATM protocol stackinstead of UDP/IP.To facilitate autonomous data streaming and connection oriented I/O, we introduced the notionof I/O channels. A channel is a resource sub-unit on particular device. An application opens I/Ochannels to devices in order to access their services. Channels can be spliced to establish applicationtransparent direct data paths between autonomous devices. The I/O architecture implemented inthe optimized system is structured around the notion of I/O channels. Following we describe thearchitectural and operational details of the system.3.3.1 Driver ExtensionsIn order to incorporate the notion of device autonomy and I/O channels we had to rewrite the devicedrivers for MMT and ATM adapters. The extended drivers support data streaming and channelspecic data processing along with the services provided by standard UNIX devices. Following webriey describe some of the major changes made to the MMT and ATM device drivers.Data Structure: To incorporate the notion of I/O channels we extended the device driver datastructure to include channel state information for each open connection. The channel state for aconnection consists of a pointer to a buer pool reserved for the channel, pointers to asynchronousconnection handlers, and pointer to a data structure consisting of device dependent channel stateinformation.Entry Points: In order to maintain backward compatibility we use the standard UNIX interfacefor the drivers. The new functionalities are accessed using the extension parameters with thestandard calls and several new ioctl calls. In the following we briey explain some of the importantfeatures of the extended driver.6This would have required hardware modications and changes to device micro-codes and were beyond the scopeour project. 28
open: The open entry point is used to open an I/O channel to the device. We use theextension parameter to specify the channel state information and channel handlers. As aresponse to the open call, the kernel opens a new channel with the specied characteristics. Ifsucient resources are not available, the request to open a channel is refused. The followingexample explains the actions in more detail.int rc;int devfd; /* Device file pointer */struct CHANEXT {int chhandle; /* Channel handle */int *ch_rcv();int *ch_snd();int *ch_ctl();struct *ch_state;} ext; /* Channel handlers */mapext.ch_rcv = rxhandler; /* Receive handler */mapext.ch_snd = txhandler; /* Transmit handler */mapext.ch_ctl = sxhandler; /* Status handler */rc = open(devfd, devflag, &ext); /* Registration */In the pseudo code above, rxhandler, txhandler, and sxhandler are the asynchronoushandlers for receive, transmit, and status handling, respectively. These handlers are speciedby the application and are used for connection specic data processing. The structure ch stateis used to initialize the channel state variables. It is also used for specifying channel parameterand for admission control. For example, for ATM interface controller, we specify the channelcharacteristics in terms of the bandwidth requirement and trac envelope (e.g. the peakarrival rate, burst size etc.). Depending on current state of the device, the request to open anew connection may or may not be complied with.close: The close call is used to close a previously open I/O channel. It de-allocates theresources reserved for the channel and unloads the channel handlers.read: The applications use the read entry point to move data from the device to applicationspace. The application can use the receive handlers to perform channel specic processing onthe data as a part of the read call. 29











Receive Path Transmit PathFigure 3.5: Transmit and receive data path in the optimized system.video and audio data generated by MMT. VASU also opens a native ATM connection to the peerVASU at the destination. By appropriately programming the transmit handler associated with theMMT data channel, the data channels to MMT and ATM interface are spliced for the purpose ofdata streaming from the MMT to the ATM adapter.Similarly, on the receive side, VASU opens two channels to MMT. One of them is used for controlow from MMT to VASU, the other is used for moving data to MMT. The data channel to MMTand the ATM network connection setup earlier by the peer VASU on the transmit side are splicedto form a direct data path from the ATM adapter to the MMT adapter. Data transfer starts whenVASUs on the transmit and receive ends trigger data streaming through the ioctl interfaces to theMMT and ATM devices, respectively.3.3.4 Data FlowOn the transmit side, when the MMT card has data to send, it interrupts the system. Theinterrupt is trapped by the interrupt handler which calls the appropriate transmit handler toprocess the packet. The send handler copies data directly into the buer on the ATM adapter. Itis the responsibility of the transmit handler of ATM to queue the datagram for transmission. Inthe transfer of data from MMT to the network interface the data path never crosses the kernelboundary and, hence, saves the cost of data copying and context switching. This mode of datatransfer cannot still be called completely autonomous because the devices still interrupt the system.31
However, the task of the interrupt handler is limited to calling the appropriate handler. Once thehandler initiates the DMA, data transfer occurs autonomously between the device, transparent tothe kernel and the application.





















Figure 3.6: Comparison of transmit latencies in the base and optimized system (in microseconds).
4 128 256 512 1024 2048 4096



















Figure 3.7: Comparison of receive latences in the base and optimize system (in microseconds).On the receive side of the optimized system, whenever an (AAL5) packet is completely received,the network interface interrupts the system. In response, the interrupt handler calls the receivehandler associated with the connection on which the packet is received. If the nal destination ofthe data is the MMT adapter, the receive handler DMAs data in a mbuf chain in the main memoryand invokes the appropriate receive handler on the MMT. It is now the responsibility of the MMTreceive handler to copy the mbuf chain to the dual-port buer on MMT adapter and submit thedata for processing. Note that, the transmit and receive data paths are not symmetric. In thetransmit side data generated by MMT is DMAed directly on to the buer on the ATM adapter.32






















































4096 15.03 17.42315% 141%Figure 3.8: Transmit and receive throughputs.However, on the receiving side, data received at the ATM interface is rst DMAed into the mainmemory and then moved into the buers on the MMT adapter. This asymmetry is due to the lackof adequate buering on the MMT adapter. As mentioned earlier, MMT is a prototype device andhas only 8KB of dual-port memory available for data buering. It is divided into 4KB of transmitbuer and 4KB of receive buer. The ATM card on the other hand has 2MB of on board dual-portbuer. Hence, when data is generated by MMT, it can be transferred directly on to the ATMadapter. However, while transferring data from ATM to MMT it has to buered through the mainmemory since the single receive buer on MMT is not always free. Figure 3.5 shows the transmitand receive data paths in the optimized system.3.3.5 LimitationsThere are a few limitations in the implementation of the autonomous device architecture describedabove. The devices still interrupt the system. To eliminate the interrupts and to make the datatransfer completely autonomous, we need to modify the device micro code. Making these changesis beyond the scope of our project. We refer to this form of device-to-device transfers as "semi-autonomous" operations. In the next section, we will see that despite this shortcoming, the improve-ment in end-to-end throughput is more than three-fold. We believe that with fully autonomousdata paths improvements in throughput will be even more dramatic.33
3.4 Performance ResultsIn the optimized system, data paths never cross the domain boundaries. The in-kernel data transfereliminates the system calls and consequently both i) the overhead due to data copying across domainboundaries and ii) the cost of context switching. The latencies in the optimized data path reectsthe cost of moving data from the MMT adapter to the ATM interface and vice versa. On thetransmit side, data is DMAed directly from the MMT adapter buer to the buer on the ATMadapter. On the receiving side, data received at the ATM network interface is rst DMAed intoa main memory buer pool and then copied into buer memory on the MMT card. Besides datapath optimizations, we save some of the protocol processing overheads.We proled the transmit and receive data paths in the optimized system following the same proce-dures used in proling the base system. Figures 3.6 and 3.7 compare the transmit and receive sidelatencies in the base and the optimized systems. Figure 3.8 summarizes all the measurements andshows the percentage improvement in throughput in the optimized system. The results speak forthemselves. On the transmitting side, we observe throughput improvements ranging from 5200%for data segments of size 4bytes to 315% for data segments of size 4Kbytes. The dramatic improve-ment in throughput for smaller packet sizes reects the fact that the xed overhead associatedwith data transfer is negligible in the optimized system. Noting that the most signicant part ofthe xed overhead is contributed by the context switches, these results are quite expected. Forlarger data segments, the improvement in throughput can be attributed to the reduction in datacopying cost. In the optimized system, data is copied only once, compared to three times in thebase system. If we account for the savings due to the elimination of two context switches andtwo data copies, a three-fold improvement in throughput may appear a little anomalous. In fact,a three-fold improvement is expected due to the elimination of two data copies only. The reasonbehind this anomaly is that one of the two copies eliminated in the optimized data path is a mainmemory to main memory copy. A main memory to main memory copy is relatively less expensivethan main memory to a I/O memory or a I/O memory to main memory copy. Hence, eliminationof two copies out of three did not translate into a saving of two third of the overhead. We couldnot experiment with packet sizes larger than 4Kbytes since MMT's transmit buer is 4Kbyte wide.Extrapolating from the measured data, we predict that the end-to-end throughput would saturateat around 55-60 Mb/s.The improvements on the receiving side, although substantial, are not as good as those on thetransmitting side. We observe throughput improvements ranging from 540% for 4byte packets to141% for 4Kbyte packets, compared to 5200% and 315% for the corresponding packet sizes on thetransmitting side. This dierence can be attributed to the asymmetry in the send and receive datapaths in the optimized system. We believe that with more buering on MMT, it is possible to34
optimize the receive data path to yield throughputs comparable to those of the transmit data path.3.5 SummaryIn this chapter we discussed in detail the design, implementation, and evaluation of a video confer-encing system developed using the principles of autonomous device operations. Our initial attemptsto develop a desktop conferencing system supporting full-motion, high-resolution video and high-quality audio failed to meet our expectations despite adequate hardware support. After a throughproling of the system, we identied the operating system interfaces to the I/O devices and networkinterfaces as the bottlenecks. To alleviate the problem, we have implemented a prototype systemwhere MMT and ATM adapters communicate with each other in a `semi-autonomous' fashion withminimum application and kernel intervention. The experimental performance results presentedabove clearly demonstrates the eectiveness of the proposed architecture. We believe that the con-nection oriented autonomous device architecture is a powerful model with a potential to providethe sheer performance demanded by many multimedia applications.
35
Chapter 4Trac ShapingIn chapter 2 we proposed a connection oriented I/O architecture with the objectives to (1) im-prove network throughput as seen by the applications, and (2) establish an infrastructure forper-connection service guarantees. In chapter 3 we demonstrated the the performance impact ofthe proposed architecture on network throughput. The goal of this chapter, and that of chapters 5and 6, is to make use of the proposed connection model to develop a service architecture thatprovides deterministic guarantees on end-to-end performance. In our model (see gure 4.1), au-tonomous devices are the sources and sinks of data that ows over the connection, called a virtualchannel, joining them. Each virtual channel is associated with a trac envelope, describing thecharacteristics (e.g. peak and average rates, burst lengths and periods) of the trac it is carrying.Also associated with each channel is a service envelope specifying its service requirements, such asmaximum tolerable delay, desired delay jitter etc. Our objective is to design a end-to-end servicearchitecture that admits a connection only when it can guarantee to satisfy its service requirements.In order to provide guaranteed services, resources, such as bandwidth, buers, etc. have to bereserved a priori, so that the promised quality of service is not violated due to system overloading.The resource reservation algorithm at a multiplexing node takes into account the trac envelopesand the service envelopes of all the connections passing though the node in order to determinehow much resources to reserve for a particular connection, or whether to reject a connection whensucient resources are not available. Besides trac and service envelopes, resource reservation alsodepends on the multiplexing schemes used at the switching nodes. Hence, in order to design anddevelop an eective quality of service (QoS) architecture, we need to understand the interactionbetween the trac characteristics and service requirements of dierent virtual channels and themultiplexing policy used at the switching nodes. In this chapter we focus on trac characteristics.Multiplexing mechanisms are discussed in chapter 5. Chapter 6 is devoted to the evaluation of theservice architecture.The rest of the chapter is organized as follows. In section 4.1, we introduce the notion of traf-36
Traffic Source Shaper Switch Data SinkFigure 4.1: The network model.c shaping. We discuss simple shaping mechanisms in section 4.2. Composite shapers and theassociated trac envelope are discussed in section 4.3. We summarize in section 4.4.4.1 Trac ShapersOne of the most important components in a QoS architecture is the characteristics of the tracsource associated with each connection. The resource reservation algorithm at a multiplexing nodeneeds to analyze the trac envelopes associated with all the connections passing through the nodein order to determine the service quality that can be oered to individual connections. Since thenumber of connections passing through a node may run into hundreds or even thousands, it isimportant that the trac envelopes are succinct and simple. Unfortunately, trac generated bymost multimedia applications are very bursty and often dicult to model [28] and specify. Toalleviate this problem, trac generated by a source is passed through a trac shaper which shapesthe trac to a form that is simple to specify and is easy to analyze. Besides shaping, a shaper (orregulator) also polices trac so that a source may not violate the trac envelope negotiated at thetime of connection setup and degrade the service quality of other users. If the trac generated bya source does not conform to the trac envelope enforced by the shaper, the shaper can either dropthe violating cells, tag them as lower priority trac, or hold them in a reshaping buer. In therest of the discussion we assume that the shaper exercises the third option. In our model, tracgenerated by a source (autonomous device) is rst fed into a shaper buer. From the shaper buerthe data is passed through a regulator and is released into the system/network in the form of xedsize cells satisfying the trac envelope. The shape of the trac envelope is determined by theshaping mechanism and the shaper parameters.37
By introducing a shaper at the edge of the network we get a better control on the trac enteringthe network. From the perspective of the network, the trac generated by the shaper is really thetrac entering the network. In the rest of the chapter our objective is to study dierent shapingmechanisms and characterize the trac envelope they enforce on a source. These trac envelopesare used later to model the trac arrival into the network.4.2 Simple ShapersSeveral shaping mechanisms enforcing dierent classes of trac envelopes have been proposed inthe literature. The most popular among them are leaky bucket [43], jumping window and movingwindow [38]. In the following we briey describe their working principles and the trac envelopesthey enforce on a connection.Leaky Bucket Shapers: A leaky bucket shaper consists of a token counter and a timer.The counter is incremented by one each t units time and can reach a maximum value b. Acell is admitted into the system/network if and only if the counter is positive. Each time acell is admitted, the counter is decremented by one. The trac generated by a leaky bucketregulator consists of a burst of up to b cells followed by a steady stream of cells with aminimum inter-cell time of t. The major attraction of leaky bucket is its simplicity. A leakybucket regulator can be implemented with two counters, one to implement the token counterand the other to implement the timer.Jumping Window Shapers: A jumping window regulator divides the time line into xedsize windows of length w and limits the number of cells accepted from a source within anywindow to a maximum number m. The trac generated by a jumping window can have aworst-case burst length of 2m. This happens when two bursts of size m each are released nextto each other, the rst one at the end of its window and the second one at the beginning ofthe next window. Like a leaky bucket, a jumping window regulator can also be implementedwith two counters.Moving Window Shapers: Similar to a jumping window, in a moving window, the numberof arrivals in a time window w is limited to a maximum number m. The dierence is thateach cell is remembered for exactly one window width. That is, if we slide a window of sizew on the time axis, the number of cells admitted within a window period can never exceedm irrespective of the position of the window. Hence, the worst-case burst size in a movingwindow regulator never exceeds m. Compared to a jumping window shaper, trac generatedby a moving window shaper is smoother. This smoothness, however, comes at the cost ofadded complexity in implementation. Since the departure time of each cell is remembered38










B B B B














Figure 4.4: Shaping with multiple leaky buckets.4.3.1 Composite Leaky BucketIn a composite leaky bucket shaper, multiple simple leaky buckets are arranged in cascade (seegure 4.3). The data generated by the source is rst buered in the shaper buer. Fixed lengthcells are dispatched into the system if and only if there is at least one token in each of the leakybuckets. A single leaky bucket generates the worst-case bursty trac when the system starts witha full bucket of tokens, and dispatches a cell whenever there is a token.This worst-case behavior of a leaky bucket is characterized by a trac envelope that starts witha burst equal to the bucket size, followed by a straight line 2 of slope equal to the rate of tokengeneration. The trac envelope enforced by a composite leaky bucket is the intersection of thetrac envelopes of the constituent leaky buckets.Example:In gure 4.4 a composite leaky bucket consisting of leaky buckets LB1, LB2, LB3, andLB4 is shown. The composite trac envelope is marked by the dark line. The exact shape ofthe envelope depends on the number of components and the associated parameters. Inappropriatechoice of shaper parameters may give rise to redundant components which may not have any rolein dening the trac envelope. For example, LB4 is a redundant component in the compositeshaper shown in gure 4.4. We call a set of leaky buckets an essential set if none of the buckets isredundant.2Strictly speaking, the trac envelope due to each leaky bucket is a burst followed by a stair case function. For easeof exposition we have approximated the stair case function by a straight line with the same slope. This simplicationis only for the purpose of explanation. The results derived later takes into consideration the stair case function.42




Shaping  envelope with m+1 leaky buckets













Portion of the shaping envelope excludedFigure 4.5: Trac envelope after adding (m+ 1)th bucket.Then the departure time of the ith cell from the composite shaper, denoted by a(i), can be expressedas a(i) = n+1Xk=1(i  bk + 1) tk [U(i  Bk)  U(i Bk 1)]; i = 0; 1; : : : ;1where U(x) is the unit step function dened asU(x) = ( 0 x < 0;1 x  0:Proof: We will prove this theorem by induction.Base Case: For n = 1, we have B0 = 1, B1 = b1, and B2 = 0. Therefore,a(i) = (i  b1 + 1) t1 U(i  b1):This allows a burst of size b1 to depart at time 0 and a cell after every t1 henceforth. Clearly,the trac envelope captures the characteristics of the trac generated by a leaky bucket withparameters b1 and t1. Hence the hypothesis holds in the base case.Inductive Hypothesis: Assume that the theorem holds for all n  m. To prove that it holds for alln, we need to show that it holds for n = m+ 1.44







































Figure 4.7: Trac envelope after adding the (l+ 1)th moving window.a(i) = nXk=1 imk    imk 1 mk 1mk wk; i = 0; 1; : : : ;1Proof: We will prove this by induction.Base Case: For n = 1, we have a(i) = j im1 kw1. This means that a bursts of size m1 appear attimes kw1, k = 0; 1; : : : ;1. Clearly, this represents the trac envelope due to a single movingwindow with parameters (w1; m1). Hence, the premise holds in the base case.Inductive Hypothesis: Assume that the premise holds for all n  l. To prove that it holds for all n,we need to show that it holds for n = l+ 1.Consider the eect of adding the (l+1)th moving window. In the worst-case, the burst always comesat the beginning of a window. Therefore, as shown in gure 4.7, bursts of size ml cells appear atthe beginning of each window of length wl, for ml 1=ml windows. Now, from the hypothesis, thearrival time of the ith cell is given bya(i) = lXk=1 imk     imk 1  mk 1mk wk:If a new shaper (wl+1; ml+1) is added, the burst appearing at the beginning of each wl window willspread out into ml=ml+1 bursts of size ml+1 each and separated by wl+1, as shown in gure 4.7.Due this spreading out of the bursts, the arrival time of the ith cell will be postponed by iml+1   iml mlml+1wl+147





























Figure 4.8: Trac envelope of a composite jumping window shaper.Denition 4.3 An n-component composite jumping window shaper consists of n simple jumpingwindows (wk; mk), k = 1; : : : ; n, where wi  wj, mi  mj, and mi=wi  mj=wj, for 1  i < j  n.For the sake of mathematical convenience we assume that an n-component composite shaper alsoincludes another pseudo jumping window (m0; w0) such that m0=m1 = 0. We also assume forsimplicity of exposition that mi+1 divides mi, and wi+1 divides wi, for i = 1; 2; : : : ; n  1.Theorem 4.3 The worst-case departure time a(i) of the ith cell, i = 0; 1; : : : ;1, from an n-component composite jumping window shaper is given bya(i) = 8>>>><>>>>: nXk=1 imk  + 1   imk 1+ 1mk 1mk wk; 0  i < m1nXk=1 imk     imk 1 mk 1mk wk; m1  i <1:Proof: The proof is similar to the proof of the last theorem. The only dierence is that in thecase of the jumping window, the rst batch of bursts is released at the end of the rst outer window.The rest of the batches are dispatched starting at the beginning of the subsequent outer windows.4.4 SummaryIn this chapter, we discussed shaping trac for the purpose of policing and characterization. Wediscussed simple shaping mechanisms such as leaky buckets, moving and jumping windows, and49
characterized the trac envelopes dened by composition of multiple shapers. We will use thesetrac envelopes to model input trac to the network in the following chapters.Shaping mechanisms have been proposed and analyzed by several authors (e.g. [38, 9]). However,most of these studies investigate dierent aspects of simple shapers such as a single leaky bucket orsingle jumping and moving window. The contribution of this chapter is in the characterization ofthe trac envelopes dened by composite shapers. We believe that the characterization of tracgenerated by composite shapers will be extremely useful in developing an end-to-end quality ofservice architecture, particularly with the recent standardization of multi-rate shaping of virtualconnections by the ATM forum.
50
Chapter 5Carry-Over Round Robin SchedulingThe heart of a quality of service architecture providing deterministic guarantees on performance isthe multiplexing policy used at the switching nodes. Multiplexing is the allocation of link capacityto competing connections. The manner in which multiplexing is performed has a profound eecton the end-to-end performance of the system. Since each connection might have dierent traccharacteristics and service requirements, it is important that the multiplexing discipline treatsthem dierently, in accordance with their negotiated quality of service. However, this exibilityshould not compromise the integrity of the scheme, that is, a few connections should not be ableto degrade service to other connections to the extent that the performance guarantees are violated.Also, the scheme should be analyzable since performance guarantees are to be given. Finally, itshould be simple enough for implementation in high-speed switches. In this chapter we propose amultiplexing mechanism that attempts to achieve these daunting goals.The rest of the chapter is organized as follows. In section 5.1 we review the current state ofart. In section 5.2 we present a detailed description of the scheduling algorithm. An outline fora hardware implementation of the multiplexing mechanism is discussed in section 5.3. Some ofthe basic properties of the multiplexing discipline are analyzed in section 5.4. We summarize insection 5.55.1 Current State of ArtIn the last several years a number of multiplexing disciplines have been proposed [5]. Based on theperformance guarantees they provide, these schemes can be broadly categorized into two classes:ones that provide guarantees on maximum delay at the switching nodes and ones that guaranteesa minimum throughput. In the following we briey explain working principles of the representativeschemes from each class and examine their merits and shortcomings.51
5.1.1 Delay GuaranteeThe multiplexing disciplines providing delay guarantees [26, 22] typically use priority based schedul-ing to bound the worst case delay encountered by a cell belonging to a connection at a particularswitch. Depending on the nature of priority assignment, they can be further sub-divided into staticpriority schemes and dynamic priority schemes. In a static priority scheme [26], each connectionis statically assigned a priority at the time of connection set up. When a cell from a certain con-nection arrives at the multiplexing node, it is stamped with the priority label associated with itsconnection, and is added to a common queue. The cells are served according to their priority order.There are other alternative approaches to implement a static priority scheduler. In a dynamicpriority scheduler, the priority assigned to cells belonging to a particular connection can be po-tentially dierent, depending on the state of the server and that of the connections. For example,an Earliest Deadline First scheduler [22] uses a real or a virtual deadline as the priority label fora cell. Here again, cells are put in a common queue and served in the priority order. Knowingthe exact arrival patterns of cells from dierent connections it is possible to bound the worst-casedelay suered by cells from a particular connection in both static and dynamic priority scheduling.The end-to-end queueing delay suered by a cell passing through multiple multiplexing nodes, eachemploying deadline or priority scheduling, is the sum of the worst-case delays encountered at eachnode. One of the serious problems with the schemes described above is that they require tracreshaping at each node. Priority scheduling completely destroys the original shape of the tracenvelope. Since these schemes require that the exact form of the trac envelope be known at eachnode in order to guarantee worst-case delay bounds, trac has to be reshaped into its original formas it exits a multiplexing node.5.1.2 Throughput GuaranteeThe multiplexing disciplines providing throughput guarantees [7, 23, 35, 48, 40] use weighted fairqueueing and frame based scheduling to guarantee a minimum rate of service at each node. Knowingthe trac envelope, this rate guarantee can be translated into guarantees on other performancemetrics, such as delay, delay jitter, worst-case backlog, etc. Unlike the disciplines providing delayguarantees, in rate based schemes, worst case end-to-end queueing delay is equal the delay sueredat the bottleneck node only, not the sum of the worst-case delays at each intermediate node. Rate-based schemes are also more fair in terms of distributing excess bandwidth. In a delay-based schemeemploying priority scheduling, excess bandwidth is consumed by the connections with the highestpriorities, whereas in rate-based schemes it can be distributed more evenly and predictably.Based on implementation strategies, rate-based schemes can be further classied into two categories{ 1) priority queue implementation, and 2) frame-based implementation.52
The most popular examples of priority queue implementations are virtual clock [48], packet-by-packet generalized processor sharing (PGPS) [13, 35], self clocked fair queueing (SFQ) [25], etc. Invirtual clock, every connection has a clock associated with it that ticks at a potentially dierentrate. When a cell from a certain connection arrives at the system, it is stamped according to analgorithm that is independent of the arrivals from other connections and dependent only on thehistory of arrivals in the connection concerned, and the rate of service allocated to the connection.The stamped cells enter a queue common to all connections, and are served in the order of stampedvalue. Both PGPS and SFQ are similar to virtual clock in the sense that they all stamp the cellsat their arrival, put all cells in a common queue, and serve them in the order of stamped value.However, they dier in the stamping algorithms they use. While these schemes are extremelyexible in terms of allocating bandwidth in very ne granularity and fair distribution of bandwidthamong active connections, they are costly in terms of implementation. Maintaining a priority queuein the switches is expensive. In some cases [35], the overhead of the stamping algorithm also canbe quite high.Frame-based mechanisms are much simpler to implement. The most popular frame-based schemesare Stop-and-Go (SG) [23, 24] and Hierarchical-Round-Robin (HRR) [7]. Both SG and HRR usea multi-level framing strategy. For simplicity, we just describe one-level framing. In a framingstrategy, the time axis is divided into periods of some constant length, called a frame. Bandwidthis allocated to each connection as a certain fraction of frame time. In SG, at each multiplexing node,the arriving frames of each incoming link is mapped onto the departing frames on the outgoinglinks. All the cells from one arriving frame of an incoming link and going to a certain outgoinglink are put into the corresponding departing frame on the outgoing link. In some sense, SG isemulates circuit switching on a packet switched network. One-level HRR is equivalent to a non-work-conserving round robin service discipline. Each connection is assigned a fraction of the totalavailable bandwidth and receives that bandwidth in each frame, if it has sucient cells availablefor service. The server ensures that no connection gets more bandwidth than what is allocatedto it, even if it has spare capacity and the connection is backlogged. Both SG and HRR arenon-work-conserving service disciplines and, hence, fail to exploit the multiplexing gains of ATM.Another important drawback of SG and HRR, and all framing strategies for that matter, is thatthey couple the service delay with bandwidth allocation granularity. The delay encountered by acell in a SG and HRR is bounded by frame size multiplied by a constant factor (in SG the constantis in between 2 and 3, in HRR it is 2). Hence, the smaller the frame size is the lower is the delay.However, granularity of bandwidth allocation is inversely proportional to the frame size, resultingin an undesirable coupling between delay and bandwidth allocation granularity.In the following, we present a multiplexing mechanism designed to integrate the exibility andfairness of the fair queueing strategies with the simplicity of frame-based mechanisms. The startingpoint of our algorithm, which we call carry-over round robin (CORR), is a simple variation of round53
robin scheduling. Like round robin, CORR divides the time-line into allocation cycles, and eachconnection is allocated a fraction of the available bandwidth in each cycle. However, unlike slottedimplementations of round robin schemes where bandwidth is allocated as a multiple of a xedquantum, in our scheme bandwidth allocation granularity can be arbitrarily small. Also, unlikethe framing strategies like SG and HRR, ours is a work-conserving discipline, and hence unusedbandwidth is not wasted but is fairly shared among the active connections. The following is analgorithmic description of CORR scheduling discipline.5.2 Scheduling AlgorithmLike simple round robin scheduling, CORR divides the time line into allocation cycles. The maxi-mum length of an allocation cycle is T . Let us assume that the cell transmission time is the basicunit of time. Hence, the maximum number of cells (or slots) transmitted during one cycle is T .At the time of admission, each connection Ci is allocated a rate Ri expressed in cells per cycle.Unlike simple round robin schemes, where Ris have to be integers, CORR allows Ris to be real.Since Ris can take real values, the granularity of bandwidth allocation can be arbitrarily small,irrespective of the length of the allocation cycle. The goal of the scheduling algorithm is to allocateeach connection Ci close to Ri slots in each cycle and exactly Ri slots per cycle over a longer timeframe. It also distributes the excess bandwidth among the active connections Cis in the proportionof their respective Ris.The CORR scheduler (see gure 5.1) consists of three asynchronous events |- Initialize, Enqueue,and Dispatch. The event Initialize is invoked when a new connection is admitted. If a connectionis admissible 1, it simply adds the connection to the connection-list fCg. The connection-list isordered in the decreasing order of Ri  bRic, that is, the fractional part of Ri. The event Enqueueis activated at the arrival of a packet. It puts the packet in the appropriate connection queue andupdates the cell count of the connection. The most important event in the scheduler is Dispatch.The event Dispatch is invoked at the beginning of a busy period. Before explaining the taskperformed by Dispatch, let us introduce the variables and constants used in the algorithm and thebasic intuition behind it.The scheduler maintains separate queues for each connection. For each connection Ci, ni keeps thecount of the waiting cells, and ri holds the number of slots currently credited to it. Note that riscan be real as well as negative fractions. A negative value of ri signies that the connection hasbeen allocated more slots than it deserves. A positive value of ri reects the current legitimaterequirements of the connection. In order to allocate slots to meet the requirements of the connection1We discuss admission control later. 54
ConstantsT : Cycle length.Ri: Slots allocated to Ci.VariablesfCg: Set of all connections.t: Slots left in current cycle.ni: Number of cells in Ci.ri: Current slot allocation of Ci.EventsInitialize(Ci) /* Invoked at connection setup time. */add Ci to fCg; /* fCg is ordered in decreasing order of Ri   bRic. */ni  0; ri  0;Enqueue() /* Invoked at cell arrival time. */ni = ni + 1add cell to connection queue;Dispatch() /* Invoked at the beginning of a busy period. */8Ci:: ri  0;while not end-of-busy-period dot T ;1. Major Cycle:for all Ci 2 fCg do /* From head to tail. */ri  min(ni; ri +Ri); xi  min(t; bric);t t  xi; ri  ri   xi; ni  ni   xi;dispatch xi cells from connection queue Ci;end for2. Minor Cycle:for all Ci 2 fCg do /* From head to tail. */xi  min(t; drie);t t  xi; ri  ri   xi; ni  ni   xi;dispatch xi cells from connection queue Ci;end forend whileFigure 5.1: Carry-Over Round Robin Scheduling.as closely as possible, CORR divides each allocation cycle into two sub-cycles |- a major cycle anda minor cycle. In the major cycle, integral requirement of each connection is satised rst. Slotsleft over from major cycle are allocated in minor cycle to connections with still unfullled fractional55





R = 2.01 1 =0.0 =2.01 r =0.01
R = 1.52 2 =0.5r2 r =1.02 r =0.02
3
r = 0.53 r = 0.53
r =1.03 r =0.03
Connection 1 Connection 2 Connection 3
Cycle 1 Cycle 2
















VCI Head Tail n r R MFi i i i imf
Figure 5.3: Architecture of the buer manager.pointers. The cells stored in the cell buer are tagged with the pointer to the next cell from thesame connection, if any. When a cell arrives, it is stored in the cell pool at the address given bythe idle-address-FIFO. While the cell is being written into the cell pool, its connection identieris extracted by the processor and the corresponding entry in the connection table is accessed toretrieve connection specic state information. The connection state is updated to reect the newarrival, and the pointers are suitably adjusted to add the arriving cell at the end of the connectionqueue.The connection table stores the state information of all connections. The state of a connectionconsists of Head and Tail pointers to the corresponding connection queue and a few connectionspecic constants and variables. The constants and variables important to the scheduling algorithmare Ri, ri, and ni. A cell from a connection is considered eligible for dispatching in a major cycleif ni is greater than zero and ri is greater than one. Similarly, a connection is considered eligiblefor dispatching in a minor cycle if ni is greater than zero and ri is greater than zero. We keep twoone-bit ags, MF and mf, to indicate whether a connection is eligible for dispatching in the majorand minor cycles, respectively. These ags are set/reset by local logic to save processor cycles.These ags are used for fast indexing to the connection states of the eligible connections during themajor and minor cycles.Once an eligible connection is selected, the processor extracts the buer address of the cell at thehead of the connection queue from the connection and dispatch it on the output link. The bueraddress of the next cell in the queue, if any, is written into the Head eld of the corresponding58
entry in the connection table. The processor also makes necessary changes to the state informationstored in the connection table.5.4 Basic PropertiesIn this section, we discuss some of the basic properties of the scheduling algorithm. Lemma 5.1denes an upper bound on the aggregate requirements of all streams inherited from the last cycle.This result is used in lemma 5.2 to determine the upper and lower bounds on the individualrequirements carried over from the last cycle by each connection.Denition 5.1 A connection is said to be in the busy period if the connection queue is non-empty.The system is said to be in the busy period if at least one of the connections is in its busy period.Note that a particular connection can switch between busy and idle periods even when the systemis in the same busy period. The following theorem determines the departure time of a specic cellbelonging to a particular connection.Lemma 5.1 If P8Ci2fCgRi  T then at the beginning of each cycle P8Ci2fCg ri  0:Proof: We prove this by induction. We rst show that it holds at the beginning of a busy period.Then we show that if it holds in the kth cycle, it also holds in the (k + 1)th cycle.Base Case: From the allocation algorithm, we observe that ri = 0, for each connection Ci at thebeginning of a busy period. Hence, X8Ci2fCg ri = 0Thus, the assertion holds in the base case.Inductive Hypothesis: Assume that the premise holds in the kth cycle. We use superscripts forcycles in the following derivation.X8Ci2fCg rk+1i = X8Ci2fCg rki + X8Ci2fCgRi   T  0 + T   T  0:This completes the proof.Remark: Henceforth we assume that the admission control mechanism makes sure thatP8Ci2fCgRi T at all nodes. This simple admission control test is one of the attractions of the CORR scheduling.59
Lemma 5.2 If P8Ci2fCgRi  T then at the beginning of each cycle 1 <  i  ri  i < 1;where i = maxkfkRi   bkRicg; k = 1; 2; : : :Proof: To derive the lower bound on ri, observe that in each cycle no more than drie slots areallocated to connection Ci. Also note that ri is incremented in steps of Ri. Hence, the lowest valueri can have is  i = maxkfkRi   dkRieg; k = 1; 2; : : :=  maxkfkRi   bkRicg; k = 1; 2; : : :Derivation of the upper bound is a little more complex. Let us assume that there are n connectionsCk, k = 1; 2; : : : ; n. Without loss generality we renumber them such that Ri  Rj , when i < j.For the sake of simplicity, let us also assume that all the Ri's are fractional. We show later thatthis assumption is not restrictive. To prove the upper bound, we rst prove that ri never exceeds1 for all connections Ci. Now, since Rn is the lowest of all Ri's, Cn is the last connection in theconnection list. Consequently, Cn is the last connection considered for a possible cell dispatch inboth major and minor cycles. Hence, if we can prove that rn never exceeds 1, then this is true forall other ris. We will prove this by contradiction.Let us assume that Cn enters a busy period in allocation cycle 1. Observe that Cn experiences theworst-case allocation when all other connections also enter their busy periods in the same cycle.Let us assume that rn > 1. This would happen in the allocation cycle d1=Rne. Since rn > 1,Cn is considered for a possible dispatch in the major cycle. Now, Cn is not scheduled during themajor cycle of the allocation cycle d1=Rne if and only if the following is true at the beginning ofthe allocation cycle: n 1Xi=1 bri +Ric  T:From lemma 5.1, we know thatPni=1 ri  0 at the beginning of each cycle. Since rn > 0,Pn 1i=1 ri < 0at the beginning of the allocation cycle d1=Rne. However,n 1Xi=1 bri + Ric  n 1Xi=1 (ri +Ri) < 0 + n 1Xi=1 Ri < T:This is in contradiction with the assumption that Cn could not be scheduled in the major cycle.Hence, rn cannot exceed 1. Since rn is incremented in steps of Rn, the maximum value of rn at the60
beginning of a cycle is the largest fractional part of kRn, for any integer k. The same is true forother ri's. Hence, the bound follows.We have proved the bounds under the assumption that all Ri's are fractional. If we relax thisassumption, the result still holds. This is due to the fact that the integral part of Ri is guaranteedto be allocated in each allocation cycle. Hence, even when Ri's are not all fractional, we can reducethe problem to an equivalent one with fractional Ri's using the transformation:R̂i = Ri   bRic and T̂ = T   nXi=1bRic:This completes the proof.The following theorem characterizes the departure function associated with a connection. Forthe purpose of simplicity, we have dropped the subscript identifying a connection in the followingdescription.Theorem 5.1 Assume that a connection enters a busy period at time 0. Let d(i) be the latest timeby which the ith cell, starting from the beginning of the current busy period, departs the system.Then d(i) can be expressed asd(i) =  i+ 1 + R T; i = 0; 1; : : : ;1;where R is the rate allocated to the connection, T is the maximum length of the allocation cycle,and  = maxkfkR  bkRcg; k = 1; 2; : : :.Proof: Since a cell may leave the system any time during an allocation cycle, we capture theworst-case situation by assuming that all the cells served during an allocation cycle leave the systemat the end of the cycle. Now, when a connection enters a busy period, the lowest value of r is  .If cell i departs at the end of the Lth cycle from the beginning of the connection busy period, thenumber of slots allocated by the scheduler is LR  , and the number of slots consumed is i+ 1(assuming packet number starts from 0). In the worst-case,1 > LR     (i+ 1)  0:This implies that i+ 1 +  + 1R > L  i+ 1 + R :From the above inequality and noting that L is an integer and d(i) = L T , we getd(i) =  i+ 1 + R T:61
5.5 SummaryIn this chapter, we have presented a cell scheduling discipline designed to provide deterministicperformance guarantees on a per-connection basis. We have presented a detailed description of thealgorithm and discussed some of its basic properties. These results are used in the next chapter toderive delay bounds and to analyze fairness properties of the algorithm.It is clear form the discussion above that CORR is a simple extension of round robin discipline.It can be implemented using a two phase round robin scheduler and is much simpler compared tothe priority queueing mechanisms. In terms of complexity, CORR is comparable to frame basedmechanisms, such as SG and HRR. However, CORR does not suer form the typical shortcomingsof frame based scheduling. It is a work conserving discipline and hence can exploit the multiplexinggains of ATM. Also, unlike most frame based schemes, CORR does not suer from the undesirablecoupling between delay and bandwidth allocation granularity. In the next chapter, we show thatalbeit its simplicity, CORR is quite competitive with other more complex schemes in terms ofperformance.
62





backlog at time  t







Figure 6.1: Computing delay and backlog from the arrival and departure functions.6.1.1 Single-node CaseLet us consider a switching node employing CORR scheduling to multiplex trac from dierentconnections. Since we are interested in the worst-case delay behavior, and since the scheduler guar-antees a minimum rate of service to all connections, we can consider each connection in isolation.The delay encountered by any cell belonging to a connection is the dierence between its arrivaland departure times. The arrival time of a cell can be obtained from the trac envelope associatedwith a connection (see chapter 4). Theorem 5.1 expresses the worst-case departure time of a cell interms of the service rate allocated to the connection and the length of the allocation cycle. Knowingboth the arrival and the departure functions, we can compute the worst-case delay bound. In therest of the section we derive delay bounds for arrival functions characterized by composite leakybucket, moving window, and jumping window shapers.The delay encountered by cell i is really the horizontal distance between the arrival and departurefunctions at i (see gure 6.1). Hence, the maximum delay encountered by any cell is the maximumhorizontal distance between the arrival and the departure functions. Similarly, the vertical distancebetween these functions represents the backlog in the system. Unfortunately, nding the maximumdelay, that is, the maximum horizontal dierence between the arrival and the departure functions, isa dicult task. Hence, instead of nding the maximum delay directly by measuring the horizontaldistance between these functions, we rst determine the point at which the maximum backlogoccurs and the index i of the cell which is at the end of the queue at that point. The worst-casedelay is then computed by evaluating d(i)  a(i), where a(i) and d(i) are the arrival and departuretimes of the ith cell, respectively. In the following, we carry out this procedure for arrival functionsdened by composite leaky bucket, moving window, and jumping window shapers.64
Lemma 6.1 Consider a connection shaped using an n-component moving window shaper and pass-ing through a single multiplexing node employing CORR scheduling with an allocation cycle of lengthT 1. If the connection is allocated a service rate of R, then the worst case delay encountered by anycell belonging to the connection is upper bounded byDCORR=MW  8>>><>>>: mj + R T   nXl=j+1ml 1ml   1wl; & R+ wj (R=T  mj=wj)' = 12mj + R T   wj   nXl=j+1ml 1ml   1wl; & R+ wj (R=T  mj=wj)' > 1when mjwj < RT < mj+1wj+1 ; j = 1; 2; : : : ; n  1:Proof: First we will show that under the conditions stated above the system is stable. That is,the length of the busy period is nite. To prove that, it is sucient to show that there exists apositive integer k such that the number of cells serviced in kwj time is greater than or equal tokmj . In other words, we have to show that there exists a k such thatkwj  d(kmj   1)kwj  (kmj   1) + 1 + R Tkwj  (kmj   1) + 1 + R + 1Tk  R+ wj (R=T  mj=wj)k  & R+ wj (R=T  mj=wj)' :Clearly, for there to exist a positive integer k so that the above equality is satised, the followingcondition needs to hold. R=T  mj=wj > 0 or R=T > mj=wjBy our assumption, R=T > mj=wj. Hence, the system is stable. Now, we need to determine thepoint at which the maximum backlog occurs. Depending on the value of k, the maximum backlogcan occur at one of the two places.1We assume that the cycle length is smaller than the smallest window period. Since R can take any real value wecan choose an allocation with arbitrarily small cycle length. Hence, this assumption is not restrictive.65
Case k=1: If k = 1, that is, when the trac coming in during a time window of length wj departsthe system in the same window, the maximum backlog occurs at the arrival instant of the (mj 1)thcell 2. Clearly, the index of the cell at the end of the queue at that instant is mj   1. Hence, themaximum delay encountered by any cell under this scenario is the same as the delay suered bythe (mj   1)th cell and can be enumerated by computing d(mj   1)  a(mj   1). We can evaluatea(mj   1) as follows:a(mj   1) = nXl=1 mj   1ml   mj   1ml 1  ml 1ml wl= jXl=1 mj   1ml   mj   1ml 1  ml 1ml wl+ nXl=j+1mj   1ml   mj   1ml 1  ml 1ml wl= 0 + nXl=j+1mjml   1   mjml 1   1ml 1ml wl (since ml 1 > ml and ml divides ml+1)= nXl=j+1ml 1ml   1wl:Therefore, the worst case delay is bounded byDCORR=MW  d(mj   1)  a(mj   1) mj + R T   nXl=j+1ml 1ml   1wl:Case k>1: When k is greater than 1, the connection busy period continues beyond the rst windowof length wj . Since R=T > mj=wj, the rate of trac arrival is lower than the rate of departure.Still, in this case not all cells that arrive during the rst window of length wj are served duringthat period and the left over cells are carried over into the next window. This is due to the factthat unlike the arrival function, the departure function starts at time d(1 + )=Re instead of time0. This is the case when k = 1 as well. However, in that case the rate of service is high enoughto serve all the cells before the end of the rst window. When k > 1 the backlog carried overfrom the rst window is cleared in portions over the next k   1 windows. Clearly, the backlogscarried over into subsequent windows diminish in size and are cleared completely by the end ofthe kth window. Hence, the second window is the one where the backlog inherited from the lastwindow is the maximum. Consequently, absolute backlog in the system reaches its maximum at the2Note that cells are numbered from 0. 66
arrival of the 2mj   1 cell. Hence, the maximum delay encountered by any cell under this scenariois the same as the delay suered by the (2mj   1)th cell and can be enumerated by computingd(2mj   1)  a(2mj   1). We can evaluate a(2mj   1) as follows:a(2mj   1) = nXl=1 2mj   1ml   2mj   1ml 1  ml 1ml wl= j 1Xl=1 2mj   1ml   2mj   1ml 1  ml 1ml wl+ $2mj   1mj %  $2mj   1mj 1 %mj 1mj !wj+ nXl=j+12mj   1ml   2mj   1ml 1  ml 1ml wl= 0 + wj + nXl=j+12mjml   1   2mjml 1   1ml 1ml wl (since ml 1  2ml)= wj + nXl=j+1ml 1ml   1wl:Therefore the worst case delay is bounded byDCORR=MW  d(2mj   1)  a(2mj   1) 2mj + R T   wj   nXl=j+1ml 1ml   1wl:Lemma 6.2 Consider a connection that is shaped using an n-component jumping window andpassing through a single multiplexing node employing CORR scheduling with an allocation cycle ofmaximum length T . If the connection is allocated a service rate of R, then the worst-case delaysuered by any cell belonging to the connection is upper bounded byDCORR=JW  8>>>>><>>>>>>: 2m1 + R T   2 nXl=2 ml 1ml   1wl; m1w1 < RT < m2w22mj + R T   wj   jXl=11  ml 1ml wl mjwj < RT < mj+1wj+1 ;+w1   nXl=2 ml 1ml   1wl; j = 2; 3; : : : ; n  167
Proof: The proof is very similar to the last proof. The condition mj=wj < R=T < mj+1=wj+1,where i = 1; 2; : : : ; n   1, guarantees that the system is stable. The dierence is that in the caseof a jumping window, maximum backlog can occur at only one point, which is the arrival instantof cell 2mj   1. This is because of the fact that unlike in a moving window, in a jumping windowtwo bursts can come almost next to each other (see gure 4.8 for explanation). Hence, the worstcase delay encountered by any cell is at most as large as the delay encountered by cell 2mj   1 andcan be computed as d(2mj   1) + a(0)  a(2mj   1). Note that we have an additional a(0) termhere in the expression of delay since the arrival curve does not start at time 0 but at time a(0). Tocompensate for that we have shifted the departure curve also by a(0) in the expression for delay.The nal expression is in two parts since depending on the value of j, a(2mj   1) can have one ofthe two forms (refer to theorem 4.3).When j = 1, a(2mj   1) is computed as follows:a(2m1   1) = nXl=12m1   1ml   2m1   1ml 1  ml 1ml wl= 2m1   1m1    2m1   1m0  m0m1w1+ nXl=22m1   1ml   2m1   1ml 1  ml 1ml wl= w1 + nXl=22m1ml   1   2m1ml 1   1ml 1ml wl= w1 + nXl=2ml 1ml   1wl:When j = 2; 3; : : : ; n  1, a(2mj   1) is computed as follows,a(2mj   1) = nXl=1 2mj   1ml + 1  2mj   1ml 1  + 1 ml 1ml wl= j 1Xl=1 2mj   1ml + 1  2mj   1ml 1 + 1ml 1ml wl+( $2mj   1mj %+ 1!   $2mj   1mj 1 %+ 1!mj 1mj )wj+ nXl=j+12mj   1ml  + 1  2mj   1ml 1 + 1ml 1ml wl= j 1Xl=1 1  ml 1ml wl +  2  mj 1mj !wj68
+ nXl=j+12mjml   1 + 1   2mjml 1   1 + 1ml 1ml wl= wj + jXl=11  ml 1ml wl:We can compute a(0) in a similar fashion. Now d(2mj   1)+ a(0)  a(2mj   1) yields the result.Lemma 6.3 Consider a connection shaped by an n-component leaky bucket shaper and passingthrough a single multiplexing node employing CORR scheduling with an allocation cycle of maximumlength T . If the connection is allocated a service rate of R, then the worst case delay suered byany cell belonging to the connection is upper bounded byDCORR=LBmax  Bj + 1+ R T   (Bj   bj + 1) tj ;when 1tj < RT < 1tj+1 ; j = 1; 2; : : : ; n:Proof: In order to identify the point where the maximum the backlog occurs, observe that therate of arrivals is more than the rate of service until the slope of the trac envelope changes from1tj+1 to 1tj . This change in the slope occurs at the arrival of the Bjth cell in the worst case. Hence,the maximum delay encountered by any cell is at most as large as the delay suered by cell Bj.We can compute a(Bj) as follows:a(Bj) = n+1Xl=1(Bj   bl + 1) tl [U(Bj  Bl)  U(Bj   Bl 1)]= j 1Xl=1(Bj   bl + 1) tl [U(Bj   Bl)  U(Bj   Bl 1)]+(Bj   bj + 1) tj [U(Bj  Bj)  U(Bj  Bj 1)]+ n+1Xl=j+1(Bj   bl + 1) tl [U(Bj  Bl)  U(Bj   Bl 1)]= (Bj   bj + 1) tj :Now d(Bj)  a(Bj) yields the result. 69
node 1 node 2
a
1
(i) a2(i) node n
an(i) an+1
(i)Figure 6.2: Nodes in tandem.The results derived in this section dene upper bounds for delay encountered in a CORR serverunder dierent trac envelopes. The compact closed form expressions make the task of computingthe numerical bounds for a specic set of parameters simple. We would also like to mention thatcompared to other published works, we consider a much larger and general set of trac envelopesin our analysis. Although simple closed form bounds under very general arrival patterns are one ofthe major attractions of the CORR scheduling and an important contribution of this study, boundsfor a single-node system are not very useful in a real-life scenario. In most systems, a connectionspans multiple nodes, and the end-to-end delay bound is of real interest. In the next section, wederive bounds on end-to-end delay.6.1.2 Multiple-node CaseIn the last section, we derived worst-case bounds on delay for dierent trac envelopes for a single-node system. In this section, we derive similar bounds for a multi-node system. We assume thatthere are nmultiplexing nodes between the source and the destination, and at each node a minimumavailable rate of service is guaranteed.We denote the arrival time of cell i at node k by ak(i). The service time at node k for cell i isdenoted by sk(i). Without loss of generality, we assume that the propagation delay between nodesis zero3. Hence, the departure time of cell i from node k is ak+1(i). Note that a1(i) is the arrivaltime of the ith cell in the system and an+1(i) is the departure time of the ith cell from the system(see gure 6.2).Let us denotePqi=p sk(i) by Sk(p; q). This is nothing other than the aggregate service times of cellsp through q at node k. In other words, Sk(p; q) is the service time of the burst of cells p through qat node k.The following theorem expresses the arrival time of a particular cell at a specic node in termsof the arrival times of the preceding cells at the system and their service times at dierent nodes.3This assumption does not aect the generality of the results, since the propagation delay at each stage is constantand can be included in sk(i). 70
This is a very general result and is independent of the particular scheduling discipline used at themultiplexing node and trac envelope associated with the connection. We will use this result toderive the worst-case bound on end-to-end delay.Theorem 6.1 For any node k and for any cell i, the following holds:ak(i) = max1ji(a1(j) + maxj=l1l2lk=i k 1Xh=1Sh(lh; lh+1)!) : (6.1)Proof: We will prove this theorem by induction on k and i.Induction on k:Base Case: When k = 1,a1(i) = max1ji(a1(j) + maxj=l1l2lk=i 0Xh=1Sh(lh; lh+1)!)= a1(i):Clearly, the assertion holds.Inductive Hypothesis: Let us assume that the premise holds for all m  k. In order to prove thatthe hypothesis is correct, we need to show that it holds for m = k + 1.ak+1(i) = max fak+1(i  1) + sk(i); ak(i) + sk(i)g= max( max1ji 1 "a1(j) + maxj=l1l2lk+1=i 1 kXh=1Sh(lh; lh+1)!#+ sk(i);max1ji "a1(j) + maxj=l1l2lk=i k 1Xh=1Sh(lh; lh+1)!#+ sk(i))= max( max1ji 1 "a1(j) + maxj=l1l2lk+1=i 1 kXh=1Sh(lh; lh+1)!+ sk(i)# ;max1ji 1 "a1(j) + maxj=l1l2lk=i k 1Xh=1Sh(lh; lh+1)!+ Sk(i; i)# ;a1(i) + iXh=1Sh(i; i))= max( max1ji 1 "a1(j) + maxj=l1l2lk<lk+1=i kXh=1Sh(lh; lh+1)!# ;71
max1ji 1 "a1(j) + maxj=l1l2lk=lk+1=i kXh=1Sh(lh; lh+1)!# ;a1(i) + kXh=1Sh(i; i))= max( max1ji 1 "a1(j) + maxj=l1l2lklk+1=i kXh=1Sh(lh; lh+1)!# ;a1(i) + iXh=1Sh(i; i))= max1ji(ak(j) + maxj=l1l2lklk+1=i kXh=1Sh(lh; lh+1)!) :Induction on i:Base Case: When i = 1,ak(1) = max1j1(a1(j) + maxj=l1l2lk=1 k 1Xh=1Sh(lh; lh+1)!)= a1(1) + k 1Xh=1Sh(1; 1)= a1(1) + k 1Xh=1 sh(1):Hence, the assertion holds in the base case.Inductive Hypothesis: Let us assume that the premise holds for all n  i. In order to prove thatthe hypothesis is correct, we need to show that it holds for n = i+ 1.ak(i+ 1) = max fak(i) + sk 1(i+ 1); ak 1(i+ 1) + sk 1(i+ 1)g= max(max1ji "a1(j) + maxj=l1l2lk=i k 1Xh=1Sh(lh; lh+1)!# + sk 1(i+ 1);max1ji+1 "a1(j) + maxj=l1l2lk 1=i+1 k 2Xh=1Sh(lh; lh+1)!#+ sk 1(i+ 1))= max(max1ji "a1(j) + maxj=l1l2lk=i k 1Xh=1Sh(lh; lh+1)!+ sk 1(i+ 1)# ;72
max1ji "a1(j) + maxj=l1l2lk 1=i+1 k 2Xh=1Sh(lh; lh+1)!+ Sk 1(i+ 1; i+ 1)# ;a1(i+ 1) + k 1Xh=1Sh(i+ 1; i+ 1))= max(max1ji "a1(j) + maxj=l1l2lk 1<lk=i+1 k 1Xh=1Sh(lh; lh+1)!# ;max1ji "a1(j) + maxj=l1l2lk 1=lk=i+1 k 1Xh=1Sh(lh; lh+1)!# ;a1(i+ 1) + iXh=1Sh(i+ 1; i+ 1))= max(max1ji "a1(j) + maxj=l1l2lk 1lk=i+1 k 1Xh=1Sh(lh; lh+1)!# ;a1(i+ 1) + kXh=1Sh(i+ 1; i+ 1))= max1ji+1(ak(j) + maxj=l1l2lk 1lk=i+1 k 1Xh=1Sh(lh; lh+1)!) :The result stated in the above theorem determines the departure time of any cell from any node inthe system in terms of the arrival times of the preceding cells and the service times of the cells atdierent nodes. This is the most general result known to us on end-to-end delay in terms of servicetimes of cells at intermediate nodes. We believe that this result is a powerful tool in enumeratingend-to-end delay for any rate based scheduling discipline and is an eective alternative for the adhoc techniques commonly used for end-to-end analysis.Although the result stated in theorem 6.1 is very general, it is dicult to make use of it in itscurrent form. In order to nd the exact departure time of any cell from any node, we need toknow both the arrival times of the cells and their service times at dierent nodes. Arrival times ofdierent cells can be obtained from the arrival function, but computing service times for dierentcells at each node is a daunting task. Hence, computing the exact departure time of a cell from anynode in the system is often quite dicult. However, the accurate departure time of a specic cellis rarely of critical interest. More often we are interested in other metrics, such as the worst-casedelay encountered by a cell. Fortunately, computing the worst-case bound on the departure time,and then the worst-case delay is not as dicult. The following corollary expresses the worst-casedelay suered by a cell in terms of the worst-case service times at each node.73
Corollary 6.1 Consider a connection passing through n multiplexing nodes. Assume that thereexists a Sw() such that Sw(p; q)  Sh(p; q) for all q  p and h = 1; 2; : : :n. Then, in the worstcase, delay D(i) suered by the cell i belonging to the connection can be upper bounded byD(i)  max1ji(a1(j) + maxj=l1l2ln+1=i( nXh=1Sw(lh; lh+1)))  a1(i)Proof: The proof follows trivially from theorem 6.1 by substituting Sh, h = 1; 2; : : :n by Sw.Corollary 6.1 expresses the worst case delay encountered by any cell under the assumption thatfor any p and q there exists a function Sw such that Sw(p; q)  Sh(p; q), for h = 1; 2; : : : ; n. Thecloser Sw is to Sh, the tighter is the bound. The choice of Sw depends on the particular schedulingdiscipline used at the multiplexing nodes. In the case of carry-over round robin, it is simply theservice time at the minimum guaranteed rate of service. The following corollary instantiates thedelay bound for CORR service discipline.Corollary 6.2 Consider a connection traversing n nodes, each of which employs carry-over roundrobin scheduling discipline. Let Rw be the minimum rate of service oered to a connection at thebottleneck node, and let T be the maximum length of the allocation cycle. Then the worst case delaysuered by the ith cell belonging to the connection is bounded byDCORR(i)  n + (n  1)2 + wRw T + max1ji fa1(j) + Sw(j; i)g  a1(i)Proof: This follows from corollary 6.1 by replacingmaxj=l1l2ln+1=i( nXh=1Sw(lh; lh+1)) with n + (n  1)2 + wRw T + Sw(j; i)The following steps explain the details.nXh=1Sw(lh; lh+1) = nXh=1&l(h+1   lh + 1) + 1 + wRw 'T (from theorem 5:1) nXh=1  lh+1   lh + 2 + wRw + 1T n+ (n  1)2 + wRw T + (ln+1   l1 + 1) + 1 + wRw T n+ (n  1)2 + wRw T + Sw(l1; ln+1) (from theorem 5:1) n+ (n  1)2 + wRw T + Sw(j; i) (putting l1 = j and ln+1 = i)74
The nal result follows immediately.The expression for DCORR derived above consists of two main terms. The rst term is a constantindependent of the cell index. If we observe the second term carefully, we realize that it is nothingother than the delay encountered by the ith cell at CORR server with a cycle time T and a minimumrate of service Rw. Hence, the end-to-end delay reduces to the sum of the delay encountered in asingle-node system and a constant. By substituting the delay bounds for the single-node systemderived in the last section, we can enumerate the end-to-end delay in a multi-node system fordierent trac envelopes.6.1.3 Comparison with Other SchemesAs discussed in chapter 5, a plethora of multiplexing disciplines providing rate guarantees havebeen proposed in the last a few years. Among these, the scheduling mechanisms closest to ours areStop-and-Go (SG) [23, 24] and Packet-by-Packet Generalized Processor Sharing (PGPS) [35]. BothSG and PGPS assume a uid model of data ow. In order to compare them with CORR discipline,we transformed the results derived for SG and PGPS into equivalent results in the discrete timemodel used in our study.In SG, each link divides the timeline into frames of size F . The admission policy under which delayguarantees can be made is that no more than m cells are admitted into the system in any frametime, where r = m=F is the transmission rate assigned to a particular connection. A trac streamthat obeys this restriction is said to be (r; F ) smooth. The concept of adjacent frames is centralto dening the stop-and-go queueing. For each frame on an incoming link, there is an adjacentframe on the outgoing link. The cells coming in during a frame time on the incoming link aredispatched on the corresponding adjacent frame on the outgoing link. In some sense, SG tries toemulate circuit switching on top of packet switching. It has been shown that when the sum of thetransmission rates assigned to all active connections at each node is less than the link capacity, theend-to-end delay for a connection spanning n nodes is bounded bynF  DSG  2nF:Clearly, the trac model used in SG is far more restricted than ours and requires allocation ofbandwidth at the peak rate. Also, the granularity of bandwidth allocation depends on the sizeof the frame. The larger the frame size is the ner the granularity of bandwidth allocation gets.However, larger frame size leads to higher delays. To alleviate this problem, SG supports dierentframe sizes to satisfy dierent rate and delay requirements. Another problem with SG is that itrequires networkwide standardization of frame sizes.75
Comparing the delay bounds achieved by SG and CORR is dicult because of the dierence in thetrac model. While SG assumes arrivals and service at the peak rate, we can exploit the variationin trac arrival rate using a composite shaping envelope and choosing a service rate just above theaverage rate of arrivals. Nevertheless, to get a rough idea of how the end-to-end delay bounds forthese schemes compare, let us consider a connection shaped using a 2-component moving windowshaper (m1; w1) and (m2; w2), where m1 > m2, w1 > w2 and m1=w1 < m2=w2. Let T be the cyclelength for CORR, and let R be the number of slots allocated to this connection in each cycle. SinceSG admits trac at the peak rate, r = m2=w2. If the connection traverses n nodes, we get thefollowing bounds on the end-to-end delay (after some simplications)DSG  2nF:DCORR  n1 + 2+ R T + 2m1R T   w1   m1m2   1w2:If we consider n to be very large, the rst term in the bound on DCORR is the dominant term,and we can neglect the second term. Thus, DCORR < DSG when R > (2 + )T=(2F   T ). Whenis n not large and the second term in bound on DCORR is not negligible, the bound depends onthe exact values of the parameters. Clearly, delay bounds are competitive. However, because ofthe exibility in choosing the appropriate service rate exploiting the variation in trac arrivals,CORR can potentially admit more connections than SG. SG assumes the peak rate arrival oftrac. In this particular example, bandwidth allocated to the connection is m2=w2. In contrast,in CORR bandwidth allocated to the connection can vary from m1=w1 to m2=w2, depending onthe delay requirements. This gives us a lot of room in terms of allocating the right amount ofbandwidth satisfying the end-to-end delay constraints, thereby increasing the number of admissibleconnections.PGPS is a packet-by-packet implementation of the fair queueing mechanism [13, 35]. In PGPS,incoming cells from dierent connections are buered in a sorted priority queue and is served inthe order in which they would leave the server in an ideal fair queueing system. The departuretimes of cells in the fair queueing system is enumerated by simulating the reference ideal system.Simulation of the reference system and maintenance of the priority queue are both quite expensiveoperations. Hence, the implementation of the PGPS scheme in a real system is dicult, to say theleast. Nevertheless, we compare CORR with PGPS to show how the delay performance of CORRcompares with that of a near ideal scheme.It has been shown in [34] that with a (b,t) leaky bucket controlled source and a guaranteed rate ofservice R > 1=t at each node, the worst case end-to-end delay of a connection traversing n nodesis bounded by 76
DPGPS  bt+ (n  1)t+ nXk=1 k;where k is the service time of a cell at the kth server operating at the maximum rate. For smallcell sizes and high transmission rates, the third term is negligible and is not considered henceforth.The assumption that the rate of service R > 1=t at each node means that there is no queueingin the system. Under that assumption, the delay bound derived above does not come as a bigsurprise. As in the case of SG, the trac model used in PGPS is also very restricted. In factthe author in [34] suggests a generalization of the trac model as an important extension of hiswork. Non-availability of delay bounds of PGPS for more general trac models makes the task ofcomparing it with CORR dicult. In any case, to get a rough idea of how these bounds compare,let us consider a connection shaped using two leaky buckets (b1; t1) and (b2; t2), where b1 > b2 andt1 > t2. Let T be the length of the allocation cycle in CORR, and R be the slots allocated to theconnection in each cycle. Since PGPS uses only single leaky bucket characterization of the source,the trac envelope for PGPS is determined by (b2; t2). If we consider a connection spanning nnodes, we get the following bounds on end-to-end delayDPGPS  (n  1)t2 + b2t2:DCORR  (n  1)1 + 2 + R T + 1 + b1 + 2R T   b1t1   b2t2t1   t2   b1 + 1 t1:If we assume n to be large, the rst terms in the bounds on DPGPS and DCORR are the dominantterms. If we also assume that t2 = T=R, then DCORR is close to (2 + )  DPGPS . Hence, asexpected, PGPS performs better than CORR in terms of worst case delay bound. However, thedelay bounds are quite competitive. When n is not negligible, the bounds depend on the exactvalues of the parameters. Although for large n DPGPS is better than DCORR in terms of worstcase end-to-end delay performance, the number of connections admitted under CORR may still bemore than that under PGPS. This is because of the fact that in PGPS the bandwidth allocatedto a connection depends on the parameters of the leaky bucket used to regulate its trac ow,not so much on the delay requirement. For example, in this particular case, bandwidth allocatedto the the connection is 1=t2. In contrast, CORR provides us with the exibility to choose theright amount of bandwidth, just enough to satisfy the end-to-end delay requirement. In this case,under CORR scheme, we can allocate any amount of bandwidth between 1=t2 to 1=t1 satisfyingthe end-to-end delay constraints. Consequently, more connections may be admitted.77
6.2 Fairness AnalysisIn the last section we analyzed some of the worst-case behavior of the system. In the worst-case analysis it is assumed that the system is fully loaded, and each connection is served at theminimum guaranteed rate. However, that is often not the case. In a work conserving server, whenthe system is not fully loaded the spare capacity can be used by the busy connections to achievebetter performance. One of the important performance metrics of a work conserving server isthe fairness. That is, how fair is the server in distributing the excess capacity among the activeconnections. In the rest of the section we dene a measure of fairness, analyze the fairness propertiesof CORR scheduling, and compare CORR with other fair queueing mechanisms.Let us denote the number of cells of connection Cp transmitted during [0,t) by Np(t). We denethe normalized work received by a connection p as wp(t) = Np(t)=Rp, where Rp is the service rateof connection Cp. Accordingly, wp(t1; t2) = wp(t2) wp(t1), where t1  t2 is the normalized servicereceived by connection Cp during [t1; t2).In an ideally fair system, the normalized service received by dierent connections in their busy stateincreases at the same rate. That is, when both connections Cp and Cq are in their busy period,ddtwp(t) = ddtwq(t):For connections that are not busy at t, normalized service stays constant, that is,ddtwp(t) = 0:If two connections Cp and Cq are both in their busy period during [t1; t2), we can easily show thatwp(t1; t2) = wq(t1; t2).Unfortunately, the notion of ideal fairness is only applicable to hypothetical uid ow models. Ina real packet network, a complete packet from one connection has to be transmitted before serviceis shifted to another connection. Therefore, it is not possible to satisfy the equality of normalizedrate of services for all busy connections at all times. However, it is possible to keep the normalizedservices received by dierent connections close to each other. The Packet-by-Packet GeneralizedProcessor Sharing (PGPS) and the Self-Clocked-Fair-Queueing (SFQ) are close approximations toideal-fair-queueing in the sense that they try to keep the normalized services received by busysessions close to that of an ideal system. Unfortunately, the realization of PGPS and SFQ are quitecomplex. Following, we derive the fairness properties of CORR scheduling and compare them withthose of PGPS and SFQ. 78
6.2.1 Fairness of CORRFor the sake of simplicity, we assume that our sampling points coincide with the beginning of theallocation cycles only. If frame sizes are small, this approximation is quite reasonable.Lemma 6.4 If a connection p is in a busy period during cycles k1 through k2, where k2  k1, theamount of service received by the connection during [k1; k2] is bounded bymaxf0; b(k2  k1)Rp   pcg  Np(k1; k2)  d(k2   k1)Rp + pe:Proof: The proof follows directly from lemma 5.2.Corollary 6.3 If a connection Cp is in a busy period during the cycles k1 through k2, where k2  k1,the amount of normalized service received by the connection during [k1; k2] is bounded bymax(0; b(k2   k1)Rp   pcRp )  wp (k1; k2)  d(k2   k1)Rp + peRp :Proof: The proof follows directly from lemma 6.4 and the denition of normalized service.Theorem 6.2 If two connections p and q are in their busy periods during cycles k1 through k2,where k2  k1, then (p; q) = jwp(k1; k2)  wq(k1; k2)j  1 + pRp + 1 + qRqProof: From corollary 6.3 we get(p; q) = jwp(k1; k2)  wq(k1; k2)j max(d(k2   k1)Rp + peRp   b(k2   k1)Rq   qcRq  ;d(k2   k1)Rq + qeRq   b(k2   k1)Rp   pcRp ) max8<:l[k2   k1] + pRpRpmRp   j[k2   k1]  qRq RqkRq  ;l[k2   k1] + qRq RqmRq   j[k2   k1]  pRpRpkRp 9=;79
 max( [k2   k1] + 1 + pRp !   [k2   k1]  1 + qRq ! [k2   k1] + 1 + qRq !   [k2   k1]  1 + pRp !) 1 + pRp + 1 + qRq :This completes the proof.6.2.2 Comparison with Other SchemesThe notion of fainess was rst applied to network systems in [13] using an idealized uid ow model.In a uid model of trac, multiple connections can receive service in parallel, and hence it is possibleto divide service capacity among active connections exactly in proportion of their service share atall times. As mentioned earlier, this ideal form of fair queueing cannot be implemented in practice.In [13, 35] an extension to the ideal fair queueing mechanism is proposed for packet switchednetwork. In the scheme (PGPS) proposed in [35] the service order of packets are determined usingthe ideal fair queueing system as the reference, and by simulating the corresponding uid owmodel. The simulation of the hypothetical uid model is computationally expensive and may beprohibitive, particularly at high transmission speed. The complexity of the of the PGPS scheme issomewhat alleviated in the SFQ mechanism proposed in [25]. In the SFQ scheme the hypotheticaluid ow simulation is eliminated by using an internal virtual time reference as an index of work.Although SFQ is far less complex than PGPS, it is still not simple enough for implementation atgigabit speed. The SFQ mechanism requires the waiting cells to be arranged in a priority queuesorted in an increasing order of their virtual departure time. Insertion in a hardware priorityqueue is an O(n) operation [9], where n is the length of the queue. Hence, for a reasonably largevalue of n and at a high transmission speed, a priority queue of cells may be dicult to maintain.Compared to schemes described here, CORR scheduling is much simpler. Of course, simplicitycomes at the cost of loss in fairness. In the following we show that although CORR is not as fair asPGPS and SFQ, it is quite close to them. But along with fairness, if we consider the complexity ofimplementation as one of the criteria for evaluation, we believe that CORR is an attractive choice.To compare SFQ, PGPS, and CORR in terms of fairness, we use (p; q) as the performance metric.As discussed earlier, (p; q) is the absolute dierence in normalized work received by two sessionsover a time period where both of them were busy. We proved earlier that if our sample points areat the beginning of the allocation cycles, thenCORR(p; q)  1 + pRp + 1 + qRq :80
Under the same scenario, it can be proved that in the SFQ scheme the following holds at all times,SFQ(p; q)  1Rp + 1RqDue to dierence in the denition of busy periods in PGPS, a similar result is dicult to derive.However, Golestani [25] has shown that the maximum permissible service disparity between a pairof busy connections in the SFQ scheme is never more than two times the corresponding gure forany real queueing scheme. This proves thatPGPS(p; q)  12SFQ(p; q):Note that, 0  i  1 for all connection i. Hence, the fairness index of CORR is within two timesthat of SFQ and at most four times that of any other queueing discipline, including PGPS. Clearly,it is a very competitive bound.6.3 SummaryIn this chapter, we have analyzed delay performance and fairness properties of CORR scheduling.We have derived closed form bounds on the worst-case delay when CORR is used in conjunctionwith composite leaky bucket, moving window, and jumping window regulators. We have shown thatwhen the scheduling discipline guarantees a minimum rate of service at all nodes, the worst caseend-to-end delay in a multi-node system can be reduced to the delay in an equivalent single-nodesystem. We have used this result to derive corresponding end-to-end bounds on delay. We haveshown that albeit its simplicity, CORR is very competitive with other more complex schedulingdisciplines, such as PGPS and SFQ, in terms of both delay performance and fairness.
81
Chapter 7ConclusionsIn this dissertation, we have addressed two important problems hindering the ubiquitous deploy-ment of distributed multimedia applications: 1) preserving network throughput at the end-hosts,and 2) developing trac control mechanisms for providing service guarantees in ATM networks.7.1 ContributionsWe have addressed the rst problem in chapters 2 and 3. In chapter 2, we proposed an I/O ar-chitecture which goes beyond preserving network throughput in end-system, and addresses chronicI/O bottleneck of the current generation of operating systems. The proposed architecture limitsthe operating system involvement in I/O transfers by migrating some of the operating system'sI/O functions to the I/O devices. The presence of on-board micro-processors on most of the I/Oadapters makes it a feasible task without a major overhaul of the device hardware. We believethat autonomy of devices, coupled with a connection oriented I/O architecture, is a fundamentallydierent and very promising approach to solving the I/O bottleneck in the host systems.In chapter 3, we have experimentally demonstrated the performance impact of the proposed I/Oarchitecture on networked multimedia applications. On a video conferencing system built aroundIBM RS/6000s equipped with high-performance video CODECs and connected via 100 Mb/s ATMlinks, we have shown that a connection oriented autonomous I/O architecture can improve end-to-end network throughput by as much as three times that achievable using the existing architecture.Trac control mechanisms for providing end-to-end service guarantees have been addressed inchapters 4, 5 and 6. The contribution of chapter 4 is in the precise characterization of tracenvelopes dened by composite shapers. These shaping envelopes are used to model input tracin the delay analysis of the scheduling policies. With the recent standardization of multiple leaky-bucket-shaping of virtual connections by the ATM forum [21], our work on characterization of82
trac generated by composite shapers is extremely important in developing end-to-end quality ofservice architecture.In chapter 5, we have presented the algorithmic description of Carry-Over Round Robin Scheduling.The main attraction of CORR is its simplicity. In terms of complexity, CORR is comparable toround robin and frame based mechanisms. However, CORR does not suer from the shortcomings ofround robin and frame based schedulers. By allowing the number of slots allocated to a connectionin an allocation cycle to be a real number instead of an integer, we break the coupling betweenthe service delay and bandwidth allocation granularity. Also, unlike frame based mechanisms, suchas Stop-and-Go and Hierarchical-Round-Robin, CORR is a work conserving discipline capable ofexploiting the multiplexing gains of ATM.The performance of the trac control mechanism and the service policy has been analyzed inchapter 6. We have shown that, when used in conjunction with composite shapers, the CORRscheduling discipline is very eective in providing end-to-end delay guarantees. Besides providingguarantees on delay, CORR is also fair in distributing the excess bandwidth. Our results show thatalbeit its simplicity, CORR is very competitive with much more complex scheduling disciplines,such as Packet-by-Packet Generalized Processor Sharing and Self-Clocked Fair Queueing, both interms of delay performance and fairness.7.2 Future DirectionsThe work presented in this dissertation can be extended in several ways. In chapter 3, we demon-strated the eectiveness of the autonomous I/O architecture on a video conferencing application.A use of similar principles and mechanisms to other multi-media applications would be an inter-esting endeavor. Although video conferencing is representative of typical multi-media applications,a high-performance video server is much more demanding, both in terms of I/O throughput andnetwork performance. We believe that the I/O architecture proposed in this dissertation is the onlyway to meet the I/O and network demands of a large video server. An experimental verication ofthis conjecture would be a worthwhile undertaking.We have proposed multi-rate shaping as an eective mechanism to capture variability in trac ows.We have analyzed end-to-end delay performance when CORR scheduling is used in conjunctionwith multi-rate shapers. It would be interesting to perform similar analyses for other schedulingdisciplines proposed in the literature. Most of the analysis of packet scheduling disciplines assumevery simple trac models, such as peak rate or simple leaky bucket controlled sources. Sincethese models are inadequate in capturing the variability in trac arrivals, the performance of thescheduling algorithms in exploiting the multiplexing gains of ATM has not been properly tested in83
these analysis. Re-evaluating them under the generalized trac model will expose their strengthsand weaknesses in a true packet switched environment.
84
Bibliography[1] MMTplus Functional Description. IBM Internal Document, 1993.[2] ATM Turboways 100 Adapter. IBM Internal Document, 1994.[3] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanin, and M. Young. Mach:A New Kernel Foundation for UNIX Development. In Proceedings of the USENIX, 1986.[4] J. Adam, H. Houh, M. Ismert, and D. Tennenhuse. A Network Architecture for DistributedMultimedia Systems. In Proceedings of the Multimedia Computing and Systems, 1994.[5] C. M. Aras, J. F. Kurose, D. S. Reeves, and H. Schulzrinne. Real-Time Communication inPacket-Switched Networks. IEEE Transsactions on Information Theory, 82(1), 1994.[6] B. Bershad, C. Chanmers, S. Eggers C. Maeda, D. McNamee, P. Pyzemyslaw, S. Savage,and S. Emin Gon. SPIN - An Extensible Microkernel for Application-Specic OperatingSystem Services. In Department of Computer Science and Engineering FR-35, TR-94-03-03,University of Washington, 1994.[7] S. Keshav C. R. Kalmanek, H. Kanakia. Rate Controlled Servers for Very High Speed Net-works. In Proceedings, GLOBECOM, 1990.[8] J.B. Carter, A. Davis, R. Kuramkote C.-C. Kuo, L.B. Stoller, and M. Swanson. Avalanche:A Communication and Memory Architecture for Scalable Parallel Computing. Draft Report,Computer Systems Laboratory, University of Utah, 1995.[9] H. J. Chao. Architecture Design for Regulating and Scheduling User's Trac in ATM Net-works. In Proceedings, SIGCOMM, 1992.[10] David Cheriton. The V Distributed System. Communications of the ACM, 33(3), March 1988.[11] D. Clark, B. Davie, and I. Gopal et al. The AURORA Gigabit Testbed. Computer Networksand ISDN Systems, 25(6), 1993.[12] D. Clark, and D. Tennenhouse. Architectural Cosideration for a New Generation of Protocols.In Proceedings, ACM SIGCOMM, 1990. 85
[13] A. Demers, S. Keshav, and S. Shenkar. Anslysis and Simulation of Fair Queuing Algorithm.In Proceedings, SIGCOMM, 1989.[14] P. Druschel, M. Abbott, M. Pagels, and L. Peterson. Analysis of I/O Subsystem Design forMultimedia Workstations. In Proceedings of the Workshop on Network and Operating SystemSupport for Digital Audio and Video, 1992.[15] P. Druschel, and L. Peterson. Fbufs: A High-Bandwidth Cross-Domain Transfer Facility. InProceedings of the SOSP, 1993.[16] P. Druschel, L. Peterson, and B.S. Davie. Experiences with a High-Speed Network Adapter:A Software Perspective. In Proceedings, ACM SIGCOMM, 1994.[17] D.R. Engler, M.F. Kaashoek, and J. O'Toole. The Operating System Kernel as a SecureProgrammable Machine. In Proceedings of the sixth SIGOPS European Workshop, 1994.[18] K. Fall, and J. Pasquale. Exploiting In-Kernel Data Paths to Improve I/O throughput andCPU Availability. In Proceedings of the USENIX Winter Conference, 1993.[19] G. Finn. An integration of network communication with workstation architecture. ComputerCommunication Review, 21(5), October 1991.[20] R. Fitzgerald, and R. Rashid. The Integration of Virtual Memory Management and Interpro-cess Communication in Accent. volume 4, 1986.[21] ATM Forum. ATM user-network interface specication, version 3.1, 1994.[22] L. Georgiadis, R. Guerin, and A. Parekh. Optimal Multiplexing on Single Link: Delay andBuer Requirements. In Proceedings, INFOCOM, 1994.[23] S. J. Golestani. A Framing Strategy for Congestion Management. IEEE Journal on SelectedAreas of Comminication, 9(7), 1991.[24] S. J. Golestani. Congestion Free Communication in High-Speed Packet Networks. IEEETransaction on Comminication, 32(12), 1991.[25] S.J. Golestani. A Self-Clocked Fair Queuing Scheme for Broadband Applications. In Proceed-ings, INFOCOM, 1993.[26] D. Ferrari H. Zhang. Rate Controlled Static Priority Queuing. In Proceedings, INFOCOM,1993.[27] M. Hayter, and M. Derek. The Desk Area Network. Operating Systems Review, 25(4), October1991. 86
[28] D. Heyman, A. Tabatabai, and T.V. Lakshman. Statistical Analysis and Simulation Studyof Video Teleconferencing Trac in ATM Networks. IEEE Journal on Selected Areas inCommunications, 2(1), 1992.[29] M. Laubach. Classical IP and ARP over ATM. Internet RFC{1577, January 1994.[30] D. Legall. MPEG - A Video Compression Standard for Multimedia Applications. Communi-cations of the ACM, 34(4), 1991.[31] H. Massalin. Synthesis: An Ecient Implementation of Fundamental Operating System Ser-vices. PhD thesis, Columbia University, 1992.[32] J. Mogul, and A. Borg. The Eect of Context Switches on Cache Performance. In Proceedingsof the ASPLOS-IV, 1991.[33] John Ousterhout. Why Aren't Operating Systems Getting Faster As Fast As Hardware? InProceedings of the USENIX Summer Conference, 1990.[34] A. Parekh. A Generalized Processor Sharing Approach to Flow Control In Integrated ServicesNetwork. Ph.D. Thesis, Massachussetts Institute of Technology, 1992.[35] A. K. Parekh, and R. G. Gallager. A Generalized Processor Sharing Approach to Flow Con-trol in Integrated Services Network: The Single Node Case. IEEE/ACM Transactions onNetworking, 1(3), 1993.[36] M. Pasieka, P. Crumley, A. Marks, and A. Infortuna. Distibuted Multimedia: How Can theNecessary Data Rates be Supported. In Proceedings of the USENIX Summer Conference, 1991.[37] J. Pasquale, E. Anderson, and P.K. Muller. Container-Shipping: Operating System Supportfor I/O Intensive Applications. IEEE Computer, 27(3), 1994.[38] E. Rathgeb. Modeling and Performance Comparison of Policing Mechanisms for ATM Net-works. IEEE Journal on Selected Areas of Communication, 9(3), 1991.[39] D. Saha, D. Kandlur, T. Barzilai, Z. Shae, and M. Willebeek-LeMair. A videoconferenc-ing testbed on ATM: Design, implementation, and optimizations. submitted for publication,November 1994.[40] D. Saha, S. Mukherjee, and S. K. Tripathi. Multi-rate Trac Shaping and End-to-end Perfor-mance Guarantees in ATM Networks. In Proceedings, International Conference on NetworkProtocols, 1994.[41] M. D. Schroeder, and M. Burrows. Performance of Firey RPC. ACM Transactions onComputer Systems, 8(1), 1989. 87
[42] Z-Y. Shae, and M-S. Chen. Mixing and playback of JPEG compressed packet videos. InProceedings GLOBECOM 92. IEEE, December 1992.[43] J. S. Turner. New Directions in Communications. IEEE Communications, 24(10), 1986.[44] S-Y Tzou, and D. P. Anderson. The Performance of Message Passing Using Restricted VirtualMemory Remapping. Software-Practice and Experience, 21, 1991.[45] The USENIX Association and O'Reilly & Associates, Inc., 103 Morris Street, Suite A, Se-bastopol, CA 94572. 4.4BSD Programmer's Reference Manual, April 1994.[46] G. K. Wallace. The JPEG Still Picture Compression Standard. Communications of the ACM,34, 1991.[47] S. Wray, T. Glauert, and A. Hopper. Networked Multimedia: The Medusa Environment.IEEE Multimedia, 1(4), 1994.[48] L. Zhang. Virtual Clock: A New Trac Control Algorith for Packet Switching Networks. InProceedings, SIGCOMM, 1990.
88
