Introduction
The design of a video server is emerging as a key technology in the trend toward integrated multimedia services such as video-on-demand, teleshopping, broadcasting and distance learning. One of the important requirement in designing a video server is the ability to support a large number of simultaneous video clients watching themovies stored in the server. To achieve this goal, parallel retrieving of the video storage has been introduced and the switch-based delivery networks were used, for example, the use of RAID (Redundant Array of Inexpensive Disks) systems and the ATM (Asynchronous Transfer Mode) networks. In [I] and [2] , a PC-based video server has been designed to support 40 video streams. In [3] , and [4] , a high performance workstation with large I/O bandwidth is designed as a video server which supports up to 300 video streams. Thenumber of simultaneous video clients ' This work is currently supported by the R.O.C. Ministry of Economic AfJaairs under the project No. 37H3100 conducted by ITRI. This paper was preseuted in part at the IEEE International Conference on Communcations, Dallas, TA, USA 1996 (SUPERCOMMfiCC'96) . is limited due to the centralized design of the video server. Given that thousands of video streams are required, it is evident that a distributed approach would beused to enhance the system.
In this paper, a distributed approach which combines space and time division switching technology is proposed for the design of a scalable video server. The proposed architecture is called the ATM-based Scalable Video (ASV) Sewer System, as shown in Figure 1 . The system is distributed and using the ATM technology as understood by its definition. The major part of the system consists of a group of video pumping units (VPU) and an internal ATM-based video stream routing network which make the proposed system scalable. There are distributed storage devices controlled by a set of VPUs. Each VPU reads the video files from its connected hard disks into the local buffer, transform the files into ATM cells, and pump the cells into the stream router. The stream router accepts a large dimension of ATM cells and switches them into their designated channels. To control to the ATM networks. For interactive control and maintenance purpose, the SHC unit uses a reverse control channel which is connected to each VPU. This paper will describe the architectural design of the proposed ASV server system, the operation and performance of the VPUs, and then focus on the design of the ATM-based video stream router. Each VPU is designed in such a way that the operations are controlled by a dedicated microprocessor. The theoretical limit on the number of simultaneous clients is analyzed and derived based on the proposed design. For the stream router, it is required to guarantee the quality of services (QoS) of each multiplexed CBR or VBR stream. Therefore, the system architecture and the corresponding hardware design are proposed and described. Experimental results show that the proposed stream router can produce guaranteed throughput and bounded delay variance for video applications. This paper is organized as follows. The system architecture describing the functionality of each component is given in Section 2. The detailed description and analysis of the proposed video pumping unit are given in Section 3. A further detailed design for the stream router is given in Section 4, followed by the performance evaluation in Section 5. A short conclusion is given in Section 6.
ystem Architecture
There are four subsystems in theproposed ASV server system, namely the Distributed Storage, the Video Pumping Unit (VPU). the Stream Router, and the Sewer Headend Control (SHC) unit. The functionality of each subsystem is described respectively in the €allowing subsections.
Distributed Storage
There are many storage devices for storing video files and each device can be a RAID system or simply a hard disk. Each RAID or hard disk is connected to one VPU which pumps the files into the networks. Two or more storage devices can be connected together to one VPU for obtaining a higher performance. The storage devices are organized in a distributed way such that the total number of simultaneous access can be made very large. Besides, the total capacity of the storage can be scaled up without producing additional seeking delay.
Video Pumping Unit
The function of each VPU is to pump the video files in the connected storage device into the streani router. The VPU reads a block of video data from its connected hard disk first, segments the data into ATM cells, and then puts the cells into the stream router at a required service rate. Each VPU is able to serve a maximum limited number of video streams due to the mechanical characteristics of the hard disk. A detailed architectural description and analysis of theVPU will begiven later in Section 3.
Stream Router
The responsibility of the stream router is to multiplex the video streams into the server headend controlunit The major challenge is that the video stream may be variable bit rate like MPEG-1 streams or constant bit rate like MPEG-2 streams The performance consideration for each stream would be a guaranteed throughput and a bounded delay variance In order to guarantee the perfoimance of each stream, some intelligent scheduling among the streams is needed as well as the switching and multiplexing The proposed architecture of the stream router is based on ATM technology The detailed description will be given in the Section 4.
Server Headend Control
The server headend control unit primarily handles the following functions.
1. The signalling interface with networks.
2. Admission control for the requests from clients.
3. Session control for video operation viathereversechannel.
4. Video server management.
The server headend control unit provides an interface for the video server to be connected to the networks. Since the proposed architecture is ATM-based, the interface is ATM UN1 (user-to-network interface). Operations such as video channel establishment, control message transfer, or video content searching can be done in this unit. The admission control function is also necessary for performance guarantee. The major components of the server headend control unit is shown in Figure 2 . The procedures of playing a movie from the server include the following steps:
I . The client sends a request to the server headend control unit.
2. Movie searching, content navigation, ...
Admission control session.
4. Initiate a video stream from the storage to the server headend control unit via the streamrouter (downstream). 
Setup a VC (virtual channel) to the client via ATM networks (downstream).
6. Play the movie.
The Video Pumping Unit
Since the function of each VPU is to pump the video files in the connected storage device into the stream router, a temporary buffer is required for caching thevideo file beforeit is segmented into ATM cells. Thus, we summarize the video pumping mechanism into the following two operations:
0 Operation (1): Move lhe file from thehard disk or tlie RAID to the temporary memory, block by block.
0 Operation (2): Move a block of video data from the memory tu the ATM segmentation unit.
The proposed VPU shown in Figure 3 is designed according to the above repeating operations. A microprocessor is used for controlling the movement of the data to and from the memory. Both ATM segmentation-and-reassembly (SAR) unit and the SCSI controller for storage devices are hooked on the system bus. The data movement via the system bus may be achieved by direct memory access (DMA) if the corresponding microprocessor provides this function.
To evaluate the performance of the VPU, the following assumptions are made:
1. The block size in Operation (1) is assumed to be B bytes, that is, each time B bytes are moved from the hard disk into the memory. 2. The hard disk or the RAID requires a mechanical delay d, seconds before it begins to transfer its data to the system bus. The mechanical delay includes the track seeking delay and the rotation delay. We also assume that this delay is the same for both an ordinary hard disk and a RAID system since the RAID improves the performance mainly by the the parallel data transfer.
3. The SCSI controller transfers the data to the system bus at T , bytes per second.
4. Each video stream requires a minimum throughput a bits per second to guarantee the quality of the video.
Assume that there are n video streams over the VPU, the relation of the above parameters can be derived as follows. The latency for the VPU to move a block of video data for one video stream includes the hard diskmechanical delay and the data transfer delay, that equals to
Since there are n. video streams, the time interval between moving two blocks of video files for a specific video stream is During the above time interval, there are B bytes of video data being transferred. Thus, we have the following equation. We plot that curve of block size B with respect to n by assuming that a = 1.5Mbps (for MPEG-1 streams), d, = lOms for commercially availablehard disk, and r, = 10MBytes/s or 20MBytesls. The result is shown in Figure 4 , where the maximum number of video streams for 1OMBps and 20MBps of SCSI transferrateis about 20 and26respectively when B = 64KBytes. For MPEG-2 streams which require a = 3 M b p s , we show the block size B versus number of streams n in Figure 5 . The result is different to Figure4 only in the half number of video streams. Only a very large size of B, i.e., 1 .5MBytes, could produce a larger number of video streams. However, that number is still limited up to 45 for r,< = lOll/rBytes/s and 95 for r, = 20MBytes/s (MPEG-1 streams). Therefore, we conclude that the bandwidth utilization per VPU is not high due to the hard disk mechanical latency, where the bandwidth utilization is defined as the ratio o f the total bandwidth used by all the video streams to the maximum possible bandwidth of a SCSI controller. For example, when thereare20MPEG-1 usersunder B = 64kbytes and r,$ = lOMBps, theutilizationis = 37%. Defining the bandwidth utilization per VPU as p, then we can calculate U, as follows.
n . a p = -8 7, By Equations (1) and (3), we have
The result of Equation (4) is shown in Figure 6 . The bandwidth utilization for a higher SCSI transfer rate is lower than that for a lower SCSI device. This is only because themechanical latency becomes a more dominating factor which affects the bandwidth utilization. Advanced RAID system using the file striping technology only introduces the improvement on the data transfer rate and therefore the improvement OII the number o f video streams is limited. To increase this number, a distributed approach based on the ATM switching technology is very helpful in improving the scalability. Several VPUs can simultaneously pump the streams into a stream router which is actually a special-purpose ATM switch. As the dimension of the switch increases, the number of video streams is improved. In next section, detailed architecture of the stream router is described.
The ATM-based Stream Router

The Architecture
The stream router should be designed in such a way that an unexpected jitter or delay would not be happened for a video stream being switched over the router. Since the stream router is ATM-based, the digitized video is carried by ATM .s ForCBRsources,welet BPi = S M i . Theaboveassumption will be very useful in the architecture which is described as follows. The stream router is an M x N switch, where M may be greater than N in order to make the storage space scalable. The stream router consists of two stages, namely the stream demultiplexer and the stream multiplexer, as shown in Figure 7 . The stream demultiplexer is a tree demultiplexer which forwards the cells to the desired output port. Each output port is associated with a stream multiplexer for multiplexing a set of CBR and VBR sources into this output port. Assume that each video source is carried by a virtual channel (VC) established on the switch, than the multiplexer can be referred to as a VC multiplexer. The design goal of the multiplexer is to guarantee the throughput of each VC within a bounded delay variance. Traditional hardware multiplexing schemes such as round robin or weighted round robin will not work very well since the traffic may be variable bit rate. Hence, we proposed a stream multiplexer using a well-known leaky bucket mechanism. The multiplexer is made up by a basic building block called Cell Input Unit (CZU) . Each video source 1: is associated with a CIU for scheduling purpose and the upcoming cells are fed into input Ii of a CIU, as shown in Figure 8 . In each CW, cells of input Ii would select one of the two outputs, namely Pi, and M i , according to the state that the cell conforms to. There are three states that an upcoming cell may fall into, namely the peak rate state P , the mean rate state M and the non-conformance state N . A cell is in M state if there are less than BMi cells arriving within a cycle. A cell is in P state if there are less than BP; and more than BMi cells arriving within a cycle. Otherwise, the cell is in N state. For P-state cells, they are routed to output Pi of the CIU. For M-state cells, they are routed to output Ml of the CIU. For N-state cells, they are discarded or de- layed. In Figure 8 , the output selection in a CIU is realized by a two-stage leaky bucket. The first leaky bucket is used don't care as a P-state arbitrator and is therefore controlled by a token generator which holds BP, tokens within a cycle. If a cell arrives and there is a token in the token pool, then the circuit in Figure 8 would switch the cell to the next leaky bucket. If there is no token, then the cell is discarded or kept in the temporary buffer depending upon the real-time requirement. The second leaky bucket is used as a M-state arbitrator which switches the cell to either output P,' or Ml and is controlled by a token generator which holds B M , tokens within a cycle. The selection circuit used for the leaky bucket in Figure 8 is realized by an operation of two bits, namely the token bit represented by T and the head-of-line occupafion bit represented by N. Let T = 1 represent a token and N = 1 represent that a cell is waiting at the head of the line. For the first leaky bucket, a cell is switched to the second leaky bucket or is dropped according to the AND operation of T and H , as shown in Table I . Another AND gate is also used to feed the token back to the token pool when there is no cell arrival. The same method is used for the second leaky bucket to switched between the output Mi and P,'. A clock signal Clk is used to control the operation of the circuit. A reset signal is also used at the input to reset the token counters when a cycle has come to an end. Assume that there are at most J video sources, the overall implementation of the stream multiplexer is shown in Figure  9 . Totally J CKJs are used and each CIU is associated with a delay circuit denoted by D such that each CIU is operated serially. The outputs of each CIU are directed to two output buffers, namely the P buffer for P-state cells, and M buffer for M-state cells. The two output buffers means that there are two priority levels. Cells in the M buffer have a higher priority to be transmitted into the outgoing link over the cells in the P buffer, By this mechanism, the average throughput of each video source is guaranteed. As for peak throughput, it depends upon the admission control algorithm and can be calculated according to the number of multiplexed sources and the allowable maximum cell delay.
Hardware Complexity
The complexity of the stream multiplexer is acceptable since each CIU requires only a few AND gates and registers clven that the output of the multiplexer is using 155Mbps OC-3 physical standard, weneed at most 100 CIUs [or serving MPEG-1 video streams. Thus, the total gate count is quite reasonable.
Admission Control
The admssion control is necessary for lhe stre5m multiplexer to produce guaranteed quality of service To many users in the system would result in severe delay and jitter which make the video quality unacceptable. Hence, the maximum allowable number of users must be controlled. We ..~ derived the control method by using a very simplified queuing theory. Assume that there is a temporary queue at the input Ik of the CIU, then the queue length is used as an indication of the video quality. This is an approximation as well as assumption, however, it is very common that alarger queue length would produce a longer delay and delay variance. The queue length is formulated as follows. Let p denote the input utilization factor of a video strean and a, denote the output transmission rate of the stream multiplexer, then we have When there are n video streams, the server in the M / D / l queuing system is assumed to be fairly shared by these n, streams. Hence, the queuing system becomes a single server queuing system with a deterministic service rate of l / n . The mean queue length 7jn for a specific queue when tliere are n video streams in the system is approximated by the queue length of the M / D / l system with a service rate of 1 and an input utilization factor of p I n. That is
Given an acceptable queue length we can calculate the maximum allowable number of streams in the multiplexer. Assume that a = 3Mbps and a , = 155Mbps, we plot the curve of 7fn with respect to n. as shown in Figure 10 
Performance Evaluation
Traffic Modeling
The approximation in the admission control method is optimistic since the assumed video source is not bursty. To evaluate the efficiency of the proposed stream router accurately, a source traffic model is hereby given for our simulation. The VBR source is modeled by an ON-OFF source model which consists of two states, namely the ON state and the OFF state [6] . The source would generate cells in the ON state and keep silent in the OFF state. The VBR source is thus characterized by alternating independent ON (burst) and OFF (silent) periods, The duration of the ON or OFF period can be characterized by any general distribution function. We assume that the duration of the ON period between two OFF periods is determined by an integer-valued random variable ton (in number of time-slots) which is exponentially distributed.
Similarly, the duration of the OFF period between two ON periods is determined by an integer-valued random variable t o f f (in number of time-slots) which is also exponentially distributed. Therefore, the expectation of the period duration is denoted as E[ton] for ON period and E [ t , f f ] for OFF period. Assume that the cell arrival probability in ON 
For CBR sources, the cell arrival probability in each timeslot is determined by a Poisson random variable X , with parameter A, , where E[X,] = A, . Then we have
Simulation Analysis
The performance of the proposed multiplexer is observed by the maximum queue length occurred in the temporary buffer of each CIU input I,. A larger queue length indicates that the traffic source is skewed by a larger delay variance and a longer average delay is resulted for each cell to be multiplexed into the outgoing link. A larger buffer space is also required for the corresponding VC due to the large queue length. We compare the proposed scheme with the traditional round-robin method by a simulation program implemented on a Sun SPARC-20 workstation using the C programming language. The round-robin method simply reserved a chance for each source to transmit its cell no matter it has cells or not. Assume that the video sources are variable bit rate sources with a peak rate of 6 M bps and a mean rate of 3 M bps. Assume that the output link is a 155Mbps DS-3 interface and the cycle interval is 25011s. Then, we may obtain that there are about 93 cells within250ps. Also we obtain BP, = 4 and BM% = 2 for each video sourceunder the above assumptions.
For modelling into a bursty traffic source, we let E[ton] = 15, &? [toff] = 15, and X = 0.043. The maximum queue length is observed with respect to different number of video sources to be multiplexed. The simulation result for maximum queue length versus number of video sources is shown in Figure 1 1. The result for round-robin method is also shown. The result indicates that the proposed method is able to multiplex more video sources under the same maximum queue length. That implies that the proposed method obtains a lower delay variance for each video source under the guaranteed throughput.
Conclusion
In this paper, an ATM-based video server with a scalable architecture is proposed. The pumping capability of each single storage device is analyzed and found to be limited up to 40 MPEG-1 streams. Hence, the ATM switching technology is applied and a stream router is designed for the scalability of the server. Previously, the efforts on designing the video server were mostly focused on a central approach which used high performance computers, real-time operating system, or :t
