Abstract
Introduction
The next generation of workstations will be capable of processing continuous media (real-time digital audio and video) as effectively as conventional discrete media (text, still image and numerical data). Continuous media (CM) processing has special requirements, among which is a huge demand for bandwidth, processing power, and storage. It has become clear that conventional architectures cannot meet these requirements, mainly because of their centralized orga-linked by a high speed multimedia interconnect. A broadband network (e.g. ATM Network) can be accessed via one module functioning as a network interface or, if necessary, via a hardware packet multiplexor/demultiplexor which provides direct connectivity of modules to the network. This paper focuses on the multimedia interconnect and its bandwidth management. Ideally, a multimedia interconnect should satisfy several requirements:
1. It should provide the bandwidth necessary for CM and random traffic which coexist in the system.
2. It must provide adequately fast response to random transcations normally associated with virtual memory page faults or cache misses.
3.
It must provide a constant throughput, which is unaffected by the time variations of random traffic, for each continuous stream. Desirably this should be in the form of backlog avoidance in the sense that the transfer of each block of CM data 0-8186-5530-5194 $3.00 Q 1994 IEEE must be completed before the next block arrives (e.g. a video frame should be transferred within a frame time).
4. The bandwidth management and access arbitration scheme should be implementable in fast hardware.
5. The interconnect must be simple enough to be reliable and economical.
Designing an interconnect that meets the above requirements is a nontrivial task. currently there is no interconnect that fulfills these specifications. Most backplane buses currently in use (NuBus, Sbus, etc.) fail in the first requirement. They cannot provide bandwidth of more than 100 MB/s [l, 51, whereas a single stream of full-motion video with image resolution of 1024 x 1280 pixels, 24-bit color and 30 frames/sec requires about 120 MB/s . Image compression schemes (e.g. JPEG, MPEG) can reduce the rate by one or two orders of magnitude to alleviate remote transfer and storage difficulties. However, the transfer of uncompressed video within the computer is often necessary. Using high speed interconnects-such as Apple's QuickRing and Sun's XDbus with 350 MB/s and 320 MB/s respectively-is not a complete solution. Fulfilling the rest of the requirements outlined above demands innovative bandwidth management schemes which are absent from today's interconnects.
There are different candidate topologies for multimedia interconnect. This paper focuses on the bus topology. While the bus topology has certain drawbacks that make it inappropriate for large-scale applications, it offers advantages that make it a viable solution for small-scale purposes such as a workstation. A bus cannot support a large number of bus masters, both due to capacitive loading effects that limit its maximum operating frequency and due to the bus being a shared medium whose throughput as seen by each module drops with the number of bus masters. For small-scale applications such as multimedia workstations, however, a bus offers acceptable throughput and speed at the lowest cost.
The rest of this paper is organized as follows: Section 2 describes the underlying assumptions of this work. In Section 3 the proposed scheme is presented in detail, an implementation method is discussed, and an example of a multimedia bus and traffic is described. Section 4 is devoted to the presentation of simulation results which compare this scheme with alternative schemes.
Assumptions
In this paper a packet-switched bus which can support split transactions [8] is considered. In a split transaction bus, a read operation is not atomic and is divided into a request and a subsequent reply operation. Between a request and its corresponding reply, other bus transactions are allowed to occur, resulting in improved bus utilization. It is assumed that all packets, hereafter called cells, are of identical size. A cell has a header containing the destination address and other identification information, as well as a payload part carrying data. Normally in a read operation, the request is sent in a single cell, while the reply may consist of many cells. Transmission of a cell is atomic and takes one time slot of the bus. During each time slot, the bus master in the next time slot is determined by a hardware bus arbiteration unit or simply bus arbiter. Thus arbitration may fully overlap transmission, making back-to-back cell transmission possible.
MODULE

F!Fl Section Section
A Packet Bus
Figure 2: Bus Interface Unit
Each hardware module may generate two types of cell traffic: random and periodic. Periodic traffic occurs because of CM transactions (real-time video processing, playback, etc.) between different modules and is relatively prolonged in time. Periods are normally much longer than the bus time slots. Every module is connected to the bus via a bus interface unit (BIU). Each BIU has some buffer space to hold a number of cells. (In fact each BIU has two buffer areas for incoming and outgoing cells, but as far as contention for the bus is concerned, only the outgoing buffer is considered). Once a BIU gains access to the bus, it transmits one cell, then releases the bus and in the meantime participates in contention with other BIUs for another time slot of the bus (if its buffer is nonempty). The arbitration unit determines which BIU is the bus master in any time slot. In the next section a more detailed description of the arbitration unit is provided.
Proposed scheme
As far as the bus bandwidth management is concerned, each continuous stream generated by a continuous media application is parameterized by two num- is the maximum number of cells of data requested to be sent over the interconnect in a period T . The CM application may chooses to make B equal to or slightly less than maximum amount of data actually generated by the compression engine. The latter choice implies the sacrifice of some unessential information. In the proposed scheme, prior to the start of a real-time application, ( B , T ) are given by the application software to a bus manager. The bus manager is a software daemon, running on one of the processors, whose job is to admit or reject a new real-time session based on the currently admitted load and to set up the hardware registers in the bus arbiter accordingly. Details are discussed in the following sections.
Service policy
The bus arbiter periodically repeats service cycles that are N time slots in duration and, during every service cycle, executes a service algorithm, which gives priority to CM cells or random cells in a dynamic fashion. The key idea of the service algorithm, hereafter called the cyclic dynamic priority policy, is to serve CM traffic in the background (low priority) as long as a backlog of CM cells does not occur. This results in backlog avoidance, minimal delay of random cells and efficient utilization of the bus bandwidth.
The service cycle N is a system parameter, typically in the range of 10 to 100, which is selected to optimize performance. Another key idea is to transmit the cells of each CM stream "almost" evenly spread in time so that the flow of cells of different streams can be mixed in an almost uniform but random way. In order tor each continuous stream j to achieve the goal of transmitting Bj cells every Tj time slots, a goal is set to transmit Mj (< Bj) cells every N (< T j ) time slots ( M j is defined later). If a backlog avoidance condition which will be discussed later is satisfied, all CM streams will fulfil backlog avoidance. In the meantime, in any time slot, random cells are served with a high priority unless a condition which implies the onset of a backlog occurs. Also in any time slot unused by a random cell, a CM cell is served, resulting in the increased utilization of the bus bandwidth and further avoidance of a backlog. This is revealed by a careful consideration of the service algorithm presented below.
Suppose that there are J modules connected to the bus and that there are Ki active continuous streams Q=ZCii.!lj
where Q is the number of time slots in a service cycle reserved for all CM streams and is updated when a CM session is admitted or terminated. Admission or rejection is done by the bus manager using this number and an admission rule that is presented later. Every time Q is updated, it is written into a special register in the bus arbiter. The bus arbiter, mainly composed of two registers N and Q (which hold N and Q respectively) and two counters n and q, continually performs the following service algorithm:
procedure service(N, Q ) forever do n t N ; This algorithms leaves some freedom to the designer to set up priorities among different modules. It is also implicitly assumed that the processing modules are cooperative and do not exceed the limits assigned by the bus manager.
It is seen from the algorithm that a service cycle time is logically divided into two subintervals (assuming Q 5 N , a condition that is enforced by the admission rule). The first subinterval corresponds to q < n.
In this subinterval random traffic is given a high priority. If all random queues (RQs) are empty, the bus is granted to CM traffic. Note that in any case n is decremented by one in any time slot. The second interval corresponds to q = n. (Note that q > n should not logically occur unless due to a fault.) In this interval CM traffic gains priority over random traffic, and if there is no CM cell in any periodic queue, random cells are served. This case may occur due to small random variations of CM traffic. It is interesting to note that the boundary between the two intervals is dynamically adjusted toward the end of the cycle time as CM cells are served in the first interval. For moderate amounts of traffic load, CM traffic is almost transparent t o random traffic (almost totally served in the background). Therefore random cells suffer from delay due only due to random traffic in the total load. This is a great advantage of this scheme over the schemes that give continuous streams priority over random traffic. Simulation results show this benefit clearly. Another advantage is due to the fact that the CM performance is completely "unaffected" by random traffic because a sufficient number of time slots are reserved for CM traffic in every cycle. It is shown later that a mild condition on the parameters of continuous streams results in a perfect backlog avoidance. Figure 3 shows an example of completely busy service cycle (top) and a partially busy cycle (bottom) both for the case N = 12 and Q = 6 . 
Admission rule
The admission of new continuous media sessions is the job of the bus manager. A request for a new CM session is sent to the bus manager, (including the parameters (B,,,, T,,,) , which computes M,,, using (1) and admits the session if the following condition holds:
where Qczlrrent is the total number of time slots in a cycle reserved for currently admitted CM streams using (2) and (Y denotes the number of time slots reserved for random traffic. a is another system parameter set by the super-user. 1 -a / N is the maximum traffic load allocatable to CM traffic.
Backlog avoidance condition
Backlog avoidance for a CM source with demand ( B , T ) can be achieved if the source has a chance to send at least B cells in every period T . (The index i is dropped to denote the generality of statement.) Thus loss-free transmission of a continuous stream is possible with no more than 2B cells worth of buffer space (assuming a double-buffered scheme). Backlog avoidance is ensured if (4) T N
where N is the bus service cycle and M is found using When N << T , condition (4) is not a big restriction because it is very close to condition (l), which is already satisfied. Another way to have backlog avoidance is to choose T and N such that T is an integral multiple of N and to have CM source periods be synchronized with the bus service cycles. This requirement is not difficult to meet in practice. For example, if a bus service cycle is 10 ps, the continuous media application should choose its transmission period to be a multiple of 10 ps. If the backlog avoidance condition is ignored, then there is a chance that at most 2M cells occasionally miss their deadlines (due to the possibility that at most two service cycles would not be fully overlapped by a period T ) . This may be still a reasonable performance in some applications because cells loss is upper-bounded by a known value. the required buffer space in the BIUs. Therefore there is a trade-off between the delay of random cells and the BIU buffer space. Another implication of a large N is the difficulty of fulfilling the backlog avoidance condition.
The choice of T has similar effects. For example, for a given random and CM load, suppose that the Ts can be doubled. N can also be doubled as a consequence, resulting in smaller delay for random cells. On the other hand, in order to keep the CM loads fixed, Bs must be doubled, implying an increase in the buffer requirement in the BIUs. The same trade-off arises again.
B may be a parameter of choice as well. In the case of uncompressed video/audio, a fixed number of cells p is generated every frame time. For best quality B is set equal to p. For a low resolution video, a user may set B equal to some fraction of p.
Implementation
The service algorithm can be realized with simple hardware which fulfills the fifth and sixth requirements mentioned in Section 1. Every module is connected to the bus via a BIU. The transmit section of a BIU is shown in Figure 4 . Each BIU should have two FIFO queues, RQ (random queue) and PQ (periodic queue).
Choice of parameters
Figure 4: Transmit Section of a BIU The scheme has two parameters N and T that should be selected. N is a system parameter under control of the super-user, and T is an application parameter selected by a user. For best results, N should be much smaller than the smallest T . For a fixed random and CM load, the increase of the cycle time N gives more room for the deferment of CM cells in favor of random cells and helps CM traffic be served in the background. This results in a reduction of the delay of random traffic. As an extreme case let N be infinitely large. This implies that all CM cells may be delayed forever; consequently random cells do not notice the CM traffic. (Obviously it is not possible to let N be arbitrarily large because it would conflict with the deadlines of CM traffic.) Also, increasing N implies the increase of M s , and as a result the increase of The bus arbiter grants the bus to only one queue in one BIU during any time slot based on the service algorithm. Two signal lines are associated with each queue. BRq (bus request) is an outgoing signal which is asserted (low) when the queue is nonempty. BGR (bus grant) is an incoming signal from the arbiter and is asserted (low) when the bus is granted. When BGR is asserted for a queue, BIU transmits in the next time slot one cell out of that queue. BRq and BGR are prefixed with R (or P) to denote their association with an RQ ( or PQ). The bus arbiter has two registers N and Q loaded by the bus manager, as well as two counters n and q. At the end of every cycle n and q are re-loaded from N and Q respectively. Many implementations, from a fully serial design (based on daisy-chaining) to a fully parallel design (using combinational logic), possible. A fully serial design is elaborated here. are will not be elaborated further. 
Application example
A packet bus is described here which serves as both an example of a high performance multimedia bus and a model used in our simulations. The bus, which resembles Sun's XDbus, has a 64-bit data path (not counting control lines) operating at 40 MHz. In a bus clock time (25 n s ) 8 bytes of data can be transferred. For efficiency reasons there are two types of packet: request and reply. A request packet is of fixed size of 16 bytes, while a reply packet is 136 bytes (8 bytes header). The bus is divided into two separate buses, called A-bus and B-bus. The A-bus is dedicated to reply packets and consists of 68 lines, 64 lines for data and 4 lines for header. Thus the overhead time associated with headers is eliminated at the cost of adding 4 lines. The B-bus is only 8 bits wide and carries only request packets. Therefore it is possible to transfer 128 bytes of useful data (i.e. excluding header and request packets) every 400 ns, which is equivalent t o a useful bandwidth of 320 Mbyteslsec. Also note that it takes the same time to transfer a request or a reply packet, allowing the packets to be fully overlapped and concurrent. A useful bandwidth of 320 Mbyteslsec can be considered a good fulfilment of the first requirement of a multimedia interconnect.
The arbitration for the B-bus needs conventional schemes because request packets are associated with random traffic and transferred out of band. CM traffic does not exist on the B-bus because this type of traffic is reservation-oriented. However the A-bus carries a mix of random and CM traffic, and its arbitration needs to use the scheme proposed in this paper.
Example Traffic -Suppose, as an example, that there are 10 modules connected to the above multimedia bus.
n t n -1 ;
This logic, which can be realized in simple hardware, works properly if time slots are longer than two round-trip times for a grant signal through the chain. This sets a lower bound on the duration of a time slot for a given number of modules-r equivalently an up- %44 of the bandwidth is used up by continuous media, and the rest is available for random traffic. Random traffic mainly arises due to virtual memory page faults, shared memory transactions, data and control messages passed between modules, and various nonreal-time network activities such as image retrieval, database transactions, and multimedia e-mail.
Simulation results
The proposed scheme was compared, using simulation method, with two alternative approaches.
The first approach is based on the assumption that the support of continuous media in a computer requires nothing more than a high bandwidth bus. (It is believed that multimedia can be supported with a little over-engineering.) This approach, referred to later as the indiscriminate policy, treats all cells equally. The buffer space in every BIU is arranged as a single FIFO queue where CM cells and random cells generated in each module are queued and served in order of arrival. The contention among different BIUs is resolved by a conventional arbitration scheme (a daisy-chain or a distributed contention resolution scheme [S] ). This scheme has the advantage of simplicity, but it suffers from drawbacks with respect to the third and fourth requirements outlined in Section 1. First, there is no guarantee that CM cells are transferred within their time frames. Second, there is no control on the ratio of lost (overly delayed) to delivered CM cells because random traffic is under no constraint (the normal situation on a conventional bus). Random momentary bus overloads result in unacceptable CM performance (freeze of movie, picture noise, etc.). Also the delay of random packets is generally higher than the proposed scheme because all cells are served in a FIFO order. Simulation shows that over-engineering is a very expensive way to get a performance comparable to the proposed scheme.
The second approach, hereafter called the dispersion policy, is based on the belief that CM traffic should be given priority over non-CM traffic. This approach makes CM performance independent of non-CM traffic but results in two drawbacks. First, CM streams can disturb each other, resulting in is overly delayed (lost) CM cells. Second, random cells are at a disadvantage with respect to CM cells; this translates into poor delay performance for random transcations. In our simulations, this policy is implemented as follows: Each BIU has two FIFO buffers, RQ and PQ. A daisy-chain priority is used to resolve the contention of PQs for the bus. If all PQs are empty, the RQs containing random cells are served. A second daisy-chain daisy chain priority is used for RQs.
Random traffic in our simulations is generated using a Bernoulli trial model in which module j generates in any time slot a random cell with probability p j . This model is not realistic, but it serves well as a common reference input to different schemes so that performance comparison is possible. It is reasonable to expect the numerical value of simulation results to change with the adoption of different traffic models, but the conclusions will remain the same (based on priority arguments).
CM traffic is chosen to be deterministic and close to the scenario described in the previous example. The value of CM load at 0.5 . It is seen that the delay is lowest for the cyclic dynamic priority policy especially when random traffic is high. Figure 8 shows the mean delay of random cells in the presence and absence of CM traffic for the proposed scheme. In the absence of CM traffic, the mean delay is expectedly a constant independent of N . When CM traffic is present, as N is increased, the mean delay of random cells asymptotically approaches the value of the mean delay in the absence of CM traffic. This shows clearly that CM traffic is almost totally unnoticed by random traffic. 
Conclusion
This paper presented the requirements of a multimedia interconnect. It also proposed a bandwidth management and arbitration scheme for a packetswitched bus that fulfilled all those requirements. In particular this scheme allowed a mix of continuous streams and random transactions to coexist on a bus. It offered the desirable feature that no packet of CM data misses its deadline and that random packets are served with a high priority most of the time. The implementation guidelines showed also that this scheme is practical.
