Abstract-A good crossbar switch scheduler should be able to sustain full bandwidth and maintain fairness among competing flows. A pure input-queued (IQ) non-buffered switch requires an impractically complex scheduler to achieve this goal. Common solutions are to use crossbar speedup and/or buffered crossbar.
I. INTRODUCTION
Input-queued (IQ) crossbar switch scheduling has been a topic for decades. However, the challenges still remain partly because the network bandwidth is increasing rapidly, and partly because of the intricate difficulty of the crossbar scheduling.
The problem of IQ crossbar scheduling can be formalized as a classical graph theory problem of maximum weight matching on a bipartite graph where nodes represent input and output ports, and edges represent packets to be switched. The maximum weight matching (MWM) algorithm 1 has been proved to achieve 100% throughput [1] but is too complex for fast hardware implementation.
For an algorithm to be practical, it must be fast. For example, with 10-Gbps (approximately OC-192) line card speed and 64-byte packet size, a scheduling decision must be made within 51.2 ns. Heuristic algorithms, such as iSLIP [2] , iFS [3] , iDRR [4] , meet the time-constraint, but fail to provide 100% throughput for admissible 2 non-uniform traffic. One approach is to apply moderate crossbar speedup (the ratio of the crossbar speed and line card speed). An exciting result This research is supported by NSF grant CCR 0311437. This work was done when Xiao Zhang was at Department of Computer Science and Engineering, University of California, Riverside. 1 The MWM algorithm assigns each VOQ ij a weight w ij , and finds a matching M that maximizes (i,j)∈M w ij . The weight can be queue length, waiting time or others. 2 Admissibility will be defined subsequently in section IV.
is that any maximal algorithms with speedup of 2 can support 100% throughput [5] . However, the downside of this approach is that doubling crossbar speed requires memory speed to be doubled and scheduling time to be halved. Recently, with the advance of very large scale integration (VLSI) technology, integrated circuit density increases dramatically. Current technology allows hundreds of millions of transistors, hence a large amount of memory, to be integrated into a single chip. This makes buffered crossbar (a small buffer resides at each crosspoint) a very promising solution. In [6] , Yoshigoe et al. show a field programmable gate application (FPGA) based design of a 24×24 10 Gbps buffered crossbar switch.
A big advantage of a buffered crossbar is the simplification of the scheduling algorithm. The crosspoint buffers separate the input contentions from the output contentions so that each input and output arbiter can work independently.
Early studies demonstrate by simulation that a buffered crossbar switch provides better throughput than an nonbuffered crossbar switch with much simpler schedulers, such as oldest cell first (OCF)-OCF [7] and round-robin (RR)-RR [8] . Later, longest queue first (LQF)-RR [9] has been proved to achieve 100% throughput for uniform admissible traffic. More recently, Shortest crosspoint buffer first (SCBF) [10] has also been proved to support 100% throughput for any admissible traffic. Unfortunately, these algorithms fail to provide fairness. On the other hand, by applying a packet fair queuing (PFQ) algorithm [11] - [14] at each input and output, PFQ-PFQ has been shown to provide fairness among competing flows [15] ; but as we will show later, it fails to sustain full bandwidth under admissible non-uniform traffic.
To provide both 100% throughput and quality of service (QoS), researchers again resort to speedup. Magill et al. show how to emulate an OQ switch with speedup of 2 [16] . And a more recent work by Chuang et al. further describe a set of scheduling algorithms to provide throughput, rate and delay guarantees with speedup of 2 or 3 [17] .
One open question is: can we achieve both 100% throughput and fairness in a buffered crossbar without speedup? Our answer is both yes and no. It is no because it is impossible to achieve both 100% throughput and strict fairness at the same time (even in the case of admissible traffic!) due to the coexistence of input and output contentions. On the other hand, the answer is yes because if we relax the strict fairness
0743-166X/07/$25.00 ©2007 IEEE
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2007 proceedings.
criterion to a dynamic max-min fairness, fairness implies 100% throughput for admissible traffic. Suprisingly enough, in a buffered crossbar switch, this fairness can be achieved with a very simple modification to an existing algorithm: a PFQ-PFQ scheme with dynamic weights based on both queue lengths and assigned weights. We name this algorithm adaptive maxmin fair scheduling (AMFS), provide a detailed description, present analysis and simulation results.
The rest of the paper is organized as follows. In section II, we briefly overview the switch model used in this paper. In section III, we discuss the relationship between fairness and throughput in the context of crossbar scheduling without speedup. In section IV, V, VI and VII, we describe our main algorithms, provide the max-min fairness definition, and present the throughput and fairness analysis. In section VIII, we show simulation results to verify our scheme and compare with existing schemes. In section IX, we briefly discuss the hardware implementation of the AMFS algorithm. Finally in section X we present our conclusions. Fig. 1 shows a high-level diagram of an input-queued nonbuffered crossbar switch. The crossbar operates on fixed-size packets (called cells) at the same speed as line cards. Time is divided into time slots, and it takes one slot to transfer one cell. Variable-size packets are segmented at inputs and reassembled at outputs. To avoid the head-of-line (HOL) blocking [18] , virtual output queuing (VOQ) [19] is used, where a logical separate FIFO queue is maintained for each input-output pair. Because of both input and output contentions, a crossbar scheduler is necessary to decide which cells are transferred across the crossbar in the next slot. In a buffered crossbar switch, a small buffer, called crosspoint buffer (CB), is put at each crosspoint, as shown in Fig. 2 . From the queuing point of view, this switch architecture is also called combined input crosspoint queued (CICQ) switch. Note that if the crosspoint buffer size is infinite or very large, CICQ is equivalent to output queuing and input arbiters are not necessary since packets can be directly stored at the crosspoint buffer upon their arrival. To make a single chip implementation feasible, the crosspoint buffer has to be limited, and therefore imposes a challenge to scheduling. A great benefit of a buffered crossbar is that the scheduling becomes much simpler. Instead of considering inputs and outputs at the same time, a buffered crossbar allows input and output arbiters to work independently. In this paper, we address the problem of how to achieve throughput and fairness in a buffered crossbar switch without speedup. In the next section, we first examine the problem and present the motivation behind this work. 
II. BACKGROUND: SWITCH MODEL

III. MOTIVATION: THROUGHPUT VS. FAIRNESS
With appropriate crossbar speedup and scheduler, an IQ/CICQ switch can emulate an OQ switch [16] , [20] , which means that throughput and fairness can be satisfied at the same time. However, this is not the case when there is no speedup. Note that our discussion in this section applies to both buffered and non-buffered crossbar.
Fairness implies equal allocation of resources, but there is little agreement among researchers as to what needs to be equalized. In this paper, we focus on the best-effort traffic which is not delay sensitive. Therefore, by fairness, we mean fair sharing of bandwidth among competing flows.
In general, the goal of achieving fairness conflicts with the goal of maximizing throughput. Consider three backlogged flows f 11 , f 12 and f 22 going through a 2×2 crossbar, where f ij is the flow from input i to output j. If we want to maximize the overall crossbar throughput, the only choice is to schedule f 11 and f 22 , as shown in Fig. 3 , so that the throughput is 2. Clearly such scheduling starves f 12 . If we want to fairly treat each flow, we have to schedule each with the per-flow rate of 0.5, where the overall throughput is only 1.5. Even though the output 1 is idle when f 12 is scheduled, the second scheduling strategy avoids starvation and enforces fairness. Fig. 4 shows the throughput of three flows as a function of workload using OCF 3 [21] . When the workload is above 0.5, f 12 receives much less than other two flows. This case shows that it is necessary to provide overload protection to ensure fairness. Otherwise, a malicious user can easily steal bandwidth by flooding the network. Notice that in many switch scheduling studies, it is assumed that at most 1 cell arrives at each input in one slot. Under this assumption, an input can never be over-subscribed. For discussion of non-admissible traffic, we remove this assumption, allowing more than 1 cells arrive at each input in one slot. Thus the aggregate arrival rate at each input can be greater than 1, as shown in Fig. 4 . Non-admissible traffic includes input over-subscription as well as output over-subscription. Although allowing input over-subscription sounds unnecessary in a crossbar switch because the switch can only transfer 1 cell per input per slot, it is still practical. For example, an input can be an aggregate link of many links or queues, and the aggregate arrival rate can be much higher than 1 cell per slot. If only 1 cell is allowed to arrive at one slot, we must put another scheduler at the aggregate link to schedule cells from the aggregate link to VOQs. This scheduler is redundant because the switch scheduler can do the same task.
In the case of admissible traffic, strict fairness can lead to bandwidth under-utilization. To illustrate this problem, let's look at the example shown in This example again shows that achieving both absolute fairness and 100% throughput at the same time is impossible. Therefore, some trade-off must be made. However, unlike the case in Fig. 3 , we prefer throughput than fairness. In this particular case, we can give priority to flow f 21 by postponing each cell of f 11 by 2 slots. Although instantaneous fairness is not maintained, this trade-off is justifiable. First best-effort traffic is not delay sensitive, and throughput is more important than delay. Because traffic is admissible, giving preference to heavy-backlogged queues will not affect the throughput of light-backlogged queues. In a relatively long interval, fairness is still maintained. Second, although cells of light-backlogged flow incur longer delay, the increased delay only affects the initial delay observed by receivers.
Note that fairness does not necessarily mean equal distribution of resources. In many cases, it is justifiable to give more bandwidth to some flows than others. How to assign weights depends on applications. From now on, we assume that each VOQ ij is assigned a weight w ij .
So far, we haven't formulated the fairness criterion when a crossbar is overloaded. At first glance, it is appealing to use the fairness definition of the well known GPS system [11] , i.e., for any competing flow i and j that are continuously backlogged in the interval [τ, t),
is the amount of flow i traffic served in an interval [τ, t) and φ i is the weight of flow i. However, in the case of a crossbar, this definition in general does not hold without unnecessarily wasting bandwidth. In fact, in terms of bandwidth distribution, it is more appropriate to represent crossbar as a network as shown in Fig. 6 . 
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2007 proceedings.
It is clear from Fig. 6 that there are 2N bottlenecks. Different flows might have different bottlenecks. In this scenario, a natural fair allocation scheme called max-min fairness [22] can be used. The goal of max-min fairness is to achieve fairness among competing flows while not unnecessarily wasting bandwidth, i.e., maximize the minimum service rate of each flow. Fig. 7 illustrates the max-min bandwidth allocation. We have four equal-weight flows f 11 , f 21 , f 31 and f 12 . Although each flow has the same weight, f 12 can have 2 3 of the total bandwidth without affecting other flows' throughput. To summarize, we need a scheduler which should fulfill the following: 1) sustain 100% throughput for admissible traffic, and 2) ensure max-min fairness for non-admissible traffic.
In the next 4 sections, we show how to achieve the above objective in a buffered crossbar employing a very simple scheme that requires no speedup. First in section IV, based on a packet fair queuing (PFQ) algorithm, we present a queue length driven packet fair queuing (QLD-PFQ) algorithm and prove that it provides 100% throughput. Then in section V, we formally define a dynamic max-min fairness criterion for crossbar switches. In section VI, we describe an adaptive maxmin fair scheduling (AMFS) algorithm and show its max-min fairness property. Finally in section VII, we apply AMFS to the case of finite buffers.
IV. QUEUE LENGTH DRIVEN PACKET FAIR QUEUING
In [15] , Stephens and Zhang study a distributed packet fair queuing (D-PFQ) system, where each input and output apply PFQ independently, hence we refer to this scheme as PFQ-PFQ in this paper. It is shown that PFQ-PFQ provides fairness among competing flows. In [23] , PFQ-PFQ is shown to automatically converge to the max-min fair rate allocation under a fully overloaded situation.
However, PFQ-PFQ cannot sustain full bandwidth when traffic is admissible. To see why this is the case, assume all flows have the same weight, then PFQ-PFQ becomes equivalent to RR-RR which is already demonstrated in [8] , [9] to fail to provide 100% throughput under admissible nonuniform traffic. In fact, we've already shown in section III that an algorithm focusing on fairness alone fails to provide 100% throughput because of the conflict between throughput and fairness.
The underlying reason that PFQ-PFQ fails to sustain full bandwidth under admissible traffic is that it doesn't take queue status into consideration. This motivates us to use queue length as weight in the scheduling decision. Formally, let φ ij (t) be the weight of queue ij (from input i to output j) at time t,
where x ij (t) be length of queue ij (combined queue length of VOQ ij and CB ij ) at time t. We call PFQ-PFQ with the above weight definition queue length driven packet fair queuing (QLD-PFQ). Now we show that QLD-PFQ achieves 100% throughput. We will adopt the notations and definitions introduced in [24] . For an N ×N switch, define . Hence, the evolution of the system of queues can be described as:
• Λ: This definition basically means that in a strongly stable system, the average queue length and hence the average queue delay is bounded. To prove that QLD-PFQ leads to a stable system, we first introduce the following lemma.
Lemma 1: A switch with a scheduling algorithm such that
and is a function of X t , andX t is the normalized vector of X t , i.e.,X t = X maxij ( k x ik (t), k x kj (t)) . Proof: refer to theorem 6, 7 and 8 in [24] . Lemma 1 states that if the departure rate vector E[D t ] is parallel to the queue length vector X, and longer thanX, the system is stable. We show that in a buffered crossbar employing QLD-PFQ, the departure rate indeed satisfies Lemma 1. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2007 proceedings.
Proof: Given a buffered crossbar, consider its corresponding fluid model using the GPS scheduler [11] . Let l ij (t) be the queue level of VOQ ij at time t, b ij (t) be the queue level of CB ij at time t, and x ij (t) = l ij (t) + b ij (t).
According to equation (1), φ ij (t) = x ij (t). Therefore, with PFQ, at each input, the service rate µ in ij (t) for queue ij at time t is
Similarly, at each output, the service rate µ out ij (t) for queue ij at time t is
Because inputs and outputs are coupled by crosspoint buffers,
where d † ij (t) ∈ R + indicates the extra service due to the fact that GPS automatically converges to max-min rate, and convergence takes finite time depending on the CB size [23] .
Finally, let
Theorem 1: A buffered crossbar switch operating under the QLD-PFQ algorithm is strongly stable.
Proof: straightforward from Lemma 1 and 2.
V. MAX-MIN FAIRNESS CRITERION
Unfortunately, QLD-PFQ cannot provide fairness when the switch is over-loaded because it doesn't take into consideration the pre-assigned weight of each queue. To discuss fairness, we need to formally define fairness. First, we introduce the maxmin fairness criteria in an N ×N crossbar switch based on the general definition in [22] . Let 
We also call a feasible allocation R weighted max-min fair, when it is impossible to increase r ij without losing feasibility or reducing r pq satisfying rpq wpq ≤ rij wij . Given W ,Č andĈ, it is easy to find the max-min rate matrix R using the water-filling approach [22] . However, the rates calculated this way only represent the maximum throughput each flow gets when all active flows are continuously backlogged. If a flow only uses a portion of its max-min rate, the unused portion should be used by other flows. Naturally we want this unused bandwidth to be distributed in the max-min fair manner.
To Clearly every non-zero entry in a R wλ falls into the following two cases:
• λ ij > r wλ ij : in this case, we call f ij a non-admissible flow.
• λ ij = r 
where k ≥ 1, p = i, q = j and f ij is an absolutely admissible flow with respect to Ψ k−1 . It is easy to get the following facts: This definition simply states that for admissible flows, the service rate is equal to the workload; and for non-admissible flows, the service rate is its dynamic max-min rate. Note that by using dynamic max-min fairness criterion, this fairness definition implies 100% throughput for admissible traffic. In fact, if λ ij = r 
VI. THE ADAPTIVE MAX-MIN FAIR SCHEDULING (AMFS) ALGORITHM
Taking the above max-min fairness definition into consideration, we now extend the weight definition in equation (1) as follows:
where T is a large number such that for a given > 0, lim n→∞ P r{||X n || > T } < when the switch uses the QLD-PFQ algorithm and is loaded with admissible traffic. This T exists because the switch is strongly stable under QLD-PFQ. In addition, to make sure that φ ij is non-decreasing as a function of queue length, we also scale all w ij in proportion so that w ij > 1, ∀i, j. We call PFQ-PFQ using this weight definition adaptive max-min fair scheduling (AMFS).
When a flow's queue length is below T , the flow is regarded as admissible, and φ ij is in proportion to its queue length. Note, it is possible that a flow is admissible when its queue length grows beyond T . Thanks to the strong stability of QLD-PFQ with infinite queue and the asymptotic construction of T , the probability of queue length greater than T can be asymptotically small.
When a flow's queue length is between T and 2T , the flow is at the boundary between admissible and non-admissible, φ ij is increased as a combination of queue length and assigned weight. The purpose of this region is to provide a transition from admissible to non-admissible traffic so that φ ij can smoothly change from queue length to assigned weight.
When a flow's queue length reaches 2T , the flow is regarded as non-admissible, and φ ij = w ij which is the flow's assigned weight. The PFQ-PFQ algorithm will ensures maxmin fairness among competing flows as discussed before. Later we will see that 2T also serves to protect well-behaved flows from overflow. (10), where the weight of each queue is proportional to its queue length. According to the asymptotic construction of T , this approximation is accurate with probability 1 − . Therefore, from theorem 1, asymptotically the switch is strongly stable, i.e., µ ij = λ ij = r . This is the fully overloaded case (all queues are continuously backlogged), and corresponds to the third case of 2T ≤ x ij ≤ ∞, where the weight of each queue is its pre-assigned weight. [23] has showed that the bandwidth will be distributed in the max-min fair fashion, i.e., µ ij = r 
VII. THE AMFS ALGORITHM WITH FINITE QUEUE
In previous discussions, queues are assumed infinite which is impractical. In this section, we apply AMFS to the case of finite buffers. Without loss of generality, we normalize the maximum queue length to 1, set two thresholds α and β (0 < α < β < 1), and define φ ij as follows:
Clearly, α and β serve as T and 2T in equation (10) . Note that α should be set large enough to accommodate reasonable traffic burst. This implies that the queue capacity should be large enough. This assumption is valid for today's highperformance routers where each line card can easily contains buffers of capacities in hundreds of megabytes. Society subject matter experts for publication in the IEEE INFOCOM 2007 proceedings. 
This full text paper was peer reviewed at the direction of IEEE Communications
A. Description of AMFS with finite buffers
In an N ×N buffered crossbar switch with finite VOQ buffers, for each input i (1 ≤ i ≤ N 1) for each j do 2)
if VOQ ij is not empty and CB ij is not full, then 5) req ij ← true
ftime ij ← stime ij + 1.0/φ ij 10) valid ij ← true 11) if req ij = false, ∀j then 12) input i idles for the next slot step 2: adjust system virtual finish time. 13) vtime i ← max(vtime i , min(stime ij | req ij = true)) step 3: select the output with the smallest eligible finish time. 14) j ← arg j min(ftime ij | stime ij ≤ vtime i and valid ij = true) 15) valid ij ← false step 4: update system virtual time.
16) vtime
Note, we replace x ij (combined length of VOQ ij and CB ij ) with l ij (length of VOQ ij ). Since CBs are very small compared to VOQs, this modification won't affect performance. In fact, this change makes input arbiters work-conserving.
An output arbiter is almost the same as an input arbiter, except that it uses x ij instead of l ij and that it also checks assembly buffers (on line cards) for flow control.
B. Remarks
In theory, any PFQ algorithm works. For the sake of simple and fast hardware implementation, we choose WF 2 Q+ [14] because it provides the tightest delay bound and the smallest worst-case fair index (WFI) while maintaining the lowest algorithmic complexity by computing the system virtual time directly from the packet system. WF 2 Q+ also maintains perflow (instead of per-packet) virtual start time and finish time, and virtual times are updated only when a packet arrives at the head of its queue, thus greatly reducing the hardware complexity.
AMFS uses a PFQ algorithm which is designed for scheduling packets at output links. However, there are still some major differences between AMFS and an output-link scheduler. First of course is the weight definition. For each flow with a fixed weight, an output-link scheduler using WF 2 Q+ (or other PFQ algorithms) simply keeps the interval which is inverse of the weight. In addition, using normalized weights, the system virtual time can simply be increased by 1 instead of 1/ j φ ij (in line 16). AMFS, on the other hand, has to keep track of the queue length and adjust the weight dynamically. So arithmetic reciprocal operation is needed to get the service interval (in line 9).
Second, an output-link scheduler is packet driven, i.e., the algorithm executes enqueue when a packet arrives and dequeue when a packet departs. AMFS, on the other hand, is time-driven, i.e., the enqueue and dequeue operation occur at the same time in every slot because of the slotted crossbar operation. To take advantage of this synchronized operation, we also sample queue status and update virtual times only at the beginning of a slot. For cells arriving in the middle of a slot, we can regard them as arriving at the beginning of the next slot. This little postponement should have virtually no impact on the performance.
Third, a output-link scheduler considers the case of N input queues and one output queue (or link). Dequeue occurs only when the output queue is available and any input queue has packets. When the output is busy, all inputs are blocked. In the crossbar scheduling, each arbiter has N input queues (e.g. VOQs) and N output queues (e.g. CBs). When one output queue is full, only the corresponding input queue should be disabled in order to keep work-conserving. Therefore, when the HOL packet of VOQ ij is selected and VOQ ij still has packets but CB ij is full, we cannot update the virtual time of VOQ ij until CB ij is available. This is why we use valid ij to indicate whether the virtual time of VOQ ij is valid or not.
VIII. SIMULATION RESULTS
In our simulation, we implement AMFS based on WF 2 Q+ [14] (with α = 0.7 and β = 0.8). WF 2 Q+ is also used in the case of PFQ-PFQ with fixed weights, which is referred to as PFQ in the figures in this section. We also implement OCF [21] for IQ switches because it is proved to achieve 100% throughput for any admissible traffic. Output queuing scheme with WF 2 Q+ as the output-link scheduler is also shown for comparison as it is the optimal solution. Note in the case of IQ and CICQ switches, the output-link scheduler is simply FCFS.
The switch size is 16×16. Cell size is 64 bytes. VOQs or output queues (OQs) are statically partitioned with 256K bytes (4K cells) per input-output pair. The total buffer size per line card for a 16×16 switch is 4M bytes. The crosspoint buffer size of CICQ switches is 8 cells unless otherwise stated.
Packet arrival is modeled as a 2-state ON-OFF process. The number of ON state slots is defined as the packet length which is obtained from a profile of NLANR traces at AIX site [25] . We collected over 119 million packets. The packet length ranges from 20 to 1500 bytes with mean E on = 566 bytes and standard deviation of 615 bytes. The number of OFF state slots is exponentially distributed with average E off = 1−ρ ρ E on , where ρ is defined as the offered workload. We consider the following performance metrics:
• average packet delay: In the case of IQ and CICQ switches, packet delay is measured from when the first bit of a packet arrives at its VOQ to the last bit leaves its assembly buffer (see Fig. 1 and 2 ). In the case of OQ switches, packet delay is measured from when the first bit of a packet arrives at its OQ to the last bit leaves its OQ.
• average queue length: the average queue length of VOQs (in the case of IQ/CICQ switches) or OQs (in the case of OQ switches).
• average throughput: number of cells per slot leaving the assembly buffers (in the case of IQ/CICQ switches) or OQs (in the case of OQ switches). The simulations run long enough to ensure the 95% confidence interval of the average packet delay, queue length or throughput with ±5% error margin. Evaluation is performed under various traffic patterns. Due to page limit, however, we only report some of the results.
A. Admissible traffic
For uniform traffic, all schemes work well. Here we only report simulation results under the diagonal traffic: λ ii = 2ρ/3, λ i|i+1| = ρ/3 and λ ij = 0 for j = i or |i + 1|, where |i + 1| = i mod N . This is a very skewed traffic pattern. No maximal algorithms have been found to be able to sustain admissible workload under the diagonal traffic pattern. Therefore this traffic pattern can be used as a litmus test. Fig. 8 and Fig. 9 show the average packet delay and queue length of traffic from input i to output i as a function of the workload per input. In this situation, AMFS and OCF still perform very well. PFQ starts to drop packets at about 90% workload. To see what happened, we also plot the packet delay of traffic from input i to output |i + 1| in Fig. 10 , where the delay is very low in the case of PFQ. This case clearly shows that favoring fairness affects the throughput adversely, as pointed out in section III.
B. Non-admissible traffic
To evaluate the weighted max-min fairness allocation, we run the simulation on a 4×4 switch with weight matrix W = On the other hand, AMFS and PFQ always maintain maxmin allocation among competing flows. When the workload is below 1/3, the throughput is equal to the workload for every flow because the switch is not overloaded.
At workload 1/3, input 1 is first saturated, the throughput of f 13 goes down until reaching its max-min rate 1/6. For f 12 , although λ 12 > r w 12 = 1/3 when 1/3 < λ < 2/5, it is still an admissible flow because λ 12 ≤ r wλ 12 , After workload 2/5, the throughput of f 12 also starts to go down until it reaches its max-min rate.
At workload 1/2, output 1 is also saturated, and f 11 gets its max-min rate 1/2. Note that although the weight of f 21 is 1 (corresponding to 1/6 of the total bandwidth), it also receives 1/2 of the total bandwidth because of the max-min fair allocation policy.
Finally when output 2 and output 3 become saturated at workload 2/3 and 5/6, f 32 and f 43 receive their max-min fair shares respectively which are also much more than their assigned bandwidth.
IX. HARDWARE IMPLEMENTATION ISSUES
Due to page limit, we only show in Fig. 12 a block 
X. CONCLUSION
In this paper, we address the problem of how to achieve both throughput and fairness in a buffered crossbar without speedup. The solution is surprisingly simple: applying a PFQ algorithm at each input and output with dynamic weights based on queue lengths and assigned weights. Our analysis and simulation results show that the adaptive max-min fair scheduling (AMFS) scheme achieves 100% throughput for any admissible traffic as well as providing max-min fairness under overloaded situation. With each arbiter on its line card, AMFS is feasible for very high speed networks.
