This paper presents a theoretical throughput analysis of two buffered-crossbar switches, called shared-memory crosspoint buffered (SMCB) switches, in which crosspoint buffers are shared by two or more inputs. In one of the switches, the sharedcrosspoint buffers are dynamically partitioned and assigned to the sharing inputs, and memory is sped up. In the other switch, inputs are arbitrated to determine which of them accesses the shared-crosspoint buffers, and memory speedup is avoided. SMCB switches have been shown to achieve a throughput comparable to that of a combined input-crosspoint buffered (CICB) switch with dedicated crosspoint buffers to each input but, with less memory than a CICB switch. The two analyzed SMCB switches use random selection as the arbitration scheme. We model the states of the shared crosspoint buffers of the two switches using a Markov-modulated process and prove that the throughput of the proposed switches approaches 100% under independent and identically distributed uniform traffic. In addition, we provide numerical evaluations of the derived formulas to show how the throughput approaches asymptotically to 100%.
This paper follows the mainstream practice of segmenting incoming variable-size packets into fixed-length packets, called cells, at the ingress side of a switch. Packets are reassembled at the egress side, before they depart from the switch. Therefore, the time it takes to transmit a cell from an input to an output is constant, and it is called time slot. In this paper, the terms cells and time slots are used interchangeably. This paper considers admissible traffic, defined as N −1 i=0 ρ i,j ≤ 1, ∀j and N −1 j=0 ρ i,j ≤ 1, ∀i, where i is the input port number (0 ≤ i ≤ N − 1), j is the output port number (0 ≤ j ≤ N − 1), and ρ i,j is the input load from input i destined to output j.
CICB switches, as IQ switches, adopt virtual output queues (VOQs) at the inputs to avoid head-of-line (HOL) blocking [23] . A VOQ stores cells from an input port destined to an output port. The amount of memory in the buffered crossbar of a CICB switch is kLN 2 bytes, where N is the number of switch ports, L is the cell size in bytes, and k is the number of cells that can be stored in a crosspoint buffer. A CICB switch uses a dedicated crosspoint buffer for each i, j pair [2] , [5] .
The CICB also uses a credit-based flow control mechanism to avoid crosspoint-buffer overflow. Buffer underflow is another phenomenon that can decrease the throughput of buffered-crossbar switches. Buffer underflow occurs when the cells arrive in the buffers at a lower rate than that at which cells are served from the buffers. This may occur when the line cards of a CICB switch are placed far from the buffered crossbar. Buffer underflow is avoided by setting the size of crosspoint buffers equal to or larger than the round-trip time (RT T ) measured between the line cards (inputs) and the buffered crossbar. Here, RT T is defined as the sum of the time it takes to transmit a cell from an input to the buffered crossbar, d1; the time it takes to transmit flow-control information from the buffered crossbar to the input that indicates crosspoint-buffer occupancy, d2; and the input and output arbitration delays, IA and OA, respectively (RT T = d1 + d2 + IA + OA). The unit of RT T is the number of time slots, however, as one cell can be transmitted in one time slot; it can be also referred to as the number of cells. In a CICB switch, the required crosspoint-buffer size to avoid underflow for flows with data rate R c b/s (or cells/time slot), where R c is the port speed, is such that k ≥ Rc RT T L
. Herein, a flow is defined as the set of packets from input i destined to output j.
When the amount of the memory in buffered crossbar of a CICB is small, such that k < RT T , the throughput of the switch under large-rate flows may decrease because a crosspoint buffer may become empty before a new cell is received from the inputs [8] , [9] . This underflow problem was demonstrated under these conditions [12] using the unbalanced traffic model [5] , [6] . Buffer underflow was also observed in a buffered crossbar switch but, with internal variable-length segments [11] .
To reduce the effect of buffer underflow, a method to improve the utilization of crosspoint buffers has been adopted in a switch with prioritized services [10] . In this study, a crosspoint buffer hosts logical queues, one for each service priority. The size of crosspoint buffers is set to k ≤ P + 1, where P is the number of service priorities, so that k > RT T . Therefore, the switch uses the prioritized buffers to either support long RT T s for high-priority flows or services with different priorities.
As buffered crossbars are often implemented on a single chip, the amount of memory that can be implemented on a single chip is limited by physical space [24] , [25] . The interconnection technology that supports high-speed requires large on-chip real estate [26] . Increasing the port speed requires more on-chip real estate for interconnection, which limits the amount of space for on-chip memory implementation. A solution to reduce memory amount (to support long RT T s under best-effort traffic) was proposed using shared-memory crosspoint buffered (SMCB) switches using round-robin arbitration [12] . The shared-memory buffered switches were also considered to support multicast traffic [16] [17] [18] . Virtual crosspoint queues (VCQs) in ingress ports of a buffered crossbar were later proposed as another approach to reduce crosspoint buffer size [13] . The VCQs resemble
VOQs; however, they are placed at the buffered crossbar. This work showed that the adoption of VCQs requires more memory in the crossbar to achieve similar performance to that of a shared-memory crosspoint switch [14] .
Another approach focused on increasing the efficiency of the flow control mechanism, to decrease the dependency in memory [15] . Furthermore, the implementation of input arbiters placed at the buffered crossbar was then proposed [19] . Here, the latency of the information exchange between input and output arbiters is avoided as the arbiters are placed in the same chip. This approach increases the efficiency of the flow control mechanism but the minimum memory requirement remains at kLN 2 .
These works showed that an SMCB switch is a buffered-crossbar switch with the smallest amount of memory in the buffered crossbar. Therefore, a question arises: What is the maximum throughput of SMCB switches under uniform traffic?
As an answer to this question, we present a theoretical throughput analysis of two SMCB switches whose crosspoint buffers are shared by m inputs and that use random selection as the arbitration scheme. In these switches, one of them uses dynamic partitioning of buffer space and speedup of m, and it is called the SMCBxm switch, and the other arbitrates inputs to access the shared crosspoint buffers to avoid memory speedup, and it is called the mSMCB switch. Random selection has been used to analyze the throughput of packet switches [27] , [28] and here, we adopt it for a similar reason. The analysis proves that the achievable throughput of the two shared memory CICB switches is 100% under i.i.d. uniform traffic. Furthermore, the high throughput achieved by the mSMCB switch indicates that memory speedup is not required to achieve high switching performance.
The remainder of this paper is organized as follows. Section II describes the SMCBxm and mSMCB switches. Section III demonstrates that the throughput of the SMCB switches with random selection approaches 100% throughput under i.i.d.
uniform traffic. Section IV presents the conclusions.
II. SHARED-MEMORY CROSSPOINT BUFFERED (SMCB) SWITCHES
To reduce the required memory amount in the buffered crossbar of an SMCB switch, a crosspoint buffer is shared by m inputs, where 2 ≤ m ≤ N . This section introduces two SMCB switches. Previously, the SMCB switches were considered with round-robin for both input and outputs arbitrations, and the SMCB switches presented in this paper consider random selection for the input and output arbitrations.
A. Shared-Memory Crosspoint Buffered Switch with Memory Allocation and Speedup of m (SMCBxm)
The SMCBxm switch has N VOQs at each input, N 2 crosspoints, and Because m inputs might need to access the shared memory at the same time, this switch requires a speedup of m for the shared memory. A credit-based control mechanism, in combination with the dynamic memory allocation, is used to avoid buffer overflow. With the flow control mechanism, an input is kept from sending more cells to the SMB than the permitted allocation. To minimize the speedup of the shared memory in a practical implementation, the number of inputs sharing a crosspoint buffer is set to two (i.e., m=2). This description considers an even N for the sake of clarity. However, an odd N can also be adopted (with one dedicated crosspoint-buffer for the non-sharing input). SMBs, the output arbiter selects a cell to be forwarded to the output. The selected cell is sent to the output in the next time slot.
Input port 0
Input port 1
IA (1) IA(N-2) 
SMB(0, 0)
Output 0 Output 3
Input 2
Output 0 Output 3 
B. Shared-Memory Crosspoint Buffered Switch with Input-Crosspoint Matching (mSMCB)
In the mSMCB switch, only one input is allowed to access an SMB in a time slot; therefore, memory speedup is not required. To schedule the access to the SMB among m inputs, an input-access scheduler, denoted as S q , is used to match m inputs to N SMBs. Figure 3 shows the architecture of the mSMCB switch for m = 2 (i.e., 2SMCB). The size of an SMB, in number of cells that can be stored, is also denoted as k s . There are N m S q s in the buffered crossbar. S q matches non-empty inputs to SMBs that have room for storing at least one cell. The matching in S q follows a three-phase process, as that used by some IQ switches [27] , [29] . In this section, the matching scheme in S q uses random selection [27] . A credit-based flow control is used to monitor the available space in SMBs and to avoid buffer underflow. S q determines non-empty VOQs and which corresponding SMBs have room available for at least one cell as eligible VOQs for matching. Allocation of memory is not used in this switch as the matching process regulates access to the SMBs because only one input is allowed to be matched to one SMB.
At each output in the buffered crossbar, there is an output arbiter to select a cell from non-empty SMBs. An output arbiter considers up to two cells from each SMB, where each cell belongs to a different input. The output arbiter uses random selection and is represented as a rectangle block in Figure 3 .
The 2SMCB switch works as follows: Cells destined to output j arrive at V OQ(i, j) and wait for dispatching. Input i notifies S q about new cell arrivals. S q selects the next cells to be forwarded to the crossbar by performing matching between inputs and SMBs. After a cell (or VOQ) is matched by S q , the input is notified and sends the cell in the next time slot. A cell going from input i to output j enters the buffered crossbar and is stored in SM B(q, j). Cells leave output j after being selected by the output arbiter. The output arbiters at Outputs 0 and 3 select a cell, in a random fashion, to be forwarded to each corresponding output. Flow
SMB (1, 0) SMB (1, 3) Input 0 SMB (1, 3) Input 0 control information is sent back to Inputs 0 and 3 to indicate the availability of the SMBs. Cells A and G are forwarded to their destined outputs in time slot T + 2.
III. THROUGHPUT OF THE SMCB SWITCHES WITH RANDOM SELECTION
This section presents the throughput analysis of the mSMCB and SMCBxm switches that use random-based selection schemes under Bernoulli uniform traffic. The presented analysis is based on the following assumptions of the incoming traffic:
1 Arrivals at each input are i.i.d.
2 Arrival processes at each input are independent of previous arrivals and they are modeled as Bernoulli arrivals.
3 Cell destinations are uniformly distributed over all outputs.
It has been shown that the throughput of a buffered crossbar switch with dedicated crosspoint buffers increases to 100% asymptotically as N → ∞ under Bernoulli i.i.d. traffic if the crossbar switch can buffer one cell at each crosspoint [28] , [30] .
This throughput is also referred to as the saturation throughput as N → ∞.
In this paper, the presented analysis focuses on SMCB switches with the minimum number of inputs sharing an SMB, i.e., m = 2, and it shows that the 2SMCB and SMCBx2 switches achieve 100% throughput under uniform i.i.d. traffic with Bernoulli arrivals as N → ∞. The results show that the memory speedup, as required by the SMCBxm, is not a strict requirement for a switch of moderate and large sizes (i.e., N =32 or larger). Furthermore, the analysis considers the case where m=N for the mSMCB switch to show the relationship between switching performance and the amount of memory required.
The performance analysis is based on the probability that a VOQ receives service to identify the maximum throughput of the proposed switches. Regarding to the mSMCB switch, the analysis focuses on the effect that the matching process has on the switching performance.
In an SMCBx2 switch, SMBs are partitioned before the cells are forwarded from the input to the crosspoints, and therefore, a partition can be considered as a dedicated queue for input i if the ratio of the occupancies of the two VOQs sharing the SMB remains unchanged. To simplify the description, the remainder of this section refers to the partition of an SMB as a queue.
Considering the characteristics of the two SMCB switches, k s is one or more cells for the mSMCB switch, and two or more cells for the SMCBxm switch.
The probability that a queue is full is denoted as P f , and the probability that a VOQ is blocked (from sending a cell to the corresponding crosspoint buffer) is denoted as P b . The probability that a VOQ sends the HOL cell to the queue is denoted as P , where P = 1 − P b .
A. mSMCB Switch with k s = 1
The blocking probability of a VOQ in the mSMCB switch with k s = 1, denoted as P b , is defined by two possible cases:
I) when the SMB is full with probability P f . In this case, P b is in function of the probability that a cell is forwarded to the corresponding output. The probability that there is a cell destined to this specific output is 1 N . II) When a given input contends with t inputs (where 0 ≤ t ≤ m − 1) for access to an available SMB, and the input is not granted because another input is matched. The probability that an input receives no grant is t t+1 . Considering these two cases, the blocking probability for the mSMCB switch is stated as
(1) Figure 5 shows a Markov chain describing the occupancy of SM B(q, j) in an mSMCB switch. P Sy represents the state probability, where 0 ≤ y ≤ k s . P uv is the transition probability from state u to state v. The probability that the SMB is full P f is equivalent to the state probability P Sks . For any k s , P 01 is defined by the product of the probability of input arrival ρ i,j and the matching probability between the inputs and the SMBs,
The service probability P service is the probability that the output arbiter selects a non-empty SM B(q, j) to forward a cell to the output. For m = 2, P service = 2 N for any state. P 10 , which occurs when input i has no requests and SM B(q, j) is selected by the output arbiter, is defined by
The service probability P service is the probability that the output arbiter selects SM B(q, j) to forward a cell, or 
P 01 P S0 = P 10 P S1 ;
The probability that the SMB is full is represented as
B. mSMCB Switch with k s = 2
To compare the performance of the two SMCB switches with the same amount of memory, the size of SMBs is set to k s = 2. The probability that an SMB is full follows (1). The following balance equations are obtained when k s = 2:
P 01 P S0 = P 10 P S1 ; P 12 P S1 = P 21 P S2 ;
and from these equations:
P f = P S2 = P 01 P 12 P 01 P 12 + P 01 P 21 + P 10 P 21 .
Here, the transition probabilities are defined as:
;
C. SMCBxm Switch with k s = 2
Without losing generality, let us assume that all VOQs in the SMCBx2 switch have a backlog longer than RT T such that SMBs are partitioned into two equally sized parts, or k s /2 each, one part for each VOQ that shares the SMB. In this case, the blocking probability of a VOQ is in function of the occupancy of the allocated portion of the memory, where the portion is either full or available. The blocking probability of a VOQ is obtained by considering whether the allocated portion of the SMB is full. The blocking probability of a VOQ to forward a cell to the corresponding SMB, p b is represented as
The allocated SMB partition of one-cell size in the SMCBx2 is modeled as a queuing system, as shown in Figure 5 , used for the mSMCB switch. The state transition probability p 01 for the SMCBx2 switch is the product of the probability of input arrival ρ i,j and the probability that the input arbiter selects V OQ(i, j) is 1 N , as in (8) .
The transition probability P 10 is the probability that an SMB is selected by the output arbiter while there is no request from the input, or
P f is calculated as in (5):
D. Maximum Throughput of the 2SMCB and SMCBx2 with Random Selection
The throughput of the proposed switches is determined by considering the blocking probability of the VOQs. This section demonstrates that these two switches can achieve 100% throughput for large switch sizes. The analysis is performed through numerical evaluations of the blocking probability of VOQs for the SMCB switches. The following equations state the limits of P b for the 2SMCB switch for k s = 1 and k s = 2 cells, respectively, and for the SMCBx2 switch when k s = 2.
In the case of the 2SMCB switch with k s = 1:
and
For the case of the 2SMCB switch with k s = 2:
then, the limit follows:
For the SMCBx2 switch with k s = 2, using (8), (9), and (10):
Because the minimum possible values of k s are addressed, the nonblocking probability for these two switches for any k s value approaches 1.0 as N grows. Therefore, both switches achieve 100% throughput for a large N . This high throughput with random-based selection schemes is achieved because of the expansion gain provided by the matching size, 2-to-N , in the 2SMCB switch, and because a speedup of two is used for the SMCBx2 switch.
Considering that the limit of the non-matching probability is (16) and that the probability that an SMB is full converges to
the blocking probability when all N inputs share the SMBs converges to
In other words, the nonblocking probability converges to 0.632. Therefore, the throughput of this switch when m = N is 63.2% (as in an N × N IQ switch with random selection [27] , [31] ). This shows that the performance of the mSMCB switch is also determined by the matching process. Therefore, the performance of the mSMCB switch increases as the number of inputs sharing the crosspoint buffers decreases, i.e., m ≤ N , as a product of the m-to-N matching. While a larger m would save more memory at the buffered crossbar, the throughput would decrease.
E. Nonblocking Probability of Small Switch Sizes
We evaluated the nonblocking probabilities of switches, shown in Figure 6 , using (5) and (10) . We further compare the nonblocking probabilities of the 2SMCB and the SMCBx2 switches with a CICB switch with random selection for both input and output arbitration and an input-queued (IQ) switch with parallel iterative matching (PIM) [27] . The nonblocking probability of the CICB switch follows that of the SMCBx2 switch without memory speedup. We use k to represent the crosspoint buffer size of the CICB switch. When k = k s = 1, the amount of memory in the SMCB switches is half of that in the CICB switch. Figure 6 shows the nonblocking probability of a VOQ in the 2SMCB switch for k s = 1, in the SMCBx2 switch for k s = 2, in the CICB switch for k = 1, and the IQ switch with one iteration, all under uniform traffic with i.i.d. Bernoulli arrivals.
The results show that the SMCBx2 and CICB switches achieve slightly higher throughput than the 2SMCB switch when the switch size is small (e.g., N < 16). As the switch size increases, the throughput of both switches approaches 100%, and the throughput of the IQ switch with PIM approaches 63%.
The throughput of the switches with the same amount of memory in the SMCB switches and the CICB switch (i.e., k s = 2 and k = 1) are compared. Figure 7 was generated using (5) for the SMCBx2 and CICB switches and (6) for the 2SMCB switch. This figure shows that as the switch size increases, the throughput of both switches approaches 100%. The SMCBx2 switch has the same performance as that of the CICB switch, but at the cost of a memory speedup of 2. 
F. Discussion of Nonblocking Probability under Hot Spot and Bursty Traffic
Internet traffic may not show uniformity. We discusse the nonblocking probability under hot spot and bursty traffic in this section. In this paper, the assumption of having cell destinations uniformly distributed over all outputs can be relaxed to allow hot spot traffic, where the traffic arriving in the inputs is directed to a single output. This is still governed by the M/M/1 model with a different arrival probability. Assuming the probability of arrival as P f = 0 still holds. The performance of the SMCB switches under unbalanced traffic depends on the number of inputs that are sharing the crosspoint buffers, m, and on the arbitration scheme used. It has been shown that the SMCB switches achieve a comparable performance to that of a CICB switch with weighted arbitration schemes and minimum number of inputs sharing the crosspoint buffers, e.g., m=2 [12] .
Bursty traffic is also of interest when analyzing switch performance. As observed by the CAIDA traffic analysis project, the probability that the average Internet packet size is smaller than 100 bytes is more than 50% [32] , [33] . Packet segmentation may not be necessary when the size of a packet is small. Cell length can be determined by the expected packet length [32] , [33] . An appropriate cell length reduces the level of packet segmentation and improves switching performance [11] , [34] [35] [36] [37] [38] [39] [40] .
The analysis of bursty traffic can be relaxed from fixed-size cells in the proposed analysis to variable-sized cells since cell arrival can still be modeled as Bernoulli arrivals for variable-length cells.
IV. CONCLUSIONS
This paper presents a theoretical throughput analysis of two shared-memory CICB switches, the SMCBxm and mSMCB switches, which use crosspoint buffers shared by m input ports and random-based arbitrations, under independent and identically distributed traffic with uniform distributions and Bernoulli arrivals. The SMCBxm switch uses memory speedup and dynamically partitions the shared memory among the sharing inputs. The mSMCB switch arbitrates the access to crosspoint buffers using an input access scheduler and avoids memory speedup.
In this paper, we have proved that these switches achieve 100% throughput under i.i.d. traffic with uniform distribution and Bernoulli arrivals. This result also indicates that speedup is not necessary by the mSMCB switch to achieve this high performance. Both of the analyzed SMCB switches relax the amount of memory to 1 m of that in the buffered crossbar of a CICB switch with dedicated crosspoint buffers.
