Abstract-Input-queued cell switches employing the oldestcell-first (OCF) policy have been shown to yield low mean delay characteristics. Moreover, it has been proven that OCF is stable for admissible i.i.d. arrival traffic when executed with a scheduling speedup of 2. However, an increase in link rates and port densities directly leads to a decrease in packet duration times, to a point where cell-by-cell switching is no longer considered practical. To address this challenge, this paper studies framebased scheduling algorithms for a scalable combined input-output queued (CIOQ) switch architecture. The latter is decomposed into independent subgroups, each employing multiple simple crosspoint switches. A key outcome of this decomposition is a substantial reduction of scheduling times. Unlike many other schemes, which necessitate custom integrated circuits, the architecture proposed here utilizes commercially available crosspoint switches. We present a Lyapunov-based stability analysis that dictates moderate conditions under which the switch is stable for all admissible traffic patterns. By reconfiguring the crossbar switch once every several time slots, the timing constraints imposed on the scheduling algorithm are significantly relaxed. Simulation results are presented, demonstrating the merits of the approach, particularly in the presence of bursty traffic scenarios.
I. INTRODUCTION
Recently, several novel architectures have been proposed for the design of packet switches with large aggregate capacities and high-speed link data rates. Examples include the Parallel Packet Switch (PPS) [1] , the Parallel Shared Memory (PSM) router [2] , and the Load-Balanced router [3] . In theory, both the PPS and the PSM can emulate a first-come-first-served output-queued (FCFS-OQ) packet switch and support quality of service (QoS) guarantees. Nevertheless, implementation of PPS and PSM switches involves the design of intricate centralized schedulers, which inherently introduce scalability limitations . It has been shown that the Load-Balancing architecture can guarantee 100% throughput for a broad class of traffic patterns and requires no scheduler. However, the Load-Balancing architecture suffers from packet reordering, a consequence of allowing multiple internal input-to-output paths for each flow. Moreover, it imposes frequent switch fabric reconfigurations [4] . These two attributes introduce delay and scalability limitations, and consequently require the introduction of custom-designed VLSI components.
Input-queued (IQ) packet switching architectures with virtual output queueing (VOQ) are commonly utilized in Internet routers as they offer pragmatic scalability while requiring moderate memory bandwidth. A scheduling algorithm is needed for an IQ switch to dynamically determine the configuration of the crossbar by finding matchings between ingress and egress ports. However, an increase in link rates directly causes a decrease in packet duration times to a point where cell-by-cell switching is no longer considered a practical approach. This is true particularly for optical switch fabrics that employ slowly reconfiguring crossconnect elements. The reconfiguration overhead for a typical optical switch fabric can be in the range of 50-lOOns [4] . However, with 64-byte packets and speeds of 40 Gbps, a reconfiguration time of a few nanoseconds is necessary for the cell-by-cell switching mechanism. To address this issue, in a previous paper [5] , the authors have proposed a frame-based maximal weight matching (FMWM) algorithm with transfer speedup, in which scheduling decisions are issued in accordance with the MWM algorithm, however they are kept unchanged for a duration of k consecutive time slots. It has been proven that a CIOQ switch running the FMWM scheduling algorithm with a transfer speedup of 2 is stable under admissible traffic for any frame size. By reconfiguring the crossbar switch once every several time slots, we significantly relax the timing constraints imposed on the scheduling algorithm.
In order to scale to high port densities, we propose a novel scalable packet switching architecture which is straightforward to implement, as it employs a group of memoryless passive crosspoint switches. The architecture is a CIOQ switch whereby an N x N switch is partitioned into G identical and independent switching groups, each hosting a pool of smaller crosspoint switches. The motivation to employ such an architecture is to benefit from the idea of partitioning one (large) crossbar into several small crosspoints [6] , so as to facilitate scalability and reduce the timing requirements from the scheduling algorithm. The approach is characterized by offering a 100% throughput guarantee for a broad class of traffic patterns when employing the FMWM1OCF scheduling algorithm. Packet reordering need not be considered and switch fabric reconfiguration is infrequent. Since the algorithm studied is frame-based, by reconfiguring the crossbar switch once every several time slots, the timing constraints imposed on crosspoint devices are significantly relaxed.
II. GENERAL SWITCH ARCHITECTURE
The proposed switch architecture is based on a design first introduced in [6] . Consider an N x N switch as shown in figure 1, whereby N input modules are equally partitioned into G groups, each of which is independently connected to all N outputs via a pool of non-blocking crosspoint switches. It is assumed that each of the crosspoint switches is small to facilitate practical scalable switch implementations.
Throughout this paper we shall refer to a flow as the collection of all packets with the same input and output index values. We further let a group flow denote the set of all packets from a given group destined to a unique output. All packets belonging to the same group flow will be buffered in the same memory module at their destination output. For example, all packets from group #1 that are destined to output N will be buffered in memory 1 of output N. Hence, multiple memory modules must be maintained at each output to hold packets from different group flows. Clearly, the number of memory modules maintained at each output is G, since for each output, there can be at most G different group flows, each of which corresponds to one group. We shall let all packets from the same flow traverse through the same path, i.e., we only consider single-path switching, discarding multipath scenarios that incur packet reordering.
The core switching fabric comprises two stages of passive crosspoint switches. The first connects the ingress ports to the rest of the fabric, hosting a pool of a x aK crosspoint switches per group, where K denotes the number of crosspoint switches in the second stage and a is the maximal number of outputs that can be matched to a single input. Note that a = 1 represents the common case in which each input can be matched to at most one output. If a > 1 then a transfer speedup is required. Each of the crosspoint switches in the second stage has a N inputs and N outputs. By placing an a x aK switch between the crosspoints pool and each input port, maximal traffic throughput is guaranteed, as will be elaborated on in the following section. From a crosspoint optimization perspective, since the highest number of inputs or outputs on any crosspoint device in the system is a key scalability metric, we note that the maximal port count on any crosspoint is given by max{aaK, a N }. For example, if one wishes to design a 512-port switch (i.e. N = 512), where a = 2, G = 16, K = 8, then max{aK, aN } = 64,suggesting that the switch can be realized using existing off-the-shelf crosspoint devices.
III. FMWM/OCF STABILITY ANALYSIS
In this section we derive the necessary conditions for the switch supporting a single class of service to be stable.
With reference to figure 1, let Qi (t) denote the VOQ size at input i holding packets destined to output j at time t. Let us also define the corresponding random arrival process, Aij (t) C {0, 1}, with a mean (normalized) rate of packet arrivals from input i to output j, E[Aij(t)] = Aij < 1.
Since the switch is equally partitioned into G independent groups and each group supports its own non-blocking paths, stability analysis focused on any particular group can be easily extended to all other groups with minor modifications, as will be described later. Throughout this paper, we consider a simple FMWM/OCF scheduling algorithm pertaining to the gth group. The algorithm consists of an iterative process whereby during each iteration the maximal weight among the currently contending set of nodes is found, and a match is registered between the corresponding input-output pairs. An iteration example is depicted in figure 2 . Upon matching an input to an output, the respective input and output pair is removed from contending during subsequent iterations (shown in scenario 1 of figure 2). Alternatively, only the associated output is removed from future contention, as illustrated in scenario 2 of figure 2, allowing other inputs from the same group to be matched to available outputs. Assuming the weight matrix is not completely null, the number of iterations can range between 1 and N/G.
Configuration of the crosspoints, determined by the FMWM/OCF algorithm, can be represented by a service matrix, S(t) = {Sij (t)}, where Sij (t) = 1 if input i is matched to output j at time t, otherwise Si (t) = 0. Based on the weights of the queues, a schedule is obtained which remains unchanged for k consecutive time slots. A new schedule will only occur at time t + k, reflected by Sij (t + k). Lyapunov function, L (t) [7] , such that L (t) = Aij T2 (t) ij [8] [9] [5] . As an expression of a k time slot lag, we write
ii By partitioning the above into the case of Qij (t) < rk and Qij(t) > }k, we obtain the following: 
for Qij(t) > r;k, and Ti2j(t + k) -Ti2j(t) < k2 + 2kTij(t) < k2 + 2r;k2Tij (t) (8) ,rkSij (t)j Ti j(t) + Z 2(Aijk2 + ,k 2)
ii ,SgFMWMIOCF(t),Qg(t)) +C + 2,A our attention on the two basic scenarios described above, as illustrated in figure 2: Scenario 1: For each matching generated, its respective input and output pair is removed from contending during subsequent iterations. Hence, r1 2 is sufficient to guarantee stability.
Scenario 2: For each matching generated, only the associated output is removed from future contentions. In this case, we remove the restriction of only one VOQ being matched per ingress port, such that there can now be up to ae VOQs matched per ingress port.
We extend the stability analysis devised thus far to address scenario 2, i.e. the case in which up to ae > 1 VOQs can be matched in each ingress port during every schedule (note that ae is a fixed number, although in each schedule different input ports may have variant number of actual matches, but the number of matches can not exceed ae). A schedule here comprises of multiple rounds/iterations, each of which produces one input-output matching. As Proof: Let us first briefly review the matching process. At the beginning of each schedule interval, the VOQ with largest weight is selected; later all VOQs with the same output as that chosen are removed from subsequent contention rounds. Next, the scheduler chooses the VOQ with the largest weight value among those in the current contention list, and then removes all VOQs with same output as the one chosen. The scheduler also checks to see if the number of matches along the same input has reached oa. If so, then it removes all VOQs in the same input from future contention. This process is repeated until no more matchings can be made.
Without loss of generality, assume that a schedule produces a total of i matches in a given interval/frame. Hence 
IV. SIMULATION RESULTS
In order to evaluate the performance of the FMWM/OCF algorithm under the multi-crosspoints based architecture proposed, three sets of simulations were carried out. In all cases, a 12 x 12 switch was considered with a transfer speedup of 2. The switch was partitioned into 4 independent switching groups, each of which supported 3 ingress ports. In the first three sets of simulations a = 1 (i.e. the transfer speedup is 2).
The first set of simulations was targeted at examining the impact of bursty traffic on the delay characteristics. A twostate Markov-modulated (ON/OFF) process was employed [10] , whereby bursts are uniformly distributed across the outputs. Figure 3 shows the average delay as a function of the mean burst sizes (MBS) for a fixed frame size of 8 packets. An inverse relationship between the MBS and the average delay is observed. Since the FMWM scheduling discipline is inherently correlated, bursty traffic better utilizes the transmission intervals.
In the second set of simulations, the FMWM/OCF algorithm was allowed to make up to 2 matches per ingress port and the transfer speedup is correspondingly dropped to 1.5 . The arrival process was Bernoulli i.i.d with uniformly distributed destination distribution. Figure 4 depicts the average delay measured for different frame sizes and shows that despite the relaxed switching times and distributed passive crosspoint switches, the overall performance is kept high. V. CONCLUSIONS This paper presents a novel scalable multi-crosspoints based packet switching architecture coupled with a frame-based scheduling algorithm for routers with large port densities and high-speed line rates. It has been shown that the architecture can guarantee 100% throughput for a broad class of traffic scenarios. By equally partitioning an N x N CIOQ switch into multiple independent switching groups, the timing requirements from the FMWM/OCF algorithm are substantially reduced. Moreover, by reconfiguring the crosspoint switches once every several time slots, it is possible to significantly relax the timing constraints imposed on the scheduling algorithm. Compared with other architectures targeting high end routers, the proposed multi-crosspoints based architecture is scalable, easy to implement, and does not entail complex packet processing or reordering. 
