ABSTRACT. As the next generation network begins to incorporate the Internet, telecommunication and TV services, it becomes one of the most critical infrastructures for our society. Routers construct the skeleton of the network. Their kernel, the structure and configuration (scheduler) of the fabric, dominates the networks' performance, scalability, reliability and cost. Based on research in [1], we proposed an interleaved architecture of multistage switching fabrics in [2], which will meet the requirements for next generation routers. In this paper, we first assess its performance with a theoretical model which complements our simulation results in [2] . Moreover, the interleaved fabrics show great tolerance against internal hardware failures. Based on these properties, we propose the architecture of RAIF (Redundant Array of Independent Fabrics) for next generation network, which could get better performance and fault tolerance as RAID [3] .
Introduction
The advantage of packet switching with statistical multiplexing makes the convergence inevitable of the Internet, telecommunication and TV service. For example, in the past few years, Britain has updated its entire telephone network to the Internet Protocol [4] . So there is no technical difference between the telephone network and the Internet in UK. At the same time, both telecommunication carriers and cable companies provide integrated voice, video, and data service for their customers with IPTV [5] , which uses IP network to deliver TV program. Thus, the incorporation of next generation network requires that a large number of line cards be integrated in a single high performance router. However, most of present routers are based on single stage crossbar, which suffers from the scalable complexity with O(N 2 ) (N is the fabric size or input/output number). As a result, these routers only support up to 16×16 interconnection in real applications such as the Cisco 12000 high-end router.
To address these issues of scalability and high performance, we proposed an interleaved architecture of switching fabrics in [2] as shown in Figure 1 , which has Y panels as switching fabrics (from 1 to Y). Each panel is a N×N I-Cubeout (ICO) network (scalable with the complexity of O(N×log 2 N)) with recirculation to the last copy of the fabric as shown in Figure 2 [1]. The fabric is composed of b×2b switching elements (SEs), which has b remote outlets to connect to next stage and b local outlets to terminate the cells from the switching fabric to the destination queues. The Adjacent stages are interconnected according to the indirect n-cube connecting patterns such that the local SE scheduler just follows the shortest path algorithm with self-routing method as shown in [2] . If the cells fail to get to their destination queue after the primary output (the output of the last stage), we could reenter the cells into the last copy (to avoid intensive collisions at previous copies) of next panel in modular by recirculation. Thus the recirculation flows of panel i will go to panel ((i mod Y) +1). Because the recirculated cell could get the information of availability directly from prior SE's outlet latch indicator, we can choose the first available point (ICO FA ) [1] to fed them into the switching fabrics through the multiplexers. So as shown in Figure 1 , at the N inputs, N demultiplexers distribute the input traffic into each panel synchronized with a clock. At clock cycle t, all the input cells at that time will enter the panel ((t mod Y) +1). After the multiple switching fabrics are interleaved by the recirculation, the scheme provides another opportunity to balance the traffic; this in turn effectively eases the hot flows after collisions. In [2] , we already demonstrate the high performance of this novel scheme by simulation under uniform and hot-spot traffic. In Section 2, we will analyze its throughput by a theoretical model for more general cases. Besides with the scalability and high performance of the interleaved architecture, its fault tolerance is still under estimated. We will evaluate its high reliability with different fault models in Section 3. Based on Section 2 and 3, we bring out the concept RAIF in Section 4. Finally, Section 5 concludes the paper.
Analytical model analysis
In the last section, we have introduced the interleaved multistage switching fabrics. Throughput or cell drop rate (throughput =1.0−cell drop rate) is one of the most important parameters to evaluate a switching fabric. Thus, an important issue is to determine the number of panels, which are enough for a real system to obtain good throughput with a reasonable hardware cost. Here, we use our analytical model under uniform traffic to address this issue; we corroborate the model's validity with simulation.
Analytical model for the single panel fabric
Normally, it is very difficult to analyze the interleaved multistage switching fabrics if the load is non-uniform between each panel or stage. For modeling simplicity, we assume that the traffic between each panel is evenly loaded and the traffic passing from each stage to next one is uniformly distributed to each port. Moreover, we assume no buffer inside the SE between the inter-stage links. With these modeling assumptions, the complex switching system could be decomposed into each single panel with relative independence; consequently, we just need to analyze the throughput of one of them.
We use the recursive method to get the analytical model for single panel fabric as [6, 7, and 8] . The load to the (k+1) stage is computed with the load that is not transferred to the output queues at the k stage. So if we know the random load starting at the first stage, we can compute the load to each stage through the whole fabric. Throughout the paper, the notations in Table 1 are used for the analytical model:
Table1. Notation used in this paper. 
Notation Explanation
the load from outside the fabric, it also means the probability that a cell is generated to the input port during each cycle. So F 0 =p. π the cell drop rate of the fabric.
With the ICO FA mechanism, the cells are dropped if all the recirculation points are not available. In order to calculate the π, we need to compute the load to each stage as follows:
. To calculate P X-n+1 to P X with recirculation, we use a recursive expression of the form:
Moreover, each P k for 1≤k≤X+1 is composed of q k,d with:
To evaluate (2)- (5), it is necessary to compute the O k first. A tagged cell in stage k can exit the fabric only if its distance becomes 0 to the destination queues. Furthermore, one of the following conditions must be met: (i) the tagged cell is the only one requiring a local outlet of the SE or (ii) more than one cell require a local outlet of the SE, but the tagged cell is chosen over the others. Then
Except the tagged cell at one SE inlet, there are other (b-1) SE inlets from which, h has cells that require local outlets with 0 distance, m-h has cells that require remote outlets to next stage with nonzero distance, and (b-1-m) has no cell for this cycle. V(h) is the probability the tagged cell is chosen in the conflict that may occur if some of the h cells require the same local outlet. So 1
To proceed the load from P k to P k+1 , we need the conditional probability to compute q k+1,d from q k,d distribution for d=0, 1, … , n-1.:
P{q k+1,d | q k,j }is the conditional probability that a cell has distance d in stage k+1 after it has been switched from stage k where it has distance j. Because we use the shortest algorithm with deflection scheme in the SE's local scheduler, most of the P{q k+1,d | q k,j } parameters are zero. Depending on the different values of d, three cases are distinguished as follows for (10): 1)
3)
In (11), (q k,0 -O k ) are the flows, which failed to reach their destination queues due to collisions, but they still have a zero distance to next stage with the shortest path. In (13), q k+1,n-1 collects all the deflected flows with distance n-1. Because we assume that the traffic passing from each stage to next one is uniformly distributed, we ignore the effects of (
Except the tagged cell at one SE inlet, there are (b-1) other SE inlets from which, h has cells that require remote outlets with distance j+1, m has cells that require remote outlets with distance from 1 to j.
T(z,i) is the probability that z inlets of the SE has the conditions as follows: (1) 
) is the probability that the tagged cell is deflected, if m cells with lower distance and h cells with equal distance are switched to next stage by the SE:
From (7) to (16), we could compute the load of each stage without recirculation. For the last copy of the fabric from stage (X-n+1) to X, we should count in the recirculation load from the final stage P X+1 . So, based on (4) and (11)-(13), the distributions for 0≤j≤n-1 when X-n+1≤k≤X are computed:
In (17), the q k,j on the right hand comes from previous stage and is calculated with equations (11)-(13). The second term belongs to the recirculation load. The coefficient e k,j,i is determined by the fabric structure (N and b) and the recirculation point. For each stage k between X-n+1≤k≤X, the [e k,j,i ] ,j×i form a 2-dimentional coefficient matrix with row j and column i, 0≤j, i≤n-1. As an example, when N=256, b=4 and n=4, the four coefficient matrixes are shown below: Based on equation (2)- (17), we compute P k , q k,d and O k recursively until they become steady. Then with (6), the drop rate π (or throughput) of the single panel fabric is obtained. 
Analytical model for the interleaved switching fabrics
We have introduced an analytical model for the single panel fabric in last section and assumed that the traffic between panels is evenly loaded. Thus, each panel runs at load (
Validation of the analytical model
The accuracy of the analytical models in Sections 2.1 and 2.2 has been assessed by comparing its results with those from simulations of the system. Figure 3 shows the cell drop rate under uniform traffic with N=256, b=4 and X=4. In this paper, we use the notation SX/PY where X specifies the number of stages per panel and Y the number of panels. We choose N=256, b=4 and X=4 because it is a full copy of fabric which will make the results more pronounced.
From Figure 3 , there is a difference between the analytical model and simulation results for S4/P1 when load is 0.3<p<0.9. As mentioned in [6, 7] , the basic reason is that the traffic between the stages is unbalanced and the assumption of uniform distribution does not hold any more. When load is very low at p<0.3, the traffic between stages could be still considered as uniform. When load is high enough after p>0.9, all inter-stage links are saturated with traffic from previous stage or recirculation. Thus the traffic between stages could be considered as uniform again from outside view. As a result, the diagram shows satisfactory matching between models and simulation for these two load portions. As to S4/P2 and S4/P3, the lower load to each panel, doubled or tripled recirculation path, and the interleaved connections make the traffic uniform between stages. Thus, there is a good match between the models and simulation. 
Fault tolerance of the interleaved architecture
A critical design aspect of high performance routers is their reliability. Though the Internet itself is designed to tolerate failure of some router nodes, lost of core routers still results in considerable congestion to other routers with unbalanced traffic. Moreover, some subnets, which connect through the failure node, will be made unreachable. On the other side, VLSI moves into the nanometer range, this in turn provides faster systems and higher integration. However, the devices suffer from extreme process variation, particle-induced transient errors, and transistor wear-out. In the near future it will be unlikely to avoid having faults in VLSI systems.
As to our interleaved switching fabrics, its parallel architecture already has built-in redundancy that in turn provides fault tolerance. Our scheme treats a faulty element in a similar fashion as the hot congestion area, and deflects the traffic away from it. There are link and SE failures inside the switching fabrics. SE hardware is much more complex and, therefore, more prone to faults than the internal link connections. An internal link fault could also be modeled as a SE fault since a faulty link renders the following SE as a nonworkable unit. Thus in this paper, we use the SE fault model to evaluate its detrimental effects to system performance.
Following the simulation models in [1, 2] , we choose fabric speedup ξ=2 and each SE output queue (either local or remote one) is equipped with 12-cell buffers. For all results, 200,000 system clocks are simulated, which are long enough to get steady state results.
Single fault model
In the single fault model, we have a SE that is faulty which could not accept any cells. Thus, the cells that need to pass the faulty SE are deflected in prior stage. If the faulty SE is located in last copy of the fabrics, the recirculated cells need to jump this faulty point in case of the ICO FA approach. However, we have observed that the stage location of the faulty SE determines the degradation to the performance other than the row location. Thus, in our simulations the faulty SE is placed at the same row but in different stages. For the results reported here we have chosen row 31 (and different stages) for symmetry purpose. Because hot-spot traffic saturates part of fabrics along its path, the effect will depend on the location of the faulty SE. Thus, we will just use uniform traffic for fault tolerant test throughout the paper.
In this section, we compare how a fault impacts the performance of the single panel and interleaved double panels; both single and double panels have the same length of 6 stages. 
Mean Latency
O ffered L o ad S 6 /P 1 (N o F a u lt) S 6 /P 1 (S 1) S 6 /P 1 (S 3) S 6 /P 1 (S 4) S 6 /P 1 (S 6) S 6 /P 2 (N o F a u lt) S 6 /P 2 (S 1) S 6 /P 2 (S 3) S 6 /P 2 (S 4) S 6 /P 2 (S 6) 
Drop Rate
O ffered L oad S 6 /P 1 ( N o F a u lt) S 6 /P 1 ( S 1 ) S 6 /P 1 ( S 3 ) S 6 /P 1 ( S 4 ) S 6 /P 1 ( S 6 ) S 6 /P 2 ( N o F a u lt) S 6 /P 2 ( S 1 ) S 6 /P 2 ( S 3 ) S 6 /P 2 ( S 4 ) S 6 /P 2 ( S 6 ) Figure 5 . Drop rate vs offered load for single fault test.
It is observed that the fault location determines the degradation to S6/P1 especially for drop rate. In Figure 5 , if the faulty SE is in the first stage, the drop rate will start from 1.59%. There are total 64 SEs in each stage, 1/64= 1.56%. One faulty SE means 1.56% of the input traffic will be lost immediately without switching through the fabrics. Thus, the simulation results match the theoretical value and prove the first stage is the most critical for single panel fabric.Another critical stage of S6/P1 is the first stage of last copy, which also merges with the first entrance point of the FA approach. If the fault happens to be this stage, both the latency and drop rate deteriorate noticeably, with a 3.23% drop rate as opposed to 1.97% for fault-free situation. Finally, the performance is insensitive to the last stage fault of S6/P1, considering most of cells have been switched to destination queues in prior stages.
As expected, the faulty SE exhibits negligible impacts to performance of S6/P2 regardless of their fault locations. The interleaved fabrics in parallel not only substantially enhance the performance, but also tolerate the single hardware failure. Even for the fatal fault in first stage of S6/P1, S6/P2 still gives the inputs another chance to divert the flows into the fabrics.
Multiple fault model
The multiple fault model is considerably more complicated than the single fault model, because of the abundant combinations of the number of faults and locations. However, the faults in the identical stage of different copies will generate a switching bottleneck and make the performance to deteriorate significantly, since all of them correct the same position of the tag bits. Thus, the simulations in Figures 6 and 7 depict this situation. As before, we specify the fault location in the parenthesis. S 12 /P1 (S 4+S 8+ S1 2) S 6/P 2(N o F a ult) S 6/P 2[(S 2+S 6)/P 1+S 2/P 2] S 6/P 2[(S 2+ S 6)/(P 1 + P 2 )] S 6/P 2[S 3 /(P 1+ P 2)] S 6/P 2[(S 2+S 6)/(P 1 +P 2 )]-d oub le S 6/P 2[S 3 /(P 1+P 2)]-do ub le Figure 6 . Mean latency vs offered load for multiple faults test.
As it can be observed in Figures 6 and 7 , S8/P1(S4+S8) and S12/P1(S4+S8+S12) are considerably impacted by this fault model. First of all, more delay latency pulls the curve above the fault-free ones. Then their drop rate jumps quickly from 9.8×10 -8 of fault-free to 3.4% at full load p=1.0. However, the redundant stage in first copy of S12/P1(S8+S12) still provides capabilities to tolerate faults in stage 8 and 12.
For S6/P2, stages 2 and 6 of S6/P2 (total of 4 affected stages) correct the second position of the tag bits. It is reasonable that one or more redundant stages in S6/P2[(S2+S6)/P1] and S6/P2[(S2+S6)/P1+S2/P2] will compensate the faults with slight degradations in latency and drop rate. Stage 3 of S6/P2 (total of 2 affected stages) corrects the third position of the tag bits. If both of them fail, one can expect inferior performance as S8/P1(S4+S8) and S12/P1(S4+S8+S12) exhibit before. However, it is important to notice that both S6/P2[S3/(P1+P2)] and S6/P2[(S2+S6)/(P1+P2)] cause a negligible increase in drop rate, 0.086% as opposed to 0.013% of fault-free. In order to understand why the interleaved architecture is fault tolerant, we consider first the two points where cells are dropped. One is in the SE itself when all local buffers are full. For example, if SE at row 0 stage 2 in Figure 2 fails, all flows from input 0 and 4 will merge and go through port 4 of SE at row 0 stage 1. The intense flows fill the local buffers quickly and make dropping cells unavoidable. The second point is at the FA entrance points. If all FA entrance points are not available, the recirculated cells will be dropped. Compared with the single panel architecture, S6/P2 firstly reduces the traffic to half for each panel, and then it doubles the FA recirculation points which broaden the switching path and mitigate collisions. Though correcting the deflected cells of S6/P2[S3/(P1+P2)] increases latency a little to 6.6 cycles as opposed to 4.7 cycles of fault-free as shown in Figure 6 ; this latency is still far below its counterpart of S12/P1. Moreover, with total four stages to correct the second position of tag bits, S6/P2[(S2+S6)/(P1+P2)] still achieve remarkable performance with low latency and drop rate.
Furthermore, we have even performed simulations with some extreme cases which double the faults from row 28 to 35 to a total of 8 faults in specific stages. In Figures 6 and 7 Mean Latency O ffered L oad S 6 /P 2 [(S 2 + S 6 )/(P 1 + P 2 )]-4 S 6/P 2[S 3/( P 1+ P 2)]-4 S 6 /P 2 [(S 2 + S 6 )/(P 1 + P 2 )]-8 S 6/P 2[S 3/( P 1+ P 2)]-8 S 6 /P 2 [(S 2 + S 6 )/(P 1 + P 2 )]-1 6 S 6/P 2[S 3/( P 1+ P 2)]-16 S 6 /P 2 [(S 2 + S 6 )/(P 1 + P 2 )]-3 2 S 6/P 2[S 3/( P 1+ P 2)]-32 S 6 /P 2 [(S 2 + S 6 )/(P 1 + P 2 )]-6 4 S 6/P 2[S 3/( P 1+ P 2)]-64 From the simulations and analysis in [2] and sections above, as a good example of interleaved architecture, S6/P2 shows a better performance and much stronger capability to tolerate internal hardware failures than the single panel architecture. Specifically, each panel in S6/P2 will be a switching board which is relatively independent as mentioned in Section 2. Even under the worst case scenario where one panel is broken; the other panel in S6/P2 will allow the router to continue running with some performance degradation. On the other hand, in the case of S12/P1 whole system will malfunction. Moreover, inspired by RAID (Redundant Array of Independent Disks) technology [3] , we could build a Redundant Array of Independent Fabrics (RAIF) by upgrading the S6/P1 with more panels in parallel. Thus, each switching panel in RAIF works as similar as a hard disk in RAID. The extra panels could work as RAIF 0 (similar as RAID level 0): working in parallel as S6/P2. Other alternative, one additional panel works as RAIF 1 (similar as RAID level 1): stand by; this will help to lower power consumption while there is no fault and this panel will replace a malfunctioning one when fault happens. Combined with RAIF 0 and 1, we could build the RAIF 2 with Y panels (Y>2): Y-1 panels working in parallel and one panel stand by for fault tolerance. In general, the RAIF provide a flexible scalability with fault tolerance and graceful performance degradation.
Concluding Remarks
In [2] , we have presented a novel architecture of interleaved switching fabrics for scalable high performance routers. In this paper, we present an analytical model to assess its throughput. The benefit of the interleaved architecture comes from avoiding the high non-linear increase during load portion (0.5 ≤p≤1) for the single panel fabric. The simulations in [2] also demonstrate our analysis here.
Moreover, our scheme treats a faulty element in a similar fashion as the hot congestion area, and deflects the traffic away from it. Extensive simulations under different faulty models have reveled that the interleaved multistage switching fabrics are highly fault tolerant against internal hardware failures that single panel fabric does not achieve. Under the worst case, the single panel fabric drops all packets when faults are located at the first stage. In general, it is possible to build a reliable, scalable high performance switching system using a Redundant Array of Independent Fabrics (RAIF) scheme in a similar fashion as RAID [3] .
