The switching fabric in the Avici TSR uses a 3-D torus topology with each line card carrying one node of the torus. It requires no centralized switching fabric, whereas its scalability is limited by the bisection bandwidth. This paper proposes a novel architecture called Cellular Router (CR). There exist some problems with the basic CR architecture. They are solved by introducing Mirror Points (MPs). This paper also gives the design of line cards in this architecture. In the end, the design of routing algorithms is introduced on this architecture. The CR architecture shows excellent scalability and fault tolerance. It is a promising choice for the design of scalable routers.
乐祖晖 等:CR:基于一类新型结构的可扩展路由器

2625
Historically, switching fabrics based on buses and crossbar switches were widely used in routers. It is well known that buses cannot be scaled to support high bit rates for the limited bandwidth. For low numbers of ports, crossbar is often selected as the switch topology, owing to the simplicity and non-blocking properties. However, its cost grows as the square of the number of ports, and cannot be economically scaled to a large number of ports.
Multistage switching fabric architectures can handle modest or large numbers of ports. Researchers have been looking at such topologies since the days of electromechanical telephony [2] . The Banyan network [3] has a low cost of N·logN (N is the number of ports) and a lot of paths. But it suffers from internal blocking. The Benes network [4] features a low cost of N·2logN and is free of internal blocking. The Benes network is rearrangeably non-blocking, and setting up new connections may destroy the existing ones. The Clos network [5] is useful in practice and stimulating in research. There are still some unsolved theoretical problems with Clos network. The Load-Balanced switch architecture [6] is considered promising for this approach eliminates the centralized scheduler and can be realized with optics [7] . However the router built with this switch architecture cannot be built with technology available today.
There exists a centralized switching fabric in the router implemented with the switches listed above. This greatly limits the scalability of the router and the centralized switch becomes the Single Point of Failure (SPF).
Most of them need complicated schedulers. For example, Maximum Weight Matching (MWM) schedulers, such as LQF and OCF [8] , can achieve 100% throughput asymptotically under any admissible workload, uniform or not, with no speedup needed. However, they have a high computational complexity of O(N 3 ), hence are infeasible in practice.
Although some other algorithms have less computational complexity, they are related to the number of ports. That is, as the number of ports increases the time slots used increase nonlinearly.
Interconnection networks are originally used as switch for processor-memory interconnect and for I/O interconnect. Afterwards interconnection networks based on 3-D torus topology [9] are used as router fabrics in the Avici TSR [10] . In the Avici TSR switching fabric, each line card carries one node of the torus. There are many optional paths between source node and destination node. This design offers some good properties [10] : Economical scalability, incremental extensibility, load balance, fault tolerance, non-blocking and low bounded delay for CBR traffic. Although 3-D torus topology shows good scalability, its implementation in TSR has limited the number of line cards, which can be only added to 560 [10] . More line cards cannot improve the bisection width of TSR.
The Cellular Router (CR) introduced in this paper is a new architecture that can be used to scale routers. The number of line cards is not limited. There are more optional paths between source-destination pair. With some modifications, this architecture provides better fault tolerance than 3-D torus topology.
The rest of this paper is organized as follows. Section 2 presents the basic CR architecture. In Section 3, we
give a brief introduction of the line card design. The basic CR architecture shows poor fault tolerance and some improvements will be described in Section 4. Section 5 introduces several routing algorithms that can be applied to the CR architecture. Section 6 presents a simulation system and some results can be obtained from this system. Section 7 provides a summary of our work and talks about the future work.
Introduction of the Basic CR Architecture
In this section, we will introduce the basic CR architecture. This architecture is motivated by some observations and shows good scalability. When we build scalable routers with this architecture, all the active components of the fabric are carried on the line cards and the line cards are connected only with data channels.
Motivations
The basic CR architecture is motivated by the following three observations. However, these equilateral triangles are well organized. If we only make use of some links as shown in Fig.2(c) , the modified hexagons can cover the squares case. In an interconnection network, if we increase the number of links, the degree of each node will increase accordingly. This means that there will be more channels between nodes. This can improve the fault tolerance of the network. In the mean time, more paths can also be used to realize good load balance. If we only consider the 2-D case, the degree of a fully connected node in Fig.1(a) , Fig.1(b) and Fig.2 (a) is 6, 4 and 6 respectively. The CR architecture provides more links between nodes (line cards) than that of the 2-D torus architecture.
The basic CR architecture
Now we introduce the basic CR architecture. As line cards are plugged in, they will appear in only one layer in a single rack, several layers in a single rack or several different racks. Here we will show all these three cases.
One layer in a single rack
We first consider the case in which there is only one layer in a single rack. According to the modified hexagon,
we can place 7 line cards in one layer (in this layer we can also place some control cards or some redundant line cards, so it is reasonable to place a hexagon in one layer). As shown in Fig.3 (a), we number these line cards from 0 to 6. These numbers represent the sequence in which the line cards are plugged into the rack. When a new line card is added, we will also add the related link(s) with the existing node(s). Each link is unidirectional, so there are two separate links between any adjacent nodes-One for each direction. We define such a hexagon as a Basic Element (BE). The numbers shown in Fig.3 (a) are defined as BE Numbers (BENs).
In practice, the line cards are always placed parallel. So we can reorganize the line cards in one BE as shown in 
Multi-Layer in a single rack
We can place one BE in each layer and connect the corresponding line cards as shown in Fig.4 
(a).
The maximum degree of each line card is 8 (here we consider the 3-D case). We reserve two of them to connect line cards of adjacent layers. In Fig.4(a) , to increase the connectivity of the nodes in the topmost layer and the bottommost one, we can connect the nodes in the topmost layer with the corresponding ones in the bottommost layer (these links are not shown in Fig.4(a) ). We can find that in the 3-D case there exist two types of polygons:
Squares and regular hexagons. Each generation of routers consumes more power than the last, and it is now difficult to package a router in only one rack of equipment. There has therefore been a move towards multi-rack systems. We can see that CR
shows good scalability in a multi-rack system.
First, we show how to place the racks. In Fig.5 , each node represents a rack with one layer or multiple layers.
We define the central point as Polar Point (PP). There are totally six axes that start from the PP. They are defined as 0-axis, −π/3-axis, −2π/3-axis, π-axis, 2π/3-axis and π/3-axis respectively. These six axes divide the whole plane into six equal-area regions, and we define these regions as region0, region1, region2, region3, region4 and region5. We first place one rack on the PP. Then we place the second rack on the π/3-axis. After we have finished placing six racks around the PP clockwise, we place other racks around these six racks. In Fig.5 , the arrowed solid line starting from the PP shows the order in which the racks are deployed in such a multi-rack system. We define the PP as In the Avici TSR [10] , each line card is assigned a three-coordinate address ( Here we first discuss the BE again. As shown in Fig.7(b) , it can be found that if the central point (line card) is down, the BE will change to a dual-ring (for there are two unidirectional links between any adjacent nodes). If one node in the remainders is down again, the dual-ring will turn to a chain. If another node breaks down, the left nodes become unconnected as shown in Fig.7(d) . So the fault tolerance is poor in the BE. It is known that the maximum degree of each line card is 6 excluding the two reserved links used for adjacent layers. However, in the BE, the degree of each node, except for the central point, is only 3. When some nodes fail, the left nodes are affected greatly. 
Improvements of the basic CR architecture
To improve the basic CR architecture, we design two schemes: Largest Number First and Mirror Points. They can improve the connectivity of edge nodes and reduce the diameter.
Largest number first
We define the set Γ, which consists of nodes with Linked Degree (LD) of less than 6. We also define the set Λ, which consists of nodes with LD of 6. In the BE, from Γ we choose the node with the largest number, node6. The Unlinked Degree (UD) of node6 is 3. This means that we can connect node6 with 3 nodes other than node0, node1 and node5. We choose from Γ the nodes that have not been connected with node6. In this case, we connect node6 with node2, node3 and node4. Now LD of node6 is 6, so we move it from Γ to Λ. Then we connect node5 with node1, node2 and node3. Next time we connect node4 with node1 and node2. Now Γ is empty and all the nodes have been moved to Λ.
In general, we can design an algorithm-Largest Number First. With this algorithm, we choose the largest line card in Γ. Then we choose from Γ the nodes, which evenly distribute on the boundary. If the LD of any node reaches 6, we move it to Λ. We do the same procedure until Γ is empty or the graph is already fully connected. This algorithm also applies to irregular CR topology. In a fully-connected graph with n nodes, there are n(n−1)/2 links. In Journal of Software 软件学报 Vol.18, No.10, October 2007 a modified CR architecture with n nodes, there are 3n links. When n is less than 8, the modified CR architecture is fully connected.
Mirror points
First we introduce the concept of Envelope. Envelope is the regular hexagon that can encapsulate the CR architecture. If no nodes appear on the envelope, we define it as External Envelope. If some nodes appear on the envelope, we define it as Internal Envelope. As shown in Fig.8(a) , the dashed regular hexagon is the External Envelope of the BE. In Fig.8(a) , node4′ on the External Envelope is defined as the Mirror Point (MP) of node4 related to node0. Node4′ can be connected to node6, node1 and node2. That is, in a practical system we will connect node4 with node6, node1 and node2. We can also connect node1 to node3, node4 and node5 with the help of MP, etc., as shown in Fig.8(b) . In the end we get a fully-connected system as shown in Fig.8(c) . 
Minimal Routing Algorithms
Before discussing the minimal algorithms, we first introduce some basic concepts.
Preliminaries
The following discussions describe the routing algorithms that are minimal. That is, they select the shortest path among all the optional paths. We restrict our discussion to 2-D CR architecture and ignore the impact of mirror points. Each link is unidirectional, and we define the length of each link as 1.
Suppose the source node is s, and the destination node is d. we choose the next-hop node on the shortest path, n 2 is selected in Fig.9 (a) while in Fig.9(b) we can choose one between n 2 and n 3 . We call n 2 as the Right Neighbor (RN) of s and n 3 as the Left Neighbor (LN) of s for source destination pair (s, d). With the minimal routing algorithms, we can always choose the RN of the current node where the cell locates. We call this algorithm as Right Neighbor First (RNF) algorithm. In the same way, we can design Left Neighbor First (LNF) algorithm. Although these two algorithms only provide a single path between source node and destination node, they are very simple and inexpensive to implement in hardware.
乐祖晖 等:CR:
2631 基于一类新型结构的可扩展路由器 
Identification of each node
To find the next-hop node on the shortest path, we should first identify each node in the architecture. There are two methods to achieve this: method based on the PP and method based on the source node.
Method based on the PP
According to the characteristic of the CR architecture, it is proper to identify each node under polar coordinate. We choose the 0-axis as polar axis OX. So node P can be identified as (ρ,θ). Here ρ represents the length of OP, and θ is the value of ∠POX. In Fig.10 ,
tgθ=|PN|/|ON|=(PM⋅sinπ/3)/(OM−PM⋅cosπ/3)=(3⋅sinπ/3)/(5−3⋅cosπ/3).
In this way, we can easily calculate the values of ρ and θ for each node.
To find the next-hop node on the shortest path, we should find the association between s→d and s→n 0 ,…,s→n 5 . In 
Method based on the source node
It is easy to calculate the coordinate of each node with the method based on the PP. However, the relationship between s and d should be calculated again to determine the next-hop node. If we set up the coordinate of each node on the source node, the speed of switching will be improved greatly. All the calculation can be carried out offline.
That is, we can get these values before switching.
Performance Analysis
We develop a simulation system to evaluate the performance of the CR architecture. There are two common approaches for the design of the simulation system: cycle-based and event-driven. We choose the former. In each time slot, one or zero cell arrives at the node and is put into one of the VOQs according to its source node and destination node (the source and destination addresses of each cell are translated to the node identifiers in the CR).
We choose the BE as our target and RNF as the routing algorithm. We choose two traffic pattern, uniform random traffic and tornado traffic. With uniform random traffic, each node in the BE randomly chooses one destination node for each incoming cell. With tornado traffic, node0 sends cells to node6, node1 sends cells to node5, node6 sends cells to node4, etc., as shown in Fig.12 . We assume that at each time slot a cell arrives at each node. 
Simulator warm-up
Our simulator is initialized with empty queues before any cells are injected. This will introduce a systematic error into our measurement. For cells that are injected early in the nodes, we can see a relatively empty network.
These cells have less contention and therefore traverse the network more quickly. However, as buffers begin to fill up later cells meet more contention, increasing their latencies. Over time the influence of the initialization becomes minimal, and at this point the simulation is warmed up. By ignoring all the events that happen before the warm-up point, the impact of systematic error on measurements can be minimized.
Here the length of each queue is set as 100. The random traffic is chosen. We run our simulator for 10000 time slots. We calculate the ratio of the dropped cells to input cells. Figure 13 (a) shows the statistics in 1000 time slots
and Fig.13(b) shows the result in 10 000 time slots. We can find that Fig.13(a) is only a snippet of Fig.13(b) , and 
Throughput
Here we set the length of each queue as 100, we run our simulator for 10 000 time slots. We compare the throughput of tornado traffic with that of random traffic. As shown in Fig.14(a can get a better result. The random traffic pattern is benign for RNF algorithm since the load is more balanceable.
When designing a routing algorithm, we always assume that each node has buffers of infinite length. However in practice, this can't be true. We choose tornado traffic pattern, and set the length of each queue as 20, 40, 60, 80
and 100 respectively. We run our program for 10 000 time slots. 
Conclusion
As we mentioned at the very beginning of this paper, the driving forces for the evolution of router design is the stupendous growth of user traffic. It becomes more and more difficult to increase the speed of ports because we are encountering not only some intrinsic limitations of silicon technology but also a whole set of physical, electrical and mechanical issues. To support a very large number of ports, traditional switching fabrics are not suitable for complicated schedulers. 3-D torus topology used in Avici TSR gives us a pleasant surprise for its simplicity and good scalability. However, TSR can only support up to 560 line cards for its physical limitation.
The CR architecture is a new architecture. It is promising and suitable for scalable routers. In theory, this architecture can be scaled from a system with only one line card to a system with an infinite number of line cards.
With some improvements, this architecture works well even when some nodes or links are down.
In future, we plan to go deep into the properties of the regular and irregular CR architectures. We plan to develop routing algorithms, which will be applied to normal operations, and some other algorithms that can be used in case of faults. With this architecture, we also can make some research on multicast or QoS switching. 
YUE Zu-Hui
