Abstract-Based on a realistic, yet simple cost model, we compute the switch radix that minimizes the cost of a fat tree network to support a given number of end nodes. The cost model comprises two parameters indicating the relative cost of a crosspoint vs. a link, and the crosspoint-independent base cost of a switch. These parameters can be adapted to represent a given technology used to implement links and switches. Based on these inputs, the resulting model allows a quick evaluation of the switch radix that minimizes the overall cost of the network. We demonstrate that the optimum radix depends most strongly on the relative cost of a link, and turns out to be largely independent of the network size. Using a first-order cost bounds analysis based on current CMOS and link technology, our model indicates that the optimum switch radix for large fat trees is driven almost entirely by link cost and as a result lies in the range of hundreds of ports, rather than the tens of ports being offered today by most commercial switch products today.
I. INTRODUCTION
Interconnection networks for supercomputers and data centers often employ a topology from the family of topologies collectively referred to as fat trees. Examples are supercomputers such as the Connection Machine CM-5 [1] , IBM's Roadrunner at LANL, MareNostrum at Barcelona Supercomputing Center, and the JUROPA and HPC-FF systems at Jülich Supercomputing Centre. The fat tree is also currently the preferred topology for InfiniBand networks. Moreover, fat trees have recently attracted increasing attention for use in commercial data center networks [5] , [6] .
An HPC or data center installation generally comprises end nodes, which send and receive data (e.g., compute nodes, storage servers, database servers, etc.) and an interconnection network, which serves to transport data between end nodes. Because systems are becoming increasingly distributed in nature, the importance of the interconnection network has been on the rise, and is currently one of the key factors determining overall system performance. The flip side of this is that the interconnect also accounts for an increasingly substantial part of the cost of the overall system. Based on this premise, this paper answers a straightforward question: To interconnect a given number of end nodes with a fat tree network, what is the switch radix r that minimizes the overall network cost? By switch radix we mean the number of ports per switch, assuming that all switches in the network have the same radix.
The authors of [7] also performed a study of switch radix optimization, but with a different objective, namely that of minimizing the end-to-end latency, based on the premise that for a given technology the aggregate switch bandwidth is a (a) Binary 4-level fat tree.
(b) Binary 4-tree. given, raising the question whether to divide this aggregate up into fewer faster ports or more slower ports. Their main conclusion was that, given technology trends, the optimum switch radix is increasing from the point of view of minimum end-to-end latency. Here, on the other hand, we focus on optimizing overall network cost rather than latency. We first review the definition of a fat tree, or more specifically k-ary n-tree, in Sec. II. In Sec. III, we propose a cost model for the network as a whole based on a simple switch cost model that is quadratic in r, and we discuss practical aspects of this model. We use the cost model to obtain the optimum switch radix as a function of the number of nodes n and cost-model parameters a and b for single-sized and double-sized fat trees in Secs. IV and V, respectively. In Sec. VI we estimate the optimum switch radix based on current technologies. We conclude in Sec. VII.
II. FAT TREES
Fat tree networks were introduced by Leiserson [2] as k-ary tree topologies, in which the upward links at each level are a factor k faster than the downward links to ensure that the bisection bandwidth remains constant, see Fig. 1(a) . The main problem with implementing such a tree is that the switch port rates become unmanagably high towards its root.
A more practical and scalable variant of such trees requiring only switches with the same radix and the same port speed at all levels are k-ary n-trees [3] , see Fig. 1(b) . Formally, these topologies and their slimmed versions (i.e., those not providing full bisectional bandwidth) belong to the family of extended generalized fat trees (XGFTs) [4] . This family includes many popular multi-stage interconnection networks, such as m-ary complete trees, k-ary n-trees, Leiserson's fat trees, and slimmed k-ary n-trees. Here, we use the term "fat tree" as a synonym for k-ary n-tree.
An XGFT(h; m 1 , ..., m h ; w 1 , ..., w h ) of height h has h + 1 levels, divided into N = h i=1 m i end nodes (leaves of the tree) at level l = 0, and switching nodes at levels 1 ≤ l ≤ h (inner nodes of the tree). Each non-leaf node in level i has m i children, and each non-root has w i+1 parents [4] . XGFTs are constructed recursively, with each sub-tree at level l having parents numbered from 0 to (w l+1 − 1).
III. COST MODEL
Here, we will specifically consider k-ary l-trees, 1 which can be described using the XGFT specification shown above as XGFT(l; k, . . . , k; 1, k, . . . , k). In such a network, the radix r of each switch at level 1 ≤ i < l equals r = m i + w i+1 = 2k. We assume that the switches at level l also have radix r = 2k, with the upward-facing ports being unconnected to allow for future network extension. Later on, we will use the term "single-sized" fat tree to distinguish this network from the "double-sized" one in which the upward ports of the top stage are used to connect k n−1 additional subtrees, thus doubling the total number of end nodes. Note that k-ary l-trees have constant bisection bandwidth.
The number of end nodes N (r, l) supported by a (singlesized) k-ary l-tree equals
Using (1), we derive simple expressions for several basic complexity metrics for a fat tree network that supports n end nodes using switches with radix r.
Equations (2)- (4) express the number of levels L(r, n), the number of switches S(r, n), and the number of links I(r, n) of such a fat tree. Note that L(r, n) counts only the number of switch levels and that I(r, n) counts the number of bidirectional inter-switch links, not including the links between end nodes and switches, because this number is independent of the topology.
The overall fat tree cost function is given by (6):
where C sw (r, b) is the switch cost function (see Sec. III-B) and a is a parameter that allows tuning of the relative cost of a link versus the relative cost of a crosspoint and a switch. The above functions are only valid for n ≥ 2 and 4 ≤ r ≤ 2n. The case r = 2n corresponds to a single switch connecting all n nodes (and having n unconnected upward ports). It does not make sense to choose r > 2n, because then there will be unconnected downward ports. 1 In literature the term "k-ary n-tree" is normally used, but from here on we prefer to use l to indicate the number of levels and n the number of nodes. 
A. Link cost model
In model of (6) we assume the same cost for all links, independently of their length. In reality, of course, cost does depend on length to some extent. Figure 2 shows prices as a function of length for 40 Gb/s (4 full-duplex QDR lanes) Mellanox cables with QSFP connectors, comparing active optical cables with passive and active copper cables. Also shown are the least-squares-fitted linear functions for each data set. Unsurprisingly, for short distances (< 12m) copper is the cheapest option; for distances beyond 12 m, fiber is the only option. The fiber cable costs, especially the shorter ones, are dominated by the cost of the connectors (which include the E/O/E conversions): the y-axis offset is about $280, plus about $4.5 per meter of fiber.
For the purpose of this study, we ignore the dependency on cable length. In a k-ary n-tree, the mean length of cables connecting level i to level i + 1 increases with i. This implies that our model underestimates the cost of adding another level, hence the resulting optimum switch radix will tend to be underestimated as well. Therefore, taking into account cable lengths would result in larger switch radices than those reported by the model (see Sec. VI).
In addition to the cable costs, we include the per-port costs physically related to the switch in the per-link cost, such as transceivers and per-port logic (e.g., handling of link-layer aspects, including flow control, priorities, virtual channels, reliable delivery) as well as any dedicated port buffers.
B. Switch cost model
For the switch cost model we consider pure single-stage implementations for the switching nodes, i.e., we do not consider "single-box" switch solutions that are actually multistage under the covers-we will touch upon these in Sec. III-C.
As we already account for the per-port functions in the link cost, the switch cost does not include a term linear in r. The remaining switch functionality is dominated by the cost of performing contention resolution for r ports. Therefore, under the condition that each switching node is implemented in a 
(a) Cost function Css as a function of r for n from 256 to 64M, a = 10, b = 100. single-stage fashion, our switch cost function CC sw (r, b) is quadratic in r:
The rationale behind this is that the complexity of a singlestage switching node always scales quadratically with r in some way, regardless of its specific implementation:
• The number of crosspoints in an unbuffered or buffered crossbar equals r 2 .
• In an input-queued switch with virtual output queues (VOQs), the number of VOQs equals r 2 .
• In a purely output-queued switch, the aggregate write bandwidth into the output queues equals r 2 times the port rate.
• In an output-queued shared-memory switch, the aggregate write bandwidth into the output queues equals r 2 times the packet rate times the address width. Moreover, the wiring complexity of the shared memory scales with r 2 , as there is at least one memory location per port, which needs to be connected to each input and output. In practice, the wiring complexity is more likely to be in the order of b · r 3 , where b is the data-path width [8] .
• In a combined-input-output-queued switch, a combination of the above quadratic complexities arises. Because the term r 2 corresponds to the number of crosspoints in a crossbar, we refer to it as crosspoint complexity, with the cost of a crosspoint being normalized to unity. In this context the term crosspoint is being used generically, and should not be equated with a crosspoint in a crossbar switch. This implies that the unit crosspoint cost depends strongly on switch architecture and implementation and needs to be assessed accordingly.
To allow for some flexibility in the switch cost function and account for a fixed per-switch overhead, we include a base cost component b that is independent of r, as shown in (7). Note that per-port cost components such as tranceivers can be associated with the links and should therefore be incorporated in the per-link cost factor a, which is also the reason that we do not include a linear term in the switch cost function.
Parameters a and b are normalized with respect to the unit cost of a crosspoint; in practice, they can be expected to be greater than one.
Note that, although some plots may show functions as being continuous in radix r, there is clearly no practical relevance of non-integer radices. Moreover, as fat trees with constant bisection bandwidth also necessitate even radices, we always round up to the nearest even radix.
C. Implementation considerations
In practice, the maximum radix of a single-stage, singlechip switch is limited by the following factors (see also [10] ):
• CMOS manufacturing limits, such as die size, gate count, wiring, and above all, manufacturing yield.
• Chip packaging limits: High-performance, high-density packages (e.g. ceramic ball or column grid arrays) are limited to about 1'000-2'500 pins. Usually about half of these are required for V dd and ground, leaving the other half for IO. As each high-speed bidirectional port requires at least four pins, the radix of a practical singlechip switch will be limited to about 125 to 300 ports from a pure pin-IO perspective.
• PCB and connector limits: The bandwidth through the chip must also be brought to and from the chip, by means of PCB traces that lead to connectors at the edge of the PCB. Wirability issues as well as the length of the card edge in conjunction with connector density impose limits in this respect.
• Timing constraints: The most timing-critical part is usually the arbitration process, which, is determined by packet duration and radix. Depending on the switch architecture, the time complexity of the arbitration process, imposes another limit.
• Power budget, heat dissipation: Power limits at the chip, rack and system level, and the capability to dissipate waste heat can also be limiting factors. In particular, the manufacturing and packaging-related factors will cause strong non-linearities on the switch cost function once it exceeds a certain radix. In essence, they will 
(a) Css vs. its smoothed version Css for n from 256 to 64K, a = 10, b = 100. impose an upper limit on the maximum switch radix that can feasibly be implemented in a single chip, which we need to take into account when interpreting the results of the idealized model.
D. Exact cost function
Figure 3(a) shows C ss (r, n, a, b) for a large range of n, with a = 10 and b = 100. Each curve clearly exhibits a cost minimum at around r = 16, almost independent of n.
We varied both a and b from 10 0 to 10 4 for n ranging from 2 8 to 2 64 and determined for each combination of these parameters the optimum switch radix r opt that minimizes C ss . Figure 3(b) displays the results in a 3D plot, with a and b along the x-and y-axes and n as a parameter, i.e., one surface for each n. The results indicate that as a and b increase, r opt also increases. In addition, the effect of increasing a is significantly stronger than that of increasing b. Moreover, these results confirm that the value of r opt does not vary much with n as long as a < 10 3 . To illustrate this point, we averaged r opt over all values of b and plotted the result as a function of n with a as a parameter, see Fig. 3 (c) (a increases from the bottom to the top curve). For small values of a, the curves are almost flat. As a increases, the curves exhibit a "see-saw" pattern, which is due to the ceiling operations in (2) and (3). Nevertheless, even for large a, the amplitude of the see-saw decreases as n increases, converging on a limit value, which we will compute in the next section.
E. Differentiable (smooth) cost function
To explore the behavior of C ss in more depth, we proceed by rendering (7) differentiable by removing the ceiling and floor operations, thus eliminating the discontinuities in its derivative:
The "smoothed" cost function C ss (r, n, a, b) is given by (11):
Substituting (7), (8), (9), and (10) into (11) yields (12):
Figure 4(a) compares C ss and C ss for n ranging from 256 to 64K nodes. We observe that C ss tends to underestimate the actual cost. The main cause is that the smoothing allows fractional tree levels and switches, which obviously does not correspond to reality. 
Css
Css as a function of r for n = 1,024 and b = 100, and different values for a. It can be shown that (for 4 ≤ r ≤ 2n) the error reaches its maximum at r = 2(n − 1) and that this maximum value approaches 0.75 for large n. However, Fig. 4 (b) also shows that for realistic values of n, r, a, and b, the error can be expected to be below 20%.
IV. SINGLE-SIZED FAT TREE
To determine the switch radix r opt that minimizes cost function (11), we differentiate (12) with respect to r, which yields (13):
and then solve
To find a solution to (14), note that the first product term of (13) is only zero when n = 1, which is not a meaningful solution. Therefore, to obtain useful solutions, we need to find the roots of f ropt (r, a, b), as defined by (15):
Note that r ≥ 4, a ≥ 0, b ≥ 0 must hold. We first treat the special cases b = 0 and a = 0, before proceeding with the general case.
A. Case a > 0, b = 0
The special case b = 0 happens to have an elegant, closedform solution for the optimum radix r opt,a , given by (16):
where W(z) is the Lambert W-function, for which holds z = W(z) · e W(z) , for all complex numbers z. Most notably, (16) is independent of n, i.e., the optimum switch radix depends only on the cost factor a, but not on the size of the network.
B. Case a = 0, b > 0
If a = 0, the optimum radix r opt,b can be shown to be equal to the solution to (17):
Using a first-order approximation for log(x) around x = 1, we can approximate the term log(r/2)−1 log(r/2)+1 by 1− 4 r . Substituting this in (17) and solving for r yields (18):
Figure 4(c) plots r opt,a , r opt,b , and r opt-est,b . We exploited the fact that r opt,a and r opt,b provide reasonable estimates of r opt to seed the zero-finding process of the general case (see Sec. IV-C). This technique was also used to find the actual values for r opt,b shown in Fig. 4(c) .
To solve the general case, we used Matlab TM to numerically find the root of (15) for any given combination of n, a, and b. As Matlab's fzero() function requires a seed value for the root, we seeded it with (16) if a > b or with (18) otherwise. Figures 5(a,b) show the results for n ranging from 2 8 to 2
64
and a and b from 10 0 to 10 4 . Both subplots display the same data set, but Fig. 5(a) plots r opt as a function of a with b as a parameter, whereas Fig. 5(b) is the opposite. The values of a and b increase from left to right (x-axis) or bottom to top (curves within same plot).
Figures 5(d,e,f) plot r opt and r opt as a function of n with b as a parameter for a = 1, 10 2 , and 10 4 , respectively. These figures clearly demonstrate that for given a and b the values of r opt and r opt converge as n increases. In addition, the decreasing dependence on b as a increases is also obvious from comparing Figs. 5(d,e,f) against each other. 
V. DOUBLE-SIZED FAT TREE
The fat tree configuration considered up to here left half the ports of the top level switches unconnected. We will now consider a fat tree in which these top-level ports are connected to another subtree of height h − 1, so that the total number of end nodes is doubled; hence, we refer to this topology as the double-sized fat tree. In XGFT terms, this corresponds to an XGFT(l; k, . . . , k, 2k; 1, k, . . . , k), with k = r/2. This topology gives rise to slightly different expressions for the number of nodes (19), levels (20), switches (21), and links (22):
with the overall cost function C ds (r, n, a, b) given by (23):
In a similar approach to Sec. IV, we derive a smoothed cost function C ds and follow a similar procedure to find its minima for given n, a, and b. In this case, 4 ≤ r ≤ n should hold. Figure 6 (a) shows C ds as a function of r for a large range of n and a = 10, b = 100. Upon close inspection, it appears that the minima occur at different values of n, suggesting that, unlike the single-sized fat tree, the optimum radix depends on n. Figure 6 (b) confirms this notion: it plots the derivative of C ds with respect to r for the same parameters, zooming in to the range where the roots are. Here, we can clearly see how the roots move left as n increases, from about 22 for n = 256 down to just over 19 for n = 64M. However, we shall demonstrate that, as n increases, the roots converge to those of the single-sized case.
Equation (24) shows the derivative of C ds (r, n, a, b) with respect to r; unfortunately, it does not have the clean product form of (13). As a consequence, the zeroes of (23) are indeed not independent of n in this case. Note that the optimum switch radix actually decreases as n increases.
Taking the limit n → ∞ of (24) yields (25):
Hence, for asymptotically large n we can obtain the optimum switch radix r opt by finding the roots of f rds-opt (r, a, b):
which by making use of (15) can be rewritten as
Therefore, any root r > 0 of f ropt (r, a, b) is also a root of f rds-opt (r, a, b) and vice versa. It follows that for large n and given a and b the double-and single-sized fat tree networks have the same optimum switch radix. Figure 6 (c) plots r ds-opt as a function of a and n, with b as a parameter, i.e., there is one surface for each value of b. These results confirm that r ds-opt declines slightly as n increases and grows with increasing a and b.
VI. ESTIMATED OPTIMUM RADIX FOR CURRENT TECHNOLOGY
In this section we apply our theory to get an insight into what would be a good choice for the radix based on current technology. A comprehensive analysis that accurately reflects all the choices of architecture, chip technology, packaging, board-and link technology is beyond the scope of this paper. Instead, we perform a first order bounds analysis by analyzing two extreme cases. For the lower bound case we assume a lowcost, entry-level switch and link system. We further assume that it is based on a buffered-crossbar architecture which is implemented in a chip. For a buffered crossbar the cost is dominated by the crosspoint memory. To achieve entry-level performance we assume an 8 packet crosspoint memory with a packet length of 64 bytes. To calculate the cost, we use the transistor cost of $1e-8 for current CMOS technology [9] . We assume SRAM technology using 6 transistors per memory bit. To account for the additional SRAM control and crosspoint control and arbitration we simply double the crosspoint cost. For the chip overhead we assume 100k gates that implement configuration, control, and other overheads.
For both cases we assume high-performance, high-density chip packaging as required for switching chips, having a cost of around 10 cents per pin. Note that assuming a fixed per-pin cost ignores the effect of steeply increasing package costs as a function of total pin count. However, the link cost is dominated anyway by the cable cost (see Fig. 2 ).
For the sake of argument, we assume that in both cases the entire system is cabled using either copper or fiber. In the former case, this will underestimate the relative link cost, because beyond a certain system size, fiber will become indispensible. To illustrate an extremely optimistic boundary case, we also include a column for the low-end system without the cable costs, i.e., we only account for the per-pin cost of the chip package.
For the second bound, we investigate a high-end system consisting of a high-performance switch using optical link technology. We assume a crosspoint with 10 times the packet storage capacity of the entry-level system and active optical fibers instead of copper cables. Table I shows the resulting costs in dollars of crosspoint, link, and switch overhead for both switch implementations. Based on these numbers, we can compute the link cost and switch overhead in relation to the crosspoint cost, corresponding to the model variables a and b. This enables us to determine the optimum switch radix using the model presented in Sec. IV.
Thus, for the entry-level we obtain a lower bound, .i.e., not taking cable cost into account, on the optimum radix of about 128, whereas for the high-end case the lower bound is around 32. However, if we take into account the costs of copper or fiber cables, the optimum radix immediately increases to at least 1024 ports in both scenarios. As such large switches are no longer feasible in a single chip, the actual radix will be determined mainly by feasibility limits of a practical implementation, see Sec. III-C. However, we do conclude that the optimum radix should be in the order of several hundred ports, somewhere inbetween the two extreme cases.
Our bounds analysis, admittedly presenting an oversimplified version of reality, clearly shows that even in the low-end case without taking into account cabling (per-link cost only a few cents), the optimum radix is quite high. With substantially higher link costs, closer to the reality of today's systems, the radix is drastically higher-on the order of thousands of ports. As this is clearly impractical, the optimum radix in these cases is limited by the factors described in Sec. III-C.
At present, switch vendors offer radices in the order of a few tens of ports only. We believe this is a result of optimizing the cost of a single chip, rather than optimizing cost at the system level. HPC and data center system designers, however, base their technology choices on system cost. The cost of transistors will continue to decrease exponentially, whereas the cost of links generally decreases at a much slower pace. Given our analysis, the radices of current switch products are too small. We believe that a 200-400 port single-stage 10G Ethernet L2 switch would offer optimum cost in medium to large data centers and supercomputers.
VII. CONCLUSIONS
Using a straightforward cost model for fat tree (k-ary ltree) topologies with full bisectional bandwidth and sameradix (r = 2k) switches at every level, we demonstrated that the optimum switch radix r opt is independent of the number of end nodes n, with n = k l . In the "double-sized" case (n = 2k l ), the optimum radix does depend on n: interestingly, it slightly decreases as n increases, converging on the optimum for the single-sized case (n = k l ) as n → ∞. Regarding the relative cost parameters a and b, which represent the fixed per-link and per-switch costs, we observed that the optimum radix increases as either a or b increases. Moreover, the optimum radix is more sensitive to a; as a increases, the dependence on b decreases.
Specific values for a and b depend on the implementation of the switching nodes and links, including choices for technology and architecture. We derived values for a and b using a cost analysis based on current CMOS and link technology under some reasonable assumptions on switch implementation, which yielded optimum switch radix values-for the idealized case-that are clearly not feasible in practice at present. Nevertheless, we can draw the conclusion that one should strive to implement the largest radix that is still feasible under the constraints outlined in Sec. III-C.
Many switch vendors are tackling this problem by recognizing that the cost of links that consist of only PCB traces and connectors is much lower than those that require cabling. This has led to several high-radix switch products with up to several hundred ports in a single enclosure. Examples are Myrinet TM switches with 256 ports (two-level fat tree using 32-port switches) and InfiniBand switches with 144, 288, 324 or even 648 ports (also two-level fat trees, but using 24-port and 36-port switches, respectively).
As, historically, link costs have decreased at a much slower rate than logic gate costs, a can be expected to increase further, implying that the optimum switch radix can be expected to grow as well. Vice versa, we should be on the lookout for technological breakthroughs, be they in high-speed link technologies, CMOS or post-CMOS integrated circuits, or other key aspects, that could fundamentally change the values of the link and switch overhead costs in relation to the crosspoint cost (i.e., the a and b parameters).
These results should be of interest to switch manufacturers, who, based on an assessment of values for a and b based on the target technology for a specific switch implementation, can determine which switch radix is most attractive. This applies in particular to networking technologies such as InfiniBand, 10G Ethernet (also known as Convergence Enhanced Ethernet (CEE)), Myrinet TM , and QsNet TM , which are networking technologies often employed in HPC and DC installations. Our work indicates that optimizing costs at the level of the individual switch chip is clearly sub-optimal from a cost perspective at the network level.
Of course, the design of an interconnection network is not only driven by cost; for instance, requirements regarding performance (latency, bandwidth) and power may impose certain boundaries that limit the choice of switch radix.
As an example, for a fixed n, a larger switch radix will reduce the number of stages and thus reduce latency. However, fat trees generally have a relatively low diameter (compared to k-ary n-meshes and -cubes) and per-hop zero-load latency can be minimized by using cut-through switches, so the latency penalty of adding stages is small.
Nevertheless, as the cost of the interconnection network accounts for a substantial share of the total cost of a supercomputer or data center and this share can be expected to grow, optimizing interconnect cost is a worthwhile endeavor.
We intend to continue this work by determining reasonable estimates for a and b based on actual commercial switch implementations. In addition, we will incorporate more realistic models to account for cable costs and packaging. The same methodology can also be applied to "slimmed" (or "pruned") fat trees, as well as indirect networks, such as k-ary n-meshes and k-ary n-cubes.
