Automated Design of Torus Networks by Solnushkin, Konstantin S.
Automated Design of Torus Networks
Konstantin S. Solnushkin
konstantin@solnushkin.org
ABSTRACT
This paper presents an algorithm to automatically design
networks with torus topologies, such as ones widely used
in large-scale supercomputers. The characteristic feature of
our approach is that real life equipment prices and values
of technical characteristics are used. As a result, we also
have the opportunity to compare costs of torus and fat-tree
networks.
The algorithm is useful as a part of a bigger design procedure
that selects optimal hardware of cluster supercomputer as a
whole.
Categories and Subject Descriptors
C.2.1 [Computer-Communication Networks]: Network
Architecture and Design—Network topology ; K.6.2 [Mana-
gement of Computing and Information Systems]: In-
stallation Management—Computer selection
General Terms
Design, Economics
Keywords
Torus network
1. INTRODUCTION
Torus networks are frequently used in large-scale supercom-
puters as a cost-efficient alternative to other topologies. Re-
cently it was demonstrated that torus networks for computer
clusters can be built from affordable commodity hardware
such as InfiniBand.
We describe an algorithm that allows to automatically de-
sign torus networks. The algorithm is implemented in a
software tool [8].
This algorithm is intended to be used as a part of a CAD
system for cluster supercomputers [9]. Such a system would
iterate through different combinations of hardware, vary-
ing the number of compute nodes and other parameters.
Thus, designing an interconnection network for every hard-
ware combination under review is a self-contained and highly
repetitive operation that must be performed efficiently.
During the design process, other characteristics of intercon-
nection networks, such as reliability, can be estimated and
used as design constraints or as a part of a complex objective
function.
The rest of the article is organized as follows. Section 2 de-
scribes relevant work in the field of torus networks. Section
3 details the design of the “Gordon” supercomputer. Section
4 introduces the algorithm, and section 5 compares costs of
torus networks with fat-trees. Finally, section 6 concludes
the article.
2. RELATEDWORK
Torus networks have found widespread use in supercomput-
ing. IBM used a 3D torus network in BlueGene/L, and a 5D
network [3] in BlueGene/Q. A 6D mesh-torus network was
used in “K Computer” [1]. Both are direct networks, where
compute nodes are connected directly to their neighbours,
as opposed to switched fabrics, where nodes are first con-
nected to switches, and then switches are connected to each
other in a torus topology. The example of the latter is a 3D
torus network of the Gordon supercomputer [11].
Torus networks are inherently prone to congestion, but this
is mitigated by designers by increasing the number of di-
mensions. Commenting on the Gordon project, Strande [10]
quotes the following benefits of torus networks: (a) lower
cost compared to fat-trees and (b) easy linear scaling along
one of dimensions. However, such scaling may result in un-
balanced topologies, leading to bigger latencies and higher
congestion on the links in that dimension. Strande also men-
tions that the torus topology uses short cables, which makes
the use of fibre optical cables unnecessary, leading to further
cost savings.
Navaridas and Miguel-Alonso [6] analysed performance of
2D switch-based torus topologies and fat-trees for up to 7680
compute nodes, on a range of workloads, using simulation
techniques. They conclude that performance degradation
from using torus networks, compared to fat-trees, can reach
20..40%, and sometimes more, on communication-intensive
workloads, which limits applicability of tori in larger instal-
lations.
Ca´mara et al. [2] introduced the technique to turn unbal-
anced rectangular 2D and 3D tori to twisted tori by rear-
ranging peripheral links, which improves performance char-
acteristics as well as regains network symmetry.
3. 3D DUAL-RAIL TORUS NETWORK OF
THE GORDON SUPERCOMPUTER
Gordon supercomputer [11] uses InfiniBand switches with
P = 36 ports of 4X QDR technology. Switches form a 4x4x4
ar
X
iv
:1
30
1.
61
80
v1
  [
cs
.D
C]
  2
5 J
an
 20
13
torus; each switch has 6 neighbours, to which it connects
with 3 links, thereby utilizing 18 ports out of 36. 17 more
ports are used to connect 16 compute nodes and one I/O
node.
The network is dual-rail, therefore there are actually two
tori made of switches, and compute and I/O nodes have two
network interfaces, one of which is used to connect to the
switch in the first torus (“rail”), and the other to the second
one. Currently, one rail is used for MPI, and the other one
for I/O traffic. According to Strande [10], there are plans to
use both rails simultaneously to provide failover capabilities
and improve bandwidth.
4. ALGORITHMFORDESIGNINGTORUS
NETWORKS
We propose the algorithm to calculate the number of switches
in a torus network, using as input the number of compute
nodes to be interconnected and, optionally, a blocking fac-
tor that determines the distribution of ports on a switch
between compute nodes and neighbouring switches. The al-
gorithm is suitable to design networks built with commodity
hardware, such as Gordon’s network.
As torus networks are inherently prone to congestion, im-
posing additional blocking at the switch level is very disad-
vantageous. However, sometimes blocking is stipulated by
the hardware manufacturer, and cannot be avoided. For ex-
ample, in [6] the hardware under review was a blade chassis
equipped with N = 20 compute nodes and an InfiniBand
switch with P = 36 ports. Only 16 ports of the switch were
used to connect it to the outside world, which resulted in
Bl = 20/16 = 1, 25 blocking factor. In order to build torus
networks for such hardware with the proposed algorithm, we
need to specify the blocking factor as an input.
The algorithm tries to build a network using identical switches
with PE ports. Let us describe the algorithm by stages. In
line 1 we check if the switch has enough ports to connect all
N nodes. In this case, we use the star topology with only
one switch and exit.
Otherwise, we will build a ring or a torus. In lines 8..10
we calculate the number of switch ports that go to compute
nodes and to the neighbouring switches, and then recalculate
the blocking factor for the network. On line 11 we derive
the minimal number of switches required to connect N nodes
with a given blocking factor. The actual torus will contain
slightly more switches (generally, the increase is within 20%
for small networks, and within several percent for the large
ones).
On line 12, we use a heuristic to determine the number of
torus dimensions, based on the number of switches. It is
important to note that there are no hard rules when choos-
ing the number of dimensions. Choosing a low number of
dimensions for a high number of compute nodes leads to in-
creased network diameter and therefore latencies. On the
other side, choosing a too high number of dimensions for a
low number of compute nodes does not provide network per-
formance benefits but results in complex cabling patterns.
In the case of direct networks this scenario also requires net-
work adapters with an unnecessarily large number of ports.
The optimal number of dimensions depends on the com-
munication pattern of the application, and can be reliably
determined, for any given application, only through bench-
marking on real hardware or by using simulation such as in
[6]. Therefore we relied on using a heuristic.
Currently, the dimension choice heuristic returns the number
of dimensions as per Table 1, up to D = 5. The layout of
switches in the maximal configuration for that number of
dimensions is provided in the last column of the table for
reference.
If the heuristic returns D = 1, then we use the ring topology
(line 14). Otherwise, we use the torus topology, and need to
calculate the number of switches along each of D dimensions
by rounding D
√
E to the nearest integer (line 17).
Algorithm 1 Design a torus network
Input:
N : Number of nodes to interconnect
Bl: Blocking factor
PE : Number of switch ports
Goal: Optimal network structure:
D: Number of torus dimensions
d = 〈d1, . . . , dD〉: Number of switches along each dimen-
sion
E: Total number of switches
Blr: Resulting blocking factor
L: Number of cables
f : Objective function for the optimal network structure
1: if PE ≥ N then
2: { If there exists a switch with N or more ports }
3: print Topology: star
4: E ← 1; Blr ← 1; L← N
5: Compute f
6: Exit
7: end if
8: PEn ← bPE · (Bl/(1 + Bl))c { Ports to nodes }
9: PEc ← PE − PEn { Ports to other switches }
10: Blr ← PEn/PEc { Resulting blocking }
11: E ← dN/PEne { Minimal number of switches }
12: D ← GetDimCount(E) { Heuristic for the number of
torus dimensions}
13: if D = 1 then
14: print Topology: ring
15: else
16: print Topology: torus
17: di ← round( D
√
E) | i = 1 . . . D − 1 { Number of
switches along dimensions }
18: dD ← dE/dD−11 e { Switches in the last dimension }
19: E ←∏Di=1 di { Actual number of switches }
20: end if
21: L← N + E · PEc/2 { Number of cables }
22: Compute f
This creates a topology close to an ideal square, cube, etc.
Packaging constraints, however, may preclude from using
this particular ideal layout, and in the resulting unbalanced
torus the number of switches along dimensions may differ
significantly. The number of switches, E, still remains the
same as returned by the algorithm, allowing to correctly
calculate equipment cost and other metrics.
Switch count, E Topology Dimensions, D Max. configuration
2 or 3 Ring 1 —
up to 36
Torus
2 6x6
up to 125 3 5x5x5
up to 2401 4 7x7x7x7
more than 2401 5 (As appropriate)
Table 1: Heuristic for the number of torus dimensions
Compute nodes, N Dimensions, D Torus topology Supercomputer of comparable size
1,000 3 4x4x4 Gordon [11]
6,000 4 4x4x4x6 Stampede [12]
8,000 4 5x5x5x4 Tianhe-1A [5]
10,000 4 5x5x5x5 SuperMUC [4]
19,000 4 6x6x6x5 Titan [7]
Table 2: Sample output for Algorithm 1
On the next step, we calculate the number of switches in the
last dimension (line 18) and recalculate the total number of
switches as the product of switch counts along all dimensions
(line 19).
The number of cables is determined on line 21. The num-
ber of switch ports facing to neighbouring switches, PEc, is
divided by two, because two ports are connected with one
cable. This is then multiplied by the number of switches
E. Compute nodes are connected with additional N cables.
The network is expandable from N up to E · PE compute
nodes. Inter-switch links run in bundles of approximately
PEc/(2 ·D), therefore it is often possible to use cables that
integrate several links (such as a 12x InfiniBand cable that
integrates three 4x links) to reduce the number of physical
cables, simplifying installation.
Sample output of the algorithm for commodity InfiniBand
switches with PE = 36 ports and a non-blocking network
(Bl = 1) is presented in Table 2.
5. COST COMPARISON OF TORUS AND
FAT-TREE NETWORKS
We used real life equipment costs provided by Mellanox
Technologies to derive costs of fat-tree and torus networks
for up to 3,888 compute nodes. We utilized the tool for
automated design of cluster interconnection networks [8].
Equipment costs are given for the older generation of equip-
ment (InfiniBand QDR), and technical characteristics are
summarized in Table 3. Cable cost is assumed to be $80.
We consider three models of switches. The first of them, the
36-port switch, is used for building torus networks, and is
also utilized on edge level of fat-tree networks. The other
two are modular switches that have 108 and 216 ports in
their maximal configurations. The actual number of sup-
ported ports depends on the number of installed line cards,
which leads to 6 and 12 configurations of these switches, re-
spectively. Each configuration has its own set of technical
characteristics as well as cost.
The set of equipment described above allows to build non-
blocking fat-tree networks with up to Nmax = PE · PC/2 =
36 · 216/2 = 3888 nodes. On Fig. 1 we plot costs of non-
blocking as well as 2:1 blocking fat-tree networks, and torus
networks. As expected, the cost of 2:1 blocking fat-trees is
lower than of their non-blocking counterparts; but reduction
in cost is less than twofold. Torus networks are consistently
cheaper than fat-trees; however, their inherent blocking may
have detrimental effect on application performance that will
not be offset by lower costs.
We also consider an alternative way of building fat-trees:
using 36-port switches for both core and edge layers. This
allows to build non-blocking fat-tree networks with up to
Nmax = 36 · 36/2 = 648 nodes. Such networks are charac-
terized by complex wiring patterns between the two layers,
but are marginally cheaper to build. Fig. 2 is essentially a
close-up of the previous figure, focusing on values of N up
to 648 nodes, with an additional curve representing costs of
the alternative fat-tree building method.
As the diagram indicates, using 36-port switches for build-
ing fat-trees does indeed lead to certain cost savings: for
N = 648 nodes, per-port cost of such networks is roughly
$1,060, while for the usual way of building fat-trees, us-
ing modular switches on the core level, the per-port cost is
roughly $1,930. However, these savings should be weighted
against the cost of compute nodes: if the latter is much
higher than the per-port cost of the interconnection net-
work, then cost savings might not justify increased wiring
and maintenance complexity of this type of networks.
Example. Let us assume the cost of a compute node is
$5,000. If per-port cost of two types of interconnection net-
works is $1,000 and $2,000, respectively, then savings from
using the network of the first type is 7000/6000, or roughly
17%. Factoring in costs of other equipment, as well as op-
erating expenses, further dilutes savings.
Figure 2 is particularly helpful to emphasize the structure
of networks generated by the network design tool [8]. Con-
sider, for example, the case of non-blocking and 2:1 blocking
fat-trees, for N = 150 compute nodes. The costs of these
01,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
 0  500  1000  1500  2000  2500  3000  3500  4000
To
ta
l C
os
t o
f N
et
wo
rk
 ($
)
Node count (N)
Fat-tree with modular switches, non-blocking
Fat-tree with modular switches, 2:1 blocking
Torus, non-blocking
Figure 1: Cost comparison of fat-tree and torus networks
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
 0  100  200  300  400  500  600  700
To
ta
l C
os
t o
f N
et
wo
rk
 ($
)
Node count (N)
Fat-tree with 36-port switches, non-blocking
Fat-tree with modular switches, non-blocking
Fat-tree with modular switches, 2:1 blocking
Torus, non-blocking
Figure 2: Cost comparison of alternative fat-tree building methods
S
w
it
ch
a
p
p
li
ca
b
il
it
y
S
w
it
ch
m
o
d
el
P
o
rt
co
u
n
t
S
iz
e,
U
W
ei
g
h
t,
k
g
P
ow
er
,
W
C
o
st
,
$
T
o
ru
s,
F
a
t-
tr
ee
(e
d
g
e
la
y
er
)
M
el
la
n
ox
G
ri
d
D
ir
ec
to
r
4
0
3
6
3
6
1
7
,7
2
0
2
1
0
,8
2
0
F
a
t-
tr
ee
(c
o
re
la
y
er
)
M
el
la
n
ox
IS
5
1
0
0
1
8
7
7
5
,1
5
1
6
7
8
,5
0
0
3
6
7
7
,8
6
0
6
9
0
,0
0
0
5
4
8
0
,6
6
9
6
1
0
1
,5
0
0
7
2
8
3
,3
7
8
6
1
1
3
,0
0
0
9
0
8
6
,1
8
7
6
1
2
4
,5
0
0
1
0
8
8
8
,9
9
6
6
1
3
6
,0
0
0
–
”–
M
el
la
n
ox
IS
5
2
0
0
1
8
1
0
1
1
5
,7
5
1
6
1
2
5
,5
0
0
3
6
1
1
8
,4
6
0
6
1
3
7
,0
0
0
5
4
1
2
1
,2
6
9
6
1
4
8
,5
0
0
7
2
1
2
3
,9
7
8
6
1
6
0
,0
0
0
9
0
1
2
6
,7
8
7
6
1
7
1
,5
0
0
1
0
8
1
2
9
,5
9
6
6
1
8
3
,0
0
0
1
2
6
1
3
2
,2
1
,0
5
6
1
9
4
,5
0
0
1
4
4
1
3
5
,0
1
,1
4
6
2
0
6
,0
0
0
1
6
2
1
3
7
,7
1
,2
3
6
2
1
7
,5
0
0
1
8
0
1
4
0
,5
1
,3
2
6
2
2
9
,0
0
0
1
9
8
1
4
3
,3
1
,4
1
6
2
4
0
,5
0
0
2
1
6
1
4
6
,0
1
,5
0
6
2
5
2
,0
0
0
T
a
b
le
3
:
C
h
a
ra
c
te
ri
st
ic
s
o
f
In
fi
n
iB
a
n
d
Q
D
R
e
q
u
ip
m
e
n
t
N
et
w
o
rk
ty
p
e
N
o
n
-b
lo
ck
in
g
2
:1
b
lo
ck
in
g
T
o
p
o
lo
g
y
S
ta
r
T
w
o
-l
ev
el
fa
t-
tr
ee
E
d
g
e
le
v
el
sw
it
ch
M
el
la
n
ox
IS
5
2
0
0
(1
6
2
p
o
rt
s)
M
el
la
n
ox
G
ri
d
D
ir
ec
to
r
4
0
3
6
(3
6
p
o
rt
s)
C
o
re
le
v
el
sw
it
ch
N
/
A
M
el
la
n
ox
IS
5
1
0
0
(9
0
p
o
rt
s)
P
ow
er
,
W
1
,2
3
6
2
,2
9
0
W
ei
g
h
t,
k
g
1
3
7
,7
1
4
0
,0
S
iz
e,
U
1
0
1
4
C
o
st
,
$
2
2
9
,5
0
0
2
1
8
,9
6
0
T
a
b
le
4
:
S
tr
u
c
tu
re
c
o
m
p
a
ri
so
n
fo
r
tw
o
ty
p
e
s
o
f
fa
t-
tr
e
e
n
e
tw
o
rk
s,
fo
r
N
=
1
5
0
n
o
d
e
s.
two networks are very close, but their structure is entirely
different, which is summarized in Table 4.
If the tool is requested to design a non-blocking network,
it chooses a star topology with a single modular switch. If,
however, a 2:1 blocking network is requested, the result is
a two-layer fat-tree, with 36-port switches on the edge level
and a 90-port switch on the core level. The latter network is
chosen because it is marginally (5%) cheaper. At the same
time, it draws 85% more power and requires 40% more space
in the rack.
This example illustrates two points: (A) more complex cri-
terion functions, such as total cost of ownership, should
preferably be used instead of capital costs; (B) trying to
design blocking networks doesn’t necessarily save consider-
able amounts of money, therefore designers should consider
non-blocking networks first.
6. CONCLUSIONS
We presented a simple algorithm for automated design of
torus networks. The algorithm relies on a heuristic to choose
the number of torus dimensions. We also compared real life
costs of torus and fat-tree networks. We found that torus
networks are consistently cheaper than non-blocking and 2:1
blocking fat-trees; however, these cost savings may not offset
performance penalties, depending on applications used.
7. REFERENCES
[1] Y. Ajima, T. Inoue, S. Hiramoto, and T. Shimizu.
Tofu: Interconnect for the K computer. Fujitsu Sci.
Tech. J, 48(3):280–285, 2012. http://www.fujitsu.
com/downloads/MAG/vol48-3/paper05.pdf.
[2] J. Ca´mara, M. Moreto´, E. Vallejo, R. Beivide,
J. Miguel-Alonso, C. Mart´ınez, and J. Navaridas.
Mixed-radix twisted torus interconnection networks.
In Parallel and Distributed Processing Symposium,
2007. IPDPS 2007. IEEE International, pages 1–10.
IEEE, 2007.
[3] D. Chen, N. Eisley, P. Heidelberger, R. Senger,
Y. Sugawara, S. Kumar, V. Salapura, D. Satterfield,
B. Steinmacher-Burow, and J. Parker. The IBM Blue
Gene/Q interconnection network and message unit. In
High Performance Computing, Networking, Storage
and Analysis (SC), 2011 International Conference for,
pages 1–10. IEEE, 2011. http://mmc.geofisica.
unam.mx/edp/SC11/src/pdf/papers/tp19.pdf.
[4] Leibniz Rechenzentrum. SuperMUC.
http://www.lrz.de/services/compute/supermuc/.
[5] National Supercomputing Center in Tianjin (NSCC).
Tianhe-1A. http://www.nscc-tj.gov.cn/en/.
[6] J. Navaridas and J. Miguel-Alonso. Indirect cube: A
power-efficient topology for compute clusters. Optical
Switching and Networking, 8(3):162–170, 2011.
[7] Oak Ridge National Laboratory (ORNL). Titan.
http://www.olcf.ornl.gov/titan/.
[8] K. S. Solnushkin. Fat-tree and torus network design
tool at ClusterDesign.org.
http://clusterdesign.org/torus/.
[9] K. S. Solnushkin. Computer cluster design automation
using web services. In Proceedings of International
Supercomputing Conference, ISC’12, June 2012.
[10] S. Strande. Gordon – design and performance of a 3D
torus interconnect for data intensive computing. In
Proceedings of HPC Advisory Council Held in
Conjunction with the International Supercomputing
Conference, 2012.
www.hpcadvisorycouncil.com/events/2012/
European-Workshop/Presentations/4_SDSC.pdf.
[11] S. Strande, P. Cicotti, R. Sinkovits, W. Young,
R. Wagner, M. Tatineni, E. Hocks, A. Snavely, and
M. Norman. Gordon: design, performance, and
experiences deploying and supporting a data intensive
supercomputer. In Proceedings of the 1st Conference
of the Extreme Science and Engineering Discovery
Environment: Bridging from the eXtreme to the
campus and beyond, page 3. ACM, 2012.
[12] Texas Advanced Computing Center (TACC).
Stampede. http://www.tacc.utexas.edu/stampede.
