Communication delay analysis of fault-tolerant pipelined circuit switching in torus  by Safaei, F. et al.
Journal of Computer and System Sciences 73 (2007) 1131–1144
www.elsevier.com/locate/jcss
Communication delay analysis of fault-tolerant pipelined circuit
switching in torus
F. Safaei a,c, A. Khonsari b,a,∗, M. Fathy c, M. Ould-Khaoua d
a IPM, School of Computer Science, Tehran, Iran
b Department of Electrical and Computer Engineering, University of Tehran, Tehran, Iran
c Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
d Department of Computing Science, University of Glasgow, UK
Received 3 October 2005; received in revised form 11 March 2006
Available online 24 February 2007
Abstract
Large-scale parallel systems, Multiprocessors System-on-Chip (MP-SoCs), multicomputers, and cluster computers are often
composed of hundreds or thousands of components (such as routers, channels and connectors) that collectively possess failure rates
higher than what arise in the ordinary systems. One of the most important issues in the design of such systems is the development
of the efficient fault-tolerant mechanisms that provide high throughput and low latency in communications to ensure that these
systems will keep running in a degraded mode until the faulty components are repaired. Pipelined Circuit Switching (PCS) has
been suggested as an efficient switching method for supporting inter-processor communications in networks due to its ability to
preserve both communication performance and fault-tolerant demands in such systems. This paper presents a new mathematical
model to investigate the effects of failures and capture the mean message latency in torus using PCS in the presence of faulty
components. Simulation experiments confirm that the analytical model exhibits a good degree of accuracy under different working
conditions.
© 2007 Elsevier Inc. All rights reserved.
Keywords: Large-scale parallel systems; Fault-tolerance; PCS; Torus; Adaptive routing; Virtual channels; Message latency; Queuing theory;
Performance evaluation
1. Introduction
Large-scale parallel systems, Multiprocessors System-on-Chip (MP-SoCs), multicomputers, and cluster comput-
ers are potential candidates for providing very high computational power. These systems are usually organized as an
ensemble of nodes, each with its own processor, local memory, and other supporting devices. The success of such
systems is highly dependent on the efficiency of their underlying interconnect networks, which are constructed from
routers and channels. In these systems, nodes exchange data and coordinate their efforts by sending and receiving
messages through the underlying network. Consequently, the performance of such systems depends heavily on the
* Corresponding author.
E-mail addresses: safaei@ipm.ir (F. Safaei), ak@ipm.ir (A. Khonsari), mahfathy@iust.ac.ir (M. Fathy), mohamed@dcs.gla.ac.uk
(M. Ould-Khaoua).0022-0000/$ – see front matter © 2007 Elsevier Inc. All rights reserved.
doi:10.1016/j.jcss.2007.02.003
1132 F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144performance of interconnect networks which mainly depends on its topology, switching method, and routing algo-
rithm [1].
The topology of the network defines how the nodes are interconnected and is generally modelled as a graph in which
the vertices represent the nodes and the edges denote the channels. Multidimensional meshes and k-ary n-cubes are the
basic topologies used in most current parallel computers [2–4]. The k-ary n-cube has an n-dimensional grid structure
with k nodes in each dimension such that every node is connected to its neighbouring nodes in each dimension by
direct channels. The hypercube (where k = 2) and the torus (where n = 2 or 3) are two popular topologies for direct
networks. The former has been used in multicomputers such as the cosmic cube [5] and iPSC/2 [6] while the latter
has been adopted in systems like the iWarp [4], Cray T3E [2] and Cray T3D [3].
In most parallel computers, a message enters the network from a source node and is switched or routed towards its
destination through a series of intermediate nodes. Wormhole switching (also known as wormhole routing) is a variant
of virtual cut-through [1] technique that avoids the need of large buffer spaces. In wormhole switching, a message
is transmitted between the nodes in units of flits [7], the smallest units of a message on which flow control can be
performed. The header flit of a message contains all the necessary routing information and all the other flits contain
the data elements. The flits of the message are transmitted through the network in a pipelined fashion. Since only the
header flit has the routing information, all the trailing flits follow the header flit contiguously. Wormhole switching
realizes very superior performance; but is prone to deadlock situations in the presence of faults.
Routing is the process of transmitting data from one node to another node in a given communication network.
Most past parallel systems have adopted deterministic routing where messages with the same source and destination
addresses always take the same network path. This form of routing has been popular because it requires a simple
deadlock-avoidance algorithm, resulting in a simple router implementation [1,8]. However, if any channel along the
message path is heavily loaded, the message experiences large delays and if any node/channel along the path is
faulty, the message cannot be delivered at all. Alternatively, adaptive routing overcomes the performance limitations
of deterministic routing by enabling messages to explore all potential paths between a pair of nodes. This routing
flexibility, however, often requires dedicated hardware resources to ensure deadlock-freedom [1,8].
Gaughan and Yalamanchili [9] have proposed PCS that combines aspects of Circuit Switching (CS) and wormhole
switching. PCS sets up a path before starting data transmission as in CS. However, PCS differs from CS in the way that
a message establishes a path. When a message header encounters blocking and cannot progress towards its destination,
it releases the last reserved channel by backtracking to the previous node, and attempts to find an alternative path. In
PCS data flits do not immediately follow the header flits into the network as in wormhole switching. Consequently,
increased flexibility is available in routing the header flit. For instance, rather than blocking on a faulty output channel
at an intermediate router, the header may backtrack to the preceding router and releases the previously reserved
channel. A new output channel may now be attempted at the preceding router in finding an alternative path to the
destination. When the header finally reaches the destination node, an acknowledgement flit is transmitted back to the
source node. Now data flits can be pipelined over the path just as in wormhole switching.
This approach is flexible in that headers can perform a backtracking search of the network, reserving and releasing
virtual channels [10] in an attempt to establish a fault-free path to the destination. A virtual channel has its own flit
queue, but shares the bandwidth of the physical channel with other virtual channels in a time-multiplexed manner [10].
PCS has been adopted as an efficient switching method for supporting inter-processor communications in massively
parallel networks due to preserving both communication performance and fault-tolerant demands in these networks
and has been used in systems including the Ariadne [11], METRO [12], and MMR Router [13]. A common advantage
of PCS is its ability to provide messages with a guaranteed latency once a connection has been established. Such a
feature makes them attractive for transmitting multimedia data that is very sensitive to the jitter latency (i.e., variation
in latency) [13].
Almost all of the proposed methods and routing algorithms for functionality of recent parallel processors such as
multiprocessors system-on-chip (MP-SoCs), and multicomputers have resorted to simulation techniques [11–13] to
evaluate their performance merits. Simulation is an approach to evaluate the performance of a system for a specific
configuration. However, realizing such detailed investigations through simulation is computational and time expen-
sive. An alternate to simulation modelling is an analytical model. Analytical modelling offers cost-effective evaluation
tools that can help designers assess the performance merits of parallel systems. Analytical models of PCS under uni-
form traffic have been reported in the literature [14–16]. However, to the best of our knowledge, no analytical model
was reported in the literature to evaluate the performance of network with faulty components. This paper proposes
F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144 1133a novel analytical model for the performance evaluation of PCS in the torus with faults when adaptive routing and
virtual channels flow control are used. The model achieves a good degree of accuracy which is evident by the results
collected from simulation experiments.
The rest of the paper is organized as follows. Section 2 reviews some definitions and background that will be useful
for the subsequent sections. Section 3 describes the analytical model while Section 4 validates the model through
simulation experiments. Section 5 conducts an extensive comparative performance analysis of wormhole switching
and PCS in the presence of failures by means of analytical modelling. Finally, in Section 6, some conclusions and
future works are drawn.
2. Preliminaries
This section briefly describes k-ary 2-cube (2D torus) with its router structure.
2.1. The torus and its router structure
The radix-k two-dimensional torus has N = k2 nodes, arranged in two dimensions, with k nodes per dimension.
Each node can be identified by a 2-digit radix-k address (x, y), 0 x, y  k − 1, where x represents the row number
and y indicates the column number of the node. Nodes with addresses (x1, y1) and (x2, y2) are connected if x1 =
[(x2 ± 1) mod k] or y1 = [(y2 ± 1) mod k]. Thus, each node is connected through four bidirectional channels to four
neighbouring nodes, two in each dimension. Fig. 1 shows a 9 × 9 torus where each node consists of a Processing
Element (PE) and a switching element or router.
The PE contains a processor and some local memory. A node is connected to its neighbouring nodes through
four inputs and four output channels. The remaining channels are used by the PE to inject/eject messages to/from
the network, respectively. Messages generated by the PE are transferred to the router through the injection channel.
Messages at the destination are transferred to the local PE through the ejection channel. Each physical channel is
associated with some, say V , virtual channels. The router contains flit buffers for any incoming virtual channel.
A V -way crossbar switch direct message flits from any input virtual channel to any output virtual channel. Such a
switch can simultaneously connect multiple-input to multiple-output virtual channels while there is no conflict.
3. The proposed analytical model
This section describes first the assumptions used in the analysis, and then presents the analytical model for PCS
in the presence of faults. Moreover, a summary of notation used in derivation of the analytical model is provided in
Table 1.
Fig. 1. A 9 × 9 torus and its router structure.
1134 F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144Table 1
Summary of notation used in the analytical model
Notation Description
λ Message generation rate at a node
λc Traffic rate on a network channel
k Mean message distance along a dimension
d Mean message distance
r Distance between source and destination nodes (in hops)
k Network radix (number of nodes per dimension)
N Network size (number of nodes)
M Message length (in flits)
P Probability that a node is being faulty
V Number of virtual channels per physical channel
V Average degree of virtual channels multiplexing that takes place at a physical channel
T r Mean network latency
Ws Mean waiting time seen by a message in the source node
σ 2r Variance of the message service time at a channel
Pv Probability that v virtual channels are busy at a given physical channel
Qv Intermediate variable used in the calculation of Pv
Pbj Probability that a message is blocked at the j th hop channel
C Random variable denoting the time to set up a connection
ns Intermediate variable used for characterizing traffic on network channels
L∗
C
(S) Laplace–Stieltjes transform of C
α, β Intermediate variables used in calculation of L∗
C
(S)
E[C2] Second moment of C
Pϕj Probability that there remains only one dimension to cross a message along on its j -hop path
ξ t
j
Probability that the header has entirely crossed t dimensions along on its j -hop path.
E[L] Expected number of activating channels
θ Traffic intensity from a given source to a particular destination
E[Y ] Mean number of surviving nodes
γ Throughput of the network
E[n] Mean queue occupancy experienced by a message at the source node
Cr Coefficient of variation of service time distribution (σr/T r )
E[T 2r ] Second moment of T r
λopt An optimal point on the curve of power at which the power is maximized
3.1. Assumptions
In this section, we make the following assumptions, which are commonly used in the literature [10,14–19] and
used in the construction of our analytical model:
(a) Nodes generate traffic independently of each other, which follows a Poisson process with a mean rate of λ mes-
sages per cycle.
(b) The arrival process at a given channel is approximated by an independent Poisson process.
(c) Message length is M flits, where M is a random variable with Laplace–Stieltjes transform L∗M(S). Each flit
requires one cycle to cross from one node to the next.
(d) Message destinations are uniformly distributed across network nodes.
(e) The local queue at the injection channel in the source node has infinite capacity. Furthermore, messages are
transferred to the local PE as soon as they arrive at their destinations through the ejection channel.
(f) V (V  1) virtual channels are used per physical channel. The router contains one flit buffer for each input virtual
channel. At a given routing step, the header chooses randomly one of the available virtual channels at one of the
physical channels that brings it closer to its destination.
(g) The probabilities of node failure in the network are equiprobable and independent of each other. Moreover, each
node is failed with probability P .
(h) Fault patterns are static [1,8], distributed uniformly through the network, and do not disconnect the network.
F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144 1135(i) Nodes (processors) are more complex than links and thus have higher failure rates [1,8]. Therefore, we assume
only node failures.
(j) When the header encounters a faulty node or experiences blocking because all the required virtual channels
are busy, it has to backtrack to the preceding node, then seeking for an alternative path to advance towards its
destination [9].
3.2. Outline of the analytical model
The model computes the mean message latency as follows. First, the mean network latency, T r , that is the time
to cross the network is determined. Then, the mean waiting time seen by a message in the source node, Ws , is
evaluated. Finally, to capture the effects of virtual channels multiplexing, the mean message latency has to be scaled
by a factor, V , representing the average degree of virtual channels multiplexing that takes place at a given physical
channel. Therefore, the mean message latency can be written as [18]
Mean message latency = (T r + Ws)V . (1)
Under uniform traffic pattern, the average number of channels that the header visits along each of dimensions, k,
and crosses the network, d , to set up a connection from the source to destination nodes are given by Agarwal [19]
k ≈ k/4, (2)
d = 2k. (3)
3.2.1. Calculation of the traffic rate on a network channel (λc)
Calculation of the traffic rate of messages received by each channel, λc, can be approximated as follows. In PCS,
the header message reserves channels as it advances towards its destination and data flits are transmitted through the
reserved path. On average, C, channels are visited to establish a path. Among these C visits, let C+ denote the number
of visits that take place when the header advances (i.e., reserves a channel) in the direction leading to the destination
node. Similarly, let C− denote the number of the visits that occur when the header encounters busy/faulty situation
and backtracks to its preceding node in the direction leading to the source. Given that the acknowledgement flit travels,
on average, d hops to reach the source node. C+ and C− satisfy the following equations{
C+ + C− + d = C,
C+ − C− = d. (4)
Solving the above equations yields the number of channels that the header crosses in the direction leading to the
destination node as
C+ = C/2. (5)
Fully adaptive routing allows the header message to use any available channel that brings it closer to its destination
resulting in an evenly distributed traffic rate on all network channels. In a 2D torus, each node has 4(1−P)2 surviving
output network channels, the traffic arriving at a network channel is equal to ns times as much as that generated by a
source node, where ns is estimated as
ns = NC
+
4N(1 − P)2 =
C
8(1 − P)2 . (6)
Therefore, the traffic rate arriving at a network channel can be written as
λc = λns = λC8(1 − P)2 . (7)
3.2.2. Calculation of the mean time to set up a reserved path (C)
In the PCS flow control mechanism, the path set-up and data transmission stages are decoupled [8,14]. The header
is first routed to construct a path. When the header encounters a faulty node or experiences blocking because all the
required virtual channels are busy, it may perform controlled and limited backtracking. The mean path set-up time
1136 F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144Fig. 2. The Markov chain diagram for modelling the behaviour of the header.
of an r-hop message (1  r  d), a message that traverses r hops to reach its destination, can be calculated using
Random Walk theory [20] which is illustrated by Fig. 2.
In this figure, any state represents the current location of the header along its network path. States π0 and πr denote
that the header is at source and destination nodes, respectively. Therefore, state πj (1 j  r − 1) corresponds to the
case where the header is at intermediate node, that is j hops away from the source. A transition out of state πj to πj+1
implies that the header has encountered a non-faulty node and succeeds in acquiring virtual channel and brings it one
hop closer to its destination. Moreover, each transition from state πj to state πj−1 means that the header is blocked
by a fault or none of the output channels along a path to the destination are available, at the node corresponding to
state πj . Consequently, the header has to backtrack to the previous node. We assume that the probability when the
header is blocked at the state πj is pbj . Thus, 1−pbj corresponds to the probability of advancing across the reserved
channel. State πr is a final state and known as “absorbing state” since once the header reaches its destination, a full
path is reserved. State π0 is a “partially reflecting” state, with the reflecting probability pb0 because once the header
backtracks to its source node due to blocking this implies that it has not succeeded in establishing a path and has to
make a new attempt. In what follows, we calculate the expected duration time to reach state πr originating from state
π0 that corresponds to the average time for the header to reserve a path from the source to destination. This time can
be computed using the first step analysis method applied to Markov chain [20].
Let Cj be the mean time to reach the absorbing state starting from state πj . Cj is always finite [20], and Cj+1
denotes the header at state πj succeeds in acquiring a virtual channel and it can enter the node corresponding to
state πj+1. When the header encounters situation of faulty/blocking, it backtracks to previous node corresponding to
state πj−1 and the residual mean time would be Cj−1. The above argument reveals that the average time, Cj , satisfies
the following equation
Cj =
⎧⎨
⎩
(1 − pbj )Cj+1 + pbjCj−1 + 1, 1 j  r − 1,
(1 − pb0)(C1 + 1) + pb0C0, j = 0,
0, j = r.
(8)
When the header reaches its destination, an acknowledgement flit is transmitted back to the source node. Solving the
above equations iteratively [21], the first two moments, C and E[C2], of the time to set up a path can be given by
C = C0 + r, (9)
E
[
C2
]= C2 + (C − 2r)2. (10)
3.2.3. Calculation of the mean network latency (T r)
Since, the torus topology is symmetric and message destinations are uniformly distributed across the network the
arrival patterns of messages (and the service times seen by messages) at the network channels exhibit similar statistical
behaviour. In PCS, the network latency for a message consists of two parts: one the time to set up a path and other, the
delay due to the actual message transmission time. Therefore, the network latency of an r-hop message can be written
as
T r = M + r + C (11)
where M and C (given by Eq. (9)) are random variables denoting the message length and the path set-up time,
respectively. Note that in Eq. (11) the term r accounts for r cycles that are required to send the acknowledgement flit
back to the source node.
F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144 1137The Laplace–Stieltjes transform [21] of T r can be written as
L∗
T r
(S) = L∗M(S)e−rSL∗C(S) (12)
where L∗M(S) and L∗C(S) represent the Laplace–Stieltjes transform of M and C, respectively. L∗C(S) can be approxi-
mately expressed as [14,21]
L∗
C
(S) = α
S + β , α,β > 0. (13)
The parameters α and β are selected to match the first two moments, C and E[C2], of the time to establish a path
and are found to be [14]
α = 4C3/E2[C2], (14)
β = 2C/E[C2]. (15)
Using the definition of the moment of random variables [21], C and E[C2] can be obtained from Eqs. (9) and (10).
3.2.4. Calculation of the probability of header message blocking (pbj )
In this section we compute the probability of header message blocking. Consider the header message that has to
cross r-hops to reach its destination. In two different situations the header may become stopped at a given node; when
it reaches a faulty node or finds that all the virtual channels to be visited are busy. When this occurs, the header does
not have to wait for acquiring a virtual channel and immediately backtracks to the preceding node. Therefore, the
probability, pbj , that an r-hop message is stopped after making j hops, is written as
pbj =
1∑
t=0
ξ tj
(
P 2−tV
)+ P, 0 j  r − 1, (16)
where PV is the probability of all virtual channels being busy and P is the probability of a node being faulty. Note
that the probability of node failure and that of virtual channels being busy are independent. So, the probability of a
message returning to the previous node can be determined as the aggregate of these two probabilities.
Moreover, parameter ξ tj is the probability that the header has entirely crossed t dimensions along on its j -hop path.
This probability is a function of the number of dimensions that the message has still to cross. Therefore, to calculate ξ tj
we need first to find how many dimensions remain for the message to reach its destination. The number of alternative
routes that a message header can select, at its next hop, to advance towards its destination depends on the number of
hops already made in both dimensions. When the message header has made j (0 j  r − 1) hops, these hops can be
a combination of (i1, i2) hops, with i1 and i2 being the number of hops achieved in the first and second dimensions,
respectively, so that (i1 + i2 = j ) and (0 i1, i2  k).
To compute the probability that a message header has crossed all the channels of one dimension, two cases need to
be considered:
(i) When (0  j < k), the number of (i1, i2) combinations is (j + 1). In this case, the message header still has to
cross channels in both dimensions and therefore, can choose among adaptive virtual channels of both dimensions.
(ii) When (k  j < r), the number of (i1, i2) combinations is (r − j + 1). In only two cases, (k, j − k) out of these
combinations, the message header has crossed all channels of one dimension and thus, for the remaining hops,
the header only crosses channels of the other dimensions.
The probability that there remains only one dimension to cross a message along on its j -hop path, Pϕj , can be written
as
Pϕj =
{ 2
r−j+1 , k  j < r,
0, 0 j < k.
(17)
Finally, the probability that the header has entirely crossed t dimensions along on its j -hop path is given by
ξ tj =
{
1 − Pϕj , t = 0,
Pϕj , t = 1. (18)
1138 F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–11443.2.5. Calculation of the mean waiting time in the source node (Ws)
In this section, we compute the mean waiting time at the source node. We assume that each node fails indepen-
dently with probability P . We also assume that messages have an arbitrary (but known) length or service distribution.
However, the arrival process will be taken to be Poisson, a single server is assumed, and the queue buffer size is taken
to be infinite. Such a queue is called M /G/1 queue, using the Kendal notation [21]. Since a channel survives if and
only if both nodes at its ends survive, therefore the surviving probability of a channel is given by
Prob[a channel survives] = (1 − P)2. (19)
The expected number of activating channels, E[L], crossing any dimension is [21]
E[L] = 2k2(1 − P)2. (20)
We assumed that messages are uniformly destined to all other surviving nodes in the network. The traffic intensity
from a given source to a particular destination, θ , is defined as
θ = λ/[k2(1 − P) − 1]. (21)
In the torus, every message generated from a node must travel over one of the surviving channels in order to
reach its destination. We assume that the traffic crossing these channels is balanced. Therefore, the mean number of
surviving nodes, E[Y ], is given by
E[Y ] = k2(1 − P). (22)
Each node will send θ units of traffic to every other surviving nodes in the network. Thus, each node sends
θk2(1 − P) units of traffic. There are k2(1 − P) surviving nodes in the network, and the traffic is balanced on each
channel, we can define therefore the traffic load per channel as
ρ  traffic intensity of each node × number of surviving nodes
average service rate of a channel
= λT r . (23)
We define the throughput of the network, γ , as the traffic intensity has truly transmitted by the network, i.e., the
number of messages per unit time accepted by the system and hence delivered at the output
γ = λk2(1 − P). (24)
Applying the Pollaczek–Khinchine (P–K) mean value formula [21] yields the mean queue occupancy experienced
by a message at the source node, E[n], which is given by
E[n] =
(
ρ
1 − ρ
)[
1 − ρ
2
(
1 − (σr/T r)2
)]= ( ρ
1 − ρ
)[
1 − ρ
2
(
1 − C2r
)] (25)
where parameters T r and σ 2r are the mean and variance of the service time distribution, respectively. Moreover,
parameter Cr is called the coefficient of variation of service time distribution and defined as the ratio of the standard
deviation to the mean service time (i.e., Cr = σr/T r ). To calculate the mean waiting time at the source node we apply
a formula which is called appropriately Little’s formula [22] according to the following expression
Ws = E[W ] = E[n]/λ − T r
= T r
(1 − ρ)
(
1 − ρ
2
(
1 − C2r
))− T r = T r ρ(1 + C2r )2(1 − ρ) = λE[T
2
r ]
2(1 − ρ) (26)
where parameter E[T 2r ] is the second moment of T r and can be expressed as follows
E
[
T 2r
]= d2
dS2
L∗
T r
(S)
∣∣∣∣
S=0
= σ 2r + T 2r . (27)
A message in the source node can enter the network through any of the V virtual channels. Modelling the local
queue in the source node as an M /G/1 queue, with the average arrival rate on each virtual channel being λ/V and
service time, T r , with an approximated variance σ 2r = (T r − M − 3d + 1)2 [17] yields the mean waiting time as
Ws =
λ
V
E[T 2r ]
2
(
1 − λ T ) =
λ
V
T 2r
(
1 + (T r−M−3d+1)2
T 2r
)
2
(
1 − λ T ) . (28)V r V r
F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144 1139Fig. 3. Transition diagram to compute virtual channel occupancy probabilities.
3.2.6. Calculation of the mean degree of virtual channels multiplexing (V )
The probability, Pv , that v virtual channels are busy at a physical channel can be determined using a Markovian
model, shown in Fig. 3. State ψv (0 v  V ) corresponds to v virtual channels being busy. The transition rate out of
state ψv to state ψv+1 is the traffic rate λc (given by Eq. (7)) while the rate out of state ψv to state ψv−1 is (1/T r ) (T is
given by Eq. (11)). The transition rates out of state ψV are reduced by λc to account for the arrival of messages while a
channel is in this state. The steady-state solutions of the Markovian model yield the probability Pv (1 v  V ) as [10]
Q0 = 1, (29)
Qv = Qv−1λcT r (1 v  V − 1), (30)
QV = QV−1λc1/T r − λc
, (31)
P0 =
(
V∑
j=0
Qj
)−1
, (32)
Pv = Pv−1λcT r (1 v  V − 1), (33)
PV = PV−1λc1/T r − λc
. (34)
When multiple virtual channels are used per physical channel they share the bandwidth in a time multiplexed manner.
The mean degree of multiplexing of virtual channels, that takes place at a given physical channel, can be estimated
by [10]
V =
∑V
v=1 v2Pv∑V
v=1 vPv
. (35)
4. Model validation
The analytical model has been validated through a discrete-event simulator that mimics the behaviour of PCS with
faults at the flit level in torus. In each simulation experiment, a total number of 100 000 messages are delivered. Sta-
tistics gathering was inhibited for the first 10 000 messages to avoid distortions due to the initial warm-up conditions.
The simulator uses the same assumptions as the analysis, and some of these assumptions are detailed here with a
view of making the network operation clearer. The network cycle time is defined as the transmission time of a single
flit from one router to the next. Messages are generated at each node according to a Poisson process with a mean
inter-arrival rate of λ messages/cycle. Message length is fixed at M flits. Faulty nodes are determined using a uniform
random number generator. The mean message latency is defined as the mean amount of time from the generation of
a message until the last data flit reaches the local PE at the destination node. The other measures include the mean
network latency, the time taken to cross the network, the mean queuing time at the source node, and the time spent
at the local queue before entering the first network channel. Numerous validation experiments have been performed
for different sizes of the network and message length. However, for the sake of specific illustration, latency results are
presented for the following cases only:
• Network size is N = 64 (8 × 8 torus) and 256 (16 × 16 torus) nodes.
• Message length is M = 32 and 64 flits.
1140 F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144• Number of virtual channels V = 3,6,10 per physical channel.
• Failure rates P = 0.0,0.1,0.3 and 0.4.
Fig. 4 depicts the results for the mean latency predicted by the above model plotted against those provided by
the simulator for a varying number of virtual channels and different failure rates. The horizontal axis in the figures
shows the traffic generation rate at each node (λ) while the vertical axis shows the mean message latency. The figures
reveal that in all cases, the analytical model predicts the mean message latency with a good degree of accuracy in
the steady-state regions. Moreover, the model predictions are still good even when the network operates in the heavy
traffic region, and when it starts to approach the saturation region. However, some discrepancies around the saturation
point are apparent. These can be accounted for by the approximation made to estimate the variance of the service
time distribution at a channel. This approximation greatly simplifies the model as it allows us to avoid computing
Fig. 4. The mean message latency predicted by the model against simulation results for the 8 × 8 and 16 × 16 torus networks with message length
M = 32, 64 flits, V = 3,6,10 virtual channels per physical channel and failure rates P = 0,0.1,0.3 and 0.4.
F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144 1141the exact distribution of the message service time at a given channel, which is not a straightforward task due to the
inter-dependencies between service times at successive channels. However, the main advantage of the proposed model
is its simplicity which makes it a practical evaluation tool for assessing the performance behaviour of PCS in the torus
with faults.
5. Performance comparison
In this section, we have used the proposed analytical models to compare the relative performance merits of PCS
and wormhole switching (WS) developed in [18] for the torus with various working conditions. For the sake of our
present discussion, the network size is set to N = 64 and 256 nodes, the message length to M = 32,50 flits, the failure
rates to P = 0.0,0.16,0.32,0.47,0.63, and the number of virtual channels per physical channel to V = 4,10; but the
conclusions reached here are found to be similar when other network configurations are considered. We have found
that these chosen values for these parameters allow us to examine different cases that reveal important conclusions
about the relative merits of the two considered switching methods.
The measures used for this set of results are throughput and Power. Throughput is the rate at which messages
are delivered by the network for a particular traffic pattern. It is measured by counting the messages that arrive at
destination over a time interval for each flow in the traffic pattern and computing from these flow rates the fraction of
the traffic pattern delivered. Furthermore, the Power of a network (is defined below) synthesises the network’s various
performance merits and characteristics such as throughput and latency. To investigate how the network throughput is
associated with traffic load, we plot throughput as a function of traffic, as shown in Fig. 5. Traffic load is the average
amount of traffic generated by each source terminal of the network. This figure depicts the throughput results for WS
and PCS under different failure rates (i.e., P = 0.0,0.16,0.32 and 0.47). At traffic levels less than saturation point,
the throughput is equal to the traffic rate and the curve is a straight line. Continuing to increase the traffic load, we
eventually reach saturation, the highest level of traffic for which throughput is equal to traffic rate. As traffic load
increases beyond saturation point, the network is unable to deliver messages as fast as they are being generated.
Fig. 5 reveals that PCS and WS have the same throughput under light traffic in a fault-free network. However,
when the traffic increases, the performance merits of WS become more apparent. This is because the overhead of
associated with path set-up in PCS increases due to the increase in network traffic. However, figure reveals that
under failure conditions, PCS outperforms WS since the header in WS is unable to backtrack over the last reserved
channel same as PCS which is a favour for PCS. As the number of failed nodes increases the probability that faults
isolate a node/channel also increases and eventually the network becomes partitioned such that there is no available
path between some particular nodes. In addition, in WS two virtual channels are reserved as deterministic virtual
channels to avoid deadlock. A message cannot therefore use all network resources to progress towards its destination,
resulting in a higher probability of blocking which is another favour of PCS. When the traffic is heavy, WS suffers
from the degrading effects of chained blocking inherent in its blocking-based flow control mechanism. In contrast,
Fig. 5. Comparison between the throughput of wormhole switching (WS) and PCS against the traffic load in an 8 × 8 torus with message length
M = 32 flits, V = 4 virtual channels per physical channel, and different failure rates P = 0.0,0.16,0.32, and 0.47.
1142 F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144Fig. 6. Performance of wormhole switching (WS) and PCS in a 16 × 16 torus with message length M = 50 flits and V = 10 virtual channels per
physical channel.
the header in PCS can backtrack to the previous channel to search for an alternative available channel, thus avoiding
blocked channels. This flexibility of PCS offsets the overhead of setting up a path. Our analytical results show that,
in the absence of failures, WS behaves the same as PCS or even better under light and moderate traffic especially
when message length is short and network size is large. However, the main advantage of PCS compared to WS is its
superior performance when nodes or links are faulty.
Fig. 6 compares the graceful degradation versus number of failed nodes for PCS and wormhole switching (WS).
In this figure, a 16 × 16 torus is analysed for a varying number of failed nodes (horizontal axis) with message length
M = 50 flits and V = 10 virtual channels per physical channel. For each number of failures, the accepted traffic
rate of the network (vertical axis) under uniform traffic is normalised by the number of all nodes in the network. As
illustrated in this figure, the normalised accepted traffic of the fault-free network is above 80% capacity both for PCS
and WS. The curve of PCS in the presence of failures is depicted by the corresponding small drop in the accepted
rate. The network continues to remain resilient even as the number of failed nodes grows up to 120 with only a slight
increase in the rate of accepted traffic degradation. Hereafter, increasing the number of faults indicating the potentially
impact of failures on the network. This is due to the fact that, with a big number of faults, the number of channels
that can access nodes in the network may be reduced, which, in turn, can significantly increase load on the remaining
channels. Scrutinising both curves reveal that PCS has more resilience to failures than WS, and displays good relative
performance as the number of failed nodes in the network increases.
In many networks, two performance measures mean latency and throughput, compete with each other. Typically, by
rising the throughput of the system (which is desirable) the mean message latency is also raised (which is undesirable).
Combining the throughput and the mean message latency of the network into a single measure yields Power which is
defined as the ratio of the network’s throughput over the mean latency
Power throughput of the network
mean latency
.
As traffic load grows, throughput increases (see Eq. (24)) but latencies become bigger (especially near network
saturation). There is clearly a trade-off between those two measures. The notion of Power proves to be useful in
addressing this trade-off issue. It appears as a natural measurement for characterising multimedia traffic such as data,
voice, video, and image communications across parallel computers, distributed computers, and telecommunication
systems, whose efficient transmission requires, simultaneously, high throughput and low latency [23].
Fig. 7 compares the Power of the network for both PCS and WS as a function of traffic load by each node in an 8×8
torus with message length M = 32 flits, number of virtual channels per physical channel V = 4 and different failure
rates P = 0.0,0.16,0.32,0.47 and 0.63. Such curves provide an indication on the degree of performance degradation
caused by faulty nodes in the network. The figure reveals that as the traffic load in the system increases, the Power
of network increases almost linearly until reaches an optimal point on the curve (depicted by λopt in the figure). This
reflects the fact that under light traffic loads, the mean message latency in the network is a small value. Beyond this
F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144 1143Fig. 7. The Power of network (%) in an 8 × 8 torus vs. traffic rate for PCS and wormhole switching (WS) with different failure rates
P = 0.0,0.16,0.32,0.47,0.63, message length M = 32 flits and V = 4 virtual channels per physical channel.
point, the message latency increase steadily and the achieved Power decrease abruptly. Note that when λ is greater than
or equal to the maximum saturation throughput, the mean latency tends to infinity that means the Power approaches
zero. As illustrated in Fig. 7, the traffic load at which the Power is maximised, can be found by taking the tangent to
latency from the origin, in which case ∂
∂λ
Power|λ=λopt = 0. From our discussion above it can be further argued that as
the Power increases, the performance of the network becomes higher.
6. Conclusions
Massively parallel systems, Multiprocessors System-on-Chip (MP-SoCs), multicomputers, and cluster computers
are considered today a very promising approach to achieve high computational power. One of the key issues in the
design of such systems is the development of an efficient communication network that provides high throughput
and low latency under different working conditions and more importantly its ability to survive beyond the failure
of individual components (i.e., nodes or links). Fault-tolerant designs of these systems aim at providing continuous
operations in the presence of faults by allowing the graceful degradation of system. Pipelined Circuit Switching (PCS)
has been suggested as an efficient switching method for supporting inter-processor communications in networks due
to its ability to preserve both communication performance and fault-tolerant demands in such systems. This paper
has described a new analytical model to compute the mean message latency in torus using PCS in the presence of
faulty components. Simulation experiments have revealed that the latency results predicted by the analytical model
are in good agreement with those obtained through simulation. Our next object is to extend our suggested modelling
approach to consider other well-known fault-tolerant routing algorithms and other network topologies.
References
[1] W.J. Dally, B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann Publishers, 2004.
[2] Cray Research Inc., The Cray T3E scalable parallel processing system, on Cray’s web page, http://www.cray.com/PUBLIC/product-info/T3E.
[3] R.E. Kessler, J.L. Schwarzmeier, Cray T3D: A new dimension for Cray research, in: Computer Conf., 1993, pp. 176–182.
[4] C. Peterson, et al., iWarp: A 100-MOPS VLIW microprocessor for multicomputers, IEEE Micro 11 (2) (1991) 26–37.
[5] C.L. Seitz, The cosmic cube, Comm. ACM 28 (1985) 22–33.
[6] S.F. Nugent, The iPSC/2 direct-connect communication technology, in: Proceedings of the Conference on Hypercube Concurrent Computers
and Applications, vol. 1, 1988, pp. 51–60.
[7] W.J. Dally, C.L. Seitz, Deadlock-free message routing in multiprocessor interconnection networks, IEEE Trans. Comput. 36 (5) (1987) 547–
553.
[8] J. Duato, S. Yalamanchili, L.M. Ni, Interconnection Networks: An Engineering Approach, Morgan Kaufmann Publishers, 2003.
[9] P.T. Gaughan, S. Yalamanchili, A family of fault-tolerant routing protocols for direct multiprocessor networks, IEEE Trans. Parallel & Dis-
tributed Systems 6 (5) (1995) 482–497.
[10] W.J. Dally, Virtual channel flow control, IEEE Trans. Parallel & Distributed Systems 3 (2) (1992) 194–205.
[11] J.D. Allen, P.T. Gaughan, D.E. Schimmel, S. Yalamanchili, Ariadne—An adaptive router for fault-tolerant multicomputers, in: Proc.
ACM/IEEE 21st Int. Symp. Computer Architecture, ISCA-21, ACM Press, 1994, pp. 278–288.
1144 F. Safaei et al. / Journal of Computer and System Sciences 73 (2007) 1131–1144[12] A. DeHon, F. Chong, M. Becker, E. Egozy, H. Minsky, S. Peretz, T.F. Knight, METRO: A router architecture for high-performance, short-haul
routing networks, in: Proc. ACM/IEEE 21st Int. Symp. Computer Architecture, ISCA-21, ACM Press, 1994, pp. 266–277.
[13] J. Duato, S. Yalamanchili, M.B. Caminero, D. Love, F. Quiles, MMR: A high performance multimedia router: Architecture and design trade-
offs, in: Proc. 5th Int. Symp. High Performance Computer Architecture, HPCA’99, IEEE Computer Society Press, 1999, pp. 300–309.
[14] G. Min, Performance modelling and analysis of multicomputer interconnection networks, PhD thesis, Computing Science Department, Glas-
gow University, 2003.
[15] G. Min, M. Ould-Khaoua, A comparative study of switching methods in multicomputer networks, J. Supercomput. 21 (2002) 227–238.
[16] G. Min, M. Ould-Khaoua, H. Sarbazi-Azad, Modeling of pipelined circuit switching in multicomputer networks, in: MASCOTS, 2000,
pp. 299–306.
[17] J.T. Draper, J. Ghosh, A comprehensive analytical model for wormhole routing in multicomputer systems, Journal Parallel & Distributed
Computing 32 (2) (1994) 202–214.
[18] M. Ould-Khaoua, A performance model of Duato’s adaptive routing algorithm in k-ary n-cubes, IEEE Trans. Comput. 48 (1999) 1297–1304.
[19] A. Agarwal, Limits on interconnection network performance, IEEE Trans. Parallel & Distributed Systems 2 (4) (1991) 398–412.
[20] W. Feller, An Introduction to Probability Theory and Its Applications, vol. 1, John Wiley & Sons, New York, 1967.
[21] L. Kleinrock, Queuing Systems, Wiley, New York, 1975.
[22] J.C. Little, A proof of the queuing formula L = λW , Oper. Res. 9 (1961) 383–387.
[23] L. Kleinrock, Power and deterministic rules of thumb for probabilistic problems in computer communications, in: Int. Conf. on Commun.,
vol. 1(10), 1979, pp. 43.1.1–43.1.10.
