The traffic pattern has significant impact on the performance of network-on-chip. Many recent studies have shown that multimedia applications can be supported in on-chip interconnects. Driven by the motivation of evaluating on-chip interconnects in multimedia embedded systems, a new analytical model is proposed to investigate the performance of the fat-tree based on-chip interconnection network under bursty multimedia traffic and nonuniform message destinations. Extensive simulation experiments are conducted to validate the accuracy of the model, which is then adopted as a cost-efficient tool to investigate the effects of bursty multimedia traffic with nonuniform destinations on the network performance.
INTRODUCTION
The latest development in multimedia embedded systems that are implemented with an on-chip architecture [Dally and Towles 2004; Majeti et al. 2009; Varatkar and Marculescu 2002 ] not only requires processing of multichannel real-time audio or video signals, but also expects efficient interconnection networks for transport of multimedia content. The emerging chip-multiprocessor (CMP) architectures consist of many processing cores on a single chip owing to the advances of miniaturization in semiconductor technologies [Marculescu et al. 2009; Peng and Lin 2010; Sanchez et al. 2010] . To date, the network-on-chip (NoC), also known as on-chip interconnects, has emerged to play an important role in providing dominant solutions for the interconnection design of CMP architectures. The topology of an on-chip network specifies the structure in which the processing cores are connected. An NoC may adopt any topology proposed for interconnection networks, such as fat trees, mesh, torus, and folded torus [Dally and Towles 2004; Moadeli et al. 2010] . In this article, we focus on the fat tree topology, which has been adopted by related studies [Grecu et al. 2004 ; Kapre et al. 2006; Taktak et al. 2008; Wang et al. 2012] .
The traffic pattern has significant impacts on the performance of on-chip interconnects. To obtain a proper and deep understanding of the network performance, it is necessary to incorporate with the accurate models in order to capture the realistic network traffic patterns. The message arrival process and destination distribution are two of the most important characteristics used to define the network traffic patterns [Duato et al. 2003 ]. A number of recent studies have convincingly shown that multimedia applications can be supported in on-chip interconnects [Lee et al. 2006; Ogras and Marculescu 2008; Varatkar and Marculescu 2004] . Furthermore, the message designations often exhibit the non-uniform distributions over on-chip interconnects [Mirza-Aghatabar et al. 2007; Zhang and Jones 2009] .
Wormhole switching is an efficient switching scheme for on-chip networks [Bjerregaard and Mahadevan 2006; Marculescu et al. 2009] , where a message is divided into a sequence of fixed-size units, called flits. The header flit governs the path through the network and the remaining data flits follow it in a pipelined fashion. Without the complexity caused by adaptive routing, a deterministic routing algorithm is suitable for NoC [Dally and Towles 2001] . In deterministic routing, a message traverses a fixed path between its source and destination, which simplifies the implementation, avoids message deadlock, and guarantees an in-order delivery. Therefore, in this study, a deterministic routing based on Up * /Down * algorithm [Schroeder et al. 1991 ] is adopted. The performance study of on-chip interconnects can be achieved by either simulation or analytical modelling [Duato et al. 2003 ]. However, the simulation-based approach may be time-consuming and costly since the convergence of simulation towards a steady state in the presence of multimedia traffic with nonuniform destinations is often very slow. In contrast, analytical modelling can capture the essential features of the network, gain significant insights, and offer a cost-effective and versatile tool that can be used to investigate the network performance with different design alternatives and under various working conditions. In particular, the analytical model can provide quantitative relations between input parameters and performance metrics in order to have a thorough investigation of the network performance over a complete parametric range.
Most of the existing studies on on-chip interconnects are resorted to simulation to evaluate the network performance. The lack of analytical performance models for such on-chip interconnects hinders efficient design for multimedia embedded systems. With the aim of capturing the characteristics of multimedia traffic patterns and obtaining a comprehensive understanding of network performance, this article makes the following contributions.
-A new analytical model is proposed to investigate the performance of on-chip interconnects in CMP in the presence of multimedia traffic with non-uniform destinations. The multimedia traffic is captured by bursty and correlated Markov-modulatedPoisson-process (MMPP) and the nonuniform destination is modelled by the hot-spot in the network. A popular fat-tree topology is adopted as the underlying interconnection architecture in CMP. -Extensive simulation experiments are conduced to validate the accuracy of the model.
The comparison between analytical and simulation results reveals that the model possesses a good degree of accuracy under different design alternatives and with various traffic conditions. -To illustrate its applications, the analytical model is then applied to investigate the impact of multimedia traffic with hot-spot destinations on the performance of the fat-tree based on-chip interconnects. The analytical results demonstrate that the network performance degrades considerably under such traffic patterns.
The rest of the article is organized as follows. Section 2 presents the related work. The network architecture is shown in Section 3. Section 4 derives the analytical model to investigate the performance of fat-tree based on-chip interconnects. Extensive simulation experiments are used to validate the accuracy of the analytical model in Section 5. Section 6 carries out performance analysis by virtue of the developed analytical model. Finally, Section 7 concludes this study.
RELATED WORK
The studies on performance evaluation of on-chip interconnects have been widely reported in the literature [Ascia et al. 2008; Kodi et al. 2008; Matsutani et al. 2009; Pande et al. 2005; Sanchez et al. 2010] . However, most of these studies are based on the use of simulation experiments to evaluate the performance of interconnects in NoC architecture. For example, Ascia et al. [2008] proposed a selection strategy coupled with a routing algorithm to improve the performance of on-chip interconnects by virtue of flit-level simulators. The objective of the proposed strategy is to choose the channel that allows the packet to travel to its destination along a path that has the fewest number of congested nodes. The study by Kodi et al. [2008] proposed a low-power lowarea on-chip interconnection network architecture by reducing the number of buffers within the router. To minimise the performance degradation caused by the reduced buffer size, the circuit level enhancements were deployed to the existing repeaters to double the buffers when required. Matsutani et al. [2009] proposed a tree-based interconnection network so as to efficiently use enormous wire resources for low-latency and high-throughput communication in NoC and employed the simulation study to evaluate the performance of interconnection networks. Pande et al. [2005] developed a consistent and meaningful evaluation methodology to compare the performance and characteristics of a variety of on-chip interconnection architectures and explored the design tradeoffs that characterise the NoC for the optimal development of integrated network-based design. Sanchez et al. [2010] explored the architectural-level implications of network design for NoC. They further evaluated and compared different network topologies using simulation of the full chip.
The analytical modelling and evaluation of on-chip interconnects are rarely reported in the current literature. Very recently, Moadeli et al. [2010] have proposed an analytical model to evaluate the performance of ring-based on-chip interconnects. However, this model was developed based on the assumptions that message arrivals follow nonbursty Poisson process and message destinations are uniformly distributed. To the best of our knowledge, there has not been any analytical model reported for on-chip interconnects in the current literature to handle multimedia traffic with hot-spot destinations.
THE NETWORK-ON-CHIP ARCHITECTURE
The NoC closely resembles the architecture of interconnection networks in highperformance computing (HPC) systems [Benini and Micheli 2002] . Thus, the interconnection topologies adopted in the early NoC architecture can be traced back to the field of HPC systems. As the technology that scales to the nanoscale regime brings physical design issues to the forefront, 2D mesh and torus topologies exhibit a grid-based regular structure which is intuitively considered to be matched to the 2D chip layout and thus have been adopted in on-chip networks. Meanwhile, the NoC architectures aiming at low latency communication, performance scalability and flexible routing still choose fat-trees as their reference topology. Moreover, numerous research efforts have been made on tree-based topologies in the NoC community, proving their superior performance over 2D meshes under different types of traffic patterns [Pande et al. 2005] . As a result, the fat-tree topology and its variants are widely adopted in practice. The m-port n-tree is a typical example of fat-tree topologies [Lin et al. 2004] . Figure 1 depicts the topology of 4-port 2-tree and its on-chip layout. In this topology, each network switch has m communication ports that are connected with other switches or processing cores. The height of the tree is (n + 1). The switches (except for root switches) use half of the ports to connect with their descendants or processing nodes, and the other half to connect with their ancestors. The root switch employs all communication ports for connection with their descendants or processing nodes.
The m-port n-tree consists of N node processing nodes and N switch switches (including the root switches) [Lin et al. 2004] , where N node and N switch are given by
Let P j denote the probability that a newly generated message needs to cross 2 j channels ( j channels in ascending phase and j channels in descending phase) to reach its destination in the m-port n-tree. As the number of nodes located at distance 2 j (1 ≤ j ≤ n − 1) in the m-port n-tree is (m/2 − 1)(m/2) j−1 , we have
The number of nodes located at distance 2n in the m-port n-tree is (m − 1)(m/2) n−1 . Thus, P n can be expressed as
Consequently, the average message distance,d, in the m-port n-tree can be expressed asd
The m-port n-tree contains two types of connections: node-to-switch (or switchto-node) connection and switch-to-switch connection. Given that the basic unit of number of channels that are i (1 ≤ i ≤ 2n) hops away from the hot-spot node traversed by the regular messages and hot-spot messages N t channel i total number of output channels that are i (1 ≤ i ≤ 2n) hops away from the hot-spot node Pb r,k, j blocking probability of the regular messages at stage k Wc r,k, j waiting time experienced by the regular messages to acquire a channel in the event of blocking
Laplace-Stieltjes transform of the service time of regular messages on network channels at stage k τ r mean time for the tail flit of a regular message to reach the destination transmission in on-chip communication is a flit, let t cn denote the time required to transmit a flit on a node-to-switch (or switch-to-node) connection and t cs represent the time to transmit a flit on a switch-to-switch connection in the m-port n-tree topology. t cn and t cs can be determined by Javadi et al. [2006] 
where θ n and θ s are the network latency and switch latency in the m-port n-tree topology and B n denotes the bandwidth of connections in the m-port n-tree. L is the length of each flit.
THE ANALYTICAL MODEL
This section firstly presents the methods for modelling message arrival processes of multimedia applications and for modelling nonuniform destination distributions. The derivation of the analytical model is then reported. The major difference between the previous publications [Wu et al. 2008 [Wu et al. , 2011 and this article is that the former considered the 2D torus interconnection networks but this study focuses on modelling and analysis of m-port n-tree interconnection topology. The key notations used in the derivation of the analytical model are listed in Table I .
Modeling the Message Arrival Process
The message arrival process of multimedia applications significantly deviates from the traditional renewal process, for example, Poisson process. The modelling of multimedia traffic is preferable to capturing its distinguishing characteristics, which usually possess bursty and correlated nature that can significantly affect the network performance [Liu et al. 2008] . A highly bursty message arrival process tends to have a large variance-to-mean ratio of the interarrival time. Let X denote the inter-arrival time, burstiness can be characterised by the squared coefficient of variation (SCV) of the inter-arrival time. The other important feature of multimedia traffic is the high correlation between interarrival times. The degree of correlation between inter-arrival times is typically measured by the correlation coefficient of X. In this article, the arrival process of multimedia traffic is represented by an MMPP [Fischer and Meier-Hellstern 1993] , which is a doubly stochastic process with the arrival rate varying according to a multistate ergodic continuous-time Markov chain. The two-state MMPP has been widely used in numerous studies to model the message arrival behaviour of bursty traffic due to the following reasons: (1) many studies [Heffes 1980; Liu et al. 2008; Shah-Heydari and Le-Ngoc 2000] have revealed that MMPP has the ability of capturing the time-varying arrival rate and the important correlations among interarrival times of multimedia traffic; (2) MMPP is closed under the splitting and superposition operations and thus can be used to model the decomposition and superposition of network traffic in on-chip interconnection networks; and (3) the queueing-related results of MMPP have been widely studied [Fischer and MeierHellstern 1993; Heffes 1980] , which makes the solutions of modelling networks with the MMPP arrival process analytically tractable.
In this study, a two-state MMPP s is adopted to model the traffic burstiness of the message arrival process generated by the source node [Wu et al. 2011] . MMPP s can be characterized by the infinitesimal generator Q s of the underlying Markov chain and rate matrix s as
where the element ϕ s1 is the transition rate from state 1 to 2 and ϕ s2 is the rate out of state 2 to 1. λ s1 and λ s2 are the traffic rate when the Markov chain is in state 1 and 2, respectively. The mean, λ s , variance, λ
s , covariance function, Cov(t), and integral of the covariance function, δ s , of the traffic rate are listed here and can also be found in Heffes [1980] and Min and Ould-Khaoua [2004] . These quantities denote how dependent the rate at one instant of time is on that at another instant and play a major role in the method for the superposition operations of MMPP.
The SCV, C correlations between interarrival times [Liu et al. 2008 ]
Modeling the Message Destination Distribution
The message designations often exhibit the nonuniform distributions over the networks in on-chip interconnects. Hot-spot traffic is able to capture the characteristics of the nonuniform distribution of message destinations where a number of nodes direct a fraction of their messages to the hot-spot node [Ascia et al. 2008; Pfister and Norton 1985] . The hot-spot traffic may lead to paralysis of the hot-spot node and even the whole network [Xiong et al. 2001] . Hot-spot traffic has attracted significant research efforts over past few years due to the strong evidence of its existence and the great effects on network performance [Ould-Khaoua and Sarbazi-Azad 2001; Sarbazi-Azad, OuldKhaoua and Mackenzie 2001; Wu et al. 2008] . For example, messages routed through an on-chip interconnect with the same destination address (i.e., the network coordinator) may result in contention. The hot-spot traffic model proposed in Pfister and Norton [1985] is employed to generate non-uniform distribution of message destinations in this study. Specifically, each message has the probability, h, to be directed to the hotspot node, and the probability, (1 − h), of being evenly directed to all other network nodes.
Derivation of the Analytical Model
A critical performance metric used to evaluate on-chip interconnects is the communication latency [Moadeli et al. 2010] , which consists of three parts: (a) waiting time at the source; (b) transmission delay that is the time for a message to cross the network; and (c) the time for the tail flit to reach the destination. The latency reflects dynamic behaviours of the network and may be high if the network traffic is non-uniformly distributed, for example, in the presence of hot-spot traffic. Let us refer to the traffic caused by the regular messages and hot-spot messages as regular traffic and hot-spot traffic, respectively. This section considers the effects of both regular messages and hot-spot messages on the performance of on-chip interconnection networks. Let L r and L h represent the communication latency experienced by regular messages and hot-spot messages in on-chip interconnects, respectively. Since each message has the probability, h, to be directed to the hot-spot node and the probability, (1 − h), of being evenly directed to all network nodes, the communication latency,L, for a given message in the on-chip interconnects can be given as follows:
The regular messages and hot-spot messages experience different latencies due to the nonuniform traffic loads and varying blocking time over different network channels, depending on their locations with respect to the hot-spot node. In what follow, we first determine the waiting time at the source node, and then calculate the transmission delay and blocking time experienced by both regular messages and hot-spot messages. Finally, we determine the time for the tail flit to reach the destination so as to calculate the message communication latency. of the traffic generated by the source. Based on the principle of splitting an MMPP [Fischer and Meier-Hellstern 1993] , the corresponding infinitesimal generator Q r and rate matrix r of MMPP r can be given by
In this article, we consider the finite buffer queue, B, at the source; thus, the arriving messages are dropped when the buffer becomes full. Let Pl s denote the probability that an arriving packet finds the buffer full; the calculation of Pl s will be given later by Eq. (19). The effective regular traffic entering the queue, denoted by MMPP er , at the source is a fraction, (1 − Pl s ), of MMPP r . Based on Eq. (17), the infinitesimal generator Q er and rate matrix er of MMPP er can be determined.
To calculate the waiting time experienced by the regular message at the source, we adopt a bi-variate Markov chain, as shown in The transition rate matrix, , of the bi-variate Markov chain can be obtained from Figure 2 . The steady-state probability vector, P = (P a,b ) = (P 0 , P 1 , . . . ,P B ), where P b = (P 1,b , P 2,b ), satisfies the equations: P = 0 and Pe = 1. Let P b , 0 ≤ b ≤ B, denote the probabillity that there are b flits in the buffer. P b is given by P b = 2 a=1 P a,b . According to Little's law [Kleinrock 1975 ], the waiting time, W r , experienced by regular messages at the source can be determined by
whereλ er is the mean arrival rate of MMPP er and can be computed based on Eq. (9). To determine Pl s used in Eq. (17), let us first calculate the probability, P b , that there are b flits in the queue observed by an arriving packet. P b can be given by [Meier-Hellstern 1989 ] P b = ( B b=0 P b r e) −1 P b r e. Therefore, the probability, Pl s , that an arriving message finds the finite buffer full can be written as
The output process of the regular traffic from the queue, MMPP or , in the source node can be modelled approximately by that of the queueing system subject to the infinite buffer and MMPP r input traffic. This approximation is validated by comparing the analytical performance results with those obtained through simulation; it is worth noting that this approximation of the output process is not used in the simulation experiments. MMPP or can be obtained by matching the moments of the inter-departure time of the packets. Following the method used in Ferng and Chang [2001] to derive the output process from queueing system with the MMPP input, the infinitesimal generator Q or and rate matrix or of MMPP or can be determined.
Similarly, let MMPP h denote the hot-spot traffic arriving at queue in the source node, which is a fraction, h, of its generated traffic. Based on the method for deriving the expression of W r , we can readily obtain the waiting time, W h j , experienced by the 2 j-channel hot-spot messages at its source node. The output of the hotspot traffic from the queue, MMPP oh , can be determined accordingly.
Transmission
Delay in m-Port n-Tree Based On-Chip Interconnects. In this section, we first determine the traffic characteristics in m-port n-tree based on-chip interconnects under bursty multimedia traffic and hot-spot destinations, and then calculate the transmission delay and blocking time experienced by both regular messages and hot-spot messages in on-chip interconnects.
Traffic characteristics for regular messages and hot-spot messages.
Due to the uniformity of regular messages on network channels, the arrivals of regular traffic at network channels exhibit similar statistical behavior. Since the network has N node source nodes and 4nN node network channels [Javadi et al. 2006] , the regular traffic arriving at a given network channel is equal to f r times as much as that enters into the network from the queue in the source node. f r can be given by
Generally, f r is not an integer value because it is determined by the network size, the properties of traffic generated by the source, and the hot-spot fraction. Let Z r and F r denote the integral and fractional parts of f r . Given that the splitting and superposition of multiple MMPPs are again an MMPP [Fischer and Meier-Hellstern 1993; Heffes and Lucantoni 1986] , let MMPP cr denote the regular traffic arriving at a given network channel. MMPP cr can be determined by the superposition of Z r traffic flows modelled by MMPP or and one MMPP F , where MMPP F represents the resulting traffic flow from the splitting of MMPP or with the splitting probability F r . According to Eq. (17), the infinitesimal generator Q F and rate matrix F of MMPP F can be obtained.
The parameters of MMPP cr can be determined by matching the following four statistical characteristics: mean, variance, third central moment, and integral of the covariance function of the arrival rate. Based on the parameter matrices of MMPP or and MMPP F , their statistical characteristics can be calculated based on Eqs. (9)-(13). Since MMPP cr is the superposition of Z r MMPP or and one MMPP F , we can further obtain the mean (λ cr ), variance (λ (2) cr ), third central moment (λ (3) cr ), and integral of the covariance function (δ cr ) of the traffic rate of MMPP cr and compute its infinitesimal generator Q cr and rate matrix cr as follows [Heffes 1980; Min and Ould-Khaoua 2004] :
where the parameters, ϕ cr1 , ϕ cr2 , λ cr1 and λ cr2 are given by 
Because of the nonuniformity of hot-spot messages on network channels, the hot-spot traffic on different channels varies and can be determined according to their locations with respect to the hot-spot node. With hot-spot traffic, the network channels at different locations with respect to the hot-spot node have identical traffic characteristics. Therefore, we need to determine the number of channels, N channel i , located at i hops away from the hot-spot node traversed by the hot-spot messages. N channel i can be given by
Let MMPP ch i denote the arrival process of hot-spot messages on network channels located at i hops away from the hot-spot node. MMPP ch i can be obtained by considering the following two cases.
(a) With the use of deterministic routing, the channels located at i (1 ≤ i ≤ n) hops away from the hot-spot node can receive messages generated from the nodes located more than 2i hops away from the hot-spot node. The number of nodes located at 2i (1 ≤ i ≤ n) hops away from the hot-spot node, N node i , can be expressed as
(b) The channels located at i (n + 1 ≤ i ≤ 2n) hops away from the hot-spot node can receive messages generated by 2 2n−i N channel i nodes in the m-port n-tree.
Since there are N channel i channels located at i hops away from the hot-spot node to be traversed by the hot-spot messages, the hot-spot traffic arriving at a given network channel located at i hops away from the hot-spot node is f ch i times of the traffic that enters into the network from the queue in the source node. f ch i can be expressed as
Adopting the similar method used in determining of the regular traffic on network channels, we can readily obtain the infinitesimal generator, Q ch i , and rate matrix, ch i , of MMPP ch i . The superposition of these two types of traffic yields the loads at network channels that are i hops away from the hot-spot node, modelled by MMPP c i with the infinitesimal generator, Q c i , and rate matrix, c i .
Transmission delay for regular messages and hot-spot messages.
Since regular messages may cross different numbers of channels to reach their destinations, we take into account the transmission delay of a 2 j-channel regular message (i.e., the message needs to traverse 2 j channels to reach its destination) as T r, j . Averaging all the possible destinations made by a given regular message yields the transmission delay as
For the sake of clarity, the numbering of network stages in m-port n-tree topology is based on the location of switches between the source and destination. The numbering starts from the stage next to the source (stage 0) and goes up as it is closer to the destination. In m-port n-tree, the number of stages to be crossed by a 2 j-channel message is K = 2 j − 1. Since messages are transferred to the local processing core upon arriving at their destinations, the analysis starts from the last stage and continues backward to the first stage. Therefore, the service time experienced by regular messages on network channels at the last stage (K − 1), T r,S−1, j , can be given by
where is the message length in flits. The service time experienced by messages on the network channels at the internal stages k (0 ≤ k ≤ K − 2) can be obtained by actual message transmission time and the delay due to blocking at subsequent stages. Thus, the service time, T r,k, j , experienced by messages on network channels at internal stages can be expressed as
where Wb r,k, j is the blocking time that messages experience to acquire a channel at stage k. T r, j is the service time of a regular message at stage 0, that is, T r, j = T r,0, j . Similarly, we consider the transmission delay of a 2 j-channel hot-spot message as T h, j . Following the derivation of T r,k, j given by Eqs. (30) and (31), the service time, T h,k, j , experienced by 2 j-channel hot-spot messages on network channels can be readily determined.
Blocking time for regular messages and hot-spot messages on network channels.
The blocking time experienced by 2 j-channel regular messages on network channels at stage k, Wb r,k, j , can be determined by the blocking probability of messages at this stage, Pb k, j , and the waiting time, Wc k, j , that the messages experience to acquire a channel when blocking occurs. Since there are N channel i channels that are i (1 ≤ i ≤ 2n) hops away from the hot-spot node traversed by the regular messages and hot-spot messages, and the total number of output channels that are i (1 ≤ i ≤ 2n) hops away from the hot-spot node is N t channel i , the probability that the channels are located at i hops away from the hot-spot node traversed by both regular messages and hot-spot messages (i.e., the superposed messages) is N channel i
/.N t channel i
. Therefore, Wb r,k, j can be expressed as
where i = 2 j−s−1. Pb r,k, j and Pb k, j are the blocking probability of the regular messages and all messages (i.e., including both regular messages and hot-spot messages) at stage k. Wc r,k, j and Wc k, j are the waiting time experienced by the regular messages and all messages to acquire a channel in the event of blocking. N t channel i can be given by
Taking both the regular messages and hot-spot messages with their appropriate weights into account yields the service time on network channels at stage k as follows:
where λ cr , λ ch i and λ c i denote the mean arrival rate of regular traffic, hot-spot traffic and the superposed traffic on network channels located at i hops away from the hot-spot node. These quantities can be obtained by the virtue of Eq. (9). The blocking probability, Pb k, j , can be determined using a Markov chain. The state of the Markov chain is described by a pair of random variables, (t, s), where t denotes the status of the channel and s is the state of MMPP c i . The transition rate out of state (t, s) to (t + 1, s) is λ cs i , where λ cs i is the traffic rate on network channels when MMPP c i is at state s; while the rate from (t + 1, s) to (t, s) is 1/T k, j − λ cs i . The transition rates out of state (t, s) to (t, s + 1) and out of (t, s) to (t, s − 1) are ϕ c2 i and ϕ c1 i , respectively. Obtaining the steady-state vector of the Markov chain can yield the blocking probability Pb k, j [Min and Ould-Khaoua 2004] .
Since the hot-spot traffic is nonuniformly distributed over the network channels, the waiting time experienced by messages due to blocking on network channels depends on the location of the current network channels with respect to the hot-spot node because the traffic rate varies from one channel to the other. To determine the waiting time, Wc k, j , the network channels are modelled as MMPP/G/1 queueing systems [Fischer and Meier-Hellstern 1993] . As the arrival process is modelled by MMPP c i and the service time is T k, j , Wc k, j can be expressed as
where i = 2 j − s − 1. In these two equations, t c k,i and t
denote the first two moments of the service time on network channels and can be determined from the Laplace-Stieltjes transform of the service time on network channels at stage k [Kleinrock 1975 ]. The traffic intensity, ρ c k,i = t c k,i λ c i , where λ c i is the mean traffic rate arriving at the network channels and is equal to π c iλ c i . π c i is the steady-state vector of MMPP c i andλ c i = c i e c . e c is the column unit vector of length 2. The algorithm for computing the matrix g c k,i can be found in [Fischer and Meier-Hellstern 1993] . Wc r,k, j can be determined according to Eqs. (35) and (36), by modelling the network channel as an MMPP/G/1 queueing system where the arrival process is modelled by MMPP cr and the service time is T r,k, j .
4.3.3. Time for the Tail Flit of Regular and Hot-Spot Messages to Reach the Destination. The mean time for the tail flit of a regular message to reach the destination, τ r , can be given by
The time for the tail of a 2 j-channel hot-spot message to reach the destination, τ h j , can be determined by
The communication latency for the regular messages, L r , can be written as L r = T r + W r +τ r , and that for an 2 j-channel hot-spot message can be given by L h j = T h j +W h j +τ h j . Averaging all the possible values of j gives the communication latency for a hot-spot message as L h .
Implementation of the Analytical Model.
To facilitate the understanding of the derivation of the analytical model, in what follows, we will outline the key steps for implementation of the model.
Step 1. Calculate the parameter matrices denoting the traffic patterns on network channels.
Step 1.1. Calculate the parameter matrices of MMPP cr for modelling the regular traffic arriving at a given network channel using Eqs. (20)- (25);
Step 1.2. Calculate the parameter matrices of MMPP ch i for modelling the hot-spot traffic arriving at network channels located at i hops away from the hot-spot node based on Eqs. (21)- (25) and (28);
Step 1.3. Apply Eqs. (21)- (25) again to calculate the parameter matrices of MMPP c i for modelling the traffic at network channels that are i hops away from the hot-spot node.
Step 2. Based on the parameter matrices of the traffic patterns obtained from Step 1, calculate the communication latency for regular messages and hot-spot messages in on-chip interconnects
Step 2.1. Calculate the waiting time at the source node for regular messages and hot-spot messages using Eq. (18);
Step 2.2. Calculate the transmission delay for regular messages and hot-spot messages using Eqs. (30)- (32);
Step 2.3. Calculate the time for the tail flit of a message to reach the destination using Eqs. (37) and (38).
Step 3. Based on the communication latencies derived from Step 2, calculate the communication latency for a given message in the on-chip interconnects using Eq. (16).
VALIDATION OF THE MODEL
The accuracy of the analytical model is validated by means of a discrete-event simulator, operating at the flit level, based on OMNeT++ simulation framework. The communication latency is defined as the mean amount of time from the generation of a message until the last data flit reaches the processing core of the destination. Extensive simulation experiments have been performed to validate the accuracy of the model for various combinations of message lengths, parameter metrics of MMPP s and hot-spot fractions. However, for the sake of specific illustration and without loss of generality, the latency results are presented for the following cases [Moadeli et al. 2010; Salminen et al.2008; Wu et al. 2011 ]: 8-port 2-tree to construct the underlying on-chip interconnects; Message length:
=16 and 32 flits; Flit length: L =16 bytes; The buffer size: B = 32 flits; The bandwidth is set to be 20 messages per cycle. The point-to-point latency and switch latency are 0.2 cycles; Parameters ϕ s1 and ϕ s2 in the infinitesimal generator Q s of MMPP s are set to be: ϕ s1 = 0.08 ϕ s2 = 0.04 (i.e., ϕ s1 = 2ϕ s2 ) and ϕ s1 = 0.09 ϕ s2 = 0.06 (i.e., ϕ s1 = 3ϕ s2 /2), representing the different degrees of traffic burstiness and correlations; Hot spot fraction is set to be δ = 0.05, 0.1, 0.15, and 0.2, representing different degrees of nonuniformity of message destinations.
Figures 3 and 4 depict the performance results for the communication latency predicted by the analytical model plotted against those provided by the simulator as a function of the traffic rate in the 8-port 2-tree on-chip interconnects. In these figures, the horizontal axis represents the traffic rate, λ s1 , at which a processing core injects messages into the network when MMPP s is at state 1, while the vertical axis denotes the communication latency. For the sake of clarity of the figures, we have deliberately set the arrival rate, λ s2 , at state 2 at zero; otherwise, we need to use three-dimensional graphs to represent the results.
These figures reveal that the results of communication latency obtained from the derived model closely match those obtained from the simulation as the average prediction error, which is calculated as |result simulation − result analytical |/.result simulation for all simulation points, is less than 6%. The tractability and accuracy of the model make it a practical and cost-effective tool to gain insight into the performance of on-chip interconnection networks in the presence of bursty multimedia traffic with hot-spot destinations.
PERFORMANCE ANALYSIS

The Impact of Traffic Patterns on Network Performance
Having validated the accuracy of analytical model, let us now use it to investigate the effects of the bursty multimedia traffic and hot-spot destinations with different degrees of traffic burstiness and correlations imposed by MMPP input parameters (which can be calculated by Eqs. (14) and (15)) and hot-spot fractions on the performance of on-chip interconnection networks. We consider four different cases of parameter settings for various traffic patterns as shown in Table II : Figure 5 (a), we can find that the bursty multimedia traffic degrades the network performance considerably, since the communication latency increases, especially under moderate and heavy traffic loads. Moreover, the maximum throughput that the network is able to support decreases when subject to the bursty multimedia traffic. Based on the comparison between Case (I) and Case (II), we can find that the maximum network throughput significantly decreases and the network performance degrades due to the presence of hot-spot destinations. This is because the hot-spot traffic can cause the higher traffic loads on network channels located closer to the hot-spot node. Thus, with the hot-spot destinations, these channels become overloaded quickly. To take into account the impact of hot-spot destinations on the performance of multimedia embedded systems, we further compare the results under Case (III) and Case (IV) and find the similar phenomenon in comparison with that under Case (I) and Case (II). Examining Figure 5 (b) for different settings of the message size reveals the same results. From this analysis, we can find that the proposed model manages to predict the increase in the communication latency and decrease in the maximum network throughput in the presence of bursty multimedia traffic with hot-spot destinations. These observations highlight the importance of developing and using the realistic models for the study and optimisation of on-chip interconnection networks. These results also demonstrate that the network suffers significant performance degradation in the presence of bursty multimedia traffic with hot-spot destinations.
Comparison of Runtime between Analytical Model and Simulator
To investigate the efficiency of the analytical model, in this section, we compare the runtime required by the analytical model and simulation experiments to obtain the desirable performance results. To this end, we use the scenario of Figure 3(a) as an example and present the runtime to obtain the analytical and simulation results with = 16 and 32. The other parameter settings are the same as those presented in Section 5. All the results were obtained on a 32-bit PC using an Intel(R) Core(TM) 2 Quad CPU 2.66GHz with 3.46GB of RAM. Tables III and IV list the runtime, for =16 and 32, respectively, required by the analytical model and simulation experiments. The tables reveal that the runtime required to reach the reliable performance results in simulation experiments is about 300 times higher than that required by the analytical model. The results demonstrate that the analytical model can be used as a cost-effective tool for performance evaluation of on-chip interconnection networks in multimedia embedded systems.
CONCLUSIONS AND FUTURE WORK
This article has developed an analytical model to evaluate the performance of on-chip interconnection networks under bursty multimedia traffic and non-uniform destinations. The bursty traffic is modelled by the well-known MMPP and the destination distribution is modelled by the hot-spot destinations. The on-chip network architecture is constructed by the popular fat-tree topology. Extensive simulation experiments have been conducted to validate the accuracy of the model. The tractability and accuracy of the model make it a practical and cost-effective tool to gain insight into the performance of on-chip interconnection networks in the presence of realistic network traffic. The model is then applied to investigate the impact of bursty multimedia traffic with hot-spot destinations on the performance of on-chip interconnection networks. The analytical results have shown that the network performance degrades considerably under such traffic patterns. In the future work, we will extend the analytical model to consider the application of virtual channel flow control. The key tasks for this extension include the calculation of the status of each virtual channel at a given physical channel when determining the waiting time experienced by messages and the effect of virtual channels multiplexing on the average latency.
