Long-Range Dependence and On-chip Processor traffic by Scherrer, Antoine et al.
Long-Range Dependence and On-chip Processor traffic
Antoine Scherrer, Antoine Fraboulet, Tanguy Risset
To cite this version:
Antoine Scherrer, Antoine Fraboulet, Tanguy Risset. Long-Range Dependence and On-chip
Processor traffic. Microprocessors and Microsystems: Embedded Hardware Design (MICPRO),
Elsevier, 2009, 33 (1), pp.72-80. <10.1016/j.micpro.2008.08.010>. <hal-00391215>
HAL Id: hal-00391215
https://hal.archives-ouvertes.fr/hal-00391215
Submitted on 3 Jun 2009
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Long-Range Dependence
and On-chip Processor Traffic
Antoine Scherrer b Antoine Fraboulet a Tanguy Risset a
aCITI - INSA Lyon
Bat 502, 20 avenue Albert Einstein,
F-69621, Villeurbanne, France
firstname.lastname@insa-lyon.fr
bLaboratoire de Physique - ENS Lyon
69364 Lyon Cedex 07, France
firstname.lastname@ens-lyon.fr
Abstract
Long-range dependence is a property of stochastic processes that has an important
impact on network performance, especially on the buffer usage in routers. We ana-
lyze the presence of long-range dependence in on-chip processor traffic and we study
the impact of long-range dependence on networks-on-chip. We propose to investi-
gate the presence of long-range dependence in communication traces of processor ips
at the cycle-accurate level. We also study the impact of long-range dependence on a
real network-on-chip using the SocLib simulation environment and traffic generators
of our own. Our experiments show that long-range dependence is not an ubiquitous
property of on-chip processor traffic and that its impact on the network-on-chip is
highly correlated with the low level communication protocol used.
Key words: Network on Chip, System on Chip, embedded software, long range
dependence, network traffic
1 Introduction
The next generation of embedded systems on chip (SoC) will soon integrate
many processor cores together with dedicated circuits connected via a network.
This is particularly true for chips integrated in telecommunication and multi-
media devices which require more and more computing power. These devices
must be designed faster to meet the market demand which implies that more
and more embedded software is present on these chips. Embedded software
brings a faster and more flexible design process. However, as performance is
Preprint submitted to Elsevier 15 May 2008
decreased compared to dedicated hardware, parallelization and efficient com-
munication schemes must be implemented to reach the required computing
power. In particular, controlling network throughput and contention is essen-
tial and is in general very difficult, because of non determinism introduced
by parallelism. This is why network prototyping will become a major issue in
future SoC design processes.
Recent advances in Internet traffic analysis have shown that predicting per-
formance of a network is a difficult task. Stochastic modeling of the traffic
must be used and it has been shown [24] that this modeling must take into ac-
count second order stochastic properties (covariance in addition to marginal
law). More recently Varatkar and Marculescu [30] have shown that on-chip
traffic might require a similar modeling. They have found long-range depen-
dent behavior in communications between different parts of the Mpeg-2 video
decoding application. The presence of long-range dependence (lrd) in embed-
ded applications, as in Internet traffic, could imply that ips communication
behavior will not be correctly modeled by a simple random traffic even if it is
adjusted to a particular first order statistical law. However, the experiments
of Varatkar and Marculescu were not done at cycle accurate level and it is still
not clear how the lrd will impact the network-on-chip.
We want to emphasize the originality of our work with respect to the seminal
paper of Varatkar and Marculescu [30], in which it was shown that the aggre-
gated throughput of the macro-block arrival process at the IDCT/IQ module
of a hardware Mpeg-2 decoder exhibits long-range dependence. Standard
queuing analysis with self-similar input was then used in order to evaluate the
buffer usage. In the present work, we first investigate the presence of long-
range dependence in the traffic produced by an on-chip processor ip at the
cycle accurate level and then we show that lrd has a very different impact
when the low level communication protocol uses non-split transactions. In
practice, if the application is executed by a processor connected to a cache,
the fact that each read transaction is waiting for an answer (this is called a
non-split transaction), makes the impact of lrd negligible. If the application
is executed by a dedicated ip that computes on data streams (i.e. not waiting
for answers: split transactions), lrd decreases performance.
We also demonstrate here a way to adjust the size of the fifos given particular
statistical properties of the executed application. As pointed out in [9], “buffers
account for the main part of the NoC router area, it is a major concern to
minimize the amount of buffering necessary under performance requirements”.
We quantify experimentally the impact of lrd on fifo usage and on average
communication latency and hence propose a way to evaluate the required
size of fifos to cope with the presence of lrd. This is possible because we
developed a complete NoC prototyping environment [26, 27] in which we are
able to produce lrd traffic. The NoC used in the simulations is dspin [14], and
2
the simulation is done at cycle accurate level: routers are precisely simulated
and instrumented in order to observe the exact behavior of what would occur
on a real chip.
The paper is organized as follows: the next section presents the theoretical
background needed to understand the notion of lrd, Section 3 presents our
NoC prototyping environment. Experiments and results about the presence
and the impact of lrd are presented in Section 4. Related works are referenced
in Section 5.
2 Network traffic and long-range dependence
We start this section by introducing a simple on-chip traffic modeling formal-
ism. We then give theoretical background on long-range dependence whereof
the impact on performance of on-chip networks is the main topic of the paper.
2.1 On-chip traffic modeling
L(k+1)
R(k)
S(k+1)
R(k+1)
A(k),C(k) A(k+1),C(k+1)
D(k)
I(k)S(k)
L(k)RESPONSE
REQUEST
CLOCK
Fig. 1. Traffic modelling formalism
The traffic produced by a component is modeled as a sequence of transactions
composed of flits (flow transfer units) corresponding to a bus-word. The kth
transaction is a 5-uple (A(k), C(k), S(k), D(k), I(k)) meaning respectively,
target address, command (read or write), size of transaction, delay and inter-
request time. This is illustrated in Fig. 1. We also define the latency of the
kth transaction L(k) as the number of cycles between the start of the kth
request and the start of the associated response. This is basically the round-
trip time in the network and will be used in our experiments to evaluate
contention. One can distinguish two main communication schemes used by
ips: the non-split transactions scheme where the ip is not able to send a
3
request until the response to the previous one has been received, and the split
transactions scheme in which new requests can be sent without waiting for the
responses. The non-split transaction scheme is used by processors and caches
for instance (although for cache, it might depend on the cache parameters).
The split transaction scheme is used by dedicated ips performing computation
of streams of data which are transmitted via dedicated direct memory access
(dma) modules.
Our goal is to emulate real traffic by means of traffic generators. With the
above formalism, traffic generation means producing (A(k), C(k), S(k), D(k),
I(k)), for each k. This can be done either deterministically (replay of a recorded
trace), or using stochastic processes. In this work, we use stochastic traffic gen-
erators. Note that, in this case, we make the assumption that the elements of
the transaction sequence are independent.
Let us recall that a stochastic process X is a sequence of random variables X[i]
(we use brackets to denote random variables). We will consider two statistical
characteristics of stochastic processes: the marginal law, which represents how
the values taken by the process are distributed, and the covariance function,
which gives information on the correlation between the random variables of the
process as a function of the time lag between them. For instance, the sequence
of delays D(k) can be generated as the sample path of some stochastic process
{D[i]}i∈N, with prescribed first and second statistical orders.
2.2 Long-range dependence
2.2.1 Definitions
Long-range dependence (lrd) is a property of a stochastic process which is
defined as a slow decrease of its covariance function [23]. This function de-
scribes how the random variables of a process are correlated with each-other
as a function of the time lag between these random variables. The covariance
function γX of a stochastic process {X[i]}i∈N is defined as follows (E is the
expectation):
γX(i, j) = E(X[i]X[j])− E(X[i])E(X[j])
If the process X is wide-sense stationary, then its mean is constant (∀(i, j) ∈
N
2,E(X[i]) = E(X[j])
∆
= E(X)) and its covariance reduces to a one variable
function as :
∀(i, j) ∈ N2, γX(i, j) = γX(0, |i− j|)
∆
= γX(|i− j|)
We expect this function to be decreasing, because correlated data are more
4
likely to be close (in time) to each other. However, if the process is long-range
dependent, then the covariance decays very slowly and is not summable:
∑
k∈N
γX(k) =∞
Therefore, long-range dependence reflects the ability of the process to be highly
correlated with its past, because even at large lags, the covariance function
is not negligible. This property is also linked to self-similarity which is more
general: in can be shown that asymptotic second order self-similarity implies
long-range dependence [7].
lrd is an ubiquitous property of Internet traffic [18,24], and it has also been
attested on on-chip multimedia (MPEG-2) applications by Varatkar and Mar-
culescu [30]. The main interest in lrd resides in its strong impact on network
performance [23]. Above all, the needed memorization in the buffers is higher
when the input traffic has this property [12]. As a consequence, for macro-
networks as well as for on-chip networks, lrd should be taken into account if
it is manifested in the traffic that the network will have to handle.
A long-range dependent process is usually modeled with a power-law decrease
of the covariance function as follows :
γX(k) ∼
k→+∞
ck−α
The exponent α (also called scaling index ) provides a parameter to quantify
the process’ long-range dependency (0 < α ≤ 1). The Hurst exponent, denoted
H, is the classical parameter for describing self-similarity [23]. Because of the
analogy between lrd and self-similarity, it can be shown that a simple relation
exists between H and α: H = (2− α)/2. As a consequence, H (1/2 < H < 1)
is the commonly used parameter for characterizing long-range dependence.
Note that when H = 0.5, then there is no long-range dependence (this is also
referred to as short-range dependence).
2.2.2 Estimation of the Hurst parameter
We have used a standard wavelet-based methodology for the estimation of
the Hurst parameter [7] in our experiments. Let ψj,k(t) = 2
−j/2ψ0(2
−jt − k)
denote an orthonormal wavelet basis, designed from the mother wavelet ψ0.
The j index represents the scale: the larger j is, the more the wavelet is dilated.
The k index is a shift in time.
For any (j, k), dX(j, k) = 〈ψj,k, X〉 are called the wavelet coefficients of the
stochastic process X (〈., .〉 is the inner product in the L2 functional space).
5
These wavelet coefficients permit a study of the X process at various times
(values of k) and various scales (values of j).
When X is a long-range dependent process with parameter H, the following
limit behavior for the expectation of wavelet coeficients can be shown [7]:
∀j, E(dX(j, k)
2) ∼
j→+∞
C2j(2H−1) (1)
It can also be shown that the time averages Sj for each scale j (nj is the
number of wavelet coefficients available at scale j):
Sj = (1/nj)
nj∑
k=1
|dX(j, k)|
2 (2)
can be used as relevant, efficient and robust estimators for E(dX(j, k)
2) [7].
From Eq. (1) and (2), the estimation of H is as follows: i) plot log2 Sj versus
log2 2
j = j and ii) perform a weighted linear regression in the limit of the
coarsest scales (see for instance Fig. 2). These plots are commonly referred
to as log-scale diagrams (LD) . In such diagrams, lrd is demonstrated by a
straight line behavior in the limit of large scales. In particular, if the line is
horizontal, then H = 0.5 and there is no long-range dependence.
To illustrate how we use this tool to evaluate the Hurst parameter, we provide
in Fig. 2 a typical log-scale diagram extracted from an Internet trace [6].
Along the x axis are the different values of the scale j at which the process is
observed. For each scale, log2Sj is plotted together with its confidence interval
(vertical bars). The Hurst parameter can be estimated if the different points
plotted are aligned on a straight line for large scales.
1 5 10 152
4
6
8
10
12
j
lo
g 2
S j
Fig. 2. Example of log-scale diagram (LD). The Hurst parameter is estimated by
the slope of the dashed line (here, H = 0.83)
6
2.2.3 Synthesis of long-range dependent processes
The synthesis (generation of sample paths) of long-range dependent processes
is easy if the marginal law is Gaussian [8]. The so-called Fractional Gaussian
Noise (fgn) is commonly used for that. However, if one wants to generate a
long-range dependent process whose marginal law is non-Gaussian, the prob-
lem is more complex. The inverse method [30] only guarantees an asymptotic
behavior of the covariance function. We have developed, for several common
laws (exponential, gamma, χ2, etc.), an exact method of synthesis described
in [28]. We can thus produce synthetic long-range dependent sample paths that
can be used, as it will be shown in Section 4, in order to evaluate the impact of
lrd on the performance of an on-chip network. For instance, delay sequences
are not likely to have a Gaussian distribution, but rather an exponential one
as we expect many small delays and few big ones. Using our synthesis method,
we can produce a synthetic exponential process with long-range dependence.
3 Multi-phase traffic generator environment
In the following, we present our synthesis flow for building multi-phase traffic
generators that can be used to replace an ip in cycle-accurate NoC performance
evaluation.
Performance EvaluationMulti−Phase Traffic Generator ConfigurationReference Trace 
Parser
Segmentation
Analysis
Stochastic
Selection
Models
TG Config
MPTG
Config
SocGen
Design Space 
Exploration
Simulation
Platform
Description
Performance
Analysis
MPTG IP
Generic
SystemC
IPCompression
without interconnect
Simulation
Trace
Application
IP
Processor
Compressed
Trace
Stochastic fit
Fig. 3. Multi Phase Traffic Generator (mptg) Framework: Traffic analysis and syn-
thesis flow
3.1 SoC simulation environment
We use an open source, SystemC-based, cycle-accurate and bit-accurate sim-
ulation environment: SocLib [1]. The environment contains cycle-accurate
models for various ips. For instance, it contains a Mips R3000 processor (with
its associated data and instruction cache), on-chip memories, and a component
used for displaying output (referred to as tty). SocLib also includes cycle true
simulators of the networks on chip developed at the LIP6 laboratory: spin and
dspin [14].
7
The application running on the Mips is composed, in addition to bootstrap-
ping information, of the C program cross-compiled with gcc to a Mips target.
In our experiments, we use several embedded programs from the multimedia
domain (still image, video and audio decoding). We make the assumption that
contention only delays the communications, without changing their ordering.
This is realistic for most applications and networks-on-chip.
The global simulation flow is depicted in Fig. 3. First, we generate a reference
trace by running the processor ip to be emulated. This trace is obtained with
an ideal network environment (no network contention), hence capturing the
intrinsic communication behavior of the application. Next, we process the
trace in our traffic analysis and synthesis tool explained hereafter and we
obtain configuration files for our traffic generators. A generic traffic generator,
further referred to as mptg, has been written once for all.
 0
 1
 0  50000  100000  150000  200000  250000  300000
N
or
m
al
iz
ed
 d
el
ay
Transaction index
−1
 0
 1
 2
 3
 4
 5
 0  50000  100000  150000  200000  250000  300000  350000
Ph
as
e 
nu
m
be
r
Transaction index
Fig. 4. Segmentation of the MP3 aggregated throughput into four phases
3.2 Traffic analysis
The first step of the analysis is the segmentation. It consists in identifying
phases in the trace with a regular behavior. We have developed an automatic
procedure for that based on existing work on cpu simulation. This part of the
work is detailed in [26]. An example of phases detected for the mp3 application
is shown in Fig. 4.
On each of the identified phases, each element of a transaction: A(k), C(k),
S(k), D(k) and I(k) (see 2.1), is independently analyzed with the following
procedure:
8
• If the designer chooses deterministic traffic generation, each transaction 5-
uple is recorded in a file and compressed using the bz2 program, which
results on average in dividing the size of the trace by 70. The resulting
trace can be directly replayed by our mptg.
• If the designer chooses stochastic traffic generation, a statistical analysis
is performed by an automatic fitting procedure that adjusts the first and
second statistical orders. The designer must choose which model they want
to use (with or without lrd for instance) before the analysis can take place.
The probability distribution function (first statistical order) can be either
fitted to some classical distributions (Gaussian, Exponential, Gamma, Log-
Normal, etc.) or kept as it is (the model is then the probability of appearance
of each value of the process). The fit is done using Maximum Likelihood
Expectation and a χ2 goodness-of-fit test is used to compare and evaluate
all different solutions [16].
The covariance function (second statistical order) can be fitted to the one
of an arma 1 process (short range correlations only), a fgn 2 (long-range-
dependence only) process, or a farima 3 process (both short and long-range
correlations). We use a wavelet-based estimation of the Hurst parameter [7]
widely adopted in the network traffic analysis domain (see section 2.2).
At the end of this step, a configuration file is generated. It contains, for each
phase, all needed information for the traffic synthesis.
3.3 Traffic synthesis
The deterministic traffic generation is simply done by reading the compressed
traffic trace.
For stochastic traffic generation purposes, we have developed an indepen-
dent random number generator that can produce realizations of a wide va-
riety of processes including non-Gaussian long-range-dependent ones (see Sec-
tion 2.2.3). This generator is integrated in the mptg and the analysis simply
produces the adequate mptg configuration file.
The platform designer then describes the desired platform architecture (e.g.
as the one presented in Fig. 12) and uses a perl script (referred to as SocGen
in Fig. 3) generating all files needed for the simulation.
At simulation time, transactions are generated by mptg according to a phase
description file and a sequencer is in charge of switching between phases. At the
1 Auto-Regressive Moving Average
2 Fractional Gaussian Noise
3 Fractionnaly Intergarted ARMA
9
end of the simulation, performance analysis indicates whether any parameters
of the platform must be changed or not.
3.4 Traffic generator key features
As a summary, we highlight the main characteristics of our mptg:
• Our mptg takes into account network properties (latency, contention, etc.).
This means that the mptg will offer a good emulation of the processor
independently of the network it is connected to. Hence, we can re-use the
same configuration file for many different platforms.
• The global methodology presents as little manual editing as possible (e.g.
the complete simulation platform is generated). Moreover, the traffic anal-
ysis and synthesis flow can very easily be plugged into another synthesis
environment: only the traffic generator ip needs to be adapted.
• Our mptg is able to run processes having a wide variety of parameters:
deterministic replay, various stochastic models with prescribed first and
second order statistics. To our knowledge this feature is not present in any
other cycle accurate simulators.
• Our mptg is multi-phase and hence can take into account the non station-
arity of the traffic.
With all these features, the designer has a very flexible tool for the design
space exploration of NoC. The accuracy of the generated traffic with respect
to the original trace it is suppose to emulate has been demonstrated in [27].
On any metric the mean error is less than 4%.
4 Experiments
In this section, we investigate the presence of long-range dependence in on-
chip processor traffic, and we present experimental results about the impact
of long-range dependence on network-on-chip performance. In particular, we
show how to use this information in order to evaluate the size of fifos in
routers.
4.1 Presence of long-range dependence
In order to check for lrd in on-chip traffic, we have run several software
implementations of applications from the multimedia domain:
10
• Mpeg-2: we use libmpeg2, an open-source implementation of the Mpeg-2
video decoding standard [5].
• Mp3: we use libmad, an open-source implementation of the Mp3 audio
decoding standard [4].
• Jpeg: we use two Jpeg still image decoding standards. The first one, known
as M-Jpeg is a multi-thread implementation and the second one, known as
S-Jpeg is a single thread implementation [2].
• Jpeg-2000: we use libj2k, an open-source implementation of the Jpeg-
2000 image decoding standard [3].
The input data for each application are described in Tab. 1.
r3000
MIPS
Cache
RAM
Measure
Fig. 5. The simulation platform for lrd detection
We use a simple platform (see Fig. 5) in order to truly characterize the traffic of
the triplet (implementation/processor/cache). As explained in Section 3, our
goal is to replace this triplet by a traffic generator. If we study communication
on a more complex platform (the one in Fig. 12 for instance), the traffic of an
ip is influenced by communications between other ips and by network-on-chip
configuration (topology, routing protocol, etc.). Therefore, each simulation is
first done on a simple platform including a Mips r3000 processor (associated
with instruction and data cache), directly connected to a memory holding all
necessary data represented in Fig. 5.
App. Input
Mpeg-2 2 images from a clip (176×144 color pixels)
Mp3 2 frames from a sound (44,1 kHz, 128 kbps)
M-Jpeg “Lena” picture (256x256)
S-Jpeg “Lena” picture (256x256)
Jpeg-2000 “Lena” picture (256x256)
Table 1
Inputs used in the simulations
11
The traffic trace is recorded at the network’s interface of the cache. A commu-
nication trace is extracted from the vcd (Value Change Dump) trace file, as
explained in Section 3. We study the aggregated throughput, computed as the
number of flits sent in consecutive time-windows of size 100 cycles. Aggregated
throughput is interesting because it combines delay and size of transactions.
For the sake of clarity, we present results on this quantity alone. Note that
lrd is present neither in commands nor in target addresses time series.
In order to estimate the covariance function and investigate the presence of
lrd in these traffic traces, we have used the wavelet-based estimator [7] de-
scribed in Section 2.2. Each log-scale diagram (Fig. 7-11) includes the esti-
mated covariance (full line) and the linear regression for the estimation of
the Hurst parameter (dotted line). The estimated value for this parameter,
denoted Hˆ is reported in the caption.
As stated in Section 3, traffic phases can be identified, and one should pay
attention, when checking for long-range dependence in a traffic trace, to con-
sider stationary parts only. Indeed, covariance estimation tools can produce
misleading results when used on non-stationary time series. Therefore, we have
extracted stationary parts of the traffic traces based on information provided
by our phase determination software [26]. This is shown in Fig. 6. In each part
we compute LD diagrams and estimations of the Hurst parameter as shown
in Fig. 7-11. Below are the comments for each application.
• Mpeg-2 (see Fig. 7). The shape of the LD does not exhibit evidence of
lrd. Indeed the estimated value for the Hurst parameter, 0.56, means no
lrd is present in the trace. In this case, an iid (Independent Identically Dis-
tributed) process is a good approximation of the traffic. One can note a peak
around scale 25, meaning that a recurrent operation with this periodicity is
present in the algorithm which might have an impact on network contention.
Such behavior could be captured by an ARMA process for instance [10].
• Mp3 (see Fig. 8). As for the Mpeg-2 implementation, no trace of lrd can
be found in this case: the estimated Hurst parameter is close to 0.5. The
other parts of the communication trace do not present any lrd either.
• M-Jpeg (see Fig. 9). For this simulation, a linear behavior can be observed
in the range of scales [24 214]. On top of this behavior, a peak is present
around scale 211. The behavior at larger time scales does not exhibit long-
range dependence, but scale 214 corresponds to approximately 1.6 million
cycles, which means lrd should be taken into account. For the joint mod-
eling of lrd and small-scale behavior, farima processes can be used.
• S-Jpeg (see Fig. 10). Firstly, It is interesting to note that the shape of the
LD is very different from the M-Jpeg implementation, showing that the
statistical properties observed are implementation rather than algorithm
dependent. For the S-Jpeg implementation, long-range dependence can be
observed in the limit of large scales ([210 213]). As for other applications,
12
1000 2000 3000 4000 5000 6000
0
20
40
60
Fl
its
Thousands of cycles
Mpeg-2
500 1000 1500 2000 2500 3000 3500 4000 4500
0
20
40
60
80
Fl
its
Thousands of cycles
Mp3
1 2 3 4 5 6
x 104
0
20
40
60
80
Fl
its
Thousands of cycles
M-Jpeg
0.5 1 1.5 2 2.5
x 104
0
20
40
60
80
Fl
its
Thousands of cycles
S-Jpeg
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
20
40
60
Fl
its
Thousands of cycles
Jpeg-2000
Fig. 6. Aggregated throughput traces obtained from the execution of each imple-
mentation. The boxes mark the part of the trace on which the LD curves were
obtained.
13
peaks are present in LD.
• Jpeg-2000 (see Fig. 11). On this application, the traffic trace exhibit a
strong non-stationarity, so that the trace must be split in rather short parts
for the analysis. In some of these parts, corresponding specifically to the
Tier-1 entropic decoder of the Jpeg-2000 algorithm, lrd is present with
an estimated Hurst parameter value between 0.85 and 0.92 (depending on
the parts). In the other parts of the algorithm, no long-range dependence
could be evidenced.
0 2 4 6 8 10 124
5
6
7
8
9
10
11
j
lo
g 2
S j
Fig. 7. LD of the traffic trace corre-
sponding to the Mpeg-2 implementa-
tion. Hˆ = 0.56
0 2 4 6 8 100
1
2
3
4
5
6
7
j
lo
g 2
S j
Fig. 8. LD of the traffic trace corre-
sponding to the Mp3 implementation.
Hˆ = 0.52
0 5 10 154
6
8
10
12
14
j
lo
g 2
S j
Fig. 9. LD of the traffic trace corre-
sponding to the M-Jpeg implementa-
tion. Hˆ = 0.77
0 5 10 154
5
6
7
8
9
10
11
12
j
lo
g 2
S j
Fig. 10. LD of the traffic trace corre-
sponding to the S-Jpeg implementa-
tion. Hˆ = 0.95
We can conclude from these experiments that long-range dependence is not
an ubiquitous property of the traffic produced by a processor associated with
a cache executing a multimedia application. In some parts of the algorithms,
lrd is present, however it is combined with periodicity effects which may have
an equivalent impact on the network-on-chip performance.
14
0 2 4 6 8 10 125
10
15
20
j
lo
g 2
S j
Fig. 11. LD of the traffic trace corre-
sponding to the Jpeg-2000 implemen-
tation. Hˆ = 0.89
4.2 Impact of long-range dependence
In this section we study the impact of long-range dependence on FIFO usage
and communication latency in a network-on-chip.
4.2.1 Experimental Setup
11
01
LRD
TG
11
BACK
TG
Contention point
10
TARGET
(RAM)
Fig. 12. Platform used for evaluating the impact of LRD. Data paths along the
switches of the NoC are represented as dash lines, FIFOs used by communications
are represented inside the switches.
All the simulations are done on the platform shown in figure 12 in the SocLib
simulation environment [21]. The network on chip is dspin developed at the
Lip6 laboratory as an evolution of the spin network [14]. It uses a mesh
topology, static XY routing and wormhole memorization in the switches.
The lrd tg component is a traffic generator that produces a traffic in which
the delay or inter-request time between transactions has an exponential marginal
distribution and a long-range dependence of parameter H. Sizes of transac-
tions are iid exponential random variables, with mean 10 flits (it is the typical
value observed in the simulation of multimedia applications). The back tg
15
traffic generator is used to introduce contention in the network. It injects, at a
constant rate, a traffic composed of constant-sized transactions of 10 flits. The
load introduced by this component is expressed as a percentage of the max-
imum load (1 transaction every cycle). The back tg component uses split
transactions in order to truly inject a constant rate in the network. All trans-
actions of both traffic generators are addressed to the target component: a
ram (with random read or write commands) which answers in one cycle (ideal
memory). The simulation length has been fixed to 10000 transactions of the
lrd tg.
The major goal of this work is to be able to quantify the size of the buffers
present in the switches and the wrappers (components inserted between ips
and switches) as a function of the amount of lrd present in the traffic. For
that, we have recorded the evolution of the usage of all fifos along the com-
munication path used by traffic generators, as shown in figure 12. We have
also computed the latency of the communications adding the latency of all
fifos. We have run two types of simulations:
• fixed: in these simulations, the fifo size in the switches has been fixed to
2 flits in order to study the usage of the wrappers fifos. Indeed, with such
small-sized fifos in the switches, the contention is directly transmitted to
the wrappers which will be in charge of adjusting the traffic of the ip to the
availability of the network.
• inf: in these simulations, the fifo size in the switches and in the wrappers
has been fixed to a very high (non-realistic) value of 250 flits so that the
usage of all fifos (switches and wrappers) can be studied.
In both types of simulation, we have used two different communication schemes:
• no-split: in this configuration the lrd tg does not use split transaction. It
waits for the reception of the kth transaction before attempting to send the
(k + 1)th. The generated delay is therefore the time between the reception
of the last response and the start of the new request. This communication
scheme is used by peripherals and some processors with caches.
• split: in this configuration the lrd tg uses split-transaction, that is to
say the responses do not influence the sending of the requests. Here, the
generated delay is the time between requests. This communication scheme
is representative of hardware accelerators with direct memory accesses.
A particular simulation will therefore be referred to as a couple (T ,S), where
T ∈ {fixed, inf} and S ∈ {no-split, split }. For instance the simula-
tion (fixed, split) corresponds to small fifos in the switches and split-
transaction communication scheme. The SoC platforms are approximately
simulated at 100 000 cycles per second.
16
4.2.2 Results
Each figure contains five curves that correspond to different values of the
Hurst parameter (H) of the lrd tg. The vertical axis is always a performance
metric. It might be, for instance, the average latency over the whole execution
or the usage (maximum or mean over the execution) of a particular fifo.
The horizontal axis corresponds to the load introduced by the back tg (in
percentage of maximum load).
 25
 30
 35
 40
 45
 50
 55
 60
 65
 0  20  40  60  80  100  120
Av
g.
 la
te
nc
y
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(a)
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 0.45
 0.5
 0.55
 0.6
 0  20  40  60  80  100  120
Av
g.
 fi
fo
 u
sa
ge
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(b)
 4
 6
 8
 10
 12
 14
 16
 18
 0  20  40  60  80  100  120
M
ax
. f
ifo
 u
sa
ge
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(c)
Fig. 13. Simulations FIXED and NO-SPLIT: average latency and usage of the LRD
TG wrapper FIFO. (a: Latency, b: LRD TG wrapper FIFO (Mean), c: LRD TG
wrapper FIFO (Max))
 20
 40
 60
 80
 100
 120
 140
 160
 180
 200
 0  20  40  60  80  100  120
Av
g.
 la
te
nc
y
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(a)
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 0  20  40  60  80  100  120
Av
g.
 fi
fo
 u
sa
ge
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(b)
 0
 50
 100
 150
 200
 250
 0  20  40  60  80  100  120
M
ax
. f
ifo
 u
sa
ge
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(c)
Fig. 14. Simulations FIXED and SPLIT: same measures as in Fig. 13 but using split
transactions. (a: Latency, b: LRD TG wrapper FIFO (Mean), c: LRD TG wrapper
FIFO (Max))
The impact of lrd in the case of fixed is evidenced by the difference between
Fig. 13 and 14. Recall that for these simulations, all the buffering occurs in
the wrapper fifos because the sizes of fifos in switches are fixed to 2 flits.
Fig. 13 clearly shows that lrd has no impact on performance in the case of
non-split transactions because the five curves representing different values of
H are identical (similar results occurs when observing the fifo of the back tg
wrapper). This is typically what would appear if a processor ip (with cache)
were running an application. Conversely, in the case of split transactions, the
results of Fig. 14 confirm the results of [30]: the lrd has a strong impact
on performance and must be taken into account to correctly prototype NoC
performance. The greater the H parameter is, the sooner network contention
occurs.
The second set of figures (Fig. 15 and 16) shows how we evaluate the size of
the fifos needed in the switches. This must be linked with a static analysis
17
of the path taken by the data. Fig. 12 highlights the fifos used by the data
flowing from the ips to the target. We may for instance deduce that the North
input fifo of router 10 will never be filled with more than 2 flits because all
data coming into router 10 comes from a single source (router 11). This is
confirmed in Fig. 16-(c). In these (inf,split) experiments (Fig. 15 and 16)
the size of the fifos in switches is fixed to a very large value (250 flits) to
evaluate the maximum and usage of the fifos. The contention of the network
occurs later than in the (fixed,split) experiments (comparing Fig. 15-(a) and
Fig. 14-(a)) because there are more buffers on the data path. Note that the
average latency on Fig. 15-(a) does not diverge except forH = 0.9. The latency
diverges if the throughput given to communication is less than the injection
rate of that communication. Recall that in these simulations, only the back
tg component injection rate varies. The average load introduced by the lrd
tg component is constant. Moreover, the round robin arbitration scheme in
the router guarantees a minimum throughput to the communications of the
lrd tg. Only if the fifos are small (see for instance Fig. 14-(a)), or if the
injection rate of back tg is sufficiently high, will the latency diverge. In this
particular simulation (inf,no-split), the average input rate of the lrd tg
is not large enough to make the latency diverge. However it is interesting to
note that when H is close to 1 the latency diverge as if the average rate of
lrd tg was much bigger.
Looking at Fig. 15 and 16, the designer can fix the sizes of different fifos.
For instance, if the lrd tg ip has H = 0.8 and if the back tg ip emits at
a rate of 40% the maximum rate, then we can see by looking at Fig. 16-(a)
that the input West FIFO of router 11 should have a size of 100 flits if we do
not want any contention to appear (for instance, if hard real time is required).
If the real time constraints are soft, the designer can decide to look at the
mean usage of the fifos rather than the maximum usage. To have a more
precise evaluation of the impact of setting this FIFO size, the designer can
also re-run the simulation by fixing the size of router 11 input West FIFO to
50 for instance and check the impact on performance.
 20
 30
 40
 50
 60
 70
 80
 90
 100
 0  20  40  60  80  100  120
Av
g.
 la
te
nc
y
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(a)
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 0  20  40  60  80  100  120
Av
g.
 fi
fo
 u
sa
ge
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(b)
 0
 50
 100
 150
 200
 250
 0  20  40  60  80  100  120
M
ax
. f
ifo
 u
sa
ge
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(c)
Fig. 15. Simulations INF and SPLIT: Maximum usage of the first three FIFOs on
the path from LRD TG to TARGET (a: Latency, b: LRD TG wrapper FIFO, c:
router 10 East out FIFO)
As a summary of these experiments: lrd is more likely to have an impact on
18
(a)
 0.985
 0.99
 0.995
 1
 1.005
 1.01
 0  20  40  60  80  100  120
M
ax
. f
ifo
 u
sa
ge
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(b)
 1.975
 1.98
 1.985
 1.99
 1.995
 2
 2.005
 2.01
 2.015
 2.02
 0  20  40  60  80  100  120
M
ax
. f
ifo
 u
sa
ge
Back TG load
H=0.5
H=0.6
H=0.7
H=0.8
H=0.9
(c)
Fig. 16. Simulations INF and SPLIT: Maximum usage of the last three FIFOs on
the path from LRD TG to TARGET: contention occurs in router 11 (input West
FIFO). (a: router 11 West in FIFO, b: router 11 South out FIFO, c: router 10 North
in FIFO)
NoC performance when split-transactions communication scheme is used by
ips. Another original result presented here is a practical way to compute fifo
usage and to precisely fix the size of the NoC fifos in order to reach a certain
level of performance.
5 Related work
A recent and up to date review on works in the topic of networks-on-chip
is done by Bjerregaard and Mahadevan in [9]. In this survey, traffic char-
acterization for NoC performance evaluation is highlighted as a key issue.
More specifically related to on-chip traffic generation, two main approaches
have been studied: deterministic traffic generation, in which the objective is
to exactly reproduce the traffic of a given ip [13, 19, 20] and stochastic traf-
fic generation which uses random sources in place of real ips [17, 25, 31]. In
the deterministic approach, Mahadevan et al. introduced, for instance, a trace
compiler able to accurately reproduce the traffic of a processor [20]. As net-
works on chip are still in early stages of development, the main part of their
performance evaluation is done using a random source. In [17], stochastic pro-
cesses are used for generating transaction sizes and transaction delay using
several statistical laws (Poisson, Exponential, and Normal). However, none
of these works proposes a fitting procedure to determine the adequate statis-
tical models that should be used to simulate a given traffic. Thid et al. use
self-similar traffic for on-chip prototyping in a recent work [29]. The work of
Varatkar and Marculescu about long-range dependence in on-chip traffic is
referenced in Section 2.2.
Among the works addressing the fifo sizing, Hu et al. introduced a non-
uniform NoC buffer space allocation algorithm [15]. Their algorithm was evalu-
ated under various random traffic patterns as well as exponentially distributed
traffic fitted to the communications of real applications. Chandra et al. pro-
posed a methodology to size the fifos along a communication path [11]. They
19
identified the various parameters that impact fifo sizing, among which the
production rate, consumption rate and data burst size are the most important.
Ogras et al. also pointed out in [22] the fact that the NoC buffer sizing prob-
lem suffers from critical issues such as realistic traffic patterns. In this work,
we have presented our proposition for having more realistic traffic patterns as
well as an application of this to the buffer sizing problem.
6 Conclusion
We have studied both the presence and the impact of long-range dependence
in the scope of networks-on-chip. Our results show firstly that long-range de-
pendence is not an ubiquitous property of the traffic produced by on-chip pro-
cessors running multimedia applications. Secondly, using cycle accurate level
simulation of a complete SoC, we have shown that the low level protocol used
by ips to communicate has an important impact on the possible performance
degradation due to the presence of long-range dependence. In the case of non-
split transaction, lrd is likely to have a very small impact on performance
while in the case of split transaction, lrd greatly affects performance. We have
also shown how the NoC designer can use our traffic generation environment
to optimize the NoC for a particular SoC platform.
References
[1] Soclib simulation environment, On line: http://soclib.lip6.fr/ (2005).
[2] Independent jpeg group, On line: http://ijg.org/ (2006).
[3] J2000 - a jpeg2000 codec, On line: http://j2000.almacom.com/ (2006).
[4] libmad - mpeg audio decoder library, On line: http://www.underbit.com/
products/mad/ (2006).
[5] The libmpeg2 library page, On line: http://libmpeg2.sourceforge.net/
(2006).
[6] Wand network research group, On line: http://wand.cs.waikato.ac.nz/
wand/wits/ (2006).
[7] P. Abry, D. Veitch, Wavelet analysis of long-range dependent traffic, IEEE
Transactions on Information Theory 44 (1) (1998) 2–15.
[8] J. Bardet, G. Lang, G. Oppenheim, A. Philippe, M. Taqqu, Long-Range
Dependence: Theory and Applications, chap. Generators of long-range
dependent processes: A survey, Birkha¨user, 2003, pp. 579–623.
20
[9] T. Bjerregaard, S. Mahadevan, A survey of research and practices of network-
on-chip, ACM Comput. Surv. 38 (1) (2006) 1.
[10] P. Brockwell, R. Davis, Time Series: Theory and Methods, 2nd edn., Springer
Series in Statistics, New York, 1991.
[11] V. Chandra, H. Xu, A. abd Schmit, L. Pileggi, An interconnect channel design
methodology for high perforance integrated circuits, in: date, 2004.
[12] A. Erramilli, O. Narayan, W. Willinger, Experimental queueing analysis with
long-range dependent packet traffic, ACM/IEEE transactions on Networking
4 (2) (1996) 209–223.
[13] N. Genko, D. Atienza, G. D. Micheli, J. M. Mendias, R. Hermida, F. Catthoor,
A complete network-on-chip emulation framework, in: DATE 05, 2005.
[14] A. Greiner, P. Guerrier, A generic architecture for on-chip paquets-switched
interconnections, in: Design, Automation and Test in Europe, 2000.
[15] J. Hu, R. Marculescu, ApplicationSpecific Buffer Space Allocation for
NetworksonChip Router Design, in: IEEE/ACM Intl. Conf. on Computer Aided
Design, San Jose, CA, 2004.
[16] R. Jain, The Art of Computer Systems Performance Analysis, John Wiley and
Sons, Inc., 1991.
[17] K. Lahiri, A. Raghunathan, S. Dey, Evaluation of the traffic performance
characteristics of system-on-chip communication architectures, in: Int. Conf.
VLSI Design, 2001.
URL citeseer.nj.nec.com/lahiri01evaluation.html
[18] W. E. Leland, M. S. Taqqu, W. Willinger, D. V. Wilson, On the self-similar
nature of ethernet traffic (extended version), ACM/IEEE transactions on
Networking 2 (1) (1994) 1–15.
[19] M. Loghi, F. Angiolini, D. Bertozzi, L. Benini, R. Zafalon, Analyzing on-chip
communication in a mpsoc environment, in: DATE 04, 2004.
[20] S. Mahadevan, F. Angiolini, M. Storgaard, R. G. Olsen, J. Sparsø, J. Madsen, A
network traffic generator model for fast network-on-chip simulation., in: DATE
05, 2005.
[21] C. S. L. of Paris VI, Soclib simulation environment, On line: http://soclib.
lip6.fr/ (2006).
[22] U. Y. Ogras, J. Hu, R. Marculescu, Key research problems in noc design: A
holistic perspective, in: Intl. Conf. on Hardware/Software Codesign and System
Synthesis, 2005.
[23] K. Park, W. Willinger (eds.), Self-Similar Network Traffic and Performance
Evaluation, John Wiley & Sons, 2000.
[24] V. Paxon, S. Floyd, Wide-area traffic: The failure of poisson modeling,
ACM/IEEE transactions on Networking 3 (3) (1995) 226–244.
21
[25] S. G. Pestana, E. Rijpkema, A. Ra˘dulescu, K. Goossens, O. P. Gangwal, Cost-
performance trade-offs in networks on chip: A simulation-based approach, in:
DATE 04, 2004.
[26] A. Scherrer, A. Fraboulet, T. Risset, Automatic phase detection for stochastic
on-chip traffic generation, in: CODES+ISSS, Seoul, south Korea, 2006.
[27] A. Scherrer, A. Fraboulet, T. Risset, Generic multi-phase on-chip traffic
generator, in: ASAP, Steamboat Springs, USA, 2006.
[28] A. Scherrer, N. Larrieu, P. Borgnat, P. Owezarski, P. Abry, Non gaussian and
long memory statistical characterisations for internet traffic with anomalies,
IEEE Transactions on Dependable and Secure Computing (TDSC)To appear.
[29] R. Thid, K. Sander, A. Jantsch, Flexible bus and noc performance analysis
with configurable synthetic workloads, in: 9th Euromicro Conference on Digital
System Design (DSD 2006), 2006.
[30] G. Varatkar, R. Marculescu, On-chip traffic modeling and synthesis for mpeg-2
video applications., IEEE Transactions on Very Large Scale Integration (VLSI)
Systems 12 (1) (2004) 108–119.
[31] D. Wiklund, S. Sathe, D. Liu, Network on chip simulations for benchmarking.,
in: IWSOC, 2004.
22
