Characterization and modeling of multicast communication in cache-coherent manycore processors by Abadal Cavallé, Sergi et al.
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]
Computers and Electrical Engineering 000 (2016) 1–16
Contents lists available at ScienceDirect
Computers and Electrical Engineering
journal homepage: www.elsevier.com/locate/compeleceng
Characterization and modeling of multicast communication in
cache-coherent manycore processors
Sergi Abadala,∗, Raúl Martínezb,1, Josep Solé-Paretaa, Eduard Alarcóna,
Albert Cabellos-Aparicioa
aNaNoNetworking Center in Catalonia (N3Cat), Universitat Politècnica de Catalunya, Barcelona, Spain
bOracle Labs, Oracle Corporation, Vancouver, BC, Canada
a r t i c l e i n f o
Article history:
Received 10 May 2015
Revised 23 December 2015
Accepted 23 December 2015
Available online xxx
Keywords:
Manycore processors
Multicast
Broadcast
On-chip traﬃc analysis
Network-on-chip
Scalability
a b s t r a c t
The scalability of Network-on-Chip (NoC) designs has become a rising concern as we en-
ter the manycore era. Multicast support represents a particular yet relevant case within
this context, mainly due to the poor performance of NoCs in the presence of this type of
traﬃc. Multicast techniques are typically evaluated using synthetic traﬃc or within a full
system, which is either simplistic or costly, given the lack of realistic traﬃc models that
distinguish between unicast and multicast ﬂows. To bridge this gap, this paper presents a
trace-based multicast traﬃc characterization, which explores the scaling trends of aspects
such as the multicast intensity or the spatiotemporal injection distribution for different
coherence schemes. This analysis is the basis upon which the concept of multicast source
prediction is proposed, and upon which a multicast traﬃc model is built. Both aspects pave
the way for the development and accurate evaluation of advanced NoCs in the context of
manycore computing.
© 2015 Elsevier Ltd. All rights reserved.1. Introduction
In the ever-changing world of microprocessor design, multicore architectures are currently the dominant trend for both
conventional and high-performance computing. Chip Multiprocessors (CMPs) resulting from the interconnection of several
processing cores were conceived to overcome the complexity and power scalability hurdles of processors with a single CPU;
however, the scalability concerns have now migrated to facets such as memory management, programmability or the limits
of parallelism as the core count increases.
Inherent parallelism limits aside, these scalability concerns are generally dependent on the architecture or programming
model of choice. A long-running debate has brought up strong arguments for the adoption of two widely-known models
in manycore CMPs: shared memory and message passing. Shared memory provides remarkable programmability and com-
patibility with legacy code. However, its scalability is arguably limited by performance and architectural complexity issues
related to data consistency. On the contrary, message passing offers unique validation and hand-tuned performance bene-
ﬁts, which come at the cost of placing an increasingly heavy burden upon the programmer. The differences between these Reviews processed and recommended for publication to the Editor-in-Chief by Associate Editor Dr. M. Daneshtalab.
∗ Corresponding author. Tel.: +34 662086489 .
E-mail address: abadal@ac.upc.edu (S. Abadal).
1 Raúl Martínez was working for INTEL at the INTEL Barcelona Research Center when the main ideas of the paper were developed.
http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
0045-7906/© 2015 Elsevier Ltd. All rights reserved.
Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
2 S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]
Fig. 1. Simulation-based multicast traﬃc characterization and modeling.two extremes contrast with one common point: most of the scalability issues are tightly coupled to on-chip communication
limitations. Due to this, the research focus in multiprocessors has gradually shifted from how cores compute to how cores
communicate.
The limited scalability of conventional Networks-on-Chip (NoCs) in the presence of global and multicast traﬃc repre-
sents an important constraint to multiprocessor architects and programmers. In shared-memory multiprocessors, cache co-
herence is the main source of on-chip communication and is generally maintained through directory-based protocols that
limit the use of multicast to the invalidation of cache blocks on a shared write. This reduces the injection of multicast traf-
ﬁc, yet at the cost of non-scalable area and energy overheads required to track the sharers of the data. In message passing
systems, where communication is explicitly set by the programmer, the use of collective communication routines such as
MPI_Bcast or MPI_Allgather is often avoided. This, however, may lower the maximum achievable performance and in-
crease the complexity of parallel programming over message passing even further. Also, the lack of proper multicast support
hampers the development of novel programming models and manycore systems that may be multicast-intensive [1].
Given that multicast and broadcast may become a critical factor guiding the design of future manycore CMPs, there is
a need to understand how the characteristics of such traﬃc will scale with the number of cores. Providing accurate mul-
ticast traﬃc characterization in different scenarios would be useful for the early-stage design and evaluation of NoCs in
general and multicast mechanisms in particular. However, to the best of the authors’ knowledge, no tools are available for
the analysis and modeling of multicast traﬃc. Different works have characterized or modeled inter-processor communica-
tion in moderately-sized shared memory processors (see [2] and references therein), as well as in message passing clusters
or supercomputers [3–5]. However, none of them has analyzed how collective communication scales for a wide set of repre-
sentative applications and architectures at the CMP scale. Also, existing traﬃc models do not differentiate between unicast
and multicast ﬂows and only offer data for a given system size.
In this paper, we aim to address these issues in one of the contending programming models (shared memory) by per-
forming a scalability-oriented multicast traﬃc characterization. Our contribution and methodology is summarized in Fig. 1.
We ﬁrst analyze traces of widely known SPLASH-2 and PARSEC benchmark applications running over a set of existing cache
coherence alternatives and for different processor sizes. Since scarce data is available beyond 64 cores, we additionally per-
form an exploratory extension of the study up to 512 cores whenever possible. Although most of these applications were
not build to scale to thousands of processors [6,7], the analysis of its multicast traﬃc patterns may be still useful to ex-
trapolate the behavior of future scalable shared-memory applications. We later employ the results of the characterization
process to (1) propose and evaluate the potential of multicast source prediction, and (2) create a synthetic traﬃc generator
that faithfully models multicast communication and that can be used to generate realistic mixed traﬃc proﬁles. With this,
we aim to trigger further research in the area of multicast support for manycore systems.
The paper is a direct extension of previous work by the authors [2]. The original contribution is augmented as follows:
• The analysis now includes results regarding Token Coherence [9], as well as directory-based coherence with imprecise
tracking up to 512-core CMPs.
• The concept of multicast source prediction is presented and evaluated, placing emphasis on its potential applicability at
the NoC design level.
• The characterization results are used to implement a multicast traﬃc model and propose an appropriate synthetic traﬃc
generator.
The remainder of this paper is as follows. In Section 2, we provide background on cache coherence and on-chip network-
ing which may further motivate this work and be useful to understand its results. Within the same section, we present re-
lated work in NoC traﬃc analysis and modeling. Section 3 details the characterization methodology used to obtain multicast
communication traces and to analyze them. The results of the multicast traﬃc characterization are presented in Section 4,
which are later used in Section 6 to create and validate a multicast traﬃc model. Section 7 concludes the paper.
2. Background and related work
This section seeks to further motivate this work by explaining why it is necessary to analyze unicast and multicast traﬃc
separately (Section 2.1); by providing background on the relation between cache coherence and the characteristics of thePlease cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16 3
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]multicast traﬃc, in an attempt to both explain the importance of multicast in shared-memory manycore chips and justify
the choice of protocols for analysis (Section 2.2); and by detailing in which aspects this paper differs from related work in
multicast traﬃc analysis and modeling (Section 2.3).
2.1. Serving multicast traﬃc in NoCs
The nature of on-chip interconnects has changed with the number of processors. Buses are feasible for a few cores and,
therefore, all communications are inherently multicast (broadcast, in fact). As more cores are integrated within a single chip,
though, the interconnect design shifts to the NoC paradigm which is point-to-point in nature. Due to this, NoCs require that
multicast packets be replicated and delivered to each of the intended destinations. When dense multicasts and broadcasts go
through this process, the performance of the network may suffer a severe drop due to both the delay incurred by the packet
replication and the contention associated to these additional messages. The performance loss is generally proportional to the
multicast intensity and to the number of destinations per message for a ﬁxed network size.
The way packets are treated in conventional NoC has been subject of different studies. The most simple approach is to
generate and inject one unicast message per destination. This process is performed at the source network interface (NIF)
and, besides being highly power-ineﬃcient, implies a potentially large serialization delay and increases contention around
the transmitting node. To alleviate these effects, two in-network multicast support techniques have been explored. On the
one hand, path-based multicast [10] relies on the transmission of a number of messages which travel around distinct regions
of the chip and are replicated and delivered at the NIF of each of the destinations. On the other hand, tree-based multicast
[11] requires the injection of a single message, which is replicated at the intermediate routers following a virtual tree until it
reaches the intended destinations. In general, path-based methods are more contention-aware, whereas tree-based methods
incur into lower latency.
Another way to reduce the impact of multicast communications is to improve the connectivity of the network or by
using globally shared media. This can be done with traditional RC wires and buses [13] or with emerging interconnect
technologies such as 3D stacking [10], nanophotonics [14] or wireless RF [1].
Regardless of the approach taken, multicast support proposals been tested either using synthetic traﬃc or within full-
system simulations, generally assuming a ﬁxed network size. Consequently, their impact upon the network performance is
imprecise and their scalability remains largely unknown. In this paper, we aim to set the foundations of effective NoC testing
for different network sizes and for a set of representative and realistic cases.
2.2. Cache coherence and multicast traﬃc
The modest performance of conventional NoCs in the presence of dense multicast and broadcast communication has
guided the design of shared memory multiprocessors through the years, encouraging the adoption of memory architectures
and cache coherence protocols that avoid such type of traﬃc. As a result, traditional snoopy coherence schemes gave way to
directory-based coherence. The former requires an ordered broadcast mechanism; whereas the latter relies in an entity (the
directory), which serves as an ordering point and coordinates, through point-to-point messages, the memory transactions
among the involved processors. As shown in Fig. 2, these represent the two extremes of cache coherence which trade off
architectural cost against interconnect cost.
In directory-based coherence, the use of multicast is generally limited to the invalidation of cache blocks on a shared
write. To send invalidation messages only to the sharers of that block, the directory needs to provide precise sharer track-
ing. This implies having one bit per core per each cache block to store that information. Obviously, this scheme does not
scale well as storage requirements skyrocket for hundreds of cores. To relax these constraints, one can limit the number of
tracked sharers and send broadcasts to invalidate heavily shared blocks. An extreme case of imprecise tracking could be the
protocol implemented by AMD HyperTransport [16], which is stateless and directly broadcasts all requests. However, broad-
cast delivery does not need to be ordered as in snoopy protocols, and responses are still unicast. Other alternatives suchFig. 2. Representation of the cache coherence design space, inspired by the work in [15].
Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
4 S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]as Token Coherence [9] or Destination Set Prediction [15] propose advance techniques to further limit their architectural or
communication costs.
As mentioned above, the adoption of multicast-demanding methods is hampered by the performance of NoCs and their
lack of ordering. Performance penalties are signiﬁcant already at 64 cores and are expected to dramatically increase with
the core count [11], progressively cornering the coherence solution space far from the ideal area. This further motivates
the pursue of improved multicast and broadcast support in NoCs, indirectly highlighting the importance of this work as a
vehicle for the evaluation of future interconnect fabrics in realistic conditions.
2.3. Related work in traﬃc characterization and modeling
The driving motivation behind traﬃc characterization is the need for a cost-effective (more than using real traces) but
accurate (more than using synthetic traﬃc) way to evaluate networks. In NoCs, traﬃc characterization can be performed by
analyzing communication traces obtained from full-system simulations. Multiprocessor benchmarks such as SPLASH-2 [6]
and PARSEC [7] are commonly used to serve this purpose in shared memory architectures. Other benchmark suites such as
NAS could be employed in message passing systems more oriented to supercomputing [3].
Traﬃc characterization: Shared memory – On-chip traﬃc generated by shared-memory architectures has been analyzed
in a wide variety of settings. Our previous work in [2] contains a comprehensive view of such characterization efforts,
including the seminal papers that explore the SPLASH-2 and PARSEC benchmarks [6–8]. Their main downturn, though, is
that those works do not distinguish between unicast and multicast in most cases. In fact, the few existing explicit multicast
analyses are thus far limited to very speciﬁc use cases. Here, instead, we perform a complete characterization of multicast
traﬃc for different representative cases using a uniﬁed approach and, for the ﬁrst time, up to 512 cores to inspect their
scalability.
Traﬃc characterization: Message passing – In message passing, simpler steps can be taken to analyze traﬃc due to the
explicit nature of communication. One can obtain a traﬃc characterization by looking into the types of communication
routines used within a given program and projecting them into a given target system. The communication routines can be
point-to-point (e.g. MPI_Send) or collective (e.g. MPI_Allgather), which may directly involve multicast and broadcast
patterns. By breaking down these operations into messages and aggregating all contributions, metrics such as the multicast
intensity or the number of destinations per message could be obtained. In the literature, several works have analyzed the
communication primitives of different parallel algorithms. In [4], collective communication patterns and their impact on
scalability are analyzed. The communication characteristics of NAS benchmarks and other scientiﬁc applications have been
also extensively evaluated in [3,5] including information on collective routines.
Traﬃc modeling – As mentioned above, traﬃc analysis also enables the faithful yet simple modeling of traﬃc for NoC
evaluation. First proposals in this regard consider that three dimensions are enough to model the injection of traﬃc in
NoCs of different benchmarks and architectures [17]: the degree of temporal burstiness resulting from the widely proven
self-similarity of NoC traﬃc, the degree of concentration in the spatial injection distribution, and the hop distribution that
models the probability of a packet going through a given number of hops as a function of the NoC topology or how a given
application is mapped onto the processor.
When modeling traﬃc using such methods, no distinction is made between unicast and multicast ﬂows, and a uniﬁed
model is used. This may be acceptable from a behavioral perspective only if the NoC treats each multicast as a set of in-
dependent unicast packets. In contrast, both types of traﬃc may have to be modeled separately and then interleaved in
path-based or tree-based multicast schemes, since the network behavior changes with the type of packet. Such heteroge-
neous approach has been adopted in works like [10], which evaluates a path-based multicast routing method using mixed
traﬃc proﬁles. Our work builds upon this premise, sustaining that the same model should not be used for both proﬁles
as they may have fundamentally different sources and, therefore, different characteristics. Actually, we aim to improve the
quality of existing models by focusing on the less-explored area of multicast characterization.
An alternative traﬃc modeling methodology based upon full-system simulations is presented in [18]. Instead of relying
blindly on the temporal and spatial information given by the traces, their approach uses additional information on the type
of communication (e.g. purpose) to establish dependencies between messages and to determine its typology. Therefore, this
method contemplates the possibility of distinguishing between multicast and the rest of traﬃc.
3. Methodology
The main objective of this work is to perform a multicast traﬃc analysis targeting NoC scalability. To this end, we employ
the methodology summarized in Fig. 1. Basically, we simulate different architectures running different benchmarks in order
to obtain a set of communication traces. These traces are then parsed with the aim of extracting a set of statistics, which
can further processed and graphically visualized. Such characterization can be later used to model realistic multicast traﬃc
(see Section 6), allowing to evaluate multicast routing schemes in practical scenarios without having to resort again to
full-system simulation.
It is worth noting that, even though the methodology is used here to study shared-memory systems, a similar workﬂow
could be used to obtain the multicast traﬃc characteristics of message-passing applications provided that the appropriate
tools are available.Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16 5
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]
Table 1
Simulation parameters.
Parameter GEM5 Graphite
Number of cores 4–64 16–512
Coherence protocols MESI, HT, TokenB MSI
Max sharers tracked 2–64
Benchmarks PARSEC, SPLASH-2
L1 cache (I&D) 32+32KB, 2-way, 2 cycles, 64B lines
L2 cache 512KB/core, 8-way, 10 cycles, 64B lines
NoC topology Fully connected3.1. Simulation tools
To capture traﬃc in a cache-coherent processor, a simulation of the whole memory hierarchy is required, since L1–L2
interactions are the main determinants of this traﬃc. Unlike in our previous work, we use two different tools to obtain the
communication traces. On the one hand, we use GEM5, a widely used open-source framework for cycle-accurate full-system
simulation [19]. Due to its depth and accuracy, GEM5 admits up to 64 cores with up to three levels of cache and, out of
the box, includes different cache coherence schemes such as MESI, HyperTransport (HT) or TokenB (an implementation of
Token Coherence). On the other hand, we used Graphite [12], an architectural simulator developed by MIT that employs
more relaxed functional and performance models, allowing it to simulate large systems (hundreds and even thousands of
cores) with reasonable accuracy. Currently, Graphite models MSI coherence with a variety of directory types.
In order to obtain custom statistics and traces oriented to multicast traﬃc analysis, we slightly modiﬁed both simulators.
In GEM5, the NIF modules now capture the time of arrival, original, destinations, type and size of each multicast packet that
is to be injected to the NoC. This way, the results are independent to the multicast routing strategy. NIFs are not modeled as
such in Graphite; therefore, we identify points where multicast packets are generated (i.e. invalidations) and register relevant
information for each of them. Note that, due to differences in terms of simulated instruction set and internal models, we
will not directly compare the output of both schemes. Instead, we will use the results from Graphite to complement the
scalability analysis performed with GEM5. The main aim is to provide trends rather than exact ﬁgures beyond 64 cores; and,
for that purpose, Graphite is an adequate tool despite the loss of accuracy with respect to GEM5.
3.2. Architecture under study
Table 1 shows a summary of the simulation parameters used in the study. We assume a common tiled conﬁguration,
wherein each tile comprises a core, private instruction and data caches, and a slice of a shared L2 cache. The main variables
of the study are the multiprocessor size, which ranges from 4 to 64 tiles in GEM5 and from 16 to 512 tiles in Graphite,
and the coherence mechanism. We consider three main different protocols, namely, MSI/MESI, HT and TokenB mentioned in
Section 2.2. In the ﬁrst case, we take the number of maximum sharers per block as a variable in order to study the impact
of imprecise tracking on the number of destinations per multicast.
From the network perspective, it is important to note that we consider an ideal fully connected NoC where the delay
introduced by the network is ﬁxed and independent on the source and destination(s). This approach has been proposed in
[18] to minimize the impact of a speciﬁc topology or routing algorithm upon the traﬃc characteristics, which essentially
depend on the cache coherence scheme. The resulting characterization is a base upon which models can be later obtained
for a wide range of NoC conﬁgurations. Additionally, this reduces the complexity of the large-scale simulations and the
impact of inaccurate network models in Graphite.
PARSEC and SPLASH-2 benchmarks are simulated on their entirety whenever possible, always using input sets large
enough to scale the workload up to hundreds of cores. In all cases, statistics and traces are only collected within the region
of interest, and results are only shown for those applications that were successfully executed both in GEM5 and Graphite.
4. Characterization results
Next, we present the results of the multicast traﬃc analysis and discuss the possible implications of each explored char-
acteristic on the NoC design.
4.1. Message type and size
The architecture of a multiprocessor is the main factor that determines the methods that will trigger the transmission of
multicasts. Therefore, the size of these messages can be easily inferred from it and, as further explained in [2], taken into
account in the NoC design process.
In the MESI coherence protocol, multicast messages are mostly invalidations which are generated upon a write to shared
data and sent to the cores that are currently sharing it. Invalidations are short control messages and account for more thanPlease cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
6 S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]
Fig. 3. Number of multicast messages per 106 instructions with MESI, HT and TokenB protocols (GEM5 traces) and number of invalidations per
106 instructions with the MSI protocol (Graphite traces) as a function of the processor size. Note the logarithmic scale.99% of the multicasts in average regardless of the system size (rarely dropping below 96%). In the rest of cases, multicasts are
long data responses sent to the main memory and to a set of caches after reading an invalidated block. These replies are less
much frequent and their size corresponds to the cache line size plus a given overhead. In the HT protocol, the percentage of
multicast long messages is expected be even lower since not only invalidations, but all control requests, are broadcast. Actual
results, however, show a similar size distribution than for MESI. Finally, in the case of TokenB, short requests represent the
totality of multicast messages. These are used to collect the tokens required to perform a shared write, to request a value
from a remote cache, or to avoid starvation situations by means of persistent requests.
4.2. Multicast traﬃc intensity
The number of multicast messages per instruction is a measure of the multicast intensity. It is loosely dependent on
the choice of NoC, and basically determined by the multiprocessor architecture, as it deﬁnes the methods that generate
multicast messages; and the application, which deﬁnes the sharing structures and memory intensity.
Fig. 3 plots the number of injected multicast messages per one million instructions. It is observed that applications gen-
erally become more multicast intensive as the number of cores grows: note the steep increases of particular cases such as lu.
In comparative terms, TokenB shows the largest multicast requirements, around one order of magnitude above the require-
ments of HT coherence and two orders of magnitude larger than MESI. Although such increase is application-dependent
and does not follow a common scaling trend, ﬁtting methods on the average values yield a somehow logarithmic rela-
tion between multicast intensity and number of cores. The scalability of the communication-to-computation ratio, which is
generally a function of the square root of the number of processors [6], may explain such tendency.
To provide hints on the evolution of this metric beyond 64 cores in directory-based settings, we used Graphite to ob-
tain the number of multicast invalidations per instruction in MSI. The bottom plot of Fig. 3 shows that such metric keeps
increasing beyond 64 cores and, even though results are not directly comparable with those obtained with GEM5 due to
differences in the instruction set and protocol, this conﬁrms the increasing importance of multicast in manycore processors.
4.3. Number of destinations
The number of destinations per message is a metric that mainly depends upon the multiprocessor architecture. In cases
such as HT or TokenB, the coherence protocol issues a broadcast in most of the transactions. Given that a very large per-
centage of the multicast messages are due to coherence, the average number of destinations is around N − 1 in a N-core
system in those cases and regardless of the target application.
In invalidation-based protocols such as MSI/MESI, the sharing structures of each speciﬁc application play an important
role in determining the number of destinations per multicast message, as it deﬁnes the number of sharers to invalidate in
each shared write. Left plot in Fig. 4 represents the number of destinations per multicast message assuming perfect tracking.
It can be observed that the number of destinations increases linearly with the system size, and that some 64-threaded
applications involve more than 16 destinations per multicast in average.
Another aspect that deﬁnes the number of destinations is the maximum number of sharers that the directory can pre-
cisely track down. To evaluate this and to have a reference on how multicast traﬃc will scale beyond 64 cores, we used
Graphite to obtain the number sharers per invalidation assuming different directory capacities. Results are shown in the
right plot of Fig. 4 and show that the number of destinations not only keeps increasing with the system size, but also sig-
niﬁcantly grows as the number of tracked sharers per cache block decreases. This is because an increasing number of shared
variables will have to be invalidated through broadcast.Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16 7
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]
Fig. 4. Number of destinations per multicast message in directory-based coherence. Left plot shows the minimum, average, and maximum of the per-
application average assuming precise tracking up to 64 cores, whereas the right plot projects the number of sharers per invalidation with imprecise
tracking beyond 64 cores.
(a) MESI (b) HT (c) TokenB
Fig. 5. Percentage of delivered ﬂits due to multicast communication. Labels indicate the percentage of communication transactions that are multicast.The importance of this metric lies within its impact on the overall bandwidth requirements. While the number of mul-
ticast transactions increases with the number of cores, they only represent a fraction of all the communication transactions
(see labels in Fig. 5). Less than 0.5% of all the transactions are multicast in MESI; in HT, the ratio is higher but it decreases
sharply with the number of cores (from 12% to 1.5%); ﬁnally, broadcasts increase up to around 24% in TokenB. In contrast,
as shown in Fig. 5, the amount of delivered (ejected) ﬂits belonging to multicast transactions consistently increases due to
the in-network replication of each multicast message. This metric grows above 2% in MESI with perfect tracking, and would
exceed 4% and 7% with 8 and 3 maximum sharers, respectively. In multicast-intensive schemes, the percentage of ejected
ﬂits due to multicast dramatically increases up to 50% in HT and goes beyond 85% in TokenB. This fact further encourages
the employment of shared-medium NoCs, if feasible, to eﬃciently serve multicast traﬃc.
4.4. Spatial distribution
An interesting aspect to investigate is the spatial distribution of the multicast traﬃc injection. Results in this regard may
be useful for the identiﬁcation of potential hotspots and could be employed to optimize the underlying NoC by, for instance,
applying priority policies. To evaluate the spatial distribution, we calculated the coeﬃcient of variation (CoV) as cv = σ/μ,
where σ and μ are the standard deviation and mean of the multicasts injected by each node. We chose this metric in order
to measure dispersion while ﬁltering out the dependence of the standard deviation with the overall number of injected
messages. A higher CoV means a higher concentration of the multicast injection over given cores.
Fig. 6 plots the CoV of each application in the target systems, as well as the average over all the applications. The
CoV grows steadily with the number of cores in an application-dependent manner: results have been in fact sorted in
descending order based on the absolute growth of the imbalance. This implies that applications that appear ﬁrst may yield
more pronounced imbalance in manycore processors and would especially beneﬁt from NoC designs that eﬃciently handle
hotspot situations. From the average behavior, it is observed that MESI shows a higher imbalance in general terms. This is
because applications heavily based on producer–consumer patterns rely on a few producers, which are the main sources of
multicast traﬃc in MESI. The number of producers does not necessarily scale with the system size, aggravating the hotspot
behavior. Such assumptions are not valid in HT or TokenB, since broadcast requests come from a wider base of cores. In
such cases, the spatial distribution provides insight about the general memory activity of different processors: cores that
frequently access to shared data will generate more broadcasts than those that do not.
4.5. Temporal distribution
In order to accurately model any kind of traﬃc, it is crucial to have a complete knowledge on its temporal distribution. As
shown in Section 2, related works have shown that on-chip traﬃc is self-similar given the long-range dependency between
arrivals, i.e. the generation of new messages is dependent on the delivery of prior messages. This creates memory effectsPlease cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
8 S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]
(a) MESI (b) HT
(c) TokenB (d) Averages
Fig. 6. Coeﬃcient of Variation (CoV) of the spatial injection distribution of multicasts. Applications are sorted in descending order based on the difference
 between CoV(256)/CoV(64) and CoV(4).
Table 2
Hurst exponents (geometric mean).
Number of cores 4 16 64
MESI 0.8955 0.9152 0.9433
HT 0.9140 0.9254 0.9450
TokenB 0.8350 0.8788 0.9456and implies a given burstiness at the transmitting end, property that is widely known to have a negative impact on network
performance. Provided that multicast traﬃc is a subset of the on-chip traﬃc, it is reasonable to deduce that multicasts will
also exhibit self-similarity albeit not necessarily with the same intensity. In order to conﬁrm this fact, we calculated the
Hurst exponent H (0.5 < H ≤ 1) applying the RS plot method [17] on the temporal information of the full-system traces.
In light of the results of Table 2 and given that an H value close to 1 denotes strong self-similarity, it can be concluded
that multicast traﬃc is self-similar and that burstiness generally increases with the core count. Also, note that the NoC has
an impact upon the value of H, although similar results are expected for most NoC implementations since burstiness stems
from memory effects inherent to the application level. For instance, our previous work assumes a mesh NoC and yields
slightly lower Hurst exponents in almost all cases. We refer the reader to [2] for more details.
4.6. Spatiotemporal correlation
Spatial and temporal analyses performed above yield two independent characterizations of the injection process: ﬁrst,
on the generic probability of any node transmitting and, second, on the probability of any node transmitting shortly after
any other node. The potential correlation between both aspects could provide further insight on, for instance, how easy is
to determine that a given node X will transmit shortly after a transmission of another given node Y. While correlation does
not necessarily imply causality between both transmissions (the message from Y may not be triggered by the message from
X), it is a highly valuable information when designing predictive strategies for NoC. We refer the reader to Section 5 for
more details on NoC-related prediction.
To evaluate spatiotemporal correlation, we consider transmissions separated less than a given time period τ . If X = Y,
we found a potential source of autocorrelation; whereas if X = Y, we are facing a case of crosscorrelation. The choice of τ
depends on several factors as further explained in [2]. Two interesting correlation metrics can be obtained from this analysis.
First, we evaluate the degree of correlation of multicast transmissions. We obtain these values by marking the second
transmitter of a correlated pair and, at the end of the execution, counting the number of marked transmissions. In MESI,
results beyond 64 cores are approximate and given as a scalability reference. Fig. 7 shows the correlation distribution of
multicast transmissions assuming two different τ values. It is observed that the percentage of crosscorrelated transmissions
grows with the number of cores, especially in the case of MESI and TokenB, and that autocorrelation levels are generally low.
Note that these ﬁgures represent the geometric mean of all the applications and, therefore, the correlation percentages may
take higher/lower values: radix and canneal, for instance, show crosscorrelation levels above the average; bodytrack yields
lower correlation. Since a high percentage of correlated traﬃc could imply not only subpar NoC performance, but also a greatPlease cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16 9
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]
(a) MESI, τ = 10TCLK (b) HT, τ = 10TCLK (c) TokenB, τ = 10TCLK
(d) MESI, τ = 50TCLK (e) HT, τ = 50TCLK (f) TokenB, τ = 50TCLK
Fig. 7. Degree of correlation of multicast transmissions with different τ values.
(a) MESI, τ = 10 · TCLK (b) HT, τ = 10 · TCLK (c) TokenB, τ = 10 · TCLK
(d) MESI, τ = 50 · TCLK (e) HT, τ = 50 · TCLK (f) TokenB, τ = 50 · TCLK
Fig. 8. Factor of predictability and percentage of crosscorrelation with different τ values.opportunity to improve it by means of predictive strategies, results suggest that such mechanisms will gain importance as
the number of cores increases. Finally, it is observed that τ impacts upon the percentage of correlated transmissions, but
not much upon its scalability with respect to multiprocessor size.
Another interesting aspect to investigate is the strength of the crosscorrelation between any two pairs of nodes, in an
attempt to quantify the predictability of the source of correlated transmissions. To this end, we deﬁne the predictability of
each node X as:
predX = maxi NXi∑
i NXi
i = X, (1)
where NXY is the number of transmissions of Y correlated to X. This metric captures how predictable are the transmissions
that happen shortly after a transmission by X, since a low value means that crosscorrelation is spread over a set of cores,
therefore complicating the prediction (0 if transmissions are not correlated). A high value indicates a strong correlation with
few cores (1 if correlation is deterministic). The factor of predictability between any two pairs is calculated as the weighted
average of the predictability of each core:
pred =
∑
i =X NXi∑
i
∑
j =i Ni j
predi =
∑
i max j =i Ni j∑
i
∑
j =i Ni j
. (2)
Fig. 8 shows the factors of predictability and correlation assuming two different τ values. Results beyond 64 cores are
approximate and given as a scalability reference. It is observed that both the crosscorrelation and predictability levels in-
crease with the number of cores in MESI and HT. The results in these cases support the hypothesis that predictive strategiesPlease cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
10 S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]have more potential in larger multiprocessors. The predictability is much higher for MESI, probably due to the clear de-
pendence of multicast traﬃc with potentially predictable memory sharing patterns. In HT, the sources of multicast traﬃc
are more varied and the use of more sophisticated predictors would be required. In TokenB, the predictability is inversely
proportional to the number of cores, which suggests that the injection of multicast traﬃc behaves as a uniformly distributed
random variable and that predictive techniques would be useless in this case. Finally, it is worth noting that increasing the
observation window hardly affects the overall predictability. This could assist in the evaluation of the scope of a predictor,
i.e. which events to predict.
4.7. Phase behavior
Apart from investigating self-similarity, trace-based analysis may allow the study of the different phases found in paral-
lel applications. Phase changes affect a wide variety of metrics, including communication intensity, and recent literature
demonstrates that such phase behavior is predictable [20]. Multicast communication is likely to be also inﬂuenced by
the existence of phases within an application, and, therefore, the metrics presented above could be evaluated on a per-
phase basis. Our previous work [2] exempliﬁes this fact by showing how different multicast traﬃc metrics change with
application phases. The results therein suggest that (1) reconﬁgurable multicast techniques could be of use in the on-
chip scenario, and that (2) multicast prediction could improve if assisted by phase tracking techniques. Notwithstanding
this, phase behavior is not taken into consideration in the remainder of this article and will be explored in future work
instead.
5. Multicast source prediction
In computer architecture, prediction has been pervasively used as a tool to improve performance. The outcome of a
conditional branch, the value of certain variables in memory, the sharer set of certain cache lines, or the load at a given
NoC link are aspects that may show correlation in different situations due to, among other factors, the iterative nature
of computer programs. Predictive systems exploit such information to optimize the processor pipeline [21], the coherence
protocol [15] or the routing mechanism of the CMP [22].
In Section 4.6, we showed that multicast traﬃc in cache-coherent processors is highly correlated and potentially pre-
dictable. In particular, one can guess which will be the source of the next multicast message, information that could be
exploited by reconﬁgurable NoC designs. Here, we evaluate the accuracy of two simple predictors by running the different
sets of traces over a simulated environment. Then, we qualitatively discuss how NoC design could beneﬁt from multicast
source prediction.
5.1. Implementation
Fig. 9 shows an abstract representation of the basic architecture of a NoC-juxtaposed predictive system. Basically, the
predictor is attached to a local NoC component, which can be a NIF or its associated router. The communication between
the NoC and the predictor is two-way: the predictor reads the packets that go through the local component to take a guess
on future events and to validate previous predictions, whereas the local component reads the predictions and modiﬁes its
operation depending on a given policy.
In the case of multicast source prediction, the predictor extracts the source of each multicast packet and uses this infor-
mation to guess who will be the source of the next multicast. Both the header and the prediction are kept by the predictor
during a pre-deﬁned amount of time (i.e. observation window) and then discarded. If the next multicast transmission oc-
curs before the information is discarded, the predictor may update its table with the new source. This way, predictions overFig. 9. Multicast source prediction scheme, with the detail of a static predictor (left box) and a last value predictor (right box).
Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16 11
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]distant and potentially non-correlated events are avoided. The choice of the length of the observation window will depend
on how the impact of consecutive multicasts behaves over time.
The speciﬁc implementation of the predictor will depend on the nature of the multicast correlation and on the required
level of accuracy to obtain acceptable speedups. Here, we evaluate two simple and widely known predictors as depicted in
Fig. 9.
Static predictor (SP): This approach basically consists of the off-line proﬁling of code or traces that represent a given
application (e.g. the results obtained in Section 4.6), which is later used to statically assign a prediction to each possible
input. We will use the results of Section 4.6 to build the prediction table. This method, however, is highly application-
dependent and does not allow to change the predictions at runtime.
Last value predictor (LVP): The predictor consists of a buffer containing the last M multicast sources, which indexes a
table of N entries where N is the number of cores. In this work, we consider the most repeated source in an 8-slot buffer
as the index for the next guess, giving preference to the most recent transmissions in case of a tie. This way, guesses are
dynamically taken and updated at runtime. In order to further increase the accuracy, 2-bit saturating counters are associated
to each entry. These counters are incremented when predictions are correct and decremented otherwise, and are only made
effective if the value of the counter is ‘10’ or ‘11’, thus reducing the frequency of incorrect predictions.
5.2. Evaluation
To provide a performance evaluation of multicast source prediction, we inject the traces to a MATLAB script that accu-
rately simulates the aforementioned designs. For simplicity, we assume that the predictor has access to all multicast mes-
sages, which could be realizable assuming a globally shared medium [1]. The metrics used for the assessment are coverage,
which accounts for the number of predictions made effective over the number of events of interest; and accuracy, which
accounts for the number of correct predictions over the number of predictions cast and executed. Basically, a good predictor
should achieve a high accuracy without reducing the coverage.
Fig. 10 shows the coverage and accuracy of SP and LVP averaged over all the SPLASH2 and PARSEC applications for
different CMP sizes and coherence protocols. We assume that the observation window is τ = 50 · TCLK . Consistently with
the results in Section 4.6, both SP and LVP show substantially better accuracy in MESI. In HT and TokenB, the use of the
static predictor is highly discouraged due to its low reliability. LVP, by means of the added conﬁdence counter, achieves
accuracies over the 50% yet with decreasing prediction coverages. Note, however, that the performance of the prediction
scheme strongly depends on the application: the accuracy of SP using HT is below 15% for 64-core canneal and above 86%
for 64-core cholesky, to cite an example.
These ﬁgures could be improved with more sophisticated predicting schemes like two-level predictors [21]. Also, as
implied in Section 4.7, phase detection can help predictors by limiting predictions to phases in which the predictor has
historically been more effective [20].
5.3. Possible uses in NoC
Thus far, prediction has been sparsely used in NoCs. The work in [23] proposes to speed up NoC routers by predicting the
output port of incoming ﬂits. Others advocate for the use of prediction to anticipate changes in the load of certain channels
to improve performance or eﬃciency [22]. In emerging interconnect technologies, prediction has been proposed to reduce
the laser power constraints of optical NoCs [24], among others. Here, we break away from this generic traﬃc prediction
techniques and provide a qualitative discussion on designs that could exploit multicast source prediction instead.
In packet-switched NoCs, multicast source prediction could help alleviate the congestion caused by tree-based methods
by timely reacting to the small bursts of ﬂits generated by a single dense multicast. Upon predicting the generation of
close multicasts, routers in the vicinities could temporarily change their priority vector or routing algorithm to avoid the
potential area of congestion. Another approach would be to conﬁgure the DVFS system so that predictions are used to
increase performance around the multicast source.
In NoCs based on arbitrated shared media such as nanophotonic buses [14] or wireless channels [1], arbitration will be
much faster if nodes can know in advance who will transmit next. Take, for instance, a broadcast-oriented wireless NoC(a) MESI (b) HT (c) TokenB
Fig. 10. Geometric mean of the prediction accuracy in SPLASH2 and PARSEC for the SP and LVP assuming a 50-cycle observation window. Labels indicate
the coverage in LVP (we assume 100% coverage in the static case).
Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
12 S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]whose medium access control is based on a contention-based approach to minimize performance-degrading collisions [1].
By allowing only to predicted sources to transmit under given circumstances, a large percentage of collisions could be pre-
vented, especially in communication-intensive phases of an application. This could suppose a great performance improve-
ment over non-optimized protocols.
6. A model for multicast traﬃc in shared-memory processors
The results of the traﬃc analysis here presented can be used to create models that faithfully capture the characteristics
of multicast traﬃc in shared-memory multiprocessors. With them, NoCs can be evaluated in realistic conditions without
having to resort to lengthy traces or full-system simulations.
Algorithm 1 outlines a possible implementation of a multicast traﬃc generator. This traﬃc generator is inspired by the
works summarized in 2.3, which use the Hurst exponent and a hotspot factor to model the spatiotemporal distribution
Algorithm 1: Proposed multicast traﬃc generator.
input : λ (load; ﬂits/cycle), H (Hurst exponent), σ (hotspot factor), D (destinations), totalMsg
Calculate spatial distribution with σ ;
Calculate distribution of destinations with D;
while numMsg < totalMsg do
Identify source src using the σ distribution;
Identify number of destinations Ndests using the D distribution;
for Ndests do
Randomly select one destination;
end
if burst > 0 then
burst ← burst − 1;
τ ← TCLK ;
else
aON ← 3 − 2H; bON ← 1;
burst ← pareto_dist(aON, bON);
aOFF ← aON; bOFF ← TCLKbON(λ−1 − 1);
τ ← pareto_dist(aOFF , bOFF );
end
Wait τ ;
Send message through src;
end
of on-chip traﬃc. We maintain these parameters and then leverage the knowledge on the destination set of multicasts to
determine the number of destinations and their location. This algorithm boils down to a unicast traﬃc generator if the
number of destinations is set to 1. Mixed proﬁles can be created by having two independent generators with their own
parameters.
Our implementation of the multicast traﬃc generator is a central module virtually connected to the NIF of each tile. This
module calculates which tile should be sending each multicast message, to which destinations, and with which delay. The
multicast message is passed to the source NIF, which treats it according to the multicast communication policies and injects
it into the NoC. In the following, we provide further details on the speciﬁcities of the multicast traﬃc generator along with
proof of its validity.
6.1. Source
The work in [17] revealed that a gaussian standard deviation σ may be enough to model the spatial distribution of the
injection process in NoCs. A large value of σ represents an rather ﬂat, uniform distribution among all cores; whereas a small
value indicates that injection of traﬃc follows a hotspot distribution. To obtain the σ parameter, ﬁtting methods are applied
to the vector of injected multicasts per tile.
Given that multicast is a subset of all the on-chip traﬃc, its injection will likely follow a similar pattern. However, our
simulations have revealed that using a normal distribution with a single σ may yield inaccurate results in some applications.
As illustrated in Fig. 11b and c, some cases would beneﬁt from a double-σ ﬁtting. To evaluate the conﬁdence of both
methods, we show their coeﬃcient of determination R2 averaged over all the benchmarks and for the different coherence
methods in Fig. 11a. A fully accurate ﬁtting would yield R2 = 1.Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16 13
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]
Coherence Single-σ Double-σ
MESI 0.9556 0.9841
HT 0.8039 0.9513
TokenB 0.7551 0.9799
(a) Geometric mean of all SPLASH-2 and PARSEC benchmarks
1 8 16 24 32 40 48 56 64
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Node
In
je
c
ti
o
n
 P
ro
b
a
b
il
it
y
Data
Double σ
Single σ
R
2
 (single sigma)
is 0.92424
R
2
 (double sigma)
is 0.98861
(b) MESI/swaptions
1 8 16 24 32 40 48 56 64
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Node
In
je
c
ti
o
n
 P
ro
b
a
b
il
it
y
Data
Single σ
Double σ
R
2
 (single sigma) is 
0.73948
R
2
 (double sigma) is 
0.95351
(c) TokenB/ﬀt
Fig. 11. Coeﬃcient of determination (R2) of the spatial injection distribution, including the gaussian ﬁtting in two relevant cases.
2 8 16 24 32 40 48 56 64
0
5
10
15
Number of Destinations
P
ro
b
a
b
il
it
y
 (
%
)
(a) Probability distribution function of the
number of destinations per multicast.
1 8 16 24 32 40 48 56 64
0
0.2
0.4
0.6
0.8
1
Considered Destinations
C
u
m
u
la
ti
v
e
 P
ro
b
a
b
il
it
y
(b) Cumulative distribution function of the
multicast destinations.
Fig. 12. Statistical analysis of the multicast destinations in MESI (perfect tracking).These results suggest that double-σ distributions may be more appropriate to model the spatial injection distribution of
multicast traﬃc, at the cost of slightly higher complexity. In light of Fig. 11, however, one may consider using a single-σ
model plus a constant which would increase the accuracy without increasing the complexity of the model.
6.2. Destinations
One of the aspects where existing NoC traﬃc models are not directly applicable in the context of this work is the election
of the destinations of a message. In multicast traﬃc in general, both the number of destinations per message and the
destinations themselves need to be modeled. One possibility is to model the number of destinations and then to use existing
approaches to independently choose the destinations of each message. More complicated schemes could try to correlate both
aspects in order to faithfully characterize certain multicast ﬂows that go from a small set of sources to a deterministic set of
destinations. Due to its simplicity, we choose the ﬁrst option to model MESI multicast traﬃc; in HT and TokenB, the process
is straightforward since almost all the generated multicast messages are broadcasts.
To model the number of destinations per message, a trivial approach would be to compute the average number of desti-
nations for a given application. More accurate models can be obtained using ﬁtting methods to the histogram of destinations.
As shown in Fig. 12a, the distribution of the number of destinations of most applications would be accurately modeled with
a power function [ f (x) = a · xb + c] or a rational function [ f (x) = a/(x + b)]. The coeﬃcient of determination R2 is of 0.9774
and 0.9668 in average, respectively. It is worth noting that broadcasts messages in MESI coherence represent a particular
case and may need to be modeled separately here, especially when imprecise tracking is considered.
To model the destinations of each message, we consider the destination set to be independent of the source for sim-
plicity. We collected the number of received multicast messages per NIF and performed initial modeling tests. Unlike the
injection process, the spatial distribution of the received multicasts does not ﬁt well with a gaussian distribution and ex-
hibits a more uniform behavior instead. Fig. 12b plots the cumulative distribution function (CDF) of the received multicasts,
conﬁrming that the destinations may be modeled using a uniform random variable. A linear ﬁtting further validates this
fact by achieving a coeﬃcient of determination of 0.9964 in average.Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
14 S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]6.3. Delay
The modeling of self-similar traﬃc has been the subject of different studies in all areas of networking [25], including
on-chip communication [17]. In light of the results shown in Section 4.5 regarding the temporal distribution of the injection
of multicasts, knowledge obtained thus far can be used to model the burstiness of multicast traﬃc.
The most widespread method to generate self-similar traﬃc is through the alternate generation of ON and OFF periods.
During the ON periods, the generator outputs one message per clock cycle; whereas it remains silent the rest of the time.
The length of both periods follow a Pareto distribution [25], which is a heavy tailed distribution with a probability density
function
f (x) = a · b
a
xa+1
x ≥ b . (3)
The shape parameter a is related with the Hurst exponent as
aON = aOFF = 3 − 2H, (4)
whereas the location parameter b needs to be set at the minimum value of the distribution. In NoC environments, one can
take bON as the equivalent to a burst of a single multicast, whereas bOFF is scaled in order to ﬁx the load to the desired λ
value, as:
bOFF = bON
(
1
λ
− 1
)
. (5)
Using this method, we successfully created synthetic traﬃc with the desired H and λ characteristics. To prove the validity
of the approach, we generated streams of bursty traﬃc of 100K messages each, with H = {0.53,0.7,0.9} and different loads
between 0 and 0.5 ﬂits per cycle. The traces containing the timestamps of each generated message are then analyzed to ob-
tain the real load and Hurst exponent, and then averaged over a variable number of repetitions. Fig. 13 shows the measured
Hurst exponents and loads as functions of the three analyzed burstiness levels. In both ﬁgures, it is observed that results
become more random as the input Hurst exponent increases. Still, both the measured load and the measured Hurst expo-
nent increase, in average terms, proportionally to the input load and Hurst exponent, respectively. Finally, it is important to
note that the average error of the measured Hurst exponent increases with the input exponent. Therefore, corrective factors0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0.5
0.6
0.7
0.8
0.9
1
Theoretical Load (flits/cycle)
M
e
a
s
u
re
d
 H
u
rs
t 
E
x
p
o
n
e
n
t
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
0.25
0.5
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
0.25
0.5
M
e
a
s
u
re
d
 L
o
a
d
 (
fl
it
s
/c
y
c
le
)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
0.25
0.5
Theoretical Load (flits/cycle)
H = 0.7
H = 0.9
H = 0.53
Fig. 13. Measured Hurst exponent (top) and load (bottom) as functions of the input load for H = {0.53,0.7,0.9}. Dashed and solid lines represent the
theoretical value and geometric mean of the measured values, respectively.
Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16 15
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]need to be applied as H values approach 1 and, therefore, ON and OFF periods become large. Increasing the length of the
simulations also helps to reduce this error.
7. Conclusions
We have analyzed the scaling trends of multicast communication in cache-coherent processors by performing a trace-
based characterization of a wide set of architectures, applications and system sizes. The results point towards a sustained
increase of the multicast intensity, as well as of its spatial imbalance and temporal burstiness, conﬁrming the need for
proper multicast support in manycore scenarios. To assist the evaluation of future NoCs, we proposed a simple yet accurate
multicast traﬃc model; whereas to optimize their design, we demonstrated a consistent growth in terms of spatiotemporal
correlation of multicast traﬃc. This trend implies an increase of the chances of correctly predicting the source of multicast
transfers, as shown here both using a predictability metric and evaluating the performance of two simple predictors.
Acknowledgment
The authors gratefully acknowledge support from INTEL through the Doctoral Student Honor Program, from the Comis-
sionat per a Universitats i Recerca of the Catalan Government under grant 2014SGR-1427, and from the Spanish Ministry of
Economía y Competitividad under grant PCIN-2015-012.
References
[1] Abadal S, Sheinman B, Katz O, Markish O, Elad D, Fournier Y, et al. Broadcast-enabled massive multicore architectures: a wireless RF approach. IEEE
Micro 2015;35(5).
[2] Abadal S, Mestres A, Martínez R, Alarcón E, Cabellos-Aparicio A. Multicast on-chip traﬃc analysis targeting manycore NoC design. In: Proceedings of
the PDP’15; 2015. p. 370–8.
[3] Wong F, Martin R, Arpaci-Dusseau R, Culler D. Architectural requirements and scalability of the NAS parallel benchmarks. In: Proceedigs of the
ACM/IEEE SC’99; 1999. p. 1–18.
[4] Vetter J, Yoo A. An empirical performance evaluation of scalable scientiﬁc applications. In: Proceedings of the ACM/IEEE SC’02; 2002. p. 1–16.
[5] Shalf J, Kamil S, Oliker L, Skinner D. Analyzing ultra-scale application communication requirements for a reconﬁgurable hybrid interconnect. In: Pro-
ceedings of the ACM/IEEE SC’05; 2005. p. 17–30.
[6] Woo S, Ohara M, Torrie E, Singh J. The splash-2 programs: characterization and methodological considerations. ACM SIGARCH Comput Archit News
1995;23(2):24–36.
[7] Bienia C, Kumar S, Singh J, Li K. The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the PACT’08; 2008.
p. 72–81.
[8] Barrow-Williams N, Fensch C, Moore S. A communication characterisation of splash-2 and parsec. In: Proceedings of the IISWC’09; 2009. p. 86–97.
[9] Martin M. Token coherence: decoupling performance and correctness. In: Proceedings of the ISCA-30; 2003. p. 182–93.
[10] Ebrahimi M, Daneshtalab M. Path-based partitioning methods for 3d networks-on-chip with minimal adaptive routing. IEEE Transactions on Computers
2014;63(3):718–33.
[11] Krishna T, Peh L, Beckmann B, Reinhardt SK. Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication. In: Proceedings of the
MICRO-44; 2011. p. 71–82.
[12] Miller JE, Kasture H, Kurian G, Beckmann N, Celio C, Eastep J, et al. Graphite: a distributed parallel simulator for multicores. In: Proceedings of the
HPCA-16; 2010. p. 1–12.
[13] Manevich R, Cidon I, Kolodny A. Handling global traﬃc in future CMP NoCs. In: Proceedings of the SLIP ’12; 2012. p. 40–7.
[14] Kim J, Choi K. Exploiting new interconnect technologies in on-chip communication. IEEE J Emerg Sel Top Circuits Syst 2012;2(2):124–36.
[15] Martin MMK, Harper PJ, Sorin DJ, Hill MD, Wood DA. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory
multiprocessors. In: Proceedings of the ISCA ’03; 2003. p. 206–17.
[16] Conway P, Hughes B. The AMD Opteron Northbridge Architecture. IEEE Micro 2007;27(2):10–21.
[17] Soteriou V, Wang H, Peh L. A statistical traﬃc model for on-chip interconnection networks. In: Proceedings of MASCOTS ’06; 2006. p. 104–16.
[18] Badr M, Jerger NE. Synfull : synthetic traﬃc models capturing cache coherent behaviour. In: Proceedings of ISCA-41; 2014. p. 109–20.
[19] Binkert N, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, et al. The gem5 simulator. ACM SIGARCH Comput Archit News 2011;39(2):1–7.
[20] Sherwood T, Sair S, Calder B. Phase tracking and prediction. ACM SIGARCH Comput Archit News 2003;31(2):336–49.
[21] Hennessy J, Patterson D. Computer architecture: a quantitative approach. Morgan Kaufmann; 2012.
[22] Ogras U, Marculescu R. Prediction-based ﬂow control for network-on-chip traﬃc. In: Proceedings of the DAC’06; 2006. p. 839–44.
[23] Matsutani H, Koibuchi M, Amano H, Yoshinaga T. Prediction router: Yet another low latency on-chip router architecture. In: Proceedings of the
HPCA’09; 2009. p. 367–78.
[24] Zhou L, Kodi AK. Probe: prediction-based optical bandwidth scaling for energy-eﬃcient nocs. In: Proceedings of the NoCs’13; 2013. p. 1–8.
[25] Leland WE, Taqqu MS, Willinger W, Wilson DV. On the self-similar nature of ethernet traﬃc (extended version). IEEE/ACM Trans Netw 1994;2(1):1–15.
Sergi Abadal is a PhD candidate at the NaNoNetworking Center in Catalonia (N3Cat), Spain, at the Universitat Politécnica de
Catalunya (UPC). In 2013, he received the INTEL Doctoral Student Honor fellowship. His research interests include on-chip net-
working, many-core architectures, and graphene-based wireless communications. Abadal has an MSc in information and commu-
nication technologies (2011) from the UPC.Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
16 S. Abadal et al. / Computers and Electrical Engineering 000 (2016) 1–16
ARTICLE IN PRESS
JID: CAEE [m3Gsc;January 19, 2016;7:35]Raúl Martínez is a consulting member of Technical Staff at Oracle Labs. Before that, he worked at the INTEL Barcelona Research
Center. His research interests include high-performance interconnects, hardware/software co-design, dynamic binary optimiza-
tions, and network on chips. Martínez has a PhD (2007) from the University of Castilla-La Mancha.
Josep Solé-Pareta is a full professor at the Computer Architecture Department at the UPC. He is co-founder of the CCABA and
N3Cat groups also at UPC. His current research interests are in nanonetworking communications, traﬃc monitoring and analysis,
high speed and optical networking, and energy eﬃcient transport networks. Solé-Pareta has a PhD in computer science (1991)
from the UPC.
Eduard Alarcón is an associate professor at the N3Cat at UPC. He has co-authored more than 250 scientiﬁc publications, and
has been involved in different R&D projects within his research interests including nanocommunications, energy harvesting, and
wireless energy transfer. Alarcón has a PhD in electrical engineering (2000) from the UPC.
Albert Cabellos-Aparicio is an associate professor at the N3Cat at UPC. He has given more than 10 invited talks and co-authored
more than 70 papers within his research interests including graphene technology, nanocommunications, and software-deﬁned
networking. Cabellos-Aparicio has a PhD in computer science engineering (2008) from the UPC.Please cite this article as: S. Abadal et al., Characterization and modeling of multicast communication in cache-coherent
manycore processors, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2015.12.018
