圧縮とマルチキャストを用いた低遅延オンチップネットワーク by 和 远 & He Yuan
Low Latency On-Chip Networks through
Compression andMulticaﬆing
(圧縮とマルチキャストを用いた
低遅延オンチップネットワーク)
YŊĵłHĹ
TļĹ UłĽŋĹŇňĽŉŏ Ńĺ TŃĿŏŃ
MĵŇķļ Ǌǈǉǌ

Advisor: Professor Hiroshi Nakamura Author: Yuan He
AĶňŉŇĵķŉ
ĉe inevitable advent of themulti-core era has driven an increasing demand for
low latency on-chip interconnection networks (or NoCs). Being a critical part of
thememory hierarchy formodern chipmulti-processors (CMPs), these networks
face stringent design constraints to provide fast communication. ModernNoC’s
ėrst order concern is clearly its latency, so we present three low latency techniques
in this thesis. One is through traﬃc compression while the other two are low la-
tency router designs based on multicast-able crossbar switches.
Firstly, an adaptive traﬃc compression scheme is proposed, taking account of
the vertical bandwidth limitation of ǋD NoCs and traﬃc compressibility. It is
found that the compressibility based adaptive compression is very useful against
incompressible traﬃc while the location-based adaptive compression is more ef-
fective with more layers.
Secondly, an improvement is made on Prediction Router, which routes packet
based on predicted destinations. ĉis improvement makes use of multiple pre-
diction algorithms (so-called the Predict-more Router) at the same time for one
packet. It helps increasing the prediction accuracy by ǉǍƻ and outperforms the
best PR by ǋ.Ǎƻ in speeding-up the system.
ĉirdly, to further lower the latency and to get rid of predictions, a novel low
latency router (McRouter) is proposed through multicasting a packet to all possi-
ble outputs. ĉis design allows a single cycle transfer of Ěits when having enough
i
Advisor: Professor Hiroshi Nakamura Author: Yuan He
bandwidthwithin the router. So itmay behave like an always-hit prediction router.
Evaluation shows that McRouter helps achieving system speed-ups of ǉ.Ǌǐ, ǉ.ǉǏ
and ǉ.ǈǍ over the conventional router, the VSA router and the best prediction
router, respectively.
ii
Acknowledgments
TļŇŃŊĻļŃŊŉŉļĽň ŀŃłĻ ľŃŊŇłĹŏ, I had somany greatmoments to be engraved
into my memory and so much great support from many to appreciate. Without
these, it is impossible for me to look back with such a joy and to complete this
research endeavor.
First and foremost, my heartiest gratitude goes tomy advisor, ProfessorHiroshi
Nakamura. From the day I startedmy study in his laboratory, he opened up a door
for me to work as a researcher by sharing his knowledge and enthusiasm. His in-
cisive comments and valuable advices had been the key to my progress and the
completion of this dissertation. Not only supported me in various aspects during
my study, he also clearly taught me the real essence of research and steadily ledmy
way through it. It had been both an honor and a joy for me to work with him.
I would also like to acknowledge Professors Hiroyuki Morikawa, Masahiro Fu-
jita, Masahiro Goshima and Masaaki Kondo for their thoughtful advices on my
dissertation in the defense. In addition, Professor Masaaki Kondo’s work (while
he was working with Professor Hiroshi Nakamura) was the reason for my admis-
sion to Nakamura Laboratory at the very beginning.
I am also very grateful to Professors Shinobu Miwa, Hiroshi Sasaki and Hiroki
Matsutani (in addition to Professor Hiroshi Nakamura) for being co-authors of
our technical papers. It had been so vital to have them when I was struggling
in writing these papers. ĉeir collaboration had made these papers materialized
iii
which had become important building blocks of my dissertation. Professor Hi-
roshi Sasaki had also been the one who brought me to the ėeld of computer ar-
chitecture by sharing his thoughts and ideas and providing hints when I was stuck.
ProfessorHirokiMatsutani hadbeen very kind tome aswell. Hehelpedmeunder-
stand the basics of on-chip interconnections and helped me build the evaluation
platforms for my work. For the same reason, I would also like to thank Professor
Takashi Nakada, discussions with him had always been so helpful on improving
the quality of my work.
At the early stage of my study, Professors Takashi Nanya andMasashi Imai had
also provided me various support. I am in debt to them as well. I also appreciate
Professors Zoran Salcic, Morteza Biglari-Abhari and Partha Roop’s help for bring-
ingme into the ėeld of computer systems while I was studying at the University of
Auckland.
During my time at Nakamura Laboratory, Ms. Setsuko Yuge, Akiko Kabayama,
YumikoKumaoka,RemiMaehashi andSaraNoguchihadhelpedme throughmany
general and administration issues. I want to expressmy sincere thanks to them too.
Time would not pass with such a joy without many friends during these years.
ĉey helpedmemake upmy life other than this dissertation. I would like to thank
RobertoDrebes, JamesWeston, Seidai Takeda, KyundongKim, ToshiyaKomoda,
Eishi Arima, all other labmembers andmany personal friends duringmy study for
their company. It had been so supportive and I really appreciate it.
Lastly, I want to thankmy family, especially mywife and daughter, for their care
and love. Nothing is possible without these.
iv
Contents
LĽňŉ Ńĺ FĽĻŊŇĹň ix
LĽňŉ Ńĺ TĵĶŀĹň x
ǉ IłŉŇŃĸŊķŉĽŃł ǉ
ǉ.ǉ On-chip Networks . . . . . . . . . . . . . . . . . . . . . . . . Ǌ
ǉ.Ǌ Objective, ProblemDeėnitions and Contributions . . . . . . . . Ǎ
ǉ.ǋ Assumptions and Scope . . . . . . . . . . . . . . . . . . . . . ǐ
ǉ.ǌ Dissertation Organization . . . . . . . . . . . . . . . . . . . . ǉǉ
Ǌ BĵķĿĻŇŃŊłĸ: LŃŌ LĵŉĹłķŏ TĹķļłĽŅŊĹň ĺŃŇ Oł-ķļĽń NĹŉŌŃŇĿň ǉǋ
Ǌ.ǉ Traﬃc Compression . . . . . . . . . . . . . . . . . . . . . . . ǉǌ
Ǌ.Ǌ Low Latency Routers . . . . . . . . . . . . . . . . . . . . . . . ǉǌ
Ǌ.Ǌ.ǉ Conventional Router . . . . . . . . . . . . . . . . . . . ǉǍ
Ǌ.Ǌ.Ǌ Router Optimizations . . . . . . . . . . . . . . . . . . ǉǎ
ǋ LĵŉĹłķŏ RĹĸŊķŉĽŃł ŉļŇŃŊĻļ TŇĵĺĺĽķ CŃŁńŇĹňňĽŃł ǉǑ
ǋ.ǉ Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ǌǈ
ǋ.Ǌ Traﬃc Compression on NoCs . . . . . . . . . . . . . . . . . . ǊǍ
ǋ.Ǌ.ǉ Compression Algorithm and Implementation . . . . . . ǊǍ
ǋ.Ǌ.Ǌ Proposed Adaptive Compression for ǋDNoCs . . . . . ǊǑ
ǋ.ǋ Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . ǋǋ
ǋ.ǌ Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ǋǎ
v
ǋ.Ǎ Summary and Discussions . . . . . . . . . . . . . . . . . . . . ǌǉ
ǌ LĵŉĹłķŏRĹĸŊķŉĽŃłŉļŇŃŊĻļIł-ŇŃŊŉĹŇMŊŀŉĽķĵňŉĽłĻ: PŇĹĸĽķŉ-
ŁŃŇĹ RŃŊŉĹŇ ǌǋ
ǌ.ǉ Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . ǌǌ
ǌ.Ǌ ĉe Predict-more Router . . . . . . . . . . . . . . . . . . . . . ǌǏ
ǌ.Ǌ.ǉ Design . . . . . . . . . . . . . . . . . . . . . . . . . . ǌǐ
ǌ.Ǌ.Ǌ Architectural Discussions . . . . . . . . . . . . . . . . Ǎǈ
ǌ.ǋ Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . ǍǊ
ǌ.ǌ Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ǎǌ
ǌ.ǌ.ǉ Prediction Accuracy and Routing Eﬃciency . . . . . . . Ǎǌ
ǌ.ǌ.Ǌ Synthetic Performance . . . . . . . . . . . . . . . . . . ǍǏ
ǌ.ǌ.ǋ Application Performance . . . . . . . . . . . . . . . . ǍǏ
ǌ.ǌ.ǌ Summary and Discussions . . . . . . . . . . . . . . . . ǎǋ
Ǎ LĵŉĹłķŏRĹĸŊķŉĽŃłŉļŇŃŊĻļIł-ŇŃŊŉĹŇMŊŀŉĽķĵňŉĽłĻ: MķRŃŊŉĹŇ ǎǍ
Ǎ.ǉ Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . ǎǎ
Ǎ.Ǌ Multicast within a Router Approach and Architecture . . . . . . Ǐǈ
Ǎ.Ǌ.ǉ Overview . . . . . . . . . . . . . . . . . . . . . . . . . Ǐǈ
Ǎ.Ǌ.Ǌ Architectural Changes and theMulticast Operation . . . Ǐǋ
Ǎ.Ǌ.ǋ Timing . . . . . . . . . . . . . . . . . . . . . . . . . . ǏǏ
Ǎ.Ǌ.ǌ Critical Path Delay . . . . . . . . . . . . . . . . . . . . ǏǑ
Ǎ.ǋ Architecture Discussions and Qualitative Comparisons . . . . . ǐǈ
Ǎ.ǋ.ǉ Control Dependency . . . . . . . . . . . . . . . . . . ǐǈ
Ǎ.ǋ.Ǌ Speculation . . . . . . . . . . . . . . . . . . . . . . . ǐǊ
Ǎ.ǋ.ǋ Routing Eﬃciency . . . . . . . . . . . . . . . . . . . . ǐǋ
Ǎ.ǋ.ǌ Power Overhead . . . . . . . . . . . . . . . . . . . . . ǐǌ
Ǎ.ǌ Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . ǐǍ
Ǎ.Ǎ Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ǐǏ
Ǎ.Ǎ.ǉ Synthetic Traﬃc . . . . . . . . . . . . . . . . . . . . . ǐǐ
Ǎ.Ǎ.Ǌ Application Traﬃc . . . . . . . . . . . . . . . . . . . . ǐǑ
vi
Ǎ.Ǎ.ǋ Sensitivity Studies . . . . . . . . . . . . . . . . . . . . ǑǍ
Ǎ.ǎ Summary and Discussions . . . . . . . . . . . . . . . . . . . . ǑǏ
ǎ CŃłķŀŊňĽŃłň ǉǈǉ
ǎ.ǉ Further Discussions and FutureWork . . . . . . . . . . . . . . ǉǈǊ
RĹĺĹŇĹłķĹň ǉǈǍ
LĽňŉ Ńĺ PŊĶŀĽķĵŉĽŃłň Ķŏ ŉļĹ AŊŉļŃŇ ǉǉǊ
vii
Liﬆ of Figures
ǉ.ǉ.ǉ ǉǎ-tile CMP connected with an NoC. . . . . . . . . . . . . . . ǋ
ǉ.ǉ.Ǌ A few topologies for on-chip networks. . . . . . . . . . . . . . . Ǎ
ǉ.ǋ.ǉ Parallel speed-up for some multi-threaded workloads. . . . . . . Ǒ
Ǌ.Ǌ.ǉ Pipeline stages of various router designs. . . . . . . . . . . . . . ǉǍ
ǋ.ǉ.ǉ ǊD and ǋDNoC topologies. . . . . . . . . . . . . . . . . . . . ǊǊ
ǋ.ǉ.Ǌ System performance degradations under link limitations for ǋD
NoC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ǌǋ
ǋ.ǉ.ǋ Tiles of ǊD and ǋDNoCs. . . . . . . . . . . . . . . . . . . . . Ǌǌ
ǋ.Ǌ.ǉ PaĨerns of the frequent paĨern compression. . . . . . . . . . . ǊǏ
ǋ.Ǌ.Ǌ An example of the frequent paĨern compression. . . . . . . . . Ǌǐ
ǋ.Ǌ.ǋ Compressibility-based adaptive control. . . . . . . . . . . . . . ǋǈ
ǋ.Ǌ.ǌ Location-based adaptive control. . . . . . . . . . . . . . . . . . ǋǉ
ǋ.Ǌ.Ǎ Compressibility- and location-based adaptive control. . . . . . . ǋǊ
ǋ.ǌ.ǉ Normalized execution timewith static/adaptive compression on
ǋDNoCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ǋǐ
ǌ.ǉ.ǉ Prediction accuracy. . . . . . . . . . . . . . . . . . . . . . . . . ǌǍ
ǌ.ǉ.Ǌ Fractions fordiﬀerentnumbersof concurrentĚits arriving at routers
each cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ǌǎ
ǌ.Ǌ.ǉ Architectures of the prediction router and the predict-more router. ǌǏ
ǌ.ǌ.ǉ Percentage of packets successfully accelerated with predictions. . Ǎǎ
viii
ǌ.ǌ.Ǌ Network latency with synthetic traﬃc. . . . . . . . . . . . . . . Ǎǐ
ǌ.ǌ.ǋ Normalized per-Ěit latency. . . . . . . . . . . . . . . . . . . . . ǍǑ
ǌ.ǌ.ǌ Normalized system speed-up. . . . . . . . . . . . . . . . . . . . ǎǈ
ǌ.ǌ.Ǎ Normalized network power consumption. . . . . . . . . . . . . ǎǊ
Ǎ.ǉ.ǉ Average link utilization on a ǉǎ-core CMP connected with ǌ by ǌ
mesh network. . . . . . . . . . . . . . . . . . . . . . . . . . . ǎǐ
Ǎ.ǉ.Ǌ Fractions fordiﬀerentnumbersof concurrentĚits arriving at routers
each cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ǎǑ
Ǎ.Ǌ.ǉ Architecture of the conventional router. . . . . . . . . . . . . . ǏǊ
Ǎ.Ǌ.Ǌ Architecture of McRouter. . . . . . . . . . . . . . . . . . . . . ǏǊ
Ǎ.Ǌ.ǋ Pipeline stages of McRouter. . . . . . . . . . . . . . . . . . . . ǏǏ
Ǎ.Ǌ.ǌ Best case transmission of a multi-Ěit packet in McRouter. . . . . Ǐǐ
Ǎ.Ǎ.ǉ Evaluations with synthetic traﬃc. . . . . . . . . . . . . . . . . . ǐǐ
Ǎ.Ǎ.Ǌ Normalized per-Ěit latency. . . . . . . . . . . . . . . . . . . . . Ǒǈ
Ǎ.Ǎ.ǋ Normalized system speed-up. . . . . . . . . . . . . . . . . . . . ǑǊ
Ǎ.Ǎ.ǌ Normalized network power consumption. . . . . . . . . . . . . Ǒǌ
Ǎ.Ǎ.Ǎ System speed-up with router parameter downscaling. . . . . . . Ǒǎ
ix
Liﬆ of Tables
ǋ.Ǌ.ǉ Qualitative comparisons of adaptive compression policies. . . . ǋǌ
ǋ.ǋ.ǉ System parameters. . . . . . . . . . . . . . . . . . . . . . . . . ǋǏ
ǋ.ǋ.Ǌ Benchmark programs and inputs. . . . . . . . . . . . . . . . . . ǋǏ
ǌ.ǋ.ǉ System parameters. . . . . . . . . . . . . . . . . . . . . . . . . ǍǊ
ǌ.ǋ.Ǌ Benchmark programs and inputs. . . . . . . . . . . . . . . . . . Ǎǋ
Ǎ.Ǌ.ǉ Destination output ports when multicasting considering incom-
ing ports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ǐǎ
Ǎ.Ǌ.Ǌ Destination output ports when multicasting considering packet
types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ǐǎ
Ǎ.ǋ.ǉ Qualitative comparisonsof low latency routers includingMcRouter. ǐǉ
Ǎ.ǌ.ǉ System parameters. . . . . . . . . . . . . . . . . . . . . . . . . ǐǍ
Ǎ.ǌ.Ǌ Benchmark programs and inputs. . . . . . . . . . . . . . . . . . ǐǎ
x
Well begun is half done.
Aristotle
1
IntroduČion
Iŉ ļĵň ĶĹĹł łĹĵŇŀŏ ļĵŀĺ ĵ ķĹłŉŊŇŏ ňĽłķĹ MŃŃŇĹ’ň LĵŌ Ōĵň ĺĽŇňŉŀŏ ŃĶ-
ňĹŇŋĹĸ; and until today it is still used to guide the technology scaling in semicon-
ductor industry. As a result of this still-ongoing technology scaling, the amount
of transistors per chip continues to double roughly every Ǌ years [ǉ, ǌǈ]. How-
ever, the end of Dennard scaling [Ǌǋ], from where the improvement on energy
eﬃciency of semiconductor devices has retreated substantially, simply marked a
historic paradigm shiě to multi-core designs, for which the number of processor
cores per chip is scaled up rather than spending transistor budget on core perfor-
mance [ǐ, ǉǊ, ǌǊ].
ǉ
As suchmulti-core systems continue to scale with the technology trend, the in-
dustry and researchers are fast moving to on-chip networks (also called networks-
on-chip orNoCs for short) from shared buses and dedicatedwires in order to con-
front the growing wire delay and increasing arbitration overheads [ǉǌ, ǉǑ, Ǌǈ, ǌǎ].
Despite many alternatives to employ on-chip networks in multi-core systems, a
lot of challenges are still leě un-addressed. Performance-wise, one of the themost
critical concern of on-chip networks is their communication latency which scales
up with their size.
Hence, three solutions targeted at shortening the communication latency of on-
chip networks are presented in this dissertation. In the next few sections, an in-
troduction to on-chip networks will be provided and this will be followed by an
overview of this dissertation including its objective, problem deėnitions, contri-
butions, assumptions, scope and organization.
ǉ.ǉ Oł-ķļĽń NĹŉŌŃŇĿň
ĉe concept of NoCs came out as a replacement of buses and crossbars in the
multi-core era. ĉe reason behind this is obvious. As core count scales up with
technology scaling, both buses and crossbars are not suﬃcient enough to provide
the scalability. As for buses, more cores require longer buses and more complex
arbiters which add both wire delay and arbitration delay substantially. Another
problem of buses is their bandwidth, with more andmore numbers of cores, a bus
can be easily saturated. For crossbars, their scalability is limited by their power and
Ǌ
!"#$%&' ()*+'
,"&%-(./'
(0/-1)&%2$"&3'
4%5"&3'
,"*$&"66%&'
!"#$%&'()*+$,-.$/012$103+4$4+5067$ !8#$,)9:)7+715$07504+$"$103+$
!"#$%&'
()*+'
Figure 1.1.1: 16-tile CMP connected with an NoC.
area consumptionswhen they are used to connectmore andmore cores since their
complexity grows rapidly. As an example, a typical on-chip network is shown in
Figure ǉ.ǉ.ǉa.
With NoCs, the beneėts are as follows. Firstly, they scale much beĨer in power
and area than buses and crossbars. Secondly, they make good use of wiring since
links are distributed across the network and wire length are much shorter than
buses. ĉis also enables NoCs to supply higher bandwidth. ĉirdly, NoCs are
modular in design.
Typical NoCs have repetitive structures and can be seen as a few basic compo-
nents. Such components include network interfaces, routers and links. Each of
them will be brieĚy introduced below.
• Network interface sits between network nodes (such as processor cores,
cache banks and memory controllers) and the network. It is capable of
ǋ
breaking packets into Ěits and vice versa. Flits are the minimum units of
network traﬃc and their size is the same as link width. Network interfaces
are also used to inject and receive Ěits to and from the network.
• Router is the most important part of NoCs since its delay determines the
communication latency per hop. A typical NoC router is connected to net-
work interfaces and neighbor routers through links and it has buﬀers, rout-
ing circuits, allocators and a crossbar inside. Depending on the topology of
thenetwork, routerdetermines the routeof anetworkpacket throughwhich
it travels in the network. To minimize router delay in order to shorten the
network communication latency, its architecture has been extensively stud-
ied.
• Link is a set of wires joining other components in the network. NoC links
are distributed through the network and are much shorter than conven-
tional buses.
ĉese basic components also form the topology of the network. A topology not
only reĚects the layout of the network but also determines the average number
of hops a packet may travel and the distance between two network nodes. For
example (as in Figure ǉ.ǉ.Ǌ), common on-chip network topologies include (but
are not limited to) ring, mesh, torus and fat tree [ǉǑ, ǌǎ, ǌǑ].
ǌ
!"#$%&'(
!)#$*+,- !.#$/012,
!3#$4"5$51++
Figure 1.1.2: A few topologies for on-chip networks.
ǉ.Ǌ OĶľĹķŉĽŋĹ, PŇŃĶŀĹŁDĹĺĽłĽŉĽŃłň ĵłĸCŃłŉŇĽĶŊŉĽŃłň
ĉe objective of this dissertation is to shorten the communication latency for
on-chip networks. ĉe reason behind this is that on-chip network is part of the
memory hierarchy of modern multi-core chips, hence its communication latency
is highly related to memory performance of the chip. Not only main memory ac-
cesses are going into such network but accesses to the LǊ cache, accesses to Lǉ
caches of other processor cores and coherencemessages are also traveling through
them.
Before going intoproblemdeėnitions, there are two formulae¹whichdetails the
¹ĉese two formulae are derived from [ǊǍ, ǌǎ].
Ǎ
packet latency and router delay in NoCs.
PacketLatency = TInjection+TLinkPropagation(h+ƥ)+TRouterh+ PacketLinkBandwidth+TReception
(ǉ.ǉ)
In Formula (ǉ.ǉ), packet latency is composed of injection/reception delay at
the network interfaces, link propagation delay multiplied by the number of links
a packet travels, router delay multiplied by the number of hops a packet takes and
the serialization delay which is the packet size divided by link bandwidth.
TRouter = TBuffering + TRouting + TAllocation + TSwitching (ǉ.Ǌ)
Looking further into the router delay, as in Formula (ǉ.Ǌ), there are time spent
on buﬀering Ěits, computing the route, router resource allocation (failed alloca-
tions is a reĚection of resource contention) and switching the Ěits.
So far, it is clear that minimizing any term in Formula (ǉ.ǉ) will help reduce
the packet latency. But in this dissertation, this packet latency is optimized with
two ideas. Firstly, under a ėxed link bandwidth, traﬃc compression is employed
to shrink the size of a packet in order to reduce both serialization delay and alloca-
tion delay (compressionmay reduce the number of Ěits, hence resulting in smaller
pressure during router resource allocations). Secondly, two low latency router de-
signs are proposed to minimize the router delay.
ĉe contributions of this dissertation can be stated as follows.
• Identifying the bandwidth limitations in ǋD NoCs when traﬃc travels be-
ǎ
tween layers.
• Proposing andexploring adaptive control of traﬃccompressiononǋDNoCs.
ĉis adaptive control invokes compression when it is beneėcial and/or the
traﬃc is encountering bandwidth limitations.
• Demonstrating the eﬀectiveness of the three adaptive compression policies
when they are applied on ǋDNoCs through simulation.
• Finding that prediction accuracy can be signiėcantly improved if multiple
predictions canbecarriedout foronepacket at once (theWisdomofCrowds
in predictive routing with multiple algorithms).
• ĉis ėnding is utilized to implement a low latency router called predict-
more router (PmR) which more successfully hides route computation and
arbitration delays than the original prediction router with amarginal power
overhead.
• Finding that a router’s internal bandwidth can be very plentiful for multi-
threaded workloads.
• Proposing a new low latency router named McRouter which successfully
hides route computation and arbitration delays by utilizing the remaining
bandwidth inside a routermore productively. Diﬀerent fromprevious tech-
niques on speculating the route computation results (such as prediction
router),multicast nevermisses. Also,multicast operations arebothbandwidth-
Ǐ
dependent and speculative so thatMcRouteronly consumes remainingband-
width within a router and it barely degrades the routing eﬃciency.
• Presenting the detailed design ofMcRouter anddemonstrating its eﬀective-
ness through performance and power evaluations against counterparts.
ǉ.ǋ AňňŊŁńŉĽŃłň ĵłĸ SķŃńĹ
To make the evaluations in this dissertation tractable, there are some basic as-
sumptionsmade for the conėgurations in evaluations. In this dissertation, a general-
purpose chip multi-processor (CMP) system with ǉǎ in-order cores are assumed
while thenetwork topology is ǊD(for the low latency routers inChapter ǌ andChap-
ter Ǎ) and ǋD (for the traﬃc compression work in Chapter ǋ) mesh with ǉǎ tiles.
In addition to a processor core, each tile has a router, a bank of LǊ cache and
sometimes a memory controller. Figure ǉ.ǉ.ǉ depicts such a system in ǊD mesh
topology and what a tile is like. Variations of this system and the tiles in ǋDmesh
topologies are described in Chapter ǋ. For ǋD integration, the vertical links are
set to ǉǎ-bit wide while the planar links are ǉǊǐ-bit wide. ĉe reasons for hav-
ing assumptions of general-purpose CMPs with in-order cores under mesh topol-
ogy are popularity and simplicity. For example, available large scale CMPs such
as the Tilera TILEǎǌ Processors and the Intel Single-Chip Cloud Computer em-
ploy in-order cores and mesh topologies [ǉǋ, Ǌǎ]. Both in-order cores and mesh
topologies are very foreseeable design decisions when considering scalability and
power eﬃciency. ĉe reasonbehind the choice of ǉǎ-bit vertical link in ǋD integra-
ǐ
tion follows considerations of the chip size of ǉǎ in-order cores, current industry
standard (such as JEDECWide I/O) anddiﬀerent ǋD integration implementation
technologies (such aswire-bonding,micro-bump and through-silicon via) [Ǌ, ǊǊ].
!
"
#
$
%
&
%'()*+,-. "/'()*+,-.
0)12+.34 *,4(*,0+ 56 78 9: ;<
Figure 1.3.1: Parallel speed-up for some multi-threaded workloads.
ĉe workloads used are multi-threaded parallel applications since the focus of
this dissertation is on high-performance computing [ǋ, Ǎǌ]. ĉey are parallelized
with the OpenMP API [ǌ]. All threads are doing similar job so their commu-
nication characteristics can be very similar. For some workloads (such as EP in
NPB ǋ [ǊǏ, Ǎǋ]) that achieve near-ideal parallel speed-up, this simply means that
performance improvements on such workloads through solutions in this disser-
tation can be very similar regardless of the number of parallel threads. ĉey help
improving thememory access latencyof everyparallel thread. Otherwise forwork-
loads whose speed-up is limited by communication overhead aěer parallelization,
Ǒ
such communication overhead can also be accelerated through the solutions pro-
posed in this dissertation. Parallel speed-up for some workloads from ǌ threads to
ǉǎ threads are shown in Figure ǉ.ǋ.ǉ. Evaluation conditions are the same as those
in Section ǌ.ǋ and Section Ǎ.ǌ. From this ėgure, nearly all tested workloads scale
preĨy well with more threads (even beĨer than EP). ĉe reason behind this is,
evaluations in this dissertation are simulating the parallel regions of these work-
loads. One exception is raytrace, its parallel speed-up with more threads scales
poorly and is diﬀerent from previous work [Ǎǌ]. A possible reason is, raytrace suf-
fers a lot from the non-uniform cache access latencies. Operating system (the OS
used in evaluations is Solaris Ǒ [ǎ]) aĨaches threads to processor cores through
natural task mapping. Natural mapping is chosen for its simplicity, popularity and
it also is not speciėcally optimized for solutions proposed in this dissertation.
For the memory hierarchy, each core has its own private Lǉ caches while an LǊ
cache is shared by all cores. ĉis shared LǊ cache is exclusive. ĉe reason behind
these two assumptions on LǊ cache is to have the capacity advantage with exclu-
sive shared last level cache (LLC). Larger cache capacity through exclusive shared
LLC is very good formulti-core designs since oﬀ-chip bandwidth is believed to be
a serious concern for them as the number of cores scale [ǌǐ]. Cache coherence
is maintained throughMOESI directory protocol, which means an exact location
of a piece of data when Lǉ miss happens would direct a request to a directory sit-
ting aside the LǊ Cache banks. Such a directory based protocol is assumed for its
advantage on scalability over broadcast based protocols since less traﬃc is needed
to complete a coherence transaction [Ǎǉ]. ĉis simplymeans that directory based
ǉǈ
protocols are more future-prooėng when having more and more cores per chip.
ĉis dissertation is based on the performance and power models of a conven-
tional NoC router and two of its variations [ǉǑ, ǌǍ, ǌǎ]. All these router designs
assume that virtual channel allocation(VCA) is themost time-consumingpipeline
stage (hence the critical path). Link traversal is thus set to ǉ cycle.
With the above assumptions on the memory hierarchy and network conėgu-
rations, there are typically two types of packets with diﬀerent sizes. Packets with
larger size are data packets which are used to transmit a piece of data in the size of a
cache line while packets with smaller size (ǉ Ěit in this dissertation) simply carries
a control message. Based on their purposes, packets can also be categorized into ǋ
classes, which is either request, response or forward.
Although above assumptions do not cover the entire design space, they are re-
laxed in discussions to qualitatively give a picture on how diﬀerent numbers of
cores, topologies, coherence protocols, types of applications and so on may aﬀect
the eﬀectiveness of proposed solutions in this dissertation.
For the scope of this dissertation, the power implications are not discussed for
the ėrst solution (traﬃc compression) while area overhead is leě out for all pro-
posals. ĉese are going to be addressed in future work.
ǉ.ǌ DĽňňĹŇŉĵŉĽŃłOŇĻĵłĽŐĵŉĽŃł
ĉe remainder of this dissertation is organized as follows. Chapter Ǌ gives a
review of related work which summarizes other low latency techniques for on-
chip networks. ĉis is followed by Chapter ǋ, which presents a low latency tech-
ǉǉ
nique through traﬃc compression for ǋDNoCs. Chapter ǌdescribes predict-more
router, one of the two low latency on-chip routers with in-router multicasting;
while theother low latencyon-chip router design,which is calledmulticast-within-
a-router are introduced in Chapter Ǎ. Finally, this dissertation is concluded in
Chapter ǎ.
ǉǊ
Study the past if you would deėne the future.
Confucius
2
Background: Low Latency Techniques for
On-chipNetworks
AĺŉĹŇ ŉļĹ ĽłŉŇŃĸŊķŉĽŃł ŉŃ NŃCň ĵłĸ ŉļĽň ĸĽňňĹŇŉĵŉĽŃł, existing low la-
tency techniques for NoCs will be covered in this chapter. ĉese studies will be
separated into two groups with each one has a diﬀerent focus. ĉe ėrst group cov-
ers techniques that are compression based while the second group hasmost of the
low latency routing techniques reviewed.
ǉǋ
Ǌ.ǉ TŇĵĺĺĽķ CŃŁńŇĹňňĽŃł
Traﬃc compression forNoCs, as an eﬃcient on-chip optimization, has been ex-
tensively studied for ǊD design [Ǌǉ, Ǌǐ, ǍǍ]. In [Ǌǉ], the authors were the ėrst to
apply frequent paĨern compression on a CMP with Network-on-Chip architec-
ture. ĉeir primary goal was to make a comparison between cache compression
and network compression with the same algorithm, in terms of their eﬀects on
performance and energy consumption. Both [Ǌǐ] and [ǍǍ] were about compress-
ing data on NoCs with another candidate algorithm, frequent value compression.
Although their results are showing positive feedback, it is believed that for any ar-
chitecture havingmultiple communicating nodes, frequent value compression can
be ineﬃcient because of its overheads of area and synchronization make it scale
poorly. In [Ǌǐ], the authors also propose a solution to the area overhead and an
adaptive compression control mechanism taking into account the network con-
gestion.
Before the study of traﬃc compression on NoCs was carried out, there were
alreadymany eﬀorts of applying it on bus and cache [Ǒ–ǉǉ, ǍǊ]. Moreover, a study
carried out in [ǌǐ] had proved that both cache and bus compression are highly
eﬃcient in terms of further scaling CMP designs.
Ǌ.Ǌ LŃŌ LĵŉĹłķŏ RŃŊŉĹŇň
Over the years aěer the introduction of NoCs, there are many existing works
focusing on shortening the latency of a router [Ǌǌ, ǋǋ, ǋǑ, ǌǉ, ǌǍ]. Major improve-
ǉǌ
!"#$%# $&#!"# '%# $%#
(# )# *# +#",-./# 0# 1#
!234/5#%# !234/5#6# !234/5#"#
7# 8# 9# (:# ((#
'%# $%# $&# !"# '%#
(a) Conventional router (4-cycle)
!"#$%&# %'#
$&#()*+,-#./0#$%&#
!"# $%&# $%&#
%&#()*+,-#./0#$%&#
$%&# %&# %'# !"# %'#
$%&#%122,,-,-#
(b) VSA router (3-cycle when speculation succeeds)
!"#$
%&$
'($
!"#$
%&$
'&$ '&$'&$ '($
!"#$
%&$
'($
(c) Look-ahead router (3-cycle)
!"#$
%&'$
&($
%'$)*+,-.$/01$%&'$
!"#$
%&'$
%&'$
&'$)*+,-.$/01$%&'$
&'$ &($
!"#$
%&'$
&($
%&'$&233--.-.$
(d) Look-ahead VSA router (2-cycle when speculation succeeds)
!"#$"%$ "#$
!&'()*+),'$-./+)01$23)4'($
-5$ 6%$ !"#$
!&'()*+),'$-./+)01$"/**''('($
(e) Prediction router (1-cycle when prediction succeeds)
Figure 2.2.1: Pipeline stages of various router designs.
ments on router designs will be covered in this section by starting with the con-
ventional router.
Ǌ.Ǌ.ǉ CŃłŋĹłŉĽŃłĵŀ RŃŊŉĹŇ
ĉestructureof a conventional virtual channel router (CR)with aǌ-stagepipeline [ǉǑ]
is shown in Figure Ǌ.Ǌ.ǉa. When a packet is transferred through this router, its
header Ěit will serially invoke these ǌ pipeline stages which are route computation
ǉǍ
(RC), virtual channel allocation (VA), switch allocation (SA) and switch traversal
(ST); if there exists anybodyĚit for the samepacket, theywill insteadbeprocessed
by SA and ST stages only. As their names suggest, RC is capable of ėnding the out-
put port for a packet throughdecoding its header Ěit and computing its route; VA is
used to locate a proper group of buﬀers for this packet to be stored at the next hop;
SA helps allocating a proper time slot for Ěits of a packet to traverse the crossbar
switch; while at last, Ěits traverse the crossbar switch at the ST stage.
Ǌ.Ǌ.Ǌ RŃŊŉĹŇ OńŉĽŁĽŐĵŉĽŃłň
ĉe latencyof a routerdetermines the transmissiondelayof apacket. FigureǊ.Ǌ.ǉb
to Figure Ǌ.Ǌ.ǉe demonstrate the evolution of router pipeline designswith each de-
sign resulting in diﬀerent per-hop latency.
VSA router (VSAR) [ǌǍ] is proposed that VA and SA stages are overlapped to
produce a ǋ-cycle router as shown in Figure Ǌ.Ǌ.ǉb. ĉe key point is, if both oper-
ations succeed while being performed at the same time, VA and SA together only
cost ǉ cycle (as for Router C). ĉis proposal acts as a CR if VA fails to return a
free virtual channel regardless of the result of the speculative aĨempt of SA (as for
Router A).
Look-ahead routing (LAR) shown in Figure Ǌ.Ǌ.ǉc is a non-speculative tech-
nique which eﬀectively shortens the router latency to ǋ cycles. A router with LAR
always performs RC for the next hop. ĉis successfully helps hiding the RC delay.
Moreover, as illustrated in Figure Ǌ.Ǌ.ǉd, VSA and LAR can be merged to form a
look-aheadVSA router which helps shortening the router latency to Ǌ cycles when
ǉǎ
speculation succeeds (as for Router C).
ĉere are two single cycle routers both utilizing LAR. One is a router whose
Ěit is able to speculatively traverse a crossbar switch if the network is not busy
by assuming that arbitration is not needed for a router in such a network condi-
tion [ǌǉ]. Another one is a non-speculative wormhole router named NoX [Ǌǌ].
Instead of arbitrations on the crossbar switch, NoX router relies on a new XOR-
based switch which helps hiding the arbitration delay by XORing contending Ěits
(so that routed Ěits may be XORed).
So far the most aggressive speculation found in routers is the prediction router
(PR) [ǋǑ]whose speculative switch traversal is enabledwith predictions of output
ports before a packet actually comes to a router. As depicted in Figure Ǌ.Ǌ.ǉe, if a
predictive routing succeeds (as forRouters B andC),RC,VAandSAare all hidden
since they are already carried out with a predicted RC result.
Kumar et al. have proposed another two low latency techniques [ǋǋ, ǋǌ]. ĉe
ėrst one introduces a single cycle router with an aggressively optimized control
path [ǋǌ]. It is able to route in one cycle by sending an advanced bundle to help
seĨing up the control before a Ěit actually comes. ĉe second one proposes an
approach called express channels which enable amulti-hop packet to bypass inter-
mediate routers [ǋǋ].
ǉǏ
ǉǐ
Size is not a reality, but a construct of the mind; and space a
construct to contain constructs.
Robert AntonWilson
3
Latency ReduČion through Traﬃc
Compression
Iŉ Ľň łŃŉ ŀŃłĻ ňĽłķĹ ŉļĹ ķŃłķĹńŉ ŃĺNŃCň Ľň ĶĹĽłĻ ĹŎŉĹłĸĹĸ ŉŃ ICň ŉļĵŉ
ļĵŋĹ ŉļŇĹĹ-ĸĽŁĹłňĽŃłĵŀ ňŉŇŊķŉŊŇĹň, namely the ǋD NoC [Ǎǈ], in order to
mitigate the wire delay andwire energy which are increasingly posing severe prob-
lems to modern VLSI design. Traditionally, the wire delay can be mitigated by in-
serting inverting buﬀers (i.e., repeaters) on long wires, but the buﬀers themselves
add gate delay and consume energy; thus repeater insertion is not a fundamental
solution to the problem. With ǋD ICs, a number of wafers or dies are stacked very
ǉǑ
closely (e.g., Ǎμm to Ǎǈμm); thus a ǋD structure signiėcantly reduces wire length,
wire delay, and wire energy compared to ǊD counterparts.
For these reasons, ǋDNoC is an emerging research topic, and its network topol-
ogy [ǌǌ], router architecture [ǋǉ, ǌǋ], and routing algorithms [ǌǏ] have already
been extensively studied.
However, many studies on ǋD IC architectures have underestimated the neg-
ative impact of vertical interconnects, as reported in [ǋǈ]. Unfortunately, these
vertical interconnects, such as through-silicon vias (TSVs) and microbumps, also
consume a certain amount of area. In addition, they aﬀect the routability of wires
negatively, because some vertical interconnects interfere with metal layers. ĉus,
although ǋD IC technologies are believed sound beyondMoore’s Law, their verti-
cal bandwidth is still a major concern. In practice, such vertical bandwidth limita-
tion can signiėcantly exacerbate the system performance (see Section ǋ.ǉ).
Since vertical bandwidth limitations come from the physical design constraints
mentionedabove, tomitigate theperformancedegradation, there is noother choice
but to reduce the amount of communication data, especially for those data mov-
ing vertically. In this solution, therefore, a study of traﬃc compression on ǋDNoC
architectures is presented with a comprehensive set of scientiėc workloads.
ǋ.ǉ MŃŉĽŋĵŉĽŃł
ǋDICsbringmanybeneėts like increased system integration, reducedwire length
and increased data locality, but how diﬀerent wafers or dies are stacked vertically
remains an open question for the research community and the industry. Various
Ǌǈ
interconnection technologies of ǋD ICs have been developed for the purpose of
vertical stacking, such as wire-bonding, micro-bump [ǉǎ, ǋǊ] and through-silicon
via (TSV) [ǉǏ, ǊǊ].
• Wire-bonding is a die-to-die interconnection formed with bonding wires.
It has a footprint recorded from ǋǍ to ǉǈǈ um. It is the most common ap-
proach and has been highly utilized by System-in-Package designs. ĉe lim-
itation is the number of wires and their density as only edges of a chip is
used for the purpose of bonding. Obviously, the bonding wire length can
be the cause of a considerable communication delay.
• Micro-bump forms a die-to-die interconnection through solder balls. It
has a footprint known to be from ǉǈ to ǉǈǈ um. ĉis approach is generally
limited to stack only two dies with face-to-face connections but it can also
be used to form connections ofmore than twodieswith face-to-back design
although this is believed ineﬃcient because of factors like heat.
• ĉrough-silicon via (TSV) is a wafer-level interconnection making use of
via-holes formed through multiple wafers. ĉe footprint of TSV is Ǎ to Ǎǈ
um thus it has the potential of oﬀering a beĨer interconnection density than
wire-bonding and micro-bump. However, it suﬀers from high manufactur-
ing cost due to the fact that an extra process to form these interconnects.
Another constraint of TSV comes from routing, as TSV interconnects in-
terfere with gates and wires. So considering yield and cost, the number of
TSV interconnects has major impact in design and it should be considered
Ǌǉ
!"#$%&$'(")"*$!$+$,-$+ !,#$.&$!$+$,-$%$,-$%
!/#$.&$!$%$,-$0$,-$1!2#$.&$!$%$,-$%$,-$+
34(56$27)84898$7:$"$;'<$;7*56$"$=")>$
7:$?%$;"2@5$")/$"$A7B95*
C7*4D7)9"($?4)>
E5*942"($?4)>
Figure 3.1.1: 2D and 3D NoC topologies.
carefully ahead of manufacturing [ǋǈ].
As brieĚy explained above, all three interconnection technologies of ǋD ICs
have a limitation of going vertical, that is, the die-to-die or wafer-to-wafer inter-
connection can become a bandwidth boĨleneck. With larger numbers of such in-
terconnects, the diﬃculty of design complexity and the cost of manufacturing are
also severe. To depict this vertical bandwidth limitation, a ǋD NoC model with
heterogeneous link widths are evaluated, which is, for vertical links that are used
tomove data betweendies orwafers, they aremodeled as having smaller bit widths
compared to horizontal links. In this chapter, the eﬀects for having diﬀerent num-
bers of layers (dies/wafers) are also evaluatedwith the ǋDNoCsmodeling Ǌ, ǌ and
ǐ layers. An example of the baseline ǊDNoCand three ǋDNoCconėgurations are
ǊǊ
!!"#
$
$"#
%
%"#
&&' ()*+, -+./0 -+1'-+)* 23 45 67 89 :5
;
(
-<
+
=/
>*
.
?4
0*
)@
'/
(
,
?3
/<
*
%A?B?$%CBD/' EA?B?$%CBD/'F$GBD/'
Figure 3.1.2: System performance degradations under link limitations for 3D
NoC.
illustrated in Figure ǋ.ǉ.ǉ. A square represents a tile of the modeled NoCs while
the thick and thin arrows denote horizontal and vertical links, respectively.
Moreover, Figure ǋ.ǉ.Ǌ presents an example of how this link limitation can aﬀect
the system performance. Please note the detailed evaluation conditions and envi-
ronmentwill be shown in Section ǋ.ǋ. Both ǊDandǋDNoCs are conėgured in the
same way except their link widths. For this particular evaluation, we tested a ǊD
NoC having ǉǊǐ-bit links and an ǐ-layer ǋDNoC. For the ǋDNoC, its horizontal
links are set to ǉǊǐ-bit while its vertical links are ǉǎ-bit wide. Both conėgurations
assume a total of ǉǎ cores. In this evaluation, the execution time of the samework-
load is being increased by up to ǉǋǈƻ. As shown in Figure ǋ.ǉ.Ǌ, these numbers are
far larger than the the ǊD NoC with ǉǊǐ-bit links. ĉus, vertical link bandwidth
Ǌǋ
!"#$
%&'$()"*+(
,-(%./+$0
,1(%./+$
2&3*$'
4
&
3
*+
5$0* 674!
8
&
'*
+
89 89
!"#$
%&'$()"*+(
,-(%./+$0
,1(%./+$
2&3*$'
4
&
3
*+
5$0* 674!
8
&
'*
+
:
&)
;
<=
89 89
>.?(1: >@?(A:
Figure 3.1.3: Tiles of 2D and 3D NoCs.
limitation can be a major boĨleneck for any systemmoving to ǋD design.
In the network model shown in Figure ǋ.ǉ.ǋb, basic building blocks (tile) of
the ǋDNoCs are connected with each other by routers and links. For comparison
purpose, the tile of a ǊD design is shown in Figure ǋ.ǉ.ǋa, whose router is at most
having six ports and two of them are used to connect to a processor core and an
LǊ cache bank. For ǋD NoCs, two more ports may be added to the router and
through two additional links, diﬀerent dies/wafers are connected. ĉe network
routing scheme is also re-deėned since X-Y routing for ǊD is not suﬃcient for the
ǋDdesign. As shown in the last paragraph, because of the layer-to-layer bandwidth
limitation, the ǋD NoC models narrower vertical links. More details on conėg-
urations of the ǋD NoC model are covered in Section ǋ.ǋ where the simulation
methodology is described.
Ǌǌ
ǋ.Ǌ TŇĵĺĺĽķ CŃŁńŇĹňňĽŃł ŃłNŃCň
Traﬃc compression is a popular architectural technique and it has been applied
inmany ėelds to conserve on-chip/oﬀ-chip bandwidth, to enlarge cache/memory
capacity or to reduce communication latency. In this solution, traﬃc compres-
sion is used to conserve bandwidth and to reduce latency for ǋD NoCs. In this
section, the compression technique is discussed in details. Firstly, an introduc-
tion to the compression algorithm, frequent paĨern compression, will be brieĚy
covered. Aěer which, the focus is shiěed to implementation issues of this com-
pression algorithm. And ėnally, the proposal of adaptive control on compression
will be presented.
ǋ.Ǌ.ǉ CŃŁńŇĹňňĽŃł AŀĻŃŇĽŉļŁ ĵłĸ IŁńŀĹŁĹłŉĵŉĽŃł
ĉereare several state-of-the-art traﬃccompressionalgorithmswhichhavebeen
applied on NoCs, including frequent paĨern compression (FPC) [Ǌǉ] and fre-
quent value compression (FVC) [Ǌǐ, ǍǍ]. In this work, FPC is chosen because of
its simplicity and eﬀectiveness. FPC is a signiėcance-based compression scheme
having small compression/de-compression overheads; unlike FVC, it has no syn-
chronization overhead. FPC compresses frequent paĨerns appeared in data pack-
ets. In this solution, there are seven such paĨerns with which each ǋǊ-bit of data
is being compressed and a full description of these paĨerns are presented in Fig-
ure ǋ.Ǌ.ǉ. Of all these paĨerns, the selection was made upon their frequencies.
In [Ǌǉ], it is found that zero words, words with ǐ-bit data and words with ǉǎ-
ǊǍ
bit data are the most frequent paĨerns for workloads from SPLASH-Ǌ [Ǎǌ] and
NPB ǋ [ǋ]. ĉerefore, these paĨerns are selected as shown in Figure ǋ.Ǌ.ǉ. For
all seven data paĨerns, a ǋ-bit index is assign to each of them. Along with another
index for uncompressed data words, there are in total eight indexes which are the
compression overhead. For example, a data word of ǋǊ zeros will be replaced with
an index of ǈǈǈ aěer compression, while an ǐ-bit sign-extended data word will be
replacedwith an index of ǈǈǉ plus the ǐ-bit data. Please note that although indexes
are ėxed to ǋ-bit, the actual data appended to the indexmaybe diﬀerent in size. For
the last index which is ”ǉǉǉ”, the data is uncompressed which results in a negative
eﬀect aěer the combination of index and data. FPC has advantages of high com-
pression ratio and parallel compression. For ǉǊǐ-bit data, it can always be split
into ǌ parts and each part is compressed with a separate compression circuit. But
since FPC employs variable length compression, the de-compressionmay have to
be done in a serial manner.
Regarding the implementation, similar to [Ǌǉ, Ǌǐ, ǍǍ], traﬃc compression/de-
compression circuits in this solution are assumed to be implemented in network
interfaces (NI) of the ǋD NoCs. At NIs, any injecting data traﬃc will be com-
pressed and receiving data traﬃcwill be de-compressed; but it is important to note
that the enhancedNIs will also have area, latency and energy overheads. ĉe com-
pression and de-compression processes are carried out for data packets only. In
the evaluation, any data packet has a ǍǉǊ-bit body which is the size of a cache line.
When compression is applied, the ǍǉǊ-bit data is broken into ǋǊ-bit pieces, which
Ǌǎ
!"##$%&'()'*+,,'-$%.'/+&0
11111111111111111111111111111111'23'111
!"##$%&'4)'5678#'98:&6$;#$&<$<
111111111111111111111111========'23'11('>'========
((((((((((((((((((((((((========'23'11('>'========
!"##$%&'?)'(@678#'98:&6$;#$&<$<
1111111111111111================'23'1(1'>'================
((((((((((((((((================'23'1(1'>'================
!"##$%&'A)'5678#'B"#"'!"<<$<'C8#D'-$%.0
========111111111111111111111111'23'1(('>'========
!"##$%&'E)'(@678#'B"#"'!"<<$<'C8#D'-$%.0
================1111111111111111'23'(11'>'================
!"##$%&'@)'FC.'G",H6C.%<0I'J"KD'"0'"'98:&6$;#$&<$<'LM#$
11111111========11111111========'23'(1('>'================
11111111========((((((((========'23'(1('>'================
((((((((========11111111========'23'(1('>'================
((((((((========((((((((========'23'(1('>'================
!"##$%&'N)'O.%<'P.&080#8&:'.H'/$Q$"#$<'LM#$0
RLM#$'STRLM#$'STRLM#$'STRLM#$'ST'23'((1'>'RLM#$'ST
U.&6K.VQ%$0087,$'O.%<
RO.%<'ST'23'((('>'RO.%<'ST
Figure 3.2.1: Patterns of the frequent pattern compression.
are then encoded with the eight paĨerns shown in Figure ǋ.Ǌ.ǉ. If the Ěit size is
ǉǊǐ-bit and the compression ratio is between ǌ:ǋ and Ǌ:ǉ (like the one shown in
Figure ǋ.Ǌ.Ǌ), the original packet is composed of a header Ěit and ǌ body Ěits while
ǊǏ
!"#$"% &'$(
)#*+",-."/'%"-0'12%"334'5-67-894,3:
!"#$"% &'$(-60'12%"33"$:
)#*+",-#/,"%-0'12%"334'5-6;-894,3:
<"5',"3-,="-
/94,-.'>5$#%(
<"5',"3-,="-45$"?"3-
6*'12%"334'5-'@"%="#$:
Figure 3.2.2: An example of the frequent pattern compression.
the compressed packet carries a header Ěit and ǋ body Ěits. ĉis results in a Ǎ-Ěit
to ǌ-Ěit packet size reduction.
As mentioned earlier, the compression process of FPC can be done in paral-
lel for several data words at a time. As stated in [Ǒ], this compression process is
only taking one cycle per data word, thus with multiple parallel encoders, the tim-
ing overhead of compression is one cycle per packet. For de-compression, since
FPC is a variable length compression scheme, it is unable to carry out the de-
compression in parallel. But as proposed in [Ǌǉ], it is able to overlap the network
latency with part of this de-compression latency. In details, the receiving and de-
compression pipeline is designed to work with only a fraction of a packet received.
Ǌǐ
Aěer the ėrst body Ěit containing indexes of all compressed words (the compres-
sion overhead) is received, there is a pre-computation process in order to obtain
the length of compressed data before its arrival. Hence, the de-compression does
not need to rely on receiving the entire compressed packet. By applying this im-
provement, the de-compression timing overhead can be kept within two cycles
per packet. ĉus, in the evaluation, one cycle of compression delay and two cycles
of de-compression delay are assumed for any data packet. However, for incom-
pressible packets, their sizes, in terms of number of Ěits, will be the same or even
increased aěer the compression. ĉis opens up another opportunity for adaptive
control in order to avoid negative eﬀects, such as increased packet latency due to
having more Ěits or eﬀortless de-compression.
In [Ǌǉ], it is recorded that with ǌǍ nm process, the area overhead and dynamic
power consumptionof compressor/de-compressor circuits areǈ.ǉǐǋmm? andǈ.ǊǏǋW,
respectively. In this work, since both the packet size and the compression/de-
compression algorithm and process are the same as [Ǌǉ], a similar area overhead
is expected.
ǋ.Ǌ.Ǌ PŇŃńŃňĹĸ AĸĵńŉĽŋĹ CŃŁńŇĹňňĽŃł ĺŃŇ ǋDNŃCň
In Section ǋ.ǉ and Section ǋ.Ǌ.ǉ, the ǋDNoCmodel and its vertical bandwidth
limitation are discussed. To help mitigating the vertical bandwidth limitation and
making beĨer use of FPC, an adaptive compression technique is presented for ǋD
NoCs. Based on FPC, this adaptive compression scheme utilizes compressibility
and location based mechanisms to control the compression process while static
ǊǑ
!"#$%&'()*++,&-$-&.$."/*- !0#$%&'()*++,&-$-&.$."/*-
!1#$%&'()*++,&-$."/*-
Figure 3.2.3: Compressibility-based adaptive control.
FPC employs a constant-on rule that every data packet gets compressed. For any
data packet waiting to be injected to the network, two policies have been set up to
determine whether the compressor should be invoked or not. ĉere is also a third
policy which aggregates these two proposed policies. ĉese ǋ adaptive policies are
described in details below and their characteristics, advantages and disadvantages
are summarized in TABLE ǋ.Ǌ.ǉ.
• Compressibility based control requires the compression process, which
incurs overhead of compression. ĉe reason for proposing this policy is that
negative compressibility and eﬀortless de-compression should always be
avoided. Aěer the actual compression process, the size of the compressed
packet will be known. If it is known that the compressed packet cannot
ǋǈ
!"
#
$
%&'()*+,-.&
/0123+45-&''6+72(89&7 /.123+45-&''6+72(89&7/8123+45-&''6+727+(2(89&7
!
%&'()
"
#
$
*+,-.&
!
"
#
%&'()
$
*+,-.&
Figure 3.2.4: Location-based adaptive control.
derive any Ěit reduction from the original packet, then the network inter-
face disregards the compressed packet and instead it splits and injects the
original packet. With this policy, for packets whose compressibility is not
good enough for any Ěit reduction, the timing overhead of sending more
Ěits or carrying out an eﬀortless de-compression can be saved when com-
pared to static compression. However, if the data is incompressible, one cy-
cle per packet is lost when compared to no compression. When this is the
only adaptive control implemented, the compressibility is always checked
in spite of the packet direction. Figure ǋ.Ǌ.ǋ gives three examples of this
adaptive control and only the third case has the compression incurred since
that packet has less number of Ěits aěer compression.
• Location based control is simple. It does not require the compression pro-
cess. As shown in Figure ǋ.Ǌ.ǌ, this method detects packets going across
layers, such as Figure ǋ.Ǌ.ǌb and Figure ǋ.Ǌ.ǌc, and compresses them. Layer
ǋǉ
!"#$%&'()*++,&-$-&.$./0*- !1#$%&'()*++,&-$./0*-!/#$%&'()*++,&-$-&.$./0*-
2
3*+.4
5
6
7
8&9)1*
2
5
6
3*+.4
7
8&9)1*
2
5
6
7
3*+.48&9)1*
: : :
Figure 3.2.5: Compressibility- and location-based adaptive control.
crossing packets can be easily detected by checking several bits of the packet
header indicating the destination node. ĉere are two reasons for propos-
ing this policy. Firstly, it is believed that ǋD NoC will grow with increasing
number of layers, whichmeansmore traﬃcwill be layer-crossing. Secondly,
if most of the compressible packets are crossing layers, compressing these
traﬃc is more promising since they also suﬀer from the vertical bandwidth
limitation as described in Section ǋ.ǉ.
• Compressibility and Location based control is the logical conjunction of
the above two policies. A layer-crossing packet will be examined for com-
pressibility to determine if its compressed form is going to be injected to
the network. Please note that packets traveling within the same layer will
ǋǊ
neither be checked for compressibility nor be compressed. Like the second
policy, this policy also targets at the vertical bandwidth limitation. How-
ever, it removes any negative compressibility or eﬀortless de-compression
for these layer-crossing packets and it also removes the timing overhead of
the compressibility check for packets traveling in the same layer. It has one
cycle of timing overhead if a layer-crossing packet is incompressible when
compared to no compression. ĉree examples are shown in Figure ǋ.Ǌ.Ǎ,
while only the third one has compression incurred since both conditions
are satisėed.
To successfully implement this adaptive control on FPC, it is necessary to have
a bit in the header indicating the compression status for all data packets. When
compressed, this bit in the packet header will be set to ”ǉ”, or this bit is set to ”ǈ”
when the packet is not compressed.
ǋ.ǋ MĹŉļŃĸŃŀŃĻŏ
To quantify the eﬀects of applying the ǋ adaptive compression policies, full sys-
tem simulation is employed. In this section, the simulation platform will be ex-
plained in details. Firstly, parameters of the simulationmodel will be covered; and
secondly, a brief introduction to the workloads in simulation will be made.
For the ǋD NoC model, simulation is carried out for a ǉǎ-core CMP system
with shared LǊ cache using the Multifacet GEMS simulator [ǋǏ] based on Sim-
ics [ǋǎ]. To correctly simulate traﬃc compression and its eﬀect on NoCs, the de-
ǋǋ
Ta
bl
e
3.
2.
1:
Qu
ali
ta
tiv
ec
om
pa
ris
on
so
fa
da
pt
ive
co
mp
res
sio
n
po
lic
ies
.
Ad
ap
tiv
ec
om
pr
es
sio
n
po
lic
ies
SC
AC
ǉ
AC
Ǌ
AC
ǉ+
Ǌ
Co
m
pr
es
sio
no
ve
r-
he
ad
pe
rp
ac
ke
t
ǋc
yc
les
ǉo
rǋ
cy
cle
s
ǈo
rǋ
cy
cle
s
ǉo
rǋ
cy
cle
s
Pa
ck
et
s
to
co
m
-
pr
es
s
Al
l
Co
m
pr
es
sib
le
on
es
D
ie-
cr
os
sin
go
ne
s
Co
m
pr
es
sib
le
di
e-
cr
os
sin
go
ne
s
Ig
no
re
d
be
ne
ėc
ial
pa
ck
et
s
N
on
e
N
on
e
Co
m
pr
es
sib
le
in
tra
-
di
eo
ne
s
Co
m
pr
es
sib
le
in
tra
-
di
eo
ne
s
Co
m
pr
es
se
dh
ar
m
-
fu
lp
ac
ke
ts
In
co
m
pr
es
sib
le
on
es
N
on
e
In
co
m
pr
es
sib
le
in
tra
-d
ie
on
es
N
on
e
Fa
vo
re
d
ca
se
s
W
he
na
llp
ac
ke
ts
ar
e
co
m
pr
es
sib
le
W
he
n
th
er
e
ex
ist
s
in
co
m
pr
es
sib
le
pa
ck
et
s
W
he
n
m
os
t
of
th
e
co
m
pr
es
sib
le
tra
ﬃ
c
ar
ed
ie-
cr
os
sin
g
W
he
n
th
er
e
ex
ist
s
in
co
m
pr
es
sib
le
di
e-
cr
os
sin
gp
ac
ke
ts
ǋǌ
tailed network model of GEMS is modiėed. Each core has a pair of dedicated in-
struction/dataLǉ caches and theLǊ cache is divided into ǉǎbanks. ĉecoherence
model of caches includes MOESI protocol with Ǌ distributed on-chip directories
implemented on the boĨom layer. Directories are used to maintain coherence of
memory hierarchies and served as memory controllers; in simulation, directory
entry access costs ǎ cycles, same as the LǊ cache. So any LǊ cache miss at a core
will result in a directory access to locate the needed data, which is either in an-
other core’s Lǉ cache or in the main memory. ĉe whole memory address space
is interleaved across these two directories, each of which is also a channel to the
main memory. ĉe router has a ėxed ǋ-stage pipeline, wormhole switching and
ǋ virtual channels; the network interface is implemented with a Ǌ-stage pipeline.
Compression always consumes one cycle of latency while de-compression takes
two cycles.
ĉe simulation parameters also assume each core has ǎǌKB of Lǉ cache split
for instruction and data. Each LǊ cache bank is ǊǍǎ KB. ĉree ǋD topologies are
evaluated. One is having eight cores per die and two stacked dies which forms a
ǌ by Ǌ by Ǌ ǋD Mesh network. ĉe other two are ǌ cores stacked as ǌ layers and
Ǌ cores stacked as ǐ layers, respectively. ĉey form a Ǌ by Ǌ by ǌ and a Ǌ by ǉ by ǐ
ǋD Mesh topologies, one by another. Note that all planar links for ǋD NoCs are
ǉǊǐ-bit wide and all vertical links are ǉǎ-bit wide. ĉese two linkwidths are picked
up aěer considering the footprint of TSVs. Footprint of a TSV is much larger than
that of a wire or a driver cell. For example, a typical size of via-last TSVs ranges
from Ǎum to Ǌǈum [ǋǈ], while that of an inverter cell is only ǈ.ǍǏumby Ǌ.ǌǏum in
ǋǍ
the case of OSU’s free ǌǍnm standard cell library. Furthermore, wire-bonding and
microbump are believed to be more area hungry according to [ǊǊ] as mentioned
in Section ǋ.ǉ.
Routers in this ǋDNoCmodel employ deterministic X-Y-Z routing and Ǌmore
ports are needed as connections to routers at neighbor dies/wafers. Packet com-
munication between layers assumes that each ǉǊǐ-bit Ěit is transferred over ǉǎ-bit
links in ǐ cycles; however, routing and arbitration for vertical going Ěits are not
diﬀerent from non-vertical going ones.
For simplicity, conėgurations are summarized inTABLEǋ.ǋ.ǉ. Wormhole switch-
ing with credit-based Ěow control are used for both horizontal and vertical trans-
fers. It is also assumed that the Ěow control signals for vertical transfers are imple-
mented with TSVs, while those for horizontal are implemented with metal wires
within the die.
In order to have a diverse performance evaluation, nine workloads with ǉǎ-core
input from SPLASH-Ǌ and NPB ǋ suites are used for simulations [ǋ, Ǎǌ]. Both
benchmark suites are implemented with OpenMP and their input sizes are stated
in TABLE ǋ.ǋ.Ǌ.
ǋ.ǌ RĹňŊŀŉň
Depending on diﬀerent ǋD topologies, memory access characteristics, on-chip
bandwidth requirements and compressibility of workload, traﬃc compression on
ǋǎ
Table 3.3.1: System parameters.
Component Parameter
Processors: ǉǎ
Lǉ Cache: Each core has a total of ǎǌKB of private Lǉ cache (split
I and D), which is ǌ-way set-associative and has ǎǌ bytes
per line and ǉ cycle of access latency.
LǊ Cache: Shared LǊ cache divided into ǉǎ banks. Each bank is
ǊǍǎKB, ǉǎ-way set-associative and has ǎ cycles of access
latency.
Memory: ǌGB of DĆMwith ǉǎǈ cycles of access latency.
Topology: ǉǎ nodes organized in three ǋD Mesh topologies, ǌ by Ǌ
by Ǌ layers, Ǌ by Ǌ by ǌ layers and Ǌ by ǉ by ǐ layers.
Network Interface: Ǌ-stage pipeline for spliĨing packets into Ěits and Ěit
injection; and Ǌ-stage pipeline for Ěit reception and
combining Ěits into a packet. ĉe compression/de-
compression circuits are implemented here.
Router: ǋ-stage pipelinewithX-Y-Z routing, wormhole switching
and ǋ virtual channels.
Link: Uneven link width is implemented; the planar link width
is ǉǊǐ-bit and the vertical link width is ǉǎ-bit.
Compression Overhead: For all compression methods, compression takes ǉ cycle
while de-compression takes Ǌ cycles. For compressibility
based adaptive policy, the compressibility check is ǉ cy-
cle. For location-based adaptive policy, the destination
node detection does not cost any additional cycle. Simi-
larly for compressibility and location based adaptive pol-
icy, the compressibility check takes ǉ cycle but it is only
for packets which travel across layers and this destination
node detection does not take any additional cycle.
Table 3.3.2: Benchmark programs and inputs.
Application Input
ĕ Ʀ?? complex data points
ocean, contiguous grid of ǉǐǉǐ
radix ǉǈǌǐǍǏǎ keys, radix of ǉǈǊǌ
raytrace head, scaled down by ǉǎ
BT grid size of ǉǊǉǊǉǊ, ǎǈ iterations, time step of ǈ.ǈǉ
EP Ʀ?? random number pairs
LU grid size of ǉǊǉǊǉǊ, Ǎǈ iterations, time step of ǈ.Ǎ
MG grid size of ǋǊǋǊǋǊ, ǌ iterations
SP grid size of ǉǊǉǊǉǊ, ǉǈǈ iterations, time step of ǈ.ǈǉǍ
ǋǏ
Figure 3.4.1: Normalized execution time with static/adaptive compression on
3D NoCs.
ǋD NoCs can bring several beneėts. In this section, how these beneėts look like
in practice is going be made clear. ĉe normalized execution time for ǋD NoCs
under two compression schemes, static and adaptive will be quantiėed and dis-
cussed. Note that for adaptive compression, each policy is applied separately. In
total, there are four sets of results under static compression (SC), compressibility-
based compression (ACǉ), location-based compression (ACǊ) and the conjunc-
tion of ACǉ and ACǊ (ACǉ+Ǌ). ĉese results are obtained with normalization to
execution time under no compression (NC) and they are presented in Figure ǋ.ǌ.ǉ
with each histogram representing a workload.
ǋǐ
Firstly, static traﬃc compression on ǋDNoCs is fairly eﬀective. Of the ǊǏ cases
(Ǒ workloads with ǋ topologies) simulated, only ǌ of them show zero or negative
performance improvement,whichmeans for these cases, theoverheadof compres-
sion is not well covered by the amount of network latency reduced. ĉese ǌ cases
are, SC of BT on ǌ by Ǌ by Ǌ in Figure ǋ.ǌ.ǉe and SCs of EP on Ǌ by ǉ by ǐ, Ǌ by Ǌ
by ǌ and ǌ by Ǌ by Ǌ in Figure ǋ.ǌ.ǉf.
Secondly, adaptive control of traﬃc compression ismore eﬀective thanSC.ACǉ
outperforms SC for all tested workloads and conėgurations. ĉis is supported by
the fact that if compression is beneėcial, then ACǉ is the same as SC, while if com-
pression is not carried out because of it results in more Ěits or no Ěit reduction,
then one cycle is wasted at the compressibility check, but Ǌ cycles are saved at de-
compression andmaybemore latency are saved at the network. It is found that the
improvement ranges from ǉ to Ǎƻ, thus avoiding incompressible packets is very
useful. As proposed, ACǊ performs beĨer than SC with more layers. With topol-
ogy of ǌ by Ǌ by Ǌ, ACǊ outperforms SC in only one case, between ACǊ and SC of
ĕ on ǌ by Ǌ by Ǌ in Figure ǋ.ǌ.ǉa; but this number climbs up to ǎ with topology of
Ǌ by ǉ by ǐ. ĉe ǎ cases are,ĕ in Figure ǋ.ǌ.ǉa, ocean in Figure ǋ.ǌ.ǉb, radix in Fig-
ure ǋ.ǌ.ǉc, raytrace in Figure ǋ.ǌ.ǉd, BT in Figure ǋ.ǌ.ǉe andMG in Figure ǋ.ǌ.ǉh.
Similarly, ACǉ+Ǌ also outperforms SC with more layers; it can also be noted that
because of avoiding unnecessary compression which is harmful on layer-crossing
traﬃc, ACǉ+Ǌ is beĨer than SC for all workloads with topology Ǌ by ǉ by ǐ.
ĉirdly, between ACǉ and ACǊ, ACǊmisses chances of compression for traﬃc
travels within layer and it also suﬀers from unnecessary compression for packets
ǋǑ
going across layers. For these two reasons, ACǉ is generally beĨer than ACǊ but
with topology of Ǌ by ǉ by ǐ, it is observed that ACǊoutperformsACǉ in two cases
as for raytrace and BT. ĉis means the beneėt gained by compressing layer-wise
packets with ACǉ does not compensate for its compression and de-compression
overhead, while ACǊ’s gain from compressing layer-crossing packets well exceeds
its unnecessary compression. Another reason is that with more layers, it is less
possible for ACǊ to lose chances to compress data within a layer.
Finally, aěer combining the two policies, it is seen that ACǉ+Ǌ outperforms
ACǊ in almost all cases with the same reason as ACǉ outperforms SC.ĉis means
layer-crossing packets also favor the compressibility check, which improves ACǊ
by denying all incompressible layer-crossing packets. Another important observa-
tion is ACǉ+Ǌ outperforms ACǉ in two cases under topology of ǌ by Ǌ by Ǌ, which
areĕ and radix. However, this number grows to ǋ for topology of Ǌ by Ǌ by ǌ with
BT,EP andLU; and it further grows to ǌ for topology of Ǌ by ǉ by ǐwith radix, ray-
trace, BT andMG. It can be seen that ACǉ+Ǌ also performs beĨer while the chip
is implemented with more layers. ĉis is the same as ACǊ; if having more layers,
ACǉ+Ǌ also loses less chances of traﬃc within a layer. One more observation is,
in some cases, performance improvement is not larger with conėgurations having
more layers under route-based adaptive policies. For example, ocean with ǌ lay-
ers under ACǉ+Ǌ obtains the best performance improvement when compared to
the cases with Ǌ and ǐ layers. An explanation behind this is, more critical packets
are well compressed in the ǌ layer case since it is the criticality of a packet which
determines if compression on such a packet is going to beneėt the performance.
ǌǈ
More speciėcally, for both Ǌ by Ǌ by ǌ and ǌ by Ǌ by Ǌ, ACǉ has been recorded
a performance improvement of up to Ǐƻ over NC, and is beĨer than SC, ACǊ and
ACǉ+Ǌ. ĉis Ǐƻ of improvement with ACǉ is seen in Figure ǋ.ǌ.ǉb, Figure ǋ.ǌ.ǉd
and Figure ǋ.ǌ.ǉh for ocean on Ǌ by Ǌ by ǌ, raytrace on ǌ by Ǌ by Ǌ and MG on ǌ
by Ǌ by Ǌ. However, with Ǌ by ǉ by ǐ, ACǉ+Ǌ is seen to have the best performance
improvement of up to ǉǉƻ over NC in Figure ǋ.ǌ.ǉd for raytrace.
ǋ.Ǎ SŊŁŁĵŇŏ ĵłĸDĽňķŊňňĽŃłň
In this solution, it is evaluated that how adaptive traﬃc compression aﬀects sys-
tem performance for CMPs implementedwith ǋDNoCs. It is also presentedwhat
diﬀerence on performance is made with adaptive schemes of traﬃc compression
proposed in this work. In a bandwidth limited situation like aCMPwith ǋDNoCs
havingmultiple connected layers, adaptive traﬃccompressionwith location-based
control or with both compressibility and location based control is very promising
if the number of layers continues to grow.
Furthermore, according to the evaluation result, if frequent paĨern compres-
sion is to be utilized, then compressibility check is a must since it is always beĨer
than static compression. Secondly, if a ǋD implementation has many layers and
few cores per layer, ACǉ+Ǌ is very eﬃcient since it targets speciėcally at the vertical
bandwidth limitation and most of the traﬃc are layer-crossing. Finally, although
the improvements vary case by case, these results are quite conservative since the
simulation are carried out with Simics whose processor model is in-order and a
relatively smaller input size is used for the workloads. In practice, modern pro-
ǌǉ
cessor cores are generally more advanced with a higher bandwidth requirement.
In consequence, a more promising improvement than these shown results can be
expected if a similar ǋD design has the adaptive FPC implemented.
Regarding the coherenceprotocols, traﬃc compressionworks beĨerwith direc-
tory based protocols, since they result in less control packets which are not com-
pressible at all. As for having more numbers of cores, it is obvious that stacking
more processor cores will beneėt from adaptive traﬃc compression since more
traﬃc will suﬀer from the bandwidth limitation.
ǌǊ
ĉe council of three gave good counsel.
Anonymous
4
Latency ReduČion through In-router
Multicaﬆing: PrediČ-more Router
Oĺ ĵŀŀ ńĵŇĵŁĹŉĹŇň ŉļĵŉ ĵĺĺĹķŉ ňŏňŉĹŁ ńĹŇĺŃŇŁĵłķĹ ĵłĸ ńŃŌĹŇ ķŃł-
ňŊŁńŉĽŃł, router latency is very critical as inter-node communications in NoCs
are carried out on a hop-by-hop basis through routers. Many aĨempts have thus
been made to eﬃciently shorten router latency. ĉe technique with the most ag-
gressive speculation so far (prediction router or PR) [ǋǐ] works by predictively
routing packets to pre-determined outputs before route computation is ėnished.
For the application traﬃc we test, about ǎǍƻ of the prediction this technique em-
ǌǋ
ploys matches the computed route even with the best algorithm. ĉis means, on
average, only about ǎǍƻ of the packets may succeed predictive routing before tak-
ing contentions into consideration yet.
In this chapter, a new low latency router is proposed forNoCsby improvingPR’s
prediction accuracy through “the Wisdom of Crowds”. ĉe essence of this new
router (predict-more router or PmR) is to carry out multiple route predictions for
one incoming packet with diﬀerent prediction algorithms. ĉis simply increases
the prediction accuracy (formore than ǉǍƻ, on average) and in consequence helps
more packets to succeed predictive routing which results inmore opportunities of
latency reduction. Diﬀerent fromPR,PmRhasmultiple predictors under diﬀerent
algorithms working at the same time and a switch crossbar which allows one-to-
many traversals of the same Ěit. ĉis simple change enables multiple predictive
switch traversals of a packet following multiple predictions. Like PR, PmR also
maintains modularity and portability as a standalone design and ėts well in any
NoC with wormhole or virtual channel routers.
ǌ.ǉ MŃŉĽŋĵŉĽŃł
Although PR is a capable technique of achieving single cycle Ěit transfer, it can
be found that, for diﬀerent applications, its predictions hit from ǌǎƻ to ǐǈƻ un-
der algorithms Matsutani et al. [ǋǐ] proposed and evaluated (more details in Fig-
ure ǌ.ǉ.ǉ, evaluation conditions are stated in Section ǌ.ǋ while the prediction al-
ǌǌ
!"
!
#
!
$
!
%
!
&
!
'
!
(
!
)
!
*
!
"
!
!
+,
-
./
01
2
34
4
-
+/
5
6
7+
-
6
89
:
;
-
;
0<
-
+/
5
6
=7
6
-
6
>
+-
6
89
:
;
-
;
0<
?
-
.@
/
6
A
B
5
8/
@=
76
0C
;
5
@/
A
<
B
5
8/
@=
70
D
5
89
5
.<
E
F
G
H
IJ
K
?
/
@5
:
/
F
L
=7
M
M
<
F
L
=7
IF
<
F
L
=7
G
N
O
<
F
4
L
=7
M
M
P
IF
<
F
4
L
=7
M
M
P
G
N
O
<
F
4
L
=7
IF
P
G
N
O
<
F
4
L
=7
M
M
P
IF
P
G
N
O
<
Fi
gu
re
4.
1.
1:
Pr
ed
ict
ion
ac
cu
ra
cy
.
ǌǍ
!"#
!$#
!%#
!&#
!!#
'"#
'$#
'%#
'&#
'!#
(""#
)*$ ( "
Figure 4.1.2: Fractions for diﬀerent numbers of concurrent ﬂits arriving at
routers each cycle.²
gorithms are described in Section ǌ.Ǌ.ǉ). With the two beĨer-performing algo-
rithms, LP and FCM, prediction on average hits around ǎǍƻ of the time, so there
are still ǋǍƻ of the packets which have no chance to be accelerated. ĉis leads to
the potential of improving the prediction accuracy. A very simple approach is, to
predict with multiple algorithms at the same time. According to evaluation (see
Figure ǌ.ǉ.ǉ), this can increase the prediction accuracy by more than ǉǍƻ on av-
erage with the best combinations of algorithms (LP+FCM or SS+LP+FCM). Al-
though this improvement looks good so far, it does not necessarily mean that ǉǍƻ
more packets are able to get accelerated. In fact, the more predictions are carried
out, the more contentions may be created within the router. More contentions
will simply degrade the eﬃciency on arbitration so that this desired improvement
may not be realized. Fortunately, with another evaluation (see Figure ǌ.ǉ.Ǌ), it is
²ĉe y-axis of this ėgure starts from ǐǈƻ.
ǌǎ
!"#$%&
'"()#$*$+",
-'.
-'&/00"1*$"2
34+$15&/00"1*$"2
6,)#$&7
6,)#$&,
8#$)#$&7
8#$)#$&,
'2%9+$.&6,'2%9+$.&8#$
-'.
:+)%0+,%
!%;+.$%2
:+)%0+,%
!%;+.$%2
:2%9+1$"2<.=
:2%9+1$"2<.=
>+00&3+;,*0.
>+00&3+;,*0.
Figure 4.2.1: Architectures of the prediction router and the predict-more
router.
found that at most of the time during execution, a router is either idle or transmit-
ting only a single Ěit for the workloads evaluated. ĉis means most of the time the
beneėt provided by having concurrent predictions on a packet can be simply re-
alized. In a word, the primary goal of PmR is to accelerate more packets through
more accurate predictions if the internal bandwidth of a router allows.
ǌ.Ǌ TļĹ PŇĹĸĽķŉ-ŁŃŇĹ RŃŊŉĹŇ
In this section, how PmR is motivated and designed are covered. Discussions
are also made on the architecture of PR and PmR, qualitatively.
ǌǏ
ǌ.Ǌ.ǉ DĹňĽĻł
PmR is basically a PRwith three enhancements. ĉe ėrst one is to enablemulti-
ple predictions on one packet. ĉe second one is to enable one-to-many traversals
on the crossbar switch. ĉe last one is about how contentions are resolved in pre-
diction. More details of these enhancements are presented when looking at how
individual components of a PmR are designed. In Figure ǌ.Ǌ.ǉ, the architectures
of both PR and PmR are shown. Components that are in gray are the ones which
diﬀerentiate PmR from PR.
• Input units: Predictors reside in the input units. In order to carry out mul-
tiple predictions at a time, multiple predictors per input port are necessary.
ĉe three prediction algorithms are static straight (SS), latest port (LP) and
ėnite contextmethod(FCM).SS is a simple algorithmtargeting atdimension-
ordered routing. It simply predicts that a packet travels along the same di-
mension. At corner nodes, SS does not work; while at edge nodes, predic-
tion can only bemade along one dimension. For this reason, not all packets
can be accelerated with SS and the eﬃciency of SS is the lowest of the three.
Because of its simplicity, the power overhead of SS is also the lowest. LP
simply predicts that an incoming packet from an input is going to be routed
to the same output as the previous packet comes from this input. LP works
well if there is a lot of repeated traﬃc. A single history buﬀer is required to
implement LP as it is used to store the previous routing result. FCM pre-
dicts by taking the most frequently used output (ǈth-order ėnite context
ǌǐ
method [ǉǐ]). For an N-radix router, an FCM predictor requires a history
table of N-ǉ items to keep a record of the routing history. It has the most
complex and expensive design of the three.
• Virtual channel allocator and switch allocator: Both allocators in PmR
are very similar to PR. In PR, when a VA or SA request is placed following a
prediction, it is considered a request with low priority since it is speculative.
ĉis rule is simply inherited from PR to PmR. So a virtual channel (VC) or
a slot of the crossbar switch can be reserved if it belongs to an output pre-
dicted at any input port and it is not allocated through non-speculative VA
or SA operation, but it is not allocated before any actual packet comes. If
there is only one packet, then only one input port’s prediction is meaning-
ful at this cycle, so allocations are simply wriĨen to this packet as it is the
only winner of this VA or SA operation. One thing that is not discussed
in the work by Matsutani et al. [ǋǐ] is how contention from predictions is
resolved. ĉe solution is, if two or more packets come to the router from
diﬀerent inputs which have an overlapped prediction, prediction accuracy
from the past is used to resolve such a contention caused by prediction. For
example, if this output is predicted with algorithm A (or combination A of
multiple algorithms) at input ǉ while it is also predicted by algorithm B (or
combination B of multiple algorithms) at input Ǌ and if A at input ǉ per-
forms beĨer than B at input Ǌ in the past, then this contention is resolved so
that the packet at input ǉ wins this allocation. ĉis process can be complex
but it can actually be done before any packet comes since past prediction
ǌǑ
accuracy is always known.
• Crossbar switch:Multicast support needs to be implemented in the cross-
bar switch, whichmeans, with anNbyNcrossbar switch,N? control signals
are needed to help invoking a multi-destination traversal.
• Kill circuit: PmRmay result in multiple copies of a head Ěit traversing the
crossbar switch to reach multiple predicted outputs and it is necessary that
only the correctly routed one leaves the router. So the same kill circuit of
PR is also required byPmR.ĉediﬀerence is, multiple Ěit killing operations
may be needed for one prediction in PmR.
ǌ.Ǌ.Ǌ AŇķļĽŉĹķŉŊŇĵŀ DĽňķŊňňĽŃłň
In this subsection, architectural discussions onPmRandPR are presented. Pur-
pose of such discussions is to identify the pros and cons of PmR, qualitatively. Dis-
cussions are made in three aspects: router timing, routing eﬃciency and critical
path delay.
• Router timing: ĉe timing of PmR is the same as PR as shown in Fig-
ure Ǌ.Ǌ.ǉe. If a predictionhits and aVCand a time slot of the crossbar switch
are successfully reserved, it takes ǉ-cycle for both PR and PmR to transmit
a Ěit. Otherwise, if a predictionmisses or a reservation fails because of con-
tention, a Ěit is then processed through the conventional datapath.
• Routingeﬃciency: Talking about routingeﬃciency, it is obvious thatPmR
causes more prediction-incurred contentions since multiple predictions by
Ǎǈ
all input ports create more overlapped predictions. But the enhancement
added to the allocators has very much alleviated this problem. Firstly, in
case if the network load is low (as in Figure ǌ.ǉ.Ǌ), there will be very few
contentions since only one prediction ismeaningful at such amoment. Sec-
ondly, if two or more Ěits come concurrently, prediction accuracy from the
past helps the most potential packet to win the arbitration. In terms of PR,
there also exist overlapped predictions, but much fewer. ĉe original PR
work did not discuss this issue so a convention is chosen in both PR and
PmR, that is, prediction accuracy is used to resolve such contentions. Fur-
thermore, the way on how contention is handled by PmR reveals an impor-
tant insight, that is, PmR’s improvement on prediction accuracy by having
multiple predictions can be eﬀectively utilized when network load is low
and PR should be more eﬃcient than PmR in routing eﬃciency when net-
work load is high. Additionally, predictions with two algorithms should be
more eﬃcient than predictions with three algorithms when network load is
high, since the former create less overlapped predictions.
• Critical path delay: In terms of critical path delay, PmR is very similar
to PR if only considering gate delay. For PR with SS, it is evaluated that
its critical path delay is longer than a CR by Ǎ.ǎƻ mainly because of more
state controls following predictions [ǋǐ]. For the same reason, the predic-
tive switch traversal stage of PmR is also longer. However, inside a virtual
channel router, VA is actually the longest stage [ǉǑ, ǌǍ]. Since half of theVA
operation is already done through reservation for both PR and PmR follow-
Ǎǉ
Table 4.3.1: System parameters.
Component Parameter
Number of cores: ǉǎ
Topology: ǌ ǌ mesh
Processor: ǌGHz, in-order
Lǉ I/D cache: ǋǊ KB per core, ǌ-way set associative, ǉ cycle access latency
LǊ cache: ǊǍǎKB per Bank, ǉǎ-way set associative, ǎ cycles access latency
Cache line size: ǎǌ Bytes
Main memory: ǌGB, ǉǎǈ cycles access latency
Coherence protocol: MOESI, directory
Link: ǉǊǐ-bit, ǉ cycle traversal
Packet: ǉǊǐ-bit control, ǎǌǈ-bit data
Router: ǉ GHz, virtual channel router
Virtual channel: ǌ per virtual network
Virtual network: ǋ per physical link
Routing algorithm: X-Y routing
Process technology: ǋǊ nm
Vdd: ǉ V
ing a prediction, the critical path delay of PR and PmR should be the same
as a conventional router.
ǌ.ǋ MĹŉļŃĸŃŀŃĻŏ
In this solution, various evaluations on performance and power are carried out
with GEMS [ǋǏ] and Simics [ǋǎ] extended with the network model from Gar-
net [Ǐ] and the network powermodel fromOrion [ǊǑ]. To evaluate performance,
the source code of GEMS and Garnet are modiėed to provide cycle-accurate tim-
ing models of PR and PmR. PR is evaluated with the three prediction algorithms
as mentioned in Section ǌ.Ǌ.ǉ. For PmR, predictions are set up with these three
algorithms being used at the same time. For example, SS+LP means a prediction
with both SS and LP while SS+LP+FCM uses all three algorithms in one predic-
ǍǊ
Table 4.3.2: Benchmark programs and inputs.
Application Input
cholesky tkǊǑ.O
fmm ǉǎǋǐǌ particles
ocean, contiguous grid of ǊǍǐǊǍǐ
ocean, non-contiguous grid of ǊǍǐǊǍǐ
volrend head
water, nsquared ǍǉǊ molecules
water, spatial ǍǉǊ molecules
EP Ʀ?? random number pairs
FT grid size of ǉǊǐǉǊǐǋǊ, ǎ iterations
LU grid size of ǎǌǎǌǎǌ, ǊǍǈ iterations, time step of Ǌ.ǈ
tion. For the power evaluation, the kill circuit is not included. For predictor power,
only thememory components inside them are considered. For each LP predictor,
a power model of a ǋ-bit register is used while for each FCM predictor, an ǐ-bit
register ėle is taken. ĉe number of registers in the register ėle equals N-ǉ if the
predictor is implemented in anN-radix router. ĉe evaluation conditions are sum-
marized in TABLE ǌ.ǋ.ǉ.
In all evaluations, a ǉǎ-tile mesh network with ǉǊǐ-bit links are assumed. Each
tile has an in-order processor core, a bank of LǊ cache/a directory. Each corner
node also has amemory controller. ĉe access latencies for Lǉ cache, LǊ cache and
mainmemory are ǉ cycle, ǎ cycles and ǉǎǈ cycles, respectively. ĉese components
are connected to a router individually and network traﬃc travels through routers
and links. So low latency routers such as PmR can accelerate remote Lǉ, LǊ and
main memory accesses; and the former two are best candidates since main mem-
ory access latency is much larger than network latency. Figure ǉ.ǉ.ǉ in Section ǉ.ǉ
illustrates the schematic view of this simulated system andwhat a tile is composed
Ǎǋ
of. ĉe entire network is set to have three virtual networks to support theMOESI
directory coherence protocol which has three classes of traﬃc. Each router has
a maximum of ǎ ports and each port has four virtual channels while each virtual
channel has four ǉǊǐ-bit buﬀers. More details are presented in TABLE ǌ.ǋ.ǉ. ĉe
evaluations are based on both synthetic and application traﬃc. ĉe types of syn-
thetic traﬃc used are uniform random, bit compliment and tornado. All packets
in synthetic traﬃc consist of Ǎ Ěits. Applications are chosen from the NPB ǋ.ǋ [ǋ]
and SPLASH-Ǌ [Ǎǌ] benchmark suites. TABLE ǌ.ǋ.Ǌ lists these applications and
their inputs.
ǌ.ǌ RĹňŊŀŉň
In this section, evaluation results on various router designs including PmR are
presented and discussed in terms of their performance and power consumption.
Two speciėc points on performance evaluations with PmRs that will be concen-
trated are if the prediction accuracy improvement iswell compensated for andhow
the bandwidth consumption of workloads makes a diﬀerence for PmR’s eﬀective-
ness.
ǌ.ǌ.ǉ PŇĹĸĽķŉĽŃł AķķŊŇĵķŏ ĵłĸ RŃŊŉĽłĻ EĺĺĽķĽĹłķŏ
ĉe prediction accuracy is brieĚy discussed in Section ǌ.ǉ already. Looking at
Figure ǌ.ǉ.ǉ, in more details, there are four other ėndings about prediction ac-
curacy. Firstly, for all workloads evaluated, LP+FCM and SS+LP+FCM are the
best performing combinations of algorithms. Secondly, SS+LP provides a mod-
Ǎǌ
erate improvement in prediction accuracy. ĉirdly, the three combinations of al-
gorithms mentioned above are always beĨer than a single prediction algorithm in
prediction accuracy. Finally, SS+FCM is only marginally beĨer than FCM and it
is even outperformed by LP for half of the workloads. ĉe reason is, prediction re-
sults of SS andFCMmostly overlap, whichmeans, outputs on the samedimension
of inputs are also the most frequently routed targets.
Figure ǌ.ǌ.ǉ provides another angle of looking at the prediction accuracy. It
records the amount of routing that is successfully accelerated with predictions,
which means, the prediction does hit and the required router resources are also
successfully allocated to this prediction. Oneobservation in this ėgure is, LP+FCM
is themost eﬀective combination of algorithms in routing eﬃciency since not only
providing thebest predictionaccuracy, it also enables themost accelerations through
prediction. Hence, it is the most productive at utilizing router’s internal band-
width. A second observation is, although SS+LP+FCM is the best performing
combination of algorithms in accuracy, it does not hold the top position in terms
of routing eﬃciency, since it creates the largest number of contentions in predic-
tion by having all three algorithms. ĉirdly, the numbers in this ėgure are always
smaller than the corresponding prediction accuracy in Figure ǌ.ǉ.ǉ, and this dif-
ference comes from the fact that some predictions do hit but they fail to acquire
necessary router resources to speed up the routing process.
PR with SS performs poorly in terms of routing eﬃciency despite of its predic-
tion accuracy is about ǎǈƻ on average. ĉe reason is, not all packets are able to be
predicted with SS (as described in Section ǌ.Ǌ.ǉ).
ǍǍ
!"
!
#
!
$
!
%
!
&
!
'
!
(
!
)
!
*
!
"
!
!
+,
-
./
01
2
34
4
-
+/
5
6
7+
-
6
89
:
;
-
;
0<
-
+/
5
6
=7
6
-
6
>
+-
6
89
:
;
-
;
0<
?
-
.@
/
6
A
B
5
8/
@=
76
0C
;
5
@/
A
<
B
5
8/
@=
70
D
5
89
5
.<
E
F
G
H
IJ
K
?
/
@5
:
/
F
L
=7
M
M
<
F
L
=7
IF
<
F
L
=7
G
N
O
<
F
4
L
=7
M
M
P
IF
<
F
4
L
=7
M
M
P
G
N
O
<
F
4
L
=7
IF
P
G
N
O
<
F
4
L
=7
M
M
P
IF
P
G
N
O
<
Fi
gu
re
4.
4.
1:
Pe
rce
nt
ag
eo
fp
ac
ke
ts
su
cc
es
sfu
lly
ac
ce
ler
at
ed
wi
th
pr
ed
ict
ion
s.
Ǎǎ
ǌ.ǌ.Ǌ SŏłŉļĹŉĽķ PĹŇĺŃŇŁĵłķĹ
Figure ǌ.ǌ.Ǌ presents the average latency per Ěit with varying injection rates un-
der diﬀerent synthetic traﬃc paĨerns. For simplicity, only four cases are shown,
which are CR, PR with FCM, PmR with LP+FCM and PmR with SS+LP+FCM.
ĉis ėgure has brought three observations. Firstly, all three cases with prediction
work preĨy well at reducing the latency with low injection load. Secondly, it can
seen that all prediction cases still work with high injection load as they only satu-
rate with more injections than CR.ĉirdly, PmR is at least as good as PR in terms
of both latency reduction (marginally beĨer, but hard to see in Figure ǌ.ǌ.Ǌ since
the best algorithms are chosen) and bandwidth consumption (nearly the same as
PR or even smaller). Although PmR creates more contentions in prediction, it
seems that the prediction accuracy based approach on resolving these contentions
works preĨy well (at least, for these synthetic traﬃc paĨerns).
ǌ.ǌ.ǋ AńńŀĽķĵŉĽŃł PĹŇĺŃŇŁĵłķĹ
For network latency as shown in Figure ǌ.ǌ.ǋ, it can be found that PmR with
LP+FCM is the best at latency reduction, on average. It outperforms all PRs and
while compared to other PmRs, there are only two cases (cholesky and ocean (non-
contiguous)) it is outperformed by SS+LP+FCM.On average, the best performing
PmR (with LP+FCM) outperforms the best PR (with LP) by ǋ.ǐƻ in latency re-
duction.
ǍǏ
!"
#$
%&
'(
"
)*
+
&$
),
-.
'&
/0
"
(
1
$
0%
2
%.
$
3)
45$6+7$)8+&$"%2)9$6):.'&),%2%.$/3)
;
<
;
=
;
>
;
?
;
@
>
<
>
=
>
>
>
?
>
@
A
< <
B<
=
A
<
B<
C
A
<
BD
=
A
<
BD
C
A
<
B=
=
A
<
B=
C
A
,+
3)
E
"
'-
(
6F
)*
+
"
1
(
F
)
;
<
;
=
;
>
;
?
;
@
>
<
>
=
>
>
>
?
>
@
A
< <
B<
=
A
<
B<
C
A
<
BD
=
A
<
BD
C
A
<
B=
=
A
<
B=
C
A
,G
3)
H
'&
)I
(
F
9
.$
F
$
"
&)
;
<
;
=
;
>
;
?
;
@
>
<
>
=
>
>
>
?
>
@
A
< <
B<
=
A
<
B<
C
A
<
BD
=
A
<
BD
C
A
<
B=
=
A
<
B=
C
A
,%
3)
J(
6"
+
1
(
)
I
*
K
*
),
:
I
L
3
K
F
*
),
8K
M
:
I
L
3
K
F
*
),
N
N
M
8K
M
:
I
L
3
Fi
gu
re
4.
4.
2:
Ne
tw
or
k
lat
en
cy
wi
th
sy
nt
he
tic
tra
ﬃ
c.
Ǎǐ
!
"#
!
"$
!
"%
!
"&
!
"'(
(
"(
)*
+
,-
./
0
12
2
+
)-
3
4
5)
+
4
67
8
9
+
9
.:
+
)-
3
4
;5
4
+
4
<
)+
4
67
8
9
+
9
.:
=
+
,>
-
4
?
@
3
6-
>;
54
.A
9
3
>-
?
:
@
3
6-
>;
5.
B
3
67
3
,:
C
D
E
F
GH
I
-
+
2
-
6>
7)
;J
-
3
4
K
L
D
L
;5
M
M
:
D
L
;5
GD
:
D
L
;5
E
K
J
:
D
2
L
;5
M
M
N
GD
:
D
2
L
;5
M
M
N
E
K
J
:
D
2
L
;5
GD
N
E
K
J
:
D
2
L
;5
M
M
N
GD
N
E
K
J
:
(
<K
0
),
-
Fi
gu
re
4.
4.
3:
No
rm
ali
ze
d
pe
r-ﬂ
it
lat
en
cy
.
ǍǑ
!
"#
!
"$
!
"%
&
"&
&
"'
&
"#
&
"$
()
*
+,
-.
/
01
1
*
(,
2
3
4(
*
3
56
7
8
*
8
-9
*
(,
2
3
:4
3
*
3
;
(*
3
56
7
8
*
8
-9
<
*
+=
,
3
>
?
2
5,
=:
43
-@
8
2
=,
>
9
?
2
5,
=:
4-
A
2
56
2
+9
B
C
D
E
FG
H
,
*
1
,
5=
6(
:I
,
2
3
J
K
C
K
:4
L
L
9
C
K
:4
FC
9
C
K
:4
D
J
I
9
C
1
K
:4
L
L
M
FC
9
C
1
K
:4
L
L
M
D
J
I
9
C
1
K
:4
FC
M
D
J
I
9
C
1
K
:4
L
L
M
FC
M
D
J
I
9
&
;J
/
(+
,
Fi
gu
re
4.
4.
4:
No
rm
ali
ze
d
sy
ste
m
sp
ee
d-
up
.
ǎǈ
For system speed-up as shown in Figure ǌ.ǌ.ǌ, the result is very similar to net-
work latency. PmR with LP+FCM is the best performing combination of algo-
rithms on average. It outperforms all other PRs and PmRs except ǋ cases (two
with SS+LP+FCM for workloads cholesky and FT; one with both SS+FCM and
SS+LP+FCMforworkload ocean (non-contiguous)). Onaverage, the best perform-
ingPmR(withLP+FCM)outperforms thebestPR(withFCM)byǋ.Ǎƻ in system
speed-up.
Results in Figure ǌ.ǌ.ǋ and Figure ǌ.ǌ.ǌ do not match each other strictly, this is
not odd since another factor, the criticality of packets, actually determines if net-
work latency reduction aﬀects the system performance [ǋǍ].
Working set size also plays an important role here. For workloads that consume
more memory (such as EP, FT and LU [Ǎ, ǊǏ, Ǎǋ, Ǎǌ]), improvement with PmRs
on system performance is relatively smaller. More LǊ cachemisses results in more
main memory accesses whose latencies are much larger than network latencies.
Workload cholesky is an exception. ĉe system speed-up is relatively low for it with
even ideal ǉ-cycle router. ĉis means, cholesky is not as sensitive on network per-
formance as others.
One more observation is, PmR is well behind the ideal ǉ-cycle router in both
network and systemperformance. ĉismeans, althoughmultiple predictions help
improving the prediction accuracy, there is still some traﬃc that is not beneėted
from beĨer prediction.
As shown in Figure ǌ.ǌ.Ǎ, PmR in general consumes more power than CR and
PR. But the diﬀerence varies considering what algorithms are used. For both PR
ǎǉ
!
"#
$
!
"%
!
"%
$&
&
"!
$
&
"&
&
"&
$
'(
)
*+
,-
.
/0
0
)
'+
1
2
3'
)
2
45
6
7
)
7
,8
)
'+
1
2
93
2
)
2
:
')
2
45
6
7
)
7
,8
;
)
*<
+
2
=
>
1
4+
<9
32
,?
7
1
<+
=
8
>
1
4+
<9
3,
@
1
45
1
*8
A
B
C
D
EF
G
+
)
0
+
4<
5'
9H
+
1
2
I
J
B
J
93
K
K
8
B
J
93
EB
8
B
J
93
C
I
H
8
B
0
J
93
K
K
L
EB
8
B
0
J
93
K
K
L
C
I
H
8
B
0
J
93
EB
L
C
I
H
8
B
0
J
93
K
K
L
EB
L
C
I
H
8
Fi
gu
re
4.
4.
5:
No
rm
ali
ze
d
ne
tw
or
k
po
we
rc
on
su
mp
tio
n.
ǎǊ
and PmR, power overhead actually comes from the dynamic power spent at the
arbitrators and the crossbar switch following predictions and the memory com-
ponents inside LP and FCM predictors. PmR consumes more power since it in-
curs more utilization on router components and predictors. In evaluation, the
best PmR (with LP+FCM) only consumes marginally more power than the best
PR (with FCM) since LP has very liĨle power overhead. Finally, when compared
to CR, PmR with LP+FCM outperforms it in speeding up the system by ǊǍ.Ǌƻ
with a power overhead of only ǐ.Ǌƻ on the network.
ǌ.ǌ.ǌ SŊŁŁĵŇŏ ĵłĸDĽňķŊňňĽŃłň
In this chapter a new low latency on-chip router named predict-more router
is proposed which simply improves the prediction accuracy and the number of
successful predictive routing of the original prediction router by allowingmultiple
predictions on one packet.
As a summary, PmR is beĨer at latency reduction than PR while it also incurs
more contentions in prediction. But there are two reasons that these extra con-
tentions do not hurt bandwidth. Firstly, link bandwidth consumption is not af-
fected since extra bandwidth consumption only happens inside a router for both
PRandPmR.Secondly, resolving these contentionswithpredictionaccuracyworks
well so internal bandwidth of the router is not a concern, too.
With the evaluations carried out, it is found that PmR (with the best combina-
tion of algorithms) improves the prediction accuracy by more than ǉǍƻ, versus
PR with its best algorithm. Along with this higher prediction accuracy, signiėcant
ǎǋ
improvement on routing eﬃciency can also be spoĨed, over ǉǌƻmore packets are
seen accelerated throughmore predictions. Finally, ǋ.ǐƻ improvement on latency
reduction and ǋ.Ǎƻmore speed-up on system performance are also identiėed, on
average. ĉese beneėts are brought with a nearly negligible cost in power con-
sumption if the best PmR (LP+FCM) is compared with the best PR (FCM).
To extend the above summary, a few qualitative discussions will be made, re-
garding network topology and core scaling.
Firstly, topologymakesdiﬀerence for the radixof routers since the radixof routers
can aﬀect the prediction accuracy. According to the PR work [ǋǑ], prediction ac-
curacy with fat tree topology is much lower than with mesh.
Secondly, there are two forms of core scalingwhich have opposite inĚuences on
PmR. If the number of tiles scales with the number of cores, PmR should still be
eﬀective or even beĨer since with such a scaling traﬃc travels longer distance in
the network, which means more chance to be accelerated. However, if the size of
the network stays while the number of cores scales up (a denser design), it is hard
for PmR tomaintain its eﬀectiveness since the network is simplymore stressed and
withmore cores prediction accuracy can be exacerbated as the radix of routers will
be increased.
ǎǌ
When in doubt, use brute force.
Kenneth Laneĉompson
5
Latency ReduČion through In-router
Multicaﬆing: McRouter
GŃĽłĻ ĺŊŇŉļĹŇ ĺŇŃŁPŇĹĸĽķŉ-ŁŃŇĹRŃŊŉĹŇ, this chapter introduces another
low latency on-chip router design utilizing a technique called multicast-within-a-
router orMcRouter for short.
ĉe essence of McRouter is to achieve latency reduction by allowing itself to
consume some additional bandwidthwithin the router (not the on-chip network).
Diﬀerent fromanyon-chip routers designed so far,McRouter has a switch crossbar
which allows multicast operations. ĉis simple change together with control cir-
ǎǍ
cuitry to check if the internal bandwidth of a router is suﬃcient to carry out amul-
ticast, helps hiding the route computation and arbitration delays which enables
a single cycle transfer of Ěits. Diﬀerent from most speculative routers [ǋǑ, ǌǍ],
McRoutermore aggressively utilizes its internal bandwidth in order to shorten the
router’s per-hop latency. In contrast to single cycle routers employing look-ahead
routing [ǌǉ], McRouter also maintains its modularity and portability as a stan-
dalone design and ėts well in anyNoChavingwormhole or virtual channel routers
with virtually any routing algorithm.
Ǎ.ǉ MŃŉĽŋĵŉĽŃł
Route computation or RC is important within a NoC router. ĉis importance
does not come from its complexity, which is actually based on the routing algo-
rithm. It is important since all other operations like VA, SA and ST depend on the
result ofRC.With the low latency routing techniques reviewed inSubsectionǊ.Ǌ.Ǌ,
two have successfully helped hiding RC delay and geĨing RC result earlier. ĉe
ėrst one is LAR. By computing the route at preceding routers, it allows one to re-
move the RC delay from the per-hop latency of a router. ĉis technique has been
extended to allow single cycle Ěit transfer in [Ǌǌ, ǌǉ]. However, these LAR based
ideas all suﬀer from a common design issue. ĉe portability and modularity of
such routers are violated, since a router with LAR requires assistance from the up-
stream routers or network interfaces to carry out RC. Moreover, in some single
cycle router with look-ahead routing [ǌǉ], to even remove the decoding of RC re-
sults from the critical path (so that RC delay is completely hidden), extra wiring
ǎǎ
between routers are required to carry one-hot encoding of the RC results. Simi-
larly, another low latency router proposed also has such design issue since it de-
pends on sending a packet called advanced bundle to set up the control for a Ěit to
traverse a future router [ǋǌ].
Another technique, which forms a standalone router design (free from above
mentioned design issue) and is able tomostly hide RCdelay, is PR [ǋǑ]. It is so far
the most aggressive speculation found in router designs. By predicting RC result
well ahead of an arriving packet, it also helps removing the arbitration delays from
theper-hop latencyof a router if predictionhits. When thepredictionmisses, extra
bandwidth is consumed as a mis-routed Ěit has to traverse its data path. In eval-
uation, it is found that prediction hits around ǎǍƻ even with the best algorithm;
therefore, about ǋǍƻ of the packets cannot be conveyed in one cycle at all.
Regarding the portability and modularity of a router design, LAR is not a good
option. ĉis causes problems when one replaces CR with LAR in a design. For
speculation onRC (such as prediction), the following two important concerns ap-
pear: (ǉ) whether such a speculation is necessary and (Ǌ) if it is possible to trade
in more bandwidth to make the speculation more accurate.
Before going further into these particular concerns mentioned above, an eval-
uation is set up with application traﬃc from NPB-ǋ.ǋ [ǋ] and SPLASH-Ǌ [Ǎǌ]
benchmarks to understand how internal bandwidth of a router, especially for its
crossbar switch, is utilized. A ǉǎ-tile mesh NoC based CMP is used where each
tile is composed of a core, an LǊ cache bank and sometimes a memory controller
(see Figure ǉ.ǉ.ǉ). ĉe result is shown in Figure Ǎ.ǉ.ǉ where the detailed condi-
ǎǏ
!!"!!#
!"!$
!"!$#
!"!%
!"!%#
!"!&
!"!&#
!"!'
!"!'#
!"!#
Figure 5.1.1: Average link utilization on a 16-core CMP connected with 4 by
4 mesh network.
tions of these evaluations are covered in Section Ǎ.ǌ. As can be seen from the ėg-
ure, even for the worst case, a link is only injected with roughly ǈ.ǈǋǉ Ěits/cycle
on average (this is equal to ǌǏǈ MBytes/link/sec). Considering a router with ǎ
input/output ports (ǌ ports connected to ǌ neighboring routers, ǉ to a core, and ǉ
to a bank of LǊ cache), it roughly translates to ǈ.Ǌ Ěits/crossbar/cycle. ĉis means
there is plenty of bandwidth inside a router which could in turn be utilized to help
shortening its latency for these parallel workloads. Such anobservation is also sup-
ported by a recent work [ǌǑ], although its evaluation has ǎǌ processors connected
with a ǉǎ-tilemesh topology and is taken fromanother set of workloads (PARSEC
benchmark suite [ǉǍ]).
As a consequence, this observation tells that, like many other components in
a computer system, a NoC router is designed for the worst case. Following this,
ǎǐ
!"#
!$#
!%#
!&#
!!#
'"#
'$#
'%#
'&#
'!#
(""#
" ( )*$
Figure 5.1.2: Fractions for diﬀerent numbers of concurrent ﬂits arriving at
routers each cycle.³
there are simply two facts that can be identiėed from Figure Ǎ.ǉ.Ǌ that is retrieved
with the same evaluation parameters as Figure Ǎ.ǉ.ǉ. ĉis ėgure presents the frac-
tion of diﬀerent numbers of concurrent Ěits arriving at routers (the cases for more
than Ǌ concurrent Ěits arriving at routers are aggregated). Firstly, it is shown in this
ėgure that a router can be inactive for most of the cycles. Secondly, when active,
a router may be transmiĨing no more than a single Ěit at most cycles. Even for
workload FT that stresses the network the most, there is only a single Ěit crossing
a router for ǉǍƻ of the time while about ǐǋƻ of the time a router is receiving no
Ěit at all. Now, back to the concerns on speculation, if the internal bandwidth of a
router (observed in Figure Ǎ.ǉ.ǉ) allows a packet to be arbitrarily routed to all pos-
sible outputs (multicast), there is no need to have RC delay in the per-hop latency
³ĉe y-axis of this ėgure starts from ǐǈƻ.
ǎǑ
anymore while RC only exists for the purpose of acknowledging a packet that is
correctly routed. ĉis puts RC, VA and ST operations parallelizable while SA is
no longer needed. To summarize, speculation on RC is not necessary anymore
with multicast where it trades in more bandwidth for lower latency. Last but not
least, relying on multicast also gets rid of the design issues related to LAR since
either extra wiring or assistance from other routers is not needed.
Ǎ.Ǌ MŊŀŉĽķĵňŉŌĽŉļĽłĵRŃŊŉĹŇAńńŇŃĵķļĵłĸAŇķļĽŉĹķŉŊŇĹ
Following Section Ǎ.ǉ, the proposed multicast within a router architecture or
McRouter is presented in this section. It is overviewed in Subsection Ǎ.Ǌ.ǉ while
details of its design, timing and critical path delay are covered in later subsections.
Ǎ.Ǌ.ǉ OŋĹŇŋĽĹŌ
As mentioned before, McRouter’s potential of having ǉ-cycle latency comes
from themulticast operationwithin the router. It multicasts when there is remain-
ing bandwidth on the crossbar switch; in other words, a packetmay be transmiĨed
to all possible output ports, which always includes its target output port, only if the
remaining bandwidth is able to support such an operation within a router. When
multicast operation is taken for a packet, its head Ěit immediately traverses the
entire crossbar switch (except where it comes from) while at the same time, RC
is carried out and its result acknowledges the one correctly routed head Ěit. Mis-
routed head Ěits are discarded while this RC result can then be used for any body
Ěits of this packet. If the internal bandwidth of a router is suﬃcient for a multicast
Ǐǈ
operation, SA for the head Ěit is not needed anymore. ĉe router is able to serve a
multicast operation for any single Ěit coming if the entire crossbar switch is not go-
ing to serve any traversal. Multicast only happens when the remaining bandwidth
within a router can aﬀord it. Flits beingmulticast also have lower priority when ac-
quiring router resources. ĉese two reasons havemadeMcRouter less demanding
than it looks.
Same as LAR and PR,McRouter also works for virtually any routing algorithm,
sincemulticasting simply covers all possible RC outcomes. In this work, X-Y rout-
ing is used since it is the most popular and straightforward routing algorithm for a
NoC router.
In amulticast operation, the entire crossbar switch ofMcRouter is almost occu-
pied (not always true, more details in the next subsection), thus a conĚict occurs
if more than one packet is traversing McRouter at the same cycle. In such a sit-
uation (rare as seen from Figure Ǎ.ǉ.Ǌ), a contention-free policy is followed. No
multicast is allowed and all packets are going to traverse the conventional data and
control paths.
Another thing worth mentioning is, McRouter handles multi-Ěit packets well.
To retain the single-cycle operation,McRouter carries out SA for body Ěits before
they actually show up. More details are covered in Subsection Ǎ.Ǌ.ǋ.
Ǐǉ
!"#$%&
'"()#$*$+",
-'.
-'&/00"1*$"2
34+$15&/00"1*$"2
6,)#$&7
6,)#$&,
8#$)#$&7
8#$)#$&,
'2%9+$.&6,'2%9+$.&8#$
-'.
:+)%0+,%
!%;+.$%2
:+)%0+,%
!%;+.$%2
Figure 5.2.1: Architecture of the conventional router.
!"#$%&
'"()#$*$+",
-'.
-'&/00"1*$"2
34+$15&
/00"1*$"2
6,)#$&7
6,)#$&,
8#$)#$&7
8#$)#$&,
'2%9+$.&6,'2%9+$.&8#$
-'.
:#0$+1*.$&
;,+$
/'<
-*0+9
-'6=
/'<
Figure 5.2.2: Architecture of McRouter.
ǏǊ
Ǎ.Ǌ.Ǌ AŇķļĽŉĹķŉŊŇĵŀ CļĵłĻĹň ĵłĸ ŉļĹMŊŀŉĽķĵňŉ OńĹŇĵŉĽŃł
To support multicast operations inside a router, there are several architectural
changes necessary. Such changes are going to be detailed in this subsection and
how these components function when multicasting will also be presented. Start-
ing fromaCR, each componentofMcRouterwill have itsmodiėcationsdescribed.
In Figure Ǎ.Ǌ.ǉ, the architecture of CR is illustrated while the proposedMcRouter
has Figure Ǎ.Ǌ.Ǌ for the same purpose. In Figure Ǎ.Ǌ.Ǌ, changes made from Fig-
ure Ǎ.Ǌ.ǉ are shown in grey.
Input units: Direct connections from the input links to the crossbar switch are
required here since when a packet is beingmulticast, all of its Ěits may traverse the
crossbar switch immediately.
Multicast unit: ĉis unit determines if a multicast operation is going to be
taken, and if so, how other parts of the router need to function. Two sets of sig-
nals from the input units are required to be supplied to this unit, the ėrst is Ěit type
and the second is the virtual channel ID (VCID). Another two sets of input signals
come from the switch allocator and the route computation unit. ĉe input signals
from the switch allocator tell which input and output ports of the crossbar switch
are granted at the last cycle, so they are occupied at this cycle. ĉe input signals
from the route computation unit supply the RC result of a packet being multicast.
Flit type helps determining if a multicast operation is going to be taken. For ex-
ample, if only one head Ěit enters the router, it is able to be multicast if there is
no granted SA requests at the last cycle. It also helps invoking the speculative VAs
needed when multicasting a head Ěit. Flit type, VCID, SA granting and RC result
Ǐǋ
are used to help traversing a body Ěit in a single cycle if a multi-Ěit packet is multi-
cast. Inmore details, before a body Ěit of amulti-Ěit packet enters theMcRouter, if
the input andoutput required by this body Ěit are not granted to other SA requests,
then this Ěit will be able to traverse the crossbar switch immediately. Last but not
least, in the case of multicasting, this unit controls the crossbar switch directly.
Routecomputationunit: ForMcRouter toworkcorrectly, the routeof apacket
still needs to be computed in order to acknowledge the correctly routed head Ěit
while the packet is being multicast. ĉerefore, route compute unit is required to
have a set of additional input and output signals to accomplish these requirements.
Firstly, a group of input signals come from the input links directly, carrying the
information encoded in the head Ěit for immediate RC. Secondly, a ǉ-bit output
from this unit is required to form an “ACK” signal which is used for the acknowl-
edgment of multicast. ĉe time required to generate this signal depends on which
routing algorithm is used and it is much shorter than ST delay in practice [ǉǑ, ǌǍ].
Finally, the RC result is also supplied to the multicast unit to help traversing body
Ěits in the case of multicasting a multi-Ěit packet.
VC allocator: Apart from its normal operation, the VC allocator is required to
allocate aVC to the correctly routed head Ěit in case of amulticast. Although there
is only one correctly routed head Ěit, this VA has to be done for all potential out-
puts for this head Ěit sinceRC result is yet to knowat this point. Additionally, since
this head Ěit is going to leave the router in ǉ-cycle, modifying the “VCID” ėeld at
the input buﬀers is not suﬃcient in time and this has to be done at the pipeline reg-
isters of the ST stage of the router. If a valid “VCID” is not successfully acquired
Ǐǌ
for the correctly routed head Ěit in the case that all VCs are fully occupied, amulti-
cast operation fails and the packet enters the conventional data path. ĉus, a ǉ-bit
output to acknowledge the process of obtaining a valid “VCID” is also required to
form the “ACK” signalmentioned in the last paragraph. To invokeVA immediately
while a head Ěit is being multicast, a ǉ-bit signal from each input is needed where
the Ěit type ėeld is picked up since it is “ǉ” when a head Ěit enters a router and it
will help determining if a multicast is about to be taken and which output has to
be allocated the VCs. Note that there is in fact no input arbitration needed when
obtaining VCs for a multicast head Ěit, since it (the input VC) requires multiple
output VCs (one for each potential output) allocated before RC is out. Moreover,
output arbitration for such a head Ěit in VA should be considered lower in priority
than VA carried out for non-multicast head Ěits in conventional data and control
paths. In practice,McRouter is implementedwith a separable output-ėrst VCallo-
cator. Aěer the output arbitration, remaining VCs that are least recently allocated
are used to supply themultiple VCs that amulticast head Ěit needs. Note that only
the VC allocated to the correctly routed head Ěit is actually meaningful while all
the other VC allocations related to this multicast operation are simply reverted.
Switch allocator: If switch allocator possesses no request to occupy the cross-
bar switch, the next cycle it will be able for the crossbar switch to multicast a head
Ěit. ĉis information is passed on to themulticast unit alongwith which input and
output ports of this crossbar are occupied if any request is granted.
ǏǍ
Table 5.2.1: Destination output ports when multicasting considering incom-
ing ports.⁴
Destination
North South West East
So
ur
ce
North    
South    
West    
East    
Table 5.2.2: Destination output ports when multicasting considering packet
types.⁴
Destination
Core LǊƮ/Dir MC
Ty
pe
Request   
Forward   
Response   
Crossbar switch: Multicast support is implemented in the crossbar switch,
which means, with an N by N crossbar switch, N? control signals are required to
help invoking a proper multicast. A multicast operation does not always occupy
the entire crossbar switch. Routing algorithm and packet types may help mini-
mizing the usage of the crossbar while multicasting. Firstly, taking X-Y routing
(used in all evaluations) as an example, a Ěit turns its directionwhen itsmovement
on the X-dimension has ėnished. ĉis means that if a Ěit comes from the north-
ern or southern port, it will never be routed to the western or eastern port. So
this helps an already-turned Ěit to reduce its cost of multicast. ĉis is summarized
in TABLE Ǎ.Ǌ.ǉ. Secondly, coherence protocols such as MOESI have three types
of packets to maintain cache coherence which are request, forward and response.
⁴ denotes “Yes” while denotes “No”.
Ǐǎ
Request messages target at cores or LǊ cache banks, which means that McRouter
needs not consider a memory controller as routing destinations for request mes-
sages. Similarly, forwardmessages target at memory controllers only, so cores and
LǊ cache banks are not considered as routing destinations for forward messages.
Responsemessages can target at all types of destinations, so it does not help. ĉis
is summarized in TABLE Ǎ.Ǌ.Ǌ.
Head Ěit acknowledgment circuit: Multicast operation results in multiple
copies of a head Ěit traversing the crossbar switch to reach all potential outputs.
It is necessary that only the correctly routed one gets acknowledged. ĉis requires
two NAND gates and a multiplexer. In the case of multicasting, one NAND gate
is used to acknowledge the correctly routed head Ěit while themultiplexer and an-
other NAND gate is used to write the VCID in time.
Ǎ.Ǌ.ǋ TĽŁĽłĻ
!"#$%#&% #$%
'()*+,-./"01.%
23% 4&%
5% 6% 7% 8%39"-:% ;% <%
!"#$%
!,-./"01.%
2(,.:=%&% 2(,.:=%>% 2(,.:=%3%
Figure 5.2.3: Pipeline stages of McRouter.
ĉis subsection presents the best case timing of McRouter. Figure Ǎ.Ǌ.ǋ illus-
trateshowaheadĚit of apacket is transmiĨed through thepipeline stagesofMcRouter.
When a head Ěit is being multicast (shown as McST at Routers B and C), it takes
ǏǏ
!"#$%&
!"#& #$&
!"#& #$&
!"#& #$&
'()*&+,-.&
/0*1&+,-.&2&
/0*1&+,-.&3&
/0*1&+,-.&4&
!"#& #$&/0*1&+,-.&5&
%&6&$7(&89,.-"):.&:;-."7&.<)=(<:),&0>(<).-0?&@0<&.7(&7()*&@,-.&
&
$%&%'()*%+"%)*%,-..)/0%123%41.%-55%613/73)-5%123623%61.3*%14%3(/%
7()*&@,-.&
&
#%&%'(/*/%!"*%-./%,-..)/0%123%8/41./%3(/%8109%45)3*%-,32-559%
"08(A&*-@@(<(?.&@<08&)&?0<8),&#B&0>(<).-0?C&.7(:(&#B:&
*):659%,(/,;%)4%-%4232./%8109%45)3<*%7//0%41.%3(/%)7623%-70%
09.>9.&0@&.7(&"<0::D)<&:;-."7&-:&)=)-,)D,(E&
FG&
+"$&
3& 4& 5& H&G1",(& I&
Figure 5.2.4: Best case transmission of a multi-ﬂit packet in McRouter.
one cycle to reach the links; if it is not being multicast (at Router A), the conven-
tional data path of the router is invoked and it spends four cycles to ėnish a routing.
Figure Ǎ.Ǌ.ǌ shows how an entire packet is transmiĨed in the case of multicasting.
For any packet whose head Ěit is beingmulticast, the ėrst cycle atMcRouter is the
key. Not only the head Ěit is being multicast, but a route computation is also car-
ried out to acknowledge the head Ěit correctly transmiĨed. In addition, multiple
VCs are required to be allocated for this head Ěit before RC is done. Meantime,
SA (mainly to checkwhich inputs and outputs of the crossbar are vacant) is carried
out to see if the switch allocator grants anything; a body Ěit of the packet is able to
traverse the crossbar if the input and output it needs are not going to be occupied.
Ǐǐ
Ǎ.Ǌ.ǌ CŇĽŉĽķĵŀ PĵŉļDĹŀĵŏ
ĉe longest pipeline stage determines the critical path delay of a router. Ac-
cording to literature [ǉǑ, ǌǍ], VA is generally the longest pipeline stage in a router.
As described in Subsection Ǎ.Ǌ.Ǌ, the operation of the multicast unit is mostly in
parallel with VA so its delay can be hidden. While the ėrst rank of arbiters of the
VC allocator is arbitrating the output VCs for non-multicast packets, themulticast
unit is deciding if a multicast operation is going to be taken and if yes, it then sets
up the crossbar switch. ĉis decision on multicasting will then be used at the VC
allocator to allocate more output VCs needed to accomplish this multicast oper-
ation. At the same time, the VC allocator’s second rank of arbiters are arbitrating
the input channels for non-multicast packets. ĉus, the delay of the multicast unit
is in parallel with the delay of the VA stage. However, the operation and existence
of the multicast unit add additional delay to the ST stage since the multicast unit
needs to control the switch crossbar directly in terms of having multicasting, the
simplicity of the multicast unit, by having pure logic, guarantees that this delay is
well hidden since ST delay is much shorter than VA delay in practice [ǉǑ, ǌǍ].
A delay which cannot be hidden in McRouter’s data and control paths is at its
head Ěit acknowledgment circuit. In Figure Ǎ.Ǌ.Ǌ, a NAND gate has to be placed
between an output of the VC allocator and the pipeline registers of the ST stage.
However, this NAND gate is going to add roughly one FOǌ delay to McRouter’s
critical path, which is negligible.
ǏǑ
Ǎ.ǋ AŇķļĽŉĹķŉŊŇĹDĽňķŊňňĽŃłňĵłĸQŊĵŀĽŉĵŉĽŋĹCŃŁńĵŇĽňŃłň
In this section, discussions on McRouter and comparisons against VSAR [ǌǍ]
and PR [ǋǑ] are presented. Purposes of such discussions and comparisons are
to identify the pros and cons of McRouter. VSAR and PR are chosen as coun-
terparts because both of them are standalone designs like McRouter which main-
tains portability and modularity and do not require any assistance from upstream
routers or any change in top-level wiring; this means, VSAR, PR and McRouter
can simply replace any CR in NoCs. Before going into details, one thing in com-
mon for all routers discussed in this section is that they are all based on CR, so
when speculation, prediction ormulticast fails, they all follow the control and data
paths of CR.
As reviewed inSectionǊ.Ǌ,CR is aǌ-cycle routerwhen freeof contention,VSAR
is a ǋ-cycle router if speculation succeeds and PR is a ǉ-cycle router if prediction
succeeds; while as introduced in Section Ǎ.Ǌ, McRouter is a ǉ-cycle router if mul-
ticast is successfully carried out. Despite how latencies are hidden, diﬀerences can
be identiėed in four aspects: control dependency, speculation, routing eﬃciency
and power overhead. ĉese comparisons are summarized in TABLE Ǎ.ǋ.ǉ.
Ǎ.ǋ.ǉ CŃłŉŇŃŀ DĹńĹłĸĹłķŏ
Every neighboring pipeline stage in aCRhas control dependencies among each
other (RC and VA, VA and SA, and SA and ST). VSAR hides the control depen-
dency betweenVA and SAby speculating that SA can be successfully performed in
ǐǈ
Ta
bl
e
5.
3.
1:
Qu
ali
ta
tiv
ec
om
pa
ris
on
so
fl
ow
lat
en
cy
ro
ut
ers
inc
lud
ing
M
cR
ou
ter
.
Lo
w
lat
en
cy
ro
ut
er
s
VS
AR
PR
M
cR
ou
te
r
Co
nt
ro
ld
ep
en
de
nc
y
Co
nt
ro
l
de
pe
nd
en
cy
be
tw
ee
n
VA
an
d
SA
ca
n
be
hi
dd
en
wi
th
sp
ec
ul
at
io
n
N
o
co
nt
ro
ld
ep
en
de
nc
yh
id
de
n
Co
nt
ro
ld
ep
en
de
nc
ies
be
tw
ee
n
RC
an
d
VA
/S
T
ar
e
br
ok
en
wh
ile
de
pe
nd
en
cy
be
tw
ee
n
VA
an
d
ST
is
hi
dd
en
Sp
ec
ul
at
io
n
VA
/S
A
ca
n
su
cc
ee
d
at
th
es
am
e
cy
cle
RC
is
pr
ed
ict
ed
A
va
lid
VC
ca
n
be
ob
ta
in
ed
fo
r
th
e
he
ad
Ěi
tw
hi
le
a
va
lid
tim
e
slo
t
of
th
e
cr
os
sb
ar
sw
itc
h
is
av
ail
ab
le
fo
rt
he
ėr
st
bo
dy
Ěi
t
Ro
ut
in
ge
ﬃ
cie
nc
y
ǋ-
cy
cle
ro
ut
in
g
wh
en
co
n-
te
nt
io
n
is
lo
w
at
VA
/S
A
ǉ-
cy
cle
ro
ut
in
gw
he
np
re
di
ct
io
n
hi
ts
ǉ-
cy
cle
ro
ut
in
g
wh
en
ro
ut
er
ba
nd
wi
dt
h
all
ow
sm
ul
tic
as
tin
g
Po
we
ro
ve
rh
ea
d
St
at
ic:
an
ex
tra
sw
itc
ha
llo
ca
to
r;
dy
na
m
ic:
ad
di
tio
na
la
cc
es
se
st
o
sw
itc
h
all
oc
ato
rw
he
n
sp
ec
ul
a-
tio
n
fai
ls
St
at
ic:
pr
ed
ict
or
s
an
d
ki
ll
cir
-
cu
it;
dy
na
m
ic:
pr
ed
ict
or
ac
tiv
-
iti
es
,
ad
di
tio
na
l
ac
ce
ss
to
th
e
vi
rtu
al
ch
an
ne
l
all
oc
ato
r,
th
e
sw
itc
h
all
oc
ato
ra
nd
th
e
cr
os
s-
ba
rs
wi
tc
hw
he
np
re
di
ct
io
nf
ail
s,
an
d
ac
tiv
ity
at
th
ek
ill
cir
cu
it
in
ca
se
of
m
is-
pr
ed
ict
io
n
St
at
ic:
m
ul
tic
as
t
un
it
an
d
th
e
ac
kn
ow
led
gm
en
t
cir
cu
it;
dy
na
m
ic:
m
ul
tic
as
t
un
it
ac
-
tiv
ity
,
ad
di
tio
na
l
ac
ce
ss
es
to
th
e
vi
rtu
al
ch
an
ne
l
all
oc
ato
r,
th
e
sw
itc
h
all
oc
ato
r
an
d
th
e
cr
os
sb
ar
sw
itc
h
wh
en
m
ul
tic
as
t
is
ta
ke
n,
an
d
ac
tiv
ity
at
th
e
Ěi
t
ac
kn
ow
led
gm
en
tc
irc
ui
ti
n
th
e
ca
se
of
am
ul
tic
as
to
pe
ra
tio
n
ǐǉ
parallel with VA if VA returns a valid VCID, otherwise carrying them out in paral-
lel is meaningless. PR does not hide any control dependency by prediction. With
predicted routing available every cycle, VA and SA are carried out in parallel to
shorten the preparation time for a potential predictive switch traversal.
ForMcRouter, dependencies between RC and VA/ST are broken when a head
Ěit is being multicast as going to every possible output means routing is always
correct. Dependency between VA and ST is also hidden since VA is carried out
for every participating output in a multicast operation. For a multicast head Ěit,
SA is no longer needed and dependency between SA andmulticast ST (orMcST)
does not exist, since multicast operations only take place when the crossbar has
enough bandwidth and there is no need to allocate the crossbar switch in such
a situation. However, dependency between SA and ST still exists for a body Ěit
whose head Ěit is being multicast. Aěer route computation acknowledges a head
Ěit, any body Ěit which follows this head Ěit requires two stages, SA and ST, to
traverse the crossbar. Hence, SA should be carried out before the actual body Ěit
comes, which helps McRouter maintain its one Ěit per cycle performance.
Ǎ.ǋ.Ǌ SńĹķŊŀĵŉĽŃł
CR has no speculation while VSAR speculates that VA/SA both return valid
results at the same cycle. For PR, three speculations are made if a head Ěit is go-
ing to be routed predictively. First, RC is speculated with predictions. Secondly,
VA and SA have to be performed in parallel based on this predicted RC result.
ĉirdly, while the head Ěit predictively traverses the crossbar switch, the ėrst body
ǐǊ
Ěit which follows it has to enter SA speculatively with the predicted RC result.
McRouter has two sources of speculation. Firstly, for VA, every output involved
in amulticast operationhas tohave a validVCobtainedanda speculation is needed
on that this VA is done for the correctly routed output to guarantee the success of
this multicast. Secondly, while the head Ěit is multicast, the ėrst body Ěit which
follows it has to enter SA speculatively (at every output involved in this multicast
operation). ĉe success of this SA helps guaranteeing that the ėrst body Ěit tra-
versesMcRouter in one cycle. However, there is no speculation on RC because of
multicasting (always correct). Flits being multicast always reach the target output
port without RC, and RC is only used to acknowledge the correctly routed head
Ěit.
Ǎ.ǋ.ǋ RŃŊŉĽłĻ EĺĺĽķĽĹłķŏ
Amount of contention determines the eﬃciency of VSAR, which means, the
more Ěits VSAR has in its data path competing in VA/SA, the less chance it can
successfully speculate. In terms of PR, other than the amount of contention be-
cause of having more Ěits, another problem is that contention also comes from
predictions. VA/SA following a predicted RC may fail because of two reasons.
First, predictions for two or more input ports may overlap. Secondly, VA and SA
following a prediction may overlap with Ěits that are not predictively routed, for
example, the body Ěits. In a word, having prediction is similar to having more Ěits
in the router. Furthermore, results of predictions also aﬀect the eﬃciency because
mis-predictions result in a head Ěit being re-routedwith the conventional pipeline.
ǐǋ
McRouter’s eﬃciency is determined by how many Ěits can be multicast and
this is related to how plentiful the bandwidth is within the router, which means,
McRouter ismore eﬃcientwhen thenetwork load is low(likeFigure Ǎ.ǉ.ǉ andFig-
ure Ǎ.ǉ.Ǌ). When the network load is too high, multicast can not be taken and its
routing eﬃciency is as low as a CR.
Ǎ.ǋ.ǌ PŃŌĹŇOŋĹŇļĹĵĸ
ĉe power overhead of VSAR, PR and McRouter comes as both static and dy-
namic power. For VSAR, its static power overhead is from the extra switch allo-
cator which handles speculative SA requests while its dynamic power overhead is
from the additional accesses to the switch allocators when speculation fails. For
PR, more static power is consumed by the predictors and the kill circuit. More
dynamic power is consumed by PR with three purposes. Firstly, predictors are
activated when a packet arrives at a PR (predicted RC result may be refreshed).
Secondly, additional accesses to the virtual channel allocator, the switch allocator
and the crossbar switch exist when prediction fails. ĉirdly, the kill circuit is also
activated in case of mis-prediction.
ForMcRouter, the static power overhead comes from themulticast unit and the
acknowledgment circuit. Dynamic power overhead of McRouter is caused when
multicast succeeds. Firstly, multicat unit is activated when any packet arrives at
a McRouter. Secondly, extra accesses to the virtual channel allocator, the switch
allocator and the crossbar switch exist when multicast is taken. ĉirdly, the Ěit
acknowledgment circuit is activated to allow the correctly routed Ěit to pass in the
ǐǌ
Table 5.4.1: System parameters.
Component Parameter
Number of cores: ǉǎ
Topology: ǌ ǌ mesh
Processor: ǌGHz, in-order
Lǉ I/D cache: ǋǊ KB per core, ǌ-way set associative, ǉ cycle access latency
LǊ cache: ǊǍǎKB per Bank, ǉǎ-way set associative, ǎ cycles access latency
Cache line size: ǎǌ Bytes
Main memory: ǌGB, ǉǎǈ cycles access latency
Coherence protocol: MOESI, directory
Link: ǉǊǐ-bit, ǉ cycle traversal
Packet: ǉǊǐ-bit control, ǎǌǈ-bit data
Router: ǉ GHz, virtual channel router
Virtual channel: ǌ per virtual network
Virtual network: ǋ per physical link
Routing algorithm: X-Y routing
Process technology: ǋǊ nm
Vdd: ǉ V
case of a multicast operation.
Apart fromthepoweroverhead, low latency routers likeVSAR,PRandMcRouter
can help reducing the total energy consumed by the system since these router de-
signs are able to speed-up the system thus shortening the execution time.
Ǎ.ǌ MĹŉļŃĸŃŀŃĻŏ
Evaluations are carriedout onperformance andpowerbyusingGEMS[ǋǏ] and
Simics [ǋǎ] extended with the network model fromGARNET [Ǐ] and the power
model from Orion [ǊǑ]. To evaluate performance, the source code of GARNET
aremodiėed toprovide cycle-accurate timingmodels ofVSAR,PRandMcRouter.
We evaluate PR with two prediction algorithms, which are latest port (LP) and
ėnite context method (FCM, the same as most-frequently-used). In the power
ǐǍ
Table 5.4.2: Benchmark programs and inputs.
Application Input
barnes ǌǈǑǎ particles
cholesky tkǊǑ.O
fmm ǉǎǋǐǌ particles
ocean, contiguous grid of ǊǍǐǊǍǐ
ocean, non-contiguous grid of ǊǍǐǊǍǐ
raytrace teapot
volrend head
water, nsquared ǍǉǊ molecules
water, spatial ǍǉǊ molecules
EP Ʀ?? random number pairs
FT grid size of ǉǊǐǉǊǐǋǊ, ǎ iterations
IS Ʀ?? keys with a max key of Ʀ??
LU grid size of ǎǌǎǌǎǌ, ǊǍǈ iterations, time step of Ǌ.ǈ
evaluation, power consumption of low latency routers are quantiėed by looking at
the component power models in Orion. For VSAR, the power consumed by the
extra switch allocator is added. For PR, both the power consumption of memory
components inside the predictors and the power consumption from extra router
component accesses incurred by mis-predicted Ěits are considered. For each LP
predictor, the powermodel of a ǋ-bit register is usedwhile for eachFCMpredictor,
an ǐ-bit register ėle is used. ĉe number of registers in the register ėle equals N-ǉ
if the predictor is implemented in an N-radix router. Similarly, the power model
ofMcRouter includes power overheads from the extra accesses to the router com-
ponents resulted frommulticasting. ĉe evaluation conditions are summarized in
TABLE Ǎ.ǌ.ǉ.
In all evaluations (except some sensitivity studies), a ǉǎ-tilemesh network with
ǉǊǐ-bit links is assumed. Each tile has an in-order processor core, a bank of LǊ
Cache/Directory. Each corner tile also has a memory controller. Similar to other
ǐǎ
low latency routers,McRouter can accelerate remote Lǉ, LǊ andmainmemory ac-
cesses; and the former two are best candidates since main memory access latency
ismuch larger thannetwork latency.ĉeassumptionof using in-order cores in eval-
uation is reasonable since these cores run at ǌ GHz and are ǌ times the frequency
of the network (so the network is properly stressed). ĉe schematic view of this
simulated system and what a tile is composed of are illustrated in Figure ǉ.ǉ.ǉ in
Section ǉ.ǉ. ĉe entire network is set to have three virtual networks to support
theMOESI directory coherence protocol which has three classes of packets. Each
router has a maximum of six ports and each port has four virtual channels while
each virtual channel has four ǉǊǐ-bit buﬀers. More details are presented in TA-
BLE Ǎ.ǌ.ǉ. Evaluations are carried out with both synthetic and application traﬃc.
ĉe synthetic traﬃc paĨern is uniform random and the traﬃc is made of data pack-
ets only (each data packet has Ǎ Ěits). ĉe application traﬃc is based onworkloads
chosen from NPB ǋ.ǋ [ǋ] and SPLASH-Ǌ [Ǎǌ] benchmark suites. TABLE Ǎ.ǌ.Ǌ
lists these applications and their inputs. Last but not least, in Subsection Ǎ.Ǎ.ǋ,
some parameters are downscaled to see how McRouter behaves under diﬀerent
situations.
Ǎ.Ǎ RĹňŊŀŉň
In this section, the evaluation results onMcRouter are presented and discussed
in terms of its performance and power consumption. As stated in Section Ǎ.ǌ,
McRouter is compared toCR, VSAR and PR in evaluations to test its eﬀectiveness
and how bandwidth consumption makes it diﬀer. Sensitivity studies are also set
ǐǏ
up to clarify McRouter’s eﬀectiveness in a few bandwidth-constrained situations.
Ǎ.Ǎ.ǉ SŏłŉļĹŉĽķ TŇĵĺĺĽķ
!"
!#
$"
$#
#"
##
"%"&# "%"# "%"'# "%( "%(&# "%(# "%('# "%& "%&&# "%&# "%&'# "%! "%!"#
)
*
+,
-
./
01
23
0*
4
56
17
56
5.
*
8
94:*50/;41<30*17=./0>?4;@*?565.*8
A<
BCD<
)<172)8
)<17-AE8
E5<;F0*+
(a) Per-ﬂit latency versus injection rate.
!
!"#
!"$
!"%
!"&
'
!"!#( !"!( !"!)( !"' !"'#( !"'( !"')( !"# !"##( !"#( !"#)( !"* !"*!(
+
,-
./
01
2
31
43
5
..
6
76
,-
/6
8
3+
70
/9
:2;6./0123<-/63=470/9>2186>.?.76@
A<3=BA@
A<3=+CD@
D.<1E/6,
(b) Fraction of accelerated ﬂits versus injection rate.
Figure 5.5.1: Evaluations with synthetic traﬃc.
ǐǐ
In Figure Ǎ.Ǎ.ǉa, average per-Ěit latency versus injection rate per node is pre-
sented while the fraction of accelerated traﬃc is reported in Figure Ǎ.Ǎ.ǉb. All
routersmentionedaboveare included inFigureǍ.Ǎ.ǉawhileonlyPRwithLP/FCM
algorithms and McRouter are shown in Figure Ǎ.Ǎ.ǉb, since the laĨer only covers
the amount of traﬃc accelerated to pass a router in ǉ cycle.
In Figure Ǎ.Ǎ.ǉa, when diﬀerent amount of traﬃc is injected into the network,
the routers perform very diﬀerently. It is observed that when injection rate is low
(up to ǈ.ǈǍ Ěits/node/cycle; this is roughly equal to Ǐǎǈ MBytes/node/sec and
it results in a link utilization of ǈ.ǈǏ Ěits/link/cycle), McRouter outperforms all
counterparts. ĉis also matches Figure Ǎ.Ǎ.ǉb that more Ěits can be accelerated
withMcRouter thanPRwithFCMwhen injection rate is underǈ.ǈǍĚits/node/cycle.
By looking at Figure Ǎ.ǉ.ǉ, there is no such applicationwhich is going to inject traf-
ėc to forma link utilizationof ǈ.ǈǏ Ěits/link/cycle and this is goingbeproved again
by evaluations with application traﬃc in the next subsection.
From Figure Ǎ.Ǎ.ǉa, it can also be seen that McRouter maintains its advantage
over CR until injection reaches ǈ.ǊǊǍ Ěits/node/cycle (it is roughly equal to ǋǌǋǈ
MBytes/node/sec and this results in a link utilization of ǈ.ǋǌ Ěits/link/cycle).
ĉis means, the performance of McRouter is beĨer than CR unless the injection
is very high (already near saturation).
Ǎ.Ǎ.Ǌ AńńŀĽķĵŉĽŃł TŇĵĺĺĽķ
Figure Ǎ.Ǎ.Ǌ shows the network performance (note that lower is beĨer) under
diﬀerent router designs. Except for ocean (non-contiguous) where PR with FCM
ǐǑ
!
"#
!
"$
!
"%
!
"&
!
"'(
(
"(
)
*
+,
-
.
/0
1
2-
.3
4
56
6
1
/-
*
,
7/
1
,
89
:
;
1
;
.<
1
/-
*
,
=7
,
1
,
>
/1
,
89
:
;
1
;
.<
+*
4
8+
*
/-
?
1
2+
-
,
@
A
*
8-
+
7,
.B
;
*
+-
@
<
A
*
8-
+=
7.
C
*
89
*
2<
D
E
F
G
HI
JK
L
-
1
6
-
8+
9/
=M
-
*
,
N
O
P
I
Q
O
E
O
=7
JE
<
E
O
=7
F
N
M
<
M
/O
1
;
8-
+
(
>N
4
/2
-
Fi
gu
re
5.
5.
2:
No
rm
ali
ze
d
pe
r-ﬂ
it
lat
en
cy
.
Ǒǈ
algorithm is ǈ.ǌƻ beĨer than McRouter, McRouter clearly outperforms all other
router designs. On average, McRouter shortened the per-Ěit latency by ǊǊƻ over
CR while it also provides an additional ǌƻ reduction over the best PR. At best,
McRouter achieves Ǌǐƻ latency reduction over CR for water (spatial) and ǎƻ la-
tency reduction over PR for water (nsquared).
Figure Ǎ.Ǎ.ǋ presents the system performance (note that higher is beĨer). A
similar outcome is observed thatMcRouter outperforms every other router design
except workload choleskywhere PRwith LP algorithm is ǉƻ beĨer. Looking at the
geometric mean, McRouter helps the system achieve a speed-up of ǉ.Ǌǐ over CR
while on average it also provides a speed-up of ǉ.ǈǍ over the best PR. In the best
case, a speed-up of ǉ.ǌǐ can be spoĨed forMcRouter overCR forworkloads barnes
and water (spatial) while a speed-up of ǉ.ǈǐ can be identiėed for raytrace over the
best PR.
ĉese performance numbers clearly demonstrate that McRouter is the optimal
routerdesigngiven the applicationworkloads evaluated. AlthoughbothMcRouter
and PR are ǉ-cycle routers, the number of timesMcRouter successfully multicasts
a packet is higher than the number of successful predictions PR has, under the
evaluated workloads.
Another observation is, as what it is designed for, on-chip bandwidth is a deter-
mining factor on how well McRouter can perform. For workloads that consume
more on-chip bandwidth (such as ocean(non-contiguous), EP, FT, IS and LU), im-
provementwithMcRouter onbothnetwork and systemperformance are relatively
smaller. Another reason behind this is, similar to PmR, working set size also plays
Ǒǉ
!
"#
!
"$
!
"%
&
"&
&
"'
&
"#
&
"$
(
)
*+
,
-
./
0
1,
-2
3
45
5
0
.,
)
+
6.
0
+
78
9
:
0
:
-;
0
.,
)
+
<6
+
0
+
=
.0
+
78
9
:
0
:
-;
*)
3
7*
)
.,
>
0
1*
,
+
?
@
)
7,
*
6+
-A
:
)
*,
?
;
@
)
7,
*<
6-
B
)
78
)
1;
C
D
E
F
GH
IJ
K
,
0
5
,
7*
8.
<L
,
)
+
M
N
O
H
P
N
D
N
<6
ID
;
D
N
<6
E
M
L
;
L
.N
0
:
7,
*
&
=M
3
.1
,
Fi
gu
re
5.
5.
3:
No
rm
ali
ze
d
sy
ste
m
sp
ee
d-
up
.
ǑǊ
an important role here. For the NPB workloads that consume more memory, im-
provement withMcRouter on system performance is smaller sincemore LǊ cache
misses results in more main memory accesses whose latencies are far larger than
network latencies. Workload cholesky is an exception since all low latency router
designs (including McRouter and ideal ǉ-cycle router) are not eﬃcient for it. It
seems that it is not much aﬀected by the network performance.
In addition, by looking at Figure ǉ.ǋ.ǉ in Section ǉ.ǋ, for those workloads that
scale well with more threads, improvements on their performance mostly come
from speed-up on the serial execution time of each individual thread. raytrace
scales poorly with more threads but it has larger performance improvement since
slower cache accesses are well accelerated with low latency routers studied in this
dissertation.
McRouter is also well behind the ideal ǉ-cycle router. Although average link
utilization is very small for these workloads, it seems that network traﬃc tends to
burst so that in reality the amount of multicasting taken is smaller than expected
because of higher crossbar switch utilization temporally.
Demonstrated in Figure Ǎ.Ǎ.ǌ, McRouter consumes on average ǉƻmore power
than CR and this is similar to VSAR and PR with LPM. ĉis simply means that
McRouter is the most power eﬃcient design when compared to its counterparts.
With merely ǉƻ more power consumption, McRouter outperforms CR by Ǌǐƻ
in speeding-up the system. Similarly, with roughly the same amount of power
consumption, McRouter outperforms VSAR by ǉǏƻ in speeding-up the system.
When compared to PR, McRouter outperforms both PRs (regardless of the pre-
Ǒǋ
!
"#
!
"#
$
!
"%
!
"%
$&
&
"!
$
&
"&
'
(
)*
+
,
-.
/
0+
,1
2
34
4
/
-+
(
*
5-
/
*
67
8
9
/
9
,:
/
-+
(
*
;5
*
/
*
<
-/
*
67
8
9
/
9
,:
)(
2
6)
(
-+
=
/
0)
+
*
>
?
(
6+
);
5*
,@
9
(
)+
>
:
?
(
6+
);
5,
A
(
67
(
0:
B
C
D
E
FG
HI
J
+
/
4
+
6)
7-
;K
+
(
*
L
M
N
G
O
M
C
M
;5
HC
:
C
M
;5
D
L
K
:
K
-M
/
9
6+
)
Fi
gu
re
5.
5.
4:
No
rm
ali
ze
d
ne
tw
or
k
po
we
rc
on
su
mp
tio
n.
Ǒǌ
diction algorithm) by Ǎƻ while it consumes roughly the same amount of power
as PR with LP and even beĨer, it consumes ǎƻ less power than PR with FCM.
Although multicasting seems a power hungry operation, this is not the case for
McRouter. SinceMcRouter does not incur any additional link traversal and buﬀer
accesses; while at the same time, not everyoutput is needed to carryout amulticast
operation. Another reason thatMcRouter ismore power eﬃcient is because of the
fact that it is bandwidth driven. Only if enough router bandwidth is present, multi-
cast happens and it accelerates the traﬃc; so a multicast operation is very produc-
tive that unlike mis-predictions from PR, there is rarely any case that it consumes
power but accelerates none.
Ǎ.Ǎ.ǋ SĹłňĽŉĽŋĽŉŏ SŉŊĸĽĹň
In this subsection,McRouter’s eﬃciency is unveiledwith smaller on-chip band-
width available. ĉepurpose tohave such studies is to clear anydoubt thatMcRouter’s
advantage shown inSubsectionǍ.Ǎ.Ǌ comes fromtooplentiful on-chipbandwidth (in
other words, over-designing). To achieve this goal, Ǌ parameters are chosen to be
downscaled in the evaluation. ĉe Ǌ parameters chosen to be varied are Ěit size
and number of VCs. By changing these parameters, the available bandwidth in-
side a router is decreased. All of these evaluations are set up to test the speed-up on
system performance with diﬀerent routers when these two parameters are down-
scaled. We select ǋ workloads (raytrace, volrend and FT) to carry out these evalua-
tions. FT is one of the most on-chip bandwidth demanding workloads while ray-
trace and volrend have relatively lower on-chip bandwidth demandswhere two dis-
ǑǍ
!"#
$
$"$
$"%
$"&
$"'
$"(
$%)*+,-./'/012 3'*+,-./'/012 $%)*+,-./$/01
14 0564 74/897: 74/8;1<: <=4>?-@A
(a) Workload: raytrace
!"#
$
$"$
$"%
$"&
$"'
$"(
$%)*+,-./'012 3'*+,-./'/012 $%)*+,-./$/01
(b) Workload: volrend
!"#
$
$"$
$"%
$"&
$"'
$"(
$%)*+,-./'012 3'*+,-./'/012 $%)*+,-./$/01
(c) Workload: FT
Figure 5.5.5: System speed-up with router parameter downscaling.
Ǒǎ
tinct types of applications are covered in evaluations.
As shown in Figure Ǎ.Ǎ.Ǎ, the Ěit size is halved; ǎǌ-bit Ěit is the minimum size
found in literature. Another downscaling has the number of VCs reduced from ǌ
to ǉ; ǉ VC is the minimum number of VCs possible (and all routers actually turn
into wormhole routers in this case).
What is seen from Figure Ǎ.Ǎ.Ǎa and Figure Ǎ.Ǎ.Ǎb tells that McRouter’s eﬃ-
ciencymerely changes with either case of downscaling. ĉis proves that even with
smaller on-chip bandwidth available, McRouter is still the best design to consider.
With Figure Ǎ.Ǎ.Ǎc, things aremore interesting. McRouter performs even beĨer
with less available on-chip bandwidth for FT. ĉe speed-up when having down-
scaled parameters are ǉ.Ǌǌ (when Ěit size is halved) and ǉ.ǊǑ (when the number
of VCs is decreased from ǌ to ǉ) while with the original parameters, the recorded
speed-up is only ǉ.Ǌǉ. Since FT is the most on-chip bandwidth demanding work-
load in evaluation, this not only tells that McRouter still works well with smaller
on-chip bandwidth, but this alsomeans that even for themost on-chip bandwidth
demanding workload, the router bandwidth is still plentiful for McRouter to uti-
lize.
Ǎ.ǎ SŊŁŁĵŇŏ ĵłĸDĽňķŊňňĽŃłň
In this chapter a novel low latency on-chip router namedMcRouter is proposed
and is designed for high performance NoCs. By allowing a head Ěit of a packet to
traverse the crossbar switch in the manner of multicasting, the RC delay is com-
pletely removed from the per-hop latency, and VC allocation and switch traversal
ǑǏ
are parallelizable which enables a single-cycle transfer of Ěits. Compared to low
latency routers employing LAR,McRouter excels by preserving its portability and
modularity in design. Furthermore, compared to low latency routers with aggres-
sive speculation like PR, McRouter is beĨer at low network load as its multicast
nature makes it an “always-hit” prediction router whenever multicast is able to be
taken. With the detailed evaluations carried out for McRouter, it is found that
McRouter outperforms its counterpart designs with a negligible power overhead.
On average, system speed-ups of ǉ.Ǌǐ, ǉ.ǉǏ and ǉ.ǈǍ are observed over CR, VSAR
and PR with the best prediction algorithms, respectively. ĉese are achieved with
a merely ǉƻ more power over CR. From the sensitivity study, McRouter is found
to work well under tighter bandwidth budget, too.
Another observation is, with a direct comparison between PmR andMcRouter
with ǉǈ workloads⁵, McRouter is deėnitely beĨer in volrend, water (nsquared) and
water (spatial) which have relatively low link utilization (hence low bandwidth
utilization). ĉis simply means that McRouter is more bandwidth sensitive than
PmR. It should also noted thatMcRouter ismore power eﬃcient thanPmR for the
reason that it does not have any predictors.
Following the above summary, there are also a few qualitative discussions to be
made, regarding network topology, core scaling, coherence protocols and parallel
speed-up of application workloads.
Firstly, topology makes diﬀerence in two aspects, the number of links and the
radix of routers. For example, McRouter should be more eﬃcient with torus than
⁵ĉere are ǋ workloads failed for PmR in simulation.
Ǒǐ
mesh, since there aremore links (and the same radix for routers) with torus which
can result in a smaller link utilization. Conversely, with ring topology, McRouter
should be less eﬃcient since the amount of links are much lower while the radix
of routers is also slightly lower. ĉis means, the available bandwidth (in terms of
network resources) is less and the link utilization can be much higher.
Secondly, there are two forms of core scaling which have opposite inĚuences
on McRouter. If the number of tiles scales with the number of cores, McRouter
should still be eﬀective or even beĨer sincewith such a scaling traﬃc travels longer
distance in the network. However, if the size of the network stays while the num-
ber of cores scales up (a denser design), it is hard for McRouter to maintain its
eﬀectiveness since the network is simply more stressed in this case.
Finally, McRouter should work beĨer with directory based cache coherence
protocols because of the fact that such protocols are less bandwidth stressful than
broadcast based ones.
ǑǑ
ǉǈǈ
ĉe end of a melody is not its goal: but nonetheless, had the
melody not reached its end it would not have reached its goal
either.
Friedrich Wilhelm Nietzsche
6
Conclusions
Iŉ Ľň ķŀĹĵŇŉļĵŉŉļĽň ňļĽĺŉŉŃŁŊŀŉĽ-ķŃŇĹĸĹňĽĻłň Ľň ňŃŁĹļŃŌĽłĹŋĽŉĵĶŀĹ.
Replication of processor cores on chip in such a modular manner does not only
favor on-chip networks but also puts a few challenges on them. A key challenge
this dissertation has contributed to is to optimize the communication latency of
such networks.
First and foremost, there are three ėndings which motivates this dissertation.
ĉe ėrst one is the bandwidth limitation identiėed in ǋDNoCs. ĉis physical lim-
itation can be spoĨed when traﬃc crosses layers. It has been the starting point
of the ėrst solution described in this dissertation, where traﬃc compression is ap-
ǉǈǉ
plied adaptively on ǋD NoCs. ĉe second ėnding is about the utilization of mul-
tiple prediction algorithms for speculating the routing result of a packet in an on-
chip router. Multiple prediction algorithms lead to beĨer prediction accuracy so
this is where predict-more router is motivated. ĉe third ėnding simply tells a fact
that with multi-threaded workloads, there is plentiful of internal bandwidth for
a router to possibly multicast an incoming packet to all possible outputs without
knowing this packet’s route computation result. And this simply leads to the ideaof
multicast-within-a-router design. With these ėndings, three solutions are simply
designed tominimize the communication latency forNoCs and their eﬀectiveness
has been proved through cycle-by-cycle simulations.
ǎ.ǉ FŊŇŉļĹŇDĽňķŊňňĽŃłň ĵłĸ FŊŉŊŇĹWŃŇĿ
ĉis section covers further discussion beyond the scope of this dissertation and
identiėes possible future work to improve the three solutions presented.
ĉeapplication traﬃcused in this dissertationaregenerated frommulti-threaded
workloads which have been optimized to minimize the amount of communica-
tions between threads. But if the target applications have been changed to multi-
programmedworkloads or data-parallel workloads, the eﬀectiveness ofMcRouter
will degrade but PmR should still work. Traﬃc compression, on the other hand,
should be amore eﬀective solution . ĉe reason behind this is, the laĨer two types
of workloads may generate more traﬃc than the multi-threaded ones, hence they
are more bandwidth and network resource consuming, especially in the case that
memory-intensive applications are executed.
ǉǈǊ
As a future work, power evaluation is a good candidate for the traﬃc compres-
sion solution, since it is interesting to ėnd the trade-oﬀ between the power saved
from reducing the amount of network traﬃc through compression and the power
overhead consumed by the compression/de-compression circuits. For the low la-
tency router proposals, HDL synthesis is a promising extension to improve the ac-
curacy of the powermodel and to provide an evaluation on area overhead of these
two solutions. It may also be interesting to apply these solutions to networks with
diﬀerent scales whose latency and bandwidth requirement diﬀer fromNoCs.
ǉǈǋ
ǉǈǌ
References
[ǉ] International technology roadmap for semiconductors. http://www.
itrs.net/reports.html.
[Ǌ] Mobile memory: LPDDRǊ & ǋ, Wide I/O, Memory MCP.
http://www.jedec.org/category/technology-focus-area/
mobile-memory-lpddrƦ-Ƨ-wide-io-memory-mcp.
[ǋ] NAS parallel benchmarks ǋ.ǋ. http://www.nas.nasa.gov/
Resources/Software/npb.html.
[ǌ] OpenMP speciėcations. http://openmp.org/wp/
openmp-specifications.
[Ǎ] A Memo on Exploration of SPLASH-Ǌ Input Sets. http://parsec.cs.
princeton.edu/doc/memo-splashƦx-input.pdf.
[ǎ] Oracle Solaris releases. http://www.oracle.com/technetwork/
server-storage/solaris/overview/releases-jsp-ƥƨƤƭƬƫ.
html.
[Ǐ] N. Agarwal, T. Krishna, Li-Shiuan Peh, and N.K. Jha. GARNET: a detailed
on-chip network model inside a full-system simulator. In ISPASS ’Ȕȝ: Pro-
ceedings of the ȖȔȔȝ IEEE international symposium on Performance analysis of
systems and soĜware, pages ǋǋ–ǌǊ, ǊǈǈǑ.
[ǐ] V Agarwal, M.S Hrishikesh, S.W Keckler, and D Burger. Clock rate versus
IPC: the end of the road for conventional microarchitectures. In ISCA ’ȔȔ:
Proceedings of the Ȗțth annual international symposium on Computer architec-
ture, pages Ǌǌǐ–ǊǍǑ, Ǌǈǈǈ.
[Ǒ] Alaa R. Alameldeen andDavid A.Wood. Frequent PaĨern Compression: A
Signiėcance-Based Compression Scheme for LǊ Caches. Technical Report
TR-ǉǍǈǈ, University of Wisconsin-Madison, April Ǌǈǈǌ.
ǉǈǍ
[ǉǈ] Alaa R. Alameldeen and David A. Wood. Adaptive cache compression for
high-performance processors. In Proceedings of the ȗȕst annual international
symposium on Computer architecture, ISCA ’ǈǌ, pages ǊǉǊ–, Ǌǈǈǌ.
[ǉǉ] Alaa R. Alameldeen and David A. Wood. Adaptive cache compression
for high-performance processors. SIGARCH Comput. Archit. News, ǋǊ(Ǌ):
ǊǉǊ–, March Ǌǈǈǌ.
[ǉǊ] Luiz André Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas
Nowatzyk, Shaz Qadeer, Barton Sano, ScoĨ Smith, Robert Stets, and Ben
Verghese. Piranha: A scalable architecture based on single-chip multipro-
cessing. In Proceedings of the Ȗțth Annual International Symposium on Com-
puter Architecture, ISCA ’ǈǈ, pages ǊǐǊ–ǊǑǋ, Ǌǈǈǈ.
[ǉǋ] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay,
M. Reif, Liewei Bao, J. Brown, M. MaĨina, Chyi-Chang Miao, C. Ramey,
D. Wentzlaﬀ, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montene-
gro, J. Stickney, and J. Zook. Tileǎǌ - processor: A ǎǌ-core soc with mesh
interconnect. In Solid-State Circuits Conference, ȖȔȔȜ. ISSCC ȖȔȔȜ. Digest of
Technical Papers. IEEE International, pages ǐǐ–ǍǑǐ, Feb Ǌǈǈǐ.
[ǉǌ] Luca Benini and Giovanni De Micheli. Networks on chip: a new paradigm
for systems on chip design. In In Proceedings of Conference onDesign, Automa-
tion and Test in Europe, pages ǌǉǐ–ǌǉǑ, ǊǈǈǊ.
[ǉǍ] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. ĉe
PARSEC benchmark suite: characterization and architectural implications.
Technical Report TR-ǐǉǉ-ǈǐ, Princeton University, January Ǌǈǈǐ.
[ǉǎ] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, Lei Jiang, G.H. Loh,
D. McCauley, P. Morrow, D.W. Nelson, D. Pantuso, P. Reed, J. Rupley, Sada-
sivan Shankar, J. Shen, and C. Webb. Die stacking (ǋd) microarchitecture.
InMicroarchitecture, ȖȔȔȚ.MICRO-ȗȝ. ȗȝthAnnual IEEE/ACMInternational
Symposium on, pages ǌǎǑ–ǌǏǑ, Ǌǈǈǎ.
[ǉǏ] J. Burns, L. McIlrath, C. Keast, C. Lewis, A. Loomis, K. Warner, and P. Wy-
aĨ. ĉree-dimensional integrated circuits for low-power, high-bandwidth
systems on a chip. In Solid-State Circuits Conference, ȖȔȔȕ. Digest of Technical
Papers. ISSCC. ȖȔȔȕ IEEE International, pages Ǌǎǐ–ǊǎǑ, Ǌǈǈǉ.
ǉǈǎ
[ǉǐ] Martin Burtscher and Benjamin G. Zorn. Hybrid load value predictors.
IEEE Transactions on Computers, Ǎǉ:ǏǍǑ–ǏǏǌ, Ǌǈǈǈ.
[ǉǑ] William Dally and Brian Towles. Principles and practices of interconnection
networks. Morgan Kaufmann Publishers Inc., Ǌǈǈǋ.
[Ǌǈ] W.J. Dally and B. Towles. Route packets, not wires: on-chip interconnec-
tion networks. In Design Automation Conference, ȖȔȔȕ. Proceedings, pages
ǎǐǌ–ǎǐǑ, Ǌǈǈǉ.
[Ǌǉ] R. Das, A.K.Mishra, C. Nicopoulos, Dongkook Park, V. Narayanan, R. Iyer,
M.S. Yousif, and C.R. Das. Performance and power optimization through
data compression in network-on-chip architectures. In High Performance
Computer Architecture, ȖȔȔȜ. HPCA ȖȔȔȜ. IEEE ȕȘth International Sympo-
sium on, pages ǊǉǍ–ǊǊǍ, Ǌǈǈǐ.
[ǊǊ] W.R. Davis, J. Wilson, S. Mick, J. Xu, Hao Hua, C. Mineo, A.M. Sule,
M. Steer, and P.D. Franzon. Demystifying ǋd ics: the pros and cons of going
vertical. Design Test of Computers, IEEE, ǊǊ(ǎ):ǌǑǐ–Ǎǉǈ, ǊǈǈǍ.
[Ǌǋ] R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc.
Designof ion-implantedmosfet’swith very small physical dimensions. Solid-
State Circuits, IEEE Journal of, Ǒ(Ǎ):ǊǍǎ–Ǌǎǐ, ǉǑǏǌ.
[Ǌǌ] Mitchell Hayenga and Mikko Lipasti. ĉe NoX router. InMICRO ȘȘ: Pro-
ceedings of the ȘȘth annual IEEE/ACM international symposium on Microar-
chitecture, pages ǋǎ–ǌǎ, December Ǌǈǉǉ.
[ǊǍ] John L. Hennessy and David A. PaĨerson. Computer Architecture, FiĜh Edi-
tion: AQuantitative Approach. MorganKaufmann Publishers Inc., San Fran-
cisco, CA, USA, Ǎth edition, Ǌǈǉǉ. ISBN ǈǉǊǋǐǋǐǏǊX, ǑǏǐǈǉǊǋǐǋǐǏǊǐ.
[Ǌǎ] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenk-
ins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada,
S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege,
J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl,
S. Borkar, V. De, R. Van der Wijngaart, and T. MaĨson. A ǌǐ-core ia-ǋǊ
message-passing processor with dvfs in ǌǍnm cmos. In Solid-State Circuits
ConferenceDigest of Technical Papers (ISSCC), ȖȔȕȔ IEEE International, pages
ǉǈǐ–ǉǈǑ, Feb Ǌǈǉǈ.
ǉǈǏ
[ǊǏ] H. Jin, M. Frumkin, and J. Yan. ĉe OpenMP Implementation of NAS Par-
allel Benchmarks and Its Performance. Technical ReportNAS-ǑǑ-ǈǉǉ,NAS
SystemDivision, NASA Ames Research Center, October ǉǑǑǑ.
[Ǌǐ] Yuho Jin, Ki Hwan Yum, and Eun Jung Kim. Adaptive data compression
for high-performance low-power on-chip networks. In Proceedings of the
Șȕst annual IEEE/ACM International Symposium on Microarchitecture, pages
ǋǍǌ–ǋǎǋ, Ǌǈǈǐ.
[ǊǑ] Andrew B. Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. ORIONǊ.ǈ:
a fast and accurate NoC power and area model for early-stage design space
exploration. InDATE ’Ȕȝ: Proceedings of the conference onDesign, automation
and test in Europe, pages ǌǊǋ–ǌǊǐ, April ǊǈǈǑ.
[ǋǈ] Dae Hyun Kim, K. Athikulwongse, and Sung-Kyu Lim. A study of through-
silicon-via impact on the ǋd stacked ic layout. In Computer-Aided Design -
Digest of Technical Papers, ȖȔȔȝ. ICCAD ȖȔȔȝ. IEEE/ACM International Con-
ference on, pages ǎǏǌ–ǎǐǈ, ǊǈǈǑ.
[ǋǉ] Jongman Kim, Chrysostomos Nicopoulos, Dongkook Park, Reetuparna
Das, Yuan Xie, VijaykrishnanNarayanan,Mazin S. Yousif, and Chita R. Das.
A novel dimensionally-decomposed router for on-chip communication in
ǋd architectures. In Proceedings of the ȗȘth annual international symposium on
Computer architecture, pages ǉǋǐ–ǉǌǑ, ǊǈǈǏ.
[ǋǊ] K. Kumagai, Changqi Yang, H. Izumino, N. Narita, K. Shinjo, S. Iwashita,
Y. Nakaoka, T. Kawamura, H. Komabashiri, T. Minato, A. Arnbo, T. Suzuki,
Zhenyu Liu, Yang Song, S. Goto, T. Ikenaga, Y. Mabuchi, and K. Yoshida.
System-in-silicon architecture and its application to h.Ǌǎǌ/avc motion esti-
mation for ǉǈǐǈhdtv. In Solid-State Circuits Conference, ȖȔȔȚ. ISSCC ȖȔȔȚ.
Digest of Technical Papers. IEEE International, pages ǉǏǈǎ–ǉǏǉǍ, Ǌǈǈǎ.
[ǋǋ] Amit Kumar, Li-Shiuan Peh, Partha Kundu, andNiraj K Jha. Express virtual
channels: towards the ideal interconnection fabric. In ISCA ’Ȕț: Proceedings
of the ȗȘth annual international symposium on Computer architecture, pages
ǉǍǈ–ǉǎǉ, June ǊǈǈǏ.
[ǋǌ] Amit Kumar, Li-Shiuan Peh, Partha Kundu, and Niraj K. Jha. A ǌ.ǎTbits/s
ǋ.ǎGHz single-cycle NoC router with a novel switch allocator in ǎǍnm
CMOS. In ICCD ’Ȕț: Proceedings of ȖȔȔț IEEE international conference on
Computer design, pages ǎǋ–Ǐǈ, September ǊǈǈǏ.
ǉǈǐ
[ǋǍ] ZhengLi, JieWu, Li Shang, Robert P.Dick, andYihe Sun. Latency criticality
aware on-chip communication. In Proceedings of the Conference on Design,
Automation and Test in Europe, DATE ’ǈǑ, pages ǉǈǍǊ–ǉǈǍǏ, ǊǈǈǑ.
[ǋǎ] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,
J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: a full system
simulation platform. ǋǍ(Ǌ):Ǎǈ–Ǎǐ, ǊǈǈǊ.
[ǋǏ] Milo M.K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R.
Marty, Min Xu, Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and
David A. Wood. Multifacet’s general execution-driven multiprocessor sim-
ulator (GEMS) toolset. SIGARCH Computer Architecture News, ǋǋ(ǌ),
November ǊǈǈǍ.
[ǋǐ] H.Matsutani,M. Koibuchi, H. Amano, andT. Yoshinaga. Prediction router:
yet another low latency on-chip router architecture. InHPCA ’Ȕȝ: Proceed-
ings of the ȖȔȔȝ IEEE ȕșth international symposium on High performance com-
puter architecture, pages ǋǎǏ–ǋǏǐ, ǊǈǈǑ.
[ǋǑ] H.Matsutani,M.Koibuchi,H.Amano, andT.Yoshinaga. PredictionRouter:
a low-latency on-chip router architecture with multiple predictors. ǎǈ(ǎ):
Ǐǐǋ–ǏǑǑ, Ǌǈǉǉ.
[ǌǈ] G.E.Moore. Crammingmore components onto integrated circuits. Proceed-
ings of the IEEE, ǐǎ(ǉ):ǐǊ–ǐǍ, ǉǑǑǐ.
[ǌǉ] Robert Mullins, Andrew West, and Simon Moore. Low-latency virtual-
channel routers for on-chip networks. In ISCA ’ȔȘ: Proceedings of the ȗȕst an-
nual international symposium on Computer architecture, pages ǉǐǐ–ǉǑǏ, June
Ǌǈǈǌ.
[ǌǊ] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and
Kunyung Chang. ĉe case for a single-chip multiprocessor. SIGOPS Oper.
Syst. Rev., ǋǈ(Ǎ):Ǌ–ǉǉ, September ǉǑǑǎ.
[ǌǋ] Dongkook Park, S. Eachempati, R. Das, A.K. Mishra, Yuan Xie, N. Vijaykr-
ishnan, and C.R. Das. Mira: A multi-layered on-chip interconnect router
architecture. In Computer Architecture, ȖȔȔȜ. ISCA ’ȔȜ. ȗșth International
Symposium on, pages ǊǍǉ–Ǌǎǉ, Ǌǈǈǐ.
ǉǈǑ
[ǌǌ] V.F. Pavlidis and E.G. Friedman. ǋ-d topologies for networks-on-chip.
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, ǉǍ(ǉǈ):
ǉǈǐǉ–ǉǈǑǈ, ǊǈǈǏ.
[ǌǍ] Li-Shiuan Peh andWilliam J. Dally. A delay model and speculative architec-
ture for pipelined routers. InHPCA ’Ȕȕ: Proceedings of the țth international
symposium on High-performance computer architecture, pages ǊǍǍ–ǊǍǍ, Ǌǈǈǉ.
[ǌǎ] Li-Shiuan Peh and Natalie Enright Jerger. On-Chip Networks. Morgan and
Claypool Publishers, ǉst edition, ǊǈǈǑ.
[ǌǏ] R.S. Ramanujam and Bill Lin. Randomized partially-minimal routing on
three-dimensional mesh networks. Computer Architecture LeĪers, Ǐ(Ǌ):
ǋǏ–ǌǈ, Ǌǈǈǐ.
[ǌǐ] Brian M. Rogers, Anil Krishna, Gordon B. Bell, Ken Vu, Xiaowei Jiang, and
Yan Solihin. Scaling the bandwidth wall: challenges in and avenues for cmp
scaling. InProceedings of the ȗȚth annual international symposiumonComputer
architecture, pages ǋǏǉ–ǋǐǊ, ǊǈǈǑ.
[ǌǑ] Daniel Sanchez, George Michelogiannakis, and Christos Kozyrakis. An
analysis of on-chip interconnection networks for large-scale chip multipro-
cessors. ACM Transactions on Architecture and Code Optimization (TACO),
Ǐ(ǉ), April Ǌǈǉǈ.
[Ǎǈ] A. Sheibanyrad, F. Petrot, and Janstch A. ȗD Integration for NoC-Based SoC
Architectures. Springer, Ǌǈǉǈ.
[Ǎǉ] Daniel J. Sorin,MarkD.Hill, andDavid A.Wood. APrimer onMemory Con-
sistency and Cache Coherence. Morgan & Claypool Publishers, ǉst edition,
Ǌǈǉǉ.
[ǍǊ] M. ĉuresson, L. Spracklen, and P. Stenstrom. Memory-link compression
schemes: A value locality perspective. Computers, IEEE Transactions on, ǍǏ
(Ǐ):Ǒǉǎ–ǑǊǏ, Ǌǈǈǐ.
[Ǎǋ] Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and
DavidE.Culler. Architectural requirements and scalability of the nas parallel
benchmarks. In Proceedings of the ȕȝȝȝ ACM/IEEE Conference on Supercom-
puting, ǉǑǑǑ.
ǉǉǈ
[Ǎǌ] S.C.Woo,M.Ohara, E. Torrie, J.P. Singh, andA.Gupta. ĉeSPLASH-Ǌ pro-
grams: characterization and methodological considerations. In ISCA ’ȝș:
Proceedings of the ȖȖnd annual international symposium on Computer architec-
ture, pages Ǌǌ–ǋǎ, ǉǑǑǍ.
[ǍǍ] Ping Zhou, Bo Zhao, Yu Du, Yi Xu, Youtao Zhang, Jun Yang, and Li Zhao.
Frequent value compression in packet-based noc architectures. In Proceed-
ings of the ȖȔȔȝ Asia and South Paciėc Design Automation Conference, pages
ǉǋ–ǉǐ, ǊǈǈǑ.
ǉǉǉ
ǉǉǊ
Liﬆ of Publications by the Author
RĹĺĹŇĹĹĸ JŃŊŇłĵŀ PŊĶŀĽķĵŉĽŃłň
[ J-ǉ] YuanHe, HirokiMatsutani, Hiroshi Sasaki, andHiroshi Nakamura, “Adap-
tive Data Compression on ǋD Network-on-Chips,” IPSJ Transactions on
Advanced Computing Systems, Vol.Ǎ, No.ǉ, pp.ǐǈ-ǐǏ, January ǊǈǉǊ.
RĹĺĹŇĹĹĸ CŃłĺĹŇĹłķĹ ĵłĸWŃŇĿňļŃń PŊĶŀĽķĵŉĽŃłň
[C-ǉ] YuanHe,Hiroshi Sasaki, ShinobuMiwa, andHiroshiNakamura, “McRouter:
Multicastwithin aRouter forHighPerformanceNetwork-on-Chips,” InProc.
of the ǊǊnd International Conference on Parallel Architectures andCompi-
lation Techniques, pp.ǋǉǑ-ǋǊǑ, September Ǌǈǉǋ.
[C-Ǌ] YuanHe, Hiroshi Sasaki, ShinobuMiwa, and Hiroshi Nakamura, “Predict-
more Router: A Low Latency NoC Router with More Route Predictions,”
In Proc. of the Ǌǈǉǋ IEEE International Parallel andDistributed Processing
Workshops and Phd Forum (the ǋrdWorkshop on Communication Archi-
tecture for Scalable Systems), pp.ǐǌǊ-ǐǍǈ, May Ǌǈǉǋ.
[C-ǋ] Yuan He, Hiroki Matsutani, Hiroshi Sasaki, and Hiroshi Nakamura, “Data
Compression on ǊD and ǋD Network-on-Chips for CMP,” In Proc. of the
ǑthSymposiumonAdvancedComputingSystemsand Infrastructures, pp.ǋǑǉ-
ǋǑǐ, May Ǌǈǉǉ.
PĵŉĹłŉň
[P-ǉ] ࿴㐲,三輪忍,中村宏,「ルータ」,特願 Ǌǈǉǋ-ǉǉǉǊǌǌ,出願日 Ǌǈǉǋ
年 Ǎ月 ǊǏ日.
ǉǉǋ
ǉǉǌ
