Hardware acceleration of sequential loops by Michaud, Pierre
Hardware acceleration of sequential loops
Pierre Michaud
To cite this version:
Pierre Michaud. Hardware acceleration of sequential loops. [Research Report] RR-7802, INRIA.
2011. <hal-00641350>
HAL Id: hal-00641350
https://hal.inria.fr/hal-00641350
Submitted on 15 Nov 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
IS
SN
02
49
-
63
99
IS
R
N
IN
R
IA
/R
R
-
-
78
02
-
-
FR
+
EN
G
RESEARCH
REPORT
N° 7802
November 2011
Project-Teams ALF
Hardware acceleration of
sequential loops
Pierre Michaud
RESEARCH CENTRE
RENNES – BRETAGNE ATLANTIQUE
Campus universitaire de Beaulieu
35042 Rennes Cedex
Hardware aeleration of sequential loops
Pierre Mihaud
Projet-Teams ALF
Researh Report n° 7802  November 2011  19 pages
Abstrat: The urrent trend in general-purpose miroproessors is to take advantage of Moore's
law to inrease the number of ores on the same hip. In a few tehnology generations, this will
lead to hips with hundreds of supersalar ores. Obtaining high performane on these so-alled
many-ores will require to parallelize the appliations. Nevertheless, it is unlikely that all the
appliations will take full advantage of the high number of ores. Hene it is important, along with
inreasing the number of ores, to inrease sequential performane and dediate a relatively large
silion area and power budget for that purpose. In this study, we onsider the possibility to inrease
sequential performane with a loop aelerator. The loop aelerator sits beside a onventional
supersalar ore and is speialized for exeuting dynami loops, i.e., periodi sequenes of dynami
instrutions. Loops are deteted and aelerated automatially, without help from the programmer
or the ompiler. The exeution is migrated from the supersalar ore to the loop aelerator
when a dynami loop is deteted, and bak to the supersalar ore when a loop exit ondition
is enountered. We desribe the proposed loop aelerator and we study its performane on the
SPEC CPU2006 appliations. We show that signiant performane gains may be ahieved on
some appliations.
Key-words: Multi-ore proessor, loop aelerator, sequential performane
Aélération matérielle de boules séquentielles
Résumé : La tendane atuelle des miroproesseurs généralistes est d'exploiter la loi de Moore en
augmentant le nombre de oeurs sur une même pue. Dans quelques générations tehnologiques, ette
tendane produira des pues ave des entaines de oeurs supersalaires. Il sera néessaire de paralléliser
les appliations an d'obtenir de hautes performanes sur es futurs multi-oeurs. Cependant, il ne sera
probablement pas possible pour toutes les appliations d'exploiter tous les oeurs. Il est don important
de ontinuer d'augmenter la performane séquentielle en même temps que la performane parallèle, en
réservant à l'aélération séquentielle une partie relativement importante de la surfae de siliium et du
budget en puissane életrique de la pue. Dans ette étude, nous onsidérons la possibilité d'augmenter la
performane séquentielle grâe à un aélérateur de boule. L'aélérateur de boule est assoié à un oeur
supersalaire lassique et est spéialisé pour l'exéution des boules dynamiques, 'est-à-dire des séquenes
périodiques d'instrutions dynamiques. Les boules sont détetées et aélérées automatiquement, sans aide
du programmeur ou du ompilateur. L'exéution migre du oeur supersalaire vers l'aélérateur de boule
quand une boule dynamique est détetée, et vie versa lorsqu'on renontre une ondition de sortie de boule.
Nous dérivons l'aélérateur de boule proposé et nous étudions sa performane sur les appliations SPEC
CPU2006. Nous montrons que des gains de performane relativement importants peuvent être obtenus pour
ertaines appliations.
Mots-lés : Proesseur multi-oeur, aélérateur de boule, performane séquentielle
Hardware aeleration of sequential loops 3
1 Introdution
During the last deade, single-thread performane has inreased at a slower pae than during previous
deades, despite transistor miniaturization ontinuing as ditated by Moore's law. For several reasons,
proessor makers have preferred to use the silion area oered by miniaturization for implementing multi-
ores. General-purpose multi-ores have atually beneted to the Internet by inreasing the throughput of
server farms. However, the lient side of the internet as not beneted as muh from multi-ores as the server
side. While multi-ores have been denitely useful for inreasing throughput, their potential for dereasing
lateny is still largely underexploited. Most existing ode is sequential, and this is likely to remain so
for some time. Writing parallel appliations with portable performane speedups is very diult, even for
the elite of programmers who understand performane and know parallel programming. The gap between
potential and atual performane will get larger as the number of on-hip ores inrease, beause of limited
o-hip memory bandwidth and beause of Amdahl's law. As the number of on-hip ores keeps inreasing
with years, we will progressively enter the many-ore era and eventually reah a point where implementing
a heterogeneous many-ore with one big and fast ore and several smaller ores will provide a signiant
performane advantage. For instane, let us onsider a homogeneous many-ore with 200 idential normal
ores on one hand, and on the other hand a heterogeneous many-ore with only 100 normal ores and one
monster ore taking the silion area and onsuming the power equivalent to 100 normal ores. Even if the
monster ore is only twie faster than a normal ore despite being 100 times bigger, a simple appliation of
Amdahl's law show that the heterogeneous many-ore will outperform the homogeneous ore on programs
whose sequential fration exeeds 1% of all the instrutions exeuted
1
. This extreme example illustrates
a situation that has been analyzed by Hill and Marty [6℄, one of their onlusions being that researhers
should investigate methods of speeding sequential performane even if they appear loally ineient.
What this means is that researhers must nd solutions for aelerating sequential exeution for future
many-ores even if these solutions look absurd for today's multi-ores. However, inreasing sequential per-
formane is not obvious, even with the silion area and power budget of a monster ore. Inreasing the
IPC (number of instutions exeuted per yle) without impating the lok yle is not straightforward.
Perhaps it will be possible to implement wider supersalar ores, e.g., 8-wide ores like the aneled Alpha
EV8 proessor. But to the best of our knowledge it has not been proved that the supersalar width ould
be inreased beyond 8 without atually losing some performane. The problem omes from strutures that
do not sale well with the issue width and the instrution window size, like register renaming, dynami
instrution sheduling, register ports, operand bypass network, and memory disambiguation mehanisms.
Many propositions for solving these problems have been published in the last 20 years, and some of them
are probably worth revisiting in the ontext of many-ores. Nevertheless, it is also important to explore new
approahes.
We propose in this study to explore a new approah to sequential performane : hardware aeleration
of loops. Many programs spend a signiant part of the exeution in dynami loops, i.e., periodi sequenes
of dynami instrutions. We explore in this study the possibility to implement an aelerator speialized for
dynami loops and that does not require any help from the programmer or the ompiler.
Our goal is not to explore the design spae of loop aelerators, whih we believe is huge. Instead, we
tried to imagine an aelerator miroarhiteture exploiting loop properties and avoiding as muh as possible
the usual supersalar bottleneks. We have simulated the proposed loop aelerator and we show that it
an potentially deliver a high sustained IPC. Atually, we believe that the miroarhiteture we propose is
salable and an be pipelined at a lok frequeny at least as high as the supersalar ore frequeny.
This work makes the following ontributions : (1) We propose a hardware mehanism for deteting
dynami loops ; (2) We provide a haraterization of the dynami loop behavior of SPEC CPU2006 benh-
marks and we show that a signiant fration of the dynami loops have a large loop body onsisting of
several tens of instrutions ; (3)We propose a loop aelerator that an aelerate most dynami loops with
a small or large body and that an issue simultaneously up to 32 µops out of program order from a window
of several thousands of µops, with a loal hardware omplexity not greater than that of a onventional
supersalar ore ; (4) We propose solutions for memory disambiguation in a window of several thousands
of µops, exploiting dynami loop properties ; (5) We show that a signiant fration of dynami µops are
1
This is assuming perfet parallelism. Limited o-hip memory banwidth, various overheads and imperfet load balaning
will make the situation worse for the homogeneous many-ore.
RR n° 7802
Hardware aeleration of sequential loops 4
lok frequeny : 3 GHz ; deode/rename : 1 inst (any) + 3 "simple" insts (1 or 2 µops) ; reorder
buer : 64-µops; load queue : 32 loads ; store queue : 16 stores ; dispath : 6 µops, dependeny-
based steering ; shedulers : 4 8-µop int, 2 8-µop FP/SSE, 2 16-µop loads ; exeution : 4 int (1
mul or 1 div), 2 FP/SSE (1 div), 2 loads/stores ; retirement : 6 µops; post-retirement : 16-store
post-retirement queue, 2 stores/yle ; branh preditor : 64-Kbit TAGE, 18-Kbit ITTAGE [12℄ ;
branh mispredition : reovery at exeution, 12 yles minimum penalty ; ahe line : 64 B ; IL1
ahe : 32 KB, 8-way asso, 1 line/yle bandwidth, up to 6 pipelined misses; DL1 ahe : 32 KB,
8-way asso, write bak, 2 yles lateny, 8 banks, 8-byte bank width, virtually indexed, PC-based stride
prefeth ; DL1 misses : 16 MSHRs; L2 ahe : 512 KB, 8-way asso, DIP poliy [10℄, write bak, 9
yles lateny, 1 line/yle bandwidth, stream prefeth [16℄ ; L3 ahe : 8 MB, 16-way asso, DIP poliy
[10℄, write bak, 18 yles lateny, 1 line/yle bandwidth, stream prefeth [16℄ ; memory+bus : 210
yles lateny, 16 bytes/yle bandwidth ; ITLB : 64 entries, 4-way ; DTLB : 64 entries, 4-way ;
TLB2 : 512 entries, 4-way, 4 yles lateny ; page size : 4 MB ; load/store dependenies : store
sets [2℄ + single-entry forwarding from store queue [8℄ ;
Table 1: Baseline supersalar ore.
atually redundant and an be removed from the dynami loops.
This paper is organized as follows. Setion 2 disusses prior work. We desribe our simulation set-up in
Setion 3. Setion 4 desribes our loop detetor and provides statistis about dynami loops in the SPEC
CPU2006 appliations. Setion 5 gives a detailed desription of the proposed LA and provides a performane
evaluation. Finally, setion 6 onludes this study and gives some diretions for future work. This paper
uses many denitions and aronyms, they are listed in Table 5 at the end of the paper.
2 Related work
Kobayashi found that many programs spend a signiant fration of the exeution in dynami loops, es-
peially sienti programs [7℄. Our denition of dynami loops is lose to Kobayashi's one. Tubella and
González desribed a hardware mehanism for the automati detetion of dynami loops [17℄. Their deni-
tion of dynami loops is dierent from ours and targeted toward speulative multithreading. Some reent
proessors have a loop buer to derease the energy onsumption by avoiding re-fething and re-deoding
the same loop body again and again. For instane, the Intel Nehalem has a Loop Stream Detetor that an
hold up to 28 µops [4℄. Garía et al. desribe a loop buer that implements a register renaming mehanism
for loops, more eient than onventional register renaming hardware [5℄. Stitt et al. proposed a LA for a
system-on-hip, implemented with some ongurable logi, able to detet and aelerate loops transparently
[14℄. The time for analyzing the loop and onguring the LA appears to be very long (tens of millions of lok
yles for relatively simple loops [14℄). It is not lear whether this approah ould be used in general-purpose
systems. Clark et al. desribed an approah where loops are identied and modulo-sheduled dynamially
(hene preserving binary ompatibility) onto a LA deoupling memory aesses from omputations [3℄. Their
LA has a ongurable ompute aelerator whih an exeute up to 15 integer operations in only 2 lok y-
les. Vajapeyam et al. proposed a dynami vetorization (DV) sheme [18℄ based on trae proessors [19, 11℄.
This DV sheme has a few similarities with our loop aelerator (e.g., use of FIFO queues), but is overall
very dierent. The DV sheme assumes 64 proessing elements (PEs), dierent instanes of a loop iteration
being proessed by dierent PEs. Neither memory disambiguation nor problems onerning ommuniation
bandwith and lateny between PEs are addressed in [18℄.
3 Simulation set-up
Our baseline onguration simulates a modern x86 supersalar ore. Our simulator is trae driven (we do not
simulate wrong path eets). We used Pin 2.8 [9℄ to generate a trae for eah SPEC CPU2006 benhmark.
We ompiled benhmarks for x86-64 arhiteture with g 4.4.3 "-O2". Eah trae onsists of 40 samples of
50 millions onseutive instrutions, for a total of 2 billions dynami instrutions (and the assoiated memory
referenes). Samples are regularly spaed so as to be representative of the whole benhmark exeution. Some
RR n° 7802
Hardware aeleration of sequential loops 5
  0.0
  0.5
  1.0
  1.5
  2.0
  2.5
400 401 403 410 416 429 433 434 435 436 437 444 445 447 450 453 454 456 458 459 462 464 465 470 471 473 481 482 483
IP
C
Nehalem
simulator
Figure 1: IPC of the simulated ore vs. IPC measured on an Intel Nehalem for the SPEC CPU2006
benhmarks.
parameters of the baseline miroarhiteture are listed in Table 1. Instrutions are split into miro-ops (µops
for short) at deode. We generate separate ADDR µops for memory aesses [8℄. Load and store µops get the
address from the assoiated ADDR µop, whih is exeuted by any ALU. At most 8 µops are generated per
instrution. A µop has at most 2 soure registers and 1 destination register, not ounting the ags. There are
4 integer shedulers, 2 oating-point shedulers and 2 load shedulers. Eah sheduler an selet one ready
µop per yle in its issue buer. The µops is removed from the buer when issued. We assume a resheduling
mehanism, but we did not simulate resheduling penalties. We simulate hardware prefethers for the DL1,
L2 and L3 ahes. The DL1 prefether is a PC-based stride prefether. The L2 and L3 prefethers are stream
prefethers that issue prefeth requests based on the sequene of misses and hits on prefethed bloks [16℄.
The prefeth distane is adjusted dynamially with a feedbak mehanism. We assume 4 MB memory pages.
Using large pages dereases the number of TLB misses and makes prefething in the L2 and L3 ahes more
eetive, as stream prefethers are limited by page boundaries [16℄. TLB misses are hardware-managed, as
in x86 proessors [1℄.
The baseline IPC (instrutions per yle) numbers for the SPEC CPU2006 are reported in Figure 1. We
also report the IPC measured on an Intel Nehalem proessor when exeuting eah benhmark to ompletion.
The simulated miroarhiteture is not idential to a Nehalem. For instane, we do not simulate µop fusion
and maro-op fusion. Our goal was to simulate a realisti supersalar ore, not to math the Nehalem
perfetly.
4 The loop detetor
In this study, the term loop is used in the sense of dynami loop. A dynami loop is a periodi sequene
of dynami instrutions [7℄. Instrutions with dierent program ounter addresses (PC, for short) are
onsidered distint, although they may perform the same ation. The loop length is the number of dynami
instrutions in the sequene. The body size B, in instrutions, is the length of the smallest period. The loop
body (or body, for short) onsists of the rst B instrutions in the sequene, i.e., iteration number 0. The
next B instrutions belong to iteration number 1, and so on. Instrutions in the body are not neessarily
distint. For example, the body itself may inlude a tiny loop unrolled, or a funtion exeuted multiple
times. The loop detetor is a hardware mehanism whose goal is to nd loops. Beause there will be a
transition penalty for entering and leaving the loop aeleration mode, we try to detet loops that are long
enough. The loop detetor onsists of two main parts : the Loop Monitor and the Loop Table.
The loop monitor. The Loop Monitor (LM) is the main mehanism for nding loops. It inludes a tiny
LM table (e.g., 4 entries). Eah LM entry ontains an end-of-loop PC (EOL PC) whih is the searh key,
a valid bit, a body size, a body signature, a previous signature, an instrution ount and a rst-iteration
bit. The LM table is fully-assoiative. For eah instrution retiring from the reorder buer, the body size
of eah valid LM entry is inremented and the orresponding body signature is updated. The signature is
a hash of instrutions PCs. It is intended to provide a unique identier for eah loop body enountered
RR n° 7802
Hardware aeleration of sequential loops 6
  0%
  10%
  20%
  30%
  40%
  50%
  60%
  70%
  80%
  90%
  100%
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
in
str
tim
e
Pe
rc
en
ta
ge
 o
f e
xe
cu
tio
n 
sp
en
t i
n 
lo
op
s
benchmark
400 401 403 410 416 429 433 434 435 436 437 444 445 447 450 453 454 456 458 459 462 464 465 470 471 473 481 482 483
MBS=256
MBS=128
MBS=64
MBS=32
MBS=16
Figure 2: Perentage of exeution (instrutions and time) spent in loops (MinLM=900 instrutions, LBI=5)
during the program exeution. The longer the signature, the less likely that two distint loop bodies have
the same signature. In our simulations, we use 32-bit signatures, and we update the signature by rotating
the signature 1 bit to the left and applying an XOR with the new instrution PC. When the body size of
any valid LM entry exeeds a xed maximum body size (MBS), we lose that entry by resetting its valid bit.
If the PC of a retiring instrution hits in the LM table, we look at the information stored in the mathing
entry. If the rst-iteration bit is set or if the signature equals the previous signature, we add the body size to
the instrution ount, otherwise we reset the instrution ount. When a branh instrution jumps bakward,
it is onsidered a potential end-of-loop. If no valid entry exists for that branh PC, we reate a new entry
(e.g., by reusing a losed entry) : the EOL PC is set to the branh PC, the valid bit and rst-iteration bit
are set, and all the other elds are reset. If an entry already exists for the bakward jump, we opy the
body signature into the previous signature and we reset the rst-iteration bit, the body signature and the
body size. When the instrution ount in one of the LM entries exeeds a predened threshold MinLM,
we lose all the other entries, and we enter the loop building state. The only valid LM entry remaining is
alled the ative LM entry. Loop building orresponds to a xed number of loop iterations during whih we
extrat the information that will be needed to exeute the loop on the LA. The number of iterations spent
in the loop building state is denoted LBI (loop build iterations). When loop building is done, we migrate
the exeution to the LA. After the loop exit, we update the instrution ount in the ative LM entry and we
lose the entry.
The loop table. It takes MinLM dynami instrutions before an instrution sequene is onsidered a loop
by the LM. On the one hand, if MinLM is xed to a low value, short loops will be deteted. But the
exeution time saved on the LA may not be worth the transition penalty. On the other hand, if MinLM is
xed to a high value, the MinLM instrutions that ould have been exeuted on the LA are instead exeuted
by the supersalar ore, whih is a probable waste of performane. So MinLM is typially set to a medium
value, e.g., 900 instrutions. Yet, the next time we enounter this loop, we may assume that it will behave
as the last time and enter the loop aeleration mode immediately. This is one of the funtions of the loop
table (LT). Eah LT entry reords some information about one loop, identied by its body signature. In
partiular, there is a ondene bit in eah LT entry. When we lose an LM entry, we set or reset the
ondene bit depending on whether the instrution ount is greater or less than MinLM. When the PC of
a retired instrution mathes one of the valid LM entries, we aess the orresponding LT entry. If that LT
entry exists and if the ondene bit is set, we lose all the other LM entries and we enter loop building
immediately.
4.1 Fration of exeution spent in loops
Figure 2 shows the fration of exeution spent in loops for dierent values of MBS. Two perentages are
given for eah benhmark, a perentage of exeuted instrutions and a perentage of exeution time. MinLM
is set to 900 instrutions and LBI=5. Several benhmarks spend a signiant fration of the exeution in
loops, but this is very dependent on MBS. If we exeute all SPEC CPU2006 benhmarks to ompletion and
RR n° 7802
Hardware aeleration of sequential loops 7
benhmark MBS=128 MBS=256 benhmark MBS=128 MBS=256
401.bzip2 5231 4838 450.soplex 3216 3261
403.g 8315 8312 456.hmmer 5446 5448
433.mil 2190425 6415123 459.GemsFDTD 4249 4953
434.zeusmp 4257 4359 462.libquantum 15362 18216
437.leslie3d 2218 2223 465.tonto 4161 4027
444.namd 3749 6320 481.wrf 2643 2645
447.dealII 2634 2627 483.xalanbmk 4239 5475
Table 2: Average loop length for loopy benhmarks (MinLM=900 instrutions, LBI=5)
accelerator
loop
L2 L3DL1
IL1
arch.
reg.loop
detector
loop
builder
superscalar core
retired uops
Figure 3: The supersalar ore and the sequential loop aelerator
in suession, about one third of all the instrutions exeuted belong to loops (as we have dened them)
with a maximum body size of 256 instrutions. In fat, many loops have a body not exeeding a few tens
of instrutions. Yet, several benhmarks spend a signiant part of the exeution in loops whose body size
exeeds 128 instrutions, in partiular 433.mil, 444.namd, 459.GemsFDTD, 465.tonto and 483.xalanbmk.
We dene a loopy benhmark as a benhmark with more than 20% of dynami loop instrutions with
MBS=256. About half of the SPEC CPU2006 benhmarks are loopy aording to this denition. Loopy
benhmarks are listed in Table 2, along with the average loop length aording to our loop detetor. In the
remaining, we assume MBS=128 instrutions.
5 The loop aelerator
This study onsiders the possibility of aelerating the exeution of dynami loops, without help from the
programmer and preserving binary ompatibility. The goal of this study is not to explore the design spae of
loop aelerators (LAs), whih we believe is huge. Instead, we have tried to imagine a new miroarhiteture
avoiding onventional supersalar tehniques that do not sale well. We tried to exploit loop harateristis
as muh as possible in order to make the LA salable. We have also tried to make the LA as general as
possible so that it an aelerate loops with a small or large body, as it was shown in the previous setion
that both ases are important.
The proposed miroarhiteture is depited in Figure 3. We onsider here only the "sequential" part of
a heterogeneous many-ore. During loop building, the loop builder takes as input the µops retired from the
supersalar ore reorder buer and prepares the loop body for exeution on the LA. When loop building is
done, the supersalar ore instrution window is leared and the exeution is migrated to the LA. When a
loop exit ondition ours, like a branh taking a dierent diretion, the arhitetural registers are updated
and the exeution is migrated bak onto the supersalar ore, until another dynami loop is enountered.
The rest of this setion is organized as follows. Setion 5.1 introdues the yli register dependeny
graph, whih is useful for haraterizing a loop and understanding its exeution on a LA. The proposed LA
is desribed in Setion 5.2 and Setion 5.3 fouses on memory dependenies. Setion 5.4 explains how the
loop body an be redued by removing redundant µops. Setion 5.5 explains how we map the loop body
RR n° 7802
Hardware aeleration of sequential loops 8
12 : 5c8289
13,flags:=ALU(13)
0 : 5c8270
357:=ADDR(13)
1 : 5c8270
109:=LOAD(357)
7 : 5c827c
109:=SSE(109,358)
2 : 5c8274
17,flags:=ALU(17)
21 : 5c82a5
flags:=ALU(17,20)
11 : 5c8285
12,flags:=ALU(12)
3 : 5c8278
357:=ADDR(12)
4 : 5c8278
110:=LOAD(357)
10 : 5c8280
110:=SSE(110,358)
13 : 5c828d
18,flags:=ALU(18)
5 : 5c827c
357:=ADDR(18)
6 : 5c827c
358:=LOAD(357)
14 : 5c8291
109:=SSE(109,110)
8 : 5c8280
357:=ADDR(26)
9 : 5c8280
358:=LOAD(357)
17 : 5c8299
110:=SSE(109,110)
20 : 5c82a1
19,flags:=ALU(19)
15 : 5c8295
357:=ADDR(19)
18 : 5c829d
357:=ADDR(19)
16 : 5c8295
110:=LOAD(357)
19 : 5c829d
STORE(357,110)
22 : 5c82a8
BRANCH(flags)
Figure 4: Example of loop CRDG found in benhmark 481.wrf. The body onsists of 15 instrutions split
up into 23 stati µops ranked from 0 to 22. Dependenies with µops not belonging to the loop are not part
of the CRDG.
onto the LA for good performane. Other performane tuning we did is desribed in Setion 5.6, and the
simulated performane speedups are presented in Setion 5.7.
5.1 Cyli register dependeny graph
To understand the performane of a LA, it is onvenient to onsider the register dependeny graph (RDG)
of a sequene of µops, where eah node of the graph is a dynami µop and direted edges represent register
dependenies. The RDG of a loop has a periodi struture and it is possible to summarize it with a yli
register dependeny graph (CRDG) where eah node of the graph is a stati µop. Figure 4 shows a loop CRDG
onsisting of 23 stati µops ranked from 0 to 22. The rank of a stati µop orresponds to the sequential order.
For example, the loop represented in Figure 4 onsists of the dynami instanes of µops 0,1,...,22, 0,1,...,22,...
and so on. Dependenies in a CRDG are of two sorts : intra-iteration and inter-iteration. When the µops
produing and onsuming a register value belong to the same iteration, it is an intra-iteration dependeny.
Otherwise it is an inter-iteration dependeny. For an inter-iteration dependeny, if the produer belongs to
iteration k, the onsumer belongs to iteration k+1. In Figure 4, inter-iterations dependenies are represented
by edges whose head rank is less than or equal to the tail rank (e.g., both edges originating from µop 11).
The longest hain in a RDG denes a minimum exeution time. For a loop, the longest RDG hain an be
found by onsidering yles in the CRDG. Most loops have 1-µop yles, like the loop shown in Figure 4
(µops 2,11,12,13,20 depend on themselves). However, a few loops have yles onsisting of several µops. The
longest hain in a loop RDG orresponds to the greatest CRDG yle, and the hain length is equal to the
yle size times the number of loop iterations. The atual exeution time, though, may be longer than this
minimum time if resoures for exeuting the µops are limited. The impat of resoure onstraints on a loop
performane an be taken into aount by adding onstraint edges to the CRDG. For instane, in order to
simplify the hardware and permit a high lok frequeny, we fore the dynami instanes of the same stati
µop to exeute in program order. This orresponds to adding to eah µop in the CRDG a onstraint edge to
itself. Doing so does not inrease the loop exeution time. On the other hand, if we add a onstraint edge
between two distint µops of the CRDG, we may reate an artiial yle onsisting of several µops, thereby
inreasing the loop exeution time.
5.2 A ring-shaped loop aelerator
Figure 5 depits the global arhiteture of the proposed LA. The LA onsists of 8 exeution nodes (node for
short) onneted by a 4-lane pipelined bus. There are dierent node types, eah node type being speialized
for ertain µops. The I-node exeutes µops having a 1-yle lateny. This inludes ADDR µops, MOV µops
(integer and oating-point) and all the µops that would normally exeute on the supersalar ore ALUs. The
RR n° 7802
Hardware aeleration of sequential loops 9
mask
I−node I−node F−nodeL−node
S−node M−node I−node F−node latch latch
Figure 5: Ring-shaped loop aelerator : 8 exeution nodes onneted by a 4-lane pipelined bus.
F-node an exeute the most frequent oating-point operations, in partiular additions and multipliations.
It exeutes integer multipliations too. The L-node exeutes load µops and the S-node exeutes store µops.
The M-node is a mutable node whose funtion is dened at loop building. It is atually an I-node augmented
with oating-point dividers. If the loop body ontains some oating-point division, the M-node exeutes only
divisions. Otherwise it behaves like a normal I-node. It is not neessary for the LA to be able to exeute the
whole instrution set. For instane, we notied that integer divisions are rare in the loops found by our loop
detetor in the SPEC CPU2006. Therefore, loops featuring integer divisions are exeuted by the supersalar
ore.
The pipelined bus. The pipelined bus onsists of 4 lanes divided in 8 segments. The bus transports
pakets. A paket onsists of a 128-bit data
2
, a µop rank, and an 8-bit node vetor (one bit per node). The
µop rank tells whih µop in the body is the produer of the data. The node vetor tells whih nodes need
the data. The node vetor is modied by a mask as it passes through segments. When the paket reahes
the last node needing the data, the node vetor beomes null and the paket is onsidered empty. Now it
is possible for that node (or a subsequent node) to overwrite the empty paket. In the worst ase, a data
does a omplete turn on the lane until it returns to the node from whih it was emitted. There are 2 lathes
per lane segment (i.e., 16 lathes per lane). When the bus lok is low, the rst lath is transparent and the
seond lath is opaque. When the bus lok is high, it is the other way around. Beause of lane segment
RC delays, we assume that the bus is loked at half the node lok frequeny
3
. To ompensate this loss of
throughput, eah lane is doubled so that it transports two pakets instead of one, as shown in Figure 5.
The exeution node. Figure 6 shows the global struture of an I-node. Other node types share many
harateristis with the I-node. Eah node ontains 4 exeution units (EU). Eah EU an exeute 1 µop
per yle, i.e., the maximum node throughput is 4 µops per yle. The loop builder maps stati µops to
EUs. All the dynami instanes of a stati µop are exeuted by the same EU. The µops exeuted by the
same EU exeute in program order with respet to eah other, but they an exeute out-of-order
with respet to µops on other EUs. One a µop is exeuted, its output data is sent to the Output Queue
(OQ), with the orresponding µop rank and node vetor. A opy of the data is also pushed into the Loop
Exit Queue (LEQ). The LEQ is used at loop exit time for updating the arhitetural registers with the
orret values before resuming the exeution on the supersalar ore. Data in the OQ is sent onto the bus
and removed from the OQ as soon as possible, i.e., when there is an empty paket at the orresponding
lane segment. The exeution of µops on a given EU is suspended if the assoiated OQ is on the verge of
beoming full. At the node input, pakets are read from the lanes. If the orresponding bit is set in the
node vetor, the data and µop rank are inserted into the Input Queue (IQ). We assume that the IQ never
gets full (we will see later how we enfore this). Consequently, the data produed by a stati µop enter
the IQs where they are needed in sequential order. In eah yle, one data is dequeued from eah IQ by
the distributor. The distributor sends these data to the External Value Queues (EVQ). Eah EU has a
dediated set of EVQs. A data from an IQ may be needed by several EUs in the same node. In this ase
2
We assume 128-bit data, as this is the data size required by Intel SSE operations. We did not try to optimize this. Future
work may try to redue the data size to 64-bit, exploiting the fat that most operations are atually salar.
3
It takes 16 LA lok yles for a data to do a full turn on a lane.
RR n° 7802
Hardware aeleration of sequential loops 10
IQ IQ IQ IQ
C
E
C
EU
OQ
C
E
C
EU
OQ
C
E
C
EU
OQ
C
E
C
EU
OQ
lane 0 lane 1 lane 2 lane 3
lane 0 lane 1 lane 2 lane 3
EVQs EVQsEVQs EVQs
DISTRIBUTOR
LEQ LEQ LEQ LEQ
Figure 6: Exeution node (here, an I-node). Eah exeution node ontains 4 exeution units.
the distributor dupliates the data (up to 4 opies, one per EU). An EVQ is assoiated with one partiular
stati µop, i.e., all the data going through a given EVQ ome from the same stati µop. If a data annot
be distributed beause one of the target EVQs is full, the data remains in the IQ until all the target EVQs
an aept it. Non-onstant data feeding the EU ome either from an EVQ head or from the LEQ "top".
The LEQ top onsists of the Nmax data output most reently by the EU, where Nmax is the maximum
number of stati µops that an be mapped on the same EU (the LEQ is physially implemented as two
distint parts). The EVQ head data is dequeued only after the value is no longer needed by any µop on
the EU. The exeution of µops on an EU is ontrolled by a Cyli Exeution Controller (CEC). The CEC
loops through the Neu stati µops mapped onto the EU, repeatedly and in program order. In partiular, the
CEC ontrols the various multiplexors for seleting the EU inputs and dequeues data from EVQs. The CEC
also holds onstant values, i.e., values that are used as input by some µops and that are guaranteed to be
onstant throughout the loop exeution (e.g., register values not modied by the loop). The CEC generates
a sequene number for eah dynami µop : the sequene number for the nth dynami instane of a stati
µop of rank k is Bmax × n + k, where Bmax = 8 ×MBS is the maximum number of µops in a loop body
4
and n orresponds to the iteration number. Eah EU exeutes µops in program order, whih means that
the sequene number inreases monotonially. For µops with inter-iteration input dependenies, input data
for the rst iteration are obtained at loop building. It should be noted that there exists a data path from
eah EU to every other EU in the LA, thanks to the distributors. We have hosen not to have any diret
data path between EUs in the same node to simplify the hardware
5
. Hene if a µop takes as input a data
produed by another µop in the same node but on a dierent EU, this data must travel around the lane to
reah the onsumer µop. If a short ommuniation lateny between 2 µops is needed for performane, we
must try to map these 2 µops on the same EU, so that the onsumer µops an ath the data from the LEQ
top. If we nd during loop building that the loop annot be exeuted by the LA, the loop keeps exeuting
on the supersalar ore. This happens for instane if the loop ontains a µop that the LA annot exeute,
like an integer division. This happens also if Nmax is too small for that loop, or if there are too few EVQs.
The F-node and M-node have a global struture similar to the I-node. The L-node and S-node are somewhat
dierent.
The S-node. The S-node reeives store addresses and store data from other nodes, it does not write on the
lanes (no OQ, no LEQ). There is one Loop Store Queue (LSQ) assoiated with eah of the 4 EUs. EUs do
not really exeute the store µops, they just gather the address and data for eah dynami store. The address
and data are then sent into the LSQ, along with the store sequene number. The S-node ontains a store
validation unit (SVU). The SVU sends stores into the post-retirement store queue, whih the LA shares with
4Bmax is a power of 2, generating the sequene number is simple
5
Future work may reonsider this hoie.
RR n° 7802
Hardware aeleration of sequential loops 11
the supersalar ore and from whih stores an write into the DL1 ahe. The SVU has a table ontaining
the ranks of all the stati store µops in the loop body, in program order. The SVU logi loops through
this table repeatedly, generating the sequene number orresponding to the next dynami store, in program
order. This SVU sequene numbers is searhed among the 4 LSQ heads. The store is then validated : it is
dequeued from its LSQ, sent to the post-retirement store queue, and the SVU sequene number is updated.
In our simulations, we have assumed that the S-node an validate 2 stores per yle.
The L-node. The L-node reeives load addresses from I-nodes. It reads the DL1 ahe and sends the load
data through the bus to the other nodes. In the L-node, EUs are replaed with Load Issue Buers (LIB)
and OQs are replaed with Load Output Queues (LOQ). Eah CEC steps through the stati loads that
have been mapped to it, in program order. The CEC alloates an entry in the LIB and in the LOQ for eah
dynami load. The load address and LOQ entry identier are written in the LIB entry. The LOQ entry has
room for the load data. The CEC writes in the LOQ entry the µop sequene number and the 8-bit node
vetor for the data. In eah yle, one load is seleted from eah LIB and aesses the ahe, i.e., the L-node
an issue up to 4 loads per yle. If the DL1 read sueeds, the load data is written into the LOQ entry, and
the LIB entry is freed. Otherwise (bank onit, DL1 miss, TLB miss), the load waits in the LIB and will
be reissued later. Upon a miss, when the missing blok is eventually inserted into the DL1, the assoiated
MSHR wakes up the LIB entries that were waiting for that blok, and the orresponding loads beome ready
for reissue. The LOQ behaves like a reorder buer for loads. In a lok yle, the load at the head of a LOQ
is dequeued if the load data is there and if the paket on the lane segment an be overwritten.
The global synhronizer. The CEC has an Nmax-entry µop buer, a pointer on that buer, and a Loal
Iteration Count (LIC). The pointer points to the µop that will next aess the EVQs 6. The pointer an be
inremented in eah lok yle and is reset when its value beomes equal to Neu. Every time the pointer is
reset, the LIC is inremented. The 32 CECs (8 nodes, 4 lanes) work independently from eah other. They
may have dierent LIC values at a given time. However, we need a global synhronizer for keeping the
program sequential semanti, a mehanism equivalent to the reorder buer in a supersalar ore.
In partiular, the synhronizer must prevent a store from leaving the SVU before all the dynami µops
preeding the store in program order have been exeuted, as these µops may trigger a loop exit. A natural
loop exit ours when a stati branh hanges its behavior, i.e., a branh µop produes a result dierent from
the one reorded at loop building
7
. Stores after the loop exit point must not write into the DL1. After the
loop exit, the LEQs are used for updating the arhitetural registers. But the LEQ size is limited. Hene
one of the synhronizer funtion is to prevent the exeution on an EU from going too far ahead, making
sure that the values needed for updating the arhitetural registers have not been pushed out of the LEQ.
The synhronizer is onneted to all the nodes, reeiving and sending signals from and to the nodes. The
SVU in the S-node maintains a maximum sequene number (MSN). If the loop ontains any store, store
validation freezes when it reahes the MSN, and the SVU sends a signal to the synhronizer. Moreover,
eah CEC has a maximum value LICmax for its LIC. Dierent CECs may have dierent LICmax values,
but with some ontraints : In the S-node, all the CECs have the same LICmax = MinLICmax, where
MinLICmax is xed at loop building. On the other nodes, LICmax ≥ MinLICmax. When the LIC in a
CEC reahes MinLICmax, the CEC sends a signal to the synhronizer. The CEC ontinues until the LIC
reahes LICmax, then the CEC freezes. When the synhronizer has reeived a signal from eah CEC and
from the SVU, all CECs have a LIC greater than or equal to MinLICmax. The synhronizer then sends
an unfreeze signal to all nodes. Upon reeiving the unfreeze signal, eah CEC subtrats MinLICmax to its
LIC, whih unfreezes automatially the frozen CECs (e.g., those in the S-node). Upon reeiving the unfreeze
signal, the SVU adds MinLICmax ×Bmax to the MSN, whih unfreezes store validation. The synhronizer
maintains the program sequential semanti while preserving some deoupling between CECs. Still, it may
impat performane. LICmax determines the instrution window size. The higher LICmax, the larger the
instrution window and the more lateny tolerane. However, the number of loop iterations in the window
must not exeed the LEQ size, i.e., LICmax + MinLICmax must not exeed the LEQ size divided by Neu.
Global synhronization lateny may also impat performane : the time during whih a CEC is frozen is a
waste of performane. Hene MinLICmax must not be too small. Another onstraint is that MinLICmax
annot be greater than the LSQ size divided by Neu, as the LSQ buers the stores until the unfreeze signal
inreases the MSN, whih permits validating the stores waiting in the LSQ. In summary, the LEQs and LSQs
6
The exeution is fully pipelined (exept oating-point divisions in the M-node). Data dependenies may stall the exeution.
7
Exeptions are another loop exit ondition.
RR n° 7802
Hardware aeleration of sequential loops 12
must be large enough for good performane.
Loop exit. When the exeution of a µop triggers a loop exit, the µop sequene number, whih we all
the loop exit sequene number (LESN), is sent to the synhronizer
8
. The LESN is then broadast to
all CECs. CECs whose sequene number already exeeds LESN freeze immediately and ignore subsequent
unfreeze signals. Other CECs ontinue to work until exeeding the LESN. If a new loop exit is deteted
while a previous loop exit was already pending, and if the new LESN is less than the LESN reorded in the
synhronizer, it beomes the new loop exit point. We an exit the loop aeleration mode when all CECs
sequene numbers exeed the LESN, all EU pipes have beed drained, the SVU sequene number exeeds the
LESN, and eah LOQ in the L-node is either empty or has its head entry sequene number exeeding the
LESN. Then, arhitetural registers an be updated with values found in the LEQs and the exeution an
resume on the supersalar ore, starting from the rst dynami instrution following the loop exit point.
Full IQ and premature loop exit. We have assumed that the IQs never get full. It is atually diult
to make sure that an IQ an never get full. But is it possible to make it a rare event. If an IQ beomes full,
we trigger a premature loop exit. The LESN is set to MSN-1. Then, the loop exit takes plae as desribed
previously. Nevertheless, for limiting the ourrene of premature loop exits without oversizing the IQs, we
introdue a throttling mehanism. We assoiate a wired-OR with eah lane. When the oupany of any IQ
on a lane exeeds a ertain threshold (e.g., half the IQ apaity), the wired-OR is asserted and the OQs on
that lane stop sending data until the wired-OR is deasserted.
Deadloks. The LA we have desribed is not immune to deadloks. Instead of trying to prevent deadloks
in all ases, it is simpler to make them as rare as possible. In partiular, the EVQs must be large enough
to maintain a low deadlok probability. We detet a deadlok when no unfreeze signal has been generated
for 10000 yles. When this happens, we trigger a premature loop exit. A deadlok leads to a muh higher
performane penalty than a full IQ.
5.3 Memory dependenies
The proposed loop aelerator an emulate a window of several thousands of instrutions. But we must
deal with memory dependenies. We must guarantee a orret exeution without sariing too muh
performane. We desribe in this setion some mehanisms to deal with memory dependenies inside loops.
We exploit loop properties to simplify memory dependeny enforement. First we expet most loops to
exhibit a very good temporal and spatial loality . Seond, we expet onstant-stride aesses to be frequent
in loops. Third, we expet dependenies between a store and a load in the same loop to repeat on eah
iteration.
The memory zone heker. The memory zone heker (MZC) is a table loated in the L-node but
aessed both by loads and stores. The purpose of the MZC is to detet memory order violations within
loops. The MZC takes advantage of spatial loality that exists in loops. Our MZC is oneptually similar
to the Memory Disambiguation Table desribed by Stone et al. [15℄ exept that we reord only loads in the
MZC. We dene a zone as a memory region whose size is a power of two and whih is aligned in memory.
The main originality of our MZC is that the zone size is xed at loop building (f. Setion 5.6). The zone
address is obtained from the load/store address by a right shifting of the address bits. Eah MZC entry is
tagged with a distint zone address and ontains a valid bit, a load sequene number, a load address, a load
data size, a onit bit, a load PC and a known_p bit. When a load exeutes, it aesses the MZC with its
zone address Upon a MZC miss, we searh a free entry and we initialize it. A free entry is an entry whose
valid bit was reset beause its sequene number is less than the SVU sequene number. If no free entry
is found, a premature loop exit is triggered. Upon a MZC hit, the sequene number in the MZC entry is
ompared with the load sequene number. If the load sequene number is greater, it overwrites the entry
sequene number. If the data address and data size in the entry do not math those of the load, we set the
onit bit. If the load PC does not math the PC in the entry, we reset the known_p bit. The MZC is
also aessed by stores when they are validated by the SVU. A potential memory order violation is deteted
if the store zone hits in the MZC, if the store sequene number is less than the entry sequene number, and
if the onit bit is set or if the store data overlaps with the address and data size reorded in the entry.
The optimal zone size, i.e., the one that minimizes the number of premature loop exits, is not the same for
all loops. If there is a good spatial loality in the loop, taking a large zone prevents lling the MZC. On the
8
The loop exit point must orrespond to an arhitetural state, i.e, the loop exit µop must be the last µop of an instrution.
RR n° 7802
Hardware aeleration of sequential loops 13
ADDR L
X Y
ZADDR S
STORE
ZADDR S
CHECK
MOVMOV
X Y
ADDR L
MOVMOV
CHECK
X Y
MOV
ADDR L
STORE
ZADDR S
MOVMOV
LOAD
STORE
dependency is predicted bypassing double bypassing
MOV
b b+1
a+1
a
a+2 a+1
a
a+2
b+1
b+2 b+3
b
Figure 7: Store-to-load bypassing (a and b are the ranks of the store and CHECK µops after transformation)
other hand, if loads and stores are independent but aess the same memory zones, dereasing the zone size
may prevent unneessary premature loop exits. In our simulations, we have assumed that the MZC ould
be updated by 4 loads and read by 2 stores in the same yle.
The stride-based heker. The MZC is suient to guarantee a orret exeution, it is not suient for
good performane. A reasonably-sized MZC may trigger a lot of premature loop exits on inexistent memory
dependenies. We introdue a stride-based heker (SBC) to assist the MZC. The SBC takes advantage of
onstant-stride memory aesses. The SBC is a small fully-assoiative table where eah entry summarizes
the memory aesses generated by a partiular stati load. Eah SBC entry is tagged with the load PC and
ontains a valid bit, a sequene number, a rst address, a last address, a stride, a data size, and an iteration
ount. When exeuting a load, we searh that load PC in the SBC. If an entry exists, we update it as follows.
If the iteration ount is null, we initialize the entry with the load address, data size and sequene number.
Else, if the iteration ount is non-null, we ompute S = load_address − last_address. If the iteration ount
equals 1, we set the stride to S. Otherwise, if the iteration ount is greater than 1 and if the stride diers
from S, we invalidate the entry. Eventually, we set the last address to the load address, and we inrement
the iteration ount. When validating a store, if the MZC detets a potential memory order violation and if
the known_p bit is set, we aess the SBC with the load PC reorded in the MZC entry. If the entry does
not exist, we reate it (possibly eviting a load), reset its ontent, and a premature loop exit is triggered
with the store as the loop exit point. If the entry exists, we use its ontent to onrm the possible memory
order violation. We use the rst address, the last address, the stride sign and the data size to nd whih
memory region the load has aessed so far. If the store data does not overlap with that region, we are
sure that there was no memory order violation for that store. Otherwise, we assume that there was one,
and a premature loop exit is triggered. When validating a store, we "remove" the oldest dynami load of
every valid SBC entry whose iteration ount is non-null and whose sequene number is less than the store
sequene number : we derement the iteration ount and, if the iteration ount is non-null, we add Bmax to
the sequene number and we add the reorded stride to the rst address.
Store-to-load bypassing. Some loops atually ontain true memory dependenies. A solution for allowing
the LA to exeute these loops is to do store-to-load bypassing at loop building. The loop builder ontains a
Store Sets memory dependeny preditor [2℄, idential to the one implemented in the supersalar ore
9
. At
loop building, if a stati load is predited to depend on a stati store with the same data size, we transform
the body as shown in Figure 7. The store µop stays in plae, but 2 extra MOV µops are added in the loop
body just behind the store
10
. The load µop is removed from the body and replaed with a CHECK µop and
a MOV µop. The MOV µop transmits the store data to the µops onsuming the load value. The CHECK
µop ompares the load and store addresses. All CHECK µops and MOV µops exeute on I-nodes. If a
CHECK fails, a loop exit is triggered, taking as loop exit point the dynami µop preeding (in sequential
order) the instrution ontaining the removed load.
9
Our store sets LFST is fully assoiative, tagged with the SSID, whih permits taking a wide SSID while keeping the LFST
small.
10
Several loads may be predited to depend on the same store. the MOV µops for the store are generated only one.
RR n° 7802
Hardware aeleration of sequential loops 14
The Bypassed Store Table. Even if a CHECK sueeds, we must still verify that stores between the
dependent store-load pair do not write that memory loation. The SVU ontains a small fully-assoiative
Bypassed Store Table (BST). Eah BST entry ontains a store address, a store PC, a data size and a lifetime.
When validating a bypassed store, we searh a free BST entry for that store, i.e., an entry whose lifetime is
null. We initialize this entry with the store address, PC, data size, and we set the lifetime to the maximum
number of dynami stores separating the bypassed store-load pair(s). This value, whih may be null, is
obtained during loop building. If the lifetime is non-null and no free entry was found in the BST, we trigger
a loop exit with that bypassed store as the loop exit point. When validating a store (whether bypassed or
not), we hek whether the store onits with any BST entry whose lifetime is non-null. If a onit is
deteted, a loop exit is triggered with the validated store as the loop exit point. For eah validated store, we
derement all the non-null lifetime values in the BST. When a onit is deteted, we train the store sets
preditor with the validated store PC and the bypassed store PC. Doing so merges these two stores' store
sets [2℄, so that on future ourrenes of that loop, the load is predited to depend on the orret store.
Double bypassing. The bypassing method desribed previously is eetive only if the distane between
the dynami store and the dynami load is less than one loop body. We found that this represents the
majority of ases. However one benhmark, 456.hmmer, suered many failed CHECKs. To solve this ase,
we introdue double bypassing, whih is the possibility for a dynami store to ommuniate its data to a
dynami load at a distane between one and two bodies. Double bypassing is illustrated in Figure 7. It uses
the BST like simple bypassing, exept that the BST entry lifetime is set to a longer value. The loop builder
ontains a Failed Chek Table (FCT), whih his a small fully assoiative table. When a CHECK triggers
a loop exit, we reord in the FCT the bypassed load PC, so that if a predited-dependent load hits in the
FCT during loop building, we apply double bypassing.
5.4 Loop body redution
Redundant exeution exists in most programs [13℄. On the example of Figure 4, µops 8 and 9 are redundant,
they produe the same result on eah iteration. This result is obtained at loop building, hene these µops
an be removed from the loop body before exeuting the loop on the LA. This is an iterative proess : in
eah iteration during loop building, we remove from the CRDG the µops that have no inputs left in the
CRDG. The number of redundant µops identied depends on LBI 11. We must be areful with loads and
stores. A store µop, even if it produes the same result on eah iteration, annot be onsidered redundant in
a shared-memory arhiteture, as removing it would break the memory onsisteny model. A redundant load
an be removed, but we must hek memory dependenies. Some hardware support is neessary for that.
The SVU ontains a Removed Loads Table (RLT). Eah RLT entry ontains a valid bit, a load address, a
load data size and a load PC. During loop building, as long as there is room in the RLT, redundant loads
are removed from the body and reorded in the RLT. When validating a store, we hek whether the store
onits with any RLT entry. If a onit is deteted, a loop exit is triggered with the store as the loop
exit point. The PC reorded in the oniting RLT entry is used to train the store sets preditor. For
loopy benhmarks, we found that redundant µops represents typially between 10% and 20% of dynami
loop µops.
MOV bypassing. It is possible to bypass ertain MOV µops, meaning that, at loop building we make
the MOV's onsumer µop depend diretly on the MOV's produer µop. MOV bypassing is possible if the
distane (in the sequene of dynami instrutions) between the produer and onsumer µops is less than
one loop body. If MOV bypassing an be applied for all the onsumers of the MOV, the MOV itself an be
removed from the body
12
.
5.5 Mapping heuristi
The LA performane is very dependent on the mapping of µops onto EUs. The in-depth study of mapping
heuristis is left for future studies. For this preliminary study, we tried to nd a good-enough mapping
through trial and error. The heuristi we found helped us understand some important properties for a good
mapping heuristi on suh LA. We just give a high-level desription here, omitting some details.
11
With LBI=5, we were able to remove all the redundant µops.
12
Solutions exist for bypassing all MOVs, this is left as future work.
RR n° 7802
Hardware aeleration of sequential loops 15
lok frequeny : 3 GHz ; LM table : 4 entries ; LT : 64 entries ; MinLM : 900 instrutions ;
LBI : 5 iterations; MBS : 128 instrutions ; Nmax : 12 µops; EVQs per EU : 12 ; EVQ : 32 data ;
IQ : 32 data ; OQ : 16 data ; LIB : 64 loads ; LOQ : 128 data ; LEQ : 128 data ; LSQ : 64 stores ;
MZC : 64 entries ; SBC : 16 entries ; BST : 16 entries ; FCT : 16 entries ; RLT : 64 entries ; SVU
throughput : 2 stores / yle ; unfreeze signal lateny : 4 yles ; wired-OR lateny : 4 yles
; loop mode transition penalty : 100 yles ; extra transition penalty on a LT miss : 10000
yles ; store sets SSIT : 4096 entries ; store sets LFST : 16 entries ; SSIT learing period : 10
millions yles ;
Table 3: Fixed parameters for the loop detetor, loop builder, and loop aelerator.
EU balaning. The µops mapped onto the same EU form a yle in the onstraint-augmented CRDG.
Therefore, the average number of lok yles per loop iteration annot be less than the Neu of the most
loaded EU. The mapping heuristi must try to reah a balaned distribution of µops on EUs. This is
partiularly important for loops with a small body : the mapping heuristi should avoid putting two µops on
the same EU whenever possible, as this potentially doubles the loop exeution time. Our mapping heuristi
omputes, from the body harateristis, a µop quota per EU type, assuming EU balaning. One a µop is
mapped onto a EU, we onsider another µop, and so on until all µops have been mapped. We try to avoid
mapping a µop onto an EU that has already reahed its quota.
Lane segment sharing. Data that must go through the same lane segment share its bandwidth, whih we
have limited to one data per LA lok yle. The µops produing these data annot exeute at an average
rate greater than the rate at whih the data go through the shared lane segments. Our mapping heuristi
tries to minimize lane usage but does not try expliitly to minimize lane segment sharing
13
. We try to map
a µop on the same EU as one of its input µops, provided this EU has not reahed its quota. Many µops
have a single onsumer, and this often permits avoiding using the bus for transmitting the data. If we must
use the bus, we try to put the onsumer µop on the node following the node where the produer µop has
been mapped, whenever possible.
Natural CRDG yles. The CRDG ontains some natural yles due only to data dependenies. Most
natural yles are 1-µop yles onsisting of an integer addition depending on itself. Yet, some loops ontain
multi-µop yles whih may inrease the loop exeution time onsiderably, as it takes several yles for a
data to travel from one EU to another. Whenever possible, our mapping heuristi tries to map onto the
same EU the µops forming a natural yle.
Artiial CRDG yles. Artiial CRDG yles are yles in the onstraint-augmented CRDG that
are not natural yles. Artiial CRDG yles may inrease the loop exeution time dramatially, either
beause the yle ontains oating-point operations or, worse, beause some µops in the yle are mapped
onto dierent EUs. Mapping µops onto the same EUs as their produers permits dereasing the probability
of reating ostly CRDG yles, but this is not always suient. Whenever possible, our mapping heuristi
tries to avoid mapping a µop on an EU if this would reate an artiial CRDG yle.
5.6 Performane tuning
Our simulator implements the loop detetor and loop aelerator as desribed in Setions 4 and 5, i.e., with
a great level of detail. Parameters we have used for the simulation are given in 3. Notie the loop mode
transition penalty, that we have xed at 100 lok yles. We assume that this penalty takes into aount
the time elapsed after loop building and before the LA starts exeuting the loop, and the time neessary to
update the arhitetural registers after the loop exit. In ase of a LT miss, we add an extra penalty of 10000
yles.
Maximum LICmax. For good performane, the values of LICmax and MinLICmax are set dynamially
at loop building. As explained before, we try to set LICmax and MinLICmax as high as possible but under
the limit permitted by the LSQ and the LEQ (whih also depends on Neu). However, when LICmax exeeds
the EVQ depth, the deadlok probability beomes non-null. Atually, there is a LICmax value beyond whih
the deadlok probability inreases dramatially. This value is not the same for all loops. To solve this issue,
13
This is a possible area of improvements.
RR n° 7802
Hardware aeleration of sequential loops 16
 0.0
 0.2
 0.4
 0.6
 0.8
 1.0
 1.2
 1.4
 1.6
400 401 403 410 416 429 433 434 435 436 437 444 445 447 450 453 454 456 458 459 462 464 465 470 471 473 481 482 483
benchmark
performance
fraction loop instructions
Figure 8: Global performane speedups obtained with the loop aelerator, relative to the baseline with no
aelerator (higher is better). The seond bar represent the fration of instrutions exeuted by the loop
aelerator.
we reord in eah LT entry a valueML whih denes a maximum for LICmax. When a deadlok ours, we
halve the ML value reorded in the LT entry for that loop. Moreover, a high LICmax is useful only for loops
with a short body, for whih a large instrution window represents many iterations. If the body exeeds 20
instrutions, or if the time elapsed sine the last deadlok is less than 1 million yles, we set ML equal to
the EVQ depth. Otherwise, we allow ML to be as high as 128. Overall, with this method, deadloks have a
negligible impat on performane.
Memory zone size. The optimal memory zone size is not the same for all loops. We reord in eah LT
entry the log
2
of the zone size. Eah LT entry also ontains a 3-bit saturating ounter. The zone size for
a loop is initially set to 1024 bytes. When a premature loop exit ours beause the MZC is full, the 3-bit
ounter is inremented. If the 3-bit ounter value is equal to 7, we double the zone size. If the MZC or the
SBC signals a memory order violation, we derement the 3-bit ounter. If the 3-bit ounter value is null, we
halve the zone size.
Disabling "bad" loops. Migrating the exeution to the LA may atually derease performane. This
generally happens beause of premature loop exits. For good performane, "bad" loops must be deteted
and their exeution on the LA disabled. Eah LT entry ontains a T imeLost value whih estimates the time
that has been lost so far by exeuting the loop on the LA rather than on the supersalar ore. On a loop exit,
the loop detetor is trained with the number ni of instrutions exeuted by the LA, the time nt spent on the
LA and the number nm of L3 ahe misses generated by the loop. The T imeLost value is updated as follows :
T imeLost← T imeLost+nt−0.7×ni−20×nm where values 0.7 and 20 were found empirially. T imeLost
is initially set to 0. For most loops it beomes negative, indiating that the LA is likely to perform better
than the supersalar ore. For some loop, T imeLost inreases. When T imeLost exeeds 100000 yles, we
onsider that this is a "bad" loop. Sometimes, we reset all the T imeLost values in the LT. We do this when
the time elapsed sine the last reset exeeds 100 times the sum of positive T imeLost values in the LT.
5.7 Simulations results
Figure 8 shows the performane obtained with the loop aelerator. Speedups are relative to the simulated
baseline (f. Figure 1). The seond bar in Figure 1 shows the fration of instrutions exeuted by the
LA. Some speedups are obtained on most loopy benhmarks, exept 433.mil, whose loop behavior omes
mostly from loop bodies greater than 128 instrutions. Performane gains exeeding 25% are obtained on
6 benhmarks : 434.zeusmp, 437.leslie3d, 456.hmmer, 459.GemsFDTD, 462.libquantum and 481.wrf. These
6 benhmarks are the ones that exeute at least 60% of the instrutions on the LA. We measured the loal
aeleration provided by the LA on the loop fration. For the 6 benhmarks mentioned above, the loal
aeleration is respetively 2.3, 2, 1.7, 2.8, 2.1 and 1.9 (ignoring the transition penalty). This level of loal
aeleration would be diult to obtain with onventional supersalar tehniques.
RR n° 7802
Hardware aeleration of sequential loops 17
all no no no naive ML = zone = bad
SBC byp. redu. mapping 32 1 KB loops
434.zeusmp 1.43 1.32 1.33 1.18 1.00 1.42 1.25 1.43
437.leslie3d 1.45 1.26 1.43 1.10 0.95 1.44 1.40 1.38
456.hmmer 1.38 0.99 0.99 1.22 0.99 1.38 0.99 1.38
459.GemsFDTD 1.32 1.00 1.13 1.13 1.00 1.33 1.28 1.32
462.libquantum 1.39 1.24 1.39 1.37 1.19 1.19 1.39 1.39
481.wrf 1.26 1.12 1.21 1.19 0.95 1.26 1.26 1.26
29 benh. mean 1.084 1.036 1.055 1.045 0.996 1.068 1.061 1.072
Table 4: Performane relative to the baseline. The seond olumn is for all features enabled. Following
olumns give performane when disabling a single feature, respetively, disabling the SBC, store-to-load
bypassing, loop body redution, using a naive mapping heuristi, using a small ML, using a xed zone size,
and allowing bad loops.
Table 4 shows performane for the 6 benhmarks mentioned above and the average performane on
all 29 benhmarks. The seond olumn is for all features enabled. Following olumns give performane
when disabling a single feature. The mapping heuristi is very important. To quantify its impat, we have
simulated a "naive" mapping heuristi whih sans all EUs in a xed order : we map on the urrent EU
one µop that an be mapped on it, then we move to the next EU. We do this repeatedly until all µops have
been mapped. With this naive heuristi, the average performane is even slightly lower than the baseline.
Another important feature is the SBC. Without it, false memory dependenies trigger many premature
loop exits. Loop body redution is also very important. Without it, the loop aelerator is learly not
working at its full potential. Store-to-load bypassing brings signiant performane gains on 456.hmmer
and 459.GemsFDTD. The last 3 olumns show that the performane tuning desribed in Setion 5.6 bring
non-negligible performane gains. The xed memory zone size prevents 456.hmmer to benet from the
aelerator. Using a large ML is important for 462.libquantum, whih has small loop bodies. Disabling bad
loops is a useful feature, espeially for benhmarks whih do not benet from the aelerator.
6 Conlusion and future work
We have proposed a hardware mehanism for deteting dynami loops. We found that about one third of
all the instrutions exeuted by the SPEC CPU2006 suite ome from dynami loops whose body size an be
quite diverse, from a few instrutions to several hundreds.
We have proposed a loop aelerator miroarhiteture that an aelerate most dynami loops without
help from the ompiler or the programmer. The aelerator onguration we have foused on has 8 nodes
and 4 lanes, and an exeute up to 32 µops simultaneously from a window of several thousands of µops.
Its loal hardware omplexity is no greater than that of a onventional supersalar ore. Our eorts for
obtaining good performane speedups have shown the importane of mapping µops onto exeution units
very arefully. We have proposed new solutions for dealing with memory dependenies in a window of
several thousands of instrutions, exploiting loop properties. We have shown that a signiant fration of
dynami loop instrutions are redundant and need not be exeuted by the loop aelerator.
This is a preliminary study, intended to provide a basis for future work on loop aeleration. The design
spae of loop aelerators is huge we believe. The mapping heuristi is very important for performane.
Some of the harateristis we have outlined for a good mapping heuristis are somewhat general we believe,
like the importane of preventing artiial CRDG yles. Other aspets of our mapping heuristi may be
dependent on some of the hoies we made, like not providing any diret data path between EUs on the same
node, or hoosing a pipelined ring for ommuniating between nodes. Future studies may reonsider these
hoies and try to nd a better tradeo between the IPC, the hardware omplexity of the bus and nodes,
and the mapping heuristi. Nevertheless, we believe that the loop aelerator miroarhiteture we have
proposed is salable beyond the partiular onguration we onsidered. For instane, inreasing the number
of I-nodes and F-nodes will not inrease the loal hardware omplexity. Inreasing the number of lanes
RR n° 7802
Hardware aeleration of sequential loops 18
will inrease the omplexity of the distributor and some parts of the L-node (for instane the MZC, whose
number of ports is proportional to the number of lanes). Yet, we believe that if one an double the width of
the supersalar ore, doubling the number of lanes should not be more diult. The node miroarhiteture
depited in Figure 6 is, we believe, easier to pipeline than a onventional 4-way supersalar exeution ore.
Future work may onsider the possibility to overlok the loop aelerator. The global speedups we have
demonstrated are limited by Amdahl's law, i.e., by the fration of the exeution spent in dynami loops.
Nervertheless, the loal aeleration we have obtained on dynami loops is onsiderable for a hardware-only
solution. A ompiler aware of the presene of a loop aelerator may try to exploit it.
Referenes
[1℄ T. W. Barr, A. L. Cox, and S. Rixner. Translation ahing: Skip, don't walk (the page table). In Pro.
of the 37th Int. Symp. on Computer arhiteture, 2010.
[2℄ G. Z. Chrysos and J. S. Emer. Memory dependene predition using store sets. In Pro. of the 25th
Int. Symp. on Computer Arhiteture, 1998.
[3℄ N. Clark, A. Hormati, and S. Mahlke. VEAL : virtualized exeution aelerator for loops. In Pro. of
the 35th Int. Symp. on Computer Arhiteture, 2008.
[4℄ M. Dixon, P. Hammarlund, S. Jourdan, and R. Singhal. The next-generation Intel Core miroarhite-
ture. Intel Tehnology Journal, 14(3), 2010.
[5℄ A. Garía, O. J. Santana, E. Fernández, P. Medina, and M. Valero. LPA : a rst approah to the loop
proessor arhiteture. In Pro. of the 3rd Int. Conf. on High Performane Embedded Arhitetures and
Compilers, 2008.
[6℄ M. D. Hill and M. R. Marty. Amdahl's law in the multiore era. IEEE Computer, 41(7):3338, July
2008.
[7℄ M. Kobayashi. Dynami harateristis of loops. IEEE Transations on Computers, -33(2):125132,
1984.
[8℄ G. H. Loh, S. Subramaniam, and Y. Xie. Zesto : a yle-level simulator for highly detailed miroar-
hiteture exploration. In Pro. of the Int. Symp. on Performane Analysis of Systems and Software,
2009.
[9℄ C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallae, V. Janapa Reddi, and
K. Hazelwood. Pin : building ustomized program analysis tools with dynami instrumentation. In
Pro. of the ACM SIGPLAN Conferene on Programming Language Design and Implementation, 2005.
http://www.pintool.org.
[10℄ M. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr, and J. Emer. Adaptive insertion poliies for high
performane ahing. In Pro. of the 34th Int. Symp. on Computer Arhiteture, 2007.
[11℄ E. Rotenberg, Q. Jaobson, Y. Sazeides, and J. E. Smith. Trae proessors. In Pro. of the 30th Int.
Symp. on Miroarhiteture, 1997.
[12℄ A. Sezne and P. Mihaud. A ase for (partially) tagged geometri history length branh predition.
Journal of Instrution Level Parallelism, April 2006. http://www.jilp.org/vol8.
[13℄ A. Sodani and G. S. Sohi. Dynami instrution reuse. In Pro. of the 24th Int. Symp. on Computer
Arhiteture, 1997.
[14℄ G. Stitt, R. Lyseky, and F. Vahid. Dynami hardware/software partitioning : a rst approah. In
Pro. of the Design Automation Conferene, 2003.
[15℄ S. S. Stone, K. M. Woley, and M. I. Frank. Address-indexed memory disambiguation and store-to-load
forwarding. In Pro. of the 38th Int. Symp. on Miroarhiteture, 2005.
RR n° 7802
Hardware aeleration of sequential loops 19
[16℄ J. M. Tendler, J. S. Dodson, J. S. Field, H. Le, and B. Sinharoy. POWER4 system arhiteture. IBM
Journal of Researh and Development, 46(1), January 2002.
[17℄ J. Tubella and A. González. Control speulation in multithreaded proessors through dynami loop
detetion. In Pro. of the 4th Int. Symp. on High-Performane Computer Arhiteture, 1998.
[18℄ S. Vajapeyam, P. J. Joseph, and T. Mitra. Dynami vetorization : a mehanism for exploiting far-ung
ILP in ordinary programs. In Pro. of the 26th Int. Symp. on Computer Arhiteture, 1999.
[19℄ S. Vajapeyam and T. Mitra. Improving supersalar instrution dispath and issue by exploiting dynami
ode sequenes. In Pro. of the 24th Int. Symp. on Computer Arhiteture, 1997.
Bmax : maximum number of µops for the loop body ; BST : bypassed store table ; CEC : yli
exeution ontroller ; CRDG : yli register dependeny graph ; EU : exeution unit ; EVQ :
external value queue ; FCT : failed hek table ; IQ : input queue ; LA : loop aelerator ; LBI : loop
build iterations ; LEQ : loop exit queue ; LESN : loop exit sequene number ; LIB : load issue buer ;
LIC : loal iteration ount ; LM : loop monitor ; LOQ : load output queue ; LSQ : loop store queue ;
LT : loop table ; MBS : maximum body size in instrutions ; MinLM : LM threshold in instrutions ;
ML : maximum value of LICmax ; MSN : maximum sequene number ; MZC : memory zone heker
; Neu : number of stati µops mapped onto a given EU ; Nmax : maximum number of stati µops per
EU ; OQ : output queue ; RDG : register dependeny graph ; RLT : removed loads table ; SBC :
stride based heker ; SVU : store validation unit ;
Table 5: Aronyms and denitions
RR n° 7802
RESEARCH CENTRE
RENNES – BRETAGNE ATLANTIQUE
Campus universitaire de Beaulieu
35042 Rennes Cedex
Publisher
Inria
Domaine de Voluceau - Rocquencourt
BP 105 - 78153 Le Chesnay Cedex
inria.fr
ISSN 0249-6399
