ReD: A reuse detector for content selection in exclusive shared last-level caches by Díaz, Javier et al.
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
Journal of Parallel and Distributed Computing xxx (xxxx) xxx
Contents lists available at ScienceDirect
J. Parallel Distrib. Comput.
journal homepage: www.elsevier.com/locate/jpdc
Q1 ReD: A reuse detector for content selection in exclusive shared
last-level caches
Javier Díaz a, Teresa Monreal b, Pablo Ibáñez a,∗, José M. Llabería b, Víctor Viñals aQ2
a Aragón Institute of Engineering Research (I3A), University of Zaragoza, and Hipeac, Spain
b Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya and Hipeac, Spain
h i g h l i g h t s
• A new content selection mechanism for Shared Last-Level Caches (SLLC) in chip multiprocessor systems is proposed.
• The mechanism leverages the reuse locality embedded in the SLLC request stream.
• By the addition of a Reuse Detector (ReD), located in between each L2 cache and the SLLC, themechanism discovers useless L2 evicted blocks, bypassing
them.
• The ReD mechanism is designed to overcome as much as possible problems affecting previous state-of-the-art proposals as low accuracy, reduced
visibility window and detector thrashing.
a r t i c l e i n f o
Article history:
Received 25 July 2017
Received in revised form 8 April 2018









a b s t r a c t
The reference stream reaching a chip multiprocessor Shared Last-Level Cache (SLLC) shows poor
temporal locality,making conventional cachemanagement policies inefficient. Fewproposals address this
problem for exclusive caches. In this paper, we propose the Reuse Detector (ReD), a new content selection
mechanism for exclusive hierarchies that leverages reuse locality at the SLLC, a property that states that
blocks referencedmore than once are more likely to be accessed in the near future. Being placed between
each L2 private cache and the SLLC, ReD prevents the insertion of blocks without reuse into the SLLC. It
is designed to overcome problems affecting similar recent mechanisms (low accuracy, reduced visibility
window and detector thrashing). ReD improves performance over other state-of-the-art proposals (CHAR,
Reuse Cache and EAF cache). Comparedwith the baseline systemwith no content selection, it reduces the
SLLC miss rate (MPI) by 10.1% and increases harmonic IPC by 9.5%. Q3
© 2018 Elsevier Inc. All rights reserved.
1. Introduction1
Nowadays, chip multiprocessor (CMP) systems dominate the2
market in high-performance servers, desktop or embedded sys-3
tems, and mobile devices. Their most common design includes4
a multilevel memory hierarchy, ending with a shared last-level5
cache (SLLC). This cache is critical in terms of cost and performance.6
In cost, because it occupies nearly 50% of the chip area. In perfor-7
mance, because it is the last resource before accessing the DRAM8
memory, located outside the chip, which is much slower.9
Several studies show that conventional SLLC designs are inef-10
ficient because they waste a large portion of the cache. This is11
because they hold many dead blocks, i.e., blocks that are never re-12
accessed before their eviction. Frequently, those blocks are already13
deadwhen they enter the SLLC [10,18,28]. This occurs inmultilevel14
hierarchies because private caches, often encompassing two levels15
∗ Corresponding author.
E-mail address: imarin@unizar.es (P. Ibáñez).
(L1 and L2), exploit most of the temporal locality, which is effec- 16
tively filtered out before reaching the SLLC [13,15]. To address this 17
drawback and increase the SLLC hit rate, several proposals suggest 18
new SLLC insertion and replacement policies. Most of the work 19
refers to inclusive or non-inclusive caches, and only a small group 20
[3,10] focuses on exclusive SLLCs [16]. 21
An exclusive SLLC acts as a victim cache of the private caches, 22
storing their evicted blocks. Some recent AMD and Intel CMPs use 23
exclusive or partially exclusive SLLCs [4,5,17]. The aggregate on- 24
chip capacity of private caches increases with the number of cores, 25
thus making exclusive hierarchies more appealing than inclusive 26
ones. Over the next few years, we can expect many-core designs 27
with more cores within the chip, and SLLCs not much larger than 28
the current ones [23]. Therefore, using an inclusive cache will be 29
even more inefficient and, unless there are drastic changes in the 30
basic design of the memory hierarchy, the usefulness of exclusive 31
SLLCs will grow in the future [14]. 32
This work focuses on enhancing the efficiency and performance 33
of an exclusive SLLC in a chipmultiprocessor. Exclusive caches offer 34
https://doi.org/10.1016/j.jpdc.2018.11.005
0743-7315/© 2018 Elsevier Inc. All rights reserved.
© 2018 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license 
http://creativecommons.org/licenses/by-nc-nd/4.0/ 
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
2 J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx
Fig. 1. Fraction of blocks evicted from the SLLC cache, in an example mix, according to the number of accesses received before its eviction: (U) one access, (R) two accesses,
and (M) three or more accesses. Figures on top of each bar show the average number of reuses for an M block.
the opportunity to implement a cache bypassmechanismwith low1
complexity, in contrast to inclusive hierarchies. Bypassing a block2
evicted from a private cachemeanswriting it directly intomemory3
if dirty or discarding it if clean. Bypassed blocks do not affect the4
state of the SLLC.5
Our proposal is a content selectionmechanism that implements6
a new policy to select which blocks enter the SLLC and which ones7
bypass it. Specifically, we propose to take advantage of the reuse8
locality existing in the stream of requests to the SLLC. A block is9
said to have reuse locality if it has been referenced at least twice.10
A block with reuse locality is more likely to be accessed in the near11
future [2]. Our mechanism prevents the insertion of many useless12
blocks in the SLLC. It is also an efficient solution to reduce traffic13
from private caches to the SLLC, one of the drawbacks of exclusive14
designs.15
After an in-depth analysis of previous proposals that exploit16
reuse locality, we have identified three aspects where there is still17
room for improvement:18
• Most of them predict reuse by linking it with a cache block19
feature, such as the instruction that brought the block into the20
SLLC or the memory area the block belongs to. The accuracy21
of these predictors is usually low.22
• Most proposals detect reuse by keeping track of past accesses23
in a store embedded into the SLLC. In such proposals, the24
size of the SLLC restricts the number of detected blocks. They25
effectively lengthen the life of the blocks flagged as reused in26
the SLLC, but they are not able to detect further blocks.27
• As far as we know, global thrashing may appear in all of28
them, since the reuse detection mechanism is shared among29
all the threads running on the CMP. A thread bringing too30
many blocks in the on-chip hierarchy can prematurely re-31
place existing data from other applications, worsening their32
reuse detection.33
The aim of our proposal is to fill up these gaps. To achieve this,34
we monitor blocks evicted from private L2 caches, by means of a35
specialized store that remembers addresses of the recently evicted36
blocks. That address store, called hereafter the Reuse Detector or37
ReD, detects which blocks of those evicted from L2 do not have38
reuse, and avoids inserting them into the SLLC. Clean blocks are39
discarded, while dirty blocks are sent directly to main memory.40
ReD is a separate private store near each L2 cache, sized and41
organized regardless of the SLLC configuration.42
We evaluate ReD using a set of multiprogrammed workloads43
running on a chipmultiprocessorwith eight cores and a three-level44
cache hierarchy. Results show that the Reuse Detector enhances45
performance, above other recent proposals such as CHAR [3], Reuse46
Cache [1], and EAF cache [30].47
Thework is structured as follows. Section 2 explains themotiva-48
tion for this work. Section 3 describes in detail the proposed Reuse49
Detector. Section 4 gives insight into the ReD operation. Section 550
details themethodology used, including the experimental environ-51
ment and the configuration of the simulated systems. Section 652
presents and analyzes results, and compares them against other 53
relevant proposals. Section 7 explores the trade-offs in the design 54
of ReD. Section 8 reviews the state of the art in the matter. Finally, 55
in Section 9 we summarize our conclusions. 56
2. Motivation 57
2.1. Problem analysis 58
Several studies have shown that, in a memory hierarchy, most 59
of the blocks have already received all accesses when they are 60
evicted from the caches close to the processor. Caches that are 61
further away from the processor are used inefficiently because the 62
stream of references that reaches them has very little temporal 63
locality. Instead, these references show reuse locality. The reuse 64
locality property has been empirically proved in several works [1– 65
3,10]. It can be stated as follows: lines accessed at least twice tend 66
to be reused many times in the near future. 67
An experiment is conducted to quantify the number of blocks 68
with reuse and the amount of reuse. Fig. 1 plots a classification of 69
the blocks evicted by an exclusive SLLC depending on the number 70
of SLLC accesses that each block registered during their stay in the 71
on-chip caches. Each block is classified according to whether it has 72
received a single access (U), two accesses (R, reuse), or more than 73
two accesses (M, multiple reuse). The average number of reuses 74
for each M block is shown on top of the bars. The figure shows the 75
distribution for eight applications running together in an example 76
mix. 77
On average, 85% of the blocks do not receive any hit in the 78
SLLC (U). These blocks could bypass the SLLC without loss of per- 79
formance. Blocks with only one reuse (R) are 4% of the total, and 80
those with more reuses (M) are 11%. For each block classified as 81
M, there are 13.0 reuses on average. A content selection policy that 82
only stored blocks with reuse (at least two accesses, R + M) would 83
keep the small set of blocks that produces most hits. Furthermore, 84
this policy would prevent the storage in the SLLC of the large set of 85
U blocks, reducing the likelihood of M blocks being replaced. Our 86
proposal is a mechanism that detects the second use of a block to 87
classify it as reused, and only stores these reused blocks in the SLLC. 88
2.2. Reuse detector design 89
We have analyzed previous mechanisms designed to classify 90
blocks as reused for an SLLC and have identified three aspects 91
where they could be improved: 92
Prediction accuracy. Most works predict reuse by linking it 93
with some cache block feature, such as the instruction that brought 94
the block into the SLLC or the memory area the block belongs to 95
[3,10,15,22,28,33]. The accuracy of these predictors is limited. As 96
an example, a mechanism that associates SLLC reuse with PC will 97
only be accurate if most blocks brought by each PC present the 98
same behavior, whether having reuse in the SLLC or not. However, 99
accuracy will drop if some blocks brought by a PC have reuse 100
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx 3
in SLLC, but others have not. Our proposal relies on detection1
instead of prediction of the reuse locality. Our mechanism keeps2
the addresses of all blocks evicted from the private caches during a3
certain time window (reuse detection window). The eviction from4
private caches of a block whose address is already stored is a true5
indicator of reuse locality, and thus the block is tagged as such. In6
Section 6 we provide results to test the accuracy of our proposal7
against CHAR, a state of the art predictor [3].8
Thread-global reuse detection. All the previous proposals have9
in common an important constraint: their reuse detector is shared10
among all threads running on the CMP. A thread delivering lots11
of misses will cause premature evictions of addresses previously12
inserted into the detector by other threads, restricting the de-13
tector’s ability to discover more reuse for those other threads. In14
other words, a thread missing a lot in its private caches shrinks15
the reuse detection window of the remainder applications. In fact,16
these mechanisms reproduce in the reuse detector the thrashing17
problem they try to avoid in the SLLC. To overcome this,wepropose18
implementing reuse detectors that are private to each core. Each19
detector is placed next to the private L2 cache, and remembers20
the addresses of all blocks evicted only by its associated L2 cache.21
In Section 6 we provide results to compare the amount of reuse22
detected by our proposal versus a global detector.23
Size of the reuse detector. A larger detector size allows it to24
remember more blocks for longer, thereby increasing the oppor-25
tunity to classify more blocks as reused. Most previously proposed26
techniques track reuse patterns using the SLLC [2,3,9,10,19,22,33].27
In these proposals, the SLLC size defines the size of the reuse28
detector and, consequently, this detector is not able to discover29
more reuse than an LRU-managed SLLC of the same size would do30
[27]. In otherwords, blocks categorized as reused donot increase in31
number. In fact, increasing the lifespan of reused blocks indirectly32
shortens the life for those blocks that have not yet shown reuse,33
due to the capacity limit. This leads to a reduction in the detection34
window size relative to a cache with an LRU replacement pol-35
icy. Our proposal aims to increase the number of blocks detected36
as reused. This requires a larger detection window. To achieve37
this, we include an additional store that is able to remember more38
block addresses than the SLLC can keep. In Section 6 we provide39
results that show the amount of reuse detected by our proposal40
with detection windows of different sizes.41
Our ReD proposal is an efficient content selection mechanism42
that detects with high accuracy when a block has been reused,43
and only stores these reused blocks in the SLLC. This removal of44
unused blocks enables a SLLC keeping more blocks with reuse and45
for longer time. Compared with previous proposals that also ex-46
ploit reuse locality, ReD is more accurate detecting reused blocks,47
permits a greater visibility window, and does not suffers from48
global thrashing. In fact, among all those proposals,Q4 ReD is the49
only one that manages to have on average more alive than dead50
blocks in the SLLC. The use of a private and separate store makes51
the SLLC replacement policy not adversely affect the detector effi-52
ciency, so ReD can be implemented in an SLLC managed with any53
replacement policy. ReD is designed for exclusive SLLCs and chip54
multiprocessor systems, turning out to be a bypassmechanism that55
has low complexity, is simple and easy to implement.56
3. Design and implementation of the reuse detector57
3.1. Baseline58
The baseline system is a three-level cache hierarchy consisting59
of an SLLC whose contents are managed in exclusion with respect60
to the contents of two-level private caches, which are inclusive.61
Coherence is kept bymeans of a directory that holds, for each block62
Fig. 2. Placement of the Reuse Detector. Every private L2 cache has one ReD.
in the hierarchy, both its status and precise location, which can be 63
one or several private caches or the SLLC. 64
Blocks coming from main memory are sent directly to the 65
requesting L2 cache. Eventually, when a block is evicted from L2 66
it is sent to SLLC. From here on, either the block is requested again 67
from any L2 cache, then being sent and invalidated in SLLC, or it 68
is replaced by another block that needs room for insertion. If a 69
block placed in an L2 cache is requested by another L2 cache, the 70
directory detects this situation and the block is retrieved from the 71
former to be delivered to the latter. Shared blocks are inserted in 72
the SLLC only when the last copy is evicted from the L2 caches. 73
It is possible to implement ReD with any SLLC replacement 74
policy. We select TC-AGE for our baseline design because this 75
policy has proved to be very efficient in exclusive SLLCs [10]. It is 76
equivalent to SRRIP for inclusive caches. It uses two bits to store 77
the age of each cache line. The age is assigned when the block is 78
inserted into the SLLC: if the block has previously received a hit in 79
the SLLC, it is inserted with age 3, otherwise it is tagged with age 1. 80
Each block in the private L2 cache stores one additional trip count 81
bit to remember if it has had a hit in SLLC (the TC bit). This bit is 82
also sent to the SLLC with the block when it is evicted from the L2 83
cache. TC-AGE selects a random victim among those blocks in the 84
younger group (age 0). If there is no block with age 0 in the cache 85
set, the age of all blocks is decremented, and the victim selection 86
restarts. In summary, TC-AGE assigns older age, and therefore less 87
likelihood of replacement, to blocks that have been reused. 88
3.2. The reuse detector 89
We propose placing our Reuse Detector next to every L2 cache, 90
in the path from each L2 cache to SLLC. ReD receives the addresses 91
of every block evicted from the corresponding L2 cache, see Fig. 2. 92
Being located outside the critical path from SLLC to L2 caches, ReD 93
does not affect the SLLC read latency. Instead, it slightly increases 94
the time that a block evicted from the L2 cache takes to be sent to 95
SLLC or main memory. 96
When a block is evicted from the L2 cache, ReD decides between 97
sending the block to the SLLC and bypassing it. The decision is 98
driven by the block reuse history: if a block has a single use, it is 99
bypassed. If it has one or more reuses, it is stored in the SLLC. 100
A block is classified as reused if it satisfies one of these condi- 101
tions: 102
- The Reuse Detector remembers the address of the block. ReD 103
is a buffer storing block addresses coming from L2 evictions. 104
It detects whether it is the first time the block is evicted from 105
L2 or it has already experienced a previous eviction. A first 106
eviction means no reuse detected, while following evictions 107
mean reuse. 108
- The block was provided to the private cache by the SLLC. 109
We add one bit to each L2 cache block, the Reuse bit, which 110
rememberswhether the block came from SLLC ormainmem- 111
ory. 112
ReD is structured as a set associative buffer, and its capacity, 113
associativity, and replacement policy are design parameters. We 114
define ReD capacity as the number of tracked evicted blocks times 115
block size. For instance, a ReD able to track 1024 blocks of 64B has 116
a capacity of 64 KB. The ReD capacity is a metric that allows us to 117
measure its tracking potential relative to the SLLC size. Capacity is a 118
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
4 J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx
Fig. 3. Reuse Detector operation.
Fig. 4. Detail of the algorithms in operation. Left: Block eviction from L2 cache (non-shared block). Right: Request from a core to its L2 cache.
key parameter, since an effective reuse detection requires storing a1
significant number of addresses between consecutive L2 evictions2
of a given block.3
Neither the SLLC nor the directory require structural changes to4
be adapted to the new mechanism. In order to take into account5
a possible bypass action, the coherence protocol and control logic6
will need to be adapted. In addition, our mechanism requires to7
add the Reuse bit to each L2 cache block.8
3.3. Reuse detector operation9
Fig. 3 shows a diagram illustrating the operation of ReD. On L210
evictions the Reuse bit is first checked. If the evicted block came11
from SLLC or another private cache, it is stored again in the SLLC,12
without looking up ReD (1). Otherwise, if the evicted block came13
from main memory, its address is looked up in ReD (2). A miss14
means no reuse, so the block is bypassed, but the address is added15
to ReD (3). A hit means reuse, so the block is sent to SLLC (4).16
Bypassed blocks send a control message to update the directory17
(6). Then, clean blocks are discarded and dirty blocks are written18
to main memory (7).19
As an exception, a small fraction of bypassed blocks are sent20
to SLLC with ‘‘low insertion priority’’ (5). This means the SLLC21
will store them only if there are free ways in the corresponding 22
set; moreover, those blocks will be inserted with the highest re- 23
placement priority. This exceptional filling policy comes from the 24
observation that exclusive SLLCs experiencing at the same time 25
many hits and bypasses may present many empty ways. Exper- 26
imentation has shown us that diverting to SLLC one of every 32 27
bypassed blocks takes advantage of the free space and increases 28
performance. 29
Fig. 4 shows two block diagrams with the algorithms in opera- 30
tion. On the leftwe show the steps followedwhen a block is evicted 31
from an L2 cache, and on the right those followed when a core 32
requests a block, detailing the management of the Reuse bit. 33
3.4. Implementation details 34
We implement the ReD buffer as a set associative cache, with 35
entries containing tags for addresses, valid bits, and some bits for 36
the replacement policy. We will use a 16-way ReD buffer; higher 37
associativities lead to hardly any performance improvements. 38
Our experiments point out that using a FIFO replacement policy 39
tomanage the ReD buffer works fine. FIFO replacementmeans that 40
the age of an address relates to its insertion (first use) and not to 41
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx 5
Fig. 5. Fraction of blocks evicted from L2 caches, in an example mix, categorized from the ReD standpoint according to the type of reuse in: (U) first use, (R) first reuse, (MD)
multiple reuse detected only by ReD, (MC) multiple reuse detected only because the block comes from SLLC or another private cache, and (MA) multiple reuse detected by
any of them.
its last access. This is in line with the buffer goal of finding out the1
first reuse of a block.2
When ReD is implemented on an SLLC with TC-AGE replace-3
ment, only one bit is needed in the L2 cache to map the Reuse4
bit and the TC bit. So, in this case, our mechanism does not have5
any overhead in the L2 caches. From the TC-AGE perspective, this6
bit maintains the same meaning: it remembers whether the block7
came from SLLC or main memory. That is, when blocks are sent to8
the SLLC, it is reset when ReD discovers a first reuse (second time a9
block is evicted from L2); it is set in the subsequent L2 evictions (a10
block has already received at least three accesses). Therefore, TC-11
AGE is driven to give higher replacement priority in the SLLC to the12
blocks having one reuse, and less to those having multiple reuses.13
3.5. Hardware costs14
In this section, we calculate the total number of bits required to15
implement the ReD buffer we attach to each L2 cache.16
The hardware cost of ReD depends primarily on its capacity. By17
increasing capacity, ReD can track blocks that have been evicted18
from the L2 cache longer ago, that is, it can detect more distant19
reuses. This is beneficial until it gets to a point where it detects20
more blocks with reuse than those the SLLC can effectively store,21
andperformance starts to decline. The optimal performance for our22
baseline system with eight cores and an 8 MB SLLC (see details in23
Section 5.2) is obtained with a ReD capacity of 2 MB (see study in24
Section 7.1).25
For a given capacity, the cost depends on how block addresses26
are stored in the buffer. A naive implementation that stores all in-27
dividual block addresses and includes the whole tag would require28
a significant area. For example, for a ReD capacity of 2 MB, and29
assuming a physical address width of 40 bits, it would require 2 K30
16-way sets, with 24 bits (23 tag bits and 1 valid bit) per entry, and31
four FIFO bits per set. The total size for the eight cores would be32
776 KB, a 9.5% of an 8 MB SLLC.33
In order to reduce the ReD area, we propose storing sector34
tags and compressing them. A sector is a set of consecutive blocks35
aligned to the sector size, a power of two. As our design requires36
per-block reuse tracking, every sector tag needs as many valid bits37
as the number of blocks a sector has. For example, a ReD sector size38
of four blocks, requires entries with four valid bits. As some blocks39
of a sectormay not be referenced at all, the performance for a given40
ReD capacity decreases when the sector size increases. Therefore,41
the right sector size is a tradeoff between area and performance.42
Compression aims to shorten the tag size while maintaining43
good ability to distinguish between sectors. To compress we pro-44
pose the following bit folding: let t and c be the number of bits of45
the entire and compressed tags, respectively. The t bits are split46
into consecutive pieces of size c, filling with ∧zeros if the last piece47
does not consist of c bits. Then, the compressed tag results from an48
XOR operation to all pieces. False positives may appear by using 49
compression, as several sectors share the same compressed tag. 50
Therefore, it might happen that blocks without reuse get inserted 51
into the SLLC. Such wrong insertion is not a functional error, but 52
can hurt performance. So, the right number of bits is also a tradeoff 53
between area and performance. 54
After trading off performance and cost (see details in Section 7), 55
the chosen configuration has a capacity of 2MB, sector size of two 56
blocks and 10 bit tags. This balanced configuration is the one used 57
in our experiments unless stated otherwise. It requires 12 bits (10 58
tag bits and 2 valid bits) per entry, and four FIFO bits per set. The 59
number of entries for each ReD is 16 K, which means 24.5 KB per 60
core. The total size for the eight cores is 196 KB, a 2.3% of an 8 MB 61
SLLC. This is a 74.7% reduction compared to the initial size without 62
cost optimizations. 63
The Reuse bits in the L2 caches do not require additional area 64
if ReD is implemented on top of our baseline design, because they 65
are the same TC bits used by TC-AGE. If ReD is implemented using 66
an alternative replacement policy, 4 KB should be added (1 bit for 67
each of the 4 K entries in our eight 256 KB L2 caches). 68
4. ReD operation insight 69
In this section we use an example workload to analyze in depth 70
the ReD operation, and how it is able to reduce the SLLC miss rate. 71
We plot how ReD classifies the blocks it receives, into five 72
classes: first use (U), first reuse (R), multiple reuse detected only 73
by ReD (MD),1 multiple reuse detected only because the block 74
comes from SLLC or another private cache (MC), andmultiple reuse 75
detected by both mechanisms (MA). Fig. 5 shows the distribution 76
for the eight applications of an example mix. 77
A L2 eviction of a block classified as U causes an SLLC bypass, 78
while an eviction of a block classified as any other class causes the 79
insertion of the block into the SLLC. Evictions of blocks classified 80
as U, R or MD denote that the block comes originally from main 81
memory, while blocks classified as MC and MA denote a previous 82
hit in the SLLC or in another private cache. 83
As shown, the amount of bypass varies from one program to 84
another. In bwaves andmilc, more than 91% of blocks evicted from 85
L2 show a single use, and are bypassed. This is consistent with 86
the measurements presented later in Table 1, which show that the 87
SLLC miss rates for these programs are very high. At the opposite 88
extreme, in astar, omnetpp and wrf less than 2% of the blocks 89
evicted from the private caches do not show reuse, so there is 90
almost no bypass for these applications. The rest of the programs 91
1 Although the proposed ReD hardware cannot distinguish between the R and
MD classes, they were separated in the figure to illustrate that both reuse detection
mechanisms are complementary.
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
6 J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx
Fig. 6. Left: fraction of the SLLC occupied by blocks of each program in the example mix. We take data every million cycles and show the average. Right: SLLC MPI reduction
of ReD with respect to the base system.
Table 1
L1, L2 and LLC: Average MPKI at each cache level of the base system (exclusive SLLC with 8 MB and TC-AGE replacement
policy). IPC: average multi-processor IPC.
Benchmark L1 L2 LLC IPC Benchmark L1 L2 LLC IPC
astar 7.5 1.1 0.7 1.17 libquantum 45.8 33.2 32.2 0.28
bwaves 24.5 21.1 20.1 0.66 mcf 64.9 36.0 18.9 0.18
bzip2 8.4 3.9 0.9 1.30 milc 24.6 23.5 22.0 0.23
cactusADM 20.8 11.4 4.9 0.64 namd 1.7 0.2 0.2 3.16
calculix 8.5 4.3 1.5 1.61 omnetpp 12.6 9.2 2.2 0.63
dealII 1.6 0.5 0.3 2.69 perlbench 10.2 1.8 0.8 1.37
gamess 6.7 1.0 0.6 2.60 povray 11.5 0.2 0.1 2.65
gcc 22 6.4 2.1 0.78 sjeng 6.9 0.8 0.5 1.28
gemsFDTD 42.7 29.7 22.8 0.45 soplex 8.9 7.1 3.1 0.63
gobmk 13.2 1.1 0.3 1.23 sphinx3 18.8 14.3 11.7 0.26
gromacs 11.7 3.0 1.2 1.60 tonto 6.7 1.3 0.5 2.18
hmmer 3.3 2.4 0.2 2.50 wrf 14.3 8.9 1.5 2.26
h264ref 4.2 1.4 0.7 1.36 xalancbmk 15.1 8.7 2.8 0.68
lbm 65.4 38.6 36.7 0.21 zeusmp 32.3 8.7 7.2 0.87
leslie3d 40.4 23.2 17.9 0.58
present intermediate figures: dealII, gobmk and soplex, with bypass1
levels of 13%, 7% and 40%, respectively.2
The number of blocks that are sent to the SLLC after detecting3
their first reuse (R class) varies between 0.2% and 5.5% for omnetpp4
and dealII, respectively, with an average of 1.0%. These few blocks5
showing a first reuse are accessed later multiple times (MD, MC,6
and MA classes). On average, for each block classified as R (first7
reuse), ReD detects 61 additional reuses.8
Blocks classified as MD have been previously detected by ReD,9
classified as reused and stored in the SLLC, but they have been10
prematurely evicted from there. As ReD has a detection window11
that is larger than the SLLC, it can detect this situation and re-12
insert them into the SLLC. This occurs on average 13% of the times13
a multiple reuse is detected.14
The bypass of blocks without reuse in bwaves, milc, dealII and15
soplex allows the SLLC to better preserve the useful blocks from16
these programs, because they will not be evicted as often. More-17
over, other programs of the mix will also benefit from this.18
Fig. 6 shows the average fraction of the SLLC occupied by each19
program, for the baseline system and for ReD (left). It also shows20
the SLLCmisses per instruction (MPI) reduction of each application21
with respect to the baseline (right). Bwaves and milc take much22
less SLLC space with ReD. However, the block selection done by23
ReD does not harm their performance, since both maintain a miss24
rate similar to that of the baseline (0.9% worse in bwaves). For the25
rest of the programs, having more space in the SLLC and a better26
block selection mechanism allows them to keep more blocks with27
reuse and for a longer time, resulting in reductions in the SLLCMPI28
between 30.4% and 97.3% for soplex and omnetpp, respectively. The29
MPI reduction for thewholemix2 is 24.0%, and the normalized hIPC30
is 1.28.31
2 If we order our 100 workloads, described in Section 5, from higher to lower
normalized hIPC, this example mix is in position number 8.
5. Experimental setup 32
This section details the experimental framework and the con- 33
figuration of the baseline system we use to evaluate the proposal. 34
5.1. The experimental framework 35
As a simulation engine we use the Simics full-system simulator 36
[25], and the plugins Ruby and Opal from the GEMS Multifacet 37
toolset [26] and DRAMSim2 from the University of Maryland Col- 38
lege Park [29]. Ruby is used to accurately model the memory 39
hierarchy of the CMP system: caches, directory, coherence proto- 40
col, on-chip network, buffering, and blocking of components. Opal 41
(also known as TFSim) is used to model in detail a superscalar out- 42
of-order processor. DRAMSim2 is used to model a cycle-accurate 43
DDR3 memory system. 44
Our Simics platform simulates SPARC cores managed by Solaris 45
10, and runs a multiprogrammed workload made of applications 46
from the SPEC CPU 2006 suite [12]. For our system with 8 proces- 47
sors we have generated a set of 100 mixes, composed by random 48
combinations of 8 benchmarks each, taken from among all the 29 49
included in the SPEC CPU 2006 benchmark suite. Each program 50
appears between 18 and 41 times, this representing an average of 51
27.6 times with a standard deviation of 6.1. 52
In order to identify initialization phases, we run until comple- 53
tion all the SPARC binaries, with the reference inputs, on a real 54
machine. During the executionwe use hardware counters to detect 55
the end of the initialization phase of each benchmark. For every 56
mix, we ensure that no application was in its initialization phase 57
by fast-forwarding the simulation until all the initialization phases 58
are finished. Starting at this point, we first run 300 million cycles 59
to warm up thememory system, and then collect data statistics for 60
the next 700 million cycles. 61
The first three columns in Table 1 show the average number 62
of misses per kilo-instruction (MPKI) in all three levels of the 63
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx 7
Table 2
Processor parameters.
Base architecture SPARC v9
Cores 8, 4-way superscalar, 4 GHz
Pipeline 18 stages: 4 fetch, 4 decode, 4 dispatch/read, 1 (>4) execute, 3 memory, 2 commit
ROB size 128 entries
Register Files int: 160 (logical) + 128 (rename)FP: 64 (logical) + 128 (rename)
Functional units 4 int, 4 FP, 2 load/store
Branch prediction YAGS cache structure with a PHT of 4.096 entries
Table 3
Memory hierarchy parameters.
Private cache L1 I/D 32 KB, 4-way, LRU replacement, block size of 64 B,
3 cycles access latency
Private cache L2 unified 256 KB in inclusion with L1, 8-way, replacement LRU, block size of 64 B, 7 cycles access latency
Network Crossbar, 80 bits bus width, 5 cycles latency
Shared cache L3 (SLLC) 8 MB exclusive (4 banks of 2 MB each), block interleaving, block size of 64 B. Each bank: 16-way,
TC-AGE replacement with 2 bits, 10 cycles access latency, 32 demand MSHR
DRAM Device Micron 32M 8B x8, 2 channels, 2 ranks per channel, 8 devices per rank, 8 GB total.
DRAM bus 2 channels at 667 MHz, Double Data Rate (DDR3-1333), 8 B bus width,
4 DRAM cycles/line, 24 processor cycles/line
memory hierarchy. These figures are averages for each benchmark1
in all mixes in which appears, and when the eight benchmarks in2
each mix run together on the base system. The last column shows3
average multi-processor instructions per cycle (IPC).4
5.2. Configuration of the baseline system5
We model a base system of eight superscalar processors with6
speculative out-of-order execution. Each processor has a 4-wide7
pipeline of 18 stages and 10 functional units. Branch prediction8
uses a YAGS cache structure [6] with a direction pattern history9
table (PHT) of 4 K entries. Table 2 summarizes all the parameters10
of the simulated processor.11
Each processor core has a two-level private cache hierarchy,12
being the exclusive third and last level cache shared among all the13
cores. The SLLC has a total size of 8MB, and is split into four banks14
that are cache line interleaved (64B).15
A crossbar network connects the eight processors to the four16
SLLC banks. The DDR3 memory system is accessed through two17
memory channels running at a frequency of 667 MHz (DDR3-18
1333). Table 3 shows all the details of the cache hierarchy we19
simulate.20
5.3. Performance metrics21
Two performance metrics are mainly used: the harmonic mean22
of weighted IPCs [7,24]Q5 normalized to that of the base system23
(normalized harmonic IPC or normalized hIPC) and the reduction24
in misses per instruction against the base system (MPI reduction).25
Unless stated otherwise, figures show the average of the results26
obtained for each of the 100 workloads.27
For each mix, the normalized harmonic IPC for a proposal28












where IPCPROP MPt is the IPC obtained using PROP for processor t31
when run in the multiprogrammed experiment, IPCBASE MPt is the32
IPC obtained using the base system for processor t when run in the33
multiprogrammed experiment, IPCBASE SPt is the IPC obtained using34
the base system for processor t when run alone on the system, and35
function H is the harmonic mean, defined as36





The harmonic IPC metric is used because it incorporates a 38
notion of fairness, in addition to performance. This is because the 39
harmonic mean tends to be lower when there is much variance 40
among the different weighted IPCs of each processor. 41


















whereMT is the number of SLLCmisses counted during simulation 44
for processor t, and IT is the number of instructions executed by 45
processor t. 46
6. Performance analysis of ReD 47
In this section we first present our performance results and 48
compare ReDwith state of the art proposals. Next, we present data 49
on the fraction of alive blocks in SLLC achieved by each proposal. 50
In Section 6.3 we analyze the IPC results of our proposal broken 51
down by application and mix. In Section 6.4 we present results on 52
single-processor workloads. Next, we analyze the efficiency of our 53
detector and compare it to the other mechanisms. In Section 6.6 54
we analyze ReD performance using different SLLC sizes, and using 55
an alternative replacement policy. Finally, we provide additional 56
performance metrics. 57
6.1. Results and comparison with other proposals 58
Fig. 7 plots normalized hIPC and MPI reduction against the 59
baseline obtained by ReD. Compared to the baseline, it reducesMPI 60
by 10.1% and increases harmonic IPC by 9.5%. 61
Hereunder we compare the performance of our mechanism 62
with three other recent proposals: cache hierarchy-aware replace- 63
ment (CHAR) [3], Reuse Cache [1] and Evicted-Address Filter cache 64
[30]. We also compare it with a base system with double the SLLC 65
size, that is, 16 MB. 66
Comparison with CHAR: CHAR is a content selection proposal 67
that bases the bypass decision on the access pattern that a block has 68
at all levels of the memory hierarchy. CHAR was proposed both for 69
inclusive and exclusive SLLCs. We use here the exclusive version. 70
Fig. 7 plots normalized hIPC and MPI reduction against the 71
baseline obtainedbyReDandCHARwith an8MBSLLC. ReDoutper- 72
forms CHAR both inMPI reduction (10.1% vs. 4.3%) and normalized 73
hIPC (1.095 vs. 1.070). 74
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
8 J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx
Fig. 7. Normalized hIPC (left) and MPI reduction (right) compared to the base system with 8 MB, for five systems: ReD (with the balanced configuration), CHAR, Evicted
Address Filter, Reuse Cache (with RC-32/8 and NRR in tag array), and a base systemwith double the SLLC size (16MB). All are implemented on an exclusive SLLCwith TC-AGE
in the data array.
CHAR uses a predictor to foresee which blocks will show or not1
reuse. In Section 6.5 we will show that the accuracy of predictor-2
based CHAR is lower than the accuracy of our detector-based ReD.3
Comparison with Reuse Cache: The Reuse Cache is a content4
selection proposal for an SLLC whose tag and data arrays are5
decoupled, and that stores data only of those lines that have shown6
reuse. To be fair in the comparison,wehave∧modeled a Reuse Cache7
in which the data array works in exclusion with the private L28
caches. Our exclusive Reuse Cache works as follows: each block in9
the L2 private caches includes a bit indicating whether it should10
be inserted into the SLLC when evicted from L2 (bypass/no by-11
pass). On an SLLC miss (first block access), the block is sent from12
main memory to the L2 cache indicating ‘‘bypass’’, and the tag is13
inserted into the SLLC tag array. This allows subsequent reuse to be14
detected. On a hit in the tag array of the SLLC thatmisses in the data15
array (second block access) the block is sent frommain memory to16
L2 cache indicating ‘‘no bypass’’. When the block is evicted again17
from L2, it is stored in the SLLC data array. In subsequent accesses,18
that hit both in tag and data arrays, the block is sent to the private19
L2 indicating ‘‘no bypass’’, and is evicted from the SLLC data array.20
There are no changes in the SLLC tag and data arrays.21
Weuse an exclusive Reuse Cachewith a data array of 8MB and a22
tag array equivalent to 32MB. Among those with 8MB of data, this23
relationship between tags and data is the best we have found in24
our simulations. TC-AGE is used as replacement policy in the data25
array.26
As shown in Fig. 7, ReD outperforms the Exclusive Reuse Cache27
both in MPI reduction (10.1% vs. 4.5%) and normalized hIPC (1.09528
vs. 1.038).29
The Reuse Cache uses a global reuse detector, suffering as a re-30
sult of interference between the distinct applications. We analyze31
this effect in Section 6.5. In addition, the Reuse Cache embeds the32
reuse detector into the SLLC tag array. This increases the detector33
complexity, since each entry must keep the complete tag along34
with coherency information, limiting the design opportunities.35
Comparisonwith Evicted-Address Filter: The Evicted-Address36
Filter Cache is an SLLC that tracks the addresses of blocks that were37
recently evicted from the SLLC in a structure called the Evicted-38
Address Filter (EAF). Missed blocks whose addresses are present39
in the EAF are predicted to have high reuse, while the rest of the40
blocks are predicted to have low reuse. This prediction affects the41
insertion priority: high-reused blocks enter at the Most-Recently-42
Used (MRU) position and low-reused ones enter according to a bi-43
modal policy (MRUwith probability 1/64, otherwise LRU). The EAF44
is implemented using a Bloom filter, which is cleared periodically.45
Even though EAF is a replacement policy and not a content46
selection policy, we have included it in our comparison because47
it also attaches a reuse detectionmechanism. The information ReD48
stores and acts upon is different, because it monitors blocks sent49
from L2 caches to the SLLC whereas EAF does it from the SLLC50
to Main Memory. EAF cannot be used to implement a content51
selection policy, because it uses a Bloom filter to store the reuse52
information. The filter is periodically cleared, which produces a 53
loss of information that leads to temporarily classify all blocks as 54
not reused. This is beneficial when the detector is used to adjust 55
the replacement policy, as it is in the original publication and in 56
our setup. Conversely, it makes it unsuitable for use as a content 57
selection mechanism, as it would lead to not inserting any new 58
blocks into the SLLC after the reset, for any application, until the 59
filter is adequately refilled. 60
To be fair in the comparison, we have ∧modeled an EAF Cache in 61
which the data arrayworks in exclusionwith the private L2 caches. 62
The L2 caches are also extended to store the Reuse bit, which is 63
sent to the SLLC on eviction. At the time the block enters into the 64
SLLC, that is, when it is evicted from an L2 cache, the Reuse bit is 65
checked first. If it is set, the block is inserted at the MRU position. 66
If not, the EAF is checked, applying the described policy. We store 67
the SLLC MRU information using 2 bits per block, in line with the 2 68
bits that we use for TC-AGE in our other models. Our experimental 69
results show that for the exclusive version a larger Bloom filter is 70
required. We obtain the best results with a filter 25% larger than 71
in the original publication. This is consistent with the increase in 72
distinct blocks in thewhole cache subsystem due to themove from 73
inclusion to exclusion, from 8 to 10 MB (8 cores with 256 KB of L2 74
cache each). 75
As shown in Fig. 7, ReD outperforms the exclusive EAF Cache 76
both in MPI reduction (10.1% vs 2.7%) and normalized hIPC (1.095 77
vs. 1.032). 78
Comparison with a double-sized base system Fig. 7 plots 79
normalized hIPC and MPI reduction against the baseline obtained 80
by ReDwith an 8MB SLLC, and by a base systemwith a 16MB SLLC 81
(BASE 16MB). 82
ReDwith an 8MBSLLC achieves 87% of theMPI reduction (10.1% 83
vs 11.6%) and 81% of the increase in normalized hIPC (1.095 vs 84
1.117) of the double-sized base system, with only a 2.3% increase 85
in SLLC space. 86
6.2. Alive and dead blocks 87
In this section we present the average number of alive blocks 88
that the SLLC stores at any given time. We define a block in the 89
SLLC as alive at a given time if it receives a hit in the future before 90
its eviction. Conversely, a block is defined as dead at a given time if 91
it does not receive an additional hit before its eviction. Dead blocks 92
waste storage. 93
Fig. 8 plots these results for CHAR, exclusive EAF, exclusive 94
Reuse Cache and ReD. Additionally, we include the baseline config- 95
uration∧(labeled TC-AGE), andNRF (Not Recently Filled) as themost 96
basic replacement policy (NRF is analogous to NRU in inclusive 97
caches [32]). For each workload, we take measures every million 98
cycles and calculate the average. We show the average over all our 99
workloads. 100
As Fig. 8 shows, when using the basic 1-bit NRF policy only 101
14.0% of the blocks are alive on average. Our baseline configuration 102
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx 9
Fig. 8. Average fraction of alive blocks present at any given moment in the SLLC,
for different cache management proposals, using an 8 MB SLLC.
(2-bit TC-AGE), increases that figure up to 20.4%. All other propos-1
als, implemented on top of TC-AGE, improve the management of2
the SLLC, increasing the fraction of alive blocks. ReD achieves the3
best results with 51.5% of alive blocks, being the only proposal that4
manages to have on average more alive than dead blocks.5
Comparing Fig. 8 with previous Fig. 7 (right), we realize that6
increasing fractions of alive blocks correlates to a higher reduction7
in misses per instruction, albeit it is not proportional. As ReD8
prioritizes multiple-reused blocks in the SLLC (see Section 3.4), it9
takes more advantage of the alive fraction, thus leading to a higher10
miss rate reduction.11
6.3. Per-application and per-mix performance12
As explained previously, application performance depends both13
on the application itself and on the other applications running in14
the workload. Fig. 9 shows box-and-whisker plots with the distri-15
bution of speed-ups (normalized IPC) by application, with respect16
to the baseline system, for all instances of the applications that17
are running in our 100 workloads. Five values are plotted, namely18
minimum, first quartile, median, third quartile, and maximum.19
Out of all 29 applications, 5 show improved performance in all20
workloads they appear in (astar, bzip2, hmmer, tonto and21
xalancbmk), with medians as high as 1.41 for xalancbmk. Another22
11 show improved performance starting with the first quartile23
(bwaves, gamess, gobmk, gromacs, h264ref, mcf, omnetpp, sjeng,24
soplex, sphinx3,wrf ), although in somemixes they show reduction.25
In 8 of them (cactusADM, dealII, gcc, libquantum, milc, namd, perl-26
bench, povray), the median shows improvement but the first quar-27
tile shows reduction. The 5 remaining applications (GemsFDTD,28
calculix, lbm, leslie3d, and zeusmp) show less performance in the29
median.30
Performance results also vary by workload, depending on the31
applications it includes. Fig. 10 plots the normalized harmonic IPC32
for all the workloads, relative to the baseline. Out of the 100mixes,33
94 show speed-up improvements of up to 1.70, the worst having a34
value of 0.98.35
6.4. Single-processor performance36
In this section we show the performance of ReD for single-37
processor workloads. For these experiments we use a 1 MB LLC,38
the same per-processor amount as in our multiprocessor simula-39
tions. All other parameters are the same. We show results for all40
benchmarks that have MPKI > 2 at the LLC.41
Fig. 11 plots normalized IPC obtained by ReD compared to the42
base system. Although ReD is specifically designed for chip mul-43
tiprocessor systems, it still provides performance enhancements44
for 9 of these 14 sequential workloads, up to 12.8% speedup for45
xalancbmk. It decreases the performance of the other 5, up to 1.8%46
for omnetpp.47
It is interesting to compare these results with those shown48
in Fig. 9 for multi-processor workloads.Q6 Omnetpp shows there49
positive normalized IPC in most of the workloads it is present in50
(starting with the first quartile). As shown in Fig. 6, the content 51
selection made by ReD changes the distribution of space in the 52
SLLC, and is often able to assign more space to the alive blocks of 53
omnetpp, surpassing the 1 MB that we use in this section. With 54
increasing space, the reused working set fits in the SLLC, and 55
omnetpp turns to show IPC improvements in most multiprocessor 56
benchmarks. 57
6.5. Detector efficiency 58
In this section we analyze ReD efficiency in terms of the num- 59
ber of blocks selected for SLLC insertion, and their usefulness. 60
We show figures about the amount of blocks selected by the 61
distinct detectors/predictors, and the subsequent reuse of these 62
blocks. 63
Fig. 12 plots the number of blocks in our example workload 64
that, after coming from main memory to L2, are selected for SLLC 65
insertion (‘‘new blocks’’). We show figures for ReD, CHAR, and a 66
detector named ‘‘Shared ReD’’. This detector is similar to ReD, but 67
it uses a single address buffer that is shared among all cores instead 68
ofmultiple private ones. Fig. 13 shows the accuracy of each content 69
selection mechanism. This is defined as the percentage of new 70
blocks that are accessed at least once after being sent to the SLLC. 71
Fig. 14 plots the MPI reduction of each mechanism versus our base 72
system. 73
As shown in Figs. 12 and 13, predictor-based CHAR is less 74
selective than our detector-based ReD. CHAR inserts many more 75
blocks into the SLLC but its accuracy is low. For example, for bwaves 76
CHAR inserts into the SLLC about 18 times the blocks inserted by 77
ReD. However, both have similar SLLC miss ratio because only 2% 78
of the blocks inserted by CHAR are used before being evicted. The 79
behavior is similar formilc (12x and 0.1%). To store blocks for these 80
two applications, CHAR evicts useful blocks from other cores. As a 81
consequence, ReD is much better than CHAR at reducing MPI for 82
two other applications of this mix (astar and dealII). For the whole 83
mix theMPI reduction for CHAR is 20% vs 24% for ReD 16M (Fig. 14). 84
These differences, or even larger ones, appear consistently across 85
our workloads, leading to the average 6% difference shown in 86
Fig. 7. 87
Shared ReD with a capacity of 16M overall inserts 32% more 88
blocks than Shared ReD with 8 MB. Additionally, the accuracy 89
increases by 2%. This does not directly translate into a much 90
higher MPI reduction in this particular workload (only 0.3%), but 91
on average (across our 100 workloads) it leads to a 1% reduction 92
in MPI. 93
ReD 16M outperforms Shared-ReD 16M in 5 applications of the 94
mix, and obtains similar performance in bwaves,milc and omnetpp. 95
Fig. 15 plots the average fraction of the reuse detector occupied 96
by each program using these two mechanisms. Only two applica- 97
tions, bwaves andmilc, occupy 76% of the shared detector, stealing 98
capacity from the other six applications. The private detector in 99
ReD protects all applications from this thrashing, which leads to 100
a fair distribution of the detection window, and ultimately better 101
performance of the workload. 102
6.6. Additional cache sizes 103
In this section we show the performance of ReD using dis- 104
tinct SLLC cache sizes, and compare it with the selected three 105
proposals. 106
Fig. 16 plots normalized hIPC and MPI reduction against the 107
base system obtained by CHAR, exclusive EAF, exclusive Reuse 108
Cache and ReD for varying SLLC sizes. ReD outperforms all of the 109
other proposals in both metrics, at all SLLC sizes considered. 110
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
10 J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx
Fig. 9. Distribution of normalized IPC, compared to the base system, for all applications in all workloads.
Fig. 10. Normalized harmonic IPC, compared to the base system, for all 100 workloads.
Fig. 11. Normalized IPC obtained by ReD, compared to the base system, for single-processor workloads that have LLC MPKI >2, on an exclusive LLC with 1 MB in the data
array.
Fig. 12. Number of blocks selected for SLLC insertion after coming frommain memory, for all applications in the example workload. From left to right: CHAR, shared ReD (8
MB and 16 MB) and ReD (16 MB, 2 MB per core).
6.7. Alternative cache replacement policy: Least Recently Filled (LRF)1
Content selection and replacement policies are usually aligned,2
as they share the same objectives. Therefore, we use TC-AGE as3
replacement policy: ReD selects reused blocks to be stored in the 4
SLLC, and TC-AGE aims to retain the most reused blocks as long 5
as possible. However, content selection and replacement policies 6
are orthogonal. The former chooses which blocks enter the SLLC 7
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx 11
Fig. 13. Accuracy of content selection mechanisms: percentage of new blocks used at least once after being sent to the SLLC, in the example workload. From left to right:
CHAR, shared ReD (8 MB and 16 MB) and ReD (16 MB, 2 MB per core).
Fig. 14. SLLC MPI reduction with respect to the base system. From left to right: CHAR, shared ReD (8 MB and 16 MB) and ReD (16 MB, 2 MB per core).
and the latter which blocks are evicted to make room for others.1
Therefore, ReD can be implemented on an SLLC managed with any2
replacement policy.3
As ReD has the detector store decoupled from the cache, the4
replacement policy is not able to adversely affect the detector5
efficiency. In other mechanisms where the SLLC itself is used as6
detector store, there is a clear dependence between detection and7
replacement. A poor replacement algorithm can adversely affect8
the detector.9
Next, we analyze the impact of our proposal on an exclusive10
SLLC with a 4-bit Least-Recently-Filled (LRF) replacement policy,11
similar to LRU in inclusive caches. Fig. 17 plots normalized hIPC12
and MPI reduction obtained by ReD when using LRF and TC-AGE,13
compared to a base system with the same replacement policy but14
without ReD. Adding ReD leads to betterMPI reductions and higher15
normalizedhIPCwith LRF thanwith TC-AGE. This is not a surprising16
result, as the former is not as efficient as the latter and leavesmore17
room for improvement.18
6.8. Additional performance metrics19
In this section we show our results using two alternative per-20
formance metrics. Fig. 18 plots normalized IPC (speedup) and nor-21
malized weighted speedup [31] for CHAR, exclusive EAF, exclusive22
Reuse Cache and ReD.23
7. Design space exploration24
The following three subsections evaluate the performance-cost25
trade-offs of ReD capacity, sector size and tag compression, search-26
ing for a balanced configuration.27
7.1. ReD capacity28
Here we study how the results vary depending on the ReD29
capacity (see Section 3.2). Fig. 19 shows normalized hIPC and30
reduction in MPI, with respect to the baseline, as a function of the31
ReD capacity per core. For this experiment, the ReD sector size is32
one block and it stores the entire tag. We show average values33
across the 100 mixes described in Section 5.34
By increasing capacity, ReD can track blocks that have been35
evicted from the L2 cache longer ago, that is, it can detect more36
Fig. 15. Fraction of overall detector space occupied by each application on the
example mix, for ReD (16 MB, 2MB per core) and Shared ReD 16MB.
distant reuses. The optimal configuration is achieved with a capac- 37
ity of 2 MB per core, which presents a hIPC increase of 9.9%, and 38
reducesMPI by 10.4%. A ReDwith capacity larger than 2MBdetects 39
more blocks with reuse than those the SLLC can effectively store, 40
leading to a performance decrease compared to the 2 MB ReD. 41
7.2. ReD sector size 42
Increasing sector size decreases the ReD hardware cost (see 43
Section 3.5). We call ReD size the hardware cost of a given ReD 44
configuration measured in bytes. Fig. 20 shows ReD size as a 45
function of capacity and sector size, when using tags with 10 bits. 46
For a given ReD capacity, doubling the sector size allows to 47
halve the number of ReD sets. Therefore, a ReD with bigger sector 48
size requires less ReD size. This is because the storage saved by 49
reducing the number of entries is greater than the storage needed 50
to add valid bits for each block of a sector. 51
On the other hand, Fig. 21 shows how performance varies when 52
the ReD sector size increases. 53
We have defined the reuse detection window as the set of 54
block addresses that ReD remembers of an executing thread (ReD 55
window for short). Note that it could be different from the ReD 56
capacity because some threads may not use the full capacity of 57
the detector. If we increase the sector size while maintaining the 58
ReD capacity, the ReD window decreases because sometimes the 59
thread will not reference all the blocks of a sector. Therefore, ReD 60
will detect less reuse, leading to a performance degradation for all 61
the ReD capacities except for 8 MB, where ReD already detected 62
more blocks with reuse than those the SLLC is able to store. 63
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
12 J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx
Fig. 16. Normalized harmonic IPC (left) and MPI reduction (right) compared to the base system, for different SLLC data sizes, for four systems: ReD (with a balanced
configuration for each size), CHAR, Evicted Address Filter Cache and Reuse Cache (RC-16/4, RC-32/8 and RC-64/16). All are implemented on an exclusive SLLC with TC-AGE
in the data array.
Fig. 17. Normalized hIPC (left) andMPI reduction (right) obtained when adding the ReD content selectionmechanism to base systemswith 4-bit LRF (Least-Recently-Filled)
and 2-bit TC-AGE as replacement policies. Both are implemented on an exclusive SLLC with 8 MB in the data array.
Fig. 18. Normalized IPC (left) and normalized weighted speedup (right) compared to the base system with 8 MB, for four systems: ReD (with the balanced configuration),
CHAR, Evicted Address Filter and Reuse Cache (with RC-32/8 and NRR in tag array). All are implemented on an exclusive SLLC with TC-AGE in the data array.
Fig. 19. Normalized hIPC (left) and reduction of SLLC misses per instruction (right) with respect to the base system, as a function of the ReD capacity per core.
Fig. 20. ReD size per core in KB, as a function of capacity and sector size. The sector size is the number of blocks associated with each ReD tag. We consider tags of 10 bits.
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx 13
Fig. 21. Normalized hIPC with respect to the base system, as a function of the ReD capacity per core, and for different sector sizes.
Fig. 22. Left: average rate of detection errors in ReD due to tag compression. Center: normalized hIPCwith respect to the base system. Right: SLLCMPI reductionwith respect
to the base system. The NC bar represents the tag with no compression.
The best configuration in terms of performance,with 2MBof ca-1
pacity and sectors with one block, requires a ReD size of 45 KB per2
core. However, other configurations have better performance/cost3
ratios: the one with 2 MB of capacity and sectors with 2 blocks4
shows 0.35% lower hIPC and a ReD size of 24.5 KB, 46% lower.5
7.3. ReD tag size6
As explained in Section 3.5, ReD can store compressed tags7
to reduce the amount of storage required. Fig. 22 shows on the8
left, depending on the tag size, the average error rate in reuse9
detection due to tag compression. These errors are false positives:10
a false reuse is detected because the compressed tag that is be-11
ing searched matches with that of a different sector previously12
registered. Inserting not-reused blocks into the SLLC reduces the13
effectiveness of the mechanism. The error rate is less than 1% with14
a tag of only 12 bits.15
Fig. 22 shows also, in the center and on the right, normalized16
hIPC and MPI reduction obtained for our selected configuration17
(2MB of capacity and a sector of 2 blocks) as a function of tag size.18
The performance loss is almost negligible for a tag size of 10 bits:19
normalized hIPC decreases 0.29% while MPI increases 0.26% com-20
pared to the configuration with uncompressed tags. This justifies21
compression in order to reduce the amount of storage required.22
This is the balanced configuration we have selected: 2 MB ReD23
capacity, sector size of two blocks and 10 bit tags. It has a ReD size24
of 24.5 KB.25
8. Related work26
Any cache content management mechanism is based on a27
model that forecasts whether a block is going to be used in the28
immediate future or not. We can classify state-of-the-art mecha-29
nisms for the SLLC into two groups: those that rely on the last touch30
to each block (PC based [18], PC-sequence based [21], counter31
based [20], . . . ), and those that rely on the reuse locality [1–3,9–32
11,15,19,22,28,30,33].33
In a similar way, Faldu and Grot [8] classifymanagement strate-34
gies into Dead Block Predictors (DBPs) and insertion policies. DBPs35
try to predict whether a block has reached the end of its useful36
lifetime on chip. Insertion policies try to predict when a block is37
dead on arrival (it will not see any reuse in the cache after its38
insertion). The paper concludes that DBPs are less accurate than 39
insertion policies. 40
We focus on proposals relying on the reuse locality property of 41
the SLLC blocks, which are ‘‘insertion policies’’ in Faldu andGrot [8] 42
taxonomy.We can classify them according to three characteristics: 43
• Replacement/content selection: In some proposals, the re- 44
placement algorithm gives higher priority to stay in the cache 45
to those blocks showing reuse. On the other hand, some 46
works leverage reuse locality to select for insertion only those 47
blocks classified as reused, bypassing the rest. 48
• Detection/prediction: In order to classify blocks, some au- 49
thors suggest reuse detection mechanisms, while others 50
propose mechanisms to predict a reuse behavior before it 51
appears. 52
• Address store: Both detection and prediction mechanisms 53
need to remember past block addresses in order to identify 54
the second access to a block. Detectionmechanisms associate 55
reuse to the block receiving a second access while prediction 56
mechanisms associate reuse to a signature of the block receiv- 57
ing a second access. The structure that holds past accesses can 58
be implemented in several ways, either embedded into the 59
SLLC itself or as an independent structure. 60
Table 4 contains a sample of previous work classified according 61
to this taxonomy. 62
8.1. Replacement policies 63
Both detection and prediction have been used to guide replace- 64
ment algorithms, and most of them keep the reuse window in the 65
SLLC itself. 66
Replacement mechanisms based on detection label a block as 67
not reused when it comes from main memory (its first use in the 68
reuse window that the SLLC is able to recall). Subsequent SLLC hits 69
(second and later touches to the block)will flag the block as reused. 70
Two proposals include an added store to remember these blocks. 71
Seshadri et al. [30] use a Bloom Filter and Gupta et al. [11] use a 72
Bypass Buffer. 73
Replacement mechanisms based on prediction try to figure out 74
whether a block will be reused before it really is, and flag the block 75
as reused just after its first touch. Prediction policies categorize 76
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
14 J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx
Table 4
Classification of previous work based on the reuse locality property, according to our taxonomy.
Addresses in the SLLC In a different store
Replacement
Detection
Gao and Wilkerson [9] Seshadri et al. [30]
Gaur et al. [10] (TC-AGE) Gupta et al. [11]
Khan et al. [19]
Albericio et al. [2]
Prediction
Qureshi et al. [28]
Jaleel et al. [15]
Wu et al. [33]
Content selection
Detection Albericio et al. [1] Our proposal (ReD)
Prediction
Gaur et al. [10] (Bypass + TC-AGE)
Chaudhuri et al. [3]
Li et al. [22]
blocks according to certain features (signatures in Wu et al. [33]),1
and study the reuse characteristics of any block in each category.2
Wu et al. [33] analyze distinct types of signatures: memory region,3
program counter, or instruction sequence. As an example, the PC4
signature policy acts by classifying blocks according to the PC of5
the memory instruction responsible for bringing them in the chip.6
It identifies the reuse behavior of the blocks that each instruction7
loads (mainly categorizing them as reused or not reused), and8
assigns the same category to all blocks that the same instruction9
will bring in the future.10
DIP and DRRIP mechanisms are also predictors [15,28]. Using11
set-dueling techniques, these mechanisms analyze the reuse be-12
havior of the entire application and apply it to all its blocks. All13
blocks are categorized into a single category, the one of their own14
application.15
8.2. Content selection policies16
Except the Reuse Cache proposed by Albericio et al. [1], all other17
content selection policies include some sort of prediction: Li et al.18
[22] uses the PC signature policy, Gaur et al. [10] the trip count and19
use count of blocks, and Chaudhuri et al. [3] the behavior of blocks20
during their stay in private caches and their coherence status. For21
each class, an algorithm analyzes its SLLC behavior and extends it22
to all future blocks belonging to the same category.23
Comparing our proposalwith others using prediction, they tend24
to bemore complex, and often require the transfer of data between25
cache levels or even to send the PC to the cache subsystem. Addi-26
tionally, predictors show lower accuracy than detectors.27
On the other hand, all previous content selection techniques28
track reuse (and reuse patterns) using the SLLC. Therefore, the SLLC29
size defines and limits the size of the reuse detection window.30
Finally, all these proposals have in common an important con-31
straint: their reuse detector is shared among all threads running32
on the CMP. A single thread can thrash the detector, shrinking33
the reuse detection window of the remainder applications. To34
overcome this, we propose implementing reuse detectors that are35
private to every core.36
9. Conclusions37
Previous publications reveal that the stream of references38
reaching the shared last level cache (SLLC) of amultiprocessor chip39
shows little temporal locality. However, it shows reuse locality,40
i.e., blocks referenced more than once are more likely to be ref-41
erenced in the near future. This leads to an inefficient use of the42
cache if conventional management is performed. There are several43
proposals addressing this problem for inclusive caches, but few44
that focus on exclusive ones.45
This paper proposes a novel content selection mechanism for46
exclusive SLLC that leverages the reuse locality embedded in the47
SLLC request stream. We propose adding a Reuse Detector (ReD),48
placed in between each L2 cache and the SLLC, to discover which 49
of the L2 evicted blocks have not experienced reuse and avoid 50
their insertion in the SLLC, bypassing them. We analyze problems 51
affecting similar recent mechanisms (low accuracy, reduced vis- 52
ibility window and thrashing in the detector) and design ReD to 53
overcome them as much as possible. We evaluate the proposal 54
in a multicore chip with eight processors that executes a mul- 55
tiprogrammed workload. Properly designed, the Reuse Detector 56
prevents the insertion of many useless blocks in the SLLC, and 57
helps keeping the most reused. Experimental results show that 58
this allows for enhancing SLLC performance beyond other recent 59
proposals. Specifically, ReD reduces the SLLCmisses per instruction 60
by 10.1% with respect to a base cache with TC-AGE replacement 61
and no content selection,while CHAR and exclusive-cache versions 62
of the Reuse Cache and the EAF cache reduceMPI by 4.3%, 4.5% and 63
2.7% respectively. 64
Although ReD is proposed here specifically for exclusive caches, 65
we think that the design can be expanded to also support other 66
hierarchies that are not strictly inclusive, and therefore can support 67
bypassing. Exploring such a generalized design is part of our future 68
work. We are also working on ways to increase the performance 69
of our mechanism in programs that currently benefit relatively 70
less from ReD, that is, those that have less multiple reuses of their 71
blocks. 72
Acknowledgments 73
We thank the anonymous referees for their valuable com- 74
ments to improve our paper. This work was supported in part by 75
grants TIN2016-76635-C2-1-R (AEI/FEDER, UE), TIN2015-65316- 76
P, Consolider NoE TIN2014-52608-REDC (Spanish Gov.), and gaZ: 77
T58_17R research group Q7(Aragón Gov. and European ESF). 78
References 79
[1] J. Albericio, P. Ibáñez, V. Viñals, J.M. Llabería, The reuse cache: downsizing 80
the shared last-level cache, in: Proceedings of the 46th Ann. Int. Symp. on 81
Microarchitecture, 2013, pp. 310–321. 82
[2] J. Albericio, P. Ibáñez, V. Viñals, J.M. Llabería, Exploiting reuse locality on 83
inclusive shared last-level caches, ACMTrans. Archit. CodeOptim. 9 (4) (2013) 84
38. 85
[3] M. Chaudhuri, J. Gaur, N. Bashyam, S. Subramoney, J. Nuzman, Introducing 86
hierarchy-awareness in replacement and bypass algorithms for last-level 87
caches, in: Proceedings of the 21st Int. conference on Parallel architectures 88
and compilation techniques, 2012, 293–304. 89
[4] M. Clark, A New, high performance x86 core design from AMD. Hot Chips 90
2016, 2016. 91
[5] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, B. Hughes, Cache 92
hierarchy and memory subsystem of the amd opteron processor, IEEE Micro 93
30 (2) (2010) 16–29. 94
[6] A.N. Eden, T. Mudge, The YAGS branch prediction scheme, in: Proceedings of 95
the 31st Ann. ACM/IEEE Int. Symp. on Microarchitecture, 1998, pp. 69–77. 96
[7] S. Eyerman, L. Eeckhout, Restating the case for weighted-IPC metrics to 97
evaluate multiprogram workload performance, IEEE Comput. Archit. Lett. 13 98
(2) (2014) 93–96. 99
YJPDC: 3976
Please cite this article as: J. Díaz, T. Monreal, P. Ibáñez et al., ReD: A reuse detector for content selection in exclusive shared last-level caches, Journal of Parallel and
Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.005.
J. Díaz, T. Monreal, P. Ibáñez et al. / Journal of Parallel and Distributed Computing xxx (xxxx) xxx 15
[8] P. Faldu, B. Grot, LLC dead block prediction considered not useful, in: 13th1
Workshop on Duplicating, Deconstructing and Debunking (WDDD), 2016.2
[9] H. Gao, C. Wilkerson, A dueling segmented LRU replacement algorithm with3
adaptive bypassing, in: Proceedings of the 1st JILP Workshop on Computer4
Architecture Competitions, 2010.5
[10] J. Gaur, M. Chaudhuri, S. Subramoney, Bypass and insertion algorithms for6
exclusive last-level caches, in: In Proceedings of the 38th Int. Symp. on7
Computer Architecture, 2011, pp. 81–92.8
[11] S. Gupta, H. Gao, H. Zhou, Adaptive cache bypassing for inclusive last level9
caches, in: Proceedings of the 27th Int. Symp. on Parallel & Distributed10
Processing, 2013, pp. 1243–1253.11
[12] J.L. Henning, SPEC cpu2006 benchmark descriptions, ACM SIGARCH Comput.12
Archit. News 34 (4) (2006) 1–17.13
[13] E. Jaleel, M. Borch, S.C. Bhandaru, A. Steely Jr, J. Emer, Achieving non-inclusive14
cache performance with inclusive caches. Temporal Locality Aware (TLA)15
cache management policies, in: Proceedings of the 43rd Ann. Int. Symp. on16
Microarchitecture, 2010, pp. 151–162.17
[14] A. Jaleel, J. Nuzman, A. Moga, Steely Jr. S.C., J. Emer, High performing cache18
hierarchies for server workloads. Relaxing inclusion to capture the latency19
benefits of exclusive caches, in: Proceedings of the 21st Int. Symp. on High20
Performance Computer Architecture, 2015, pp. 343–353.21
[15] K.B. Jaleel, S.C. Theobald, A. Steely Jr, J. Emer, High performance cache replace-22
ment using re-reference interval prediction (RRIP), in: Proceedings of the 37th23
Int. Symp. on Computer Architecture, 2010, pp. 60–71.24
[16] N.P. Jouppi, S.J.E. Wilton, Tradeoffs in two-level on-chip caching, in: Proceed-25
ings of the 21st Ann. Int. Symp. on Computer Architecture, pp. 34–45.26
[17] D. Kanter, Skylake-SP scales server systems, Microprocessor Report, July 17,27
2017, 2017.28
[18] S. Khan, Y. Tian, D.A. Jiménez, Sampling dead block prediction for last-level29
caches, in: Proceedings of the 43rd Ann. Int. Symp. on Microarchitecture,30
2010, pp. 175–186.31
[19] S. Khan, Z.Wang, D. Jimenez, Decoupleddynamic cache segmentation, in: Pro-32
ceedings of the IEEE 18th Int. Symp.High Performance Computer Architecture33
HPCA, 2012, pp. 1–12.34
[20] M. Kharbutli, Y. Solihin, Counter-based cache replacement and bypassing35
algorithms, IEEE Trans. Comput. 57 (4) (2008) 433–447.36
[21] An-Chow Lai, C. Fide, B. Falsafi, Dead-Block prediction & dead-block corre-37
lating prefetchers, in: Proceedings of the 28th Ann. Int. Symp. on Computer38
Architecture, 2001, pp. 144–154.39
[22] L. Li, D. Tong, Z. Xie, J. Lu, X. Cheng, Optimal bypass monitor for high perfor-40
mance last-level caches, in: Proceedings of the 21st Int. Conference on Parallel41
Architectures and Compilation Techniques, 2012, pp. 315–324.42
[23] P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A.43
Adileh, D. Jevdjic, S. Idgunji, E. Ozer, B. Falsafi, Scale-Out processors, ACM44
SIGARCH Comput. Archit. News 40 (3) (2012) 500–511.45
[24] K. Luo, J. Gummaraju, M. Franklin, Balancing throughput and fairness in SMT46
processors, in: Proceedings of the IEEE Int. Symp. Performance Analysis of47
Systems and Software, ISPASS, 2001, pp. 164–171.48
[25] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hog-49
berg, F. Larsson, A. Moestedt, B. Werner, Simics: a full system simulation50
platform, Computer 35 (2) (2002) 50–58.51
[26] M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore,52
M. Hill, D. Wood, Multifacet’s general execution-driven multiprocessor sim-53
ulator (gems) toolset, Comput. Archit. News 33 (4) (2005) 92–99.54
[27] R.L.Mattson, J. Gecsei, D.R. Slutz, I.L. Traiger, Evaluation techniques for storage55
hierarchies, IBM Syst. J. 9 (2) (1970) 78–117.56
[28] M. Qureshi, A. Jaleel, Y. Patt, S. Steely, J. Emer, Adaptive insertion policies for57
high performance caching, in: Proceedings of the 34th Ann. Int. Symp. on58
Computer Architecture, 2007, pp. 381–391.59
[29] P. Rosenfeld, E. Cooper-Balis, B. Jacob, DRAMSim2: a cycle accurate memory60
system simulator, Comput. Archit. Lett. 10 (1) (2011) 16–19.61
[30] V. Seshadri, O. Mutlu, M.A. Kozuch, T.C. Mowry, The evicted-address filter: a62
unified mechanism to address both cache pollution and thrashing, in: Pro-63
ceedings of the 21st Int. conference on Parallel architectures and compilation64
techniques, 2012, pp. 355–366.65
[31] A. Snavely, D.M. Tullsen, Symbiotic jobscheduling for simultaneous mul-66
tithreading processor, in: Proceedings of the International Conference on67
Architectural Support for Programming Languages and Operating Systems68
(ASPLOS), 2000, pp. 234–244.69
[32] Sun Microsystems. UltraSPARC T2 supplement to the Ultra-SPARC architec-70
ture 2007. Draft D143.71
[33] C.J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S.C. Steely Jr, J. Emer, SHiP: 72
signature-based hit predictor for high performance caching, in: Proceedings 73
of the 44th Ann. Int. Symp. on Microarchitecture, 2011, pp. 430–441. 74
75
Javier Díaz is pursuing an Ph.D. degree from the School 76
of Engineering and Architecture at the University of 77
Zaragoza. His main research topic is the management of 78
the memory hierarchy in chip multiprocessor systems. 79
He has held teaching positions as Associate Professor at 80
the University of Zaragoza, and is currently working at 81
DXC technology. His research interests include computer 82
architecture and parallel computing, with focus on cache 83
and memory management. 84
85
Teresa Monreal-Arnal received the MS degree in Math- 86
ematics and the Ph.D. degree in Computer Science from 87
the University of Zaragoza, Spain, in 1991 and 2003, re- 88
spectively. Until 2007, she was with the Informática e In- 89
geniería de Sistemas Department (DIIS) at the University 90
of Zaragoza, Spain. 91
Currently, she is anAssociate Professorwith the Com- 92
puter Architecture Department (DAC) at the Universi- 93
tat Politècnica de Catalunya (UPC), Spain. Her research 94
interests include processor microarchitecture, memory 95
hierarchy, and parallel computer architecture. She col- 96
laborates actively with the Grupo de Arquitectura de Computadores from the 97
University of Zaragoza (gaZ). 98
99
Pablo Ibáñez received the MS degree in Computer Sci- 100
ence from the Universitat Politècnica de Catalunya in 101
1989, and the Ph.D. degree in Computer Science from 102
the Universidad de Zaragoza in 1998. He is an Associate 103
Professor in theDepartamentode Informática e Ingenier a 104
de Sistemas (DIIS) ıı at theUniversidad de Zaragoza,Spain. 105
His research interests include processor microarchitec- 106
ture, memory hierarchy, parallel computer architecture, 107
and High Performance Computing (HPC) applications. He 108
is a member of the Instituto de Investigación en Ingenier 109
a de Aragón (I3A) and the ıı European HiPEAC NoE. 110
111
José María Llabería received the MS degree in telecom- 112
munication, and the MS and the Ph.D. degrees in 113
computer science from the Universitat Politècnica de 114
Catalunya (UPC) in 1980, 1982, and 1983, respectively. 115
He is a full professor in the Computer Architecture De- 116
partment atUPC (Barcelona, Spain). His research interests 117
include processor microarchitecture, memory hierarchy, 118
parallel computer architecture, vector processors, and 119
compiler technology for these processors. 120
121
Víctor Viñals-Yúfera received theMSdegree in Telecom- 122
munications, and the Ph.D. degree in Computer Science 123
from the Universitat Politècnica de Catalunya (UPC) in 124
1982 and 1987, respectively. He was associate professor 125
in the Facultat d’Informàtica de Barcelona (UPC) in the 126
1983–88 period. Currently, he is full professor in the 127
Informática e Ingeniería de Sistemas Department at the 128
University of Zaragoza, in Zaragoza (Spain). His research 129
interests include processor microarchitecture, memory 130
hierarchy and parallel computer architecture. He ismem- 131
ber of the ACM and the IEEE Computer Society. He also 132
belongs to the Juslibol Midday Runners Team and to the Computer Architecture 133
Group of the University of Zaragoza. 134
