Touch\'e: Towards Ideal and Efficient Cache Compression By Mitigating
  Tag Area Overheads by Hong, Seokin et al.
ar
X
iv
:1
90
9.
00
55
3v
1 
 [c
s.A
R]
  2
 Se
p 2
01
9
Touche´: Towards Ideal and Efficient Cache
Compression By Mitigating Tag Area Overheads
Seokin Hong∗, Bulent Abali†, Alper Buyuktosunoglu†, Michael B. Healy†, and Prashant J. Nair‡
∗Kyungpook National University †IBM T. J. Watson Research Center ‡The University of British Columbia
seokin@knu.ac.kr [abali,alperb,mbhealy]@us.ibm.com prashantnair@ece.ubc.ca
Abstract—Compression is seen as a simple technique to in-
crease the effective cache capacity. Unfortunately, compression
techniques either incur tag area overheads or restrict data
placement to only include neighboring compressed cache blocks
to mitigate tag area overheads. Ideally, we should be able to
place arbitrary compressed cache blocks without any placement
restrictions and tag area overheads.
This paper proposes Touche´, a framework that enables storing
multiple arbitrary compressed cache blocks within a physical
cacheline without any tag area overheads. The Touche´ framework
consists of three components. The first component, called the
“Signature” (SIGN) engine, creates shortened signatures from
the tag addresses of compressed blocks. Due to this, the SIGN
engine can store multiple signatures in each tag entry. On a
cache access, the physical cacheline is accessed only if there is
a signature match (which has a negligible probability of false
positive). The second component, called the “Tag Appended
Data” (TADA) mechanism, stores the full tag addresses with
data. TADA enables Touche´ to detect false positive signature
matches by ensuring that the actual tag address is available
for comparison. The third component, called the “Superblock
Marker” (SMARK) mechanism, uses a unique marker in the tag
entry to indicate the occurrence of compressed cache blocks from
neighboring physical addresses in the same cacheline. Touche´ is
completely hardware-based and achieves an average speedup of
12% (ideal 13%) when compared to an uncompressed baseline.
I. INTRODUCTION
As Moore’s Law slows down, the number of transistors-per-
core for Last-Level caches (LLC) tends to be stagnating [1],
[2], [3], [4], [5]. For instance, in moving from Ivy Bridge (i7-
4930K processor at 22nm) to Broadwell (i5-5675C processor
at 14nm), the LLC capacity per core (thread) has stagnated at
1MB [6], [7]. One can employ data compression to increase
the effective LLC capacity [8], [9], [10], [11]. Unfortunately,
data compression may also incur significant tag area over-
heads [12], [13], [14]. This is because, in conventional caches
each block needs a separate tag. We can reduce the tag area
overheads by storing compressed blocks only from neighbor-
ing addresses [15], [16], [17], [18]. This enables us to use
a single overlapping tag for all compressed blocks. However,
such an approach restricts data compression only to regions
that contain contiguous compressed blocks. Ideally, we would
like to employ LLC compression without any data placement
restrictions and tag area overheads.
Tag overheads are a key roadblock for cache compression.
For instance, if we store 4x more blocks, the effective LLC
capacity can be increased by 4x. But we will also incur the
area overheads for maintaining 4x unique tags. Furthermore,
it is likely that these unique tags have no locality, cannot
be combined together, and therefore incur significant area
overheads [17], [18]. One can reduce the tag area overhead
with placement restrictions. For instance, if we set a rule that
only compressed blocks from neighboring memory addresses
can reside in a physical cacheline, then we can overlap
their tags. These contiguous compressed blocks are called
“superblock” and their tag is called a “superblock-tag” [15],
[16]. For a 4MB 8-way cache, superblock-tags can track 4
compressed blocks per cacheline with 1.35x tag area.
Restricting block placement by using superblocks reduces
the benefits of compression. Figure 1 shows the effective
LLC capacity for four designs executing 29 memory-intensive
SPEC workloads in mixed and rate modes on a 4MB shared
LLC [19]. The first design is a baseline LLC without data
compression. The second design employs data compression in
LLC while using superblocks. While such a design has a tag
area of 1.35x as compared to the baseline, it also increases
the effective LLC capacity only by 20%. This is because
only blocks from neighboring addresses can be compressed
and stored in the cacheline. The third design enables data
compression to place arbitrary tags in the same cacheline.
While this design increases the effective LLC capacity by
38%, it also requires 3.7x the tag area. The fourth design is an
ideal design which places arbitrary tags in the same cacheline
without any area overheads. This paper presents Touche´, a
framework that helps achieve the fourth design to enable near-
ideal LLC compression.
Fig. 1. The effective capacity and tag area overheads for a 4MB last-level
cache employing compression. Superblock-tags uses 1.35x tag area while
providing 20% higher effective capacity. Arbitrary-tags uses 3.7x tag area
while providing 38% higher effective capacity. The goal of this paper is to
obtain 38% higher effective capacity with no tag overhead.
Touche´ mitigates the tag area overheads by using a short-
ened signature of the full tag address for each compressed
block. This has two key benefits. First, short signatures require
fewer bits as compared to full tag addresses. Due to this,
multiple signatures from different tags addresses can be placed
in the space that was originally reserved for only a single tag
address. Second, by enabling arbitrary signatures to reside next
to each other, we can overcome restrictions of prior work that
require compressed blocks to be from neighboring addresses.
Furthermore, as compression creates unused space in the data
array, tag addresses can be appended to compressed blocks
and stored in this unused space.
The Touche´ framework consists of three components. The
first component, called the “Signature” (SIGN) engine, cre-
ates shortened signatures from the tag addresses and places
them in the tag array. The second component, called the
“Tag Appended Data” (TADA) mechanism, appends full tag
addresses to the compressed blocks and stores them in the
data array. The third component, called the “ Superblock
Marker” (SMARK) mechanism, uses a unique marker in the
tag-bits to enable Touche´ to identify superblocks that contain
4 contiguous compressed blocks from neighboring physical
addresses. We describe each mechanism below:
1) Signature (SIGN) Engine: The Touche´ framework is
implemented within the LLC controller. The core provides
the LLC controller with a 48-bit physical address for each
request 1. The LLC controller uses this physical address to
index into the appropriate set. At the same time, Touche´
invokes the SIGN engine to create a shortened 9-bit signature
of the tag and looks up all the ways for a matching signature.
On a signature match, the corresponding compressed block
is accessed from the data array. As these signatures are only
9-bits long, several signatures, each belonging to a different
tag address, can co-reside in a tag entry. For instance, a 4MB
8-way LLC with 64 Byte cachelines has tag entries that store
29-bit tag address. The SIGN engine can store up to three 9-bit
signatures in the space that was designed for a single 29-bit
tag address. This enables Touche´ to store up to three arbitrary
compressed blocks without any tag area overheads.
Unfortunately, simply using shortened 9-bit signatures can
lead to false positive tag matches (signature collisions). Signa-
ture collisions cause the LLC controller to incorrectly access
blocks that do not have matching tag addresses for each access.
For instance, in a workload that has a 0% cache hit-rate (worst
case scenario), a 9-bit signature has an average signature
collision rate of 0.19% (i.e., 1
29
). Furthermore, as each way in
a set can have up to three 9-bit signatures, Touche´ potentially
needs to check twenty-four 9-bit signatures in an 8-way LLC
(worst case scenario) which results in a signature-collision rate
of 4.58%. Therefore, it is essential to also check the full tag
address on a signature collision.
1Processor vendors have already proposed schemes like the Intel 5-level
paging for enabling 57-bit physical addresses to increase the physical address
space from 256 TB to 128 PB [20]. This would increase the tag address bits
within an LLC tag entry by 9.
2) Tag Appended Data (TADA) Mechanism: The full tag
addresses of the compressed blocks can be stored in the data
array. Touche´ re-provisions a portion of the additional space
that is obtained by compression to store the full tag addresses.
To this end, Touche´ uses the “Tag Appended Data” (TADA)
mechanism to append full tags beside the compressed blocks.
The TADA mechanism appends metadata information on com-
pressibility (3-bits), dirty and valid state (2-bits), and the full
tag address (29-bits) to each compressed block. Overall, TADA
increases the block size by only 34 bits (4.25 Bytes) and our
experiments show that it only minimally reduces the effective
LLC capacity. On an access, TADA interprets the last few
bits in a compressed cacheline as metadata and tag addresses.
As TADA checks the full tags on all signature matches and
collisions, it guarantees the correctness of each LLC access.
3) Superblock Marker (SMARK) Mechanism: Shortened
9-bit signatures enable storing up to three compressed blocks.
However, there can be instances of four compressed blocks
from neighboring addresses (superblock). To address this sce-
nario, Touche´ uses a “Superblock Marker” (SMARK) mech-
anism to generate a unique 16-bit marker. Touche´ stores this
16-bit marker in the tag bits, and uses this marker to indicate
the presence of a superblock within the cacheline.
With a negligible probability (0.012%), the unique 16-bit
marker can flag a match with the signatures that are stored
by the SIGN engine. We call these scenarios as SMARK col-
lisions. Fortunately, SMARK collisions cause no correctness
problems. This is because even after a marker collision, the
TADA mechanism will read the data array and check for full
tag matches. During a collision, the tag addresses will not
match and the compressed blocks are not processed by the
LLC. The SMARK mechanism enables Touche´ to derive all
the benefits of superblocks while also enabling the storage of
up to three arbitrary compressed blocks.
Touche´ provides a speedup of 12% (ideal 13%) without any
area overheads. Touche´ requires comparators and lookup tables
within the LLC controller. Touche´ is a completely hardware-
based framework that enables near-ideal compression.
II. BACKGROUND AND MOTIVATION
We provide a brief background on last-level cache organi-
zation and highlight the potential of data compression.
A. Last-Level Caches: Why Size Matters
Processors tend to have several levels of on-chip caches.
Caches are designed to exploit spatial and temporal locality
of accesses. Due to this, caches help improve the performance
of processors as they reduce the number of off-chip accesses
and reduce the latency of memory requests. Caches are usually
designed such that each level is progressively larger than its
previous level. Consequently, the Last-Level Cache (LLC)
tends to have the largest size, is typically shared, and occupies
significant on-chip real-estate. Due to this, it is beneficial to
increase the LLC capacity per core as this would enable the
designers to fit a larger number of blocks on-chip and further
reduce the number of off-chip accesses [21].
B. Last-Level Caches: Capacity Stagnation
Figure 2 shows the LLC capacity per core for commercial
Intel and AMD processors from 2009 until 2018. On average,
as the number of cores has increased, the LLC capacity per
core has reduced. In current multi-core systems, the LLC
capacity per core tends to be less than 1MB. Therefore, going
into the future, it is beneficial to look at techniques to improve
the effective capacity of the LLC [22].
L
L
C
S
iz
e 
(M
B
)/
C
o
re
0
1
2
3
4
5
5 10 15 20 25 30
Number of (logical) Cores
MEAN
Intel
AMD
Fig. 2. The Last-Level Cache (LLC) capacity per (logical) core for Intel and
AMD processors from 2009 to 2018. On average, as the number of cores has
increased, the LLC capacity per core has reduced.
C. Last-Level Caches: Organization
A Last-Level Cache (LLC) is organized into data arrays
and tag arrays. Each cacheline in the data array has a cor-
responding tag entry in the tag array. Furthermore, groups of
cachelines form “sets” and each cacheline in a set corresponds
to a separate “way”. As the size of the LLC is significantly
smaller than the total physical address space, several blocks
can map into the same set. Because of this, the LLC controller
stores a tag address in the tag entry to uniquely identify the
block in the cacheline. For instance, as shown in Figure 3,
a 4MB 8-way LLC with 64-byte lines, uses 29 bits of tag
address. On a cache access, all the tag entries for each of the
ways in a set is searched in parallel by the LLC controller.
LLC Controller
(4MB Cache)
Tag Arrays Data Arrays
8 
W
ay
s
Set
To/From 
L2 Cache
To/From 
Main Memory
Tag Entry Cacheline
29-bit Tag 64 Byte
8
1
9
2
 S
et
s
Fig. 3. The organization of a 4MB Last-Level Cache (LLC). The LLC consists
of data arrays, tag arrays, and an LLC controller. The tags are 29-bits long
and all tag entries across the ways in a set are searched in parallel.
D. Compression: Higher Effective Capacity
Several prior works have proposed using efficient and low-
latency algorithms to compress blocks, thereby storing more
blocks and improving the effective LLC capacity. Typically,
LLC compression techniques are typically implemented within
the LLC controller.
1) Efficient Data Compression Algorithms: The Base Delta
Immediate (BDI) and Frequent Pattern Compression (FPC) are
two state-of-the-art low-latency compression algorithms [8],
[23]. BDI uses the insight that data values tend to be similar
within a block and therefore can be compressed by repre-
senting them using small offsets. FPC uses the insight that
blocks contain frequent patterns like all-zeros, all-ones, etc.
FPC represents frequent patterns with fewer bits. Prior work
has shown that both BDI and FPC can be implemented to
execute with a a single-cycle delay and can be implemented
within the LLC controller [8], [23].
LLC Controller
Data Arrays
To/From 
L2 Cache
To/From 
Main Memory
D 	

E
Tag
Manager
C
fffiflffi 
Compressed?
Compressed?
Tag Arrays
Fig. 4. The LLC compression-decompression engine. The compression-
decompression engine taps the data bus and stores compressibility information
in the tag entries.
2) Compression-Decompression Engine: As shown in Fig-
ure 4, the LLC controller implements a compression-
decompression engine that taps the bus going into the cache
data array. The LLC controller contains a separate “tag man-
ager” to manage tag entries. The compression-decompression
engine implements both BDI and FPC and chooses the best
algorithm. The tag manager maintains the compressibility
information in the tag entries.
3) Distribution of Compressed Data Size: Figure 5 shows
the distribution of the size of blocks after compression for
29 SPEC workloads. On average, 55% of the blocks can be
compressed to less than 48 bytes in size. Furthermore, 17%
of the lines can be compressed to be less than 16 bytes in
size. Therefore, several workloads tend to have blocks with
low entropy and can benefit from compression.
0
20
40
6!
8"
1#$
m
%
&
lb
m
s
'
(
)
*
+
m
ilc
l
,
-
.
/
0
2
3
4
5
o
7
9
:
;
<
=
b
>
?
@
A
B
gc
c
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
[
w
\
]
c
^
_
`
a
d
e
f
g
z
h
i
j
k
n
p
q
r
t
u
v
x
y
{
|
}
~











Ł





















 
¡
¢
£
¤
¥
¦
§
¨
©
ª
«
¬
­
®
¯
°
±
²
³
´
µ
¶
·
¸
¹
A
M
E
A
N
º
»
¼
½
¾
¿
À
Á
Â
Ã
Ä
Å
Æ
Ç
È
É
Ê
Ë
Ì
Í
Î
Ï
Ð
Ñ
ÒÓ ÔÕÖ×Ø ÙÚ ÛÜÝÞß àá âãäåæ çè éêëìí
Fig. 5. The distribution of block-size for 29 SPEC workloads (rate and mix
modes). On average, up to 55% of the blocks can be compressed to 48 Bytes.
E. LLC Compression: Tag Area Overheads
Modern computing systems tend to operate on 64-byte
blocks. Figure 6 (a) shows the design of the tag entry and
the cacheline in the data array for a 4MB 8-way LLC that
does not employ compression. The tag entry for each block
requires a valid bit and a dirty bit. Furthermore, we assume
that replacement policy is maintained at the cacheline-level
and the largest block in the selected cacheline is evicted.
To reduce the number of encoding bits in the tag array,
blocks are compressed into 16 byte, 32 byte, or 64 byte
boundaries. To reduce the number of bits in the tag entry
further, one can restrict cachelines to store blocks only from
contiguous addresses. Such a contiguous set of blocks is called
a superblock. Prior work has shown that superblocks with 4
compressed blocks can reduce the tag area overheads to 8
bits. As shown in Figure 6 (b), a 4MB 8-way LLC that stores
up to four blocks per cacheline will require 46 bits of tag
entry. While superblocks help reduce tag area overheads, they
limit the potential benefits of LLC compression as they restrict
block placement to include only neighboring addresses. If one
can store blocks from arbitrary addresses, we can unlock all
the benefits of LLC compression. However, the disadvantage
of this approach is that, as shown in Figure 6 (c), a 4MB
8-way LLC that stores up to four blocks per cacheline will
require 127 bits of tag entry (3.7x higher than the baseline).
A Cacheline in Data ArrayA Tag Entry in Tag Array
Valid-Bit
Dirty-Bit
Tag Address Bits
Replacement Policy 
Bits (3-bits)
64 Bytes29-bits
34-bits
(a) Baseline (No Compression)
Memory 
Block 1
35-bits
46-bits
(b) Compression with Superblock Tag Addresses 
Memory 
Block 2
Memory 
Block 4
Memory 
Block 3
116-bits
127-bits
(c) Compression with Arbitrary Tag Addresses 
Block
1
1 2 3 4
Compressed Blocks
Compressed Blocks
1 43 233 129
Fig. 6. The tag area overheads for different techniques. (a) The baseline
technique that does not employ any compression has no tag area overheads.
(b) The superblock technique increases the tag area to 1.35x. (c) Storing
arbitrary tags increases the tag area to 3.7x.
F. LLC Compression: Potential
Figure 7 shows the overheads and benefits of LLC com-
pression for three techniques. The baseline technique does not
employ compression, has no tag overheads and has an average
hit-rate of 31.5%. The second technique employs superblocks
for compression, has a tag area of 1.35x and increases the
average hit-rate to 36%. The third technique highlights the
potential hit-rate with compression when each cacheline can
store up to 4 compressed blocks. Unfortunately, the third
 20
 25
 30
 35
 40
 45
 50
 55
 60
RATE MIX AVERAGE
H
it
 R
at
e 
(%
)
Baseline
Superblock (1.35x Tag Area)
Ideal (3.7x Tag Area in Practice)
Fig. 7. The potential of LLC compression. Enabling blocks from arbitrary
addresses increases the average hit-rate of the LLC from 31.5% to 38.5%.
technique uses a tag area of 3.7x while also increasing the
average LLC hit-rate to 38.5%.
III. THE TOUCHE´ FRAMEWORK
A. An Overview
Figure 8 shows an overview of the Touche´ framework.
Touche´ consists of three components. The first component,
called the Signature (SIGN) Engine, generates shortened sig-
natures of the tag addresses. The SIGN engine is designed
within the tag manager. The second component, called the
Tag Appended Data (TADA) mechanism, attaches full tag ad-
dresses to compressed memory blocks. The TADA mechanism
taps the data bus after the compression-decompression engine
and obtains the full tag address from the tag manager. The third
component, called Superblock Marker (SMARK) mechanism,
keeps track of superblocks by using a unique 16-bit marker
in the tag entry. The SMARK mechanism is implemented in
the tag manager. Touche´ requires changes only in the LLC
controller.
LLC Controller
Data Arrays
To/From 
L2 Cache
To/From 
Main Memory
îïðñòóôõö÷øùú
ûüýþßE
Tag
Manager
C 	


Compressed?
Compressed?
Tag Arrays
SIGN
Engine
SMARK
TADAFull Tag Address
Fig. 8. An overview of Touche´. Touche´ consists of three components.
The Signature (SIGN) Engine, the Tag Appended Data (TADA) mechanism,
and the Superblock Marker (SMARK) mechanism. All components are
implemented in the LLC controller with no changes to the LLC.
B. Signature (SIGN) Engine
The Signature (SIGN) Engine is implemented in the tag
manager. The SIGN Engine generates shortened signatures
from the full tag addresses supplied during the read and write
accesses to the LLC.
1) Identifying Compressed blocks: On a LLC write, the
compression-decompression engine informs the tag manager
if the block is compressible; a block can be compressed to
16B, 32B or 48B. The tag manager uses the original valid bit
and the dirty bit in its tag entry to encode this information.
We use the insight that, for uncompressed blocks, the valid bit
and the dirty bit can only exist in three states. For instance,
a cacheline cannot be marked both invalid and dirty at the
same time. The tag manager uses this unused state to flag
cachelines that contains compressed blocks. Thereafter, for a
cacheline that stores compressed blocks, the 1st and 2nd bits
of the tag address encodes its valid bit and dirty bit.
As shown in Table I, on a read, the tag manager checks the
original dirty bit and the valid bit in the tag entry to identify if
the cacheline contains compressed blocks. If the cacheline is
deemed to contain compressed blocks, the tag manager reads
the 1st and 2nd bits from the tag address to determine if any
of the cacheline contains blocks that are valid, dirty or both.
TABLE I
IDENTIFYING COMPRESSED BLOCKS
Cacheline Status Valid Dirty Tag Address Tag Address
Bit Bit 1st Bit 2nd Bit
Invalid 0 0 N/A N/A
Uncompressed: Valid 1 0 N/A N/A
Uncompressed: Valid and Dirty 1 1 N/A N/A
Compressed: Valid 0 1 1 0
Compressed: Valid and Dirty 0 1 1 1
2) Using Shortened Signatures: To store multiple signa-
tures within a single tag entry, the SIGN engine shortens the
full tag address into a 9-bit signature. For a 4MB 8-way LLC,
the full tag address is 29-bits long. For a compressed block,
as the top 2 bits of the tag address space in its tag entry are
already used for valid and dirty bits, we have 27 unused bits
remaining in the tag address space of its tag entry. Therefore,
we can store up to three 9-bit signatures corresponding to three
compressed blocks.
Figure 9 shows the design of the signature generator in the
SIGN engine. The signature generator uses the least 27-bits
of the full tag address and divides it into three 9-bit segments.
Each bit of these 9-bit segments is then XORed together to
generate a 9-bit output. The 9-bit output is then partitioned into
a 4-bit segment containing its lowest bits and a 5-bit segment
that contain its highest bits. These 4-bit and 5-bit partitions
then index into a 16 entry lookup table and a 32 entry lookup
table respectively. Each entry in lookup tables are populated at
boot-time with unique numbers. The indexed 4-bit and 5-bit
numbers from the lookup tables are then appended together
to form a 9-bit signature. The overall latency of generating
signatures is the delay of one 3-bit XOR gate and one parallel
table lookup. For a high-performance processor executing at
3.2GHz, we estimate the signature generation to incur only
1 cycle. Furthermore, the latency of signature generation is
masked by the latency of reading the tag entries for each LLC
access (up to 5 cycles).
3) Checking for Matching Signatures: On an LLC read,
the tag manager reads the tag entries from all the ways of the
L








t



ff
h
e
 
fi
fl
ffi

!
"
#
$
%
&
'
(
)
9
*
+
,
-
s
.
/
0
1
2
s
3
4
5
6
7
s
X8:
;
<
=
>
?
s
@ABDFs
GHIJKs
4-Bits
5-Bits
MNOPQ
SRTUVWYZ[
\] ^_`ab
cdefgh ijklm
no pqrsu
vwxyz{ |}~
Fig. 9. The signature generator within the SIGN Engine. The signature
generator only requires one XOR operation and two parallel table lookups
for each LLC access.
indexed set. At the same time, the SIGN engine forwards its
9-bit signature to the tag manager. The tag manager identifies
if the cacheline contains compressed blocks using the original
valid and dirty bits. For an uncompressed cacheline, the tag
manager ignores the signature and uses the full tag address to
check for a match.
If the cacheline contains compressed blocks, the tag man-
ager ignores the first two bits of the tag address as they are
valid and dirty bits. Thereafter, the remaining 27 bits in the tag
address space of the tag entry are partitioned into three 9-bit
entries. The tag manager then compares each of these three
9-bit entries with the 9-bit signature from the SIGN Engine.
If the 9-bit entry does not match the 9-bit signature, then the
block is guaranteed to be absent. On the other hand, if the
9-bit signature matches in any one of the ways, then the block
is likely to be present. As a 9-bit signature is smaller than its
full 29-bit tag address, there is a small chance of 0.19% ( 1
512
)
that each 9-bit entry comparison with the 9-bit signature can
result in a false positive match. We call these false positive
matches of signatures as “signature collisions”.
4) Collision Rate of Signatures: As each tag entry can store
up to three 9-bit signatures, an 8-way LLC would require up
to twenty-four 9-bit signature comparisons for each access.
As signatures are shorter than full tags, several tags may map
into the same signature. As we perform twenty-four signature
checks (in the worst-case), it is likely that some of LLC
accesses will result in signature collisions. Figure 10 shows the
probability of collisions as the number of signatures present
in the 8-ways varies from zero (all ways are uncompressed)
to twenty-four (all ways have three compressed blocks) for
different LLC hit-rates.




Ł
5%
  8    
P
ro
b
ab
il
it
y
 o
f 
S
ig
n
at
u
re
 C
o
ll
is
io
n
 
P
er
 A
cc
es
s
Number of 9-Bit Signature Entries in a Set
0% Hit Rate
25% Hit Rate
50% Hit Rate
75% Hit Rate
100% Hit Rate
Worst-Case = 4.58% 
Fig. 10. The probability of collision for a 9-bit signature as the number of
signature entries vary in a set. In the worst-case, for a 8-way LLC, we expect
a signature collision 4.58% of the time for each access.
In the worst case, we can expect a collision 4.58% of the
time and this occurs for a workload that has 0% hit-rate. As
signature collisions can cause the LLC to forward blocks with
incorrect tag addresses to the processing cores, it is essential
to check full tags.
C. Tag Appended Data (TADA) Mechanism
The Tag Appended Data (TADA) mechanism is imple-
mented in the LLC controller and taps the data-bus between
the compression-decompression engine and the data array.
1) Appending Full Tag Addresses to Data: During an LLC
write, the TADA mechanism uses the full tags that are supplied
by the tag manager. The TADA mechanism then appends the
full tag addresses (29 bits), the valid-bit, the dirty-bit, and the
compressibility information (3 bits) to the end of the cacheline
(total 34 bits or 4.25 Bytes). Figure 11 shows a cacheline
storing three compressed blocks and the TADA mechanism
appending the information for each of these blocks at the end
of the cacheline.
A Cacheline in Data ArrayA Tag Entry in Tag Array
s
 ¡¢£ ¤ ¥
¦§¨©ª«¬­® ¯ 1
9-Bit Signatures
°±²³´µement Policy 
¶·¸¹ º»¼bit½¾
¿ÀÁÂÃ Ä ÅÆÇÈÉ ÊËÌ ÍÎÏÐÑ ÒÓÔ
ÕÖ×Ø ÙÚÛ ÜÝÞ
ßàáâã äåæ çèé
êëìíî ïðt añò
óôõö÷øùúûüýþßCy
F   
V	
   
D t an
fffiflffi!"#$%&'y
()*+ ,-. /01
23456 789 :;<
=>?@A BEt aGH
IJKLMNOPQRSTUWy
XYZ[ \]^_`
Fig. 11. A cacheline storing compressed blocks with TADA mechanism.
The TADA mechanism appends full tag addresses, valid bit, dirty bit, and
compressibility information for each block at the end of the cacheline.
2) Appending Full Tag Addresses to Data: The TADA
mechanism appends 34 bits (4.25 Bytes) of information to the
end of the cacheline containing compressed blocks. As a result,
TADA reduces the space available to store the compressed
block. We can store three 16B compressed blocks or a pair
of a 32B and a 16B compressed blocks in the data array; the
block size is stored in the compressibility information field.
Fortunately, this additional loss of space only causes a few
lines to reduce their effective capacity. Figure 12 shows the
reduction in effective LLC capacity due to TADA for an LLC
that can store up to 3 arbitrary compressed blocks. TADA
decreases the effective cache capacity by only 4.15% points
as compared to an ideal scheme that can store three arbitrary
compressed blocks without any storage overheads.
3) Detecting Collisions of Signatures: TADA helps detect
signature collisions. This is because, on a signature collision,
the cachelines from the selected way(s) in the data array are
read by the tag manager. The TADA mechanism extracts the
full tag address from the cachelines and checks if they match
the full tag address of the LLC access. If there is no match,
abc
1
def
ghi
jkl
mop
qrst uvw xyz{|}~
Baseline
TADA based Block Storage (3 Compressed Blocks)
  Ł 
4.15 % 
Points
E
ff
ec
ti






ci



 
¡
L
C
Fig. 12. The reduction in the effective LLC capacity by the TADA storage
overhead. While TADA uses 4.25 Bytes per compressed block, it decreases
the effective LLC capacity only by 4.15% points as compared to an ideal
scheme that does not require the metadata storage in the data array.
TADA flags a signature collision. Therefore, TADA guarantees
the detection of signature collisions and thereby ensures cor-
rectness. Furthermore, TADA extracts the the compressibility
information and supplies to the decompression engine. The
valid and dirty bits of compressed blocks are also stored using
TADA. Therefore, TADA helps to avoid using any additional
bits in the tag entry to store additional information.
D. Latency Overheads
As the data array needs to be accessed during a signature
collision, it can increase the LLC access latency.
1) Additional Accesses to Data Arrays: In the baseline
system, an LLC access probes all the ways in the indexed
set from the tag array. The cacheline from the data array is
read-only in case of an LLC hit. As a tag access occurs for
every access, irrespective of whether the access is a hit or a
miss, the tag array is designed with lower access latency as
compared to the data array. Typically, accessing the LLC tag
array incurs a latency overhead of only 5 cycles. On the other
hand, reading the LLC data array incurs an overhead of 30
cycles in modern processors [24].
In the Touche´ framework, the data arrays are likely to be
accessed even in case of an LLC miss. This is because the
SIGN engine may incur signature collisions and may invoke
the TADA mechanism to access the data array to detect
signature collisions. In the worst-case, for a workload with
0% hit-rate, this scenario may occur only 4.58% of the times.
Therefore, signature collisions will increase the overall latency
of LLC access. Table II shows the average latency of an LLC
access during a collision.
TABLE II
ADDITIONAL DATA ARRAYS ACCESSED ON A COLLISION
Number of Data Arrays Accessed Probability Latency (cycles)
1 0.9768 35
2 0.0229 70
3+ 0.0003 105+
Average: 1.0235 1 35.82
In the worst-case, all accesses can be a cache miss and a
collision can occur 4.58% of the times. As shown in Table II,
collisions increase the access latency to 35.82 cycles. For a
worst-case workload with a 0% hit-rate, the increase in the
LLC tag access latency is denoted by Equation 1.
New Tag Access Latency = (1− 0.0458)× Old Latency+
0.0458× Collision Latency
(1)
As the old tag access latency is 5 cycles and the collision
latency is 35.82 cycles, the new tag access latency of Touche´
is 6.4 cycles.
2) Mitigate Latency Overheads: Dynamic Touche´: One can
mitigate signature-collision latency overheads by compressing
only when it is useful. To this end, Touche´ continuously
monitors the average memory latency at the LLC controller.
The average memory latency is defined as the total latency
that is experienced by each request and this can emanate from
the LLC and main memory.
Touche´ enables compression only when the average memory
access latency is 100x greater than the latency overheads of
signature collisions. As signature collisions increase the LLC
tag access latency by 1.4 cycles, Touche´ enables compression
only when the average memory latency is greater than 140
cycles. This has two key advantages. First, Touche´ is enabled
for workloads that showcase a large memory latency and
benefit from LLC compression. Second, the latency overheads
from Touche´ are capped at 1%. As shown in Figure 13, for
memory intensive benchmarks, the average memory latency
for reads is 541 cycles (significantly higher than 140 cycles).
Therefore, the latency overhead of Touche´ is only 0.26%.
200
400
¢£¤
¥¦§
¨©ª«
¬­®¯
°±²³
´
µ
¶
lb
m
·
¸
¹
º
»
¼
m
ilc
½
¾
¿
À
Á
Â
Ã
Ä
Å
Æ
Ç
È
É
Ê
Ë
Ì
Í
Î
Ï
Ð
Ñ
Ò
Ó
gc
c
Ô
Õ
Ö
×
Ø
Ù
Ú
Û
Ü
Ý
Þ
ß
à
á
â
ã
ä
å
æ
ç
è
é
ê
ë
ì
í
î
ï
ð
ñ
ò
ó
ô
õ
ö
÷
ø
ù
ú
û
ü
ý
þ
ß
b
d
 




x




	




m














ff
fi
fl
ffi

 
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
A
m
ea
n
A
?
@
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
140 cycles
Fig. 13. The average memory latency for reads. On average, the average
memory access latency is 541 cycles. Therefore, Touche´ has a latency
overhead of only 0.26%.
E. Superblock Marker (SMARK) Mechanism
The SIGN Engine enables storage of up to three blocks.
However, some cachelines may contain superblocks (four
compressed blocks from neighboring addresses).
1) Benefits of Including Superblocks: Figure 14 shows the
hit-rate of Touche´ while maintaining up to three compressed
blocks and compares this against a scheme that also stores
superblocks (up to four blocks). For a superblock, Touche´
tries to compress each block to 15Bytes. This enables Touche´
to get the benefits of storing both the superblock-tags and
the arbitrary tags. If we can store superblocks and arbitrary
blocks at the same time, we can increase the average hit-rate
of Touche´ from 31.5% to 37.5%.
20
25
30
35
40
45
50
WXYZ [\] ^_`acef
g
h
i
j
k
l
n
o
p
q
Touché: 3 Arbitrary Blocks
Touché: 3 Arbitrary Blocks + Superblocks
Fig. 14. The average hit-rate of Touche´ with 3 blocks versus 3 blocks
with superblocks. On average, the hit-rate increases to 37.5% by combining
superblocks.
2) Identifying Potential Cachelines: During an LLC install,
if the block is compressible and if the candidate cacheline
already contains compressed blocks from its neighboring ad-
dresses, then this cacheline is also a superblock candidate. The
TADA mechanism is used to identify superblock candidates by
extracting the full tag addresses for all the blocks in a cacheline
during an LLC install.
3) Generating Markers: Touche´ implements a “Superblock
Marker” (SMARK) mechanism in the tag manager. SMARK
mechanism generates a random 16-bit marker at boot-time and
uses this marker throughout the operational time of the system.
Once the TADA mechanism identifies a superblock cache-
line, it informs the tag manager. The tag manager then
retrieves the 16-bit marker from the SMARK mechanism.
It then informs the SIGN engine to ignore the last 2-bits
(corresponding to four neighboring addresses) of the full tag
address to generate a unique 9-bit signature. This ensures that
neighboring addresses in the superblock generate the same
9-bit signature. Thereafter, the tag manager appends the 9-
bit signature to the 16-bit marker and writes the 29-bit full
tag of the first block within the superblock at the end of
the compressed blocks in the data array. Figure 15 shows the
superblock-tag generation.
A Cacheline in Data ArrayA Tag Entry in Tag Array
rstuvws yz{|}~
12 13 14 
Ł
  
SMARK
SIGN
Tag Manager
 s ¡¢£¤¥¦
Fig. 15. The Superblock Marker (SMARK) mechanism. The SMARK
mechanism generates a unique 16-bit marker to identify superblocks. It then
appends this marker with the 9-bit signature from the SIGN engine.
4) Checking for Matching Markers: On a read, the tag
manager will check for matching 16-bit marker values in all
the ways that store compressed blocks within a set. If there is
a marker match, then the tag manager uses the 9-bit signature
(generated from by ignoring the least two significant bits) and
checks for a match.
If the signature matches, then the cacheline is read from the
data array. The TADA mechanism extracts the full tag address
and checks if the tag address of the LLC access is one of
Process Request
§¨©ª«¬ ­®¯°± ²³´µ¶·¸ ¹º»¼½¾¿
ÀÁÂÃÄÅ ÆÇÈÉ ÊËÌÍÎÏ ÐÑÒÓÔÕÖ×ØÙ
ÚÛÜÝÞß àáâã äåæçèé êëìíîïðñòó
 ôõö
÷øùúû üýþßW   	
t
  
Cfffi flffi  !"#$%&'()*+,-./ 0123456789
N:
I;<=>? @ABD
REFGHJKLMO
like Baseline
PQSTUVXYZ[\]^
_`a
bcd efgh
ijk
lmnopqrsuvwxy
z{|}~
Superblock


Ł
    
  ¡¢£¤¥ ¦§¨©ª«¬­®
¯°±
²³´µ¶ ·¸¹º»¼ ½¾¿ÀÁ ÂÃÄÅ
ÆÇÈ ÉÊËÌ ÍÎÏÐÑÒÓÔÕ
Ö×ØÙÚ ÛÜÝÞß àáâãäåæçè
Install
éêë
ìíî
ïð
Access
ñòóô õö÷ øùú ûüýþßR 
Y
I 	
 
Sfffiflffi
M !"#$%&
'()*+ ,-./0123
456789:;<=>
N?
@AB
CDE
FGHJKL OPQT UVW XZ[\]^_`ab cdef
Write Superblock Tag
Fig. 16. The flowchart detailing the high-level operations of Touche´ for install and access requests. (a) Shows the flowchart for install requests. (b) Shows
the flowchart for access requests.
the superblocks. If there is a match, the block is processed by
the LLC controller. It is likely, the cacheline may not contain
the requested block and it may simply be a false positive
match. Similar to signature collisions, we call this scenario
as a marker collision.
5) Effect of Marker Collisions: Marker collisions are ex-
tremely rare. This is because, we use markers which are 16
bits long. For instance, in an 8-way cache, the probability of
a marker collision for each access is only 0.012% and their
impact on LLC latency is negligible. Furthermore, even in
the case of marker collisions, the TADA mechanism ensures
that the full tag address is checked before forwarding the
compressed block. Therefore, SMARK works with TADA to
guarantee correctness while storing superblocks.
F. Touche´ Operation: Reads and Writes
Figure 16 (a) and Figure 16 (b) show the flowchart for
Touche´ for LLC accesses and install requests respectively.
Touche´ invokes the SIGN, TADA, and SMARK mechanisms
only for compressed data. For uncompressible data, Touche´
works just like the baseline. Furthermore, TADA mechanism
is always invoked for LLC hits of compressed blocks. This
enables Touche´ to guarantee correctness.
G. Discussion: Coherence and Replacement
In the baseline LLC, the tag entry contains metadata such
as the replacement policy bits and coherence states (for private
LLCs). We discuss how these affect the design of Touche´.
1) Handling Cache Coherence:: Touche´ assumes a shared
LLC and therefore does not encounter coherence concerns.
However, in case the LLC is private, Touche´ would need to
maintain coherence states with minimal overheads. Touche´
stores the coherence state as well as the full tag for each
compressed block in the data array. Thus, the LLC controller
needs to access data array for tag matching and checking
the coherence state. This operation would likely increase the
latency of tag matching for the coherence request.
However, such an operation would likely incur low per-
formance overheads. This is because, handling most of the
coherence requests tends to be off in the critical path and the
coherence state can be updated after the critical requests are
serviced. In addition, if the coherence request is the “BusRd”
which is a read request made by another core, the current
core might need to send the entire block to the requesting
core anyway. In this case, the additional access to the data
array does not add any overheads.
Furthermore, we can eliminate the performance impact
of snooping-based coherence protocols, by simply using a
directory-based coherence protocol as implemented in com-
mercial processors [25].
2) Handling Cache Replacement Policy:: Each tag entry
in the baseline system is already equipped with replacement
information bits. As Touche´ stores multiple compressed blocks
per cacheline, ideally, it would be preferable to equip each
of these blocks with additional replacement bits in the tag
entry. However, this would require us to add 3 ∼ 4 bits per
compressed block in the tag entry.
To minimize the overheads for storing the replacement
information, whenever a cacheline is accessed, Touche´ only
updates the original replacement bits. Touche´ does not keep
track of individual replacement bits for each block. During
replacement, Touche´ selects the victim cacheline based on the
original replacement bits and randomly evicts one block from
within the victim cacheline.
IV. EXPERIMENTAL METHODOLOGY
To evaluate the performance benefits of Touche´, we de-
velop a trace-based simulator based on the USIMM [26]
which is a detailed memory system simulator. We extended
the USIMM to model the processor core and a detailed
ghi
1
jkl
mno
pqr
stu
vwx
m
cf
lb
m
y
z
{
|
}
~
m
ilc
lib
qu
an
tu
m











Ł

gc
c




















 
¡
¢
£
¤
ca
ct
us
A
D
M
¥
¦
§
¨
©
ª
«
¬
­
®
¯
°
±
²
³
´
µ
¶
·
¸
¹
º
»
¼
½
¾
¿
À
Á
Â
Ã
Ä
Å
Æ
Ç
È
É
Ê
Ë
Ì
Í
Î
Ï
Ð
Ñ
Ò
Ó
Ô
Õ
Ö
×
Ø
Ù
Ú
Û
Ü
Ý
Þ
ß
à
á
â
ã
ä
å
æ
ç
è
é
ê
ë
ì
í
î
ï
ð
ñ
ò
ó
ô
õ
ö
÷
ø
ù
ú
û
ü
ý
þßB 
Y
T	

I
1.88 1.95
1.91
Fig. 17. Speedup of Touche´ as compared to a baseline system that does not employ compression. On average, Touche´ achieves a speedup of 12% (Ideal –
13%, YACC – 10.3%) by enabling compressed blocks from arbitrary addresses to be placed next to each other while also allowing superblocks to be stored.
cache hierarchy. Our processor model supports the out-of-
order (OoO) execution. Our detailed cache model supports
various replacement policies such as LRU, DRRIP [27], and
DIP [28]. The baseline system configuration is described in
Table III. To enable efficient compression, the compression
engine modeled in the cache model employs the BDI [8],
[29] and FPC [23] compression algorithms and uses the one
with the best compression ratio for each cacheline. As per
prior work in BDI and FPC, we assume that compression and
decompression of data incurs only a single-cycle latency. We
compare Touche´ to the previous state-of-the-art scheme called
YACC that uses only “superblocks” [16]. We also compare our
scheme against an “Ideal” scheme that can store either three
arbitrary blocks or a superblock (four neighboring blocks)
without any area overheads. The Ideal scheme uses the same
replacement policy as Touche´ (described in Section III-G2).
TABLE III
BASELINE SYSTEM CONFIGURATION
Number of cores (OoO) 4
Processor clock speed 3.2 GHz
Issue width 8
L1 Cache (Private) 32KB, 8-Way, 64B lines, 4 cycles
L2 Cache (Private) 256KB, 8-Way, 64B lines, 12 cycles
Last Level Cache (Shared) 4MB, 8-Way, 64B lines
LLC Tag Access Latency 5 cycles
LLC Data Access latency 30 cycles
Memory bus frequency 1600MHz (DDR 3200MHz) [30]
Memory channels 2
Ranks per channel 1
Banks Groups 4
Banks per Bank Group 4
Rows per bank 64K
Columns (cache lines) per row 128
DRAM Access Timings: TRCD-TRP -TCAS 22-22-22 [31]
DRAM Refresh Timings: TRFC 420ns [32], [33]
We chose memory intensive benchmarks, which have
greater than 1 MPKI (LLC Misses Per 1000 Instructions),
from the SPEC CPU2006 benchmarks. We warm up the caches
for 2 Billion instructions and execute 4 Billion instructions.
To ensure adequate representation of regions of compressibil-
ity [34] and performance [35], the 4 Billion instructions are
collected by sampling 400 Million instructions per 1 Billion
instructions over a 40 Billion instruction window. We execute
all benchmarks in rate mode. We also create twelve 4-threaded
mixed workloads by forming two categories of SPEC2006
Benchmarks, low MPKI, and high MPKI. As described in
Table IV, we randomly pick one benchmark from each cate-
gory to form high MPKI mixed workloads and medium MPKI
mixed workloads. We perform timing simulation until all the
benchmarks in the workload finish execution.
TABLE IV
WORKLOAD MIXES
mix1 mcf, libquantum, GemsFDTD, wrf
mix2 lbm, gcc, bzip2, bwaves
mix3 milc, sphinx, leslie3d, zeusmp
mix4 soplex, omnetpp, cactusADM, dealII
mix5 xalancbmk, mcf, gcc, sphinx
mix6 omnetpp, lbm, milc, xalancbmk
mix7 astar, mcf, milc, calculix
mix8 omnetpp, gobmk, sjeng, libquantum
mix9 namd, gcc, lbm, dealII
mix10 soplex, tonto, hmmer, perlbench
mix11 GemsFDTD, bwaves, povray, zeusmp
mix12 wrf, xalancbmk, h264, gamess
V. RESULTS
This section discusses the performance, hit-rate, and sensi-
tivity results of Touche´.
A. Performance Impact
Figure 17 shows the speedup of Touche´ when compared
to a baseline system that does not employ compression. On
average, Touche´ has a speedup of 12%. Ideally, when we can
place compressed memory blocks without any area overheads
in tag and data arrays, we get a speedup of 13%. On the
other hand, YACC achieves 10.3% speedup by capturing the
superblocks. Our analysis shows that gcc benefits the most
from LLC compression. gcc is extremely sensitive to the LLC
capacity and as the miss rate of gcc drops from 45% to 5%
(by 9x) due to Touche´, gcc experiences very low memory
access latency. This is because, at 5% miss-rate, almost all of
its working set now fits in the LLC. Therefore, gcc shows a
speedup of 91% due to Touche´. For all other workloads, the
drop in miss-rate is at most 2.4x (see Figure 18), hence they
show up to 50% speedup.
B. Effect on Last-Level Cache Hit-Rate
Figure 18 shows the speedup of Touche´ when compared
to a baseline system that does not employ compression. On
average, Touche´ increases the hit rate by 6% points. In the
ideal case, when we can place compressed memory blocks
0
10
20
30
40
50
6
7
8
9
1
m
cf
lb
m
s




ff
m
fi
fl
ffi
l

 
!
"
#
$
%
&
'
o
(
)
*
+
,
-
b
.
/
0
2
3
g
4
5
:
;
<
=
>
?
G
@
A
C
D
E
F
H
J
K
L
M
N
O
P
Q
w
R
S
c
U
V
W
X
Z
[
\
]
z
^
_
`
a
d
e
f
h
i
j
k
n
p
q
r
t
x
u
v
y
{
|
}
~











Ł





















 
¡
¢
£
¤
¥
¦
§
¨
©
ª
«
¬
­
®
¯
°
±
²
A
m
ea
n
³
´
µ
¶
·
¸
¹
º
»
¼
½
Baseline
¾¿ÀÁ
ÂÃÄÅÆÇ
ÈÉÊËÌ
Fig. 18. The Hit-Rate of Touche´ as compared to a baseline system that does not employ compression. On average, Touche´ increases the hit-rate by 6% points
(Ideal – 7% points, YACC – 4% points) by storing a larger number of blocks within the LLC.
ÍÎÏ
1
ÐÑÒ
ÓÔÕ
Ö×Ø
ÙÚÛ
ÜÝÞ
ßàá
m
cf
lb
m
â
ã
ä
å
æ
ç
è
é
ê
ë
ì
í
î
ï
ð
ñ
ò
ó
ô
õ
ö
÷
ø
ù
ú
û
ü
ý
þ
ß
b
 

g


s





G
	







l







w


c


ff
fi
fl
ffi

 
z
!
"
#
$
%
&
'
(
)
*
d
+
,
-
.
/
x
0
1
2
3
4
5
6
7
m
8
9
:
;
<
=
>
?
@
A
B
C
D
E
F
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
`
a
e
f
h
i
j
k
n
o
p
q
r
ea
t
u
v
y
{
|
}
~
Baseline
Touché with LRU
Touché with DIP
Touché with DRRIP
1.91 1.98 2.04
Fig. 19. The sensitivity of Touche´ to different replacement policies. As Touche´ is only a LLC compression framework, it is orthogonal to the replacement
policy. Touche´ shows an increasing speedup of 12%, 14.5%, and 16.7% for the LRU, DIP, and DRRIP replacement policies respectively.
from arbitrary addresses without any area overheads in tag
and data arrays, the hit-rate increases 7% points. On the other
hand, YACC increases the hit-rate by 4% points. Furthermore,
some workloads like gcc, mix2, mix5, and mix9 get significant
increase in hit rate.
We also observe that hit-rates either increase or remain the
same for benchmarks. Furthermore, Touche´ closely follows
the hit-rate of an ideal LLC compression technique. The slight
loss in hit-rate from the ideal LLC compression is due to the
capacity loss in the data array from the TADA mechanism.
C. Sensitivity to Replacement Policy
As Touche´ is a LLC compression technique, it does not
interfere with the replacement policy. Typically, the LLC con-
troller will choose a cacheline based on its replacement policy.
Touche´ then evicts a block from within the selected cacheline
randomly. Therefore, replacement policies are orthogonal to
the Touche´ framework.
Figure 19 shows the speedup of Touche´ for different
cache replacement policies. On average, Touche´ increases the
speedup from 12% while using LRU, to 14.5% while using
DIP. The speedup is increased to 16.7% while using DRRIP
replacement policy. Therefore, irrespective of the replacement
policy, Touche´ continues to provide high performance by
enabling efficient compression.
D. Impact on Memory Latency
Figure 20 shows the impact of Touche´ on the average
memory latency for reads. As Touche´ provides a higher LLC
hit rate, it also reduces the average memory read latency. On
average, Touche´ reduces the memory read latency from 541
cycles to 489 cycles. In the ideal case, we can reduce the
average memory read latency to 478 cycles as this scheme
provides slightly higher hit-rate .
400
450
500
550

  Ł

Touché
Ideal
A
v
er
ag
e 
M
em
o
ry
 L
at
en
cy
(R
ea
d
s)
Fig. 20. The average memory read latency for Touche´. On average, Touche´
reduces the memory read latency from 541 cycles to 489 cycles.
E. Sensitivity to Last-Level Cache Size
Figure 21 shows the impact of LLC size on the effectiveness
of Touche´. Touche´ is robust to different LLC sizes and
continues to be effective. For instance, even while using a
2MB cache, Touche´ provides an average speedup of 10%.
Even after doubling the LLC size from 4MB to 8MB, Touche´
still provides a 9% average speedup.
F. Impact on Low-MPKI Benchmarks
Until now, we have presented results only for high MPKI
benchmarks. However, for implementation purposes, it is vital
that Touche´ does not hurt the performance of low MPKI
benchmarks. Figure 22 shows the impact of Touche´ on the
 0.95
 1
 1.05
 1.1
 1.15
 1.2
2MB 4MB 8MB
S
p
ee
d
u
p
Baseline
Touché
Fig. 21. The sensitivity of Touche´ to the size of the LLC. Even after varying
the LLC size, Touche´ continues to provide at least 9% average speedup.
performance of Low MPKI workloads from the SPEC2006
suite. Overall, Touche´ does not cause slowdown for any Low
MPKI workload. On the contrary, Touche´ provides an average
speedup of 1.9% for these workloads.
0.9
1.0
1.1

hm
m
er





 
¡
¢
£
¤
¥
¦
§
as
ta
r
gr
om
ac
s
¨
©
ª
«
¬
­
®
¯
°
±
na
m
d
to
nt
o
²
³
´
µ
¶
·
¸
¹
ga
m
es
s
º
»
¼
½
¾
¿
À
Á
ea
Â
Ã
Ä
Å
Æ
Ç
È
É
ÊËÌÍÎÏÐÑ
ÒÓÔÕÖ×
ØÙÚÛÜ
Fig. 22. Impact of Touche´ on low MPKI workloads. Touche´ does not hurt
the performance of any low MPKI workload. Touche´ provides an average
speedup of 1.9% for these workloads.
VI. RELATED WORK
In this section, we describe prior work that is closely related
to the ideas discussed in this paper.
A. Efficient Compression Algorithms
Cache compression algorithms like Frequent Pattern Com-
pression (FPC) [23], Base-Delta-Immediate (BDI) [8], and
Cache Packer (C-PACK) [36] have low decompression latency
and require low implementation cost (i.e., area overhead). The
C-PACK algorithm can be improved further by detecting zero
cache lines [15]. Recently, Kim et al. [37] introduce a bit-plane
compression algorithm that uses a bit-plane transformation
to achieve a high compression ratio. Touche´ is orthogonal
to all of these compression algorithms. Touche´ can select
any of these algorithms to meet the hardware budget, latency
constraints, and application’s requirements.
B. Cache Compression with Tag Management
Prior works have proposed compressed cache architectures
to improve the effective cache capacity [12], [15], [16],
[17], [18]. For instance, a variable-size compressed cache
architecture using FPC was proposed [12]. This architecture
doubles the cache size when all cachelines are compressed
while requiring twice as many tag entries.
To reduce tag overhead of the compressed cache, DCC [17]
and SCC [15] use superblocks to track multiple neighbor
blocks with a single tag entry. Recently, YACC [16] was
proposed to reduce the complexity of SCC by exploiting
the compression and spatial locality. YACC still restricts the
mapping of compressed cachelines as it requires superblocks
that contain cachelines only from neighboring addresses. Fur-
thermore, YACC requires that those cachelines be of the same
compressed size. Touche´ eliminates this fundamental limita-
tion of the super block-based compressed cache. On average,
YACC provides 10.3% speedup while requiring additional bits
in the tag area resulting in 1.35x tag area. Touche´ provides
12% speedup without any area overheads. To increase LLC
efficiency, Amoeba-Cache [38], proposes storing tag and data
together while eliminating the tag area. However, to create
space for tags, Amoeba-Cache stores only parts of the memory
block within the cache. As DRAM caches do not encounter tag
storage problems and tend to be bandwidth sensitive, Young
et. al. [39] use compression in DRAM caches to improve both
capacity and bandwidth dynamically.
C. Compression using Deduplication
Data deduplication exploits the observation that several
memory blocks in the LLC contain the same identical
value [40], [41], [42]. To improve efficiency, these techniques
store only a single value of these memory blocks within the
LLC and design techniques to maintain tags that point to such
memory blocks.
Exploit the presence of identical memory blocks in the LLC,
Dedup [40] changes the LLC to enable several tags to point
to the same data. To this end, the tag array is decoupled
from the data array. Each tag entry is then equipped with
pointers to enable them to point to arbitrary memory blocks
in the data array. Touche´ is orthogonal to Dedup, as Touche´
is compression technique that compresses arbitrary memory
blocks independently and enables Dedup to be applied over
it.
D. Main Memory Compression Techniques
Compression can also be used for main memory. Memzip
compresses data for improving the bandwidth of the main
memory [43]. Pekhimenko et. al. [29] and Abali et. al [44]
have proposed efficient techniques to improve the effective
capacity of main memory using compression. Compression
can also be used in Non-Volatile Memories (NVM) to re-
duce energy and improve performance [45]. As compression
increases the number of bit-toggles on the bus, Pekhimenko
et. al. [46] minimizes bit-toggles and reduces the bus energy
consumption. Recently, Compresso [47] memory system was
proposed to reduce the additional data movement caused by
metadata accesses for additional translation, changes in com-
pressibility of the cacheline, and compression across cacheline
boundaries. Similarly, DMC [48] was proposed to improve
memory capacity.
Compression can use software support and increase the
main memory capacity. Products like IBM MXT and VMWare
ESX use “Balloon Drivers” to allocate and hold unused
memory when data becomes incompressible or when Virtual
Machines exceed capacity thresholds [49], [50], [51], [52],
[53]. One can also use compression in the context of memory
security. Morphable Counters [54] compress integrity tree
and encryption counters to reduce the size and height of the
integrity tree within the main memory.
E. Metadata Management for Main Memory
To reduce metadata bandwidth overheads from compression,
Attache´ [55] and PTMC [56] enables data and metadata to be
accessed together. Deb et. al. [57] describes the challenges in
maintaining metadata in main memory and recommend using
ECC to store Metadata. While this technique is useful for
memory modules that have ECC in them, LLC uses tag entries
and does not have to rely on ECC to store metadata [58], [59],
[60], [61], [62], [63], [64], [65], [66], [67]. However, Touche´
can be expanded to include the ECC within LLC to store
metadata.
F. Other Relevant Work
Sardashti and Wood [68] observed that cachelines in the
same page may not have similar compressibility. Hallnor
et. al. [69] proposed using compressed data throughout the
memory hierarchy. This approach reduces the overheads of
compression and decompression at every level of memory
hierarchy. Sathish et al. [70] try to save memory bandwidth
by using both lossy and lossless compression for GPUs.
Recent work from Han et. al. [71] and Kadetotad et. al. [72]
used compression with deep neural networks to significantly
improve performance and reduce energy. These prior work are
orthogonal to Touche´.
Cache compression has also been used to reduce cache
power consumption. Residue cache architecture [10] reduces
the last-level cache area by half, resulting in power saving.
Other prior works have been proposed to lower the negative
impacts of compression on the replacement. ECM [14] reduces
the cache misses using Size-Aware Insertion and Size-Aware
Replacement. CAMP [73] exploits the compressed cache block
size as a reuse indicator. Base-Victim [74] was also proposed
to avoid performance degradation due to compression on the
replacement. The power-performance efficiency of Touche´ can
be improved using these prior work.
VII. SUMMARY
The Last-Level Cache (LLC) capacity per core has stagnated
over the past decade. One way to increase the effective
capacity of LLC is by employing data compression. Data
compression enables the LLC controller to pack more memory
blocks within the LLC. Unfortunately, the additional com-
pressed memory blocks require additional tag entries. The
LLC designer needs to provision additional tag area to store
the tag entries of compressed blocks. We can also restrict
data placement within each cacheline to neighboring addresses
(superblocks) and reduce the tag area overheads. Ideally, we
would like to get the benefits of LLC compression without
incurring any tag area overheads.
To this end, this paper proposes Touche´, a framework
that enables LLC compression without any area overheads
in the tag or data arrays. Touche´ uses shortened signatures
to represent full tag address and appends the full tags to the
compressed memory blocks in the data array. This enables
Touche´ to store arbitrary memory blocks as neighbors. Further-
more, Touche´ can be enhanced further to include superblocks.
Touche´ is completely hardware based and achieves a near-
ideal speedup of 12% (ideal 13%) without any changes or
area overheads to the tag and data array.
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their feedback.
We thank Amin Azar his feedback on compression. This
work was partially supported by the Natural Sciences and
Engineering Research Council of Canada (NSERC) [funding
reference number RGPIN-2019-05059] and by the National
Research Foundation of Korea (NRF) grant funded by the
Korea government (MSIT) [funding reference number NRF-
2019R1G1A1011403].
REFERENCES
[1] The Economist, “Technology Quarterly: After Moore’s Law,” 2019,
accessed: 2019-03-07.
[2] The Verge, “Intel is forced to do less with Moore,” 2019, accessed:
2019-03-08.
[3] BBC Science, “The End of Moore’s Law: What Happens Next,” 2019,
accessed: 2019-03-08.
[4] H. Esmaeilzadeh et al., “Dark silicon and the end of multicore scaling,”
IEEE Micro, vol. 32, no. 3, pp. 122–134, May 2012.
[5] H. Esmaeilzadeh et al., “Dark silicon and the end of multicore scaling,”
in Proceedings of the 38th Annual International Symposium on Com-
puter Architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011,
pp. 365–376.
[6] Intel Inc., “Ivy Bridge: Intel Core X-series Processors,” 2019, accessed:
2019-03-07.
[7] ——, “Broadwell: 5th Generation Intel Core i5 Processors,” 2019,
accessed: 2019-03-07.
[8] G. Pekhimenko et al., “Base-delta-immediate compression: Practical
data compression for on-chip caches,” in 2012 21st International Con-
ference on Parallel Architectures and Compilation Techniques (PACT),
Sept 2012, pp. 377–388.
[9] A. Arelakis and P. Stenstrom, “Sc2: A statistical compression
cache scheme,” in Proceeding of the 41st Annual International
Symposium on Computer Architecuture, ser. ISCA ’14. Piscataway,
NJ, USA: IEEE Press, 2014, pp. 145–156. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2665671.2665696
[10] S. Kim, J. Kim, J. Lee, and S. Hong, “Residue cache: A low-energy
low-area l2 cache architecture via compression and partial hits,” in 2011
44th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), Dec 2011, pp. 420–429.
[11] J. Dusser, T. Piquet, and A. Seznec, “Zero-content augmented caches,”
in Proceedings of the 23rd International Conference on Supercomputing,
ser. ICS ’09. New York, NY, USA: ACM, 2009, pp. 46–55. [Online].
Available: http://doi.acm.org/10.1145/1542275.1542288
[12] A. R. Alameldeen and D. A. Wood, “Adaptive cache compression for
high-performance processors,” in Proceedings. 31st Annual International
Symposium on Computer Architecture, 2004., June 2004, pp. 212–223.
[13] N. S. Kim, T. Austin, and T. Mudge, “Low-energy data cache using sign
compression and cache line bisection.” Citeseer.
[14] S. Baek, H. G. Lee, C. Nicopoulos, J. Lee, and J. Kim, “Ecm: Effective
capacity maximizer for high-performance compressed caching,” in 2013
IEEE 19th International Symposium on High Performance Computer
Architecture (HPCA), Feb 2013, pp. 131–142.
[15] S. Sardashti, A. Seznec, and D. A. Wood, “Skewed compressed caches,”
in 2014 47th Annual IEEE/ACM International Symposium on Microar-
chitecture, Dec 2014, pp. 331–342.
[16] ——, “Yet another compressed cache: A low-cost yet effective
compressed cache,” ACM Trans. Archit. Code Optim., vol. 13,
no. 3, pp. 27:1–27:25, Sep. 2016. [Online]. Available:
http://doi.acm.org/10.1145/2976740
[17] S. Sardashti and D. A. Wood, “Decoupled compressed cache: Exploiting
spatial locality for energy optimization,” IEEE Micro, vol. 34, no. 3, pp.
91–99, May 2014.
[18] S. Sardashti and D. A. Wood, “Decoupled compressed cache: Exploiting
spatial locality for energy-optimized compressed caching,” in 2013
46th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), Dec 2013, pp. 62–73.
[19] www.spec.org, “The SPEC2006 Benchmark Suite,” 2006.
[20] Intel Inc., “Intel 5-Level Paging and 5-Level EPT,” 2019, accessed:
2019-03-07.
[21] D. Weiss, M. Dreesen, M. Ciraula, C. Henrion, C. Helt, R. Freese,
T. Miles, A. Karegar, R. Schreiber, B. Schneller, and J. Wuu, “An 8mb
level-3 cache in 32nm soi with column-select aliasing,” in 2011 IEEE
International Solid-State Circuits Conference, Feb 2011, pp. 258–260.
[22] P. J. Nair, B. Asgari, and M. K. Qureshi, “Sudoku: Tolerating high-
rate of transient failures for enabling scalable sttram,” in 2019 49th
Annual IEEE/IFIP International Conference on Dependable Systems and
Networks (DSN), June 2019, pp. 388–400.
[23] A. R. Alameldeen and D. A. Wood, “Frequent pattern compression:
A significance-based compression scheme for l2 caches,” Dept. Comp.
Scie., Univ. Wisconsin-Madison, Tech. Rep, vol. 1500, 2004.
[24] Intel Inc., “Intel 64 and IA-32 Architectures Optimization Reference
Manual,” 2016, accessed: 2019-04-08.
[25] ——, “Intel Ultra Path Interconnect: Directory-Based Protocol,” 2019,
accessed: 2019-05-08.
[26] N. Chatterjee et al., “Usimm: the utah simulated memory module a sim-
ulation infrastructure for the jwac memory scheduling championship,”
2012.
[27] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer,
“High performance cache replacement using re-reference interval
prediction (rrip),” in Proceedings of the 37th Annual International
Symposium on Computer Architecture, ser. ISCA ’10. New
York, NY, USA: ACM, 2010, pp. 60–71. [Online]. Available:
http://doi.acm.org/10.1145/1815961.1815971
[28] M. K. Qureshi et al., “Adaptive insertion policies for high
performance caching,” in Proceedings of the 34th Annual International
Symposium on Computer Architecture, ser. ISCA ’07. New
York, NY, USA: ACM, 2007, pp. 381–391. [Online]. Available:
http://doi.acm.org/10.1145/1250662.1250709
[29] G. Pekhimenko et al., “Linearly compressed pages: A low-
complexity, low-latency main memory compression framework,”
in Proceedings of the 46th Annual IEEE/ACM International
Symposium on Microarchitecture, ser. MICRO-46. New York,
NY, USA: ACM, 2013, pp. 172–184. [Online]. Available:
http://doi.acm.org/10.1145/2540708.2540724
[30] JEDEC Standard, “DDR4 Standard,” in JESD79-4, 2015.
[31] K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu,
“Low-cost inter-linked subarrays (lisa): Enabling fast inter-subarray data
movement in dram,” in 2016 IEEE International Symposium on High
Performance Computer Architecture (HPCA), March 2016, pp. 568–580.
[32] P. Nair, C.-C. Chou, and M. Qureshi, “A case for refresh pausing in
dram memory systems,” in High Performance Computer Architecture
(HPCA2013), 2013 IEEE 19th International Symposium on, Feb 2013,
pp. 627–638.
[33] M. K. Qureshi, D. Kim, S. Khan, P. J. Nair, and O. Mutlu, “Avatar:
A variable-retention-time (vrt) aware refresh for dram systems,” in
2015 45th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks, June 2015, pp. 427–437.
[34] E. Choukse, M. Erez, and A. Alameldeen, “Compresspoints: An evalu-
ation methodology for compressed memory systems,” IEEE Computer
Architecture Letters, vol. 17, no. 2, pp. 126–129, July 2018.
[35] E. Perelman et al., “Using SimPoint for accurate and efficient simula-
tion,” ACM SIGMETRICS Performance Evaluation Review, 2003.
[36] X. Chen, L. Yang, R. P. Dick, L. Shang, and H. Lekatsas, “C-pack: A
high-performance microprocessor cache compression algorithm,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 18,
no. 8, pp. 1196–1208, Aug 2010.
[37] J. Kim, M. Sullivan, E. Choukse, and M. Erez, “Bit-plane compression:
Transforming data for better compression in many-core architectures,”
in 2016 ACM/IEEE 43rd Annual International Symposium on Computer
Architecture (ISCA), June 2016, pp. 329–340.
[38] S. Kumar, H. Zhao, A. Shriraman, E. Matthews, S. Dwarkadas, and
L. Shannon, “Amoeba-cache: Adaptive blocks for eliminating waste in
the memory hierarchy,” in 2012 45th Annual IEEE/ACM International
Symposium on Microarchitecture, Dec 2012, pp. 376–388.
[39] V. Young, P. J. Nair, and M. K. Qureshi, “Dice: Compressing dram
caches for bandwidth and capacity,” in Proceedings of the 44th Annual
International Symposium on Computer Architecture, ser. ISCA ’17.
New York, NY, USA: ACM, 2017, pp. 627–638. [Online]. Available:
http://doi.acm.org/10.1145/3079856.3080243
[40] Y. Tian, S. M. Khan, D. A. Jime´nez, and G. H. Loh, “Last-
level cache deduplication,” in Proceedings of the 28th ACM
International Conference on Supercomputing, ser. ICS ’14. New
York, NY, USA: ACM, 2014, pp. 53–62. [Online]. Available:
http://doi.acm.org/10.1145/2597652.2597655
[41] B. Hong, D. Plantenberg, D. D. E. Long, and M. Sivan-Zimet, “Duplicate
data elimination in a san file system,” in MSST, 2004.
[42] T. E. Denehy, W. W. Hsu, T. E. Denehy, and W. W. Hsu, “Duplicate
management for reference data,” in Research Report RJ10305, IBM,
2003.
[43] A. Shafiee et al., “Memzip: Exploring unconventional benefits from
memory compression,” in 2014 IEEE 20th International Symposium on
High Performance Computer Architecture (HPCA), Feb 2014, pp. 638–
649.
[44] B. Abali et al., “Performance of hardware compressed main mem-
ory,” in Proceedings HPCA Seventh International Symposium on High-
Performance Computer Architecture, 2001, pp. 73–81.
[45] P. M. Palangappa and K. Mohanram, “Compex: Compression-expansion
coding for energy, latency, and lifetime improvements in mlc/tlc nvm,”
in 2016 IEEE International Symposium on High Performance Computer
Architecture (HPCA), March 2016, pp. 90–101.
[46] G. Pekhimenko et al., “A case for toggle-aware compression for gpu
systems,” in 2016 IEEE International Symposium on High Performance
Computer Architecture (HPCA), March 2016, pp. 188–200.
[47] E. Choukse, M. Erez, and A. R. Alameldeen, “Compresso: Pragmatic
main memory compression,” in 2018 51st Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture (MICRO), Oct 2018, pp. 546–
558.
[48] S. Kim, S. Lee, T. Kim, and J. Huh, “Transparent dual memory
compression architecture,” in 2017 26th International Conference on
Parallel Architectures and Compilation Techniques (PACT), Sept 2017,
pp. 206–218.
[49] P. Franaszek and J. Robinson, “Design and analysis of internal organi-
zations for compressed random access memories,” in IBM Report, RC
21146, year=1998.
[50] C. D. Benveniste, P. A. Franaszek, and J. T. Robinson, “Cache-memory
interfaces in compressed memory systems,” IEEE Transactions on
Computers, vol. 50, no. 11, pp. 1106–1116, Nov 2001.
[51] T. B. Smith, B. Abali, D. E. Poff, and R. B. Tremaine, “Memory
expansion technology (mxt): Competitive impact,” IBM Journal of
Research and Development, vol. 45, no. 2, pp. 303–309, March 2001.
[52] R. B. Tremaine, T. B. Smith, M. Wazlowski, D. Har, K.-K. Mak, and
S. Arramreddy, “Pinnacle: Ibm mxt in a memory controller chip,” IEEE
Micro, vol. 21, no. 2, pp. 56–68, Mar 2001.
[53] E. VMware, “Understanding memory resource management in vmware
esx 4.1,” 2010.
[54] G. Saileshwar et al., “Morphable counters: Enabling compact in-
tegrity trees for low-overhead secure memories,” in 2018 51st Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO),
Oct 2018, pp. 416–427.
[55] S. Hong, P. J. Nair, B. Abali, A. Buyuktosunoglu, K. Kim, and M. Healy,
“Attache´: Towards ideal memory compression by mitigating metadata
bandwidth overheads,” in 2018 51st Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), Oct 2018, pp. 326–338.
[56] V. Young, S. Kariyappa, and M. Qureshi, “Enabling transparent memory-
compression for commodity memory systems,” in 2019 IEEE Interna-
tional Symposium on High Performance Computer Architecture (HPCA),
Feb 2019, pp. 570–581.
[57] A. Deb et al., “Enabling technologies for memory compression: Meta-
data, mapping, and prediction,” in 2016 IEEE 34th International Con-
ference on Computer Design (ICCD), Oct 2016, pp. 17–24.
[58] D. H. Yoon, M. K. Jeong, and M. Erez, “Adaptive granularity memory
systems: A tradeoff between storage efficiency and throughput,” in
Proceedings of the 38th Annual International Symposium on Computer
Architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011, pp.
295–306.
[59] P. J. Nair, D.-H. Kim, and M. K. Qureshi, “Archshield: Architectural
framework for assisting dram scaling by tolerating high error rates,” in
Proceedings of the 40th Annual International Symposium on Computer
Architecture, ser. ISCA ’13. New York, NY, USA: ACM, 2013, pp.
72–83.
[60] P. J. Nair, D. A. Roberts, and M. K. Qureshi, “Faultsim:
A fast, configurable memory-reliability simulator for conventional
and 3d-stacked systems,” ACM Trans. Archit. Code Optim.,
vol. 12, no. 4, pp. 44:1–44:24, Dec. 2015. [Online]. Available:
http://doi.acm.org/10.1145/2831234
[61] D. Roberts and P. Nair, “Faultsim: A fast, configurable memory-
resilience simulator,” in The Memory Forum: In conjunction with ISCA,
vol. 41, 2014.
[62] S. Khan et al., “The efficacy of error mitigation techniques for dram
retention failures: A comparative experimental study,” in The 2014 ACM
International Conference on Measurement and Modeling of Computer
Systems, ser. SIGMETRICS ’14. New York, NY, USA: ACM, 2014,
pp. 519–532.
[63] P. J. Nair, V. Sridharan, and M. K. Qureshi, “Xed: Exposing on-die error
detection information for strong memory reliability,” in 2016 ACM/IEEE
43rd Annual International Symposium on Computer Architecture (ISCA),
June 2016, pp. 341–353.
[64] C. Chou, P. Nair, and M. K. Qureshi, “Reducing refresh power in
mobile devices with morphable ecc,” in 2015 45th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, June
2015, pp. 355–366.
[65] P. J. Nair, D. A. Roberts, and M. K. Qureshi, “Citadel: Efficiently
protecting stacked memory from large granularity failures,” in 2014 47th
Annual IEEE/ACM International Symposium on Microarchitecture, Dec
2014, pp. 51–62.
[66] ——, “Citadel: Efficiently protecting stacked memory from tsv
and large granularity failures,” ACM Trans. Archit. Code Optim.,
vol. 12, no. 4, pp. 49:1–49:24, Jan. 2016. [Online]. Available:
http://doi.acm.org/10.1145/2840807
[67] G. Saileshwar et al., “Synergy: Rethinking secure-memory design for
error-correcting memories,” in 2018 IEEE International Symposium on
High Performance Computer Architecture (HPCA), Feb 2018, pp. 454–
465.
[68] S. Sardashti and D. A. Wood, “Could compression be of general use?
evaluating memory compression across domains,” ACM Trans. Archit.
Code Optim., vol. 14, no. 4, pp. 44:1–44:24, Dec. 2017. [Online].
Available: http://doi.acm.org/10.1145/3138805
[69] E. G. Hallnor and S. K. Reinhardt, “A unified compressed memory
hierarchy,” in 11th International Symposium on High-Performance Com-
puter Architecture, Feb 2005, pp. 201–212.
[70] V. Sathish, M. J. Schulte, and N. S. Kim, “Lossless and lossy
memory i/o link compression for improving performance of gpgpu
workloads,” in Proceedings of the 21st International Conference on
Parallel Architectures and Compilation Techniques, ser. PACT ’12.
New York, NY, USA: ACM, 2012, pp. 325–334. [Online]. Available:
http://doi.acm.org/10.1145/2370816.2370864
[71] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
deep neural network with pruning, trained quantization and huffman
coding,” CoRR, vol. abs/1510.00149, 2015. [Online]. Available:
http://arxiv.org/abs/1510.00149
[72] D. Kadetotad et al., “Efficient memory compression in deep neural
networks using coarse-grain sparsification for speech applications,” in
Proceedings of the 35th International Conference on Computer-Aided
Design, ser. ICCAD ’16. New York, NY, USA: ACM, 2016, pp. 78:1–
78:8. [Online]. Available: http://doi.acm.org/10.1145/2966986.2967028
[73] G. Pekhimenko et al., “Exploiting compressed block size as an indicator
of future reuse,” in 2015 IEEE 21st International Symposium on High
Performance Computer Architecture (HPCA), Feb 2015, pp. 51–63.
[74] J. Gaur, A. R. Alameldeen, and S. Subramoney, “Base-victim com-
pression: An opportunistic cache compression architecture,” in 2016
ACM/IEEE 43rd Annual International Symposium on Computer Archi-
tecture (ISCA), June 2016, pp. 317–328.
