CRAM: Efficient Hardware-Based Memory Compression for Bandwidth
  Enhancement by Young, Vinson et al.
CRAM: Efficient Hardware-Based Memory
Compression for Bandwidth Enhancement
Vinson Young , Sanjay Kariyappa , and Moinuddin K. Qureshi
Georgia Institute of Technology
{vyoung,sanjaykariyappa,moin}@gatech.edu
Abstract—This paper investigates hardware-based memory
compression designs to increase the memory bandwidth. When
lines are compressible, the hardware can store multiple lines
in a single memory location, and retrieve all these lines in
a single access, thereby increasing the effective memory band-
width. However, relocating and packing multiple lines together
depending on the compressibility causes a line to have multi-
ple possible locations. Therefore, memory compression designs
typically require metadata to specify the compressibility of the
line. Unfortunately, even in the presence of dedicated metadata
caches, maintaining and accessing this metadata incurs significant
bandwidth overheads and can degrade performance by as much
as 40%. Ideally, we want to implement memory compression
while eliminating the bandwidth overheads of metadata accesses.
This paper proposes CRAM, a bandwidth-efficient design for
memory compression that is entirely hardware based and does
not require any OS support or changes to the memory modules
or interfaces. CRAM uses a novel implicit-metadata mechanism,
whereby the compressibility of the line can be determined by
scanning the line for a special marker word, eliminating the
overheads of metadata access. CRAM is equipped with a low-cost
Line Location Predictor (LLP) that can determine the location
of the line with 98% accuracy. Furthermore, we also develop
a scheme that can dynamically enable or disable compression
based on the bandwidth cost of storing compressed lines and the
bandwidth benefits of obtaining compressed lines, ensuring no
degradation for workloads that do not benefit from compression.
Our evaluations, over a diverse set of 27 workloads, show that
CRAM provides a speedup of up to 73% (average 6%) without
causing slowdown for any of the workloads, and consuming a
storage overhead of less than 300 bytes at the memory controller.
I. INTRODUCTION
As modern compute systems pack more and more cores
on the processor chip, the memory systems must also scale
proportionally in terms of bandwidth in order to supply data to
all the cores. Unfortunately, memory bandwidth is dictated by
the pin count of the processor chip, and this limited memory
bandwidth is one of the bottlenecks for system performance.
Data compression is a promising solution for enabling a higher
effective memory bandwidth. For example, when the data
is compressible, we can pack multiple memory lines within
one memory location and retrieve all of these lines with a
single memory request, increasing memory bandwidth and
performance. In this paper, we study main memory compression
designs that can increase memory bandwidth.
Prior works on memory compression [1] [2] [3] aim to obtain
both the capacity and bandwidth benefits from compression,
trying to accommodate as many pages as possible in the main
memory, depending on the compressibility of the data. As
the effective memory capacity of such designs can change
at runtime, these designs need support from the Operating
System (OS) or the hypervisor, to handle the dynamically
changing memory capacity. Unfortunately, this means such
memory compression solutions are not viable unless both the
hardware vendors (e.g. Intel, AMD etc.) and the OS vendors
(Microsoft, Linux etc.) can co-ordinate with each other on the
interfaces, or such solutions will be limited to systems where
the same vendor provides both the hardware and the OS. We
are interested in practical designs for memory compression
that can be implemented entirely in hardware, without relying
on any OS/hypervisor support. Such designs would provide the
bandwidth benefits, while providing constant memory capacity.1
A prior study, MemZip [5], tried to increase the memory
bandwidth using hardware-based compression; however, it
requires significant changes to the memory organization and
the memory access protocols. Instead of striping the line across
all the chips on a memory DIMM, MemZip places the entire
line in one chip, and changes the number of bursts required
to stream out the line, depending on the compressibility of
the data. Thus, MemZip requires significant changes to the
data organization of commodity memories and the memory
controller to support variable burst lengths. Ideally, we want
to obtain the memory bandwidth benefits from compression
while retaining support for commodity DRAM modules and
using conventional data organization and bus protocols.
Compression can change both the size and location of the
line. Without additional information, the memory controller
would not know how to interpret the data obtained from
the memory (compressed or uncompressed). Conventional
designs for memory compression rely on explicit metadata
that indicates the compressibility status of the line, and this
information is used to determine the location of the line. Such
designs store the metadata in a separate region of memory.
Unfortunately, accessing the metadata can incur significant
bandwidth overhead. While on-chip metadata caches [3] reduce
the bandwidth required to obtain the metadata, such caches
are designed mainly to exploit spatial locality and are not as
effective for workloads that have limited spatial locality. Our
goal is to design hardware-compressed memory, without any
OS support, using commodity memories and protocols, and
without the metadata lookup.
1In fact, a few months ago, Qualcomm’s Centriq [4] system was announced
with a feature that tries to provide higher bandwidth through memory
compression while forgoing the extra capacity available from memory
compression. Centriq’s design relies on increasing the linesize to 128 bytes,
striping this wider line across two channels, having ECC DIMMs in each
channel to track compression status, and obtaining the 128-byte line from
one channel if the line is compressible. Ideally, we want to obtain bandwidth
benefits without changing the linesize, or relying on ECC DIMMs, and without
getting limited to 2x compression ratio. Nonetheless, the Centriq announcement
shows the commercial appeal of such hardware-based memory compresssion.
ar
X
iv
:1
80
7.
07
68
5v
1 
 [c
s.A
R]
  2
0 J
ul 
20
18
No Compression Compressible
Ideal Compression
Compression with Metadata
Incompressible
A B
X Y
MetadataA B
X Y Metadata
A B
X Y
(accesses = 4)
(accesses = 3)
(accesses = 5)
a
b
c
Fig. 1. Number of accesses to read 4 lines: A, B, X, Y for (a) uncompressed memory, (b) ideal compressed memory, and (c) compressed memory with
metadata lookup overhead. Metadata lookup causes significant bandwidth overhead.
We explain the problem of bandwidth overhead of metadata
with an example. Figure 1 shows three memory systems, each
servicing four memory requests A, B, X and Y. A and B
are compressible and can reside in one line, whereas X and
Y are incompressible. For the baseline system (a), servicing
these four requests would require four memory accesses. For
an idealized compressed memory system (b) (that does not
require metadata lookup), lines A and B can be obtained in
a single access, where as X and Y would require one access
each, for a total of 3 accesses for all the four lines. However,
when we account for metadata lookup (c), it could take up to 5
accesses to read and interpret all the lines, causing degradation
relative to an uncompressed scheme. Our studies show that
even in the presence of metadata caching, the metadata lookup
can degrade performance by as much as 40%. Ideally, we
want to implement memory compression without incurring the
bandwidth overheads of metadata accesses.
To this end, this paper presents CRAM, an efficient hardware-
based main-memory compression design. CRAM decouples
and separately solves the issue of (i) how to interpret the data,
and (ii) where to look for the data, to eliminate the metadata
lookup. To efficiently interpret the data received on an access,
we propose an implicit-metadata scheme, whereby compressed
lines are required to contain a special value, called a marker.
For example, with a four-byte marker, the last four bytes of a
compressed line is required to always be equal to the marker
value. We leverage the insight that compressed data rarely
uses the full 64-byte space, so we can store compressed data
within 60 bytes and use the remaining four bytes to store
the marker. On a read to a line that contains the marker, the
line is interpreted as a compressed line. Similarly, an access
to a line that does not contain the marker is interpreted as
an uncompressed line. The likelihood that an uncompressed
line coincidentally matches with a marker is quite small (less
than one in a billion), and CRAM handles such rare cases of
marker collisions simply by identifying lines that cause marker
collisions on a write and storing such lines in an inverted form
(more details in Section V-A).
The implicit-metadata scheme eliminates the need to do
a separate metadata lookup. However, CRAM now needs an
efficient mechanism to determine the location of the given line.
CRAM restricts the possible locations of the line, based on
compressibility. For example, in Figure 1(b), when A and B are
compressible, CRAM restricts that both A and B must reside
in the location of A. Therefore, the location of A remains
unchanged regardless of compression. However, for B, the
location depends on compressibility. We propose a history-
based Line Location Predictor (LLP), that can identify the
correct location of the line with a high accuracy, which helps
in obtaining a given line in a single memory access. The LLP
is based on the observation that lines within a page tend to
have similar compressibility. We propose a page-based last-
compressibility predictor to predict compressibility and thus
location, and this allows us to access the correct location with
98% accuracy. CRAM, combined with implicit-metadata and
LLP, eliminates metadata lookups and achieves an average
speedup of 8.5% on spec workloads.
Unfortunately, even after eliminating the bandwidth over-
heads of the metadata lookup, some workloads still have
slowdown with compression due to the inherent bandwidth
overheads associated with compressing memory. For example,
compressing and writing back clean-lines incurs bandwidth
overhead, as those lines are not written to memory in an
uncompressed design. For workloads with poor reuse, this
bandwidth overhead of writing compressed data does not get
amortized by the subsequent accesses. To avoid performance
degradation in such scenarios, we develop Dynamic-CRAM, a
sampling-based scheme that can dynamically enable or disable
compression depending on when compression is beneficial.
Dynamic-CRAM ensures no slowdown for workloads that do
not benefit from compression.
Overall, this paper makes the following contributions:
1. It proposes CRAM, a hardware-based compressed memory
design to provide bandwidth benefits without requiring OS
support or changes to the memory module and protocols.
CRAM performs memory accesses using conventional linesize
(64 bytes) and does not rely on the availability of ECC-DIMMs.
2. It proposes implicit-metadata to eliminate the storage and
bandwidth overheads of metadata, by providing both the
compressibility status and data in a single memory access.
3. It proposes a low-cost Line Location Predictor (LLP) to
determine the location of the line with a high accuracy.
4. It proposes Dynamic-CRAM, to enable or disable compres-
sion at runtime based on the cost and benefit of compression.
Our evaluations show that CRAM improves bandwidth by
9% and provides a speedup of up to 73% (average 6%), while
ensuring no slowdown for any of the workloads. CRAM can
be implemented with minor changes to the memory controller,
while incurring a storage overhead of less than 300 bytes.
2
II. BACKGROUND AND MOTIVATION
Compression exploits redundancy in data values and can
provide both larger effective capacity and higher effective
bandwidth for the memory systems. While exploiting the
increased memory capacity requires OS support (to handle
the dynamically changing capacity depending on data values),
memory compression for exploiting only the bandwidth benefits
can potentially be implemented entirely in hardware. We
provide background on hardware-based memory compression,
the potential benefit from compression, the challenges in
implementing such a compressed design, and insights that
can help in developing efficient designs.
A. Memory Compression for Bandwidth
Figure 2 provides an overview of a compressed memory
design. We will assume that the design is geared towards
obtaining only the bandwidth benefits, and the extra capacity
created by compression remains unused. Compressed memory
designs leverage compression algorithms [6] [7] [8] [9] [10]
[11] [12] to accommodate data into a smaller space. If the lines
are compressible we can either store them in their original
location and stream out in a smaller burst, or place multiple
compressed lines in one location and stream out all these lines
in one access. We use the second option as it avoids dynamically
changing the burst length, thus retaining compliance with the
protocols and data mapping used in conventional memory
designs. Thus, if two lines A and B are compressible, then
both are resident in the location of A. One memory access
can provide both A and B, thereby increasing the effective
bandwidth if both A and B get used. The bandwidth benefit of
such a design is dictated by both (a) the ability to pack multiple
lines together, and (b) the spatial locality of the workload
(ability to use adjacent lines).
Memory 
Controller
Metadata Cache:
Compressibility & 
Location Info
CoreLLC
Compression-
Decompression
Engine
DRAM Metadata
Processor
Read/Write
Requests
Metadata 
Transfer
Fig. 2. Overview of Compressed Memories. Conventional designs track
compression status of each line in a metadata region and use a metadata cache.
B. The Challenge of Metadata Accesses
An access to the compressed memory obtains a 64-byte line,
however, the memory controller would not know if the line
contains compressed data or not. For example, an access to A
would provide both A and B, if both lines are compressible,
and only A if the lines are uncompressed. Simply obtaining the
line from location A is insufficient to provide the information
about compressibility of the line. A separate region in memory,
which we refer to as the metadata region keeps track of the
Compression Status Information (CSI) for each line. Thus, we
need the CSI of the line along with the data line to not only
interpret the data line, but also to determine the location of
the data line. Even if we provisioned only one bit per line in
memory, the size of this metadata would be quite large. For
example, for our 16GB memory, having 1-bit per line to specify
if the line is compressed or not would require a capacity of 32
megabytes. Therefore, conventional designs keep the metadata
region in memory and access this metadata region on a demand
basis and cache it in an on-chip metadata cache [3] [5]. Such
designs are effective only when the metadata cache has a high
hit-rate, due to either high spatial locality or small workload
footprint. However, these approaches become ineffective when
scaled to much larger workloads with low spatial locality. In
the worst-case, these designs may need a separate metadata
access for every data access, constituting a potential bandwidth
overhead of 50-80% (e.g., in xz and cactu). Therefore, avoiding
the bandwidth overhead of metadata accesses is vital to building
an effective memory compression design.
C. Potential for Performance Improvement
Our goal is to develop an efficient memory compression
design that provides higher bandwidth. Figure 3 shows the
performance benefit from an idealized compression scheme
that does not maintain any metadata and simply transfers all the
lines that would be together in a compressed memory system,
thereby obtaining all the benefits of compression and none of
the overheads. We also show the performance of a practical
memory compression that maintains metadata in memory and
is equipped with a 32KB metadata cache.
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
fot
on
ik
lbm
17
so
ple
x
libq
m
cf1
7
m
ilc
Ge
ms
pa
res
t
sp
hin
x
les
lie
ca
ctu
17
om
ne
t17
gcc
06 xz
wr
f17
bc
 tw
i
bc
 we
b
cc
 tw
i
cc
 w
eb
pr 
twi
pr 
we
b
SP
ECGA
P
AL
L2
7
Sp
ee
du
p
Ideal Design (no metadata) Design with metadata
Fig. 3. Speedup from ideal compression (no metadata lookup) and practical
compression (w/ metadata cache).
On average, compression can provide a speedup of 9%; how-
ever, the overheads of implementing compression erodes this.
In fact, we observe significant degradation with compression
for several workloads. For example, Graph workloads (bc twi
- pr web) have small potential benefits from compression, due
to poor spatial locality and low data reuse. It is important that
the implementation of compressed memory does not degrade
the performance of such workloads.
D. Insight: Store Metadata in Unused Space
To reduce the metadata access of compressed memory, we
leverage the insight that not all the space of the 64 byte line
is used by compressed memory. For example, when we are
trying to compress two lines (A and B) they must fit within
64 bytes; however, the compressed size could still be smaller
than 64 bytes (and not large enough to store additional lines C
and D). We can leverage the unused space in the compressed
3
memory line to store metadata information within the line. For
example, we could require that the compressed lines store a
4-byte marker (a predefined value) at the end of the line, and
the space available to store the compressed lines would now
get reduced to 60 bytes. Figure 4 shows the probability of a
pair of adjacent lines compressing to ≤64B and ≤60B. As the
probability of compressing pairs of lines to ≤64B and ≤60B
are 38% and 36%, respectively, we find that reserving space
for this marker does not substantially impact the likelihood
of compressing two lines together and thus would not have a
significant impact on compression ratio.
0
20
40
60
80
100
fot
on
ik
lbm
17
so
ple
x
libq
m
cf1
7
m
ilc
Ge
ms
pa
res
t
sp
hin
x
les
lie
ca
ctu
17
om
ne
t17
gcc
06 xz
wr
f17
bc
 tw
i
bc
 we
b
cc
 tw
i
cc
 w
eb
pr 
twi
pr 
we
b
av
er
ag
e%
 o
f c
om
pr
es
sib
le
 lin
es Double ≤ 64 Double ≤ 60
Fig. 4. Probability of a pair of adjacent lines compressing to ≤64B and ≤60B.
We can use 4 bytes to indicate status of the line, without significantly affecting
compressibility.
We can use this insight to store the metadata implicitly within
the line and avoid the bandwidth overheads of accessing the
metadata explicitly. If the line obtained from memory contains
the marker value, the line is deemed compressed, whereas,
if the the line does not have the marker value, then it is
deemed uncompressed. However, there could be a case where
the uncompressed line coincidentally stores the marker value.
A practical solution must efficiently handle such collisions,
even though such collisions are expected to be extremely rare.
We propose CRAM, an efficient hardware-based compression
design, that does not require any OS support or changes to the
memory module/protocols, and avoids the bandwidth overheads
of metadata lookups. We discuss our evaluation methodology
before discussing our solution.
III. METHODOLOGY
A. Framework and Configuration
We use USIMM [13], an x86 simulator with detailed memory
system model. Table I shows the configuration used in our
study. We assume a three-level cache hierarchy (L1, L2, L3
being on-chip SRAM caches). All caches use line-size of 64
bytes. The DRAM model is based on DDR4.
We model a virtual memory system to perform virtual to
physical address translations, and this ensures that the memory
accesses of different cores do not map to the same physical
page. Note that, other than the virtual memory translation,
the OS is not extended to provide any support to enable the
compressed memory.
For compression, we use a hybrid compression scheme
where we use FPC and BDI and compress with the one that
gives better compression. Information about the compression
algorithm used and the compression-specific metadata (e.g.
base for BDI) are stored within the compressed line, and are
counted towards determining the size of the compressed line.
TABLE I
SYSTEM CONFIGURATION
Processors 8 cores; 3.2GHz, 4-wide OoO
Last-Level Cache 8MB, 16-way
Compression Algorithm FPC + BDI
Main Memory
Capacity 16GB
Bus Frequency 800MHz (DDR 1.6GHz)
Configuration 2 channel, 2x rank, 64-bit bus
tCAS-tRCD-tRP-tRAS 11-11-11-39 ns
B. Workloads
We use a representative slice of 1-billion instructions selected
by PinPoints [14], from benchmarks suites that include SPEC
2006 [15], SPEC 2017 [16], and GAP [17]. We evaluate all
SPEC 2006 and SPEC 2017 workloads, and mark ’06 or ’17
to denote the version when the workload is common to both.
We additionally run GAP suite, which is graph analytics with
real data sets (twitter, web sk-2005) [18]. We show
detailed evaluation of the workloads with at least five misses
per thousand instructions (MPKI). The evaluations execute
benchmarks in rate mode, where all eight cores execute the
same benchmark. Table II shows L3 miss rates and memory
footprints of the workloads we have evaluated in detail. In
addition to these workloads, we also include 6 mixed workloads
that are formed by randomly mixing the SPEC workloads.
TABLE II
WORKLOAD CHARACTERISTICS
Suite Workload L3 MPKI Footprint
SPEC
fotonik 26.2 6.8 GB
lbm17 25.5 3.4 GB
soplex 23.3 2.1 GB
libq 23.1 418 MB
mcf17 22.8 4.4 GB
milc 21.9 3.1 GB
Gems 17.2 5.8 GB
parest 16.4 465 MB
sphinx 11.9 223 MB
leslie 11.9 861 MB
cactu17 10.6 2.1 GB
omnet17 8.6 1.9 GB
gcc06 5.8 205 MB
xz 5.7 943 MB
wrf17 5.2 798 MB
GAP
bc twi 66.6 9.2 GB
bc web 7.4 10.0 GB
cc twi 101.8 6.0 GB
cc web 8.1 5.3 GB
pr twi 144.8 8.3 GB
pr web 13.1 8.2 GB
We perform timing simulation until each benchmark in
a workload executes at least 1 billion instructions. We use
weighted speedup to measure aggregate performance of the
workload normalized to the baseline and report geometric
mean for the average speedup across all the 27 workloads (7
SPEC2006, 8 SPEC2017, 6 GAP, 6 MIX). For other workloads
that are not memory bound, we present full results of all 64
benchmarks evaluated (29 SPEC2006, 23 SPEC2017, 6 GAP,
6 MIX) in Section VII-B.
4
IV. CRAM: BASIC DESIGN
Our proposed design, CRAM, tries to obtain bandwidth
benefits using memory compression without requiring OS
support, without changes to bus protocol, and while maintaining
the existing organization for the memory modules. In this
section, we provide an overview of the basic CRAM design,
and discuss the shortcomings of maintaining and retrieving
metadata associated with compression.
A. Organization and Operation of CRAM
Figure 5 shows an overview of CRAM. The main memory
can store compressed data, and the job of compression and
decompression is performed by the logic on the memory
controller on the processor chip. The L3 cache is assumed
to store data in uncompressed form. The bus connecting the
memory controller and the memory modules use the existing
JEDEC protocol and transfer 64 bytes on each access. If the
lines are compressible, then a single access can provide multiple
neighboring lines, and increase effective bandwidth.
Fig. 5. An Overview of the CRAM Design.
Restricted Data Mapping: Without loss of generality, CRAM
supports up to 4-to-1 compression, which means up to four
compressed lines can be resident in one memory location. If
the compressibility is not high enough to store 4 lines in one
location, then the design tries 2-to-1 compression, where two
neighboring lines are placed in one location. If the lines are
uncompressed they retain their original location. If we organize
the data layout appropriately, we can reduce the amount of
uncertainty (i.e., number of possible positions) in locating lines.
CRAM restricts the location of the lines in a group of 4 lines
to help in locating the lines easily.
A B C D
A C
C D
A B
A
B D
A B
C D
B C D
2:1 compressed
uncompressed
4:1 compressed
Possible locations: 1 2 2 3
Fig. 6. CRAM relocates and packs adjacent lines. Restricting placement
reduces number of valid positions.
Figure 6 shows the five different line permutations for
a group of 4 lines under CRAM, based on whether the
lines undergo 2-to-1 compression, 4-to-1 compression, or no
compression. Thus, line A (lines with line-address ending in
”00”) is always resident in the same location, whereas line
B (lines with address ending in ”01”) can be in the original
location at B (if B is uncompressed) or at A (if B is compressed).
Note that, on average there are only two locations for each
line in the group. An access to line A can provide location
information for all four lines in the group if the line is 4-to-1
compressed, or for line B otherwise. Thus, a sequential access
across the memory would obtain the first line in the group
always from the original location, and this line can provide
location information for the subsequent lines in the group.
Write Operation: When a cacheline gets evicted from LLC,
the memory controller checks if (a) the neighboring cachelines
are present in the LLC, and if (b) the group of 2 or 4 cachelines
can be compressed to the size of a single uncompressed
cacheline (64 Bytes). If the group of cachelines can be
compressed to a single block, the memory controller compacts
them together and issues a write containing the 2 or 4
compressed lines to one physical location. Note that for a
compressed memory, the controller can have the flexibility to
compress and write back clean lines as well, otherwise the
benefits of compression will become restricted only to dirty
lines. Our default policy compacts and writes back clean lines
if they are compressible, in the hope that this bandwidth cost
will be amortized by future re-use.
Read Operation: On a read, the controller needs to determine
the compressibility and the location of the line, as the line can
get relocated based on compressibility. Conventional designs
rely on metadata in memory to provide the Compression Status
Information (CSI) of each line. In our case, this CSI-metadata
would be a 3-bit entity for a group of 4 lines (to indicate one
of 5 possible states for the group, based on Figure 6). If the
CSI metadata is available, the read can determine the location
of the line, access the memory, decompress all the line(s) if
needed, and store all the retrieved lines in the L3 cache.
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
fot
on
ik
lbm
17
so
ple
x
libq
m
cf1
7
m
ilc
Ge
ms
pa
res
t
sp
hin
x
les
lie
ca
ctu
17
om
ne
t17
gcc
06 xz
wr
f17
bc
 tw
i
bc
 we
b
cc
 tw
i
cc
 w
eb
pr 
twi
pr 
we
b
SP
ECGA
P
AL
L2
7
Sp
ee
du
p
Fig. 7. Speedup of CRAM with explicit metadata (+ metadata cache) compared
to uncompressed memory.
B. The Problem With Explicit Metadata
We can design CRAM with explicit metadata, where the
metadata specifies the compression status of the line. Given
that the size of the metadata is 3-bits per group of four lines,
we need on average 0.75 bits per line. For our 16GB memory,
containing 1 billion lines, the total size of this metadata would
be 24 megabytes, much larger than the capacity of on-chip
structures. We can keep the metadata in memory and cache
it in an on-chip metadata cache, as done in prior works [3],
[5]. For workloads with good spatial locality or small memory
footprint, most metadata requests and updates will be serviced
5
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
fot
on
ik
lbm
17
so
ple
x
libq
m
cf1
7
m
ilc
Ge
ms
pa
res
t
sp
hin
x
les
lie
ca
ctu
17
om
ne
t17
gcc
06 xz
wr
f17
bc
 tw
i
bc
 we
b
cc
 tw
i
cc
 w
eb
pr 
twi
pr 
we
b
SP
EC GA
P
AL
L2
1N
or
m
al
iz
ed
 B
an
dw
id
th
Co
ns
um
pt
io
n
Data Clean Evict Metadata
Fig. 8. Bandwidth consumption for data, compressed writebacks, and metadata for CRAM with explicit metadata, normalized to uncompressed memory.
Metadata accesses constitute significant bandwidth overhead.
by the cache and such a design would work well. Unfortunately,
having an explicit metadata requires accessing memory on a
miss in the metadata cache. Figure 7 shows the performance
of CRAM with explicit metadata with a 32KB metadata cache.
On average, this scheme shows 10% slowdown relative to an
uncompressed memory, because of the metadata accesses.
Figure 8 shows the breakdown of the bandwidth consumed
by the CRAM design, normalized to the uncompressed memory.
In general, compression is effective at reducing the number of
requests for data. However, depending on the workload, the
metadata cache can have poor hit-rate, and require frequent
access to obtain the metadata. These extra metadata accesses
can constitute a significant bandwidth overhead. For example,
xz needs over 50% extra bandwidth just to fetch the metadata.
Thus, schemes that require a separate metadata lookup can end
up degrading performance relative to uncompressed memory.
We develop a solution that eliminates metadata lookups.
V. CRAM: OPTIMIZED DESIGN
To avoid the bandwidth overheads of the metadata access,
CRAM decouples the information provided by the metadata
into two parts: (a) determining the compression status, and (b)
determining the location of the cachelines, and solves each
of them separately. The first solution tackles the problem of
interpreting accessed lines with implicit-metadata using marker
values. The second component handles the problem of the
locating lines with a line-location predictor.
A. Implicit-Metadata: No Metadata Lookups
We exploit the insight that a pair of compressed lines does not
always use all the available 64 bytes. Our analysis (presented
in Figure 4) shows that the probability that a pair of lines
is compressible within 64 bytes but not within 60 bytes is
quite small (close to 2%). We exploit this leftover space in
compressed lines to specify the compression status of the line,
using a predefined special value, which we call as the marker.
Figure 9 shows the implicit-metadata design using markers,
for lines that are compressible with 2-to-1 compression, 4-to-
1 compression, or no compression. If the line contains two
compressed lines (e.g. A and B both reside in A), then the
line is required to contain the marker corresponding to 2-to-1
compression (x22222222 in our example) in the last four bytes.
Similarly, if the line contains four compressed lines (A, B, C,
and D, all reside in A), then the line is required to contain the
marker corresponding to 4-to-1 compression (x44444444 in our
example). Marker reduces the available space for compressed
lines to 60 bytes. If the compressed lines require cannot fit
within 60 bytes, then it is stored in uncompressed form.
An incompressible line is stored in its original form, without
any space reserved for the marker. The probability that the
uncompressed line coincidentally matches with the 32-bit
marker is quite small (less than 1 in a billion). Our solution
handles such extremely rare cases of collision with marker
values by storing such lines in an inverted form. This ensures
that the only lines in memory that contain the marker value
(in the last four bytes) are the compressed lines.
A B MarkerC D
Line A Line B Marker 2-to-1 compressed
4-to-1 compressed
UncompressedLine X 
4 B60 B
x00000000
xFFFFFFFF
x22222222
x44444444
Fig. 9. Implicit metadata using markers: Compressed lines always contain a
marker in the last four bytes.
Determining Compression Status with Markers: When a
line is retrieved, the memory controller scans the last four
bytes for a match with the markers. If there is a match with
either the 2-to-1 marker or the 4-to-1 marker, we know that
the line contains compressed data for either two lines, or four
lines, respectively. If there is no match, the line is deemed to
store uncompressed data. Thus, with a single access, CRAM
obtains both the data and the compression status.
Handling Collisions with Marker via Inversion: We define
a marker collision as the scenario where the data in an
uncompressed line (last four bytes) matches with one of the
markers. Since our design generates per-line markers, the
likelihood of marker collision is quite rare (less than one
in a billion). However, we still need a way to handle it without
incurring significant storage or complexity. CRAM handles
marker collisions simply by inverting the uncompressed line
and writing this inverted line to memory, as shown in Figure 10.
Doing so ensures that the only lines in memory that contain
the marker value (in the last four bytes) are compressed lines.
x44444444
x0 . . .  . . . . . . . . . . . . . 44444444
Marker
x1 . . . . . . . . . . . . . . . .  BBBBBBBB
Inverted line
Invert on collision
Line inversion 
Table
-
-
-
AUpdate 
LIT
Line to install in mem addr A
Marker 
Collision 
?
Do:         & 1 2
1
2
Install as is
Yes
No
Fig. 10. Line inversion handles collisions of uncompressed lines with marker.
6
A dedicated on-chip structure, called the Line Inversion
Table (LIT), keeps track of all the lines in memory that are
stored in an inverted form. The likelihood that multiple lines
resident in memory concurrently encounter marker collisions
is negligibly small. For example, if the system continuously
writes to memory, then it will take more than 10 million years
to obtain a scenario where more than 16 lines are concurrently
stored in inverted form. Therefore, for our 16GB memory, we
provision a 16-entry LIT in CRAM.
When a line is fetched from memory, it is not only checked
against the marker, but also against the complement of the
marker. If the line matches with the inverted value of the
marker, then we know that the line is uncompressed. However,
we do not know if the retrieved data is the original data for the
line or if the line was stored in memory in an inverted form
due to a collision with the marker. In such cases, we consult
the LIT. If the line address is present in the LIT, then we know
the line was stored in an inverted fashion and we will write
the reverted value in the LLC. Otherwise, the data obtained
from the memory is written as-is to the LLC.
On a write to the memory, if the line address is present in
the LIT, and the last four bytes of the line no longer match
with any of the markers, then we write the line in its original
form and remove this line address from the LIT. Each entry
in the LIT contains a valid bit and the line address (30 bits),
so our 16-entry LIT incurs a storage overheads of only 64
bytes. We recommend that the size of the LIT be increased in
proportion to the memory size.
Efficiently Handling LIT Overflows: In the extremely rare
cases LIT can overflow, and we have two solutions to handle
this scenario: (Option-1) Make the LIT memory-mapped (one
inversion-bit for every line in memory, stored in memory)
and this can support every line in memory having a collision.
On marker-collision, the memory system has to make two
accesses: one access to the memory, and another to the LIT
to resolve collision. Under adversarial settings, the worst-case
effect would simply be twice the bandwidth consumption. We
implement updates to the LIT by resetting the LIT entry when
lines with marker-collisions are brought into the LLC and
marking these cachelines as dirty. On eviction, these lines will
be forced to go through the marker-collision check and will
appropriately set the corresponding LIT entry. (Option-2) On
an LIT overflow, CRAM can regenerate new marker values
using the random number generator, encode the entire memory
with new marker values, and resume the execution. As cases of
LIT overflows are rare (once per 10 million years), the latency
of handling LIT overflows does not affect performance.
Attack-Resilient Marker Codes: The markers in Figure 9
were chosen for simplicity of explanation. Markers generated
from simple address based hash functions can be a target for a
Denial-of-Service Attack. An adversary with knowledge of the
hash function can write data values intended to cause frequent
LIT overflows resulting in severe performance degradation.
We address this vulnerability by using a cryptographically
secure hash function (e.g. such as DES [19], given that marker
generation can happen off-the-critical path) to generate marker
values on a per-line basis. This would make the marker values
impractical to guess without knowledge of the secret-keys of
the hash function, which are generated randomly for each
machine. Furthermore, the secret-keys are regenerated in the
event of an LIT overflow which changes the per-line markers.
Efficiently Invalidating Stale Data: Compression can relocate
the lines, and, when lines get moved, they can leave behind a
potentially stale copy of the line. For example, in Figure 11,
if adjacent lines A and B became compressible (into values
A’ and B’), we could move B’ and store lines A’ and B’
together in one physical location. However, an old value of B
would still exist in the previous location. Reading the previous
location would reveal an old value of B that could still be
erroneously interpreted as a valid uncompressed cacheline.
Keeping all locations of the line in sync requires significant
bandwidth overheads. Therefore, we simply mark such lines
as invalid using a special 64-byte marker value, called Invalid
Line Marker (Marker-IL). Marker-IL is also initialized at boot
time using a randomly generated value. Per-line Marker-IL
can be generated as in Section V-A. Collisions with Marker-
IL are extremely rare (1 in 2512 probability, less than one in
quadrillion years), and are also handled using line inversion,
and are tracked by the LIT.
Line A Line B A' B B' Invalid Marker
Before Compression After Compression Compression with 
Invalidate
Stale value 64B Marker
A' B'
Fig. 11. Compression relocates lines and can create copies of data. We mark
such lines as invalid to ensure correct operation.
Handling Updates to Compressed Lines: An update to a
compressed line can render the entire group (of two or four
lines) from compressible to incompressible. Such updates must
be performed carefully so that the data of the other line(s)
in the group gets relocated to their original location(s). To
accomplish this, we need to know the compressibility of the
line when the line was obtained from memory. To track this
information, we provision 2-bits in the tag store of the LLC
that denotes the compression level when the data was read
from memory. On an eviction, we can determine if the lines
were previously uncompressed, 2-to-1 compressed, or 4-to-1
compressed by checking these two bits, and we can send writes
and invalidates (when applicable) to the appropriate locations.
Ganged Eviction: Write-back of a cacheline that belongs to a
compressed group can require a read-modify-write operation if
the other cachelines in the group are not present in the cache.
Our design avoids this by using a ganged-eviction scheme
which forces the eviction of all members of a compressed
group if one of its members gets evicted. This ensures that all
the members of a compressed group are either simultaneously
present or absent from the LLC, effectively avoiding the need
for read-modify-write operations. Our evaluations show that
ganged eviction has negligible impact on the LLC hit rate.
7
0.40
0.60
0.80
1.00
1.20
1.40
fot
on
ik
lbm
17
so
ple
x
libq
m
cf1
7
m
ilc
Ge
ms
pa
res
t
sp
hin
x
les
lie
ca
ctu
17
om
ne
t17
gcc
06 xz
wr
f17
bc
 tw
i
bc
 we
b
cc
 tw
i
cc
 w
eb
pr 
twi
pr 
we
b
m
ix1
m
ix2
m
ix3
m
ix4
m
ix5
m
ix6
SP
EC GA
P
MI
X
AL
L2
7
1.74
Sp
ee
du
p
CRAM (Explicit Metadata) CRAM (Implicit Metadata + LLP)
Fig. 12. Speedup of CRAM with explicit metadata and CRAM with implicit metadata, normalized to uncompressed memory. CRAM with implicit metadata
eliminates metadata lookup and improves performance.
B. Prediction for Line Location
With implicit-metadata, CRAM can efficiently determine
the compressibility status of any line retrieved from memory.
Reading the line from an incorrect location returns the invalid-
line marker (Marker-IL). However, in such cases, a second
request must be sent to another location to obtain the line (for
example, an access to B gets routed to A because A contains
both A and B). Sending multiple accesses to retrieve a line
from memory wastes bandwidth. To obtain the line in a single
access (in the common case), we develop a Line Location
Predictor (LLP), that predicts the compressibility status of
the line. Knowing the compressibility helps in determining
the location of the line (e.g. B will be in original location if
incompressible and at A if compressible).
01
10
00
-
-
-
01
Hash
Page Addr
Last Compressibility Table
Predicted 
Compression 
Status
Line Addr
Compressed 
Lookup Predicted 
location
Fig. 13. Line Location Predictor uses line address and compressibility-
prediction (based on last-time compressibility) to predict location.
To design a low-cost LLP, we exploit the observation that
lines within a page are likely to have similar compressibility
[3] [20]. Figure 13 shows the organization of the LLP. LLP
contains the Last Compressibility Table (LCT), that tracks the
last compression status seen for a given index. The LCT is
indexed with the hash of the page address. So, for a given
access, the index corresponding to the page address is used to
predict the compressibility, then line location. The LCT is used
only when a prediction is needed (for example, A is always
resident in its own location and does not need a prediction).
We use a 512-entry LCT, so the storage overhead is 128 bytes.
0
20
40
60
80
100
fot
on
ik
lbm
17
so
ple
x
libq
m
cf1
7
m
ilc
Ge
ms
pa
res
t
sp
hin
x
les
lie
ca
ctu
17
om
ne
t17
gcc
06 xz
wr
f17
bc
 tw
i
bc
 we
b
cc
 tw
i
cc
 w
eb
pr 
twi
pr 
we
b
SP
ECGA
P
AL
L2
7L
oc
at
io
n 
Pr
ed
ict
io
n
Ac
cu
ra
cy
 (%
)
Metadata-Cache Hit CRAM Prediction Accuracy
Fig. 14. Probability of finding line in one access for explicit-metadata and
CRAM with LLP predictor.
With explicit-metadata, if there is a hit in the metadata
cache, we can determine the location of the line and obtain the
line is one memory access. However, a miss in the metadata
cache means we need to send two access, one for the metadata
and second for the data. Figure 14 compares the hit-rate of
the metadata cache (32KB) with the prediction accuracy of
the LLP (128 bytes). Even though the LLP is quite small, it
provides an accuracy of 98%, much higher than the hit-rate of
the metadata cache. On an LLP misprediction, we re-issue the
request to the other possible locations of the line.
C. Speedup of CRAM with Optimizations
CRAM, when combined with implicit-metadata and LLP, can
accomplish the task of locating and interpreting lines, without
the need for a separate metadata lookup. Figure 12 shows
the performance of CRAM (with implicit-metadata + LLP)
compared to the basic CRAM design (with explicit-metadata).
CRAM (with implicit-metadata) eliminates the metadata lookup,
which significantly helps both compressible and incompressible
workloads. For SPEC workloads, CRAM provides a speedup.
However, for Graph workloads, CRAM still causes a slowdown.
We investigate bandwidth of CRAM to determine the cause.
D. Bandwidth Breakdown of CRAM
Figure 15 shows the bandwidth consumption of CRAM
(with implicit-metadata + LLP), normalized to uncompressed
memory. The components of bandwidth consumption of CRAM
are data, second access due to LLP mispredictions, and clean
writebacks + invalidates (for writing compressed data). High
location prediction accuracy means we are able to effectively
remove the cost of metadata lookup, except for bc twi. For
Graph workloads, the inherent cost of compression (i.e.,
compressing and writing back clean lines, and invalidating) is
the dominant source of bandwidth overhead and the cause for
performance degradation. We develop an effective scheme to
disable compression when compression degrades performance.
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
fot
on
ik
lbm
17
so
ple
x
libq
m
cf1
7
m
ilc
Ge
ms
pa
res
t
sp
hin
x
les
lie
ca
ctu
17
om
ne
t17
gcc
06 xz
wr
f17
bc
 tw
i
bc
 we
b
cc
 tw
i
cc
 w
eb
pr 
twi
pr 
we
b
SP
ECGA
P
AL
L2
1N
or
m
al
iz
ed
 B
an
dw
id
th
Co
ns
um
pt
io
n
Data Clean Evict + Inv. LLP Mispredict
Fig. 15. Bandwidth consumption for Optimized CRAM approach, normalized
to uncompressed memory.
8
0.60
0.70
0.80
0.90
1.00
1.10
1.20
1.30
1.40
fot
on
ik
lbm
17
so
ple
x
libq
m
cf1
7
m
ilc
Ge
ms
pa
res
t
sp
hin
x
les
lie
ca
ctu
17
om
ne
t17
gcc
06 xz
wr
f17
bc
 tw
i
bc
 we
b
cc
 tw
i
cc
 w
eb
pr 
twi
pr 
we
b
m
ix1
m
ix2
m
ix3
m
ix4
m
ix5
m
ix6
SP
EC GA
P
MI
X
AL
L2
7
1.74
+ no BW for inval and compressed write)
Sp
ee
du
p
CRAM (Always-Compress) CRAM (Dynamic) Ideal CRAM (no metadata 
Fig. 16. Speedup of Static-CRAM, Dynamic-CRAM, and Ideal memory compression. Dynamic-CRAM avoids slowdown for workloads that do not benefit
from compression, and performs similar to ideal scheme with no overheads.
VI. CRAM: DYNAMIC DESIGN
Thus far, we have focused only on avoiding the metadata
access overheads of compressed memory. However, even after
eliminating all of the bandwidth overheads of the metadata,
there is still performance degradation for several workloads.
Compression requires additional writebacks to memory which
can consume additional bandwidth. For example, when a
cacheline is found to be compressible on eviction from LLC,
it needs to be written back in its compressed form to memory.
What could have been a clean evict in an uncompressed memory
is now an additional writeback, which becomes a bandwidth
overhead.2 Additionally, CRAM requires invalidates to be sent,
which further adds to the bandwidth cost of implementing
compression. In general, if the workload has enough reuse
and spatial locality, the bandwidth cost of compression yields
bandwidth savings in the long run. But for a workload with poor
reuse and spatial locality (such as several Graph workloads),
the cost of compression does not get recovered, causing
performance degradation.
A. Design of Dynamic-CRAM
We can avoid the degradation by dynamically disabling com-
pression, when compression is found to degrade performance.
Doing so would return the workload its baseline performance
with an uncompressed memory. We call this design Dynamic-
CRAM. Dynamic-CRAM compares at runtime the ”bandwidth
cost of doing compression” with the ”bandwidth benefits from
compression”, and enabling or disabling compression based
on this cost-benefit analysis.
Bandwidth Cost of Compression: The bandwidth overhead
of compression comes from sending extra writebacks (com-
pressed writebacks from clean locations), sending invalidates,
and sending requests to mispredicted locations. These are
additional requests incurred due to compression that could
have been avoided if we had used an uncompressed design.
Bandwidth Benefits of Compression: Compression pro-
vides bandwidth benefits by enabling bandwidth-free prefetch-
ing. On reading a compressed line, adjacent lines get fetched
without any extra bandwidth. This saves bandwidth if the
prefetched lines are useful. Tracking useful prefetches can
allow us to determine the benefits from compression.
2CRAM installs new pages in an uncompressed form to avoid inaccurate
prefetches. By compressing adjacent lines that are evicted from the LLC, we
can ensure that prefetches are done only when the neighboring lines have been
previously accessed together and are thus expected to be useful.
Dynamic-CRAM monitors the bandwidth costs and benefits
of compression at run-time, to determine if compression should
be enabled or disabled. To efficiently implement Dynamic-
CRAM, we use set-sampling, whereby a small fraction of sets
in the LLC (1% in our study) always implement compression
and we track the cost-benefit statistics only for the sampled sets.
The decision for the remaining (99%) of the sets is determined
by the cost-benefit analysis on the sampled sets, as shown in
Figure 17. To track the cost and benefit of compression, we
use a simple saturating counter. The counter is decremented
on seeing the bandwidth cost and is incremented on seeing the
bandwidth benefit of compression. The Most Significant Bit
(MSB) of the counter determines if the compression should be
enabled or disabled for the remaining sets. We use a 12-bit
counter in our design. We extend Dynamic-CRAM to support
per-core decision by maintaining a 12-bit counter per core and
a 3-bit tag storage for the lines in the sampled sets to identify
the core that requested the cacheline.
Sampled Set
(always compress)
Enforce
Policy
Enable
Disable
0
4096
Saturating Counter
Set 0
Set 1
Set 99
Increment 
Utility Counter
Decrement 
Utility Counter
2 3 4
1
1 Useful Prefetch 2
Compressed 
Writeback 3 Misprediction 4 Invalidate Request
Fig. 17. Dynamic-CRAM analyzes cost-benefit of compression on sampled
sets. This analysis determines compression policy for the other sets.
B. Effectiveness of Dynamic-CRAM
Figure 16 shows the performance of Optimized CRAM (that
always tries to compress), and Dynamic-CRAM. CRAM with-
out the Dynamic optimization provides performance improve-
ment for SPEC workloads; however, it degrades performance
for GAP workloads. However, Dynamic-CRAM eliminates
all of the degradation, ensuring robust performance – the
design is able to obtain performance when compression is
beneficial and avoid degradation when compression is harm-
ful. On average, Dynamic-CRAM provides 6% performance
improvement, nearing two-thirds of the performance of an
idealized compression design that does not incur any bandwidth
overheads for implementing compression. Thus, Dynamic-
CRAM is a robust and efficient way to implement hardware-
based main memory compression.
9
0.80
1.00
1.20
1.40
1.60
1.80
Sp
ee
du
p
Fig. 18. S-curve showing speedup of Dynamic-CRAM for 64 workloads, sorted by speedup.
VII. RESULTS AND ANALYSIS
A. Storage Overhead of CRAM Structures
CRAM can be implemented with minor changes at the
memory controller. Table III shows the storage overheads
required for implementing Dynamic-CRAM. The total storage
of the additional structures at the memory controller is less
than 300 bytes. In addition to these structures, CRAM needs
2-bits in the tag-store of each line in the LLC to track prior-
compressibility. And, per-core Dynamic-CRAM needs 4-bits
per each line in sampled sets (1%) for reuse and core id.
TABLE III
STORAGE OVERHEAD OF CRAM STRUCTURES
Structure Storage Cost
Marker for 2-to-1 4 Bytes
Marker for 4-to-1 4 Bytes
Marker for Invalid Line 64 Bytes
Line Inversion Table (LIT) 64 Bytes
Line Location Predictor (LLP) 128 Bytes
Dynamic-CRAM counter 12 Bytes
Total 276 bytes
B. Extended Evaluation
We perform our study on 27 workloads that are memory
intensive. Figure 18 shows the speedup with Dynamic-CRAM
across an extended set of 64 workloads (29 SPEC2006, 23
SPEC2017, 6 GAP, and 6 mixes), including ones that are
not memory intensive. Dynamic-CRAM is robust in terms of
performance, as it avoids degradation for any of the workloads
while retaining improvement when compression helps.
C. Impact on Energy and Power
Figure 19 shows the power, energy consumption and energy-
delay-product (EDP) of a system using Dynamic-CRAM,
normalized to a baseline uncompressed main memory. Energy
consumption is reduced as a consequence of fewer number of
requests to main memory. Overall, Dynamic-CRAM reduces
energy by 5% and improves EDP by 10%.
 0.7
 0.8
 0.9
 1
 1.1
 1.2
Speedup Power Energy EDPNo
rm
al
iz
ed
 to
 B
as
el
in
e
Fig. 19. Dynamic-CRAM impact on energy and power
D. CRAM Sensitivity to Number of Memory Channels
CRAM offers bandwidth-free adjacent-line prefetch, which
are latency benefits that exist regardless of the number of
memory channels. Table IV shows that CRAM consistently
provides speedup of 5% even with larger number of channels.
TABLE IV
CRAM SENSITIVITY TO NUMBER OF MEMORY CHANNELS
Num. Channels Avg. Speedup of CRAM
1 4.8%
2 5.5%
4 4.6%
E. Comparison to Larger Fetch for L3
CRAM can install adjacent lines from the memory to the
L3 cache. While this may seem similar to prefetching, we note
there is a fundamental difference. CRAM installs additional
lines in L3 only when those lines are obtained without any
bandwidth overhead. Meanwhile, prefetches result in an extra
memory access which incurs additional bandwidth. We compare
the performance of next-line prefetching and Dynamic-CRAM
Table V. Next-line prefetching causes an average slowdown
of 10%, while CRAM achieves a speedup of 6% as it obtains
adjacent lines without the bandwidth cost.
TABLE V
COMPARISON OF CRAM TO NEXT-LINE PREFETCH
Next-Line Prefetch Dynamic-CRAM
SPEC -5.7% +8.5%
GAP -21.1% +0.0%
MIX -7.3% +4.2%
ALL27 -9.7% +5.5%
VIII. RELATED WORK
To the best of our knowledge, this is the first paper to
propose a robust hardware-based main-memory compression
for bandwidth improvement, without requiring any OS-support
and without causing changes to the memory organization and
protocols. We discuss prior research related to our study.
A. Low-Latency Compression Algorithms
As decompression latency is in the critical path of memory
accesses, hardware compression techniques typically use simple
per-line compression schemes [6] [7] [8] [9] [10] [11] [12]. We
evaluate CRAM using a hybrid compression using FPC [6] and
BDI [10]. However, CRAM is orthogonal to the compression
algorithm and can be implemented with any compression
algorithm, including dictionary-based [21] [22] [23] [24] [25].
B. Main Memory Compression
Hardware-based memory compression has been applied to
increase the capacity of main memory [1] [2] [3]. To locate
the line, these approaches extend the page table entries to
include information on the compressibility of the page. These
approaches are attractive as they allow locating and interpreting
lines using the TLB. However, such approaches inherently
require software-support (from the OS or hypervisor) that limit
their applicability. We want a design that can be built entirely
in hardware, without any OS support.
10
0.40
0.60
0.80
1.00
1.20
1.40
fot
on
ik
lbm
17
so
ple
x
libq
m
cf1
7
m
ilc
Ge
ms
pa
res
t
sp
hin
x
les
lie
ca
ctu
17
om
ne
t17
gcc
06 xz
wr
f17
bc
 tw
i
bc
 we
b
cc
 tw
i
cc
 w
eb
pr 
twi
pr 
we
b
m
ix1
m
ix2
m
ix3
m
ix4
m
ix5
m
ix6
SP
EC GA
P
MI
X
AL
L2
7
1.74
Sp
ee
du
p
Explicit-Metadata Optimized for Row-Buffer Hits (Memzip / LCP) Dynamic CRAM
Fig. 20. Performance of schemes that need explicit metadata management (Memzip, LCP) optimized for row-buffer hits. Prior approaches still require
significant bandwidth to retrieve and update metadata.
Several studies [5] [26] [27] propose to send compressed
data across links in smaller bursts, and send additional ECC or
metadata bits when there is still room in a burst length. These
proposals try to improve the bandwidth of the memory system
by sending fewer bursts per memory request. However, these
proposals require either non-traditional data organization (such
as MiniRank) [28] [29] [30] or changes to the bus protocols
or both. CRAM enables compressed memory systems with
existing memory devices and protocols.
Prior studies [3] [5] have advocated reducing the latency
for metadata lookups by placing the metadata in the same
row buffer as the data line. However, this does not reduce the
bandwidth required to obtain the metadata. For comparison, we
implement an explicit-metadata scheme optimized to access the
same row as the data line. Figure 20 compares the performance
of Dynamic-CRAM with optimized explicit-metadata provi-
sioned with a 32KB metadata cache. The bandwidth overheads
of obtaining metadata is still significant, causing slowdown.
Whereas, Dynamic-CRAM provides performance improvement.
COP [31] proposes in-lining ECC into compressed lines
and uses the ECC as markers to identify compressed lines.
Unfortunately, COP is designed to provide reliability at low
cost and provides no performance benefit if the system does
not need ECC or already has an ECC-DIMM. Whereas,
CRAM is designed to provide bandwidth benefits by fetching
multiple lines, and helps regardless of whether the system
has ECC-DIMM or not. Furthermore, COP relies on a fairly
complex mechanism to handle marker collisions (locking lines
in the LLC, memory-mapped linked-list etc.), whereas, CRAM
handles marker collisions efficiently via data inversion.
C. SRAM-Cache Compression
Prior work has looked at using compression to increase
capacity of on-chip SRAM caches. Cache compression is
typically done by accommodating more ways in a cache set
and statically allocating more tags [11] [32]. Recent proposals,
such as SCC, investigate reducing SRAM tag overhead by
sharing tags across contiguous sets [33] [34] [35]. Compressed
caches typically obtain compression metadata by storing
metadata beside tag and retrieving them along with tag accesses.
However, these approaches do not scale for memory, as there is
no tag space or tag lookup to enable easy access to metadata.
Our restricted data-mapping in CRAM is inspired by the
placement in SCC [34], in that the location of the line gets
determined by compressibility. However, unlike SCC, our
placement ensures that a significant fraction of lines do not
change their locations, regardless of their compression status.
Furthermore, SCC requires skewed-associative lookup of all
possible positions, which is possible to do in a cache; however,
such unrestricted probes of all possible placement locations
would incur intolerable bandwidth overheads in main memory.
D. Adaptive Cache-Compression
Prior works have looked at adaptive or dynamic cache
compression [11] [32] [36] [37] to avoid performance degra-
dation due to latency overheads of decompression or due
to extra misses caused by sub-optimal replacement in com-
pressed caches. These designs are primarily target cache hit
rate. Whereas, our main memory proposal targets bandwidth
overheads inherent in memory compression (metadata or
compressed writes). Additionally, fine-grain adaptive mem-
ory compression has been previously unexplored, as prior
approaches have had no capability to turn off (except by
expensive global operation).
E. Predicting Cache Indices
Several studies have looked at predicting indices in asso-
ciative caches [38] [39] [40] [41] [42] [43] [44] [45] [20]. A
cache can verify such predictions simply by checking the tag,
and issuing a second request in case of a misprediction. Our
work is quite different from these, in that we try to predict
the location for memory. Since memory does not provide tags
to identify the data like caches, we verify our prediction by
integrating implicit-metadata within the line, allowing memory
accesses to provide information about whether the location
contains compressed data or not. Our predictors utilize this
implicit-metadata to verify the location prediction and issue a
request to an alternate location on a misprediction.
IX. CONCLUSIONS
This paper investigates practical designs for main-memory
compression to obtain higher memory bandwidth. The proposed
design, CRAM, is hardware-based, does not require any
OS/hypervisor support, or changes to the memory modules
or access protocols. We show that for compressed memory
designs, the bandwidth overheads of accessing metadata can
be significant enough to cause slowdown for several workloads.
We propose the implicit-metadata design, based on marker
values, to eliminate the storage and bandwidth overheads of
the metadata access. We also propose a simple and effective
predictor to predict the location of the line in compressed
memory, and a dynamic scheme to disable compression when
compression degrades performance. Our proposed design
provides an average speedup of 6%, and avoids slowdown
for any of the workloads. This design can be implemented
with minor additions to the memory controller.
11
REFERENCES
[1] B. Abali, H. Franke, X. Shen, D. Poff, and T. Smith, “Performance of
hardware compressed main memory,” in High-Performance Computer
Architecture, 2001. HPCA. The Seventh International Symposium on,
2001.
[2] M. Ekman and P. Stenstrom, “A robust main-memory compression
scheme,” in ACM SIGARCH Computer Architecture News, vol. 33, no. 2.
IEEE Computer Society, 2005.
[3] G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. B. Gibbons,
M. A. Kozuch, and T. C. Mowry, “Linearly compressed pages: A
low-complexity, low-latency main memory compression framework,” in
Proceedings of the 46th Annual IEEE/ACM International Symposium on
Microarchitecture, ser. MICRO-46. New York, NY, USA: ACM, 2013.
[Online]. Available: http://doi.acm.org/10.1145/2540708.2540724
[4] Qualcomm, “Qualcomm centriq 2400 processor,” https://www.qualcomm.
com/media/documents/files/qualcomm-centriq-2400-processor.pdf, 2017,
[Online].
[5] A. Shafiee, M. Taassori, R. Balasubramonian, and A. Davis, “Memzip:
Exploring unconventional benefits from memory compression,” in
High Performance Computer Architecture (HPCA), 2014 IEEE 20th
International Symposium on. IEEE, 2014.
[6] A. R. Alameldeen and D. A. Wood, “Frequent pattern compression: A
significance-based compression scheme for l2 caches,” Dept. Comp. Scie.,
Univ. Wisconsin-Madison, Tech. Rep, vol. 1500, 2004.
[7] J. Dusser, T. Piquet, and A. Seznec, “Zero-content augmented caches,”
in Proceedings of the 23rd International Conference on Supercomputing,
ser. ICS ’09. New York, NY, USA: ACM, 2009. [Online]. Available:
http://doi.acm.org/10.1145/1542275.1542288
[8] Y. Zhang, J. Yang, and R. Gupta, “Frequent value locality and value-
centric data cache design,” in ACM SIGOPS Operating Systems Review,
vol. 34, no. 5. ACM, 2000.
[9] J. Yang, Y. Zhang, and R. Gupta, “Frequent value compression
in data caches,” in Proceedings of the 33rd Annual ACM/IEEE
International Symposium on Microarchitecture, ser. MICRO 33.
New York, NY, USA: ACM, 2000. [Online]. Available: http:
//doi.acm.org/10.1145/360128.360154
[10] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch,
and T. C. Mowry, “Base-delta-immediate compression: practical data
compression for on-chip caches,” in Proceedings of the 21st international
conference on Parallel architectures and compilation techniques. ACM,
2012.
[11] A. R. Alameldeen, D. Wood et al., “Adaptive cache compression for high-
performance processors,” in Computer Architecture, 2004. Proceedings.
31st Annual International Symposium on. IEEE, 2004.
[12] J. Kim, M. Sullivan, E. Choukse, and M. Erez, “Bit-plane compression:
Transforming delta for better compression in many-core architectures,”
in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual Interna-
tional Symposium on, 2016.
[13] N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi,
A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah
simulated memory module,” University of Utah, Tech. Rep, 2012.
[14] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi,
“Pinpointing representative portions of large intel itanium programs with
dynamic instrumentation,” in Microarchitecture, 2004. MICRO-37 2004.
37th International Symposium on, Dec 2004.
[15] J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH
Comput. Archit. News, vol. 34, Sep. 2006. [Online]. Available:
http://doi.acm.org/10.1145/1186736.1186737
[16] S. P. E. Corporation, “Spec cpu 2017,” 2017, accessed: 2017-11-10.
[Online]. Available: https://www.spec.org/cpu2017/
[17] S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP
benchmark suite,” CoRR, vol. abs/1508.03619, 2015. [Online]. Available:
http://arxiv.org/abs/1508.03619
[18] T. A. Davis and Y. Hu, “The university of florida sparse matrix
collection,” ACM Trans. Math. Softw., vol. 38, Dec. 2011. [Online].
Available: http://doi.acm.org/10.1145/2049662.2049663
[19] D. Coppersmith, “The data encryption standard (des) and its strength
against attacks,” IBM J. Res. Dev., vol. 38, May 1994. [Online].
Available: http://dx.doi.org/10.1147/rd.383.0243
[20] V. Young, P. J. Nair, and M. K. Qureshi, “Dice: Compressing
dram caches for bandwidth and capacity,” in Proceedings of the
44th Annual International Symposium on Computer Architecture, ser.
ISCA ’17. New York, NY, USA: ACM, 2017. [Online]. Available:
http://doi.acm.org/10.1145/3079856.3080243
[21] X. Chen, L. Yang, R. P. Dick, L. Shang, and H. Lekatsas, “C-pack: A
high-performance microprocessor cache compression algorithm,” Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 18,
2010.
[22] T. M. Nguyen and D. Wentzlaff, “Morc: A manycore-oriented compressed
cache,” in Microarchitecture (MICRO), 2015 48th Annual IEEE/ACM
International Symposium on. IEEE, 2015.
[23] A. Arelakis and P. Stenstrom, “Sc2: A statistical compression cache
scheme,” in Computer Architecture (ISCA), 2014 ACM/IEEE 41st
International Symposium on, June 2014.
[24] A. Arelakis, F. Dahlgren, and P. Stenstrom, “Hycomp: A hybrid cache
compression method for selection of data-type-specific compression
methods,” in Proceedings of the 48th International Symposium on
Microarchitecture, ser. MICRO-48. New York, NY, USA: ACM, 2015.
[Online]. Available: http://doi.acm.org/10.1145/2830772.2830823
[25] Y. Tian, S. M. Khan, D. A. Jime´nez, and G. H. Loh, “Last-level cache
deduplication,” in Proceedings of the 28th ACM International Conference
on Supercomputing, ser. ICS ’14. New York, NY, USA: ACM, 2014.
[Online]. Available: http://doi.acm.org/10.1145/2597652.2597655
[26] V. Sathish, M. J. Schulte, and N. S. Kim, “Lossless and lossy
memory i/o link compression for improving performance of gpgpu
workloads,” in Proceedings of the 21st International Conference
on Parallel Architectures and Compilation Techniques, ser. PACT
’12. New York, NY, USA: ACM, 2012. [Online]. Available:
http://doi.acm.org/10.1145/2370816.2370864
[27] H. Kim, P. Ghoshal, B. Grot, P. V. Gratz, and D. A. Jime´nez, “Reducing
network-on-chip energy consumption through spatial locality speculation,”
in Proceedings of the Fifth ACM/IEEE International Symposium on
Networks-on-Chip, ser. NOCS ’11. New York, NY, USA: ACM, 2011.
[Online]. Available: http://doi.acm.org/10.1145/1999946.1999983
[28] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu, “Mini-
rank: Adaptive dram architecture for improving memory power efficiency,”
in 2008 41st IEEE/ACM International Symposium on Microarchitecture,
Nov 2008.
[29] D. H. Yoon, M. K. Jeong, and M. Erez, “Adaptive granularity memory
systems: A tradeoff between storage efficiency and throughput,” in
Proceedings of the 38th Annual International Symposium on Computer
Architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011.
[Online]. Available: http://doi.acm.org/10.1145/2000064.2000100
[30] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber,
“Future scaling of processor-memory interfaces,” in Proceedings of the
Conference on High Performance Computing Networking, Storage and
Analysis, Nov 2009.
[31] D. J. Palframan, N. S. Kim, and M. H. Lipasti, “Cop: To compress and
protect main memory,” in 2015 ACM/IEEE 42nd Annual International
Symposium on Computer Architecture (ISCA), June 2015.
[32] J. Guar, A. R. Alameldeen, and S. Subramoney, “Base-victim compres-
sion: An opportunistic cache compression architecture,” in Computer
Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Sympo-
sium on, 2016.
[33] S. Sardashti and D. A. Wood, “Decoupled compressed cache: Exploiting
spatial locality for energy-optimized compressed caching,” in Proceedings
of the 46th Annual IEEE/ACM International Symposium on Microarchi-
tecture. ACM, 2013.
[34] S. Sardashti, A. Seznec, D. Wood et al., “Skewed compressed caches,” in
Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International
Symposium on. IEEE, 2014.
[35] B. Panda and A. Seznec, “Dictionary sharing: An efficient cache
compression scheme for compressed caches,” in 49th Annual IEEE/ACM
International Symposium on Microarchitecture, 2016, 2016.
[36] Y. Xie and G. H. Loh, “Thread-aware dynamic shared cache compression
in multi-core processors,” in 2011 IEEE 29th International Conference
on Computer Design (ICCD), Oct 2011.
[37] S. Kim, S. Lee, T. Kim, and J. Huh, “Transparent dual memory
compression architecture,” in 2017 26th International Conference on
Parallel Architectures and Compilation Techniques (PACT), Sept 2017.
[38] A. Agarwal, J. Hennessy, and M. Horowitz, “Cache performance
of operating system and multiprogramming workloads,” ACM
Trans. Comput. Syst., vol. 6, Nov. 1988. [Online]. Available:
http://doi.acm.org/10.1145/48012.48037
[39] A. Agarwal and S. D. Pudar, Column-associative caches: A technique for
reducing the miss rate of direct-mapped caches. ACM, 1993, vol. 21,
no. 2.
[40] J. J. Valls, A. Ros, J. Sahuquillo, and M. E. Gomez, “Ps-cache: An
energy-efficient cache design for chip multiprocessors,” J. Supercomput.,
vol. 71, Jan. 2015. [Online]. Available: http://dx.doi.org/10.1007/s11227-
014-1288-5
12
[41] B. Calder, D. Grunwald, and J. Emer, “Predictive sequential associative
cache,” in Proceedings of the 2Nd IEEE Symposium on High-
Performance Computer Architecture, ser. HPCA ’96. Washington,
DC, USA: IEEE Computer Society, 1996. [Online]. Available:
http://dl.acm.org/citation.cfm?id=525424.822662
[42] D. H. Albonesi, “Selective cache ways: On-demand cache resource
allocation,” in Microarchitecture, 1999. MICRO-32. Proceedings. 32nd
Annual International Symposium on. IEEE, 1999.
[43] M. D. Powell, A. Agarwal, T. N. Vijaykumar, B. Falsafi, and
K. Roy, “Reducing set-associative cache energy via way-prediction
and selective direct-mapping,” in Proceedings of the 34th Annual
ACM/IEEE International Symposium on Microarchitecture, ser. MICRO
34. Washington, DC, USA: IEEE Computer Society, 2001. [Online].
Available: http://dl.acm.org/citation.cfm?id=563998.564007
[44] H.-C. Chen and J.-S. Chiang, “Low-power way-predicting cache using
valid-bit pre-decision for parallel architectures,” in 19th International
Conference on Advanced Information Networking and Applications
(AINA’05) Volume 1 (AINA papers), vol. 2, March 2005.
[45] A. Deb, P. Faraboschi, A. Shafiee, N. Muralimanohar, R. Balasubramo-
nian, and R. Schreiber, “Enabling technologies for memory compression:
Metadata, mapping, and prediction,” in 2016 IEEE 34th International
Conference on Computer Design (ICCD), Oct 2016.
13
