Surface Compression Using Dynamic Color Palettes by Gubran, Ayub A. et al.
Surface Compression Using Dynamic Color Palettes
Ayub A. Gubran
University of British Columbia
Vancouver, BC, Canada
ayoubg@ece.ubc.ca
Felix Huang
Carnegie Mellon University
Pittsburgh, PA, United States
felixh@andrew.cmu.edu
Tor M. Aamodt
University of British Columbia
Vancouver, BC, Canada
aamodt@ece.ubc.ca
ABSTRACT
Off-chip memory traffic is a major source of power and
energy consumption on mobile platforms. A large amount
of this off-chip traffic is used to manipulate graphics
framebuffer surfaces. To cut down the cost of accessing
off-chip memory, framebuffer surfaces are compressed
to reduce the bandwidth consumed on surface manipu-
lation when rendering or displaying.
In this work, we study the compression properties
of framebuffer surfaces and highlight the fact that sur-
faces from different applications have different compres-
sion characteristics. We use the results of our analysis
to propose a scheme, Dynamic Color Palettes (DCP),
which achieves higher compression rates with UI and
2D surfaces.
DCP is a hardware mechanism for exploiting inter-
frame coherence in lossless surface compression; it im-
plements a scheme that dynamically constructs color
palettes, which are then used to efficiently compress
framebuffer surfaces. To evaluate DCP, we created an
extensive set of OpenGL workload traces from 124 An-
droid applications. We found that DCP improves com-
pression rates by 91% for UI and 20% for 2D applica-
tions compared to previous proposals [1, 2]. We also
evaluate a hybrid scheme that combines DCP with a
generic compression scheme [1], and found that com-
pression rates improve over previous proposals [1, 2] by
161%, 124% and 83% for UI, 2D and 3D applications,
respectively.
1. INTRODUCTION
Off-chip memory traffic, including that of framebuffer
surfaces, is one of the major sources of power consump-
tion on mobile systems-on-chip (SoCs). In some cases,
the energy consumption to access data on the off-chip
memory can dominate that from computations [3]. In
this work, we study the properties of framebuffer sur-
faces and propose a set of unique compression tech-
niques to reduce the bandwidth consumed by frame-
buffer operations.
In graphics rendering, a framebuffer surface is an off-
chip memory space that contains pixels generated by
the graphics processing unit (GPU) and then used by
the display controller to read pixels to the screen. In
some cases, the display controller operates on multiple
framebuffer surfaces, which are composited to a single
surface for screen display. Also GPUs can use frame-
buffer surfaces as inputs to additional rendering stages,
e.g., render to texture and deferred shading; as a result,
any given application may utilize one or more frame-
buffer surfaces.
This work studies a large set of Android workloads to
infer the compression properties of framebuffer surfaces
generated by mobile UI, 2D and 3D applications. Our
study found that framebuffers from different classes of
workloads have different compression properties. We
exploit these properties to propose an effective palette-
based framebuffer compression scheme that focuses on
common UI and 2D applications. In addition, we ex-
ploit temporal coherence in graphics, where applications
exhibit minor changes between frames that can be ex-
ploited for compression.
Using temporal coherence, and by focusing on com-
mon uses cases, we propose and evaluate our Dynamic
Color Palettes (DCP) technique. DCP uses palette based
compression and focuses on reducing the traffic caused
by framebuffer operations in UI and 2D applications.
To evaluate our compression scheme, we created an ex-
tensive set of workloads from 124 Android applications.
We show that by combining DCP with other compres-
sion techniques [1], DCP is able to improve compression
rates between 83% and 161% across UI, 2D and 3D ap-
plications.
This paper makes the following contributions:
1. Characterizes compression properties of framebuffer
surfaces from user-interface (UI) as well as non-UI
2D and 3D applications;
2. Uses characterization results to propose and eval-
uate dynamic color palettes (DCP), a compression
technique that offers higher compression rates for
common UI and non-UI 2D applications;
3. Proposes two DCP variations that dynamically choose
an optimal palette size based on the frequencies of
the values in color palettes;
4. Evaluate our compression schemes using an exten-
sive set of workloads created from the OpenGL
traces of 124 Android applications.
ar
X
iv
:1
90
3.
06
65
8v
1 
 [c
s.G
R]
  1
9 J
an
 20
19
Application 1 Application 2 Application 3
OS Graphics Libraries / Graphic API calls
CPU (Software Rendering) / GPU (Hardware Rendering)
Render Framebuffer 1 Render Framebuffer 2 Render Framebuffer 3
System Compositor (e.g., 
Android’s SurfaceFlinger)
Display framebuffer
Display 
Controller
Device Screen
@ screen refresh 
rate (i.e., 60 FPS)
Refresh rates vary by 
application and user 
activity
Called upon 
updating any 
of the 
framebuffers
GPU or CPU 
Composition
OS Level
App Level
HW Level
Legends
Compression!Compression!
Compression!
1
2
3
Figure 1: The life-cycle of a surface from rendering to display. In 1 , applications render to their corresponding
framebuffers. In 2 , a compositor combines the surfaces generated by different applications. In 3 , the composited
surface is used and displayed on the screen by the display controller.
2. BACKGROUND AND RELATED WORK
2.1 The life-cycle of a framebuffer surface
Figure 1 summarizes the life-cycle of a frame surface
in contemporary mobile systems (Android Ice Cream
Sandwich 4.0 and later [4]).
Figure 1 shows a typical scenario of drawing mul-
tiple surfaces simultaneously from multiple processes:
the status bar, Facebook, and the navigation bar. Each
process independently renders to its own surface ( 1 );
for example, Facebook renders a new surface when the
user scrolls or clicks, while the navigation bar updates
the corresponding surface when the user clicks on one
of its buttons.
For display, a system compositor, such as Surface-
Flinger [4] in Android, combines surfaces from multiple
applications before sending them to the screen ( 2 ).
The compositor actively monitors the surfaces of all ap-
plications and when a process updates a surface, the
compositor subsequently updates the composited sur-
face. Simultaneously, the display controller hardware
continuously reads the composited surface to the screen
at 60 frames per second (FPS) or higher ( 3 ). Note that
because using the same surface for updates and screen
refresh operations can cause artifacts, such as flickering
and tearing, double (or triple) buffering is used [5].
The example in Figure 1 shows how a surface can be
used and re-used multiple times and this is why it is
important to reduce the overhead of framebuffer ma-
nipulation through compression.
2.2 Surface compression techniques
Surface compression is used to reduce off-chip mem-
ory traffic, which can improve performance and/or re-
duce energy consumption. Graphics pipeline implemen-
tations utilize compression for textures [6], surfaces [1,
7, 8], depth [9] and vertex data [10].
Many of the compression techniques use lossy com-
pression as well on lossless compression. For framebuffer
surfaces, lossless compression is used to avoid error ac-
cumulation upon reading then re-writing surfaces (as is
the case with composition).
Surface compression differs from texture compression
in that both encoding, as well as decoding, are per-
formed in real-time. Also opposite to surface compres-
sion schemes, most texture compression algorithms are
lossy [6, 11, 12].
Another crucial aspect of surface compression is ran-
dom accessibility. Techniques like Run-Length Encod-
ing (RLE) are unable to provide such accessibility. How-
ever, it important to be able to randomly access a sur-
face when used for sampling (e.g., used as a texture),
resizing, or composition. Compression algorithms have
used block-based schemes to enable random access for
their simplicity and practicality. Block-based compres-
sion mechanisms define preconfigured compression sizes
that allow random access to compressed surfaces. Block-
based mechanisms have been used for compressing inte-
ger (e.g., RGBA) surfaces [1], floating-point surfaces [13,
14] and depth buffers [9, 14].
The work by Rasmusson et al. [1] (which we refer
to by RAS ) evaluated several surface compression pro-
posals [15, 16, 17] and compared them against their
technique. RAS is a lossless block-based compression
technique for integer buffers that encodes the difference
between adjacent pixel values. RGB pixels are con-
verted to the Y cocg (luminance-chrominance) format,
to increase compression efficiency. We compare against
RAS in this paper since it reports better compression
results versus prior work.
We also evaluate our scheme against the compression
scheme proposed by Nvidia [2]. In this scheme, for each
block going to memory, the algorithm checks if 4×2
pixels in sub-blocks within a block are identical. If so,
the block is compressed 1:8. When that is not possible,
the algorithm then checks if 2×2 regions have identical
colors, if so the block is compressed 1:4, otherwise the
block remains uncompressed. This algorithm works well
with regions of identical color values, as the case with
UI surfaces.
Other compression work includes the work by Daniel-
son [18], which proposes using a dictionary-based com-
pression in which the operating system and/or program
specify the colors to configure a dictionary. In contrast
to Danielson’s work, our work exploits temporal co-
herence to dynamically construct dictionaries (palettes)
avoiding the need for software changes.
Another work by Shim et al. [19] use a dictionary-
based compression mechanism targeted at display buffer
compression. Shim et al.’s approach compresses sur-
faces using Huffman coding after rendering is completed
to reduce the bandwidth of display refresh operations.
Rendered surfaces are read to construct critical color
differences which are used in a second stage to con-
struct a Huffman dictionary. The third stage re-reads
the surface buffer and writes out a compressed buffer
that is then used for screen refresh operations. In con-
trast to previous work, we propose employing temporal
coherence to predict the values for the dictionary, avoid-
ing submitting uncompressed surfaces to memory or re-
quiring additional surface read/write operations. Also
we propose an adaptive compression scheme that avoids
Huffman coding inefficiencies with probability distribu-
tions that are not exact powers of two.
Finally, a body of work has exploited temporal co-
herence in real-time rendering through inter-frame data
reuse. These techniques, in addition to off-line render-
ing techniques like ray-tracing, are summarized in the
survey work of Scherzer et al [20]. Here we propose a
different application for temporal coherence by exploit-
ing it for compression.
2.3 GPU Architectures
GPU architectures are broadly categorized as either
tile-based or immediate-mode architectures. Tile-Based
architectures aim to save bandwidth by handling all
raster operations, like blending and depth testing, us-
ing an on-chip buffer. Most mobile GPUs are tile-based
architectures (including Qualcomm, Imagination and
ARM GPUs).
In tiled rendering, the screen is divided into render
tiles (e.g., 32×32 or 64×64 pixels). For each tile, the
GPU renders all primitives that map to that tile using
an on-chip buffer before committing that buffer to the
off-chip memory. As a result, what is being compressed
and sent to the off-chip memory is the final surface value
of each tile.
On the other hand, immediate-mode architectures
render primitives in their drawing order. They avoid
the overhead of the tiling process, but the values sent
to the off-chip memory will contain intermediate sur-
face values. In the case of overdrawing (i.e., when a
pixel location is covered by more than one primitive),
the same memory location will be written to multiple
times. When a compression scheme is deployed in an
immediate-mode GPU, it will compress the values sent
to the off-chip memory as rendering progresses; this
means compressing blocks from the GPU’s LLC instead
of an on-chip buffer.
As tile-based architectures are the dominant choice
for mobile GPUs, going forward, we assume a tile-based
architecture when evaluating surfaces for compression.
2.4 Mobile Use Patterns
In this work, we focus on developing an effective scheme
for compressing UI and 2D framebuffer surfaces. The
reason for this choice is that studies found users to
spend 70% of their time running UI applications [21,
22], where over half of the time is spent on web brows-
ing, messaging and social media alone. Whereas, games
of all types, 2D, and 3D, account for only 30% of the
usage time. Thus, we saw the opportunity of design-
ing an effective scheme that targets such common use
cases. In Section 7.5, we show how our DCP scheme can
be combined with other generic compression algorithms
to provide a comprehensive compression solution for all
use-cases.
3. TEMPORAL COHERENCE IN MOBILE
GRAPHICS
Temporal coherence is the property of inter-frame
similarity [20]; this means that in a sequence of frames,
content only gradually changes from one frame to the
next. To quantify temporal coherence, we use two mea-
surements: Color change and Pixel change.
Color change is the total difference in pixel color fre-
quencies between two frames regardless of the locations
of the pixels. On the other hand, Pixel change is the to-
tal number of pixels that change color between frames,
which is measured by counting the number of pixel lo-
cations that differ in value between two frames. Color
change estimates how similar two frames are, only with
regard to color frequency. While Pixel change captures
the movement of content on a surface.
To illustrate color and pixel change, we use an ex-
ample from the Google Chrome browser in Figure 2.
The example shows pixel and color change for different
events. Notice that in some cases, when a new content is
displayed, both pixel and color change values are high
(e.g., new web search). In most cases, however, pixel
change is always higher than color change; this means
that in many cases the content is moving but not chang-
ing, as the case with BBC news in Figure 2.
Looking at a range of mobile workloads, we found
that temporal color coherence is reflected by low Color
change values, especially in UI applications. We ana-
lyzed a set of nine Android UI applications and games
(UI: Twitter; Facebook; Chrome; and Android Home
Screen, and 3D: Fruit Ninja; Need 4 Speed; Gunship
2; and Temple Run 2). In 3D applications, color and
pixel change rates are 15.7% and 65%, respectively. On
the other hand, UI applications has rates of 3.3% and
14.5% for color and pixel change values, respectively.
These numbers show that 2D and UI applications ex-
hibit higher temporal color coherence relative to 3D ap-
plications.
In addition to higher temporal coherence, we found
that UI applications tend to use fewer colors. Figure 3
demonstrates how a small number of frequent pixel color
values dominates a typical UI application compared to
a 3D one. Figure 3 shows the cumulative distribution
function of colors used in Twitter UI (a), compared to
a 3D game, Temple Run 2 (b). In (a), the top 100 most
common color values cover over 80% of the frame’s sur-
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1
1
0
1
2
0
1
3
0
1
4
0
1
5
0
1
6
0
1
7
0
1
8
0
1
9
0
1
1
0
0
1
1
1
0
1
1
2
0
1
1
3
0
1
1
4
0
1
1
5
0
1
1
6
0
1
1
7
0
1
1
8
0
1
1
9
0
1
2
0
0
1
2
1
0
1
2
2
0
1
2
3
0
1
C
H
A
N
G
E 
(%
 O
F 
P
IX
EL
S)
FRAME #
Color change
Pixel change
chrome
New web searchScrolling Loading BBC 
news
Loading 
Amazon
Figure 2: Pixel change and Color change in Google
Chrome. In most cases, pixel change is higher than
color change; this means that content is moving but
not changing most of the time. Our compression scheme
takes advantage of lower color change between frames
to predict compression palettes.
face, while coverage is 10% for Temple Run 2. Mea-
suring compressibility with Shannon entropy, we found
that Twitter has an entropy of 4.5 bits per pixel, while
it is 14 bits per pixel for Temple Run, indicating higher
compressibility for Twitter.
The next section shows how to take advantage of tem-
poral coherence and the color characteristics of UI ap-
plications to design a dynamic color palette scheme for
compressing UI and 2D surfaces.
4. DYNAMIC COLOR PALETTES (DCP) COM-
PRESSION
DCP is a technique to exploit graphics temporal color
coherence for framebuffer compression. For each frame,
DCP carries two operations in parallel: color frequency
collection and framebuffer compression. For color fre-
quency collection, DCP tracks the most frequently used
colors as the rendering of a frame progresses; mean-
while, DCP works on compressing the pixels in the
frame with a palette constructed using the frequency
information of the previous frame.
DCP has two main advantages over previous dictionary-
based techniques [18, 19]. First, it employs sampling to
exploit temporal coherence to predict future dictionary
values on-the-fly, alleviating the need for software hints
or a multi-stage dictionary update process. This allows
DCP to compress intermediate surfaces (i.e., applica-
tion surfaces) as well as the framebuffer surface used by
the display unit. Second, as we will show later, DCP
maximizes compression using adaptive dictionary siz-
ing, which puts to use the color frequency data collected
for each frame.
DCP relies on two structures (shown in Figure 4),
the Frequent Values Collector (FVC) for color frequency
collection, and the Common Colors Dictionary (CCD)
for compressing new pixels. The FVC identifies most
commonly occurring colors, while the CCD encodes the
most frequent colors as identified by the FVC from the
previous frame. As shown in Figure 4, each frame the
FVC collects color frequency information that are then
used to construct the CCD of the next frame.
4.1 DCP workflow
Figure 5 shows DCP workflow. In 1 , the GPU com-
mits tiles to the off-chip memory in multiple batches,
i.e., blocks of spatially adjacent pixels [23, 24, 25]. For
the example in Figure 5, we use a block size of 4×4 pix-
els and a sub-block size of 2×2. Pixels in each block are
sent to the FVC a1 and the CCD b1 .
In a1 , the FVC uses pixel values in each block to up-
date the common color frequencies of the current frame
(more details on that in the next Section).
In b1 , the CCD compresses pixel blocks in batches
of sub-blocks. In b2 , if all pixel values in a sub-block
have an entry in the CCD, the sub-block is determined
to be compressible. Each color value in a compress-
ible sub-block is represented using log2(CCD size) bits,
e.g., 6 bits per pixel for a CCD with 64 entries. If
one of the pixel value in a sub-block does not have a
CCD entry, then the whole sub-block remains uncom-
pressed. Compressed and non-compressed sub-blocks
are buffered and once a full block is processed, it is
then written to the off-chip memory b3 .
Like other block-based compression schemes [1, 26],
DCP uses an a metadata compression status buffer (CSB)
that contains a compression status bit for each sub-
block. Upon compressing a sub-block, the correspond-
ing entry in the CSB is set b4 , and upon reading a
compressed surface, the CSB is consulted to determine
how much data should be fetched from memory.
4.2 The Frequent Values Collector (FVC)
FVC is a relatively small–e.g., 16 to 128 entries–associative
memory structure. The FVC stores a set of pixel values
and their corresponding frequencies as value-frequency
pairs. For each pixel access, the FVC determines if a
pixel already has an entry in the FVC, if so, the FVC
increases the corresponding frequency counter by one.
However, because FVC size is limited, the FVC uses an
eviction policy to determine which pixel frequencies to
keep track of. Similar to a fully associative cache, the
FVC uses the least frequent color (LRC) policy, where it
evicts the pixel value with the smallest frequency when
an entry is needed to track the frequency of a new pixel
value.
Hardware Cost.
Each FVC entry contains a color value (32 bits for
RGBA), a validity flag (1 bit), and a counter with log2(number
of screen pixels) bits. For example, a 64-entry FVC
sized for a 4k×4k display will only require 456 bytes of
storage.
4.3 The Common Colors Dictionary (CCD)
CCD is used to encode compressed pixels. At the end
of each frame, the FVC holds the frequencies of the es-
timated most common colors. The FVC is then used to
(a) Twitter (b) Temple Run 2
Figure 3: The cumulative distribution function (CDF) of unique color values in UI (Twitter) and 3D (Temple Run
2) Android applications.
Frame 0 Frame 1 Frame 2
FVC
Time
Construct FVC from the 
pixel values of the 
current frame
Construct the CCD 
dictionary using 
FVC values
To memory
CCDFVC
To memory
FVCCCD
To memory
Framebuffer 
access 
compressed 
using CCD
Figure 4: Using DCP across frames.
GPU
On-chip Buffer
Dispatch 
Pixel 
Blocks
CCD (Compress sub-
blocks)
Each pixel in 
the current 
sub-block has 
CCD entry?
Write a 
compressed 
sub-block
Write a non-
compressed 
sub-block
Yes No
b2
Block Buffer
Write block to 
off-chip 
memory
Pixel Sampling
a1
FVC (constructs a 
CCD for the next frame)
b11
b3
a2
CSB 
Buffer
b4
Figure 5: DCP stages.
CSB buffer
Off-chip 
Memory
rCCD
Determine block-
size
Fetch Block
De-compress 
Block
To display/Texture 
or Composition Unit
Figure 6: Reading a DCP compressed surface.
construct the CCD for the next frame. Each CCD entry
maps a pixel value to a dictionary (encoding) value. The
CCD is implemented using a fully associative structure.
When reading a surface, the mapping of CCD is re-
versed to decompress encoded pixels. We call the direct
mapped structure that holds this reversed mapping the
rCCD. Upon compressing a frame, or a set of frames,
the rCCD mapping is attached to the frame and stored
in main memory. Later on, when the frame is read, the
rCCD is used to decompress the frame as described in
Section 4.5 below.
Hardware Cost.
CCD/rCCD with 64 entries only requires 264 bytes
of storage.
4.4 The Compression Status Buffer (CSB)
Similar to other block-based compression algorithms [1,
13, 14, 9], a metadata buffer is used to hold the status
of each compression block. For DCP, the CSB buffer in-
dicates whether a given sub-block is compressed, where
CSB holds one bit per sub-block. In our baseline, this
translates to a cost of 1 bit per 128 bit of surface data.
Both CSB and rCCD are needed to read a compressed
frame as explained in the next section.
4.5 Reading a Compressed Framebuffer Sur-
face
Figure 6 shows the process of reading a compressed
surface. It starts with loading the corresponding rCCD
and CSB. To read a pixel, CSB entries are decoded to
determine the size of compressed data and how many
bytes should be fetched for each block. To avoid dou-
ble latency, and since CSB size is relatively small, the
CSB can be prefetched to a small on-chip buffer/cache.
Once CSB is used to determine the size of a compressed
block, the block is then fetched and the rCCD is used
to decompress the values in each sub-block as shown
in Figure 6.
4.6 Multi-Surface Support
Multiple Render Targets (MRT): Some graph-
ics applications may render to multiple target surfaces.
Techniques that use MRT, like deferred shading, are
popular in 3D applications and used to render scenes
with complex lighting [27]. To support multiple ren-
der targets, we need to replicate some of the structures
in Figure 5 to match the maximum possible number of
target surfaces. DCP will need a single FVC and a sin-
gle CCD unit per render target. However, no need for
additional FVC and CCD units if multiple passes are
used to process MRT.
Since most UI and 2D workloads render to a single
target, a typical hardware implementation may only
need support a single render target and the rare case
of multiple targets is handled by using DCP with just
a single surface. However, as discussed in Section 7.4,
adding extra structure is relatively cheap and cost little
chip area.
Multi-Surface Composition: Contemporary com-
positor engines can composite up to 16 surfaces in one
pass [28]. To support multi-surface composition, the
number of rCCD structures in Figure 6 should match
the number of surfaces that can be composited in par-
allel.
4.7 Coupling DCP with Other Compression
Algorithms
DCP targets common UI and 2D applications. Other
compression algorithms are better suited to 3D and
some 2D applications. Industry practitioners have pro-
posed supporting multiple compression algorithms [2,
26]. This means that in a hybrid scheme, each block
can be compressed either using DCP or an alternative
algorithm. In Section 7.5 we evaluate the results of
combining DCP with RAS.
4.8 Dynamically Enabling DCP
In this section, we explain how DCP can be enabled/disabled
based on the expected compression performance. DCP
performance can be predicted using the frequencies col-
lected by the FVC at the end of a frame. By adding
frequency values in the FVC, then comparing it to the
total number of pixels (sample size), we can calculate
what we call FVC coverage, which can be used to pre-
dict DCP performance, where:
FVC coverage =
Sum of FVC frequencies
Number of samples
By defining a coverage threshold (CT) and compar-
ing it to FVC coverage, then DCP can be used only
if FVC coverage ≥ CT. By periodically enabling FVC,
e.g., once every n frames, FVC coverage can be updated
and used to determine if DCP should be enabled. For
an N-entry FVC, calculating coverage takes N − 1 in-
0
1
2
3
4
5
6
0.11 0.46 0.65 0.73 0.80 0.85 0.88 0.91 0.95 0.96 0.98 0.99 1.00
Workload FVC Coverage (sorted by compression rate)
Compression Rate
Figure 7: Compression rates vs. FVC coverage.
teger addition and one division operations per frame.
Figure 7 shows FVC coverage vs. compression rates
across workloads in Table 3. It is clear that higher com-
pression rates are achieved with higher FVC coverage.
In our set of workloads, using DCP with FVC coverage
≥ 0.7 seems to achieve good compression rates (> 2).
Figure 7 also shows some cases where larger FVC cov-
erage yields lower compression. These cases represent
workloads that exhibit sudden changes in frames, as a
result, temporal coherence is lower than that of other
benchmarks with similar FVC coverage. Two examples
from Figure 7 (the two large dips at the right end) are
Unwind which exhibits a UI with changing color bright-
ness and Super Hexagon which exhibits an interface that
continuously switches theme colors.
5. DCP SCHEMES
5.1 Baseline DCP
In baseline DCP, the CCD is constructed using all
FVC entries; thus, the number of entries in the CCD
will always match FVC, and compressed blocks will
have a fixed size of log2(FVC size)× (pixels per block)
bits.
Memory layout and effective compression rates.
Figure 8 shows the memory layout of a DCP com-
pressed surface. Space allocated to DCP blocks (0-2)
is fixed (S0, i.e., the size of an uncompressed block).
On the other hand, the actual utilized space is deter-
mined by the size of compressed data (S2). But be-
cause DRAM reads/writes data blocks using a num-
ber of bandwidth cycles that are burst size multiples, a
block that should be compressed by S0/S2 will have an
effective compression rate of S0/S1, where S1 is the size
of DRAM bursts needed to read compressed data.
In this work, we use the effective compression rate
which reflects the reduction in memory bandwidth. In
the remainder of this section, two variants of DCP (ADCP
and VDCP) are introduced in addition to a hybrid scheme
combining DCP and RAS (HDCP).
5.2 Adaptive DCP (ADCP)
ADCP is a variation of DCP that uses the distribu-
tion of frequent color values in the FVC to adjust the
number of CCD entries. ADCP looks for the best trade-
off between the number of compressible blocks and the
Blocks layout in
DRAM
Block 0
Block 1
Block 2
Unused
space
Block 0
(Compressed)
DRAM layout
Effective 
compression 
size
Effective size (S₁)= 
N × (DRAM burst size)
[This what is read/written
to DRAM]
Unused
space
Compressed
Pixels
DCP 
compressed 
data (S₂)
DRAM 
bursts 
overhead
Block 0
DRAM bus layout
S₀
Figure 8: DCP memory Layout.
0.45
0.55
0.65
0.75
0.85
0.95
1.05
1.15
1.25
8 16 32 64 128 256 512
# of FVC/CCD Entries
Kindle Facebook
Figure 9: DCP compression vs. CCD size.
size of their encoding.
Frames of different applications, or different frames
within the same application, may perform better/worse
under larger/smaller palette sizes. Figure 9 shows DCP
compression rates for Facebook and Kindle using 16 to
512 entry CCDs. Kindle with its simple text achieves
higher compression rates using smaller CCDs. On the
other hand, Facebook achieves the best compression rate
using a 256-entry CCD. A larger CCD covers a wider
range of values and it is able to compress more blocks,
while a smaller CCD uses smaller encoding sizes. For
example, if a frame that uses 32-bit pixels with blocks
that are 80% white, 18% blue, 1% black and 1% red uses
a 2-entry CCD, 98% of the blocks can be compressed
using 1 bit per pixel for a total compression rate of
19.75:1 (ignoring metadata overhead). Another option
is to use a 4-entry CCD to compress all the frame using
2 bits per pixel producing a compression rate of 16:1.
ADCP tries to optimize CCD size for each case by
actively predicting the optimal number of CCD entries.
CCD size determines encoding sizes and subsequently
the size of compressed blocks. ADCP uses FVC to pre-
dict the optimal CCD size using Algorithm 1. In Al-
gorithm 1, FVC frequencies, sorted from most to least
frequent in FVC Val, are used as input. Note that to
simplify calculations, DRAM burst size and pixels lay-
out were ignored.
ADCP has a negligible overhead; the number of iter-
ations in Algorithm 1 depends on the number of FVC
entries. For example, for 64-entry FVC, the loop will
only execute six times (i.e., log2(FVC size)).
Algorithm 1 Predicting optimal CCD size
INPUTS (FrameSizePixels, PixelSizeBits, FVC Val,
Max FVC Size)
. predicted compressed frame size in bits
expected frame size = Frame W×H*PixelSizeBits
. Optimal CCD entries = 2opt CCD
opt CCD = 0
for i=0 to log2(Max FVC Size) do
sum = SumFrequencies(FVC Val(0) to FVC Val(2i-1))
frame size = sum * i + (FrameSizePixels-sum)* Pixel-
SizeBits
if frame size < expected frame size then
expected frame size = frame size
opt CCD = i
end if
end for
return 2opt CCD
CDD
C0
C1
C2
C3
C4
C5
C6
C7
0
1
2
3
4
5
6
7
CCD
C0
C1
C2
C3
C4
C5
C6
C7
000
001
010
011
100
101
110
111
{10,11}
{0,1}
{φ}
{001,101}
{110,111}
{C3,Cy}
{Cx,C4}
{C2, C3}
{C0, C1}
{C0, C0}
{C1, C5}
{C6, C7}
{C3, Cy}
{Cx, C4}
010
001
000
011
011
111
111
Block Encoding CSB
(a) (b)
Figure 10: VDCP example encoding (a). CCD entries
(b). The ecoding of 2-pixel blocks using the CCD in
Table (a).
5.3 Variable DCP (VDCP)
VDCP is another DCP variation that goes further
than ADCP by adapting palette sizes to optmize com-
pression at the sub-block level. VDCP uses variable-
length coding by changing the number of rCCD en-
tries used to encode/decode each sub-block. VDCP
reduces the number of encoding bits per pixel to i =
ceil(log2(max(pixel color index))), which means that i
is determined by the pixel within the sub-block that has
the highest index (i.e., lowest frequency) in the CCD.
With VDCP, CSB is used to determine the number of
rCCD entries used for each sub-block, where the number
rCCD entries equals to 2CSB V alue (i.e., encoded colors
fall in the first 2CSB V alue CCD entries), and a special
CSB Value is used for uncompressed sub-blocks.
Figure 10 shows a VDCP example. In Figure 10.a, an
example CCD is shown, where the most frequent color
value, C0, is encoded to 000 and the least frequent value,
C7, is encoded to 111. Figure 10.b shows the VDCP
encoding for seven 2-pixel sub-blocks. As shown in the
figure, the CSB tracks each sub-block’s encoding. 0002
in the CSB indicates that only the most frequent color
in the CCD C0 is used, while 0012 indicates that the
top 2 CCD colors, {C0 and C1}, are used and so on.
1112 is used for uncompressed sub-blocks.
In Figure 10.b, the first row shows a sub-block with
values C2 and C3, this means that only the top 2
2 en-
tries in the CCD are used for encoding the sub-block.
Baseline DCP Configurations
FVC Size 64 entries
FVC replacement policy Least-frequent value
CCD size 64 entries
Pixel Block size 8× 8
Pixel Sub-block size 2×2
CSB bits per sub-block
1 (DCP, ADCP & HuffDCP)
3 (VDCP), 5 (HDCP)
Memory Burst Size 128 bits
Pixel Sampling Rate 1:1
Table 1: Baseline Configurations
Subsequently, the corresponding CSB entry is set to
0102, and each pixel color is encoded using 2 bits. The
second sub-block in Figure10.b contains {C0,C1}, en-
coded using the top 21 CCD entries (1 bit per color),
and the corresponding CSB entry is set to 0012. The
third sub-block contains only C0, and the corresponding
CSB value in this case is 0002, which indicates the con-
tent for the entire sub-block (since only the top entry in
the table is used). Note that to make this example easy
to follow, the CCD is shown with eight entries (instead
of 64), so the CSB values 1002 to 1102 are not in use.
5.4 Hybrid DCP (HDCP)
DCP is only effective on a subset of applications.
Ideally, it should be used with other compression al-
gorithms. We evaluate a Hybrid DCP that combines
DCP with RAS. HDCP compresses each block using
DCP and RAS and uses the result with higher com-
pression rate. To support the additional compression
modes, the number of CSB bits is increased. Results
in Section 6 show that this technique produces higher
compression rates at the cost of additional on-chip com-
putations.
5.5 DCP Implementation
In addition to hardware structures (FVC, CCD and
rCCD), DCP requires some support from the software
layer. To implement DCP, the graphics driver will at-
tach DCP data as part of the state associated with a
surface (along other state data like size and formatting).
For VDCP, Algorithm 1 can be added to the driver as
well, where it can calculate next CCD size at the end
of each frame.
6. METHODOLOGY
Our experimentation configurations are listed in Ta-
ble 1. We calculated compression rates using a model
that assumes a tile-based GPU architecture. Our model
works as follows:
• First, we feed the frames of each workload to our
model, which then splits each frame to 8x8 blocks.
• For each block, the model calculates the compressed
size of each sub-block. The total of compressed
and uncompressed sub-block sizes are added to cal-
culate the compressed size of the block.
• Compressed block size is then used to calculate
the number of DRAM bursts required. The model
then calculates the total bandwidth consumed by
a compressed frame by summing the number of
DRAM bursts of all the blocks in the frame.
Note that the model computes compression rates start-
ing from the second frame, using the first frame to pop-
ulate the first FVC and CCD.
We evaluated surface compression using our set of
randomly chosen popular Android applications (Table 2).
Our traces will be published and made available for any
future studies.
We split applications into three groups: UI applica-
tions, 2D applications, and 3D games. All of our bench-
marks use OpenGL ES and render to a single target
buffer (up to OpenGL ES 2.0 MRT is only supported
through vendor extensions [29]).
We manually interacted with each application to ex-
ecute a simple task. In total, we used 34468 frames
that represent 124 applications (shown in Table 3). We
only consider regions of interest in each workload that
represent the typical use case of the workload (i.e., load-
ing/initialization frames are not considered).
The rest of configurations are listed in Table 1. The
effective compression rate and metadata overhead are
taken into account when calculating the total compres-
sion rate. We use a block size of 8×8 pixels (256 bytes),
which matches the block sizes used by RAS.
In addition to DCP, we evaluate two lossless methods
described in Section 2. RED uses Nvidia’s compression
[2] and RAS, which is based on work of Rasmusson et
al. [1]. RAS is a prediction based algorithm that pre-
dicts the value of a pixel using neighbor pixel values.
The difference between prediction and the actual value
is then encoded using Golomb-Rice coding. We used
parameters suggested by Rasmusson et al. [1], namely
8×8 blocks and, as described in the paper, we set the
value of the Golomb-Rice parameter k by exploring val-
ues between 0 and 6, use k = 7 for the “special mode”,
and use the suggested “3 sizes mode” for higher com-
pression rates. We organize color values by their color
channel as described in Stro¨m et al. [14]. We experi-
mented with RAS using RGBA and Y cocg formats and
found that for many applications, particularly UI and
2D, RAS shows favorable results using RGBA channels.
So we used RAS with RGBA channels in our compari-
son.
For CSB, DCP and ADCP use 1-bit per sub-block.
VDCP uses 3 bits per sub-block; with an FVC size of
64, seven combinations are used–1, 2, 4, 8, 16, 32 and
64, plus a combination for non-compressed sub-blocks.
To compare against techniques that use Huffman cod-
ing [19], a Huffman coded DCP (HuffDCP) is imple-
mented, where FVC frequencies are used to construct
CCD with variable length Huffman coding.
7. RESULTS AND DISCUSSION
7.1 DCP Schemes
To compare DCP schemes, we isolate the effect of
memory burst size and only take into account CSB over-
# Cat. Benchmark # Cat. Benchmark # Cat. Benchmark # Cat. Benchmark # Cat. Benchmark
1 UI Android Settings 26 UI Pocket 51 UI Yellowpages 76 UI Textra 101 2D Unwind
2 UI Morecast 27 UI ES File Explorer 52 UI Eye in the Sky 77 UI WPS Office 102 2D Color Switch
3 UI Poweramp 28 UI Chrome 53 UI OfficeSuite 78 UI People Contacts 103 2D Impossible Game
4 UI Speedest 29 UI Applock 54 UI Dictionary.com 79 UI Unit Conv. Ult. 104 2D Flow
5 UI Twitter 30 UI Accuweather 55 UI Walgreens 80 UI Skyscanner 105 2D 2048
6 UI Facebook 31 UI Flipboard 56 UI Walmart 81 UI Calendar 106 2D Gyro
7 UI Twitch 32 UI Booking.com 57 UI CNN 82 UI Merriam Webster 107 2D 99 Problems
8 UI Wish 33 UI Shazam 58 UI File Commander 83 UI ESPN 108 2D Dumb Ways to Die
9 UI Imgur 34 UI Zedge 59 UI Terminal Emulator 84 UI Tumblr 109 2D Piano Tiles
10 UI Soundcloud 35 UI Indeed 60 UI Adobe Acrobat 85 UI Quickpic 110 2D loop
11 UI Automate 36 UI Runkeeper 61 UI Android Call 86 UI Duolingo 111 2D Ultraflow
12 UI Musixmatch 37 UI Steam 62 UI Gallery 87 UI Clock 112 2D Okay
13 UI Airbnb 38 UI Khan Academy 63 UI Feedly 88 UI Google Messenger 113 3D Traffic Rider
14 UI CBS Sports 39 UI The Weather Channel 64 UI Baconreader 89 UI Calculator 114 3D Extreme Car Driving
15 UI Etsy 40 UI Yahoo Finance 65 UI aCalendar 90 UI Soundhound 115 3D 3D Bowling
16 UI Android Home 41 UI Tapatalk 66 UI Bakareader 91 UI Translate 116 3D Dr. Driving
17 UI Pinterest 42 UI Kickstarter 67 UI Kindle 92 UI Any.do 117 3D Paper Toss
18 UI Aldiko 43 UI Amazon Store 68 UI eBay 93 2D Candy Crush Saga 118 3D Rolling Sky
19 UI Letgo 44 UI Zomato 69 UI Venmo 94 2D Trainyard 119 3D Stack
20 UI Yelp 45 UI Spotify 70 UI Mcdonalds 95 2D Mines 120 3D Zigzag
21 UI Android Messaging 46 UI Runtastic 71 UI Colornote 96 2D Cut the Rope 2 121 3D Stargather
22 UI BBC iPlayer 47 UI theScore 72 UI Reddit 97 2D Angry Birds 122 3D Commute H. Traffic
23 UI Tachiyomi 48 UI Food Network 73 UI Checkout 51 98 2D Strata 123 3D Crossy Road
24 UI gReader 49 UI MX Player 74 UI Tasker 99 2D Brain it On 124 3D Smashy Road
25 UI Google Maps 50 UI VLC 75 UI IFTTT 100 2D Super Hexagon
Table 2: List of Android workloads
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
C
o
m
p
re
ss
io
n
 R
a
te
VDCP DCP ADCP HDCP
1-92 (UI)
113-124 (3D)
93-113 (2D)
1
2
4
5
6
7
8
3
9
VA
(a) DCP schemes compression.
1
6
11
16
21
26
31
C
o
m
p
re
ss
io
n
 R
a
te
RASS RED ADCP
1-92 (UI)
113-124 (3D)
93-113 (2D)
V
(b) Comparing RAS, RED, and VDCP compression rates.
Figure 11: Compression rates of workloads ordered from left to right following their order in Table 2.
2.87
2.22
1.59
4.22
2.62
1.70
5.56
3.02
1.78
5.19
2.90
1.80
UI 2D 3D
C
o
m
p
re
ss
io
n
 r
a
te
DCP ADCP VDCP HuffDCP
Figure 12: Harmonic mean of DCP schemes compres-
sion rates per application category.
2.73 2.90 2.542.77
1.93
1.35
5.26
2.90
1.75
UI 2D 3D
C
o
m
p
re
ss
io
n
 r
a
te
RAS RED VDCP
Figure 13: Comparing RAS, RED and VDCP effective
compression rates.
System Configurations
Operating System Android 4.2.2 (API 17)
Display Size 720×1280
Android workloads
Category # of workloads
UI Applications 92 (24031 frames)
2D Applications 20 (7888 frames)
3D Applications 12 (2549 frames)
Total # of Applications 124 (34468 frames)
Table 3: System configurations and workloads summary
head. Later, bursts are taken into account when com-
paring DCP, RAS and RED. We compare baseline DCP
against ADCP, VDCP and HuffDCP.
Figure 11a shows compression rates in each category
sorted by baseline DCP compression rate (the same or-
der used in Table 2). The figure shows that the baseline
DCP is the least effective scheme. UI applications 1 ,
such as Android Settings, Morecast, and Poweramp,
show low compression rates of less than 2. After ex-
amining these applications, we found that they feature
gradient backgrounds and graphical elements that DCP
cannot compress using small palettes.
In 2 (Zedge) and 3 (Spotify), ADCP achieves
higher compression rates than HuffDCP and VDCP.
Looking at these applications we found that they con-
tain a mix of solid backgrounds and frames that con-
tain images which DCP will, mostly, not be able to
compress. ADCP can compress frames with solid back-
grounds with lower overhead than VDCP since it has
a lower CSB overhead. On the other hand, in frames
containing images, both ADCP and VDCP will not be
able to perform well, but ADCP will incur lower CSB
overhead.
For applications with simple color schemes, such as
OfficeSuite 4 and Any.do 5 , VDCP and DCP achieve
high compression rates. Nevertheless, VDCP, ADCP,
and HuffDCP were all able to achieve even higher com-
pression rates. Looking at 2D applications, performance
varies significantly.
In 6 , applications with sophisticated graphics, like
Candy Crush, Trainyard, Mines, Cut the Rope and An-
gry Birds, have low compression rates (< 1.7). On the
other hand, applications using simpler graphics (e.g.,
loop Ultraflow, and Okay) achieve high compression
rates, especially with VDCP 7 . A similar trend is
exhibited in 8 , where graphically rich 3D games (Traf-
fic Rider, Extreme Car Driving, and 3D Bowling) show
low compression rates. On the other hand, games like
Smashy Road, show good compression rates (highest
VDCP at 10.63). Also Stargather 9 , with similar char-
acteristics to UI applications in 2 and 3 , shows higher
rates with HuffDCP and VDCP.
Figure 12 summarizes the results in Figure 11a. VDCP
shows better compression rates for UI and 2D applica-
tions with 5.56 and 3.02 respectively. For 3D games,
HuffDCP shows the highest rate (1.80). HuffDCP and
VDCP do better with 3D workloads since their compres-
sion rates are similar to VDCP but with lower CSB over-
head. Interestingly, using Huffman encoding in HuffDCP
achieves lower compression rates than VDCP in UI and
2D workloads. This is due to Huffman inefficiencies
with probability distributions that are not exact powers
of two. For example, if we have 32 bit values and fre-
quencies of A(49.5%), B(49.5%), C(0.5%) and D(0.5%)
then Huffman encoding will assign codes of 1 bit to A, 2
bits to B and 3 bits to C and D with a total compression
rate of 21.12. ADCP and VDCP encode A and B using
1 bit, while keeping C and D uncompressed, resulting
in a compression rate of 24.4.
7.2 Comparing VDCP, RAS and RED
Figure 11b compares VDCP with RAS and RED and
Figure 13 summarizes the results in Figure 11b. Mem-
ory bursts and CSB overhead were taken into account.
For UI applications, VDCP achieves a mean effective
compression rate of 5.26 compared to 2.73 for RAS
and 2.77 for RED. VDCP performs well with UI and
2D applications. On the other hand, RAS, a more
generic compression algorithm, has consistent perfor-
mance across all workloads. RAS outperforms VDCP in
3D games (2.54, compared to 1.75 for VDCP). Similar
to VDCP, RED performs well with UI and 2D applica-
tions, but with lower rates that VDCP. VDCP perfor-
mance with 3D workloads is the reasoning behind sug-
gesting a hybrid approach consisting of DCP and an-
other general purpose compression algorithms–similar
to what is described some implementations [2, 26]. A
Hybrid VDCP-RAS scheme is discussed in Section 7.5.
7.3 Factors affecting FVC Fidelity
In this section we discuss and quantitatively evaluate
four factors that affect FVC and should be considered
when using DCP.
FVC Size:.
Larger FVC sizes can capture frequent colors more
accurately, as they are less likely to evict a frequent
value from the FVC because of capacity.
To evaluate how FVC size affect accuracy, we use
relative coverage. For an N -entry FVC, we calculate
relative coverage by dividing the number of pixels rep-
resented by the N top colors collected by FVC by the
number of pixels represented by the actual N most fre-
quent colors.
Figure 14a shows the effect of FVC size for UI appli-
cations. A 16-entry FVC has a relative coverage of 94%
compared to 98.3% for 512-entry FVC. This means a 16-
entry FVC is able to capture colors that cover 94% of
the area covered by the actual 16 most frequent colors,
while the 512-entry FVC is able to capture 98% of the
coverage the actual 512 most common colors are able to
cover. Figure 14b shows how accuracy affect compres-
sion rates, as frequencies collected using larger FVCs
are a better representation of the actual most common
colors.
Replacement Policy and Associativity.
We evaluated using a number of replacement policies:
the baseline least-frequent color (LFC), second least-
0.7
0.8
0.9
1
16 32 64 128 256 512
N
o
rm
a
li
ze
d
 
co
m
p
re
ss
io
n
 r
a
te
FVC Size
UI 2D
FVC size vs. Relative coverage
(a) FVC size vs. Relative Coverage.
4.53 4.85
5.26
5.84
6.57
7.22
2.57 2.77 2.90
3.21 3.51
3.89
16 32 64 128 256 512
N
o
rm
a
li
ze
d
 
co
m
p
re
ss
io
n
 r
a
te
# of FVC Entries
UI 2D
FVC size vs. compression rate
(b) FVC size vs. VDCP compression rate.
Figure 14: Comparing FVC size with relative coverage
(a) and its effect on compression rates (b).
0.4
0.6
0.8
1
2 4 8 16 32 64
N
o
rm
a
li
ze
d
 
co
m
p
re
ss
io
n
 r
a
te
#of sets
UI 2D
ADC Compression vs # 
of set in 64 entry FVC
Figure 15: 64-entry FVC associativity vs. the compres-
sion rate of fully associative FVC.
frequent (2LFC), least-recently-used (LRU), and ran-
dom replacement. The idea behind including a 2LFC is
to see the effect of avoiding thrashing newly discovered
colors that are prone to eviction.
Using UI workloads with 64-entry FVC, the mean
compression rate with LFC is 5.26, while it is 5.25 for
2LFC. On the other hand, LRU and random achieve
lower rates of 3.04 and 2.92, respectively.
We also evaluated changing FVC associativity from
fully associative to direct-mapped, and used color chan-
nel values to determine the set. As expected, the FVC
performance degrades as we increase the number of sets
(as shown Figure 15).
Pixel Sampling.
We noticed that the FVC can be constructed using
a subset of frame pixels, i.e., by sampling them using
only one in every nth pixel to collect frequent colors
statistics.
Figure 16 illustrates the effect of pixel sampling on
VDCP. We evaluate sampling rates from 1:1 (every pixel
accesses the FVC) to 1:16384. 1:16 sampling achieves
98.7% (UI) and 102% (2D) of the compression achieved
by 1:1 sampling. We expect that the slightly higher
compression rate for 2D workloads is caused by sam-
pling working as a noise filter.
Frame sampling.
In frame sampling, the same CCD is used for a num-
ber of frames (N) instead for just one frame. We vary
the sampling period (N) for VDCP between 1 (every
frame) and 60 frames. Figure 17 shows compression
0.75
0.85
0.95
1.05
N
o
rm
a
li
ze
d
 
co
m
p
re
ss
io
n
 r
a
te
Pixel sampling rate
UI 2D
ADCP Compression vs Pixel Sampling Rate 
Figure 16: Normalized compression rates vs. FVC pixel
sampling rate for UI applications.
0.2
0.4
0.6
0.8
1
2 5 8 10 15 20 30 40 50 60
N
o
rm
a
li
ze
d
 c
o
m
p
re
ss
io
n
 
ra
te
Sampling period (in frames)
UI 2D
ADCP Compression vs
Frame Sampling 
Figure 17: VDCP normalized (to sampling period of 1)
compression rates vs. FVC frame sampling period.
rates relative to N=1. VDCP maintains good compres-
sion rates with N=2, with a relative compression rate
of 97%. Compression rates, however, significantly de-
crease with higher N values with 44.6% and 43.3% for
N values of 50 and 60, respectively.
7.4 Implementation Cost and Energy Savings
Section 5 mentions storage requirements associated
with DCP. Specifically, for 64-entry FVC/CCD, 456
bytes are need for FVC and 264 bytes for CCD. The
cost of rCCDs is (264 bytes) x (maximum number of
surfaces that can be read in parallel). Current systems
support up to 16 surfaces [28].
For energy, we used DRAMPower v4.0 [30] to esti-
mate the energy cost of accessing a MICRON 1600 x32
LPDDR3 DRAM. We found the cost of DRAM ac-
cesses to be around 451.2 pJ/byte (this number ex-
cludes DRAM idle energy and other system energy costs
like the interconnection network). For DCP we used
CACTI v7.0 [31] with the 22nm process to estimate the
area/energy/latency of DCP structures as shown in Ta-
ble 4.
Using the numbers in Table 4, DCP total area cost
with support of 16 surfaces equals to 0.009527672 mm2.
To compare this area with current hardware, it is less
than 0.003% of Nvidia’s Xavier die area [32]. For the
the dynamic energy cost of compressing/decompression
a byte using DCP, we found it to be around 1.3 pJ/byte,
i.e., less than 0.29% of DRAM access cost.
Energy savings.
DRAM consumes around 199.6 mW (629.4 mW in-
cluding static power) for framebuffer operations under
a typical rate of 60 FPS using HD frames (GPU writ-
ing/display controller reading, or 949.21 MB/s). We
calculated DCP total compression/decompression static
and dynamic energy consumption (4.83 pJ/byte) and
we compared it to only DRAM dynamic energy con-
sumption (451.2 pJ/byte). We found that VDCP re-
Structure Type Area cost (mm2) Leakage power (mW) Access cost (pJ) Access latency (ns) Max bandwidth (MPixels/s)
FVC CAM 0.00304232 0.881899 0.766572 0.131695 7241
CCD CAM 0.00197988 0.39 0.402 0.128338 7431
rCCD Cache 0.000281592 0.281112 0.104106 0.0722227 13203
Table 4: DCP structures hardware cost.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
R
at
io
 o
f 
b
lo
ck
s 
co
m
p
re
ss
ed
 u
si
n
g
 V
D
C
P
Workloads (ordered by the ratio of VDCP compressed block) 
VDCP RAS
Avg. 
VDCP:0.49
Figure 18: Ratio of DCP vs. RAS compressed blocks
across all workloads.
duces the energy consumed by framebuffer operations
by 79.9% for UI apps, 64.4% for 2D apps, and 41.8%
for 3D apps.
7.5 Hybrid Schemes
Our hybrid compression scheme uses RAS and VDCP.
We compress using both algorithms and then use the
best of the two. This exploits VDCP high compression
rates for simpler surfaces while falling back on RAS for
other cases. RAS+VDCP outperforms RAS and VDCP
(with rates of 7.2, 5.206 and 3.23 for UI, 2D and 3D ap-
plications respectively). The ratio of VDCP vs. RAS
compressed blocks varies by application as shown in Fig-
ure 18. However, we found that, on average, VDCP and
RAS compress an equal number of blocks.
8. CONCLUSION
This work presents surface compression techniques
that reduce the off-chip bandwidth of framebuffer op-
erations in energy-constrained mobile devices. In this
work, we analyze and characterize the framebuffer sur-
faces of UI, 2D and 3D applications and highlight the
unique characteristics of each.
To evaluate our compression schemes, we created and
used a set of workloads that represents 124 popular
mobile applications. Our results show that VDCP im-
proves compression by an average of 93% relative to
RAS for UI applications, while improving UI and 2D
applications over RED by 89% and 50%, respectively.
DCP focuses on 2D and UI applications and can com-
plement other generic compression algorithms. We eval-
uated a hybrid VDCP+RAS (HDCP) scheme; the scheme
was able to increase compression rates by 163%, 79%
and 27% over RAS, and by 159%, 169% and 139% over
RED for UI, 2D and 3D applications, respectively.
9. REFERENCES
[1] J. Rasmusson, J. Hasselgren, and T. Akenine-Moller,
“Exact and error-bounded approximate color buffer
compression and decompression,” in
SIGGRAPH/EUROGRAPHICS Conference On Graphics
Hardware: Proceedings of the 22 nd ACM
SIGGRAPH/EUROGRAPHICS symposium on Graphics
hardware, vol. 4, no. 05, 2007, pp. 41–48.
[2] NVIDIA, “NVIDIA Tegra X1 Whitepaper,” 2015. [Online].
Available: http://international.download.nvidia.com/pdf/
tegra/Tegra-X1-whitepaper-v1.0.pdf
[3] T. J. Olson, “Saving the planet, one handset at a time:
Designing low-power, low-bandwidth gpus,” in ACM
SIGGRAPH 2012 Mobile, ser. SIGGRAPH ’12. New
York, NY, USA: ACM, 2012, pp. 1:1–1:1. [Online].
Available: http://doi.acm.org/10.1145/2341910.2341912
[4] A. SurfaceFlinger, “SurfaceFlinger and Hardware
Composer,” 2016. [Online]. Available: https:
//source.android.com/devices/graphics/arch-sf-hwc.html
[5] Android, “Android : Graphics architecture,” 2016. [Online].
Available: https:
//source.android.com/devices/graphics/architecture.html
[6] J. Stro¨m and T. Akenine-Mo¨ller, “i PACKMAN:
high-quality, low-complexity texture compression for mobile
phones,” in Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS conference on Graphics
hardware. ACM, 2005, pp. 63–70.
[7] T. Akenine-Mo¨ller and J. Stro¨m, “Graphics for the masses:
a hardware rasterization architecture for mobile phones,” in
ACM Transactions on Graphics (TOG), vol. 22. ACM,
2003, pp. 801–808.
[8] I. Antochi, B. Juurlink, S. Vassiliadis, and P. Liuha,
“Memory bandwidth requirements of tile-based rendering,”
in Computer Systems: Architectures, Modeling, and
Simulation. Springer, 2004, pp. 323–332.
[9] J. Hasselgren and T. Akenine-Moller, “Efficient depth
buffer compression,” in SIGGRAPH/EUROGRAPHICS
Conference On Graphics Hardware: Proceedings of the 21
st ACM SIGGRAPH/Eurographics symposium on Graphics
hardware: Vienna, Austria, vol. 3, 2006, pp. 103–110.
[10] A. Khodakovsky, P. Schro¨der, and W. Sweldens,
“Progressive geometry compression,” in Proceedings of the
27th annual conference on Computer graphics and
interactive techniques. ACM Press/Addison-Wesley
Publishing Co., 2000, pp. 271–278.
[11] J. Stro¨m and M. Pettersson, “Etc 2: texture compression
using invalid combinations,” in Graphics Hardware, 2007,
pp. 49–54.
[12] J. Nystad, A. Lassen, A. Pomianowski, S. Ellis, and
T. Olson, “Adaptive scalable texture compression,” in
Proceedings of the Fourth ACM SIGGRAPH/Eurographics
conference on High-Performance Graphics. Eurographics
Association, 2012, pp. 105–114.
[13] J. Pool, A. Lastra, and M. Singh, “Lossless compression of
variable-precision floating-point buffers on GPUs,” in
Proceedings of the ACM SIGGRAPH Symposium on
Interactive 3D Graphics and Games. ACM, 2012, pp.
47–54.
[14] J. Stro¨m, P. Wennersten, J. Rasmusson, J. Hasselgren,
J. Munkberg, P. Clarberg, and T. Akenine-Mo¨ller,
“Floating-point buffer compression in a unified codec
architecture,” in Proceedings of the 23rd ACM
SIGGRAPH/EUROGRAPHICS symposium on Graphics
hardware. Eurographics Association, 2008, pp. 75–84.
[15] T. J. Van Hook, “Method and apparatus for compression
and decompression of color data,” May 2 2006, uS Patent
7,039,241.
[16] S. E. Molnar, B.-O. Schneider, J. Montrym, J. M.
Van Dyke, and S. D. Lew, “System and method for
real-time compression of pixel colors,” Nov. 30 2004, uS
Patent 6,825,847.
[17] S. L. Morein and M. A. Natale, “System, method, and
apparatus for compression of video data using offset
values,” Jul. 13 2004, uS Patent 6,762,758.
[18] B. H. Danielson, J. J. Watters, and T. J. McDonald,
“Method and apparatus for displaying computer graphics
data stored in a compressed format with an efficient color
indexing system,” Apr. 14 1998, uS Patent 5,740,345.
[19] H. Shim, Y. Cho, and N. Chang, “Frame buffer compression
using a limited-size code book for low-power display
systems,” in Embedded Systems for Real-Time Multimedia,
2005. 3rd Workshop on. IEEE, 2005, pp. 7–12.
[20] D. Scherzer, L. Yang, O. Mattausch, D. Nehab, P. V.
Sander, M. Wimmer, and E. Eisemann, “Temporal
Coherence Methods in Real-Time Rendering,” in Computer
Graphics Forum, vol. 31, no. 8. Wiley Online Library,
2012, pp. 2378–2408.
[21] Flurry Analytics, “Flurry Five-Year Report: ItaˆA˘Z´s an App
World. The Web Just Lives in It,” 2013. [Online]. Available:
http://www.flurry.com/bid/95723/Flurry-Five-Year-
Report-It-s-an-App-World-The-Web-Just-Lives-in-It
[22] Nielsen, “All about Android,” 2011. [Online]. Available:
http://www.nielsen.com/us/en/insights/webinars/2011/all-
about-android-insights-from-nielsens-smartphone-
meters.html
[23] JEDEC, “JEDEC LPDDR2 standard (JESD209-2F),” 2013.
[Online]. Available: https://www.jedec.org/standards-
documents/results/JESD209-2F
[24] ——, “JEDEC LPDDR3 standard (JESD209-3C),” 2015.
[Online]. Available: http://www.jedec.org/standards-
documents/results/jesd209-3c
[25] ——, “JEDEC LPDDR4 standard (JESD209-4A ),” 2015.
[Online]. Available: http://www.jedec.org/standards-
documents/results/jesd209-4a
[26] N. Kulshrestha, D. K. McAllister, and S. E. Molnar,
“Selecting and representing multiple compression methods,”
Oct. 7 2010, uS Patent App. 12/900,362.
[27] Unity3D, “Deferred Shading Rendering Path.” [Online].
Available: https://docs.unity3d.com/Manual/RenderTech-
DeferredShading.html
[28] Vivante Corporation, “COMPOSITION PROCESSSING
CORES (CPC).” [Online]. Available:
http://www.vivantecorp.com/index.php/en/technology/
composition.html
[29] Khronos, “OpenGL ES Extension 91,” 2016. [Online].
Available: https://www.khronos.org/registry/gles/
extensions/NV/NV draw buffers.txt
[30] K. Chandrasekar, C. Weis, Y. Li, B. Akesson, N. Wehn,
and K. Goossens, “Drampower: Open-source dram power &
energy estimation tool,” URL: http://www. drampower.
info, vol. 22, 2012.
[31] R. Balasubramonian, A. B. Kahng, N. Muralimanohar,
A. Shafiee, and V. Srinivas, “Cacti 7: New tools for
interconnect exploration in innovative off-chip memories,”
ACM Trans. Archit. Code Optim., vol. 14, no. 2, pp.
14:1–14:25, Jun. 2017. [Online]. Available:
http://doi.acm.org/10.1145/3085572
[32] M. Ditty, A. Karandikar, and D. Reed, “Nvidia’s xavier
soc,” 2018. [Online]. Available: https://www.hotchips.org/
hc30/1conf/1.12 Nvidia XavierHotchips2018Final 814.pdf
