Exploiting Data Skew for Improved Query Performance by Zhang, Wangda & Ross, Kenneth A.
Exploiting Data Skew for Improved Query Performance *
Wangda Zhang
Columbia University
zwd@cs.columbia.edu
Kenneth A. Ross†
Columbia University
kar@cs.columbia.edu
Abstract—Analytic queries enable sophisticated large-scale
data analysis within many commercial, scientific and medical
domains today. Data skew is a ubiquitous feature of these
real-world domains. In a retail database, some products are
typically much more popular than others. In a text database,
word frequencies follow a Zipf distribution with a small
number of very common words, and a long tail of infrequent
words. In a geographic database, some regions have much
higher populations (and data measurements) than others.
Current systems do not make the most of caches for
exploiting skew. In particular, a whole cache line may remain
cache resident even though only a small part of the cache
line corresponds to a popular data item. In this paper, we
propose a novel index structure for repositioning data items to
concentrate popular items into the same cache lines. The net
result is better spatial locality, and better utilization of limited
cache resources. We develop a theoretical model for analyzing
the cache behavior, and implement database operators that
are efficient in the presence of skew. Our experiments on real
and synthetic data show that exploiting skew can significantly
improve in-memory query performance. In some cases, our
techniques can speed up queries by over an order of magnitude.
1. Introduction
In online analytic processing (OLAP) a user executes
a collection of complex queries over large data sets, in
order to understand the data at hand and to obtain ac-
tionable knowledge. With the increasing main-memory ca-
pacity of contemporary hardware, query execution can
occur entirely in RAM. Analytical query workloads that
are typically read-only need no disk access after the ini-
tial load. In response to this trend, several commercial
and research database management systems have been de-
signed (or re-designed) for memory-resident data [1]. Exam-
ples of recent systems include H-Store/VoltDB [2], Heka-
ton [3], HyPer [4], IBM BLINK [5], DB2 BLU [6], SAP
HANA [7], Vectorwise [8], Oracle TimesTen [9], Mon-
etDB [10], HYRISE [11], HIQUE [12], LegoBase [13],
Peloton [14], and Quickstep [15]. Most analytic database
* This is an extended version of a paper appearing at 36th IEEE Interna-
tional Conference on Data Engineering (ICDE 2020)
† Work partly supported by a gift from Oracle Corporation.
(a) Distribution of book titles (b) Access to cache lines (Q1)
Figure 1: An example of data skew in a real dataset.
systems use some variant of columnar storage, since only the
columns needed to answer the query need to be read [16].
Skew is a common feature of many real-world domains.
Power-law distributions apply to many types of data, in-
cluding word-usage in text databases, protein interactions,
internet routing node-degree, phone-call data, city popula-
tions, email contact-lists, surname frequencies, and paper
citations [17]. Several other kinds of skewed distributions
can also be found [18]. Figure 1(a) shows the cumulative
distribution of book reviews in the publicly available Ama-
zon reviews dataset, where books are ranked by the number
of reviews. The distribution is likely to be a representative
proxy of the actual sales data. As shown, the 100,000 most
popular book titles (< 5%) account for roughly 50% of the
entire data, and 75% of the sales concentrate on the top
500,000 titles.
Although skew has been recognized as common and im-
portant in empirical database deployments, standard bench-
marks such as TPC-H and SSB have specified uniform
distributions and avoided skew [19]. Even when benchmarks
such as TPC-DS adopt skew, they are constrained by query
comparability issues to make the data more uniform than it
would otherwise be [20]. As a result, there is little research
on the impact of skew on analytic query performance.
In the context of a data warehouse, skew is likely to
affect query performance. Consider a large fact table corre-
sponding to the sales of a bookseller. One of the columns
of this fact table is the id of a book, represented as an
integer foreign key bid into a dimension table of detailed
book information. Skew as shown in Figure 1(a) in the bid
column would be expected due to different popularity of
the books. The following example queries utilize bid in an
important way.
ar
X
iv
:1
91
0.
10
06
3v
1 
 [c
s.D
B]
  2
2 O
ct 
20
19
-- Q1: Consult a column of the dimension table
Select B.price
From Sales S, Books B
Where S.bid = B.bid
Query Q1 executes a foreign key join to read or material-
ize a column from the dimension table. The obtained prices
can then be used to calculate revenues. To answer this query,
the database system can scan the fact table, and look up the
price column of table B using S.bid as an array offset.
If the number of products is such that the price column
is larger than the cache in size, and S.bids are uniformly
distributed in the dimension domain, then this lookup could
incur expensive cache misses due to unclustered memory
acccesses, which can often be the dominant cost [21]. With
skew in S.bid, the situation is somewhat better because the
most frequently occurring items are likely to reside in the
cache. Nevertheless, the cache is still underutilized because
a single cache-resident cache line will typically hold a small
number of popular items and many unpopular items.
-- Q2: Filter the fact table based on B.price
Select S.*
From Sales S, Books B
Where S.bid = B.bid and B.price < 100
To process this selective filtering, the database system
can preprocess the dimension table to determine which
products meet the B.price<100 condition. It can then
create a bitmap indexed by bid that can be consulted using
S.bid as an offset. Q2 has similar cache miss issues to
Q1, except that the data being consulted is one bit per bid
rather than several bytes. As a result, the product cardinality
thresholds for cache-residence will be larger under the same
cache capacity.
-- Q3: Aggregate grouped by book ids
Select bid, count(*)
From Sales
Group by bid
For this query, the database system can create a table
of counts, indexed by bid, and update the count for each
corresponding fact table record in turn. The memory access
pattern (and cache behavior) is similar to Q1, with the added
observation that because cache lines are updated, cache-line
replacement triggers some additional memory-write traffic.
-- Q3a: Heavy hitter counts
Select bid, count(*)
From Sales
Group by bid
Order by count(*) desc
Limit 4000
Query Q3a is similar to Q3 except that we are only
interested in the counts of heavy hitters, which for this
query means the bids with a count among the top 4,000
book titles. Because the cache footprint of the heavy hitters
is much smaller than the entire bid domain, there are
opportunities for further performance enhancement if the
candidate heavy hitters can be identified in advance.
For these core database operations, skew-aware data
management is not adequately considered in existing main
memory database architectures. In this paper, we propose a
novel index structure called a permutation index for reorder-
ing data items by their access frequency (Section 3). Under
skewed data distribution, popular data items are concentrated
into common cache lines using permutation indexes, leading
to improved locality for query processing. By carefully
organizing data at the cache line level, we can exploit the
data skew for better performance, which has not be utilized
by previous research.
The permutation index method is simple, yet very effec-
tive at improving the cache utilization. Figure 1(b) illustrates
the cumulative frequency of cache line accesses for the
example book dataset during query Q1. Under a random-
ized data layout, the most popular book titles are sparsely
distributed, so the distribution of cache line accesses is
smoothed out. Even with significant data skew, the most
frequent 8,000 cache lines (about the size of L1 cache using
4-byte bids) only correspond to 22% of the fact table data
(see the dashed line). With the permutation index, frequent
data are concentrated and the percentage increases to 52%.
In Section 3.3, we further analyze the improvement in cache
hit rates.
Built on permutation indexes, we develop efficient
database operators in the presence of skew. Permutation
indexes provide frequency information about data items,
allowing for threshold-based algorithms that execute differ-
ent code paths for items with different degrees of skew.
To take full advantage of modern architectures, our im-
plementation makes use of single-instruction multiple-data
(SIMD) instruction sets, multithreaded execution, and soft-
ware prefetching (Section 4).
Finally, we present a detailed experimental evaluation
of our techniques in Section 5. We conduct experiments
on Intel Skylake and Xeon Phi processors, using both syn-
thetic microbenchmarks and real data sets. Our results show
that exploiting skew and reordering data can significantly
improve performance, making queries using permutation
indexes up to an order of magnitude faster.
2. Background
2.1. Data Skew
As introduced in Section 1, a wide variety of real-world
phenomena approximately follow skewed distributions [17],
[18]. For database systems, skewed distribution of empirical
data has been recoginized in previous studies as a serious
challenge to efficient query processing, since databases have
to carefully handle parallel query execution and avoid per-
formance degradation due to the long tail effect [22]. Various
partitioning algorithms and load balancing strategies have
been studied to combat skew [23], [24]. Different from these
studies, we demonstrate that uneven distribution of data can
in fact improve performance when the skewed accesses are
exploited.
While the techniques described in this paper apply to
any skewed distribution, for ease of exposition, we focus
on Zipf distributions in our microbenchmark study because
they are common and allow one to model the degree of
skew with a single parameter z. In a Zipf distribution, the
frequency of the data item having rank r is proportional
to r−z . z = 0 corresponds to a uniform distribution, while
many real-world skewed data sets can be modeled by Zipf
distributions with z ≈ 1 [17], [25].
2.2. Data Representation
For physical storage, we assume a columnar format as
is common in analytic databases. An individual column is
represented as a dense array of integers. The compact integer
representations can be used as data values and/or as foreign
key offsets into dimension tables stored as a collection
of column arrays. When the true data values are strings,
products, or other types of non-numerical data, databases
will typically use some form of dictionary coding to map
the domain to a contiguous range of integers.
2.3. Architectural Issues
We assume a columnar format as is common in analytic
databases. An individual column is represented as a dense
array of integers. The compact integer representations can
be used as data values and/or as foreign key offsets into
dimension tables stored as a collection of column arrays. In
order to perform well on such RAM-resident data, database
management systems must be sensitive to modern computer
architectures. We highlight two significant trends in modern
architectures that influence implementation choices.
Single Instruction Multiple Data. Modern processors
support single-instruction multiple-data (SIMD) instruction
sets. These instructions process many data items at a time,
enhancing the data-parallelism of algorithms that can be
written in a SIMD fashion. Further, SIMD instructions con-
vert control dependencies to data dependencies, helping to
eliminate branch misprediction latencies [26]. Current main-
stream CPUs (e.g., Skylake [27]) and Xeon Phi processors
(Knights Landing [28]) now support the 512-bit extensions
(AVX-512).
Cache and TLB Performance. Modern architectures
provide multiple levels of data and instruction caches. These
small but fast memories improve performance when algo-
rithms display spatial or temporal locality. Conversely, algo-
rithms that ignore the caches (e.g., by randomly accessing
data structures whose footprint exceeds the cache size) incur
many cache misses and/or TLB misses that can reduce per-
formance by an order of magnitude. Addressing cache and
TLB performance is a necessity for a database management
system given that most data structures for representing the
underlying data will be much larger than the data caches.
When access patterns are deterministic, prefetching can hide
some of the memory access latency. For common patterns
such as sequential access, the hardware can automatically
prefetch data that is soon needed.
In some cases, these architectural features interact. For
example, SIMD gather and scatter instructions are used to
access multiple memory locations in a single instruction.
Such instructions are highly efficient on L1-cache-resident
data, but perform no better than scalar code on data outside
the L1 cache [29], [30]. This kind of observation argues
for synergy. SIMD optimizations enhance the potential of
data locality to improve performance, and cache locality
optimizations stand to benefit from additional SIMD en-
hancement.
2.4. Cache Performance Optimizations
For the example queries of Section 1, some existing tech-
niques could be used to improve performance by optimizing
the cache behavior. One approach is to measure the footprint
of the accessed data column to see whether it is larger than
the performance-critical cache. If so, a range-partitioning
pass over the input bid references (and their fact table
payloads for Q1 and Q2) can redistribute the data into frag-
ments. With a sufficient partitioning factor, each fragment
will reference a subset of bids that fits into the cache.
As long as the partitioning pass is done efficiently with
mostly-sequential data accesses and few cache misses [31],
the overhead of partitioning may be smaller than the gain
from avoiding cache misses. On the other hand, multiple
passes through the very large sales table would be needed,
meaning that the overhead is nontrivial even if there is a
performance improvement.
An alternative approach is to use software prefetching
to overlap the latencies of multiple cache misses [29]. A
prefetching distance d is determined empirically, and the
bid value for the record d steps ahead of the current record
is prefetched into the cache using machine-specific prefetch
instructions. The hope is that the cache miss latency is paid
while the processor is doing useful work on other items.
Modern CPUs can have as many as 10 outstanding memory
requests per core, allowing many prefetches to be in flight
at once.
While prefetching helps by overlapping latencies, it does
not completely eliminate cache miss effects in workloads
that are memory-bandwidth bound. In the queries of Sec-
tion 1, there is limited work that can be done while waiting
for the miss to resolve. Therefore, there remains an oppor-
tunity for methods that reduce the total volume of data that
needs to be brought into the cache from slow memory.
3. Permutation Indexes
To address the cache utilization problem as outlined in
Section 1, we propose a novel index structure that we call
a permutation index. We start by identifying a fact table
column C with an integer data type. For simplicity we
assume that the integer column is an offset into a dimension
table as in the example queries. Let |C| denote the number
of distinct values appearing in column C.
The permutation index is an array of size |C| containing
the values appearing in column C in decreasing frequency
order. The left part of Figure 2(a) shows an example in
which 5 is the most frequent value in the referencing column
(Figure 2(c)), then 1, 8, 4, etc. The arrows in Figure 2(a)
A
B
C
D
E
F
G
H
5
1
8
4
3
7
6
2
...
Permutation Index Dimension Columns
(a) Dimension positions
A
B
C
D
E
F
G
H
...
Dimension Columns Permutation Index (Rank)
2
8
5
4
1
7
6
3
(b) Frequency ranks
123
456
789
864
753
222
567
...
8
5
1
5
5
4
1
...
Fact Table FK
...
(c) Original table
123
456
789
864
753
222
567
...
3
1
2
1
1
4
2
...
Fact Table r(FK)
...
(d) Transformed table
Figure 2: (a) A permutation index, (b) its alternative representation, (c) a foreign key column C before applying the
permutation index transformation, and (d) the same column C ′ with new identifiers after the transformation.
show how these integers may be interpreted as offsets into
a dimension table (starting from 1). Figure 2(b) shows an
equivalent representation of the same information in which
the popularity rank of each distinct data item in C is stored
as an additional column of the dimension table. If there are
dimension table offsets that do not appear in C, then the
representation of Figure 2(a) will be slightly more compact
than Figure 2(b) because no information needs to be stored
for the missing values.
Note that the array representation of a permutation index
using |C| log |C| bits is already within O(|C|) bits of the
information theoretic bound of log(|C|!) [32]. To support
efficient inverse operations, we can store a permutation
index using compressed data structures for permutations,
so that both lookups and reverse-lookups take sublinear
time [32], [33]. Alternatively, we can store both the offset
and rank representations, so that lookups in either direction
take only constant time.
3.1. Index Building and Maintenance
To achieve a better access pattern, we must recode
the references in column C into a new version with the
permuted references. For the data of Figure 2, references
to dimension table row i are replaced by references to row
Rank(i) according to the rank representation in Figure 2(b)
to obtain Figure 2(d). We assume that this replacement in the
fact table has occurred prior to the query execution, typically
when the database administrator (or automated physical
design optimizer) decided to build the permutation index.
Replacement of one integer value with another yields no net
change in space for column C. The space cost of building
the permutation index is one integer value per dimension
table row, and is independent of the size of column C of
the (presumably large) referencing column in the fact table.
Building the permutation index can be implemented as
an aggregation query (Q3) to obtain the frequency of distinct
dimension values, taking linear time, followed by sorting the
values by their frequencies. The sorting step takes super-
linear time but it executes over the much smaller dimension
domain. Then, the identifier replacement in a fact table can
be implemented as a materialization query (Q1) using the
permutation index we just built, again taking linear time.
After this transformation, the old column C (Figure 2(c)) is
dropped, and replaced with a new column C ′ with the new
identifiers (Figure 2(d)).
Updates to the permutation index can be handled without
major reorganization by simply ignoring the effect of fact
table insertions/deletions on the frequencies of existing do-
main values. When a new dimension table row is inserted,
it is appended to the end of the dimension table, and its
identifier is appended to the end of the permutation index
as the new least-frequent item. A small number of updates
is unlikely to dramatically change the relative ordering of
popular value frequencies from a large fact table, so these
simple choices will preserve most of the performance ad-
vantages of the structure even after a few updates. After
many updates (e.g., a batch update in a data warehouse) the
index should be rebuilt.
3.2. Overview of Query Processing
The permutation index functions as a reordering tem-
plate for preprocessing the dimension table during the initial
phases of query processing. We create a copy of the needed
dimension table columns for a query, permuted according
to the corresponding permutation index. For Figure 2(a),
assuming we needed the first column, we would create an
intermediate result in the order E,A,H,D,C,G,F,B. Because
the cost of the query is likely to be dominated by the large
fact table containing the source column C, preprocessing the
dimension table in this way will be relatively fast. Note that
a dimension table may have several permutation indexes that
refer to it from different source columns, each with different
orders. If the dimension table is referenced just from a single
source, then its reordering can be done entirely ahead of
query processing.
Having reordered the data, popular items are adjacent
and therefore are likely to share cache lines with other pop-
ular items. Therefore, during query processing if we access
the data via offsets in the reordered table, we will get much
better utilization of all cache levels, particularly when there
is skew in an otherwise large domain. As we shall discuss
in detail in Section 4, several basic database operations can
make use of a permutation index and the reordered data to
improve efficiency (in different ways). Since a permutation
index simply replaces a fact table column and adds a copy
of the dimension column, there is minimal impact on other
database queries.
Using permutation indexes over skewed data essentially
reorders data to change the memory access pattern dur-
ing query execution. Physically reordering the dimension
table by access frequency would eliminate the need for
an explicit permutation index. However, we can impose
only one ordering on the dimension table and there may
be many competing demands (e.g., ordering by a domain
value to support lexicographic comparisions via identifiers,
or additional references from fact columns having different
skew properties). A similar observation holds for the fact
table, where reordering rows might help locality in the
ordered columns, but there can be only one order unless
data is stored redundantly. In light of these observations,
one can think of permutation indexes as a way to influence
physical database design. For some skewed data columns,
permutation indexes eliminate the need for physical ordering
of the fact table by those columns in order to get good cache
performance. Physical organization can then focus on the
remaining columns that may have more serious nonlocality.
3.3. Cache Behavior Analysis
To decide whether to build a permutation index or not,
we need to estimate the improvement in cache utilization.
A sophisticated query optimizer also requires knowledge of
the cache behavior to compute the cost of query execution,
together with hardware characteristics. For these purposes,
we now study a model for estimating the cache hit rates
under different data layouts.
Prior work has developed analytical cache models based
on stack distances [34], [35]. The stack distance is the num-
ber of distinct cache lines referred between two references
to the same line. For an LRU cache with S lines, accesses
with stack distance less than S will be cache hits, while
others are misses. To apply this model in our setting, we
need an estimation of access frequency to the cache lines,
and the stack distance distribution for each line.
Conventional databases store statistics about column
distributions to help estimate the selectivities of query
conditions. It is also possible to sample the data and fit
the samples to known distributions [36]. Most power-law
distributions can be modeled with a few parameters, such
as the slope and intercept in a log-log plot. Given the
value frequency distribution, we can then map it to the
cache line distribution for a particular layout. For example,
using permutation indexes over 4-byte values, the access
frequency of the most frequent 64-byte cache line would be
the sum of the 16 largest frequencies in the value distribu-
tion. For a randomized mapping from value distribution to
cache line distribution, the computed frequency would be
an approximation.
Let fi denote the access frequency to cache line Li,
where 0 ≤ fi ≤ 1 and
∑
i fi = 1. For line Li, we
model its next reference as a geometric distribution with
probability p = fi. Suppose after k trials, we see cache
line Li. According to the stack distance model, if there
are fewer than S distinct lines occurred within the k trials,
then the reference will be a cache hit. Due to data skew
(espectially the very frequent lines), there are potentially
repeated occurrences of the same cache lines within the k
trials. For every line Lj where j 6= i, its expected number
of occurrences is
nj = k ∗ fj/(1− fi)
If nj > 1, then there are (nj − 1) repeated occurrences of
Lj . Thus, an estimation of the number of distinct lines is
d = k −
∑
j 6=i
max(0, nj − 1)
As k increases, d also increases (at a lower rate).
For cache line Li, we want to find a threshold K so that
after K trials, we see Li and there are estimated d = S
distinct lines Lj 6=i seen within the K trials. Then we know
for any k < K, we have d < S so the access will be a
cache hit. To compute the threshold K, a binary search can
be used where K is at least S. Modeled as a geometric
distribution, the cumulative distribution (CDF) of (k < K)
for Li is
cdfi = 1− (1− fi)K
Therefore, the overall estimated cache hit rate is
∑
i(fi ·
cdfi).
Empirically we find this model is accurate with errors
less than 5% on our microbenchmarks given that we have a
good estimation of the cache line frequencies. In practice,
we can also sample the cache line distribution directly to
verify the accuracy of the model and adjust accordingly if
there is partial clustering, which we plan to address as future
work.
4. Skew-Aware Operator Implementation
We now discuss how to implement efficient database
operators that can take advantage of permutation indexes
and the reordered data. As introduced in Section 1, we focus
on three types of basic database operations: materialization
(Q1), selection (Q2), and aggregation (Q3 and Q3a).
The benefits of the proposed permutation index approach
are twofold. First, reordering data using permutation indexes
improves cache utilization during execution of important
database operations (Section 3). Second, the transformed
identifiers in the fact table (Figure 2(d)) provide valuable
information about data frequency, which the operators can
exploit to perform threshold-based processing (Section 4.2).
In this way, the operators are “aware” of the degree of skew,
and are able to take appropriate code paths for different
actions.
We find that even a straightforward scalar implementa-
tion of these operators can take advantage of the improved
cache utilization. To further enhance performance, we study
the use of a variety of techniques including SIMD vector-
ization (Section 4.1), multithreaded execution (Section 4.3),
and software prefetching (Section 5.1).
4.1. Data-Parallel Execution
Data-parallel execution using SIMD instructions has
been successful at speeding up various database opera-
tions [26], [30], [37], [38], [39], especially when the data
is cache resident. For data-parallel read and write, we use
AVX-512 gather and scatter instructions extensively. For
simplicity of presentation, we assume a database column
is simply an array of 32-bit integers. Most AVX-512 in-
structions provide variants for other data types including 8,
16, and 64-bit integers as well.
Materialization. The operator uses the fact table values
as indexes to perform gather instructions from the dimen-
sion data array. Given 32-bit integer values, an AVX-512
gather instruction retrieves 16 dimension values at once.
The latency of this gather operation depends directly on
the number of cache misses occurred during its execution
(Section 4.2.1).
Selection. This operation produces an array of qualifying
fact table offsets. After preprocessing the dimension table
using the selection condition to obtain a bitmap, the operator
checks each row of the fact table against the bitmap using
the referencing identifier as the index, writing out the row
offset in the fact table if the bitmap testing succeeds (the
scalar implementation needs to be branch-free to avoid the
branch misprediction penalty). Whenever the operator needs
to test a random bit, reads from the bitmap incur random
memory accesses, similar to materialization. For SIMD, we
use gather and shift instructions to compute addresses in the
bitmap and to extract the bits into a mask. Using the AVX-
512 compressed store instruction vpcompressd, we can
contiguously store the selected fact table offsets (those with
their respective bits set in the mask) into an output array.
Aggregation. This operation generates an array of nu-
meric types to compute an aggregate of some fact table col-
umn, grouped by the dimension table offset. The implemen-
tation basically scatters into an output array after gathering
old aggregates and peforming arithmetic computations. In a
scalar implementation, the memory access pattern is similar
to materialization with additional writes.
SIMD aggregation, however, needs an additional step to
check for conflicts before scattering, since different SIMD
lanes may write to the same memory location. When in-
ternal conflicts occur, we identify a subset of conflict-free
SIMD lanes using an AVX-512 conflict detection instruction
vpconflictd, and allow only writes from this subset
to succeed using the masked scatter instruction. The other
SIMD lanes are retained for processing during the next iter-
ation, alongside new data. If there are many conflicts (e.g.,
when fact table data is highly skewed), then performance de-
teriorates severely due to this conflict resolution step. In the
worst case where all SIMD lanes attempt to update the same
aggregate value, only one (instead of 16 assuming 4-byte
integers) can proceed. We shall discuss this problem further
Figure 3: Materialization performance with varying cache
misses per 16 elements.
in Section 4.2.2. Note that these SIMD implementations do
not rely on permutation indexes. Data-parallel optimizations
and cache locality optimizations are orthogonal, but they
work together to further boost performance.
4.2. Threshold-Based Processing
An important insight in the use of permutation indexes
is that the transformed identifier in the fact table (see Fig-
ure 2(d)) can be used as a proxy for the value frequency
in the table. Given an estimate of the function mapping
the numeric identifier to the value frequency, the system
can estimate the likely cache behavior at each level using
the model in Section 3.3. In general, the most frequently
accessed data are likely cached under common cache re-
placement policies. Given the available cache capacity, it
is possible to derive a threshold t such that accesses to
data items with identifiers smaller than t are likely cache
hits, while identifiers larger than t lead to misses. We can
therefore choose different code paths for items based on
their anticipated cache residence using a simple comparison
between the identifiers and the threshold t.
4.2.1. Materialization. One situation where such an opti-
mization pays off is when most, but not all data references
are cache hits. Consider a materialization query like Q1
in which roughly one in 16 accesses is a cache miss,
and the other 15 accesses are cache hits. A profile of the
performance of materialization, with data constructed to
achieve specific cache miss rates, is shown in Figure 3. For
this profiling, we use 128 million dimension keys and 1
billion fact table tuples (the default microbenchmark setting
in Section 5.1). Using single-thread implementations of both
scalar and SIMD versions described in Section 4.1, we
measured the elapsed time of the query execution on a
Skylake processor and a Xeon Phi processor.
On both platforms, there is a big jump from 0 misses
to 1 miss per 16 references, and then a less steep increase
with additional misses. When there are no misses, the code
can run at full capacity with essentially no memory traffic.
At one miss, there is a stall on average once for each SIMD
instruction. Additional misses have a less dramatic impact
because the system can have multiple outstanding misses at
any given time.
Another observation from Figure 3 is that the SIMD
version has about 2x better performance over the scalar ver-
sion when data is cache resident for the Phi. When there are
many cache misses and the performance is bounded by the
memory bandwidth, scalar and SIMD implementations have
similar performances. In other words, SIMD optimization is
most effective for accessing cache-resident data.
Given the apparent performance difference between in-
and out-of-cache accesses, it would therefore be advanta-
geous to split a data stream with a low cache miss rate
into two pieces: The first (and largest) fragment always hits
the cache and runs at cache speeds. The smaller fragment
contains the likely misses, but when misses occur they occur
together and their latencies can be overlapped.
The implementation for such a splitting method can be
done in a data-parallel way. Based on an estimated threshold,
a SIMD comparison operator determines which items are
likely hits, and their references are resolved immediately. We
use masked gather instructions in AVX-512 to directly write
materialized data into the output array. For items failing
the comparison, the item and its address are written to the
tail of a buffer. A second pass then iterates through the
buffer to resolve the remaining references. In Section 5.1, we
demonstrate that for a materialization operation, the splitting
method performs better than the baseline when there are one
or two cache misses per 16 accesses.
4.2.2. Aggregation. The frequency information implicit in
transformed identifiers can also be used to improve the
performance of aggregation. As we discussed in Section 4.1,
SIMD implementations of aggregation perform poorly when
the degree of skew is high in data. The reason for this
performance degradation is that the most popular data items
become so common that conflicts within the SIMD scatter
step are frequent. In other words, two or more different
SIMD lanes try to update the aggregate value (e.g., count)
for the same group. Conflicts take several steps to resolve,
and during this resolution process additional conflicting
values can be read into the SIMD register, exacerbating the
problem.
To mitigate this behavior, one can avoid conflicts on
common items by re-mapping accesses to the most frequent
items to distinct copies, one per SIMD lane. Identifying
whether an item is among the most common is simple: it
has an identifier that is no more than some threshold t.
Because of data skew, remapping just a few of the most
frequent items can already be quite effective in reducing
conflicts. For example, in our microbenchmark experiments
(Section 5.1), remapping the 40 most common 4-byte values
was a good choice. In the modified aggregation algorithm,
we keep 16 copies of each of the top 40 values. Accesses
to these common values never conflict because the copy
used is determined by the SIMD lane. The remaining items
have only one copy, as before. The copies can be stored
immediately before the original data array, preserving data
contiguity. For SIMD lane i (where i = 0 . . . 15), if the
identifier loaded for this lane is not greater than t, then the
copy at offset (−i ∗ t) is updated.
Using SIMD instructions, the remapping can be simply
implemented as masked arithmetic computations based on
the comparison with the threshold t. The top-t locations
in the original array now store only the aggregates from
a single SIMD lane. When t = 40, the cache footprint is
bigger by 40 ∗ 15 ∗ 4 = 2400 bytes, which is small relative
to typical cache sizes and thus of minor impact. In the end,
combining the copies for all SIMD lanes produces the final
aggregates. In our experimental evaluations, we find this
copying method effectively reduces conflicts in case of high
skew, drastically improving performance.
Heavy hitter aggregation. The permutation index has
direct benefits for queries like Q3a that compute counts
for just the most frequent elements (i.e., heavy hitters).
The most common 4,000 values for Q3a are simply the
transformed identifiers from 1 to 4,000, assuming that the
permutation index is up-to-date. We can directly aggregate
these values and ignore the rest (using the limit value as a
cut-off threshold), unlike conventional methods that would
need to compute the exact counts for all bid values. The
cache footprint of this approach is much smaller. For 4-
byte integers, the output array is entirely L1-cache resident.
Since there are almost no out-of-cache memory accesses,
the performance of heavy hitter aggregation is significantly
better than a full aggregation.
4.3. Intra-Query Parallelization
A modern CPU contains many cores, each capable of
independent work. It is therefore essential that this available
parallelism is exploited by implementing multi-threaded ver-
sions of query processing algorithms. Fortunately, materi-
alization (Q1) and selection (Q2) operations are relatively
easy to parallelize. We can partition the input fact data
into non-overlapping chunks so that different threads can
work on different chunks independently. Shared reads (e.g.,
from a common dimension array) typically do not cause
performance issue since they are cached across cores.
Aggregation (Q3) is trickier because independent threads
may try to update the count for the same bid. Similar to the
SIMD conflict detection, implementations must guarantee
the atomicity of potentially conflicting updates to the shared
result array, which may induce a performance overhead.
Data skew potentially exacerbates the problem by increasing
the probability that conflicting updates occur.
On the other hand, to fully avoid conflicts and the
overhead, each thread can have its own independent copy of
the result array, so that threads do not concurrently update
the same memory location. The obvious disadvantage is
memory consumption, which is multiplied by the number
of threads used. Another cost is the overhead of the final
aggregation that combines partial results from all threads,
which gradually becomes non-negligible with more threads
used.
In case of data skew, we are able to get the best of both
approaches. We use a hybrid approach to allow different
threads access to their own private versions of the hot data,
while using atomic operations on shared representations of
TABLE 1: Hardware Specifications
Microarchitecture Skylake Knights Landing
Model Number 8175M Phi 7210
Clock Frequency 2.5 GHz 1.3 GHz
Cores x SMT 24 x 2 64 x 4
L1 Size / Core 32 KB 32 KB
L2 size / Core 1 MB 512 KB
L3 Size 33 MB -
Memory Bandwidth 60 GB/s 55 GB/s
the less frequent data. Such an approach reduces conflicts
while remaining space-efficient, similar to recent approaches
to parallel aggregation [40], [41], [42]. Given the available
memory space and the frequency estimator, it is possible to
derive (or obtain experimentally) a threshold for distinguish-
ing the frequent data based on the transformed identifiers.
As we demonstrate in our experiments, a small threshold is
often enough to speed up queries in the case of skewed data.
For example, typically each thread can use a private buffer
for the most frequent 8,192 data items so that the buffer fits
in the L1 cache, using four-byte identifiers. A limitation of
the current SIMD extensions is that they lack support for
vector atomic instructions. Introduction of such instructions
could significantly improve performance [43].
5. Experiments
We conducted experiments on an Intel Skylake CPU
and on an Intel Xeon Phi (Knights Landing) processor,
both of which support AVX-512. The Xeon Phi does not
have a L3 cache, and its 1MB L2 cache is shared between
two adjacent cores. Table 1 summarizes the details of our
experimental platforms. Both machines are configured to
use 2MB hugepages to avoid the TLB thrashing problem
discussed in Section 2.3. 1GB hugepages are also available
and result in similar performance. Both hugepage options
lead to much better performance than using the regular 4KB
pages.
Our code was compiled using GCC 7.3 with -O3
optimization and loop-unrolling enabled, and ran on 64-
bit Linux operating systems. Performance counters such
as cache misses and branch misses were obtained using
the perf events interface. We employed thread pinning to
avoid undesired thread migration, maximizing the utilization
of private cache and local memory on multicore NUMA
platforms.
5.1. Microbenchmark Results
We have implemented microbenchmarks to evaluate the
potential performance improvements enabled by the per-
mutation index. We simulate skew in empirical data by
generating zipf distributions with varying zipf parameters.
For a particular zipf factor, we vary the number of keys
in a dimension table (i.e., the cardinality of the domain).
We fix the column cardinality to 1 billion tuples in a fact
table, and assign key values (4-byte integers) from the
dimension domain to the fact table column under the chosen
zipf distribution, in random order. We also experiment with
varying sizes of the dimension domain.
For the baseline approach, the tuple identifiers in a
dimension table are randomized (denoted as “rand” in the
figures). For the permutation index approach, the dimension
column is sorted by frequency and the referencing column
in a fact table is preprocessed to hold transformed identifiers
(denoted as “freq”). Using both scalar and SIMD (AVX-512)
implementations, we measure the elapsed time for database
operations discussed in Section 4. For a particular operation,
all tuples in a column are executed in a tight loop. We
also evaluate the approaches both with and without software
prefetching of the data references.
Figure 4 shows the performance of materialization (Q1)
using a SIMD implementation, which was faster than the
scalar implementation for this query. Figure 4(a) shows that
at z = 1.0, a common distributional parameter in practice,
permutation indexes speed up the query by more than 20%
on both architectures. With the number of keys fixed at 128
million, Figure 4(b) shows that performance improvements
occur over a range of z values. For smaller z values, the data
is close to uniform and both methods suffer cache misses.
For larger z values, the skew is so concentrated in a handful
of data values that both methods enjoy cache hits most of
the time. Without software prefetching, the speedup at z = 1
and 128 million keys is 1.3x on the Skylake and 1.8x on
the Phi.
Figure 4(c) demonstrates that the time improvements on
the Phi are indeed due to cache behavior. Half of the data
loads are to the array of probes (almost always hits due to
clustered sequential access and hardware prefetching), while
half are to the accessed data in the dimension array. Thus, at
small z values we see a 50% load hit rate in the L1 cache,
and also a 50% miss rate for the L2 cache. The dimension
table footprint is much bigger than the L2 cache, making L2
hits that are L1 misses insignificant at small z values. As z
increases, the L1 hit rate of the permutation index method
is better (higher) than the randomized method, while the
L2 miss rate corresponding to the access frequency from
RAM is also better (lower). In addition, the permutation
index method has much better data TLB utilization at 128
million keys. For example, at z = 1.0, the DLTB miss rate
is 2.6% using permutation indexes, while the miss rate of
the baseline method is 20%.
As discussed in Section 2.4, it is often possible to utilize
out-of-order execution and software prefetching to hide the
latency of cache misses. Figures 4(a) and 4(b) also show
that the benefits of permutation indexes are still present
even when prefetching is used. For example, at z = 1.0
and 128 million keys, using permutation indexes on top of
prefetching makes the query 3x faster on the Phi. For these
results, we chose the prefetch distances empirically. Fig-
ure 4(d) shows performance with varying prefetch distances.
Since Skylake does not support AVX-512 PF extensions,
we simulate the vpgatherpf instruction with a loop of
16 scalar prefetch instructions. As shown in the figure,
the best prefetch distance on the Phi is 3(x16) for the
baseline and 6(x16) for the permutation index method. As
(a) Varying number of keys, z = 1 (b) Varying zipf parameter, 128M keys (c) Cache behavior (Phi), 128M keys
(d) Varying prefetch distance, z = 1, 128M
keys
(e) Varying cache misses per 16 elements,
using threshold-based processing
(f) Multithreaded performance, z = 1, 128M
keys
Figure 4: Materialization (Q1) performance using permutation indexes (SIMD).
we prefetch more, the baseline approach quickly suffers due
to cache thrashing, while the permutation index approach
is almost unaffected. In general, we find that permutation
indexes enable longer prefetch distances comparing with the
baseline, making our approach even faster. On the Skylake,
it turns out that the prefetch overhead is more than the
benefits from prefetching, so we did not show its prefetch
performance in Figures 4(a) and 4(b).
In the case of high skew, Figure 4(e) presents a zoomed-
in view of the SIMD performance (the red and green curves)
from Figure 3 for 0–6 cache misses, together with the per-
formance of the threshold-based splitting methods described
in Section 4.2.1. Without splitting, there is a big difference
of 2.1–3.1x from no cache misses at all to just a single miss,
while subsequent misses cause much less latency. When the
number of cache misses is small (1 or 2 per 16 elements),
the threshold-based implementation splitting in- and out-of-
cache accesses is better than the baseline. With more misses,
the benefits of this optimization are gradually outweighed
by the overhead of writing out intermediate buffers.
Figure 4(f) shows the performance of multithreaded
materialization. We vary the number of threads from 1
to 256 on the Phi, and to 48 on the Skylake. Since the
performance scales almost linearly up to 16 threads, we
omitted the results with fewer than 8 threads used. Even
though threads are sharing cache resources, the permutation
index method always performs better, achieving a speedup
of over 1.3x on both platforms with varying threads. On the
Phi, using more than 64 threads does not provide any more
benefits than using the baseline approach, and prefetching
makes the query even slower. In contrast, the permutation
index approach still slightly improves performance.
Figure 5 shows the performance of selection (Q2) using
the SIMD implementation, which was again faster than the
scalar implementation for this query. We omit the results
without prefetching on the Phi since they are slower. The
performance profile is similar to Figure 4, except that for
low key-counts the performance is better. Because only one
bit (rather than 32 bits) is needed per key, a larger number
of keys is required before the cache capacity is exceeded.
The relative performance of the Skylake processor is better
in Figure 5 than in Figure 4 because it has a large L3
cache that can hold the entire bitmap. Figure 5(c) shows the
multithreaded performance. Comparing with the baseline,
the permutation index method is 1.4–2.2x faster on the Phi,
and 1.3–1.8x faster on the Skylake.
Figure 6 shows the performance of aggregation (Q3)
using the scalar and SIMD implementations. On Skylake,
scalar code generally outperforms the SIMD code because
the SIMD code has the overhead of conflict detection and
resolution. In fact, for high z values, the impact of conflict
resolution creates a severe degration in performance. The
threshold-based copying method described in Section 4.2.2
addresses this issue. On Skylake, the scalar version with our
permutation index optimization is fastest for all z values in
Figure 6(b). The modified SIMD algorithm with copying is
the best-performing method at high skew levels on the Phi
(Figure 6(e)), but not on the Skylake machine.
Figure 6 also shows the performance of heavy hitter
aggregation for computing the counts of the 4,000 most
frequent data items (“top-4k”). Our optimization discussed
in Section 4.2.2 significantly improves query performance,
since the threshold-based approach only updates cache-
resident data. In comparison, performing full aggregates
without using the permutation index would have much larger
memory footprint and have to do extra work to extract the
top-4,000 results. As an example, for z = 0.5, the perfor-
mance improvement over computing the entire aggregate is
(a) Varying number of keys, z = 1 (b) Varying zipf parameter, 128M keys (c) Multithreaded, z = 1, 128M keys
Figure 5: Selection (Q2) performance using permutation indexes (SIMD).
(a) Skylake, varying number of keys, z = 1 (b) Skylake, varying zipf parameter, 128M keys (c) Skylake, multithreaded, z = 1, 128M keys
(d) Phi, varying number of keys, z = 1 (e) Phi, varying zipf parameter, 128M keys (f) Phi, multithreaded, z = 1, 128M keys
Figure 6: Aggregation (Q3) performance using permutation indexes.
8.6x for the Phi, and 6.6x for the Skylake processor.
Note that we report the performance results for all
operations using branch-free implementations, with one ex-
ception: for the heavy hitter aggregation on the Phi, we
use an explicit if-branch to check whether the identifiers
are smaller than 4,000, since we find the scalar branch-
free version is particularly slow on the Phi. As shown in
Figure 6(e), when there are no heavy hitters (z = 0.0), or
when the data is very skewed (z = 2.0), branching does
not hurt the performance of heavy hitter aggregation. At
high skew, the latency is longer because of more updates.
When z is close to 1.0, performance suffers due to branch
misprediction penalty. For example, at z = 1.0 the branch
misprediction rate is 45.5%. In contrast, Figure 6(b) shows
the performance of the branch-free heavy hitter aggregation
on the Skylake, where there is no performance degration
around z = 1.0.
Figures 6(c) and 6(f) present the aggregation perfor-
mance using all 48 threads on the Skylake, and 256 threads
on the Phi, respectively. Single-threaded results are shown
as a reference. We compare the performance of independent-
buffer implementations, shared buffer implementations us-
ing atomic operations, and the threshold-based hybrid ap-
proach described in Section 4.3. For independent-buffer
implementations, the permutation index method is better
than the baseline, but both methods do not scale well
because of the increased contention and the overhead of
final combination when all threads are used. On the Phi, they
are the slowest methods when all threads are used, despite
using much more memory. For the shared-buffer implemen-
tations, the permutation index method performs worse with
multithreading because it leads to more severe contention
over popular cache lines for atomic operations. The hybrid
method addresses the contention problem. As shown in the
figure, the hybrid methods perform the best. Using 256K
as the threshold results in improved latency comparing to
the 8K threshold (these threshold values correspond to L1
and L2 cache sizes), but uses more memory. Comparing the
independent-buffer baseline approach, this method is 4.7x
faster on the Skylake and 10.4x faster on the Phi, with much
smaller memory usage.
To estimate the cost of building a permutation index, one
can add the cost of an aggregation query (Q3) and a materi-
alization query (Q1) using the baseline (rand) organization.
In every configuration described above, building the index
using all available threads would take just a few seconds.
TABLE 2: Table Cardinality in Real-World Datasets
Dataset Dimension Fact
Pageview wikipedia 11,880,596 3,351,629,753
Product book 2,717,050 19,531,329
dvd 221,086 5,069,140
ebook 1,292,480 17,622,415
home 918,287 6,221,559
music 675,893 4,751,577
pc 382,331 6,908,554
sports 753,280 4,850,360
wireless 767,830 9,002,021
Graph friendster 65,608,366 1,806,067,135
orkut 3,072,441 117,185,083
livejournal 4,847,571 68,993,773
pokec 1,632,803 30,622,564
youtube 1,134,890 2,987,624
wiki-talk 2,394,385 5,021,410
patent 3,774,768 16,518,948
ca-road 1,965,206 5,533,214
5.2. Results on Real Datasets
We tested the permutation index method on three types
of real-world datasets with different sizes and degrees of
skew:
• Pageview1 is a dataset of web request logs of
wikipedia pages. The data is cleaned to filter out
requests from search engine spiders, leaving only
human traffic. In a database system, the request logs
are stored as a fact table referencing a dimension ta-
ble describing page information. For our experiment,
we used the monthly request data from December
2018.
• Product2 is the Amazon reviews data. We believe
the number of public product reviews is a proxy of
the sales data, and it also exhibits skew. We consider
the products as dimension data, and each review is
a data item in the fact table. The data is separated
into different categories, and each category exhibits
different skewed data distribution. Our experiments
uses eight categories with the most products.
• Graph3. We tested several large graphs from the
Stanford Large Network Dataset Collection [44],
including social, communication, citation, and road
networks. In our experiments, edges are stored as a
fact table, with nodes stored in a dimension table.
As an example, in a social network, nodes represent
users and the dimension table describes user infor-
mation. An aggregation query on the fact table, for
instance, is to count the number of friends for every
user (i.e., node degrees).
Table 2 summarizes the sizes of the all the real-world
datasets used in our experiments.
In the following experiments, we report the results of
SIMD implementations for materialization and selection,
1. https://dumps.wikimedia.org/other/analytics
2. https://s3.amazonaws.com/amazon-reviews-pds
3. https://snap.stanford.edu/data
and scalar implementations for aggregation (and heavy hitter
aggregation). These implementations are generally fast as
revealed in our microbenchmark analysis. For heavy hitter
aggregation, we again compute the counts for the top 4,000
most frequent data items.
Figure 7 shows the performance improvements on the
largest dataset Pageview, using a varying number of threads.
On the Skylake with a single thread, the permutation index
speedup of materialization, selection, full aggregation, and
heavy hitter aggregation is 1.5x, 2.2x, 1.3x, and 4.0x, respec-
tively. The results on the Phi are generally slower than on the
Skylake, but using permutation indexes similarly achieves
1.10–2.27x performance using a single thread. With more
threads, permutation indexes speed up queries similarly for
materialization, selection, and full aggregation, up to 2x.
When all threads are used, the performance ratio for heavy
hitter aggregation is 20.4x on the Skylake and 22.3x on the
Phi. Because all threads can use their cache-resident private
buffer to execute heavy hitter aggregation in parallel, the
permutation index method is particularly effective.
Tables 3 and 4 present the single-threaded latency re-
sults (in milliseconds) and performance speedups of using
permutation indexes on all other datasets. On the Skylake
for the product review and graph data, the performance
improvements are up to 6.0x for heavy hitter aggregations,
and up to 1.4–1.5x for other operations. Results on the Phi
are generally slower, but they achieve similar speedups using
permutation indexes.
6. Related Work
Skew has been used to improve OLAP query execution
in several ways that are orthogonal to those described in
this paper. High-frequency data items can be represented
with fewer bits to reduce data transfer costs [45], [46].
When skew causes load imbalances in a parallel join, ex-
plicit scheduling of common keys can overcome this im-
balance [23], [47], [48]. At lower levels of the memory
hierarchy, popular rows may be surrounded by cold rows
in a disk page. This observation has led some systems to
adopt row caches that keep popular rows in RAM even
if their page has been evicted from the buffer pool [49].
Such a solution is not viable at higher levels of the memory
hierarchy because the programmer has little direct control
of what data is loaded into the processor caches.
Techniques for tracing data access streams to quantify
the degree of latent locality have been described [50]. Dis-
tributional parameters of skewed data streams can be esti-
mated [51]. Skew can affect the performance of selectivity
estimators in query optimization, and skew-aware methods
are required for accurate estimation [52].
The advent of large main memory capacity paved the
way for high-performance OLAP query execution. Column-
oriented execution [53] and cache-conscious operators [54]
were proposed before the advent of multi-core CPUs. Ana-
lytical database systems adopted column-oriented storage,
while focusing on compression [6], [38], [55] and com-
plex materialization strategies [56], [57] to further optimize
(a) Skylake (b) Phi
Figure 7: Performance speedup on Pageview using permutation indexes.
TABLE 3: Performance Results with Real-World Datasets on Skylake
Materialize Select Aggregate Top-4k
Product rand freq ratio rand freq ratio rand freq ratio freq ratio
book 74.89 53.54 1.40 11.88 10.07 1.18 71.40 62.23 1.15 25.46 2.80
dvd 5.57 4.61 1.21 1.77 1.75 1.01 5.47 4.84 1.13 5.34 1.02
ebook 49.04 35.53 1.38 10.10 8.20 1.23 53.09 39.48 1.34 20.76 2.56
home 14.40 10.46 1.38 3.10 2.53 1.23 14.38 10.60 1.36 6.99 2.06
music 10.23 7.49 1.37 2.18 1.88 1.16 10.24 7.70 1.33 5.61 1.82
pc 10.96 7.60 1.44 2.95 2.80 1.05 10.19 6.82 1.49 6.99 1.46
sports 10.13 7.64 1.33 2.41 1.95 1.23 10.04 6.87 1.46 5.47 1.84
wireless 19.96 13.46 1.48 4.62 3.67 1.26 21.61 13.38 1.62 9.78 2.21
Graph rand freq ratio rand freq ratio rand freq ratio freq ratio
friendster 16514.6 12136.3 1.36 5243.83 3753.34 1.40 16423.1 12692.4 1.29 2718.59 6.04
orkut 400.85 386.93 1.04 73.30 69.67 1.05 459.44 440.40 1.04 169.06 2.72
livejournal 248.14 230.95 1.07 47.31 44.02 1.07 276.49 267.41 1.03 99.58 2.78
pokec 100.37 88.78 1.13 18.34 16.66 1.10 112.33 101.78 1.10 44.56 2.52
youtube 6.23 4.32 1.44 1.45 1.12 1.30 6.00 4.53 1.32 3.46 1.73
wiki-talk 11.25 9.46 1.19 2.54 1.95 1.30 10.07 9.38 1.07 5.00 2.02
patent 59.39 55.11 1.08 10.10 9.92 1.02 62.32 60.71 1.03 24.44 2.55
ca-road 16.94 16.57 1.02 2.95 2.9 1.02 16.46 16.19 1.02 8.37 1.97
TABLE 4: Performance Results with Real-World Datasets on Phi
Materialize Select Aggregate Top-4k
Product rand freq ratio rand freq ratio rand freq ratio freq ratio
book 218.36 167.07 1.31 49.28 49.11 1 289.77 277.2 1.05 141.81 2.04
dvd 21.45 19.39 1.11 8.76 8.58 1.02 36.98 33.73 1.10 36.58 1.01
ebook 145.41 105.27 1.38 43.62 39.08 1.12 248.39 220.59 1.13 125.68 1.98
home 50.97 35.06 1.45 15.85 13.67 1.16 87.07 78.32 1.11 44.62 1.95
music 36.59 26.83 1.36 9.98 9.12 1.09 64.68 58.57 1.10 34.60 1.87
pc 36.06 29.73 1.21 15.64 13.81 1.13 75.33 57.59 1.31 49.75 1.51
sports 39.08 26.9 1.45 12.13 10.48 1.16 68.68 59.3 1.16 35.29 1.95
wireless 60.45 45.23 1.34 21.69 18.81 1.15 118.96 98.45 1.21 66.18 1.80
Graph rand freq ratio rand freq ratio rand freq ratio freq ratio
friendster 27300.2 26776.80 1.02 25146.1 17061.5 1.47 30691.60 28938.5 1.06 12902.7 2.38
orkut 1627.33 1483.74 1.10 303.01 290.24 1.04 1890.73 1763.08 1.07 842.5 2.24
livejournal 978.51 855.56 1.14 231.75 178.46 1.3 1140.16 1081.29 1.05 495.52 2.30
pokec 404.86 316.48 1.28 77.64 77.41 1 488.34 465.34 1.05 220.46 2.22
youtube 30.29 21.08 1.44 7.72 7.29 1.06 45.7 41.68 1.10 21.88 2.09
wiki-talk 44.36 33.09 1.34 13.13 12.2 1.08 71.82 68.61 1.05 36.36 1.97
patent 234.93 223.26 1.05 44.99 41.64 1.08 266.96 261.31 1.02 117.75 2.27
ca-road 76.14 75.23 1.01 12.76 12.76 1 89.91 85.75 1.05 41.31 2.18
memory access. Block-at-a-time execution [58] and code
generation [59], [60], [61], [62] are both state-of-the-art
designs for analytical query engines [63].
SIMD optimizations have been applied to isolated oper-
ators using key–rid pairs. Many join implementations exist
[21], [37], [64], [65], [66], including for many-core plat-
forms [67]. SIMD implementations of stand-alone operators
such as sorting [31], [39], [68], [69] are also common.
Linear-access operators such as scans [26] and compres-
sion [38], [70], [71], are inherently data-parallel. Advanced
SIMD optimizations [30], [72] include non-linear-access
operators. A recently proposed vector algebra [73] proposes
templates for both SIMD CPUs and SIMT GPUs. Non-
linear-access operators are wrappers to vector gathers, such
as joins executed by dereferencing join indexes, which are
likely to benefit from the permutation indexes when there is
skew in the join attribute distribution.
Other architecture-specific optimizations such as soft-
ware prefetching and query compilation are used to improve
database performance [29]. Cache partitioning is used to
address the cache pollution problem in shared cache to
accelerate concurrent workloads [74]. The performance of
frequent pattern mining algorithms can also be improved by
tuning data layout and access patterns [75].
7. Conclusions
We propose permutation indexes to reorder data so that
popular data items are concentrated in the cache hierarchy.
Using this method, we can exploit the data skew inherent
in many practical domains to improve cache utilization at
all levels. Efficient database operators can be developed
using permutation indexes, combining the benefits of cache
optimization with SIMD vectorization. Our method is also
effective when working together with multithreaded execu-
tion and software prefetching, further speeding up query
execution. Through extensive experiments with real and
synthetic data, we demonstrate that the performance of
materialization, selection, and aggregation queries can be
significantly improved.
References
[1] F. Faerber, A. Kemper, P.-A˚. Larson, J. Levandoski, T. Neumann,
A. Pavlo et al., “Main memory database systems,” Foundations and
Trends® in Databases, vol. 8, no. 1-2, pp. 1–130, 2017.
[2] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik,
E. P. Jones, S. Madden, M. Stonebraker, Y. Zhang et al., “H-store:
a high-performance, distributed main memory transaction processing
system,” Proceedings of the VLDB Endowment, vol. 1, no. 2, pp.
1496–1499, 2008.
[3] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal,
R. Stonecipher, N. Verma, and M. Zwilling, “Hekaton: Sql server’s
memory-optimized oltp engine,” in Proceedings of the 2013 ACM
SIGMOD International Conference on Management of Data. ACM,
2013, pp. 1243–1254.
[4] A. Kemper and T. Neumann, “Hyper: A hybrid oltp&olap main
memory database system based on virtual memory snapshots,” in
Data Engineering (ICDE), 2011 IEEE 27th International Conference
on. IEEE, 2011, pp. 195–206.
[5] R. Barber, P. Bendel, M. Czech, O. Draese, F. Ho, N. Hrle, S. Idreos,
M.-S. Kim, O. Koeth, J.-G. Lee et al., “Business analytics in (a)
blink.” IEEE Data Eng. Bull., vol. 35, no. 1, pp. 9–14, 2012.
[6] V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. Ku-
landaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman et al.,
“Db2 with blu acceleration: So much more than just a column store,”
Proceedings of the VLDB Endowment, vol. 6, no. 11, pp. 1080–1091,
2013.
[7] F. Fa¨rber, S. K. Cha, J. Primsch, C. Bornho¨vd, S. Sigg, and W. Lehner,
“Sap hana database: data management for modern business applica-
tions,” ACM Sigmod Record, vol. 40, no. 4, pp. 45–51, 2012.
[8] M. Zukowski, M. Van de Wiel, and P. Boncz, “Vectorwise: A vector-
ized analytical dbms,” in 2012 IEEE 28th International Conference
on Data Engineering. IEEE, 2012, pp. 1349–1350.
[9] T. Lahiri, M.-A. Neimat, and S. Folkman, “Oracle timesten: An in-
memory database for enterprise applications.” IEEE Data Eng. Bull.,
vol. 36, no. 2, pp. 6–13, 2013.
[10] P. A. Boncz, M. L. Kersten, and S. Manegold, “Breaking the memory
wall in monetdb,” Communications of the ACM, vol. 51, no. 12, pp.
77–85, 2008.
[11] M. Grund, J. Kru¨ger, H. Plattner, A. Zeier, P. Cudre-Mauroux, and
S. Madden, “Hyrise: a main memory hybrid storage engine,” Pro-
ceedings of the VLDB Endowment, vol. 4, no. 2, pp. 105–116, 2010.
[12] K. Krikellas, S. D. Viglas, and M. Cintra, “Generating code for
holistic query evaluation,” in Data Engineering (ICDE), 2010 IEEE
26th International Conference on. IEEE, 2010, pp. 613–624.
[13] Y. Klonatos, C. Koch, T. Rompf, and H. Chafi, “Building efficient
query engines in a high-level language,” Proceedings of the VLDB
Endowment, vol. 7, no. 10, pp. 853–864, 2014.
[14] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. C.
Mowry, M. Perron, I. Quah et al., “Self-driving database management
systems.” in CIDR, 2017.
[15] J. M. Patel, H. Deshmukh, J. Zhu, N. Potti, Z. Zhang, M. Spehlmann,
H. Memisoglu, and S. Saurabh, “Quickstep: a data platform based
on the scaling-up approach,” Proceedings of the VLDB Endowment,
vol. 11, no. 6, pp. 663–676, 2018.
[16] D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden et al.,
“The design and implementation of modern column-oriented database
systems,” Foundations and Trends® in Databases, vol. 5, no. 3, pp.
197–280, 2013.
[17] A. Clauset, C. R. Shalizi, and M. E. Newman, “Power-law distribu-
tions in empirical data,” SIAM review, vol. 51, no. 4, pp. 661–703,
2009.
[18] M. E. Newman, “Power laws, pareto distributions and zipf’s law,”
Contemporary physics, vol. 46, no. 5, pp. 323–351, 2005.
[19] T. Rabl, M. Poess, H.-A. Jacobsen, P. O’Neil, and E. O’Neil, “Vari-
ations of the star schema benchmark to test the effects of data
skew on query performance,” in Proceedings of the 4th ACM/SPEC
International Conference on Performance Engineering. ACM, 2013,
pp. 361–372.
[20] R. O. Nambiar and M. Poess, “The making of tpc-ds,” in Proceedings
of the 32nd international conference on Very large data bases. VLDB
Endowment, 2006, pp. 1049–1058.
[21] S. Schuh, X. Chen, and J. Dittrich, “An experimental comparison
of thirteen relational equi-joins in main memory,” in Proceedings of
the 2016 International Conference on Management of Data. ACM,
2016, pp. 1961–1976.
[22] H. Ma¨rtens, “A classification of skew effects in parallel database
systems,” in European Conference on Parallel Processing. Springer,
2001, pp. 291–300.
[23] D. J. DeWitt, J. F. Naughton, D. A. Schneider, S. Seshadri et al.,
Practical skew handling in parallel joins. University of Wisconsin-
Madison. Computer Sciences Department, 1992.
[24] A. Pavlo, C. Curino, and S. Zdonik, “Skew-aware automatic database
partitioning in shared-nothing, parallel oltp systems,” in Proceedings
of the 2012 ACM SIGMOD International Conference on Management
of Data. ACM, 2012, pp. 61–72.
[25] I. Moreno-Sa´nchez, F. Font-Clos, and A´. Corral, “Large-scale analysis
of zipfs law in english texts,” PloS one, vol. 11, no. 1, p. e0147073,
2016.
[26] J. Zhou and K. A. Ross, “Implementing database operations using
simd instructions,” in Proceedings of the 2002 ACM SIGMOD in-
ternational conference on Management of data. ACM, 2002, pp.
145–156.
[27] J. Doweck, W.-F. Kao, A. K.-y. Lu, J. Mandelblat, A. Rahatekar,
L. Rappoport, E. Rotem, A. Yasin, and A. Yoaz, “Inside 6th-
generation intel core: new microarchitecture code-named skylake,”
IEEE Micro, no. 2, pp. 52–62, 2017.
[28] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod,
S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights land-
ing: Second-generation intel xeon phi product,” Ieee micro, vol. 36,
no. 2, pp. 34–46, 2016.
[29] P. Menon, T. C. Mowry, and A. Pavlo, “Relaxed operator fusion for in-
memory databases: Making compilation, vectorization, and prefetch-
ing work together at last,” Proceedings of the VLDB Endowment,
vol. 11, no. 1, pp. 1–13, 2017.
[30] O. Polychroniou, A. Raghavan, and K. A. Ross, “Rethinking simd
vectorization for in-memory databases,” in Proceedings of the 2015
ACM SIGMOD International Conference on Management of Data.
ACM, 2015, pp. 1493–1508.
[31] O. Polychroniou and K. A. Ross, “A comprehensive study of main-
memory partitioning and its application to large-scale comparison-and
radix-sort,” in Proceedings of the 2014 ACM SIGMOD international
conference on Management of data. ACM, 2014, pp. 755–766.
[32] J. I. Munro, R. Raman, V. Raman, and S. Rao, “Succinct representa-
tions of permutations and functions,” Theoretical Computer Science,
vol. 438, pp. 74–88, 2012.
[33] J. Barbay, “Succinct and compressed data structures for permutations
and integer functions,” Encyclopedia of Algorithms, pp. 1–7, 2008.
[34] C. CaBcaval and D. A. Padua, “Estimating cache misses and locality
using stack distances,” in Proceedings of the 17th annual interna-
tional conference on Supercomputing. ACM, 2003, pp. 150–159.
[35] N. Beckmann and D. Sanchez, “Modeling cache performance beyond
lru,” in 2016 IEEE International Symposium on High Performance
Computer Architecture (HPCA). IEEE, 2016, pp. 225–236.
[36] J. Alstott, E. Bullmore, and D. Plenz, “powerlaw: a python package
for analysis of heavy-tailed distributions,” PloS one, vol. 9, no. 1, p.
e85777, 2014.
[37] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish,
J. Chhugani, A. Di Blas, and P. Dubey, “Sort vs. hash revisited: fast
join implementation on modern multi-core cpus,” Proceedings of the
VLDB Endowment, vol. 2, no. 2, pp. 1378–1389, 2009.
[38] T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and
J. Schaffner, “Simd-scan: ultra fast in-memory table scan using on-
chip vector processing units,” Proceedings of the VLDB Endowment,
vol. 2, no. 1, pp. 385–394, 2009.
[39] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim,
and P. Dubey, “Fast sort on cpus and gpus: a case for bandwidth
oblivious simd sort,” in Proceedings of the 2010 ACM SIGMOD
International Conference on Management of data. ACM, 2010, pp.
351–362.
[40] J. Cieslewicz and K. A. Ross, “Adaptive aggregation on chip multi-
processors,” in Proceedings of the 33rd international conference on
Very large data bases. VLDB Endowment, 2007, pp. 339–350.
[41] J. Cieslewicz, K. A. Ross, K. Satsumi, and Y. Ye, “Automatic con-
tention detection and amelioration for data-intensive operations,” in
Proceedings of the 2010 ACM SIGMOD International Conference on
Management of data. ACM, 2010, pp. 483–494.
[42] Y. Ye, K. A. Ross, and N. Vesdapunt, “Scalable aggregation on
multicore processors,” in Proceedings of the Seventh International
Workshop on Data Management on New Hardware. ACM, 2011,
pp. 1–9.
[43] S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J.
Hughes, C. Kim, V. W. Lee, and A. D. Nguyen, “Atomic vector
operations on chip multiprocessors,” in ACM SIGARCH Computer
Architecture News, vol. 36, no. 3. IEEE Computer Society, 2008,
pp. 441–452.
[44] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network
dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
[45] Y. Li, C. Chasseur, and J. M. Patel, “A padded encoding scheme
to accelerate scans by leveraging skew,” in Proceedings of the 2015
ACM SIGMOD International Conference on Management of Data.
ACM, 2015, pp. 1509–1524.
[46] B. Hentschel, M. S. Kester, and S. Idreos, “Column sketches: A scan
accelerator for rapid and robust predicate evaluation,” in Proceedings
of the 2018 International Conference on Management of Data. ACM,
2018, pp. 857–872.
[47] J. L. Wolf, D. M. Dias, and P. S. Yu, “An effective algorithm
for parallelizing hash joins in the presence of data skew,” in Data
Engineering, 1991. Proceedings. Seventh International Conference
on. IEEE, 1991, pp. 200–209.
[48] J. Duggan, O. Papaemmanouil, L. Battle, and M. Stonebraker, “Skew-
aware join optimization for array databases,” in Proceedings of the
2015 ACM SIGMOD International Conference on Management of
Data. ACM, 2015, pp. 123–135.
[49] N. Lastovica, “Guide to database performance and tuning: Row
cache enhancements,” https://www.oracle.com/technetwork/database/
rdb/0308-row-cache-712a-134300.pdf, 2003, accessed 01/15/2019.
[50] T. M. Chilimbi, “Efficient representations and abstractions for quan-
tifying and exploiting data reference locality,” in ACM SIGPLAN
Notices, vol. 36, no. 5. ACM, 2001, pp. 191–202.
[51] F. Korn, S. Muthukrishnan, and Y. Wu, “Modeling skew in data
streams,” in Proceedings of the 2006 ACM SIGMOD international
conference on Management of data. ACM, 2006, pp. 181–192.
[52] C. A. Lynch, “Selectivity estimation and query optimization in large
databases with highly skewed distribution of column values.” in
VLDB, 1988, pp. 240–251.
[53] S. Manegold, P. A. Boncz, and M. L. Kersten, “Optimizing database
architecture for the new bottleneck: memory access,” The VLDB
JournalThe International Journal on Very Large Data Bases, vol. 9,
no. 3, pp. 231–246, 2000.
[54] S. Manegold, P. Boncz, and M. Kersten, “Optimizing main-memory
join on modern hardware,” IEEE Transactions on Knowledge & Data
Engineering, no. 4, pp. 709–730, 2002.
[55] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack,
M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil et al., “C-store:
a column-oriented dbms,” in Proceedings of the 31st international
conference on Very large data bases. VLDB Endowment, 2005, pp.
553–564.
[56] D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. R. Madden, “Materi-
alization strategies in a column-oriented dbms,” in Data Engineering,
2007. ICDE 2007. IEEE 23rd International Conference on. IEEE,
2007, pp. 466–475.
[57] L. Shrinivas, S. Bodagala, R. Varadarajan, A. Cary, V. Bharathan, and
C. Bear, “Materialization strategies in the vertica analytic database:
Lessons learned,” in Data Engineering (ICDE), 2013 IEEE 29th
International Conference on. IEEE, 2013, pp. 1196–1207.
[58] P. A. Boncz, M. Zukowski, and N. Nes, “Monetdb/x100: Hyper-
pipelining query execution.” in Cidr, vol. 5, 2005, pp. 225–237.
[59] A. Gupta, D. Agarwal, D. Tan, J. Kulesza, R. Pathak, S. Stefani,
and V. Srinivasan, “Amazon redshift and the case for simpler data
warehouses,” in Proceedings of the 2015 ACM SIGMOD international
conference on management of data. ACM, 2015, pp. 1917–1923.
[60] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen,
T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey, “Fast: fast
architecture sensitive tree search on modern cpus and gpus,” in
Proceedings of the 2010 ACM SIGMOD International Conference
on Management of data. ACM, 2010, pp. 339–350.
[61] T. Neumann, “Efficiently compiling efficient query plans for modern
hardware,” Proceedings of the VLDB Endowment, vol. 4, no. 9, pp.
539–550, 2011.
[62] R. Y. Tahboub, G. M. Essertel, and T. Rompf, “How to architect a
query compiler, revisited,” in Proceedings of the 2018 International
Conference on Management of Data. ACM, 2018, pp. 307–322.
[63] T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz,
“Everything you always wanted to know about compiled and vec-
torized queries but were afraid to ask,” Proceedings of the VLDB
Endowment, vol. 11, no. 13, pp. 2209–2222, 2018.
[64] C. Balkesen, G. Alonso, J. Teubner, and M. T. O¨zsu, “Multi-core,
main-memory joins: Sort vs. hash revisited,” Proceedings of the
VLDB Endowment, vol. 7, no. 1, pp. 85–96, 2013.
[65] C. Balkesen, J. Teubner, G. Alonso, and M. T. O¨zsu, “Main-memory
hash joins on multi-core cpus: Tuning to the underlying hardware,” in
Data Engineering (ICDE), 2013 IEEE 29th International Conference
on. IEEE, 2013, pp. 362–373.
[66] S. Blanas, Y. Li, and J. M. Patel, “Design and evaluation of main
memory hash join algorithms for multi-core cpus,” in Proceedings of
the 2011 ACM SIGMOD International Conference on Management
of data. ACM, 2011, pp. 37–48.
[67] S. Jha, B. He, M. Lu, X. Cheng, and H. P. Huynh, “Improving main
memory hash joins on intel xeon phi processors: An experimental
approach,” Proceedings of the VLDB Endowment, vol. 8, no. 6, pp.
642–653, 2015.
[68] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K.
Chen, A. Baransi, S. Kumar, and P. Dubey, “Efficient implementation
of sorting on multi-core simd cpu architecture,” Proceedings of the
VLDB Endowment, vol. 1, no. 2, pp. 1313–1324, 2008.
[69] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “Aa-sort: A
new parallel sorting algorithm for multi-core simd processors,” in
Parallel Architecture and Compilation Techniques, 2007. PACT 2007.
16th International Conference on. IEEE, 2007, pp. 189–198.
[70] H. Lang, T. Mu¨hlbauer, F. Funke, P. A. Boncz, T. Neumann, and
A. Kemper, “Data blocks: Hybrid oltp and olap on compressed storage
using both vectorization and compilation,” in Proceedings of the 2016
International Conference on Management of Data. ACM, 2016, pp.
311–326.
[71] O. Polychroniou and K. A. Ross, “Efficient lightweight compression
alongside fast scans,” in Proceedings of the 11th International Work-
shop on Data Management on New Hardware. ACM, 2015, p. 9.
[72] ——, “Vectorized bloom filters for advanced simd processors,” in
Proceedings of the Tenth International Workshop on Data Manage-
ment on New Hardware. ACM, 2014, p. 6.
[73] H. Pirk, O. Moll, M. Zaharia, and S. Madden, “Voodoo-a vector
algebra for portable database performance on modern hardware,”
Proceedings of the VLDB Endowment, vol. 9, no. 14, pp. 1707–1718,
2016.
[74] S. Noll, J. Teubner, N. May, and A. Bo¨hm, “Accelerating concurrent
workloads with cpu cache partitioning,” in 2018 IEEE 34th Interna-
tional Conference on Data Engineering (ICDE). IEEE, 2018, pp.
437–448.
[75] M. Wei, C. Jiang, and M. Snir, “Programming patterns for
architecture-level software optimizations on frequent pattern mining,”
in 2007 IEEE 23rd International Conference on Data Engineering.
IEEE, 2007, pp. 336–345.
